Arxiv今日论文 | 2024-10-30

本篇博文主要展示 2024-10-30 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决视觉-语言模型（Vision-and-Language Models, VLMs）内部任务表示的机制问题，特别是这些模型如何跨不同模态（如文本和图像）和任务指定方式（如示例或指令）一致地编码任务表示。解决方案的关键在于识别并分析VLMs中的任务向量（task vectors），发现这些向量在概念上相似的任务之间具有相似性，且能够在不同模态间传递。此外，通过结合示例和指令生成的任务向量，可以产生更优的任务表示。这些发现揭示了VLMs在处理任务时的三个主要阶段：输入、任务和答案，并展示了其在不同模态和任务指定方式下的一致性处理能力。

链接: https://arxiv.org/abs/2410.22330
作者: Grace Luo,Trevor Darrell,Amir Bar
关键词-EN: investigate the internal, encode task representations, task, internal representations, representations
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page: this https URL.
摘要：我们研究了视觉与语言模型（Vision-and-Language Models, VLMs）的内部表示，以及它们如何编码任务表示。我们考虑了通过示例或指令指定的任务，这些任务可以采用文本或图像输入。令人惊讶的是，我们发现概念上相似的任务被映射到相似的任务向量表示，无论它们是如何指定的。我们的研究结果表明，为了输出答案，VLMs中的Token经历了三个不同的阶段：输入、任务和答案，这一过程在不同的模态和规范中是一致的。我们在VLMs中识别出的任务向量具有足够的通用性，可以在一种模态（例如文本）中推导出来，并转移到另一种模态（例如图像）中。此外，我们发现将基于示例和基于指令的任务向量进行集成，可以产生更好的任务表示。综合来看，这些见解揭示了VLMs的底层机制，特别是它们在不同模态和任务规范之间以共享方式表示任务的能力。项目页面：this https URL。

[NLP-1] Understanding Synthetic Context Extension via Retrieval Heads

【速读】：该论文试图解决的问题是如何通过合成数据扩展（synthetic context extension）来提升长上下文语言模型（Long-context LLMs）在下游长上下文任务中的表现，特别是在需要检索和推理的任务中。解决方案的关键在于识别和分析在合成数据上微调的模型中出现的“检索头”（retrieval heads），这些检索头在处理长上下文检索任务时起着重要作用。研究发现，尽管合成数据训练的模型在真实数据上的表现不如直接在真实数据上训练的模型，但通过分析检索头的召回率和模型下游性能之间的强相关性，可以解释和预测这种性能差异。此外，通过注意力敲除（attention knockout）和激活补丁（activation patching）实验，证明了检索头在模型性能中的必要性，尽管它们并不完全充分。这些发现为理解和改进合成数据微调的效果提供了新的视角。

链接: https://arxiv.org/abs/2410.22316
作者: Xinyu Zhao,Fangcong Yin,Greg Durrett
关键词-EN: retrieval-augmented generation, synthetic context extension, increasingly in demand, demand for applications, synthetic
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically generated long-context data in a post-training stage. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of “needle” concepts to be retrieved and diversity of the surrounding “haystack” context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context: retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data are mostly subsets of the retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are necessary and explain model performance, although they are not totally sufficient. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
摘要：随着检索增强生成等应用的需求日益增加，长上下文大语言模型（LLM）的需求也在不断增长。为了降低在长上下文上预训练大语言模型的成本，近期的工作采用了一种合成上下文扩展的方法：在训练后阶段，使用合成生成的长上下文数据对大语言模型进行微调。然而，目前尚不清楚这种合成上下文扩展如何以及为何能够赋予模型在下游长上下文任务中的能力。本文针对需要检索和推理的三项长上下文任务，研究了在合成数据上的微调效果。我们通过使用大语言模型构建合成文档、使用模板化关系和创建符号数据集，来改变“针”概念的检索真实性和周围“干草堆”上下文的多样性。我们发现，在合成数据上训练的模型在真实数据上的表现有所不足，但令人惊讶的是，这种不匹配可以通过一组特殊的注意力头来解释和预测，这些注意力头负责在长上下文中进行检索：检索头（Wu et al., 2024）。在合成数据上学习的检索头大多是真实数据上学习的检索头的子集，并且这些头的召回率与模型的下游性能之间存在强相关性。此外，通过注意力敲除和激活修补，我们机械地展示了检索头是必要的，并且能够解释模型性能，尽管它们并不完全充分。我们的研究结果揭示了如何解释合成数据微调性能，以及如何为在长上下文中学习现实世界能力创建更好的数据。

[NLP-2] Natural Language Inference Improves Compositionality in Vision-Language Models CEC

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在组合推理（compositional reasoning）中难以关联物体、属性和空间关系的问题。解决方案的关键在于提出了基于自然语言推理（Natural Language Inference, NLI）的Caption Expansion with Contradictions and Entailments (CECE)方法。CECE通过生成蕴含句（entailments）和矛盾句（contradictions）来扩展给定的前提，从而在保持句子核心意义的同时增加词汇多样性。这种方法不仅提高了模型的可解释性，还减少了模型对偏见或表面特征的过度依赖，无需额外微调即可在多个基准测试中取得最先进的结果。

链接: https://arxiv.org/abs/2410.22315
作者: Paola Cascante-Bonilla,Yu Hou,Yang Trista Cao,Hal Daumé III,Rachel Rudinger
关键词-EN: Large Language Models, Compositional reasoning, remains challenging, relate objects, spatial relationships
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show that CECE enhances interpretability and reduces overreliance on biased or superficial features. By balancing CECE along the original premise, we achieve significant improvements over previous methods without requiring additional fine-tuning, producing state-of-the-art results on benchmarks that score agreement with human judgments for image-text alignment, and achieving an increase in performance on Winoground of +19.2% (group score) and +12.9% on EqBen (group score) over the best prior work (finetuned with targeted data).
摘要：视觉-语言模型 (Vision-Language Models, VLMs) 中的组合推理仍然是一个挑战，因为这些模型通常难以关联对象、属性和空间关系。最近的方法试图通过依赖文本描述的语义，使用大语言模型 (Large Language Models, LLMs) 将其分解为一系列问题和答案来解决这些限制。然而，这些方法主要在表面层次上操作，未能融入更深层次的词汇理解，同时引入了由 LLM 生成的不正确假设。针对这些问题，我们提出了矛盾与蕴含扩展 (Caption Expansion with Contradictions and Entailments, CECE)，这是一种利用自然语言推理 (Natural Language Inference, NLI) 从给定前提生成蕴含和矛盾的原则性方法。CECE 在保持句子核心意义的同时，生成词汇多样化的句子。通过广泛的实验，我们展示了 CECE 增强了可解释性，并减少了对于偏见或表面特征的过度依赖。通过在原始前提上平衡 CECE，我们实现了对先前方法的显著改进，而无需额外的微调，在评估图像-文本对齐的人类判断一致性的基准测试中取得了最先进的结果，并在 Winoground 上实现了 +19.2%（组得分）和在 EqBen 上实现了 +12.9%（组得分）的性能提升，超过了之前最佳的工作（使用针对性数据进行微调）。

[NLP-3] SVIP: Towards Verifiable Inference of Open-source Large Language Models

【速读】：该论文试图解决开源大型语言模型（LLMs）在通过计算服务提供商进行推理时，可能被未经用户同意替换为较小、能力较弱的模型的问题。解决方案的关键是引入了一种基于秘密的验证LLM推理协议（SVIP），该协议利用LLM的中间输出作为独特的模型标识符。通过在中间输出上训练代理任务，并要求计算服务提供商返回生成的文本和处理后的中间输出，用户可以可靠地验证计算服务提供商是否诚实。此外，秘密机制的整合进一步增强了协议的安全性。实验结果表明，SVIP在准确性、通用性、计算效率和抗攻击性方面表现出色，具有低于5%的假阴性率和低于3%的假阳性率，且每次查询的验证时间少于0.01秒。

链接: https://arxiv.org/abs/2410.22307
作者: Yifan Sun,Yuhang Li,Yue Zhang,Yuchen Jin,Huan Zhang
关键词-EN: Open-source Large Language, Large Language Models, natural language understanding, Open-source Large, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 20 pages

点击查看摘要

Abstract:Open-source Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language understanding and generation, leading to widespread adoption across various domains. However, their increasing model sizes render local deployment impractical for individual users, pushing many to rely on computing service providers for inference through a blackbox API. This reliance introduces a new risk: a computing provider may stealthily substitute the requested LLM with a smaller, less capable model without consent from users, thereby delivering inferior outputs while benefiting from cost savings. In this paper, we formalize the problem of verifiable inference for LLMs. Existing verifiable computing solutions based on cryptographic or game-theoretic techniques are either computationally uneconomical or rest on strong assumptions. We introduce SVIP, a secret-based verifiable LLM inference protocol that leverages intermediate outputs from LLM as unique model identifiers. By training a proxy task on these outputs and requiring the computing provider to return both the generated text and the processed intermediate outputs, users can reliably verify whether the computing provider is acting honestly. In addition, the integration of a secret mechanism further enhances the security of our protocol. We thoroughly analyze our protocol under multiple strong and adaptive adversarial scenarios. Our extensive experiments demonstrate that SVIP is accurate, generalizable, computationally efficient, and resistant to various attacks. Notably, SVIP achieves false negative rates below 5% and false positive rates below 3%, while requiring less than 0.01 seconds per query for verification.
摘要：开源大语言模型 (LLM) 近期在自然语言理解和生成方面展示了显著的能力，推动了其在多个领域的广泛应用。然而，随着模型规模的不断增大，本地部署对个人用户而言变得不切实际，导致许多人依赖计算服务提供商通过黑箱 API 进行推理。这种依赖引入了一种新的风险：计算服务提供商可能在未经用户同意的情况下，秘密地将请求的 LLM 替换为规模较小、能力较弱的模型，从而在节省成本的同时提供低质量的输出。本文正式定义了 LLM 的可验证推理问题。现有的基于密码学或博弈论技术的可验证计算解决方案要么在计算上不经济，要么依赖于强假设。我们提出了 SVIP，一种基于秘密的可验证 LLM 推理协议，该协议利用 LLM 的中间输出作为独特的模型标识符。通过在这些输出上训练代理任务，并要求计算服务提供商返回生成的文本和处理后的中间输出，用户可以可靠地验证计算服务提供商是否诚实操作。此外，秘密机制的集成进一步增强了我们协议的安全性。我们在多种强适应性对抗场景下对协议进行了全面分析。广泛的实验表明，SVIP 具有准确性、可泛化性、计算效率高且能抵抗多种攻击。值得注意的是，SVIP 的假阴性率低于 5%，假阳性率低于 3%，而每次查询的验证时间不到 0.01 秒。

[NLP-4] Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

【速读】：该论文试图解决大型语言模型 (LLMs) 在数学推理任务中生成详细和准确推理轨迹的挑战。解决方案的关键在于引入了一种基于在线学习流 (online learning Flows) 的新方法，通过增量输出生产流 (incremental output production Flow) 实现高质量推理轨迹的生成。具体来说，该方法利用多个组件 LLMs 通过迭代通信协作构建解决方案，并采用在线直接偏好优化 (Direct Preference Optimization, DPO) 学习与回滚技术，实时生成 DPO 对并更新模型，从而显著提升 LLMs 在数学推理任务中的表现。

链接: https://arxiv.org/abs/2410.22304
作者: Yihe Deng,Paul Mineiro
关键词-EN: Large Language Models, Large Language, capability for Large, significant challenge, reasoning traces remains
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 4 figures, 1 table

点击查看摘要

Abstract:Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning \textbfFlows. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.
摘要：数学推理是大语言模型 (LLM) 的一项关键能力，然而生成详细且准确的推理过程仍然是一个重大挑战。本文提出了一种新颖的方法，利用在线学习 Flows 来生成高质量的推理过程，以用于 LLM 的微调。我们的方法采用了一种增量输出生成 Flow，其中组件 LLM 通过迭代通信协作构建解决方案。我们使用在线直接偏好优化 (DPO) 学习进行训练，通过回滚生成每个训练样本的 DPO 对，并实时更新模型。我们直接比较了由我们的方法生成的推理过程与通过直接模型推断生成的推理过程的质量，证明了我们的方法在提升 LLM 在数学推理任务中的表现方面的有效性。

[NLP-5] From melodic note sequences to pitches using word2vec

【速读】：该论文试图解决的问题是如何将词嵌入技术（word2vec）应用于旋律分析，以捕捉音符之间的音高信息。解决方案的关键在于将音符视为句子中的单词，通过构建一个二维的语义空间来定义音符的嵌入表示。具体方法是通过预测当前音符基于其前2、3或4个音符所建立的上下文，并利用多变量分析验证这些语义向量与其音高之间的相关性，结果显示相关系数约为0.80。

链接: https://arxiv.org/abs/2410.22285
作者: Daniel Defays
关键词-EN: language modeling, words in sentences, enables the capture, pitch information, treated as words
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Applying the word2vec technique, commonly used in language modeling, to melodies, where notes are treated as words in sentences, enables the capture of pitch information. This study examines two datasets: 20 children’s songs and an excerpt from a Bach sonata. The semantic space for defining the embeddings is of very small dimension, specifically 2. Notes are predicted based on the 2, 3 or 4 preceding notes that establish the context. A multivariate analysis of the results shows that the semantic vectors representing the notes have a multiple correlation coefficient of approximately 0.80 with their pitches.
摘要：将广泛应用于语言建模的 word2vec 技术应用于旋律分析，其中音符被视为句子中的单词，从而能够捕捉音高信息。本研究考察了两个数据集：20首儿童歌曲和一段巴赫奏鸣曲的节选。用于定义嵌入的语义空间维度非常小，具体为2维。音符的预测基于前2、3或4个音符所建立的上下文。对结果进行的多变量分析显示，表示音符的语义向量与其音高之间的多重相关系数约为0.80。

[NLP-6] Fourier Head: Helping Large Language Models Learn Complex Probability Distributions

【速读】：该论文试图解决的问题是：在将大型语言模型（LLM）应用于非语言领域时，传统的离散分类（softmax over discrete bins）是否能有效捕捉连续结构和复杂分布，从而生成高质量的非语言标记。解决方案的关键在于引入了一种基于傅里叶级数（Fourier series）构建的神经网络层，称为“傅里叶头”（Fourier head）。这一层可以替代任何线性层，使得输出具有更连续的结构。通过在合成数据集、大规模决策制定和时间序列预测任务中的广泛分析，以及理论上的证据支持，该论文证明了傅里叶头在处理具有自然连续结构的数据分布时，能够更好地从数据中学习信号并忽略高频噪声，从而显著提升模型性能。例如，在Atari Seaquest游戏中，傅里叶头使决策变换器（Decision Transformer）代理的回报提高了46%，并在20个未见过的基准测试中，将最先进的时间序列基础模型的预测性能提高了3.5%。

链接: https://arxiv.org/abs/2410.22269
作者: Nate Gillman,Daksh Aggarwal,Michael Freeman,Saurabh Singh,Chen Sun
关键词-EN: large language models, large language, increased interest, model non-linguistic tokens, Decision Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Project page and code are at this https URL

点击查看摘要

Abstract:As the quality of large language models has improved, there has been increased interest in using them to model non-linguistic tokens. For example, the Decision Transformer recasts agentic decision making as a sequence modeling problem, using a decoder-only LLM to model the distribution over the discrete action space for an Atari agent. However, when adapting LLMs to non-linguistic domains, it remains unclear if softmax over discrete bins captures the continuous structure of the tokens and the potentially complex distributions needed for high quality token generation. We introduce a neural network layer, constructed using Fourier series, which we can easily substitute for any linear layer if we want the outputs to have a more continuous structure. We perform extensive analysis on synthetic datasets, as well as on large-scale decision making and time series forecasting tasks. We also provide theoretical evidence that this layer can better learn signal from data while ignoring high-frequency noise. All of our results support the effectiveness of our proposed Fourier head in scenarios where the underlying data distribution has a natural continuous structure. For example, the Fourier head improves a Decision Transformer agent’s returns by 46% on the Atari Seaquest game, and increases a state-of-the-art times series foundation model’s forecasting performance by 3.5% across 20 benchmarks unseen during training.
摘要：随着大语言模型质量的提升，人们越来越有兴趣利用它们来建模非语言的 Token。例如，Decision Transformer 将智能体决策过程重新构建成一个序列建模问题，使用仅包含解码器的 LLM 来建模 Atari 智能体在离散动作空间上的分布。然而，在将 LLM 应用于非语言领域时，尚不清楚离散 bin 上的 softmax 是否能捕捉到 Token 的连续结构以及生成高质量 Token 所需的潜在复杂分布。我们引入了一种基于傅里叶级数构建的神经网络层，如果希望输出具有更连续的结构，可以轻松地将其替换为任何线性层。我们对合成数据集以及大规模决策制定和时间序列预测任务进行了广泛分析。我们还提供了理论证据，证明该层在忽略高频噪声的同时能更好地从数据中学习信号。所有结果都支持了我们提出的傅里叶头在底层数据分布具有自然连续结构场景中的有效性。例如，傅里叶头在 Atari Seaquest 游戏中将 Decision Transformer 智能体的回报率提高了 46%，并在训练期间未见过的 20 个基准测试中将最先进的时间序列基础模型的预测性能提高了 3.5%。

[NLP-7] FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

【速读】：该论文试图解决语言模型（LMs）在实际用户交互中生成内容的事实性问题。解决方案的关键在于提出了VERIFY（Verification and Evidence RetrIeval for FactualitY evaluation）管道，该管道通过评估模型生成内容的可验证性，并根据从网络检索到的证据将内容单元分类为支持、不支持或无法确定，从而更准确地判断事实性。VERIFY与人类评估的相关性优于现有方法，并用于识别在不同主题下引发最高错误率和不确定响应的“幻觉提示”（hallucination prompts），形成FactBench数据集。该数据集包含150个细粒度主题的1000个提示，用于基准测试广泛使用的LMs，揭示了专有模型在事实性上的表现优于开源模型，以及不同模型在拒绝率和主观性上的差异。

链接: https://arxiv.org/abs/2410.22257
作者: Farima Fatahi Bayat,Lechen Zhang,Sheza Munir,Lu Wang
关键词-EN: Language models, increasing number, broad range, Language, VERIFY
类目: Computation and Language (cs.CL)
备注: 25 pages, 10 figures

点击查看摘要

Abstract:Language models (LMs) are widely used by an increasing number of users, underscoring the challenge of maintaining factuality across a broad range of topics. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs’ factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on the retrieved evidence from the Web. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify “hallucination prompts” across diverse topics, i.e., those eliciting the highest rates of incorrect and inconclusive LM responses. These prompts form FactBench, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FactBench, yielding the following key findings: (i) Proprietary models exhibit better factuality, with performance declining from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual accuracy than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases. Our code and data are publicly available at this https URL.
摘要：语言模型 (Language Models, LMs) 被越来越多的用户广泛使用，这凸显了在广泛主题范围内维持事实性的挑战。我们首先介绍了 VERIFY（Verification and Evidence RetrIeval for FactualitY evaluation，事实性评估的验证与证据检索），这是一个用于评估语言模型在实际用户交互中事实性的流程。VERIFY 考虑了语言模型生成内容的可验证性，并根据从网络检索到的证据将内容单元分类为支持、不支持或无法确定。重要的是，VERIFY 的事实性判断与人类评估的相关性优于现有方法。使用 VERIFY，我们识别了跨多个主题的“幻觉提示”，即那些引发最高错误率和不确定语言模型响应的提示。这些提示构成了 FactBench，一个包含 150 个细粒度主题的 1000 个提示的数据集。我们的数据集捕捉了现实世界语言模型交互中出现的事实性挑战，并且可以通过新提示定期更新。我们在 FactBench 上对 GPT、Gemini 和 Llama3.1 系列中广泛使用的语言模型进行了基准测试，得出了以下关键发现：(i) 专有模型表现出更好的事实性，其性能从简单到困难的幻觉提示逐渐下降。(ii) Llama3.1-405B-Instruct 在所有评估方法中显示出与 Llama3.1-70B-Instruct 相当或更低的事实准确性，这是由于其更高的主观性导致更多内容被标记为无法确定。(iii) Gemini1.5-Pro 显示出显著的拒绝率，其中超过 25% 的案例中存在过度拒绝。我们的代码和数据可在以下链接公开获取：https URL。

[NLP-8] DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers EMNLP2024

【速读】：该论文试图解决当前机器学习系统中存在的系统性偏差问题，这些偏差通常源于数据集中的标注伪影或某些类别支持不足。解决方案的关键在于引入DISCERN框架，该框架通过语言解释来解释文本分类器中的系统性偏差。DISCERN通过两个大型语言模型之间的交互循环，迭代生成精确的自然语言描述，从而识别和解释这些偏差。最终，这些描述被用于通过合成生成实例或通过主动学习注释示例来增强分类器训练集，从而改进分类器性能。实验结果表明，该框架的语言解释在三个文本分类数据集上均能带来一致的性能提升，且在人类评估中，用户通过语言解释比通过聚类示例更有效地理解和解释系统性偏差。

链接: https://arxiv.org/abs/2410.22239
作者: Rakesh R. Menon,Shashank Srivastava
关键词-EN: high predictive accuracies, current machine learning, machine learning systems, predictive accuracies, current machine
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 9 figures, 15 tables; Accepted to EMNLP 2024

点击查看摘要

Abstract:Despite their high predictive accuracies, current machine learning systems often exhibit systematic biases stemming from annotation artifacts or insufficient support for certain classes in the dataset. Recent work proposes automatic methods for identifying and explaining systematic biases using keywords. We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using language explanations. DISCERN iteratively generates precise natural language descriptions of systematic errors by employing an interactive loop between two large language models. Finally, we use the descriptions to improve classifiers by augmenting classifier training sets with synthetically generated instances or annotated examples via active learning. On three text-classification datasets, we demonstrate that language explanations from our framework induce consistent performance improvements that go beyond what is achievable with exemplars of systematic bias. Finally, in human evaluations, we show that users can interpret systematic biases more effectively (by over 25% relative) and efficiently when described through language explanations as opposed to cluster exemplars.
摘要：尽管当前的机器学习系统在预测准确性上表现出色，但它们往往表现出由标注伪影或数据集中某些类别支持不足引起的系统性偏差。最近的研究提出了使用关键词自动识别和解释系统性偏差的方法。我们引入了DISCERN框架，该框架利用语言解释来解读文本分类器中的系统性偏差。DISCERN通过两个大语言模型之间的交互循环，迭代生成系统性错误的精确自然语言描述。最后，我们利用这些描述通过增加合成生成的实例或通过主动学习注释的示例来增强分类器训练集，从而改进分类器。在三个文本分类数据集上，我们证明了我们的框架生成的语言解释能够带来持续的性能提升，超越了仅依靠系统性偏差示例所能达到的效果。最后，在人类评估中，我们展示了用户通过语言解释（相对于聚类示例）能够更有效地（相对提高超过25%）和高效地解释系统性偏差。

[NLP-9] Cora: Accelerating Stateful Network Applications with SmartNICs

【速读】：该论文试图解决在将状态网络应用卸载到智能网卡（SmartNICs）时，由于状态操作复杂性、状态资源消耗以及流量与状态之间复杂关系导致的卸载方案次优问题。解决方案的关键在于提出了Cora编译器和运行时系统。Cora编译器通过引入精确的性能模型和高效的编译算法来搜索最优的卸载方案，而Cora运行时则能够监控流量动态并进行自适应调整，以最小化CPU使用率。实验结果表明，Cora在相同吞吐量目标下，能够节省高达94.0%的CPU核心，性能优于基线解决方案1.9倍，并且在相同资源约束下，能够加速网络功能44.9%-82.3%。

链接: https://arxiv.org/abs/2410.22229
作者: Shaoke Xi,Jiaqi Gao,Mengqi Liu,Jiamin Cao,Fuliang Li,Kai Bu,Kui Ren,Minlan Yu,Dennis Cai,Ennan Zhai
关键词-EN: stateful network applications, offloading stateful network, growing performance requirements, stateful network, network applications
类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the growing performance requirements on networked applications, there is a new trend of offloading stateful network applications to SmartNICs to improve performance and reduce the total cost of ownership. However, offloading stateful network applications is non-trivial due to state operation complexity, state resource consumption, and the complicated relationship between traffic and state. Naively partitioning the program by state or traffic can result in a suboptimal partition plan with higher CPU usage or even packet drops. In this paper, we propose Cora, a compiler and runtime that offloads stateful network applications to SmartNIC-accelerated hosts. Cora compiler introduces an accurate performance model for each SmartNIC and employs an efficient compiling algorithm to search the offloading plan. Cora runtime can monitor traffic dynamics and adapt to minimize CPU usage. Cora is built atop Netronome Agilio and BlueField 2 SmartNICs. Our evaluation shows that for the same throughput target, Cora can propose partition plans saving up to 94.0% CPU cores, 1.9 times more than baseline solutions. Under the same resource constraint, Cora can accelerate network functions by 44.9%-82.3%. Cora runtime can adapt to traffic changes and keep CPU usage low.
摘要：随着网络应用性能需求的不断提升，将状态网络应用卸载到智能网卡（SmartNIC）以提高性能并降低总拥有成本（TCO）已成为一种新趋势。然而，卸载状态网络应用并非易事，因为状态操作的复杂性、状态资源的消耗以及流量与状态之间的复杂关系。简单地通过状态或流量对程序进行分区，可能会导致次优的分区方案，从而增加CPU使用率甚至导致数据包丢失。本文提出了Cora，一种将状态网络应用卸载到智能网卡加速主机的编译器和运行时系统。Cora编译器引入了针对每种智能网卡的精确性能模型，并采用高效的编译算法来搜索卸载方案。Cora运行时能够监控流量动态并进行适应性调整，以最小化CPU使用率。Cora构建在Netronome Agilio和BlueField 2智能网卡之上。我们的评估结果显示，在相同的吞吐量目标下，Cora可以提出节省高达94.0% CPU核心的分区方案，比基线解决方案高出1.9倍。在相同的资源约束下，Cora可以将网络功能的加速比提升44.9%-82.3%。Cora运行时能够适应流量变化，保持低CPU使用率。

[NLP-10] ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding

【速读】：该论文试图解决多模态系统在应用导向场景中的评估问题，特别是如何衡量这些系统在实际操作活动中的理解能力。解决方案的关键在于提出了一个新的评估数据集，称为ProMQA，该数据集包含401个多模态程序性问答对，结合了用户记录的操作活动及其对应的指令。通过采用人机协作的标注方法，利用大型语言模型（LLM）生成问答对并由人工验证，确保了数据集的高效性和准确性。论文还提供了基准测试结果，揭示了当前系统与人类表现之间的显著差距，从而为多模态理解能力的研究提供了新的视角。

链接: https://arxiv.org/abs/2410.22211
作者: Kimihiro Hasegawa,Wiradee Imrattanatrai,Zhi-Qi Cheng,Masaki Asada,Susan Holm,Yuran Wang,Ken Fukuda,Teruko Mitamura
关键词-EN: people follow instructions, achieve their goals, great potential, potential to assist, people follow
类目: Computation and Language (cs.CL)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models’ multimodal understanding capabilities.
摘要：多模态系统在辅助人类进行程序性活动方面具有巨大潜力，这些活动通常需要人们按照指令来实现目标。尽管应用场景多样，系统通常在传统的分类任务上进行评估，例如动作识别或时间动作分割。本文提出了一种新的评估数据集，称为 ProMQA，用于衡量系统在面向应用场景中的进展。ProMQA 包含 401 个多模态程序性问答对，这些问答对基于用户记录的程序性活动及其相应的指令。在问答标注方面，我们采用了一种高效的人机协作方法，即利用大语言模型 (LLM) 生成的问答对来增强现有标注，随后由人工进行验证。随后，我们提供了基准测试结果，以设定 ProMQA 上的基线性能。我们的实验表明，人类表现与当前系统（包括竞争性的专有多模态模型）之间存在显著差距。我们希望该数据集能够揭示模型在多模态理解能力方面的新视角。

[NLP-11] Class-Aware Contrastive Optimization for Imbalanced Text Classification

【速读】：该论文试图解决文本分类任务中的类别不平衡问题，特别是在实际应用中常见的场景。解决方案的关键在于结合类别感知的对比优化（class-aware contrastive optimization）和去噪自编码器（denoising autoencoders）。具体来说，论文提出的方法通过在嵌入空间中结合重构损失（reconstruction loss）和对比类别分离（contrastive class separation），实现了生成嵌入的真实性与模型区分不同类别能力之间的更好平衡。这种方法在多个文本数据集上显著优于传统和当前最先进的方法。

链接: https://arxiv.org/abs/2410.22197
作者: Grigorii Khvatskii,Nuno Moniz,Khoa Doan,Nitesh V Chawla
关键词-EN: data make classification, text data make, complex problem, make classification tasks, text classification tasks
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures, accepted for publication in CODS-COMAD 2024

点击查看摘要

Abstract:The unique characteristics of text data make classification tasks a complex problem. Advances in unsupervised and semi-supervised learning and autoencoder architectures addressed several challenges. However, they still struggle with imbalanced text classification tasks, a common scenario in real-world applications, demonstrating a tendency to produce embeddings with unfavorable properties, such as class overlap. In this paper, we show that leveraging class-aware contrastive optimization combined with denoising autoencoders can successfully tackle imbalanced text classification tasks, achieving better performance than the current state-of-the-art. Concretely, our proposal combines reconstruction loss with contrastive class separation in the embedding space, allowing a better balance between the truthfulness of the generated embeddings and the model’s ability to separate different classes. Compared with an extensive set of traditional and state-of-the-art competing methods, our proposal demonstrates a notable increase in performance across a wide variety of text datasets.
摘要：文本数据的独特特性使得分类任务成为一个复杂的问题。无监督学习和半监督学习以及自编码器架构的进步解决了许多挑战。然而，它们在处理不平衡文本分类任务时仍然存在困难，这是现实应用中的常见场景，表现为倾向于生成具有不利特性的嵌入，例如类别重叠。本文展示了利用类别感知的对比优化结合去噪自编码器可以成功应对不平衡文本分类任务，并取得了优于当前最先进技术的性能。具体而言，我们的方案在嵌入空间中结合了重构损失与对比类别分离，从而在生成嵌入的真实性与模型区分不同类别的能力之间实现了更好的平衡。与一系列传统和最先进的竞争方法相比，我们的方案在多种文本数据集上展示了显著的性能提升。

[NLP-12] ADAM: An Embodied Causal Agent in Open-World Environments

【速读】：该论文试图解决在开放世界环境中（如Minecraft），现有智能体在持续学习结构化知识，特别是因果关系方面面临的挑战。这些挑战源于黑箱模型的固有透明度问题以及训练过程中对先验知识的过度依赖，从而影响了智能体的可解释性和泛化能力。解决方案的关键在于引入ADAM（An emboDied causal Agent in Minecraft），这是一个能够在开放世界中自主导航、感知多模态上下文、学习因果世界知识并通过终身学习解决复杂任务的智能体。ADAM的核心组件包括：1) 交互模块，使智能体能够执行动作并记录交互过程；2) 因果模型模块，负责从零开始构建不断增长的因果图，增强可解释性并减少对先验知识的依赖；3) 控制器模块，包含规划器、执行器和记忆池，利用学习到的因果图完成任务；4) 感知模块，由多模态大型语言模型驱动，使ADAM能够像人类玩家一样感知。实验结果表明，ADAM能够从零开始构建几乎完美的因果图，实现高效的任务分解和执行，并展现出强大的可解释性、鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2410.22194
作者: Shu Yu,Chaochao Lu
关键词-EN: existing agents face, continuously learning structured, agents face challenges, learning structured knowledge, open-world environments
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their interpretability and generalization capability. To this end, we introduce ADAM, An emboDied causal Agent in Minecraft, that can autonomously navigate the open world, perceive multimodal contexts, learn causal world knowledge, and tackle complex tasks through lifelong learning. ADAM is empowered by four key components: 1) an interaction module, enabling the agent to execute actions while documenting the interaction processes; 2) a causal model module, tasked with constructing an ever-growing causal graph from scratch, which enhances interpretability and diminishes reliance on prior knowledge; 3) a controller module, comprising a planner, an actor, and a memory pool, which uses the learned causal graph to accomplish tasks; 4) a perception module, powered by multimodal large language models, which enables ADAM to perceive like a human player. Extensive experiments show that ADAM constructs an almost perfect causal graph from scratch, enabling efficient task decomposition and execution with strong interpretability. Notably, in our modified Minecraft games where no prior knowledge is available, ADAM maintains its performance and shows remarkable robustness and generalization capability. ADAM pioneers a novel paradigm that integrates causal methods and embodied agents in a synergistic manner. Our project page is at this https URL.
摘要：在像 Minecraft 这样的开放世界环境中，现有的 AI 智能体在持续学习结构化知识，特别是因果关系方面面临挑战。这些挑战源于黑箱模型固有的不透明性以及训练过程中对先验知识的过度依赖，这损害了它们的可解释性和泛化能力。为此，我们引入了 ADAM，一个嵌入式因果 AI 智能体，能够在 Minecraft 的开放世界中自主导航，感知多模态上下文，学习因果世界知识，并通过终身学习解决复杂任务。ADAM 由四个关键组件赋能：1) 交互模块，使智能体能够在执行动作的同时记录交互过程；2) 因果模型模块，负责从头构建一个不断增长的因果图，增强可解释性并减少对先验知识的依赖；3) 控制器模块，包括规划器、执行器和记忆池，利用学习到的因果图完成任务；4) 感知模块，由多模态大语言模型驱动，使 ADAM 能够像人类玩家一样感知。广泛的实验表明，ADAM 能够从头构建一个几乎完美的因果图，实现高效的任务分解和执行，并具有强大的可解释性。值得注意的是，在我们修改的 Minecraft 游戏中，没有任何先验知识的情况下，ADAM 仍能保持其性能，显示出显著的鲁棒性和泛化能力。ADAM 开创了一种将因果方法与嵌入式 AI 智能体协同集成的新范式。我们的项目页面位于此 https URL。

[NLP-13] Natural Language Processing for Analyzing Electronic Health Records and Clinical Notes in Cancer Research: A Review

【速读】：该论文旨在解决现有文献中关于自然语言处理 (NLP) 技术在癌症研究中应用的局限性问题，特别是针对特定癌症类型或应用的狭隘视角。解决方案的关键在于提供一个更广泛的视角，涵盖多种癌症类型和NLP应用，并通过综合文献搜索和数据提取，分析了NLP在电子健康记录 (EHRs) 和临床笔记中的应用趋势、方法、挑战和未来方向。关键点包括：识别信息提取和文本分类作为主要NLP任务，观察到从基于规则的方法向高级机器学习技术（特别是基于Transformer的模型）的转变，以及强调模型泛化性、临床语言处理复杂性和扩展到未充分研究的癌症类型的重要性。此外，论文强调了将NLP工具整合到临床实践中并解决伦理问题的必要性，以充分发挥NLP在癌症诊断、治疗和患者预后中的潜力。

链接: https://arxiv.org/abs/2410.22180
作者: Muhammad Bilal,Ameer Hamza,Nadia Malik
关键词-EN: electronic health records, NLP, natural language processing, health records, aims to analyze
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: This review aims to analyze the application of natural language processing (NLP) techniques in cancer research using electronic health records (EHRs) and clinical notes. This review addresses gaps in the existing literature by providing a broader perspective than previous studies focused on specific cancer types or applications. Methods: A comprehensive literature search was conducted using the Scopus database, identifying 94 relevant studies published between 2019 and 2024. Data extraction included study characteristics, cancer types, NLP methodologies, dataset information, performance metrics, challenges, and future directions. Studies were categorized based on cancer types and NLP applications. Results: The results showed a growing trend in NLP applications for cancer research, with breast, lung, and colorectal cancers being the most studied. Information extraction and text classification emerged as predominant NLP tasks. A shift from rule-based to advanced machine learning techniques, particularly transformer-based models, was observed. The Dataset sizes used in existing studies varied widely. Key challenges included the limited generalizability of proposed solutions and the need for improved integration into clinical workflows. Conclusion: NLP techniques show significant potential in analyzing EHRs and clinical notes for cancer research. However, future work should focus on improving model generalizability, enhancing robustness in handling complex clinical language, and expanding applications to understudied cancer types. Integration of NLP tools into clinical practice and addressing ethical considerations remain crucial for utilizing the full potential of NLP in enhancing cancer diagnosis, treatment, and patient outcomes.
摘要：
目的：本综述旨在分析自然语言处理 (NLP) 技术在癌症研究中的应用，特别是利用电子健康记录 (EHRs) 和临床笔记。本综述通过提供比以往专注于特定癌症类型或应用的研究更广泛的视角，填补了现有文献中的空白。

方法：通过 Scopus 数据库进行了全面的文献检索，确定了 2019 年至 2024 年间发表的 94 篇相关研究。数据提取包括研究特征、癌症类型、NLP 方法、数据集信息、性能指标、挑战和未来方向。研究根据癌症类型和 NLP 应用进行了分类。

结果：结果显示，NLP 在癌症研究中的应用呈增长趋势，其中乳腺癌、肺癌和结直肠癌是最受关注的研究对象。信息提取和文本分类成为主要的 NLP 任务。观察到从基于规则的方法向先进的机器学习技术，特别是基于 Transformer 的模型的转变。现有研究中使用的数据集规模差异很大。主要挑战包括所提出解决方案的泛化能力有限，以及需要改进与临床工作流程的整合。

结论：NLP 技术在分析 EHRs 和临床笔记以进行癌症研究方面显示出巨大的潜力。然而，未来的工作应着重于提高模型的泛化能力，增强处理复杂临床语言的鲁棒性，并扩展到研究不足的癌症类型。将 NLP 工具整合到临床实践中，并解决伦理考虑，对于充分发挥 NLP 在提升癌症诊断、治疗和患者预后方面的潜力至关重要。

[NLP-14] Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech NAACL

【速读】：该论文试图解决基于自回归 (AR) Transformer 的序列模型在处理比训练时更长的序列时遇到的泛化问题，特别是在文本到语音 (TTS) 任务中，模型容易出现单词重复或遗漏以及输出不稳定的现象。解决方案的关键在于引入了一种对齐机制，该机制通过相对位置信息为交叉注意力操作提供支持。这种对齐位置作为模型的潜在属性通过反向传播学习，训练过程中无需外部对齐信息。尽管该方法针对 TTS 输入输出对齐的单调性进行了优化，但它仍能受益于交错的多头自注意力和交叉注意力操作的灵活建模能力。论文提出的改进系统，称为 Very Attentive Tacotron，在自然度和表现力上与基于 T5 的基线 TTS 系统相当，同时消除了单词重复或遗漏的问题，并能够泛化到任意实际长度的语音输出。

链接: https://arxiv.org/abs/2410.22179
作者: Eric Battenberg,RJ Skerry-Ryan,Daisy Stanton,Soroosh Mariooryad,Matt Shannon,Julian Salazar,David Kao
关键词-EN: Transformer-based sequence models, Transformer-based sequence, difficulty generalizing, Transformer-based encoder-decoder TTS, sequences longer
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to NAACL

点击查看摘要

Abstract:Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.
摘要：基于自回归 (Autoregressive, AR) Transformer 的序列模型在处理长度超过训练时所见序列的任务时，通常表现出泛化能力不足的问题。当应用于文本到语音 (Text-to-Speech, TTS) 任务时，这些模型往往会出现单词遗漏、重复或输出不稳定的情况，尤其是在处理较长语句时。本文提出了一系列针对基于 AR Transformer 的编码器-解码器 TTS 系统的改进措施，旨在解决这些鲁棒性和长度泛化问题。我们的方法采用了一种对齐机制，为交叉注意力操作提供相对位置信息。相关联的对齐位置通过反向传播作为模型的潜在属性进行学习，且在训练过程中无需外部对齐信息。尽管该方法针对 TTS 输入输出对齐的单调性进行了优化，但它仍能从交错的多头自注意力和交叉注意力操作的灵活建模能力中获益。我们称包含这些改进的系统为“非常注意的 Tacotron”，该系统在自然度和表现力上与基于 T5 的基准 TTS 系统相当，同时消除了单词重复或遗漏的问题，并能够泛化到任何实际的语句长度。

[NLP-15] Benchmarking LLM Guardrails in Handling Multilingual Toxicity

【速读】：该论文试图解决大型语言模型（LLMs）在多语言场景中检测和防御有毒内容的能力不足的问题。解决方案的关键在于引入一个全面的多语言测试套件，涵盖七个数据集和超过十种语言，以基准测试现有最先进防护措施的性能。此外，论文还研究了这些防护措施对最新破解技术的抵抗能力，并评估了上下文安全策略和语言资源可用性对防护措施性能的影响。研究发现，现有的防护措施在处理多语言有毒内容方面仍然无效，并且缺乏对破解提示的鲁棒性。该研究旨在识别防护措施的局限性，并构建在多语言场景中更可靠和可信的LLMs。

链接: https://arxiv.org/abs/2410.22153
作者: Yahan Yang,Soham Dan,Dan Roth,Insup Lee
关键词-EN: Large Language Models, ubiquity of Large, Language Models, Large Language, crucial to detect
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails’ performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts. This work aims to identify the limitations of guardrails and to build a more reliable and trustworthy LLMs in multilingual scenarios.
摘要：随着大语言模型（Large Language Models, LLMs）的普及，防护机制（guardrails）在检测和防御有害内容方面变得至关重要。然而，随着LLMs在多语言场景中的日益广泛应用，其在处理多语言有害输入方面的有效性仍不明确。在本研究中，我们引入了一个全面的多语言测试套件，涵盖七个数据集和超过十种语言，以基准测试最先进的防护机制的性能。我们还研究了防护机制对近期破解技术的抵抗能力，并评估了上下文安全策略和语言资源可用性对防护机制性能的影响。我们的研究结果表明，现有的防护机制在处理多语言有害内容方面仍然无效，并且在面对破解提示时缺乏鲁棒性。本研究旨在识别防护机制的局限性，并构建在多语言场景中更为可靠和可信的大语言模型。

[NLP-16] AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLM s with Higher Success Rates in Fewer Attempts

【速读】：该论文试图解决大语言模型 (LLMs) 在面对无意义对抗性后缀 (gibberish adversarial suffixes) 时的脆弱性问题。解决方案的关键在于引入 AmpleGCG-Plus，这是一个增强版的生成模型，能够更高效地生成可定制的无意义对抗性后缀，从而提高攻击成功率 (ASR)。通过一系列探索性实验，论文确定了多种训练策略来改进无意义后缀的学习，使其在白盒和黑盒设置下对不同模型（如 Llama-2-7B-chat 和 GPT-4）的攻击效果显著提升，特别是在黑盒设置下对 GPT-4 的攻击成功率提高了三倍以上。此外，AmpleGCG-Plus 还能有效破解最新版本的 GPT-4o 系列模型，并揭示了针对新提出的电路断路器防御机制的漏洞。

链接: https://arxiv.org/abs/2410.22143
作者: Vishal Kumar,Zeyi Liao,Jaylen Jones,Huan Sun
关键词-EN: carefully crafted prompts, gibberish adversarial suffixes, remain vulnerable, vulnerable to jailbreaking, carefully crafted
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) are typically aligned, they remain vulnerable to jailbreaking through either carefully crafted prompts in natural language or, interestingly, gibberish adversarial suffixes. However, gibberish tokens have received relatively less attention despite their success in attacking aligned LLMs. Recent work, AmpleGCG~\citepliao2024amplegcg, demonstrates that a generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query, exposing a range of alignment gaps in out-of-distribution (OOD) language spaces. To bring more attention to this area, we introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. Through a series of exploratory experiments, we identify several training strategies to improve the learning of gibberish suffixes. Our results, verified under a strict evaluation setting, show that it outperforms AmpleGCG on both open-weight and closed-source models, achieving increases in attack success rate (ASR) of up to 17% in the white-box setting against Llama-2-7B-chat, and more than tripling ASR in the black-box setting against GPT-4. Notably, AmpleGCG-Plus jailbreaks the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense. We publicly release AmpleGCG-Plus along with our collected training datasets.
摘要：尽管大语言模型 (LLMs) 通常经过对齐，但它们仍然容易通过精心设计的自然语言提示或有趣的乱码对抗后缀被破解。然而，乱码 Token 虽然成功攻击了对齐的 LLMs，却相对较少受到关注。最近的工作，AmpleGCG~\citepliao2024amplegcg，展示了生成式模型可以快速生成大量可定制的乱码对抗后缀，用于任何有害查询，揭示了分布外 (OOD) 语言空间中的一系列对齐漏洞。为了引起更多关注，我们引入了 AmpleGCG-Plus，这是一个性能更强的版本，能够在更少的尝试中取得更好的效果。通过一系列探索性实验，我们确定了多种训练策略来改进乱码后缀的学习。我们的结果在严格的评估设置下得到验证，显示其在开放权重和闭源模型上均优于 AmpleGCG，在白盒设置下对 Llama-2-7B-chat 的攻击成功率 (ASR) 提高了多达 17%，在黑盒设置下对 GPT-4 的 ASR 提高了三倍以上。值得注意的是，AmpleGCG-Plus 对较新的 GPT-4o 系列模型的破解率与 GPT-4 相当，并揭示了针对最近提出的电路断路器防御的漏洞。我们公开发布了 AmpleGCG-Plus 及其收集的训练数据集。

[NLP-17] RankUp: Boosting Semi-Supervised Regression with an Auxiliary Ranking Classifier NEURIPS2024

【速读】：该论文试图解决现有半监督学习方法（如FixMatch及其变体）在分类任务中表现优异，但无法直接应用于回归任务的问题。解决方案的关键在于提出了一种名为RankUp的方法，通过将原始回归任务转换为排序问题，并与原始回归目标同时训练，从而使现有的半监督分类技术能够应用于回归任务。RankUp的核心在于引入了一个辅助的排序分类器，该分类器输出分类结果，从而能够与现有的半监督分类方法集成。此外，论文还提出了回归分布对齐（Regression Distribution Alignment, RDA）技术，通过优化伪标签的分布来进一步提升RankUp的性能。尽管方法简单，RankUp及其与RDA结合的版本在多个回归基准测试中均达到了最先进的性能。

链接: https://arxiv.org/abs/2410.22124
作者: Pin-Yen Huang,Szu-Wei Fu,Yu Tsao
关键词-EN: demonstrated impressive performance, demonstrated impressive, regression, semi-supervised learning techniques, semi-supervised learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at NeurIPS 2024 (Poster)

点击查看摘要

Abstract:State-of-the-art (SOTA) semi-supervised learning techniques, such as FixMatch and it’s variants, have demonstrated impressive performance in classification tasks. However, these methods are not directly applicable to regression tasks. In this paper, we present RankUp, a simple yet effective approach that adapts existing semi-supervised classification techniques to enhance the performance of regression tasks. RankUp achieves this by converting the original regression task into a ranking problem and training it concurrently with the original regression objective. This auxiliary ranking classifier outputs a classification result, thus enabling integration with existing semi-supervised classification methods. Moreover, we introduce regression distribution alignment (RDA), a complementary technique that further enhances RankUp’s performance by refining pseudo-labels through distribution alignment. Despite its simplicity, RankUp, with or without RDA, achieves SOTA results in across a range of regression benchmarks, including computer vision, audio, and natural language processing tasks. Our code and log data are open-sourced at this https URL.
摘要：当前最先进的（SOTA）半监督学习技术，如 FixMatch 及其变体，在分类任务中展示了令人印象深刻的性能。然而，这些方法并不直接适用于回归任务。本文提出了 RankUp，一种简单而有效的方法，通过将现有的半监督分类技术适应于回归任务，以提升其性能。RankUp 通过将原始回归任务转化为排序问题，并与原始回归目标同时训练，实现了这一目标。这种辅助的排序分类器输出一个分类结果，从而能够与现有的半监督分类方法集成。此外，我们引入了回归分布对齐（RDA），这是一种补充技术，通过分布对齐来优化伪标签，进一步提升了 RankUp 的性能。尽管 RankUp 方法简单，但无论是否结合 RDA，它都在一系列回归基准测试中达到了 SOTA 结果，涵盖了计算机视觉、音频和自然语言处理任务。我们的代码和日志数据已在以下链接开源：https URL。

[NLP-18] he Impact of Inference Acceleration Strategies on Bias of LLM s

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在推理加速过程中可能引入的人口统计学偏差问题。解决方案的关键在于深入且针对具体案例的评估，即在模型经过推理加速优化后，对其输出进行广泛的偏差分析。论文通过多种指标从多个角度探测模型输出的偏差，发现推理加速策略对偏差的影响复杂且不可预测，不同模型对同一加速策略和偏差类型的反应可能截然不同。因此，强调了在推理加速后对模型偏差进行细致评估的必要性。

链接: https://arxiv.org/abs/2410.22118
作者: Elisabeth Kirsten,Ivan Habernal,Vedant Nanda,Muhammad Bilal Zafar
关键词-EN: Large Language Models, Large Language, Language Models, unprecedented advances, advances in capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to deeply benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.
摘要：近年来，大语言模型（Large Language Model, LLM）的能力取得了前所未有的进步。这些进步有望为广泛的应用领域带来深远的影响。然而，由于其庞大的规模，LLM 的推理过程既昂贵又缓慢。因此，近期大量研究提出了各种提升推理效率的策略，如量化（quantization）、剪枝（pruning）和缓存（caching）。这些加速策略通过减少推理成本和延迟，通常达到数倍的提升，同时仍能保持通过常见基准测试衡量的预测性能。在本研究中，我们探讨了 LLM 性能的另一个关键方面：由于推理加速优化导致的模型生成中的群体偏见（demographic bias）。我们使用一系列指标，从多个角度检测模型输出中的偏见。对加速前后输出结果的分析显示，偏见程度发生了显著变化。令人担忧的是，这些偏见效应复杂且难以预测。一种加速策略与某种偏见类型的组合，在一个模型中可能表现出较小的偏见变化，而在另一个模型中可能导致较大的偏见效应。我们的研究结果强调，在模型经过修改以加速推理后，需要进行深入且针对具体情况的偏见评估。

[NLP-19] Protecting Privacy in Multimodal Large Language Models with MLLM U-Bench

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在记忆和泄露个人隐私数据方面的法律和伦理问题。解决方案的关键在于引入了一个名为多模态大语言模型遗忘基准 (Multimodal Large Language Model Unlearning Benchmark, MLLMU-Bench) 的新基准，用于推进对多模态机器遗忘 (multimodal machine unlearning) 的理解。MLLMU-Bench 包含 500 个虚构和 153 个公众人物的档案，每个档案有超过 14 个定制的问答对，从多模态 (图像+文本) 和单模态 (文本) 两个角度进行评估。该基准分为四个部分，用于评估遗忘算法在有效性、泛化性和模型实用性方面的表现。论文还提供了使用现有生成模型遗忘算法的基线结果，实验表明单模态遗忘算法在生成和填空任务中表现优异，而多模态遗忘方法在多模态输入的分类任务中表现更好。

链接: https://arxiv.org/abs/2410.22108
作者: Zheyuan Liu,Guangyao Dou,Mengzhao Jia,Zhaoxuan Tan,Qingkai Zeng,Yongle Yuan,Meng Jiang
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, massive web corpora, disclose individuals’ confidential
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:Generative models such as Large Language Models (LLM) and Multimodal Large Language models (MLLMs) trained on massive web corpora can memorize and disclose individuals’ confidential and private data, raising legal and ethical concerns. While many previous works have addressed this issue in LLM via machine unlearning, it remains largely unexplored for MLLMs. To tackle this challenge, we introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning. MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives. The benchmark is divided into four sets to assess unlearning algorithms in terms of efficacy, generalizability, and model utility. Finally, we provide baseline results using existing generative model unlearning algorithms. Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.
摘要：生成式模型，如大语言模型 (LLM) 和多模态大语言模型 (MLLMs)，在经过大规模网络语料库训练后，能够记忆并泄露个人的机密和隐私数据，引发法律和伦理问题。尽管许多先前的工作已通过机器遗忘技术解决了 LLM 中的这一问题，但对于 MLLMs 的研究仍处于初步阶段。为应对这一挑战，我们引入了多模态大语言模型遗忘基准 (MLLMU-Bench)，这是一个旨在推进多模态机器遗忘理解的新型基准。MLLMU-Bench 包含 500 个虚构人物档案和 153 个公众人物档案，每个档案包含超过 14 个定制的问答对，从多模态（图像+文本）和单模态（文本）两个角度进行评估。该基准分为四个部分，用于评估遗忘算法在有效性、泛化能力和模型实用性方面的表现。最后，我们使用现有的生成式模型遗忘算法提供了基线结果。令人惊讶的是，我们的实验表明，单模态遗忘算法在生成和填空任务中表现优异，而多模态遗忘方法在处理多模态输入的分类任务时表现更佳。

[NLP-20] Joint Extraction and Classification of Danish Competences for Job Matching

【速读】：该论文试图解决从丹麦语职位发布中自动提取和分类能力（如技能、职业或知识）的问题，以提高招聘人员在寻找适合职位空缺的候选人时的效率。解决方案的关键在于开发了一种基于BERT架构的单一模型，该模型能够同时进行能力的提取和分类，并且在大规模标注的丹麦语语料库上进行训练。这一模型不仅能够提取多种类别的丹麦语能力，而且在实际应用场景中，其整体性能优于现有最先进模型，并且在推理时间上节省了超过50%的时间。

链接: https://arxiv.org/abs/2410.22103
作者: Qiuchi Li,Christina Lioma
关键词-EN: key desiderata, Danish competence extraction, Danish job postings, Danish, Danish job
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The matching of competences, such as skills, occupations or knowledges, is a key desiderata for candidates to be fit for jobs. Automatic extraction of competences from CVs and Jobs can greatly promote recruiters’ productivity in locating relevant candidates for job vacancies. This work presents the first model that jointly extracts and classifies competence from Danish job postings. Different from existing works on skill extraction and skill classification, our model is trained on a large volume of annotated Danish corpora and is capable of extracting a wide range of Danish competences, including skills, occupations and knowledges of different categories. More importantly, as a single BERT-like architecture for joint extraction and classification, our model is lightweight and efficient at inference. On a real-scenario job matching dataset, our model beats the state-of-the-art models in the overall performance of Danish competence extraction and classification, and saves over 50% time at inference.
摘要：技能、职业或知识的匹配是候选人适合工作的关键需求。从简历和职位描述中自动提取这些能力可以极大地提高招聘人员在寻找适合职位空缺的候选人时的效率。本文首次提出了一个从丹麦职位发布中联合提取和分类能力的模型。与现有的技能提取和分类工作不同，我们的模型基于大量标注的丹麦语语料库进行训练，能够提取包括技能、职业和不同类别知识在内的广泛丹麦语能力。更重要的是，作为一个单一的类似 BERT 架构，用于联合提取和分类，我们的模型在推理时既轻量又高效。在一个真实的职位匹配数据集上，我们的模型在丹麦语能力提取和分类的整体性能上超越了最先进的模型，并且在推理时间上节省了超过 50%。

[NLP-21] Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate

【速读】：该论文试图解决机器遗忘（machine unlearning）问题，即如何从大型语言模型（LLMs）中移除不需要的知识。解决方案的关键在于将机器遗忘问题框架化为一个正则化的多任务优化问题，其中一项任务优化遗忘目标，另一项任务优化模型性能。论文引入了一种归一化梯度差分（normalized gradient difference, NGDiff）算法，该算法能够更好地控制遗忘目标与模型性能之间的权衡，并集成了一种新的自动学习率调度器。通过理论分析和在TOFU和MUSE数据集上的实验，论文展示了NGDiff在现有最先进的遗忘方法中的优越性能，同时表现出稳定的训练过程。

链接: https://arxiv.org/abs/2410.22086
作者: Zhiqi Bu,Xiaomeng Jin,Bhanukiran Vinzamuri,Anil Ramakrishna,Kai-Wei Chang,Volkan Cevher,Mingyi Hong
关键词-EN: remove unwanted knowledge, unwanted knowledge acquired, large language models, Machine unlearning, examine machine unlearning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine unlearning has been used to remove unwanted knowledge acquired by large language models (LLMs). In this paper, we examine machine unlearning from an optimization perspective, framing it as a regularized multi-task optimization problem, where one task optimizes a forgetting objective and another optimizes the model performance. In particular, we introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives, while integrating a new, automatic learning rate scheduler. We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets while exhibiting stable training.
摘要：机器遗忘（Machine unlearning）已被用于移除大语言模型（LLMs）中不希望保留的知识。本文从优化的角度探讨机器遗忘，将其构建为一个正则化的多任务优化问题，其中一个任务优化遗忘目标，另一个任务优化模型性能。特别地，我们引入了一种归一化梯度差分（Normalized Gradient Difference, NGDiff）算法，使我们能够更好地控制目标之间的权衡，同时整合了一种新的自动学习率调度器。我们提供了理论分析，并在TOFU和MUSE数据集上通过实验展示了NGDiff在现有最先进的遗忘方法中的优越性能，同时表现出稳定的训练过程。

[NLP-22] An Actor-Critic Approach to Boosting Text-to-SQL Large Language Model

【速读】：该论文试图解决现有Text-To-SQL (T2S)转换方法缺乏理论性能保证的问题。解决方案的关键在于提出了一种名为Actor-Critic (AC)的简单、通用且性能有保证的增强方法。具体来说，该方法利用同一个大型语言模型（LLM）设计了两个角色：Actor负责生成SQL查询，Critic负责评估生成的SQL。如果Critic认为生成的SQL不正确，它会通知Actor重新生成并再次评估。通过这种简单的迭代过程，理论上可以推导出预期的性能提升。实验结果表明，Actor-Critic方法在Spider及相关数据集上显著提升了T2S转换的性能，证明了其作为T2S转换通用增强方法的有效性。

链接: https://arxiv.org/abs/2410.22082
作者: Ziyang Zheng,Haipeng Jing,Canyu Rui,Askar Hamdulla,Dong Wang
关键词-EN: large language models, query intent expressed, language models, large language, natural language
类目: Databases (cs.DB); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Text-To-SQL (T2S) conversion based on large language models (LLMs) has found a wide range of applications, by leveraging the capabilities of LLMs in interpreting the query intent expressed in natural language. Existing research focuses on suitable representations for data schema and/or questions, task-specific instructions and representative examples, and complicated inference pipelines. All these methods are empirical and task specific, without a theoretical bound on performance. In this paper, we propose a simple, general, and performance guaranteed T2S enhancement approach called Actor-Critic (AC). Specifically, we design two roles using the same LLM: an Actor to produce SQL queries and a Critic to evaluate the produced SQL. If the Critic believes the produced SQL is wrong, it notifies the Actor to reproduce the SQL and perform evaluation again. By this simple iterative process, expected performance can be derived in theory. We conducted extensive experiments on the Spider and related datasets with eleven LLMs, and demonstrated that the Actor-Critic method consistently improves the performance of T2S, thus serving as a general enhancement approach for T2S conversion.
摘要：基于大语言模型 (LLM) 的文本到 SQL (T2S) 转换技术在多个领域得到了广泛应用，这得益于 LLM 在解释自然语言表达的查询意图方面的能力。现有研究主要集中在数据模式和/或问题的合适表示、任务特定的指令和代表性示例，以及复杂的推理流程上。这些方法都是经验性的，且针对特定任务，缺乏性能的理论界限。本文提出了一种简单、通用且性能有保障的 T2S 增强方法，称为 Actor-Critic (AC)。具体而言，我们设计了两个角色，均使用同一个 LLM：一个 Actor 用于生成 SQL 查询，一个 Critic 用于评估生成的 SQL。如果 Critic 认为生成的 SQL 有误，它会通知 Actor 重新生成 SQL 并再次进行评估。通过这种简单的迭代过程，理论上可以推导出预期的性能。我们在 Spider 及相关数据集上对十一个 LLM 进行了广泛的实验，结果表明 Actor-Critic 方法持续提升了 T2S 的性能，从而成为一种通用的 T2S 转换增强方法。

[NLP-23] Choosy Babies Need One Coach: Inducing Mode-Seeking Behavior in BabyLlama with Reverse KL Divergence

【速读】：该论文试图解决在BabyLM Challenge的Strict-Small Track中，如何通过教师-学生蒸馏（teacher-student distillation）方法优化BabyLLaMa模型的性能问题。解决方案的关键在于采用反向Kullback-Leibler散度（reverse Kullback-Leibler divergence）作为目标函数，以促使学生模型表现出模式寻求（mode-seeking）而非模式平均（mode-averaging）的行为。此外，论文还探讨了使用单一教师模型（single teacher）而非多个教师模型的效果，并结合先进的优化策略来进一步提升蒸馏过程的效果。实验结果表明，在反向KL散度下，单一教师模型在大多数任务中表现优于或与多教师模型相当，且引入高级优化技术进一步增强了模型性能，验证了所提出方法的有效性和鲁棒性。

链接: https://arxiv.org/abs/2410.22081
作者: Shaozhen Shi,Yevgen Matusevych,Malvina Nissim
关键词-EN: BabyLM Challenge, Strict-Small Track, study presents, presents our submission, Timiryasov and Tastet
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents our submission to the Strict-Small Track of the 2nd BabyLM Challenge. We use a teacher-student distillation setup with the BabyLLaMa model (Timiryasov and Tastet, 2023) as a backbone. To make the student’s learning process more focused, we replace the objective function with a reverse Kullback-Leibler divergence, known to cause mode-seeking (rather than mode-averaging) behaviour in computational learners. We further experiment with having a single teacher (instead of an ensemble of two teachers) and implement additional optimization strategies to improve the distillation process. Our experiments show that under reverse KL divergence, a single-teacher model often outperforms or matches multiple-teacher models across most tasks. Additionally, incorporating advanced optimization techniques further enhances model performance, demonstrating the effectiveness and robustness of our proposed approach. These findings support our idea that “choosy babies need one coach”.
摘要：本研究展示了我们在第二届BabyLM挑战赛Strict-Small赛道中的提交成果。我们采用了教师-学生蒸馏设置，以BabyLLaMa模型（Timiryasov和Tastet，2023）作为骨干。为了使学生的学习过程更加专注，我们将目标函数替换为反向Kullback-Leibler散度，该散度在计算学习者中已知会导致模式寻求（而非模式平均）行为。我们进一步实验了使用单一教师（而非两个教师的集成），并实施了额外的优化策略以改进蒸馏过程。我们的实验表明，在反向KL散度下，单一教师模型在大多数任务中往往优于或与多教师模型表现相当。此外，结合先进的优化技术进一步提升了模型性能，展示了我们提出的方法的有效性和鲁棒性。这些发现支持了我们的观点：“挑剔的宝宝需要一位教练”。

[NLP-24] Distinguishing Ignorance from Error in LLM Hallucinations

【速读】：该论文试图解决大型语言模型（LLMs）在闭卷问答（CBQA）中产生的幻觉问题，特别是区分两种不同类型的幻觉：模型参数中没有正确答案（情况1）和模型虽然拥有所需知识但回答错误（情况2）。解决方案的关键在于引入了一种名为“尽管拥有正确知识但回答错误”（Wrong Answer despite having Correct Knowledge, WACK）的方法，用于构建针对第二种幻觉类型的模型特定数据集。通过这种方法，研究者能够更好地区分这两种幻觉，并展示了在WACK数据集上训练的探测器在检测情况2幻觉方面比使用通用数据集更为有效。

链接: https://arxiv.org/abs/2410.22071
作者: Adi Simhi,Jonathan Herzig,Idan Szpektor,Yonatan Belinkov
关键词-EN: Large language models, Large language, close-book Question Answering, factually incorrect, prior generations
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are susceptible to hallucinations-outputs that are ungrounded, factually incorrect, or inconsistent with prior generations. We focus on close-book Question Answering (CBQA), where previous work has not fully addressed the distinction between two possible kinds of hallucinations, namely, whether the model (1) does not hold the correct answer in its parameters or (2) answers incorrectly despite having the required knowledge. We argue that distinguishing these cases is crucial for detecting and mitigating hallucinations. Specifically, case (2) may be mitigated by intervening in the model’s internal computation, as the knowledge resides within the model’s parameters. In contrast, in case (1) there is no parametric knowledge to leverage for mitigation, so it should be addressed by resorting to an external knowledge source or abstaining. To help distinguish between the two cases, we introduce Wrong Answer despite having Correct Knowledge (WACK), an approach for constructing model-specific datasets for the second hallucination type. Our probing experiments indicate that the two kinds of hallucinations are represented differently in the model’s inner states. Next, we show that datasets constructed using WACK exhibit variations across models, demonstrating that even when models share knowledge of certain facts, they still vary in the specific examples that lead to hallucinations. Finally, we show that training a probe on our WACK datasets leads to better hallucination detection of case (2) hallucinations than using the common generic one-size-fits-all datasets. The code is available at this https URL .
摘要：大语言模型（LLMs）容易产生幻觉——即输出内容缺乏依据、事实错误或与先前生成的内容不一致。我们专注于闭卷问答（CBQA），在此领域中，先前的工作并未充分区分两种可能的幻觉类型：模型（1）在其参数中不持有正确答案，或（2）尽管拥有所需知识却回答错误。我们认为，区分这两种情况对于检测和缓解幻觉至关重要。具体而言，情况（2）可能通过干预模型的内部计算来缓解，因为知识存在于模型的参数中。相比之下，情况（1）则无法利用参数知识进行缓解，因此应通过求助于外部知识源或选择放弃来解决。为了帮助区分这两种情况，我们引入了“尽管拥有正确知识却给出错误答案”（WACK）方法，用于构建针对第二种幻觉类型的模型特定数据集。我们的探测实验表明，这两种幻觉在模型的内部状态中以不同的方式表现。接下来，我们展示了使用WACK构建的数据集在不同模型之间存在差异，表明即使模型共享某些事实的知识，它们在导致幻觉的具体示例上仍有所不同。最后，我们证明，在WACK数据集上训练的探测器在检测情况（2）幻觉方面比使用通用的一刀切数据集效果更好。代码可在以下链接获取：https URL。

[NLP-25] Sing it Narrate it: Quality Musical Lyrics Translation

【速读】：该论文试图解决音乐剧歌词翻译中翻译质量与可唱性之间的平衡问题。解决方案的关键在于：首先，创建一个用于训练奖励模型（reward models）的数据集，以自动评估翻译质量；其次，通过两阶段训练过程和过滤技术，同时提升翻译质量和可唱性；最后，引入一个推断时优化框架（inference-time optimization framework），用于整首歌曲的翻译。实验结果表明，该方法在自动和人工评估中均显著优于基线方法，验证了各组件的有效性。

链接: https://arxiv.org/abs/2410.22066
作者: Zhuorui Ye,Jinhan Li,Rongwu Xu
关键词-EN: presents unique challenges, unique challenges due, musicals presents unique, ensure high translation, translation quality
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Translating lyrics for musicals presents unique challenges due to the need to ensure high translation quality while adhering to singability requirements such as length and rhyme. Existing song translation approaches often prioritize these singability constraints at the expense of translation quality, which is crucial for musicals. This paper aims to enhance translation quality while maintaining key singability features. Our method consists of three main components. First, we create a dataset to train reward models for the automatic evaluation of translation quality. Second, to enhance both singability and translation quality, we implement a two-stage training process with filtering techniques. Finally, we introduce an inference-time optimization framework for translating entire songs. Extensive experiments, including both automatic and human evaluations, demonstrate significant improvements over baseline methods and validate the effectiveness of each component in our approach.
摘要：为音乐剧翻译歌词面临着独特的挑战，需要在确保高翻译质量的同时，遵守歌唱性要求，如长度和韵律。现有的歌曲翻译方法往往优先考虑这些歌唱性约束，而牺牲了翻译质量，这对于音乐剧至关重要。本文旨在提高翻译质量的同时，保持关键的歌唱性特征。我们的方法由三个主要部分组成。首先，我们创建了一个数据集，用于训练奖励模型，以自动评估翻译质量。其次，为了同时提升歌唱性和翻译质量，我们实施了一个两阶段的训练过程，并结合过滤技术。最后，我们引入了一个推理时优化框架，用于翻译整首歌曲。广泛的实验，包括自动和人工评估，均显示出相较于基线方法的显著改进，并验证了我们方法中每个组件的有效性。

[NLP-26] Are VLMs Really Blind

【速读】：该论文试图解决视觉语言模型在处理低级基本视觉任务（如几何推理）时表现不佳的问题。解决方案的关键在于提出了一种新颖的自动流程，通过使用问题派生的关键词生成图像描述，从而突出与问题相关的图像细节。这种方法避免了直接依赖视觉问答（VQA），而是利用生成的描述来引导语言模型提供精确答案，无需外部微调。

链接: https://arxiv.org/abs/2410.22029
作者: Ayush Singh,Mansi Gupta,Shivank Garg
关键词-EN: Optical Character Recognition, including Optical Character, Visual Question Answering, Character Recognition, Vision Language Models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 1 figure

点击查看摘要

Abstract:Vision Language Models excel in handling a wide range of complex tasks, including Optical Character Recognition (OCR), Visual Question Answering (VQA), and advanced geometric reasoning. However, these models fail to perform well on low-level basic visual tasks which are especially easy for humans. Our goal in this work was to determine if these models are truly “blind” to geometric reasoning or if there are ways to enhance their capabilities in this area. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions. Instead of just relying on direct VQA, we use question-derived keywords to create a caption that highlights important details in the image related to the question. This caption is then used by a language model to provide a precise answer to the question without requiring external fine-tuning.
摘要：视觉语言模型在处理包括光学字符识别（OCR）、视觉问答（VQA）以及高级几何推理在内的广泛复杂任务方面表现出色。然而，这些模型在处理人类特别擅长的低级基本视觉任务时表现不佳。本研究的目标是探究这些模型是否真的在几何推理方面“失明”，或者是否存在提升其在这方面能力的方法。我们的工作提出了一种新颖的自动化流程，旨在根据特定问题从图像中提取关键信息。我们不仅依赖于直接的视觉问答，而是利用问题衍生的关键词生成一个强调图像中与问题相关重要细节的描述。随后，该描述被语言模型用于提供精确的答案，而无需外部微调。

[NLP-27] Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation

【速读】：该论文试图解决多语言检索增强语言模型（RALMs）在处理全球知识时面临的语言多样性问题。解决方案的关键在于提出了一个名为Futurepedia的精心设计的基准测试，该基准包含八种代表性语言的平行文本。通过评估六种多语言RALMs在该基准上的表现，研究揭示了语言不平等现象，并提出了改进多语言检索增强生成的建议，包括在单语言知识提取中注意低资源语言到高资源语言的级联错误，在跨语言知识转移中鼓励RALMs在不同语言的文档中直接提供答案，以及在多语言知识选择中通过增加非英语文档和重新定位英语文档来减轻RALMs的选择偏差。

链接: https://arxiv.org/abs/2410.21970
作者: Suhang Wu,Jialong Tang,Baosong Yang,Ante Wang,Kaidi Jia,Jiawei Yu,Junfeng Yao,Jinsong Su
关键词-EN: Retrieval-Augmented Language Models, external textual resources, Language Models, incorporating external textual, textual resources
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:RALMs (Retrieval-Augmented Language Models) broaden their knowledge scope by incorporating external textual resources. However, the multilingual nature of global knowledge necessitates RALMs to handle diverse languages, a topic that has received limited research focus. In this work, we propose \textitFuturepedia, a carefully crafted benchmark containing parallel texts across eight representative languages. We evaluate six multilingual RALMs using our benchmark to explore the challenges of multilingual RALMs. Experimental results reveal linguistic inequalities: 1) high-resource languages stand out in Monolingual Knowledge Extraction; 2) Indo-European languages lead RALMs to provide answers directly from documents, alleviating the challenge of expressing answers across languages; 3) English benefits from RALMs’ selection bias and speaks louder in multilingual knowledge selection. Based on these findings, we offer advice for improving multilingual Retrieval Augmented Generation. For monolingual knowledge extraction, careful attention must be paid to cascading errors from translating low-resource languages into high-resource ones. In cross-lingual knowledge transfer, encouraging RALMs to provide answers within documents in different languages can improve transfer performance. For multilingual knowledge selection, incorporating more non-English documents and repositioning English documents can help mitigate RALMs’ selection bias. Through comprehensive experiments, we underscore the complexities inherent in multilingual RALMs and offer valuable insights for future research.
摘要：检索增强语言模型 (Retrieval-Augmented Language Models, RALMs) 通过整合外部文本资源来扩展其知识范围。然而，全球知识的多样语言特性要求 RALMs 能够处理多种语言，这一课题目前研究关注较少。在本研究中，我们提出了 \textitFuturepedia，这是一个精心设计的基准测试，包含跨越八种代表性语言的平行文本。我们使用该基准测试评估了六种多语言 RALMs，以探讨多语言 RALMs 面临的挑战。实验结果揭示了语言间的不平等现象：1) 高资源语言在单语言知识提取中表现突出；2) 印欧语系语言引导 RALMs 直接从文档中提供答案，减轻了跨语言表达答案的挑战；3) 英语得益于 RALMs 的选择偏差，在多语言知识选择中占据优势。基于这些发现，我们提出了改进多语言检索增强生成 (Retrieval Augmented Generation) 的建议。对于单语言知识提取，必须注意从低资源语言翻译到高资源语言时可能产生的级联错误。在跨语言知识转移中，鼓励 RALMs 在不同语言的文档中提供答案可以提高转移性能。对于多语言知识选择，增加非英语文档的数量并重新定位英语文档有助于减轻 RALMs 的选择偏差。通过全面的实验，我们强调了多语言 RALMs 固有的复杂性，并为未来的研究提供了宝贵的见解。

[NLP-28] SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types NEURIPS2024

【速读】：该论文试图解决当前大型语言模型（LLM）安全基准的两个主要局限性：一是仅关注判别性或生成性评估范式，忽视了两者之间的联系；二是依赖标准化的输入，忽略了广泛使用的提示技术（如系统提示、少样本示例和思维链提示）对模型安全性的影响。解决方案的关键在于开发了SG-Bench，这是一个新型基准，用于评估LLM在各种任务和提示类型下的安全性泛化能力。SG-Bench整合了生成性和判别性评估任务，并扩展了数据集以考察提示工程和越狱对LLM安全性的影响。通过评估3个先进的专有LLM和10个开源LLM，研究发现大多数LLM在判别性任务上的表现不如生成性任务，且对提示高度敏感，表明在安全性对齐方面泛化能力较差。

链接: https://arxiv.org/abs/2410.21965
作者: Yutao Mou,Shikun Zhang,Wei Ye
关键词-EN: large language model, trustworthy artificial intelligence, developing trustworthy artificial, language model, applications is essential
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS2024 (Dataset and Benchmark Track)

点击查看摘要

Abstract:Ensuring the safety of large language model (LLM) applications is essential for developing trustworthy artificial intelligence. Current LLM safety benchmarks have two limitations. First, they focus solely on either discriminative or generative evaluation paradigms while ignoring their interconnection. Second, they rely on standardized inputs, overlooking the effects of widespread prompting techniques, such as system prompts, few-shot demonstrations, and chain-of-thought prompting. To overcome these issues, we developed SG-Bench, a novel benchmark to assess the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment of 3 advanced proprietary LLMs and 10 open-source LLMs with the benchmark reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment. We also explain these findings quantitatively and qualitatively to provide insights for future research.
摘要：确保大语言模型 (LLM) 应用的安全性对于发展可信赖的人工智能至关重要。当前的 LLM 安全基准存在两个主要局限性。首先，它们仅关注判别或生成评估范式中的某一种，而忽视了两者之间的联系。其次，它们依赖于标准化的输入，忽略了广泛使用的提示技术（如系统提示、少样本演示和思维链提示）的影响。为解决这些问题，我们开发了 SG-Bench，这是一个新颖的基准，用于评估 LLM 安全在各种任务和提示类型中的泛化能力。该基准整合了生成和判别评估任务，并包含扩展数据以考察提示工程和越狱对 LLM 安全性的影响。我们对 3 个先进的专有 LLM 和 10 个开源 LLM 进行的基准测试显示，大多数 LLM 在判别任务上的表现不如生成任务，并且对提示高度敏感，表明在安全对齐方面泛化能力较差。我们还通过定量和定性的方式解释了这些发现，以提供未来研究的见解。

[NLP-29] Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications

【速读】：该论文试图解决大型语言模型（LLMs）在工业领域应用中缺乏特定领域知识和容易产生幻觉的问题。解决方案的关键在于将多模态模型（multimodal models）集成到检索增强生成（Retrieval Augmented Generation, RAG）系统中，以提高其在处理工业领域文档中包含的文本和图像时的性能。论文通过一系列实验，比较了两种图像处理和检索方法（多模态嵌入和图像文本摘要生成）以及两种语言模型（GPT4-Vision和LLaVA）的效果，发现利用图像生成的文本摘要比多模态嵌入更具潜力，能够有效提升RAG系统的性能。

链接: https://arxiv.org/abs/2410.21943
作者: Monica Riedler,Stefan Langer
关键词-EN: Large Language Models, Large Language, demonstrated impressive capabilities, lack domain-specific knowledge, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations. Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images. In this paper we describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain. The purpose of the experiments is to determine whether including images alongside text from documents within the industrial domain increases RAG performance and to find the optimal configuration for such a multimodal RAG system. Our experiments include two approaches for image processing and retrieval, as well as two LLMs (GPT4-Vision and LLaVA) for answer synthesis. These image processing strategies involve the use of multimodal embeddings and the generation of textual summaries from images. We evaluate our experiments with an LLM-as-a-Judge approach. Our results reveal that multimodal RAG can outperform single-modality RAG settings, although image retrieval poses a greater challenge than text retrieval. Additionally, leveraging textual summaries from images presents a more promising approach compared to the use of multimodal embeddings, providing more opportunities for future advancements.
摘要：大语言模型 (LLMs) 在回答问题方面展示了令人印象深刻的能力，但它们缺乏领域特定的知识，并且容易产生幻觉。检索增强生成 (RAG) 是一种解决这些挑战的方法，而多模态模型作为处理文本和图像的有前途的 AI 助手正在崭露头角。本文描述了一系列实验，旨在确定如何最佳地将多模态模型集成到工业领域的 RAG 系统中。实验的目的是确定在工业领域文档中同时包含图像和文本是否能提高 RAG 的性能，并找到这种多模态 RAG 系统的最佳配置。我们的实验包括两种图像处理和检索方法，以及两种大语言模型 (GPT4-Vision 和 LLaVA) 用于答案合成。这些图像处理策略涉及使用多模态嵌入和从图像生成文本摘要。我们采用 LLM-as-a-Judge 方法评估实验。结果显示，多模态 RAG 可以超越单一模态的 RAG 设置，尽管图像检索比文本检索更具挑战性。此外，利用图像的文本摘要比使用多模态嵌入提供了更有前景的方法，为未来的进步提供了更多机会。

[NLP-30] SceneGenAgent : Precise Industrial Scene Generation with Coding Agent

【速读】：该论文试图解决工业场景建模中精确测量和布局规划的挑战，特别是在使用大型语言模型 (LLMs) 生成工业场景时。解决方案的关键是引入了一个基于LLM的代理工具——SceneGenAgent，它通过C#代码生成工业场景，确保布局规划的精确性和可计算性。SceneGenAgent通过结构化格式、布局验证和迭代优化来满足工业场景的量化要求。此外，论文还构建了SceneInstruct数据集，用于微调开源LLMs，以进一步提高其在SceneGenAgent中的集成性能。

链接: https://arxiv.org/abs/2410.21909
作者: Xiao Xia,Dan Zhang,Zibo Liao,Zhenyu Hou,Tianrui Sun,Jing Li,Ling Fu,Yuxiao Dong
关键词-EN: essential for simulations, generating industrial scenes, industrial scenes, industrial, industrial manufacturing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at this https URL .
摘要：工业场景的建模对于工业制造中的仿真至关重要。尽管大语言模型（LLMs）在从文本描述生成一般3D场景方面取得了显著进展，但生成工业场景对LLMs提出了独特的挑战，因为工业场景需要精确的测量和定位，这要求对空间布局进行复杂的规划。为了应对这一挑战，我们引入了SceneGenAgent，这是一个基于LLM的智能体，通过C#代码生成工业场景。SceneGenAgent通过结构化和可计算的格式、布局验证以及迭代优化来确保精确的布局规划，以满足工业场景的定量需求。实验结果表明，由SceneGenAgent赋能的LLMs在实际工业场景生成任务中的成功率高达81.0%，有效满足了大多数场景生成需求。为了进一步提升其易用性，我们构建了SceneInstruct数据集，旨在微调开源LLMs以集成到SceneGenAgent中。实验显示，在SceneInstruct上微调开源LLMs显著提升了性能，Llama3.1-70B的能力接近GPT-4o。我们的代码和数据可通过此https URL获取。

[NLP-31] A Longitudinal Analysis of Racial and Gender Bias in New York Times and Fox News Images and Articles

【速读】：该论文试图解决新闻媒体中不同种族和性别群体的呈现方式及其对公众意见的影响问题。解决方案的关键在于提出了两个机器学习分类器，用于检测图像中人物的种族和年龄，并构建了一个包含123,337张图片和441,321篇在线新闻文章的数据集，分别来自《纽约时报》(NYT)和福克斯新闻(Fox)。通过两种计算方法分析了这些数据：首先，分析了新闻文章中嵌入图片的种族和性别群体的出现频率和显著性，发现少数种族和性别群体在新闻中的代表性不足，且出现时通常不如多数群体显著；其次，分析了文章文本中少数种族群体的出现频率和上下文，揭示了某些种族群体报道范围的狭窄以及不同群体在冲突中被描述为受害者或加害者的频率。这些分析不仅提供了两个新的开源分类器，还揭示了新闻文章中存在的种族和性别偏见，特别是在美国政治光谱两端的新闻媒体中。

链接: https://arxiv.org/abs/2410.21898
作者: Hazem Ibrahim,Nouar AlDahoul,Syed Mustafa Ali Abbasi,Fareed Zaffar,Talal Rahwan,Yasir Zaki
关键词-EN: shaping public opinion, groups, gender groups, racial minority groups, racial and gender
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, and 11 figures

点击查看摘要

Abstract:The manner in which different racial and gender groups are portrayed in news coverage plays a large role in shaping public opinion. As such, understanding how such groups are portrayed in news media is of notable societal value, and has thus been a significant endeavour in both the computer and social sciences. Yet, the literature still lacks a longitudinal study examining both the frequency of appearance of different racial and gender groups in online news articles, as well as the context in which such groups are discussed. To fill this gap, we propose two machine learning classifiers to detect the race and age of a given subject. Next, we compile a dataset of 123,337 images and 441,321 online news articles from New York Times (NYT) and Fox News (Fox), and examine representation through two computational approaches. Firstly, we examine the frequency and prominence of appearance of racial and gender groups in images embedded in news articles, revealing that racial and gender minorities are largely under-represented, and when they do appear, they are featured less prominently compared to majority groups. Furthermore, we find that NYT largely features more images of racial minority groups compared to Fox. Secondly, we examine both the frequency and context with which racial minority groups are presented in article text. This reveals the narrow scope in which certain racial groups are covered and the frequency with which different groups are presented as victims and/or perpetrators in a given conflict. Taken together, our analysis contributes to the literature by providing two novel open-source classifiers to detect race and age from images, and shedding light on the racial and gender biases in news articles from venues on opposite ends of the American political spectrum.
摘要：新闻报道中不同种族和性别群体的呈现方式在塑造公众意见方面起着重要作用。因此，理解这些群体在新闻媒体中的呈现方式具有显著的社会价值，这使得相关研究在计算机科学和社会科学领域都显得尤为重要。然而，现有文献仍缺乏一项纵向研究，该研究不仅考察不同种族和性别群体在在线新闻文章中的出现频率，还探讨这些群体被讨论的背景。为了填补这一空白，我们提出了两个机器学习分类器，用于检测给定主体的种族和年龄。接下来，我们编译了一个包含123,337张图片和441,321篇来自《纽约时报》(NYT)和福克斯新闻(Fox)的在线新闻文章的数据集，并通过两种计算方法来考察这些群体的呈现情况。首先，我们考察了新闻文章中嵌入的图片中种族和性别群体的出现频率和显著性，结果显示，种族和性别少数群体在很大程度上被低估，并且当他们出现时，与多数群体相比，他们的显著性较低。此外，我们发现NYT比Fox更多地展示了种族少数群体的图片。其次，我们考察了文章文本中种族少数群体的出现频率及其被讨论的背景。这揭示了某些种族群体报道范围的狭窄性，以及不同群体在特定冲突中被呈现为受害者或加害者的频率。总的来说，我们的分析通过提供两个新颖的开源分类器来检测图片中的种族和年龄，并揭示了来自美国政治光谱两端的新闻文章中的种族和性别偏见，从而为相关文献做出了贡献。

[NLP-32] Evaluating K-Fold Cross Validation for Transformer Based Symbolic Regression Models

【速读】：该论文试图解决在数据集较小的情况下，基于Transformer的符号回归模型性能下降和过拟合问题。解决方案的关键在于应用k-fold交叉验证技术，将训练数据划分为多个子集，通过迭代训练和验证来提高模型的泛化能力。实验结果表明，这种方法显著提升了模型在验证集上的损失表现，相对改善了53.31%，从而在资源受限的环境中实现了更高效和可行的符号回归。

链接: https://arxiv.org/abs/2410.21896
作者: Kaustubh Kislay,Shlok Singh,Soham Joshi,Rohan Dutta,Jay Shim George Flint,Kevin Zhu
关键词-EN: extensive research focusing, Symbolic Regression, Symbolic Regression remains, NP-Hard problem, remains an NP-Hard
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Symbolic Regression remains an NP-Hard problem, with extensive research focusing on AI models for this task. Transformer models have shown promise in Symbolic Regression, but performance suffers with smaller datasets. We propose applying k-fold cross-validation to a transformer-based symbolic regression model trained on a significantly reduced dataset (15,000 data points, down from 500,000). This technique partitions the training data into multiple subsets (folds), iteratively training on some while validating on others. Our aim is to provide an estimate of model generalization and mitigate overfitting issues associated with smaller datasets. Results show that this process improves the model’s output consistency and generalization by a relative improvement in validation loss of 53.31%. Potentially enabling more efficient and accessible symbolic regression in resource-constrained environments.
摘要：符号回归（Symbolic Regression）仍然是一个 NP-难问题，相关研究主要集中在针对此任务的 AI 模型上。Transformer 模型在符号回归中显示出潜力，但在小数据集上的表现不佳。我们提出将 k 折交叉验证（k-fold cross-validation）应用于基于 Transformer 的符号回归模型，该模型在显著减少的数据集（从 500,000 个数据点减少到 15,000 个数据点）上进行训练。该技术将训练数据划分为多个子集（折），并迭代地在某些子集上训练，同时在其他子集上验证。我们的目标是提供模型泛化能力的估计，并缓解小数据集相关的过拟合问题。结果显示，这一过程通过验证损失相对改善 53.31%，提高了模型输出的稳定性和泛化能力。这可能使得在资源受限的环境中实现更高效且易用的符号回归成为可能。

[NLP-33] Improving In-Context Learning with Small Language Model Ensembles NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在特定领域任务中的性能局限问题。解决方案的关键是提出了一种名为Ensemble SuperICL的新方法，该方法通过利用多个经过微调的小型语言模型（SLMs）的专业知识来增强上下文学习（ICL）。Ensemble SuperICL不仅在多个自然语言理解基准测试中达到了最先进（SoTA）的结果，还在医疗领域的标签任务中展示了其实用性，通过使用现成的经过一般语言任务微调的SLMs，在大规模数据标注中实现了优于所有基线的准确性。此外，论文通过消融研究和敏感性分析阐明了Ensemble SuperICL的底层机制，为实践者提供了一种廉价且有效的领域专业化方法。

链接: https://arxiv.org/abs/2410.21868
作者: M. Mehdi Mojarradi,Lingyi Yang,Robert McCraith,Adam Mahdi
关键词-EN: shown impressive capabilities, Large language models, tasks remains limited, remains limited, domain-specific tasks remains
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2024 Workshop on Adaptive Foundation Models

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities across various tasks, but their performance on domain-specific tasks remains limited. While methods like retrieval augmented generation and fine-tuning can help to address this, they require significant resources. In-context learning (ICL) is a cheap and efficient alternative but cannot match the accuracies of advanced methods. We present Ensemble SuperICL, a novel approach that enhances ICL by leveraging the expertise of multiple fine-tuned small language models (SLMs). Ensemble SuperICL achieves state of the art (SoTA) results on several natural language understanding benchmarks. Additionally, we test it on a medical-domain labelling task and showcase its practicality by using off-the-shelf SLMs fine-tuned on a general language task, achieving superior accuracy in large-scale data labelling compared to all baselines. Finally, we conduct an ablation study and sensitivity analyses to elucidate the underlying mechanism of Ensemble SuperICL. Our research contributes to the growing demand for efficient domain specialisation methods in LLMs, offering a cheap and effective method for practitioners.
摘要：大语言模型（LLMs）在各种任务中展现了令人印象深刻的能力，但在特定领域任务中的表现仍然有限。尽管检索增强生成和微调等方法可以解决这一问题，但它们需要大量的资源。上下文学习（ICL）是一种廉价且高效的替代方案，但其准确性无法与先进方法相媲美。我们提出了集成超ICL（Ensemble SuperICL），这是一种新颖的方法，通过利用多个微调的小语言模型（SLMs）的专业知识来增强ICL。集成超ICL在多个自然语言理解基准测试中达到了最先进（SoTA）的结果。此外，我们在一项医疗领域标注任务中测试了该方法，并通过使用经过通用语言任务微调的现成SLMs，展示了其在实际应用中的可行性，在大规模数据标注中实现了比所有基线方法更高的准确性。最后，我们进行了消融研究和敏感性分析，以阐明集成超ICL的内在机制。我们的研究为大语言模型中日益增长的高效领域专业化方法需求做出了贡献，为从业者提供了一种廉价且有效的方法。

[NLP-34] Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

【速读】：该论文试图解决远距离麦克风会议转录中的噪声和混响问题，其解决方案的关键在于引入了一种联合波束成形（beamforming）和说话人属性自动语音识别（SA-ASR）的方法。具体来说，论文首先描述了一种数据对齐和增强方法，用于在真实会议数据上预训练神经波束成形器。随后，比较了固定、混合和全神经波束成形器作为SA-ASR模型的前端的效果。最终，论文提出了联合优化全神经波束成形器和SA-ASR模型的方法。实验结果表明，尽管基于多帧跨通道注意力的通道融合方法未能提升ASR性能，但通过在固定波束成形器输出上微调SA-ASR模型以及联合微调SA-ASR模型与神经波束成形器，分别使得词错误率相对降低了8%和9%。

链接: https://arxiv.org/abs/2410.21849
作者: Can Cui(MULTISPEECH),Imran Ahamad Sheikh,Mostafa Sadeghi(MULTISPEECH),Emmanuel Vincent(MULTISPEECH)
关键词-EN: Distant-microphone meeting transcription, Distant-microphone meeting, challenging task, Distant-microphone, SA-ASR
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Distant-microphone meeting transcription is a challenging task. State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end, which limits their performance. In this paper, we introduce a joint beamforming and SA-ASR approach for real meeting transcription. We first describe a data alignment and augmentation method to pretrain a neural beamformer on real meeting data. We then compare fixed, hybrid, and fully neural beamformers as front-ends to the SA-ASR model. Finally, we jointly optimize the fully neural beamformer and the SA-ASR model. Experiments on the real AMI corpus show that,while state-of-the-art multi-frame cross-channel attention based channel fusion fails to improve ASR performance, fine-tuning SA-ASR on the fixed beamformer’s output and jointly fine-tuning SA-ASR with the neural beamformer reduce the word error rate by 8% and 9% relative, respectively.
摘要：远距离麦克风会议转录是一项具有挑战性的任务。目前最先进的端到端说话人属性自动语音识别（SA-ASR）架构缺乏多通道噪声和混响抑制前端，这限制了其性能。本文介绍了一种联合波束形成和SA-ASR的方法用于实际会议转录。首先，我们描述了一种数据对齐和增强方法，用于在实际会议数据上预训练神经波束形成器。然后，我们比较了固定、混合和全神经波束形成器作为SA-ASR模型的前端。最后，我们联合优化了全神经波束形成器和SA-ASR模型。在实际的AMI语料库上的实验表明，尽管基于多帧跨通道注意力的通道融合未能提高ASR性能，但在固定波束形成器的输出上微调SA-ASR以及联合微调SA-ASR与神经波束形成器分别将词错误率降低了8%和9%。

[NLP-35] Multi-aspect Depression Severity Assessment via Inductive Dialogue System

【速读】：该论文试图解决在患者对话中进行多方面抑郁症严重程度评估的问题，并提出了一种新颖的任务——通过归纳式对话系统（MaDSA）进行多方面抑郁症严重程度评估。解决方案的关键在于结合评估辅助的响应生成，通过引入辅助情感分类任务和层次化的严重程度评估结构，生成具有心理支持性的对话响应。此外，论文还构建了一个包含八个抑郁症严重程度方面和情感标签的对话数据集，并通过人工评估验证了其鲁棒性。实验结果表明，该初步工作在MaDSA任务上具有潜在的应用前景。

链接: https://arxiv.org/abs/2410.21836
作者: Chaebin Lee,Seungyeon Seo,Heejin Do,Gary Geunbae Lee
关键词-EN: automatic depression detection, gained more attention, advancement of chatbots, growing demand, demand for automatic
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the advancement of chatbots and the growing demand for automatic depression detection, identifying depression in patient conversations has gained more attention. However, prior methods often assess depression in a binary way or only a single score without diverse feedback and lack focus on enhancing dialogue responses. In this paper, we present a novel task of multi-aspect depression severity assessment via an inductive dialogue system (MaDSA), evaluating a patient’s depression level on multiple criteria by incorporating an assessment-aided response generation. Further, we propose a foundational system for MaDSA, which induces psychological dialogue responses with an auxiliary emotion classification task within a hierarchical severity assessment structure. We synthesize the conversational dataset annotated with eight aspects of depression severity alongside emotion labels, proven robust via human evaluations. Experimental results show potential for our preliminary work on MaDSA.
摘要：随着聊天机器人的进步和自动抑郁症检测需求的增加，识别患者对话中的抑郁症问题引起了更多关注。然而，以往的方法通常以二元方式评估抑郁症，或仅提供单一评分，缺乏多样化的反馈，并且忽视了对话响应的改进。本文提出了一项通过归纳对话系统（MaDSA）进行多方面抑郁症严重程度评估的新任务，通过结合评估辅助响应生成，从多个标准评估患者的抑郁症水平。此外，我们为MaDSA提出了一个基础系统，该系统在层次严重程度评估结构中引入辅助情感分类任务，生成心理对话响应。我们合成了一个包含抑郁症严重程度八个方面及情感标签的对话数据集，并通过人工评估验证了其鲁棒性。实验结果显示了我们在MaDSA初步工作中的潜力。

[NLP-36] Self-Preference Bias in LLM -as-a-Judge

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在自动评估对话系统性能时存在的自我偏好偏差（self-preference bias），这种偏差可能导致模型倾向于推广其固有的风格或策略。论文的关键解决方案是引入了一种新的定量指标来测量这种自我偏好偏差。通过实验，研究者发现GPT-4表现出显著的自我偏好偏差，并提出假设认为LLMs可能倾向于更熟悉的输出，这种熟悉性可以通过输出的困惑度（perplexity）来衡量。研究结果表明，LLMs对困惑度较低的输出给予显著更高的评估分数，这一现象在模型自我生成的输出和非自我生成的输出中均存在，表明偏差的核心在于困惑度，即LLMs偏好更熟悉的文本。

链接: https://arxiv.org/abs/2410.21819
作者: Koki Wataoka,Tsubasa Takahashi,Ryokan Ri
关键词-EN: Automated evaluation leveraging, large language models, leveraging large language, self-preference bias, evaluation leveraging large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
摘要：利用大语言模型 (LLM) 进行自动评估，通常称为 LLM 评估器或 LLM 作为评判者，已被广泛用于衡量对话系统的性能。然而，LLM 中的自我偏好偏差带来了显著的风险，包括推广特定风格或策略，这些风格或策略内嵌于 LLM 本身。尽管这一问题的重要性不言而喻，但目前缺乏定量测量自我偏好偏差的方法，其根本原因也未被充分理解。本文中，我们引入了一种新的定量指标来测量自我偏好偏差。我们的实验结果表明，GPT-4 表现出显著的自我偏好偏差。为了探究其原因，我们假设 LLM 可能更倾向于输出内容，这些内容对其来说更为熟悉，表现为较低的困惑度 (perplexity)。我们分析了 LLM 评估与输出困惑度之间的关系。研究发现，无论输出是否由 LLM 自身生成，LLM 对困惑度较低的输出给予的评估显著高于人类评估者。这表明偏差的本质在于困惑度，而自我偏好偏差存在的原因是 LLM 偏好对其更为熟悉的文本。

[NLP-37] Gnothi Seauton: Empowering Faithful Self-Interpretability in Black-Box Models

【速读】：该论文试图解决黑箱模型（black-box models）在可解释性AI（Explainable AI, XAI）中的自解释性与后解释性方法之间的矛盾。解决方案的关键在于提出了一种名为AutoGnothi的新方法，该方法通过在黑箱模型中集成一个小的侧网络（side network），实现了在不改变原始网络参数的情况下生成Shapley值解释。这一侧调优（side-tuning）策略显著降低了内存、训练和推理成本，同时保持了预测准确性，从而在提供理论保证的自解释性的同时，不牺牲模型的性能和效率。

链接: https://arxiv.org/abs/2410.21815
作者: Shaobo Wang,Hongxuan Tang,Mingyang Wang,Hongrui Zhang,Xuyang Liu,Weiya Li,Xuming Hu,Linfeng Zhang
关键词-EN: central to Explainable, XAI, Explainable, self-interpretable models, black-box
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:The debate between self-interpretable models and post-hoc explanations for black-box models is central to Explainable AI (XAI). Self-interpretable models, such as concept-based networks, offer insights by connecting decisions to human-understandable concepts but often struggle with performance and scalability. Conversely, post-hoc methods like Shapley values, while theoretically robust, are computationally expensive and resource-intensive. To bridge the gap between these two lines of research, we propose a novel method that combines their strengths, providing theoretically guaranteed self-interpretability for black-box models without compromising prediction accuracy. Specifically, we introduce a parameter-efficient pipeline, AutoGnothi, which integrates a small side network into the black-box model, allowing it to generate Shapley value explanations without changing the original network parameters. This side-tuning approach significantly reduces memory, training, and inference costs, outperforming traditional parameter-efficient methods, where full fine-tuning serves as the optimal baseline. AutoGnothi enables the black-box model to predict and explain its predictions with minimal overhead. Extensive experiments show that AutoGnothi offers accurate explanations for both vision and language tasks, delivering superior computational efficiency with comparable interpretability.
摘要：在可解释人工智能（Explainable AI, XAI）领域，自解释模型与事后解释方法之间的争论是核心议题。自解释模型，如基于概念的网络，通过将决策与人类可理解的概念相连接来提供洞察，但往往在性能和可扩展性方面存在挑战。相反，事后解释方法如Shapley值虽然在理论上具有鲁棒性，但在计算上却非常昂贵且资源密集。为了弥合这两类研究之间的差距，我们提出了一种结合两者优势的新方法，能够在不牺牲预测准确性的前提下，为黑箱模型提供理论上保证的自解释性。具体而言，我们引入了一个参数高效的流水线，即AutoGnothi，它将一个小型侧网络集成到黑箱模型中，使其能够在不改变原始网络参数的情况下生成Shapley值解释。这种侧调优方法显著降低了内存、训练和推理成本，优于传统的参数高效方法，其中全量微调作为最佳基线。AutoGnothi使黑箱模型能够在最小开销下预测并解释其预测结果。广泛的实验表明，AutoGnothi在视觉和语言任务中均能提供准确的解释，实现了卓越的计算效率，同时保持了可比拟的解释性。

[NLP-38] SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication

【速读】：该论文试图解决在生成式模型驱动的涌现通信中，如何通过自监督学习（Self-Supervised Learning, SSL）方法，特别是SimSiam，来增强表示学习和捕捉不确定性，并在此基础上实现有效的涌现语言生成。解决方案的关键在于提出SimSiam+VAE，这是一种将变分自编码器（Variational Autoencoder, VAE）集成到SimSiam网络预测器中的统一方法，以提升表示学习和不确定性捕捉能力。此外，论文还扩展了这一模型，提出了SimSiam Naming Game (SSNG)，通过结合生成和贝叶斯方法，利用SimSiam的判别过程来促进代理之间的相互理解，从而在动态角色交替的交互中实现有效的涌现语言生成。

链接: https://arxiv.org/abs/2410.21803
作者: Nguyen Le Hoang,Tadahiro Taniguchi,Fang Tianwei,Akira Taniguchi
关键词-EN: describing their individual, representation learning, SimSiam, VAE, individual views
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emergent communication, driven by generative models, enables agents to develop a shared language for describing their individual views of the same objects through interactions. Meanwhile, self-supervised learning (SSL), particularly SimSiam, uses discriminative representation learning to make representations of augmented views of the same data point closer in the representation space. Building on the prior work of VI-SimSiam, which incorporates a generative and Bayesian perspective into the SimSiam framework via variational inference (VI) interpretation, we propose SimSiam+VAE, a unified approach for both representation learning and emergent communication. SimSiam+VAE integrates a variational autoencoder (VAE) into the predictor of the SimSiam network to enhance representation learning and capture uncertainty. Experimental results show that SimSiam+VAE outperforms both SimSiam and VI-SimSiam. We further extend this model into a communication framework called the SimSiam Naming Game (SSNG), which applies the generative and Bayesian approach based on VI to develop internal representations and emergent language, while utilizing the discriminative process of SimSiam to facilitate mutual understanding between agents. In experiments with established models, despite the dynamic alternation of agent roles during interactions, SSNG demonstrates comparable performance to the referential game and slightly outperforms the Metropolis-Hastings naming game.
摘要：由生成模型驱动的涌现通信使得智能体能够通过交互发展出一种共享语言，用于描述各自对同一对象的视角。同时，自监督学习 (Self-Supervised Learning, SSL)，特别是 SimSiam，利用判别表示学习使同一数据点的增强视图在表示空间中更加接近。基于先前的工作 VI-SimSiam，该工作通过变分推断 (Variational Inference, VI) 解释将生成和贝叶斯视角融入 SimSiam 框架，我们提出了 SimSiam+VAE，一种统一表示学习和涌现通信的方法。SimSiam+VAE 将变分自编码器 (Variational Autoencoder, VAE) 集成到 SimSiam 网络的预测器中，以增强表示学习并捕捉不确定性。实验结果表明，SimSiam+VAE 优于 SimSiam 和 VI-SimSiam。我们进一步将该模型扩展为一个通信框架，称为 SimSiam 命名游戏 (SimSiam Naming Game, SSNG)，该框架基于 VI 应用生成和贝叶斯方法来发展内部表示和涌现语言，同时利用 SimSiam 的判别过程促进智能体之间的相互理解。在与现有模型的实验中，尽管在交互过程中智能体角色动态交替，SSNG 的性能与指称游戏相当，并略优于 Metropolis-Hastings 命名游戏。

[NLP-39] Enhancing Adversarial Attacks through Chain of Thought

【速读】：该论文试图解决针对对齐的大型语言模型 (Large Language Models, LLMs) 的对抗攻击的鲁棒性问题。解决方案的关键在于将链式思维提示 (Chain of Thought, CoT) 与贪婪坐标梯度 (Greedy Coordinate Gradient, GCG) 技术相结合，形成 CoT-GCG 方法。通过使用 CoT 触发器而非肯定目标，该方法能够激发后端 LLMs 的推理能力，从而提高对抗攻击的可转移性和普遍性。实验结果表明，CoT-GCG 方法在性能上优于传统的 GCG 攻击和单独的 CoT 提示，并且通过 Llama Guard 评估潜在有害交互，提供了比匹配输出到拒绝短语更客观的对话风险评估。

链接: https://arxiv.org/abs/2410.21791
作者: Jingbo Su
关键词-EN: Large language models, demonstrated impressive performance, Large language, language models, safety concerns
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance across various domains but remain susceptible to safety concerns. Prior research indicates that gradient-based adversarial attacks are particularly effective against aligned LLMs and the chain of thought (CoT) prompting can elicit desired answers through step-by-step reasoning. This paper proposes enhancing the robustness of adversarial attacks on aligned LLMs by integrating CoT prompts with the greedy coordinate gradient (GCG) technique. Using CoT triggers instead of affirmative targets stimulates the reasoning abilities of backend LLMs, thereby improving the transferability and universality of adversarial attacks. We conducted an ablation study comparing our CoT-GCG approach with Amazon Web Services auto-cot. Results revealed our approach outperformed both the baseline GCG attack and CoT prompting. Additionally, we used Llama Guard to evaluate potentially harmful interactions, providing a more objective risk assessment of entire conversations compared to matching outputs to rejection phrases. The code of this paper is available at this https URL.
摘要：大语言模型（LLMs）在多个领域展示了令人印象深刻的表现，但仍然存在安全性问题。先前的研究表明，基于梯度的对抗攻击对对齐的 LLMs 尤为有效，而思维链（CoT）提示可以通过逐步推理引出期望的答案。本文提出通过将 CoT 提示与贪婪坐标梯度（GCG）技术结合，增强对齐 LLMs 的对抗攻击鲁棒性。使用 CoT 触发器而非肯定目标激发后端 LLMs 的推理能力，从而提高对抗攻击的可转移性和普遍性。我们进行了消融研究，比较了我们的 CoT-GCG 方法与 Amazon Web Services 的 auto-cot。结果显示，我们的方法在基线 GCG 攻击和 CoT 提示方面均表现更优。此外，我们使用 Llama Guard 评估潜在的有害交互，提供了比匹配输出到拒绝短语更客观的整个对话风险评估。本文代码可在以下链接获取：https URL。

[NLP-40] MARCO: Multi-Agent Real-time Chat Orchestration EMNLP2024

【速读】：该论文试图解决利用大型语言模型（LLMs）进行复杂、多步骤任务执行时面临的挑战，特别是任务自动化过程中可能出现的输出格式不一致、函数和参数幻觉（function and parameter hallucination）以及缺乏领域知识等问题。解决方案的关键在于提出了MARCO框架，这是一个多智能体实时聊天编排框架，通过引入强大的防护机制（guardrails）来引导LLM行为、验证输出并从错误中恢复。这些防护机制包括对输出格式的严格检查、对函数和参数的正确性验证以及对领域知识的补充，从而显著提高了任务执行的准确性和效率。实验结果表明，MARCO在数字餐厅服务平台对话和零售对话数据集上的任务执行准确率分别达到了94.48%和92.74%，同时延迟减少了44.91%，成本降低了33.71%。

链接: https://arxiv.org/abs/2410.21784
作者: Anubhav Shrimal,Stanley Kanagaraj,Kriti Biswas,Swarnalatha Raghuraman,Anish Nediyanchath,Yi Zhang,Promod Yenigalla
关键词-EN: Large language model, Real-time Chat Orchestration, Large language, Multi-Agent Real-time Chat, Chat Orchestration framework
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: EMNLP 2024 Industry Track

点击查看摘要

Abstract:Large language model advancements have enabled the development of multi-agent frameworks to tackle complex, real-world problems such as to automate tasks that require interactions with diverse tools, reasoning, and human collaboration. We present MARCO, a Multi-Agent Real-time Chat Orchestration framework for automating tasks using LLMs. MARCO addresses key challenges in utilizing LLMs for complex, multi-step task execution. It incorporates robust guardrails to steer LLM behavior, validate outputs, and recover from errors that stem from inconsistent output formatting, function and parameter hallucination, and lack of domain knowledge. Through extensive experiments we demonstrate MARCO’s superior performance with 94.48% and 92.74% accuracy on task execution for Digital Restaurant Service Platform conversations and Retail conversations datasets respectively along with 44.91% improved latency and 33.71% cost reduction. We also report effects of guardrails in performance gain along with comparisons of various LLM models, both open-source and proprietary. The modular and generic design of MARCO allows it to be adapted for automating tasks across domains and to execute complex usecases through multi-turn interactions.
摘要：大语言模型的进步推动了多智能体框架的发展，这些框架能够解决复杂的现实世界问题，例如自动化需要与多种工具交互、推理以及人类协作的任务。我们提出了 MARCO，这是一个用于自动化任务的多智能体实时聊天编排框架，利用了大语言模型。MARCO 解决了在复杂多步骤任务执行中使用大语言模型的关键挑战。它集成了强大的防护机制来引导大语言模型的行为，验证输出，并从由于输出格式不一致、函数和参数幻觉以及缺乏领域知识导致的错误中恢复。通过广泛的实验，我们展示了 MARCO 在数字餐厅服务平台对话和零售对话数据集上的任务执行中分别达到了 94.48% 和 92.74% 的准确率，同时延迟减少了 44.91%，成本降低了 33.71%。我们还报告了防护机制对性能提升的影响，并比较了各种开源和专有大语言模型。MARCO 的模块化和通用设计使其能够适应跨领域的任务自动化，并通过多轮交互执行复杂的用例。

[NLP-41] Leveraging LLM s for Hypothetical Deduction in Logical Inference: A Neuro-Symbolic Approach

【速读】：该论文试图解决大型语言模型（LLMs）在逻辑推理任务中面临的两个关键问题：一是外部逻辑符号求解器方法在处理具有不同特征的问题时泛化能力差；二是这些方法不可避免地会导致问题信息的丢失。解决方案的关键在于引入了一种名为LINA的LLM驱动的神经符号方法，通过使LLM自主完成从命题逻辑提取到复杂逻辑推理的转换，增强了推理过程的鲁棒性，并消除了对外部求解器的依赖。此外，LINA采用假设-演绎推理范式，有效规避了传统前向推理方法面临的搜索空间扩展问题。

链接: https://arxiv.org/abs/2410.21779
作者: Qingchuan Li,Jiatong Li,Tongxuan Liu,Yuting Zeng,Mingyue Cheng,Weizhe Huang,Qi Liu
关键词-EN: Large Language Models, Large Language, Language Models, exhibited remarkable potential, including logical reasoning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited remarkable potential across a wide array of reasoning tasks, including logical reasoning. Although massive efforts have been made to empower the logical reasoning ability of LLMs via external logical symbolic solvers, crucial challenges of the poor generalization ability to questions with different features and inevitable question information loss of symbolic solver-driven approaches remain unresolved. To mitigate these issues, we introduce LINA, a LLM-driven neuro-symbolic approach for faithful logical reasoning. By enabling an LLM to autonomously perform the transition from propositional logic extraction to sophisticated logical reasoning, LINA not only bolsters the resilience of the reasoning process but also eliminates the dependency on external solvers. Additionally, through its adoption of a hypothetical-deductive reasoning paradigm, LINA effectively circumvents the expansive search space challenge that plagues traditional forward reasoning methods. Empirical evaluations demonstrate that LINA substantially outperforms both established propositional logic frameworks and conventional prompting techniques across a spectrum of five logical reasoning tasks. Specifically, LINA achieves an improvement of 24.34% over LINC on the FOLIO dataset, while also surpassing prompting strategies like CoT and CoT-SC by up to 24.02%. Our code is available at this https URL.
摘要：大语言模型 (LLMs) 在包括逻辑推理在内的多种推理任务中展现了显著的潜力。尽管通过外部逻辑符号求解器来增强 LLMs 的逻辑推理能力已经付出了巨大努力，但符号求解器驱动方法在面对具有不同特征的问题时泛化能力差以及不可避免的问题信息损失等关键挑战仍未得到解决。为缓解这些问题，我们提出了 LINA，一种基于 LLM 的神经符号推理方法，用于忠实的逻辑推理。通过使 LLM 自主完成从命题逻辑提取到复杂逻辑推理的转换，LINA 不仅增强了推理过程的鲁棒性，还消除了对外部求解器的依赖。此外，通过采用假设-演绎推理范式，LINA 有效地规避了传统前向推理方法所面临的搜索空间扩展挑战。实证评估表明，LINA 在五个逻辑推理任务中显著优于既有的命题逻辑框架和传统的提示技术。具体而言，LINA 在 FOLIO 数据集上比 LINC 提升了 24.34%，同时超越了如 CoT 和 CoT-SC 等提示策略最多达 24.02%。我们的代码可在以下链接获取：https URL。

[NLP-42] RELATE: A Modern Processing Platform for Romanian Language

【速读】：该论文试图解决罗马尼亚语自然语言处理（Natural Language Processing, NLP）的高性能平台设计问题，特别是针对文本和音频处理的需求。解决方案的关键在于设计和开发了RELATE平台，该平台不仅提供了针对罗马尼亚语的文本处理功能，还通过最近的更新整合了音频处理工具，从而实现了对罗马尼亚语语料库的全面处理。核心组件的技术细节和实际应用场景的展示进一步证明了RELATE平台在处理罗马尼亚语方面的成熟性、现代性和先进性。最近的发展还包括平台内的双模态（文本和音频）功能，这进一步增强了其处理能力。

链接: https://arxiv.org/abs/2410.21778
作者: Vasile Păiş,Radu Ion,Andrei-Marius Avram,Maria Mitrofan,Dan Tufiş
关键词-EN: design and evolution, Romanian language, Abstract, processing, Romanian
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the design and evolution of the RELATE platform. It provides a high-performance environment for natural language processing activities, specially constructed for Romanian language. Initially developed for text processing, it has been recently updated to integrate audio processing tools. Technical details are provided with regard to core components. We further present different usage scenarios, derived from actual use in national and international research projects, thus demonstrating that RELATE is a mature, modern, state-of-the-art platform for processing Romanian language corpora. Finally, we present very recent developments including bimodal (text and audio) features available within the platform.
摘要：本文介绍了 RELATE 平台的设计及其演进过程。该平台为罗马尼亚语的自然语言处理活动提供了一个高性能的环境。最初开发用于文本处理，最近已更新以整合音频处理工具。本文详细介绍了核心组件的技术细节。此外，我们还展示了从国家及国际研究项目中的实际应用中得出的不同使用场景，从而证明 RELATE 是一个成熟、现代且处于技术前沿的罗马尼亚语语料库处理平台。最后，我们介绍了平台中最新开发的包括双模态（文本和音频）功能在内的最新进展。

[NLP-43] Learning and Unlearning of Fabricated Knowledge in Language Models

【速读】：该论文试图解决的问题是：当新知识被引入到大语言模型（LM）的训练数据中时，这些知识在模型继续训练的过程中能持续多久，以及这些知识对模型行为的影响。解决方案的关键在于识别出在事实新颖性谱中，介于与世界知识一致和完全随机之间的“最佳点”，在这个点上，注入的记忆最为持久。具体来说，论文发现与常识冲突的事实能够在模型中持续数万次训练步骤，而与常识不冲突的事实（平凡的）和随机打乱的事实则很快被遗忘。此外，与常识冲突的事实能够影响模型在逻辑上不相关的提示上的幻觉生成，显示出其非目标泛化的倾向，而平凡和随机打乱的事实则影响较小。最后，论文提出了一种多步稀疏更新的方法，可以有效消除与常识冲突的事实对模型的长期影响，同时保持模型的训练能力，这为缓解训练数据中毒问题提供了直接的解决方案。

链接: https://arxiv.org/abs/2410.21750
作者: Chen Sun,Nolan Andrew Miller,Andrey Zhmoginov,Max Vladymyrov,Mark Sandler
关键词-EN: continues to train, large language model, large language, facts, knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What happens when a new piece of knowledge is introduced into the training data and how long does it last while a large language model (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, “Outlandish”, which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime’’ how the language model hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.
摘要：当新的知识被引入训练数据中时，大语言模型 (LM) 在继续训练的过程中，这些新知识会持续多久？我们通过将事实注入到一个名为“Outlandish”的新探测数据集中来研究这个问题，该数据集设计用于测试不同类型事实的广泛范围。在研究这些记忆的鲁棒性时，我们发现事实新颖性谱中存在一个最佳点，介于与世界知识的一致性和完全随机性之间，注入的记忆在此处最为持久。具体来说，我们表明与常识冲突的事实会在数万个训练步骤中被记住，而与常识不冲突的提示（平凡的）以及打乱顺序的提示（随机打乱的）则会被更快地遗忘。此外，知识冲突的事实能够“引导”语言模型在逻辑上无关的提示上产生幻觉，显示出其非目标泛化的倾向，而平凡和随机打乱的事实则引导作用显著减弱。最后，我们展示了大语言模型中知识冲突事实的影响，尽管它们可能持续很长时间，但可以通过多步稀疏更新的新颖应用来大幅消除，即使在模型的训练能力得以保留的情况下也是如此。因此，这一非常简单的过程对减轻训练中的数据中毒效应具有直接的实际意义。

[NLP-44] Enhancing Financial Question Answering with a Multi-Agent Reflection Framework

【速读】：该论文试图解决大型语言模型（LLMs）在金融问答（QA）任务中，尤其是在涉及数值推理时的表现不佳的问题。解决方案的关键在于提出了一种多代理框架，该框架包括一个批判代理（critic agent），用于反思推理步骤和最终答案，并通过增加多个专注于答案特定方面的批判代理来进一步增强系统。这种多代理框架显著提高了性能，相较于单一代理推理，平均性能提升了15%（LLaMA3-8B模型）和5%（LLaMA3-70B模型），并且在某些情况下超越了更大规模的单一代理LLMs，如LLaMA3.1-405B和GPT-4o-mini，尽管在某些方面略逊于Claude-3.5 Sonnet，但提供了一种成本效益更高的替代方案。

链接: https://arxiv.org/abs/2410.21741
作者: Sorouralsadat Fatemi,Yuheng Hu
关键词-EN: Natural Language Processing, numerous Natural Language, Large Language Models, Language Processing, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICAIF 24

点击查看摘要

Abstract:While Large Language Models (LLMs) have shown impressive capabilities in numerous Natural Language Processing (NLP) tasks, they still struggle with financial question answering (QA), particularly when numerical reasoning is required. Recently, LLM-based multi-agent frameworks have demonstrated remarkable effectiveness in multi-step reasoning, which is crucial for financial QA tasks as it involves extracting relevant information from tables and text and then performing numerical reasoning on the extracted data to infer answers. In this study, we propose a multi-agent framework incorporating a critic agent that reflects on the reasoning steps and final answers for each question. Additionally, we enhance our system by adding multiple critic agents, each focusing on a specific aspect of the answer. Our results indicate that this framework significantly improves performance compared to single-agent reasoning, with an average performance increase of 15% for the LLaMA3-8B model and 5% for the LLaMA3-70B model. Furthermore, our framework performs on par with, and in some cases surpasses, larger single-agent LLMs such as LLaMA3.1-405B and GPT-4o-mini, though it falls slightly short compared to Claude-3.5 Sonnet. Overall, our framework presents an effective solution to enhance open-source LLMs for financial QA tasks, offering a cost-effective alternative to larger models like Claude-3.5 Sonnet.
摘要：尽管大语言模型 (LLM) 在众多自然语言处理 (NLP) 任务中展现了令人瞩目的能力，但在涉及数值推理的金融问答 (QA) 任务中仍显不足。近期，基于大语言模型的多智能体框架在多步骤推理中表现出了显著的有效性，这对于金融问答任务至关重要，因为它涉及从表格和文本中提取相关信息，并对提取的数据进行数值推理以推导出答案。在本研究中，我们提出了一种包含批评智能体的多智能体框架，该智能体对每个问题的推理步骤和最终答案进行反思。此外，我们通过增加多个批评智能体来增强系统，每个智能体专注于答案的特定方面。我们的结果表明，与单智能体推理相比，该框架显著提升了性能，LLaMA3-8B 模型的平均性能提升了 15%，LLaMA3-70B 模型提升了 5%。此外，我们的框架在某些情况下与更大规模的单智能体大语言模型（如 LLaMA3.1-405B 和 GPT-4o-mini）表现相当，甚至超越，尽管在某些方面略逊于 Claude-3.5 Sonnet。总体而言，我们的框架为增强开源大语言模型在金融问答任务中的表现提供了一种有效的解决方案，为 Claude-3.5 Sonnet 等大型模型提供了一种成本效益更高的替代方案。

[NLP-45] Lets Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

【速读】：该论文试图解决现有链式思维（Chain of Thought, CoT）提示方法在提升大型语言模型（LLMs）推理能力时，仍需大量人工干预或难以生成高质量提示的问题。解决方案的关键在于提出了一种名为LBS3的新型自动推理提示方法，该方法受课程学习（curriculum learning）启发，模拟人类学习习惯。LBS3通过引导LLMs从易到难地回忆与目标查询相关的代理查询（proxy queries），并利用从简单代理查询生成的示范提示逐步指导LLMs解决复杂代理查询，从而确保代理解决方案的高质量。实验结果表明，LBS3在多种推理密集型任务中表现出色，与当前最先进的基线方法相比具有较强的竞争力。

链接: https://arxiv.org/abs/2410.21728
作者: Kangyang Luo,Zichen Ding,Zhenmin Weng,Lingfeng Qiao,Meng Zhao,Xiang Li,Di Yin,Jinlong Shu
关键词-EN: Chain of Thought, large language models, language models, significantly consolidated, capabilities of large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Chain of Thought (CoT) prompting approaches have significantly consolidated the reasoning capabilities of large language models (LLMs), they still face limitations that require extensive human effort or have performance needs to be improved. Existing endeavors have focused on bridging these gaps; however, these approaches either hinge on external data and cannot completely eliminate manual effort, or they fall short in effectively directing LLMs to generate high-quality exemplary prompts. To address the said pitfalls, we propose a novel prompt approach for automatic reasoning named \textbfLBS3, inspired by curriculum learning which better reflects human learning habits. Specifically, LBS3 initially steers LLMs to recall easy-to-hard proxy queries that are pertinent to the target query. Following this, it invokes a progressive strategy that utilizes exemplary prompts stemmed from easy-proxy queries to direct LLMs in solving hard-proxy queries, enabling the high-quality of the proxy solutions. Finally, our extensive experiments in various reasoning-intensive tasks with varying open- and closed-source LLMs show that LBS3 achieves strongly competitive performance compared to the SOTA baselines.
摘要：尽管思维链 (Chain of Thought, CoT) 提示方法显著增强了大型语言模型 (Large Language Models, LLMs) 的推理能力，但它们仍面临需要大量人工努力或性能有待提升的局限。现有的努力主要集中在填补这些差距上；然而，这些方法要么依赖外部数据且无法完全消除人工干预，要么在有效引导 LLMs 生成高质量示例提示方面表现不足。为解决上述问题，我们提出了一种名为 LBS3 的自动推理提示新方法，该方法受课程学习 (curriculum learning) 启发，更符合人类学习习惯。具体而言，LBS3 首先引导 LLMs 回忆与目标查询相关的由易到难的代理查询。随后，它采用渐进策略，利用从简单代理查询中提取的示例提示来指导 LLMs 解决复杂代理查询，从而确保代理解决方案的高质量。最后，我们在多种推理密集型任务中，使用不同开源和闭源的 LLMs 进行了广泛实验，结果显示 LBS3 在与最先进基线方法的对比中表现出色。

[NLP-46] A Bayesian Approach to Harnessing the Power of LLM s in Authorship Attribution

【速读】：该论文试图解决的是在缺乏大量标注数据的情况下，如何利用预训练的大型语言模型（Large Language Models, LLMs）进行一次性作者归属（one-shot authorship attribution）的问题。解决方案的关键在于利用LLMs的深度推理能力和长程文本关联性，通过贝叶斯方法和LLMs的概率输出，计算文本与作者先前作品之间的概率关系，从而实现对作者身份的细致理解。具体方法包括使用预训练的模型如Llama-3-70B，在IMDb和博客数据集上进行实验，结果显示在一次性作者分类任务中达到了85%的准确率，为一次性作者分析设定了新的基准，并扩展了LLMs在法医语言学中的应用范围。

链接: https://arxiv.org/abs/2410.21716
作者: Zhengmian Hu,Tong Zheng,Heng Huang
关键词-EN: aims to identify, identify the origin, Authorship attribution aims, Authorship, one-shot authorship
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Authorship attribution aims to identify the origin or author of a document. Traditional approaches have heavily relied on manual features and fail to capture long-range correlations, limiting their effectiveness. Recent advancements leverage text embeddings from pre-trained language models, which require significant fine-tuning on labeled data, posing challenges in data dependency and limited interpretability. Large Language Models (LLMs), with their deep reasoning capabilities and ability to maintain long-range textual associations, offer a promising alternative. This study explores the potential of pre-trained LLMs in one-shot authorship attribution, specifically utilizing Bayesian approaches and probability outputs of LLMs. Our methodology calculates the probability that a text entails previous writings of an author, reflecting a more nuanced understanding of authorship. By utilizing only pre-trained models such as Llama-3-70B, our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors. Our findings set new baselines for one-shot authorship analysis using LLMs and expand the application scope of these models in forensic linguistics. This work also includes extensive ablation studies to validate our approach.
摘要：作者归属旨在识别文档的来源或作者。传统方法严重依赖手动特征，无法捕捉长距离关联，限制了其有效性。近期进展利用预训练语言模型的文本嵌入，但这些方法需要大量标注数据的微调，存在数据依赖性和解释性有限的问题。大语言模型 (LLM) 凭借其深度推理能力和维持长距离文本关联的能力，提供了一种有前景的替代方案。本研究探讨了预训练 LLM 在单样本作者归属中的潜力，特别是利用贝叶斯方法和 LLM 的概率输出。我们的方法计算文本包含作者先前作品的概率，反映了更细致的作者理解。通过仅使用预训练模型如 Llama-3-70B，我们在 IMDb 和博客数据集上的结果显示，在十位作者中进行单样本作者分类的准确率达到了 85%。我们的发现为使用 LLM 进行单样本作者分析设定了新的基准，并扩展了这些模型在法医语言学中的应用范围。本研究还包括广泛的消融实验以验证我们的方法。

[NLP-47] CFSafety: Comprehensive Fine-grained Safety Assessment for LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在生成文本时可能带来的安全风险问题，特别是社会偏见、不道德内容以及在特定对抗性指令下可能引发的非法活动。解决方案的关键在于引入了一个名为CFSafety的安全评估基准，该基准整合了5种经典安全场景和5类指令攻击，共10种安全问题类别，形成了一个包含25k提示的测试集。通过结合简单的道德判断和1-5安全评分尺度，对LLMs的自然语言生成（NLG）能力进行评估。研究结果表明，尽管GPT-4在安全性方面表现优异，但包括GPT-4在内的所有LLMs的安全性仍需进一步提升。

链接: https://arxiv.org/abs/2410.21695
作者: Zhihao Liu,Chenhui Hu
关键词-EN: bring significant conveniences, considerable safety risks, rapidly evolve, daily lives, bring significant
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) rapidly evolve, they bring significant conveniences to our work and daily lives, but also introduce considerable safety risks. These models can generate texts with social biases or unethical content, and under specific adversarial instructions, may even incite illegal activities. Therefore, rigorous safety assessments of LLMs are crucial. In this work, we introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions, to form a test set with 25k prompts. This test set was used to evaluate the natural language generation (NLG) capabilities of LLMs, employing a combination of simple moral judgment and a 1-5 safety rating scale for scoring. Using this benchmark, we tested eight popular LLMs, including the GPT series. The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement. The data and code associated with this study are available on GitHub.
摘要：随着大语言模型 (Large Language Models, LLMs) 的迅速发展，它们为我们的工作和日常生活带来了显著的便利，但同时也引入了相当大的安全风险。这些模型可能生成带有社会偏见或不道德内容的文本，并且在特定的对抗性指令下，甚至可能煽动非法活动。因此，对 LLMs 进行严格的安全评估至关重要。在本研究中，我们引入了一个安全评估基准，即 CFSafety，该基准整合了 5 种经典的安全场景和 5 类指令攻击，共计 10 种安全问题类别，形成了包含 25,000 个提示的测试集。该测试集用于评估 LLMs 的自然语言生成 (Natural Language Generation, NLG) 能力，采用简单的道德判断和 1-5 安全评分等级相结合的方式进行评分。利用这一基准，我们测试了包括 GPT 系列在内的八种流行 LLMs。结果表明，尽管 GPT-4 展示了卓越的安全性能，但包括该模型在内的 LLMs 的安全有效性仍有待提高。本研究的相关数据和代码已在 GitHub 上公开。

[NLP-48] Sequential choice in ordered bundles

【速读】：该论文试图解决的问题是如何基于用户对有序捆绑商品（如体育赛事、音乐、视频等）的前序消费模式，预测其是否会继续消费下一个商品。解决方案的关键在于使用了一种基于解码器架构的自定义Transformer模型，该模型在预测个体选择和总体需求方面表现最为准确。该模型能够捕捉到一种普遍的状态依赖性，并通过分析Transformer的注意力权重，发现用户对下一个商品的消费决策基于对所有前序选择的近似等权重考虑。这一解决方案有助于优化商品排队、预测单个商品的需求，并通过个性化促销策略来提高需求。

链接: https://arxiv.org/abs/2410.21670
作者: Rajeev Kohli,Kriste Krstovski,Hengyu Kuang,Hengxu Lin
关键词-EN: Experience goods, artistic events, television series, sporting and artistic, Experience
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Experience goods such as sporting and artistic events, songs, videos, news stories, podcasts, and television series, are often packaged and consumed in bundles. Many such bundles are ordered in the sense that the individual items are consumed sequentially, one at a time. We examine if an individual’s decision to consume the next item in an ordered bundle can be predicted based on his/her consumption pattern for the preceding items. We evaluate several predictive models, including two custom Transformers using decoder-only and encoder-decoder architectures, fine-tuned GPT-3, a custom LSTM model, a reinforcement learning model, two Markov models, and a zero-order model. Using data from Spotify, we find that the custom Transformer with a decoder-only architecture provides the most accurate predictions, both for individual choices and aggregate demand. This model captures a general form of state dependence. Analysis of Transformer attention weights suggests that the consumption of the next item in a bundle is based on approximately equal weighting of all preceding choices. Our results indicate that the Transformer can assist in queuing the next item that an individual is likely to consume from an ordered bundle, predicting the demand for individual items, and personalizing promotions to increase demand.
摘要：体育和艺术活动、歌曲、视频、新闻故事、播客和电视剧等体验性商品，通常以捆绑形式打包和消费。许多此类捆绑包具有顺序性，即其中的单个项目是按顺序逐个消费的。我们探讨了个体是否可以根据其对前序项目的消费模式来预测其是否会继续消费捆绑包中的下一个项目。我们评估了几种预测模型，包括两种使用解码器和编码器-解码器架构的自定义 Transformer 模型、微调的 GPT-3、自定义 LSTM 模型、强化学习模型、两种马尔可夫模型和一个零阶模型。利用来自 Spotify 的数据，我们发现使用解码器架构的自定义 Transformer 模型在个体选择和总体需求预测方面提供了最准确的预测。该模型捕捉到了一种普遍的状态依赖性。对 Transformer 注意力权重的分析表明，捆绑包中下一个项目的消费决策基于对所有前序选择的近似等权重考虑。我们的研究结果表明，Transformer 模型可以帮助排队预测个体可能从有序捆绑包中消费的下一个项目，预测单个项目的需求，并个性化促销以增加需求。

[NLP-49] f-PO: Generalizing Preference Optimization with f-divergence Minimization

【速读】：该论文试图解决语言模型与人类偏好对齐的问题，并提出了一种名为 f-divergence Preference Optimization (f-PO) 的新框架。解决方案的关键在于通过最小化 f-divergence 来优化策略与最优策略之间的差异，从而涵盖了使用各种 divergence 的广泛对齐方法。f-PO 不仅统一了先前的算法如 DPO 和 EXO，还通过选择不同的 f-divergence 提供了新的变体。该方法在多个基准数据集上进行了实验，结果显示其在各种任务中均表现出色，优于现有的对齐方法。此外，论文还探讨了不同 f-divergence 对正则化和性能的影响，为离线偏好优化提供了理论和实践上的贡献。

链接: https://arxiv.org/abs/2410.21662
作者: Jiaqi Han,Mingjian Jiang,Yuxuan Song,Jure Leskovec,Stefano Ermon,Minkai Xu
关键词-EN: significant progress recently, made significant progress, Preference optimization, numerous methods developed, divergence Preference Optimization
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces f -divergence Preference Optimization ( f -PO), a novel framework that generalizes and extends existing approaches. f -PO minimizes f -divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of f -divergences. We provide theoretical analysis of f -PO’s properties and conduct extensive experiments on state-of-the-art language models using benchmark datasets. Results demonstrate f -PO’s effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, and MT-Bench. Additionally, we present ablation studies exploring the impact of different f -divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of language model alignment. Code is available at this https URL.
摘要：偏好优化近年来取得了显著进展，开发了多种方法以使语言模型与人类偏好对齐。本文介绍了 f-散度偏好优化（f-PO），这是一种新颖的框架，能够概括和扩展现有方法。f-PO 通过最小化优化策略与最优策略之间的 f-散度，涵盖了使用各种散度的一类广泛的偏好对齐方法。我们的方法统一了先前的算法，如 DPO 和 EXO，同时通过选择不同的 f-散度提供了新的变体。我们对 f-PO 的性质进行了理论分析，并在使用基准数据集的最新语言模型上进行了广泛的实验。结果表明，f-PO 在各种任务中表现出色，在 AlpacaEval 2、Arena-Hard 和 MT-Bench 等流行基准测试中优于现有方法。此外，我们进行了消融研究，探讨了不同 f-散度对结果的影响，提供了关于离线偏好优化中正则化与性能之间权衡的见解。我们的工作为语言模型对齐领域贡献了实用的算法和理论理解。代码可在以下链接获取：https URL。

[NLP-50] Can Language Models Replace Programmers? REPOCOD Says Not Yet

【速读】：该论文试图解决的问题是：现有的代码生成基准测试（如HumanEval和MBPP）无法准确评估大型语言模型（LLMs）在实际软件开发中的应用潜力，因为这些基准测试与真实世界软件开发的复杂性和多样性存在较大差距。论文提出的解决方案之关键是：引入了一个名为REPOCOD的新基准测试，该基准包含从11个流行开源项目中收集的980个问题，其中超过58%的问题需要文件级或仓库级的上下文信息。REPOCOD不仅具有最长的平均规范解决方案长度（331.6 tokens）和高平均圈复杂度（9.00），而且在对十种LLM的评估中，没有模型能达到超过30的pass@1，这揭示了构建更强大的LLM以辅助实际软件开发的必要性。

链接: https://arxiv.org/abs/2410.21647
作者: Shanchao Liang,Yiran Hu,Nan Jiang,Lin Tan
关键词-EN: solving Python coding, Large language models, Python coding problems, shown remarkable ability, Large language
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable ability in code generation with more than 90 pass@1 in solving Python coding problems in HumanEval and MBPP. Such high accuracy leads to the question: can LLMs replace human programmers? Existing manual crafted, simple, or single-line code generation benchmarks cannot answer this question due to their gap with real-world software development. To answer this question, we propose REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. In our evaluations on ten LLMs, none of the models can achieve more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development.
摘要：大语言模型 (LLMs) 在代码生成方面展现了显著的能力，在 HumanEval 和 MBPP 的 Python 编程问题解决中，通过率超过 90%。如此高的准确率引发了一个问题：LLMs 能否取代人类程序员？现有的手工设计、简单或单行代码生成的基准测试无法回答这一问题，因为它们与实际软件开发的差距较大。为了回答这一问题，我们提出了 REPOCOD，这是一个包含 980 个问题的代码生成基准测试，这些问题从 11 个流行的实际项目中收集，其中超过 58% 的问题需要文件级或仓库级的上下文信息。此外，REPOCOD 的平均规范解决方案长度（331.6 Token）和平均圈复杂度（9.00）均高于现有基准测试。在我们对十个 LLMs 的评估中，没有任何模型在 REPOCOD 上的通过率超过 30%，这揭示了构建更强大的 LLMs 以帮助开发者在实际软件开发中工作的必要性。

[NLP-51] Are Paraphrases Generated by Large Language Models Invertible?

【速读】：该论文试图解决的问题是释义反转 (paraphrase inversion)，即在给定一个释义文档的情况下，尝试恢复原始文本。解决方案的关键在于微调释义反转模型，并探索使用作者特定的上下文来指导反转过程。具体方法包括使用目标作者的写作示例作为上下文，以及使用学习到的风格表示 (style representations) 来捕捉作者风格的独特特征。研究表明，从机器生成的释义文本开始，可以通过学习到的反转模型恢复文档的显著部分；而从人类写作的文本开始，由于源写作风格的多样性，反转更具挑战性。尽管无法完全恢复原始标记，但反转后的文本在风格上与原始文本相似，这显著提高了依赖风格标记的抄袭检测和作者身份识别系统的性能。

链接: https://arxiv.org/abs/2410.21637
作者: Rafael Rivera Soto,Barry Chen,Nicholas Andrews
关键词-EN: Large language models, produce highly fluent, Large language, highly fluent paraphrases, produce highly
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models can produce highly fluent paraphrases while retaining much of the original meaning. While this capability has a variety of helpful applications, it may also be abused by bad actors, for example to plagiarize content or to conceal their identity. This motivates us to consider the problem of paraphrase inversion: given a paraphrased document, attempt to recover the original text. To explore the feasibility of this task, we fine-tune paraphrase inversion models, both with and without additional author-specific context to help guide the inversion process. We explore two approaches to author-specific inversion: one using in-context examples of the target author’s writing, and another using learned style representations that capture distinctive features of the author’s style. We show that, when starting from paraphrased machine-generated text, we can recover significant portions of the document using a learned inversion model. When starting from human-written text, the variety of source writing styles poses a greater challenge for invertability. However, even when the original tokens can’t be recovered, we find the inverted text is stylistically similar to the original, which significantly improves the performance of plagiarism detectors and authorship identification systems that rely on stylistic markers.
摘要：大语言模型能够在保留大部分原始意义的同时生成高度流畅的释义。尽管这一能力具有多种有益的应用，但也可能被不法分子滥用，例如用于抄袭内容或隐藏身份。这促使我们考虑释义逆向问题：给定一个释义文档，尝试恢复原始文本。为了探索这一任务的可行性，我们对释义逆向模型进行了微调，包括有无额外作者特定上下文来指导逆向过程。我们探索了两种作者特定逆向的方法：一种使用目标作者写作的上下文示例，另一种使用捕捉作者风格独特特征的学习风格表示。我们发现，从机器生成的释义文本开始，使用学习到的逆向模型可以恢复文档的显著部分。而从人类写作的文本开始，源写作风格的多样性对逆向性提出了更大的挑战。然而，即使无法恢复原始Token，我们发现逆向文本在风格上与原始文本相似，这显著提高了依赖风格标记的抄袭检测器和作者身份识别系统的性能。

[NLP-52] MCPDial: A Minecraft Persona-driven Dialogue Dataset

【速读】：该论文试图解决在游戏中生成角色驱动对话的问题，特别是玩家与非玩家角色（NPC）之间的对话。解决方案的关键在于利用大型语言模型（LLMs）生成基于角色特征的对话，并引入了一个名为Minecraft Persona-driven Dialogue dataset (MCPDial)的数据集。该数据集通过从少量专家编写的对话种子开始，使用提出的方法生成数百个额外的对话，每个对话包含丰富的角色描述，并支持深入和广泛的互动。此外，对话中还融入了规范的函数调用（如“Call find a resource on iron ore”），以增强对话的实用性和复杂性。最后，通过定性分析评估数据集的质量和特性。

链接: https://arxiv.org/abs/2410.21627
作者: Seyed Hossein Alavi,Sudha Rao,Ashutosh Adhikari,Gabriel A DesGarennes,Akanksha Malhotra,Chris Brockett,Mahmoud Adada,Raymond T. Ng,Vered Shwartz,Bill Dolan
关键词-EN: large language models, Minecraft Persona-driven Dialogue, Persona-driven Dialogue dataset, language models, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel approach that uses large language models (LLMs) to generate persona-driven conversations between Players and Non-Player Characters (NPC) in games. Showcasing the application of our methodology, we introduce the Minecraft Persona-driven Dialogue dataset (MCPDial). Starting with a small seed of expert-written conversations, we employ our method to generate hundreds of additional conversations. Each conversation in the dataset includes rich character descriptions of the player and NPC. The conversations are long, allowing for in-depth and extensive interactions between the player and NPC. MCPDial extends beyond basic conversations by incorporating canonical function calls (e.g. “Call find a resource on iron ore”) between the utterances. Finally, we conduct a qualitative analysis of the dataset to assess its quality and characteristics.
摘要：我们提出了一种新颖的方法，利用大语言模型 (LLM) 在游戏中生成玩家 (Player) 与非玩家角色 (NPC) 之间的角色驱动对话。通过展示我们的方法论应用，我们引入了《我的世界》角色驱动对话数据集 (MCPDial)。从一小部分专家编写的对话种子开始，我们采用该方法生成了数百个额外的对话。数据集中的每个对话都包含玩家和 NPC 的丰富角色描述。对话内容较长，允许玩家与 NPC 之间进行深入且广泛的互动。MCPDial 不仅限于基本对话，还通过在话语之间嵌入规范函数调用（例如“调用查找铁矿资源”）来扩展对话功能。最后，我们对数据集进行了定性分析，以评估其质量和特征。

[NLP-53] Reducing the Scope of Language Models with Circuit Breakers

【速读】：该论文试图解决语言模型在特定应用场景中回答无关查询的问题，即模型行为范围界定 (scoping)。解决方案的关键在于采用断路器方法 (Circuit Breakers, CB)，这是一种最近提出的通用对齐方法，能够将语言模型的行为限定在特定任务上，如情感分析或摘要生成，甚至在更细粒度的任务上（如仅摘要新闻文章）。与传统的微调或偏好学习方法相比，CB在分布外任务和对抗性提示技术方面表现出更高的鲁棒性。此外，论文还展示了将监督微调 (SFT) 和 CB 结合使用，可以在接受相关查询的同时拒绝无关查询，从而实现最佳性能。

链接: https://arxiv.org/abs/2410.21597
作者: David Yunis,Siyu Huo,Chulaka Gunasekara,Danish Contractor
关键词-EN: user-facing applications, coding assistants, Language models, wide variety, variety of user-facing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models are now deployed in a wide variety of user-facing applications, often for specific purposes like answering questions about documentation or acting as coding assistants. As these models are intended for particular purposes, they should not be able to answer irrelevant queries like requests for poetry or questions about physics, or even worse, queries that can only be answered by humans like sensitive company policies. Instead we would like them to only answer queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We find that, despite the use of system prompts, two representative language models can be poorly scoped and respond to queries they should not be addressing. We then conduct a comprehensive empirical evaluation of methods which could be used for scoping the behavior of language models. Among many other results, we show that a recently-proposed method for general alignment, Circuit Breakers (CB), can be adapted to scope language models to very specific tasks like sentiment analysis or summarization or even tasks with finer-grained scoping (e.g. summarizing only news articles). When compared to standard methods like fine-tuning or preference learning, CB is more robust both for out of distribution tasks, and to adversarial prompting techniques. We also show that layering SFT and CB together often results in the best of both worlds: improved performance only on relevant queries, while rejecting irrelevant ones.
摘要：语言模型目前被广泛应用于面向用户的各种应用中，通常用于特定目的，如回答文档相关问题或作为编码助手。由于这些模型旨在用于特定目的，因此它们不应能够回答与诗歌或物理学相关的不相关查询，更不用说那些只能由人类回答的查询，如敏感的公司政策。相反，我们希望它们只回答与期望行为相对应的查询，并拒绝所有其他请求，我们称之为“范围界定”。我们发现，尽管使用了系统提示，两个代表性的语言模型在范围界定方面表现不佳，并且会回答它们不应处理的查询。随后，我们对可用于范围界定语言模型行为的方法进行了全面的实证评估。在众多其他结果中，我们展示了最近提出的一种通用对齐方法——断路器（Circuit Breakers, CB），可以被改编用于范围界定语言模型，使其仅限于非常特定的任务，如情感分析或摘要，甚至更细粒度的任务（例如仅摘要新闻文章）。与标准方法如微调或偏好学习相比，CB在分布外任务和对抗性提示技术方面更为稳健。我们还展示了将监督微调（SFT）和CB结合使用通常能实现两者的最佳效果：仅在相关查询上提高性能，同时拒绝不相关的查询。

[NLP-54] Can Large Language Models Replace Data Scientists in Clinical Research?

【速读】：该论文试图解决的问题是如何评估和提升大型语言模型（LLMs）在临床研究数据科学任务中的实用性和准确性。解决方案的关键在于开发了一个包含293个真实世界数据科学编码任务的数据集，并引入了两种先进的适应方法：链式思维提示（chain-of-thought prompting）和自我反思（self-reflection）。链式思维提示通过提供逐步的数据分析计划，将代码准确性提高了60%；自我反思则通过迭代优化代码，使准确性提高了38%。此外，论文还开发了一个集成LLMs的平台，用于医疗专业人员的数据科学工作流程，通过用户研究验证了LLMs在简化编程过程中的显著作用，尽管不能完全自动化编码任务，但显著提高了数据科学效率。

链接: https://arxiv.org/abs/2410.21591
作者: Zifeng Wang,Benjamin Danek,Ziwei Yang,Zheng Chen,Jimeng Sun
关键词-EN: Data science, Data, data science tasks, clinical research, Data science plays
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Data science plays a critical role in clinical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, these tests do not assess LLMs’ ability to handle data science tasks in medicine, nor do they explore their practical utility in clinical research. To address this, we developed a dataset consisting of 293 real-world data science coding tasks, based on 39 published clinical studies, covering 128 tasks in Python and 165 tasks in R. This dataset simulates realistic clinical research scenarios using patient data. Our findings reveal that cutting-edge LLMs struggle to generate perfect solutions, frequently failing to follow input instructions, understand target data, and adhere to standard analysis practices. Consequently, LLMs are not yet ready to fully automate data science tasks. We benchmarked advanced adaptation methods and found two to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 60% improvement in code accuracy; and self-reflection, enabling LLMs to iteratively refine their code, yielding a 38% accuracy improvement. Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical doctors, we found that while LLMs cannot fully automate coding tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs, when integrated into expert workflows, to enhance data science efficiency in clinical research.
摘要：数据科学在临床研究中扮演着至关重要的角色，但这一领域需要具备编码和医学数据分析专业知识的专家。大语言模型（LLMs）在支持医疗任务和在一般编码测试中表现出色方面展现了巨大潜力。然而，这些测试并未评估LLMs处理医学数据科学任务的能力，也未探讨其在临床研究中的实际应用价值。为此，我们基于39项已发表的临床研究，开发了一个包含293个真实世界数据科学编码任务的数据集，涵盖了128个Python任务和165个R任务。该数据集利用患者数据模拟了真实的临床研究场景。我们的研究发现，最先进的LLMs在生成完美解决方案方面存在困难，常常未能遵循输入指令、理解目标数据和遵守标准分析实践。因此，LLMs尚不具备完全自动化数据科学任务的能力。我们测试了多种高级适应方法，发现其中两种特别有效：思维链提示（chain-of-thought prompting），它为数据分析提供了一个逐步的计划，使得代码准确性提高了60%；以及自我反思（self-reflection），使LLMs能够迭代优化其代码，准确性提高了38%。基于这些发现，我们开发了一个平台，将LLMs集成到医学专业人员的数据科学工作流程中。在一项包含五名医学博士的用户研究中，我们发现尽管LLMs无法完全自动化编码任务，但它们显著简化了编程过程。我们发现，80%的提交代码解决方案来自LLM生成的代码，某些情况下重复使用率高达96%。我们的分析强调了LLMs在融入专家工作流程时，能够提升临床研究中数据科学效率的潜力。

[NLP-55] hank You Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense

【速读】：该论文试图解决多语言大型语言模型（LLMs）在跨语言语义评估中的可靠性问题，特别是在非英语语言中的表现。解决方案的关键在于引入了一个新的跨语言语义消歧基准测试——StingrayBench，并通过使用“假朋友”（false friends）——即在两种语言中拼写相似但含义完全不同的词汇——来揭示LLMs在跨语言语义消歧中的局限性。研究收集了四种语言对的假朋友，并挑战LLMs在上下文中区分这些词汇的使用。分析结果显示，模型倾向于偏向于资源更丰富的语言。此外，论文还提出了新的度量标准，用于量化跨语言语义偏见和理解能力，从而推动更公平的多语言模型发展。

链接: https://arxiv.org/abs/2410.21573
作者: Samuel Cahyawijaya,Ruochen Zhang,Holy Lovenia,Jan Christian Blaise Cruz,Hiroki Nomoto,Alham Fikri Aji
关键词-EN: reliability beyond English, gained prominence, cross-lingual sense disambiguation, concerns arise, cross-lingual sense
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends – words that are orthographically similar but have completely different meanings in two languages – as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive language modeling, promoting fairer access for the wider multilingual community.
摘要：多语言大语言模型 (LLMs) 近年来备受瞩目，但其超越英语的可靠性问题引起了关注。本研究通过引入一种新的跨语言语义消歧基准——StingrayBench，填补了跨语言语义评估的空白。本文利用“假朋友”——即在两种语言中拼写相似但含义完全不同的词汇——作为评估跨语言语义消歧能力的一种方法。我们收集了四种语言对的“假朋友”，包括印尼语-马来语、印尼语-他加禄语、中文-日语和英语-德语，并挑战大语言模型在上下文中区分这些词汇的使用。在分析多种模型时，我们发现它们往往偏向于资源更丰富的语言。此外，我们基于此基准提出了新的量化跨语言语义偏差和理解能力的指标。本研究有助于开发更多样化和包容性的语言模型，促进更广泛的跨语言社区的公平访问。

[NLP-56] Semantic Search Evaluation CIKM2024

【速读】：该论文试图解决内容搜索系统中查询与返回结果之间的语义匹配问题，提出了一种新的评估方法。解决方案的关键在于引入了一个名为“主题相关率 (on-topic rate)”的指标，用于衡量返回结果与查询的相关性百分比。通过设计一个评估管道，包括定义黄金查询集、检索前K个结果，并利用GPT 3.5生成提示，该方法能够识别常见的失败模式并设定改进相关性的目标。

链接: https://arxiv.org/abs/2410.21549
作者: Chujie Zheng,Jeffrey Wang,Shuqian Albee Zhang,Anand Kishore,Siddharth Singh
关键词-EN: content search system, search system, content search, method for evaluating, evaluating the performance
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by 3rd International Workshop on Industrial Recommendation Systems (at CIKM 2024)

点击查看摘要

Abstract:We propose a novel method for evaluating the performance of a content search system that measures the semantic match between a query and the results returned by the search system. We introduce a metric called “on-topic rate” to measure the percentage of results that are relevant to the query. To achieve this, we design a pipeline that defines a golden query set, retrieves the top K results for each query, and sends calls to GPT 3.5 with formulated prompts. Our semantic evaluation pipeline helps identify common failure patterns and goals against the metric for relevance improvements.
摘要：我们提出了一种新颖的方法，用于评估内容搜索系统的性能，该方法通过测量查询与搜索系统返回结果之间的语义匹配度来进行评估。我们引入了一个名为“主题相关率”的指标，用于衡量结果中与查询相关的百分比。为此，我们设计了一个流程，该流程定义了一个黄金查询集，检索每个查询的前 K 个结果，并向 GPT 3.5 发送带有格式化提示的调用。我们的语义评估流程有助于识别常见的失败模式，并针对相关性改进的指标设定目标。

[NLP-57] MultiTok: Variable-Length Tokenization for Efficient LLM s Adapted from LZW Compression

【速读】：该论文试图解决大规模语言模型（LLMs）训练过程中资源消耗巨大的问题。解决方案的关键在于提出了一种新的分词方法——MultiTok，该方法受通用Lempel-Ziv-Welch数据压缩技术启发，将重复短语压缩为多词标记（multi-word tokens）。通过使用MultiTok，论文展示了语言模型能够在显著更高效的情况下进行训练，同时保持与BERT标准分词器相当的准确性，训练速度提高了近2.5倍，且所需训练数据减少了超过30%。

链接: https://arxiv.org/abs/2410.21548
作者: Noel Elias,Homa Esfahanizadeh,Kaan Kale,Sriram Vishwanath,Muriel Medard
关键词-EN: natural language processing, complex natural language, drastically changed, changed the prospects, introducing technologies
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not limited to large amounts of data, expensive machinery, and lengthy training. To solve this problem, this paper proposes a new tokenization method inspired by universal Lempel-Ziv-Welch data compression that compresses repetitive phrases into multi-word tokens. With MultiTok as a new tokenizing tool, we show that language models are able to be trained notably more efficiently while offering a similar accuracy on more succinct and compressed training data. In fact, our results demonstrate that MultiTok achieves a comparable performance to the BERT standard as a tokenizer while also providing close to 2.5x faster training with more than 30% less training data.
摘要：大语言模型通过引入更复杂的自然语言处理技术，极大地改变了 AI 的前景。然而，当前训练这些大语言模型的方法需要大量的资源，包括但不限于大量的数据、昂贵的设备和长时间的训练。为了解决这一问题，本文提出了一种新的 Token 化方法，该方法受到通用 Lempel-Ziv-Welch 数据压缩技术的启发，将重复的短语压缩成多词 Token。通过使用 MultiTok 这一新的 Token 化工具，我们展示了语言模型能够在更高效地训练的同时，在更简洁和压缩的训练数据上提供相似的准确性。事实上，我们的结果表明，MultiTok 在作为 Token 化工具时，其性能与 BERT 标准相当，同时提供了接近 2.5 倍的训练速度，并且减少了超过 30% 的训练数据。

[NLP-58] Unveiling Context-Aware Criteria in Self-Assessing LLM s

【速读】：该论文试图解决现有大型语言模型（LLMs）作为评估者时依赖静态、人为定义的评估标准，导致其在多样化的生成任务中泛化能力受限的问题。解决方案的关键在于提出了一种新的自评估LLM框架，即集成上下文感知标准（SALC）与动态知识，为每个评估实例量身定制相关且上下文感知的洞察，从而提升评估性能。该框架不依赖预定义的人类标准，能够灵活适应各种任务，并通过知识蒸馏技术优化了较小语言模型的标准生成和评估能力，显著提高了评估效果，特别是在直接偏好优化（DPO）中的偏好数据生成方面表现出色。

链接: https://arxiv.org/abs/2410.21545
作者: Taneesh Gupta,Shivam Shandilya,Xuchao Zhang,Supriyo Ghosh,Chetan Bansal,Huaxiu Yao,Saravan Rajmohan
关键词-EN: long-form response assessments, garnered significant attention, significant attention due, rival human-level evaluations, response assessments
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The use of large language models (LLMs) as evaluators has garnered significant attention due to their potential to rival human-level evaluations in long-form response assessments. However, current LLM evaluators rely heavily on static, human-defined criteria, limiting their ability to generalize across diverse generative tasks and incorporate context-specific knowledge. In this paper, we propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance. This instance-level knowledge enhances the LLM evaluator’s performance by providing relevant and context-aware insights that pinpoint the important criteria specific to the current instance. Additionally, the proposed framework adapts seamlessly to various tasks without relying on predefined human criteria, offering a more flexible evaluation approach. Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks, yielding improvements on average 4.8% across a wide variety of datasets. Furthermore, by leveraging knowledge distillation techniques, we fine-tuned smaller language models for criteria generation and evaluation, achieving comparable or superior performance to larger models with much lower cost. Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation in Direct Preference Optimization (DPO), underscoring its efficacy as a robust and scalable evaluation framework.
摘要：大语言模型 (LLM) 作为评估工具的使用因其潜在的与人类水平评估相媲美的能力而在长篇回答评估中引起了广泛关注。然而，当前的 LLM 评估工具严重依赖于静态的、人类定义的标准，这限制了它们在多样生成任务中的泛化能力以及对特定情境知识的整合。本文提出了一种新颖的自评估 LLM 框架，该框架将情境感知标准 (SALC) 与针对每个评估实例定制的动态知识相结合。这种实例级别的知识通过提供相关且情境感知的洞察，增强了 LLM 评估工具的性能，这些洞察能够精准地识别出当前实例的重要标准。此外，所提出的框架能够无缝适应各种任务，而不依赖于预定义的人类标准，提供了更为灵活的评估方法。实证评估表明，我们的方法显著优于现有的基线评估框架，在多种数据集上的平均改进率达到 4.8%。此外，通过利用知识蒸馏技术，我们微调了较小的语言模型用于标准生成和评估，实现了与较大模型相当甚至更优的性能，同时成本大幅降低。我们的方法在 AlpacaEval2 排行榜上的 LC 胜率提升了高达 12%，当应用于直接偏好优化 (DPO) 中的偏好数据生成时，突显了其作为稳健且可扩展评估框架的有效性。

[NLP-59] L3Ms – Lagrange Large Language Models

【速读】：该论文试图解决在监督微调 (Supervised Fine-Tuning, SFT) 和大型语言模型 (Large Language Models, LLMs) 对齐过程中，依赖启发式选择进行优化的问题。解决方案的关键在于将SFT和对齐过程形式化为一个约束优化问题，并通过提出拉格朗日大型语言模型 (Lagrange Large Language Models, L3Ms) 来解决。L3Ms 使用对数障碍法来强制执行约束，从而避免了启发式驱动的过程，并允许在不同应用中定制化对齐。实验结果表明，L3Ms 在实现各种应用的定制化对齐方面具有多功能性和有效性。

链接: https://arxiv.org/abs/2410.21533
作者: Guneet S. Dhillon,Xingjian Shi,Yee Whye Teh,Alex Smola
关键词-EN: good user experience, large language models, Supervised fine-tuning, language models, user experience
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) and alignment of large language models (LLMs) are key steps in providing a good user experience. However, the concept of an appropriate alignment is inherently application-dependent, and current methods often rely on heuristic choices to drive the optimization. In this work, we formulate SFT and alignment as a constrained optimization problem, where the LLM is trained on a task while being required to meet application-specific requirements, without resorting to heuristics. To solve this, we propose Lagrange Large Language Models (L3Ms), which employ logarithmic barriers to enforce the constraints. This approach allows for the customization of L3Ms across diverse applications while avoiding heuristic-driven processes. We demonstrate experimentally the versatility and efficacy of L3Ms in achieving tailored alignments for various applications.
摘要：监督微调 (Supervised Fine-Tuning, SFT) 和大语言模型 (Large Language Models, LLMs) 的对齐是提供良好用户体验的关键步骤。然而，合适的对齐概念本质上是依赖于应用的，当前的方法通常依赖于启发式选择来驱动优化。在这项工作中，我们将 SFT 和对齐形式化为一个约束优化问题，其中大语言模型在执行任务的同时需要满足特定应用的要求，而不依赖于启发式方法。为此，我们提出了拉格朗日大语言模型 (Lagrange Large Language Models, L3Ms)，它采用对数障碍法来强制执行约束。这种方法允许在不同应用中定制 L3Ms，同时避免了启发式驱动的过程。我们通过实验展示了 L3Ms 在实现各种应用的定制对齐方面的多功能性和有效性。

[NLP-60] Not All LLM -Generated Data Are Equal: Rethinking Data Weighting in Text Classification

【速读】：该论文试图解决生成式数据增强（Synthetic data augmentation）中，由大型语言模型（LLMs）生成的数据与真实世界数据分布不一致的问题。解决方案的关键在于提出了一种高效的加权损失方法（weighted-loss approaches），通过强调高质量和多样化的生成数据，并结合少量真实世界数据，来实现生成数据与真实数据分布的对齐。实验结果表明，该方法在多个文本分类任务中显著优于标准交叉熵损失和其他数据加权方法，为有效利用任何合适的生成数据进行模型训练提供了潜在的解决方案。

链接: https://arxiv.org/abs/2410.21526
作者: Hsun-Yu Kuo,Yin-Hsiang Liao,Yu-Chieh Chao,Wei-Yun Ma,Pu-Jen Cheng
关键词-EN: large language models, leverage additional training, Synthetic data augmentation, augmentation via large, large language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.
摘要：通过大语言模型（LLM）进行合成数据增强，使研究人员能够利用额外的训练数据，从而提升下游任务的性能，尤其是在真实世界数据稀缺的情况下。然而，生成的数据可能与真实世界数据存在偏差，这种不一致性在将训练模型应用于实际应用时可能导致不良结果。因此，我们提出了一种高效的加权损失方法，通过强调由大语言模型生成的高质量和多样化数据，并仅使用少量真实世界数据，来对齐合成数据与真实世界数据的分布。我们在多个文本分类任务上实证评估了该方法的有效性，结果显示，在BERT级别的模型上应用我们的方法，其性能显著优于标准的交叉熵损失和其他数据加权方法，为有效利用任何合适的数据生成器生成的合成数据进行模型训练提供了潜在的解决方案。

[NLP-61] LLM -Forest for Health Tabular Data Imputation

【速读】：该论文试图解决表格数据集中的缺失数据插补问题，特别是在医疗领域，数据完整性对于准确分析至关重要。解决方案的关键在于提出了一种名为LLM-Forest的新框架，该框架利用少量样本学习（few-shot learning）的大型语言模型（LLMs）构建了一个“森林”，并通过基于置信度的加权投票机制来提高插补的准确性。LLM-Forest的核心创新在于引入二部信息图（bipartite information graphs）来识别高质量的相关邻近条目，这些条目在特征和值的粒度上都有所体现，从而有效缓解了LLM幻觉（LLM hallucinations）的风险。

链接: https://arxiv.org/abs/2410.21520
作者: Xinrui He,Yikun Ban,Jiaru Zou,Tianxin Wei,Curtiss B. Cook,Jingrui He
关键词-EN: Missing data imputation, Missing data, tabular data imputation, accurate analysis, completeness is vital
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Missing data imputation is a critical challenge in tabular datasets, especially in healthcare, where data completeness is vital for accurate analysis. Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation, making them a promising tool for tabular data imputation. However, challenges persist in designing effective prompts for a finetuning-free process and in mitigating the risk of LLM hallucinations. To address these issues, we propose a novel framework, LLM-Forest, which introduces a “forest” of few-shot learning LLM “trees” with confidence-based weighted voting. This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity. Extensive experiments on four real-world healthcare datasets demonstrate the effectiveness and efficiency of LLM-Forest.
摘要：缺失数据插补是表格数据集中的一个关键挑战，尤其是在医疗领域，数据的完整性对于准确分析至关重要。大语言模型（Large Language Models, LLMs）经过大量语料库的训练，在数据生成方面展现出强大的潜力，使其成为表格数据插补的有力工具。然而，在设计无需微调的有效提示以及减轻大语言模型产生幻觉的风险方面仍存在挑战。为解决这些问题，我们提出了一种新颖的框架——LLM-Forest，该框架引入了一种基于信心的加权投票机制，通过少样本学习（Few-shot Learning）的“树”结构来构建“森林”。该框架建立在一个新的二分信息图概念之上，以识别具有特征和值粒度的高质量相关邻近条目。在四个真实世界的医疗数据集上的广泛实验证明了LLM-Forest的有效性和效率。

[NLP-62] Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

【速读】：该论文试图解决稀疏自编码器 (Sparse AutoEncoders, SAEs) 在大规模语言模型 (Large Language Models, LLMs) 中训练计算量大的问题。解决方案的关键在于提出了一种新的训练策略，即将多个连续层的SAEs合并为一个进行训练，从而减少了训练的SAEs数量。通过这种层聚类 (layer clustering) 的方法，实验结果表明在Pythia 160M模型上实现了高达6倍的训练速度提升，同时不降低重建质量和下游任务的性能。

链接: https://arxiv.org/abs/2410.21508
作者: Davide Ghilardi,Federico Belotti,Marco Molinari
关键词-EN: Large Language Models, Large Language, workings of Large, Language Models, Sparse AutoEnocders
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse AutoEnocders (SAEs) have recently been employed as an unsupervised approach for understanding the inner workings of Large Language Models (LLMs). They reconstruct the model’s activations with a sparse linear combination of interpretable features. However, training SAEs is computationally intensive, especially as models grow in size and complexity. To address this challenge, we propose a novel training strategy that reduces the number of trained SAEs from one per layer to one for a given group of contiguous layers. Our experimental results on Pythia 160M highlight a speedup of up to 6x without compromising the reconstruction quality and performance on downstream tasks. Therefore, layer clustering presents an efficient approach to train SAEs in modern LLMs.
摘要：稀疏自编码器 (Sparse AutoEncoders, SAEs) 最近被用作一种无监督方法，用于理解大语言模型 (Large Language Models, LLMs) 的内部工作机制。它们通过稀疏线性组合的可解释特征来重建模型的激活状态。然而，训练 SAEs 的计算量巨大，尤其是在模型规模和复杂性增加的情况下。为应对这一挑战，我们提出了一种新颖的训练策略，将每层训练一个 SAE 的方式改为针对一组连续层训练一个 SAE。我们在 Pythia 160M 上的实验结果表明，该方法在不降低重建质量和下游任务性能的前提下，实现了高达 6 倍的加速。因此，层聚类为现代 LLMs 中训练 SAEs 提供了一种高效的方法。

[NLP-63] SandboxAQs submission to MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval EMNLP2024

【速读】：该论文试图解决多语言环境下问答系统 (Question Answering, QA) 和命名实体识别 (Named Entity Recognition, NER) 的性能问题。解决方案的关键在于测试了五种大型语言模型 (Large Language Models) 在不同语言和任务中的表现，并采用了多种提示方法 (prompting methods)，包括零样本学习 (zero-shot)、思维链推理 (chain-of-thought reasoning) 和翻译技术 (translation techniques)。研究发现，高级提示技术通常能提升问答系统的性能，但对命名实体识别的效果不一，且不同任务和语言之间的难度模式存在差异。这表明需要针对具体任务采用特定的多语言自然语言处理 (NLP) 方法，并指出当前模型可能在不同任务上发展出不同的语言能力。

链接: https://arxiv.org/abs/2410.21501
作者: Isidora Chara Tourni,Sayontan Ghosh,Brenda Miao,Constantijn van der Poel
关键词-EN: Named Entity Recognition, Question Answering, Entity Recognition, Named Entity, problems of Question
类目: Computation and Language (cs.CL)
备注: MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval; 4th Multilingual Representation Learning (MRL) Workshop; EMNLP 2024

点击查看摘要

Abstract:This paper explores the problems of Question Answering (QA) and Named Entity Recognition (NER) in five diverse languages. We tested five Large Language Models with various prompting methods, including zero-shot, chain-of-thought reasoning, and translation techniques. Our results show that while some models consistently outperform others, their effectiveness varies significantly across tasks and languages. We saw that advanced prompting techniques generally improved QA performance but had mixed results for NER; and we observed that language difficulty patterns differed between tasks. Our findings highlight the need for task-specific approaches in multilingual NLP and suggest that current models may develop different linguistic competencies for different tasks.
摘要：本文探讨了在五种不同语言中问答系统 (Question Answering, QA) 和命名实体识别 (Named Entity Recognition, NER) 的问题。我们测试了五种大语言模型 (Large Language Model, LLM)，采用了多种提示方法，包括零样本 (zero-shot)、思维链推理 (chain-of-thought reasoning) 和翻译技术。结果显示，尽管某些模型在某些任务中持续表现优于其他模型，但它们在不同任务和语言中的有效性存在显著差异。我们发现，先进的提示技术通常能提升问答系统的性能，但对命名实体识别的效果则参差不齐；此外，我们观察到不同任务之间的语言难度模式存在差异。这些发现强调了在多语言自然语言处理 (NLP) 中需要针对特定任务的方法，并表明当前模型可能在不同任务中发展出不同的语言能力。

[NLP-64] RoBIn: A Transformer-Based Model For Risk Of Bias Inference With Machine Reading Comprehension

【速读】：该论文试图解决科学出版物中偏倚风险（Risk of Bias, RoB）评估的自动化问题。解决方案的关键在于开发了一种名为RoBIn（Risk of Bias Inference）的创新模型，该模型采用双任务方法，从给定文本中提取证据并基于提取的证据评估RoB。具体方法包括使用Cochrane Database of Systematic Reviews (CDSR)的数据作为基准，对PubMed中的开放获取临床试验出版物进行标注，以生成用于机器阅读理解和RoB推断的训练和测试数据集。此外，论文还提出了基于Transformer的两种方法：提取式（RoBInExt）和生成式（RoBInGen），用于有效提取相关证据并分类RoB。实验结果表明，RoBIn在多种场景下优于传统的机器学习和大型语言模型（LLM）方法，实现了0.83的ROC AUC。

链接: https://arxiv.org/abs/2410.21495
作者: Abel Corrêa Dias,Viviane Pereira Moreira,João Luiz Dihl Comba
关键词-EN: Scientific publications play, shaping healthcare policies, Scientific publications, Risk of Bias, testing novel drugs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objective: Scientific publications play a crucial role in uncovering insights, testing novel drugs, and shaping healthcare policies. Accessing the quality of publications requires evaluating their Risk of Bias (RoB), a process typically conducted by human reviewers. In this study, we introduce a new dataset for machine reading comprehension and RoB assessment and present RoBIn (Risk of Bias Inference), an innovative model crafted to automate such evaluation. The model employs a dual-task approach, extracting evidence from a given context and assessing the RoB based on the gathered evidence. Methods: We use data from the Cochrane Database of Systematic Reviews (CDSR) as ground truth to label open-access clinical trial publications from PubMed. This process enabled us to develop training and test datasets specifically for machine reading comprehension and RoB inference. Additionally, we created extractive (RoBInExt) and generative (RoBInGen) Transformer-based approaches to extract relevant evidence and classify the RoB effectively. Results: RoBIn is evaluated across various settings and benchmarked against state-of-the-art methods for RoB inference, including large language models in multiple scenarios. In most cases, the best-performing RoBIn variant surpasses traditional machine learning and LLM-based approaches, achieving an ROC AUC of 0.83. Conclusion: Based on the evidence extracted from clinical trial reports, RoBIn performs a binary classification to decide whether the trial is at a low RoB or a high/unclear RoB. We found that both RoBInGen and RoBInExt are robust and have the best results in many settings.
摘要：
目标：科学出版物在揭示新知、测试新药以及塑造医疗政策方面发挥着关键作用。评估出版物的质量需要对其偏倚风险 (Risk of Bias, RoB) 进行评估，这一过程通常由人工评审员完成。在本研究中，我们引入了一个新的机器阅读理解与偏倚风险评估数据集，并提出了 RoBIn (Risk of Bias Inference)，这是一种创新的模型，旨在自动化此类评估。该模型采用双任务方法，从给定上下文中提取证据，并基于收集到的证据评估偏倚风险。

方法：我们使用 Cochrane 系统评价数据库 (Cochrane Database of Systematic Reviews, CDSR) 的数据作为真实标签，从 PubMed 中标记开放获取的临床试验出版物。这一过程使我们能够专门为机器阅读理解和偏倚风险推断开发训练和测试数据集。此外，我们创建了基于 Transformer 的抽取式 (RoBInExt) 和生成式 (RoBInGen) 方法，以有效提取相关证据并分类偏倚风险。

结果：RoBIn 在多种设置下进行了评估，并与当前最先进的偏倚风险推断方法进行了基准测试，包括在多种场景下的大语言模型。在大多数情况下，表现最佳的 RoBIn 变体超越了传统的机器学习和大语言模型方法，实现了 0.83 的 ROC AUC。

结论：基于从临床试验报告中提取的证据，RoBIn 执行二分类以决定试验是处于低偏倚风险还是高/不明确偏倚风险。我们发现，RoBInGen 和 RoBInExt 均表现稳健，并在许多设置中取得了最佳结果。

[NLP-65] FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks

【速读】：该论文试图解决大语言模型 (LLMs) 在集成外部信息时面临的提示注入攻击 (prompt injection attacks) 问题。解决方案的关键在于引入了一种名为基于哈希标签的格式化认证 (Formatting AuThentication with Hash-based tags, FATH) 的新型测试时防御策略。与现有方法不同，FATH 不仅阻止 LLMs 响应外部文本中的额外指令，还通过实施一个认证系统，要求 LLMs 根据安全策略响应所有接收到的指令，并选择性地过滤出最终输出的用户指令响应。该方法利用哈希认证标签标记每个响应，从而根据用户指令准确识别响应，并增强对适应性攻击的鲁棒性。实验结果表明，FATH 能有效防御间接提示注入攻击，在 Llama3 和 GPT3.5 模型上实现了最先进的防御性能。

链接: https://arxiv.org/abs/2410.21492
作者: Jiongxiao Wang,Fangzhou Wu,Wendi Li,Jinsheng Pan,Edward Suh,Z. Morley Mao,Muhao Chen,Chaowei Xiao
关键词-EN: Large language models, Large language, widely deployed, Large, real-world applications
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant security concerns. Among these, prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire. While both training-time and test-time defense methods have been developed to mitigate such attacks, the unaffordable training costs associated with training-time methods and the limited effectiveness of existing test-time methods make them impractical. This paper introduces a novel test-time defense strategy, named Formatting AuThentication with Hash-based tags (FATH). Unlike existing approaches that prevent LLMs from answering additional instructions in external text, our method implements an authentication system, requiring LLMs to answer all received instructions with a security policy and selectively filter out responses to user instructions as the final output. To achieve this, we utilize hash-based authentication tags to label each response, facilitating accurate identification of responses according to the user’s instructions and improving the robustness against adaptive attacks. Comprehensive experiments demonstrate that our defense method can effectively defend against indirect prompt injection attacks, achieving state-of-the-art performance under Llama3 and GPT3.5 models across various attack methods. Our code is released at: this https URL
摘要：大语言模型 (LLM) 已被广泛部署为实际应用中的核心组件，并结合了额外的工具和文本信息。然而，将外部信息整合到 LLM 集成的应用中引发了显著的安全问题。其中，提示注入攻击尤为严重，恶意指令通过外部文本信息注入，可以利用 LLM 生成攻击者期望的答案。尽管已经开发了训练时和测试时的防御方法来缓解此类攻击，但训练时方法的高昂训练成本和现有测试时方法的有限效果使其难以实际应用。本文提出了一种新的测试时防御策略，名为基于哈希标签的格式化认证 (Formatting AuThentication with Hash-based tags, FATH)。与现有方法不同，我们的方法实施了一个认证系统，要求 LLM 根据安全策略回答所有接收到的指令，并选择性地过滤掉对用户指令的响应作为最终输出。为此，我们利用基于哈希的认证标签来标记每个响应，从而根据用户指令准确识别响应，并提高对自适应攻击的鲁棒性。综合实验表明，我们的防御方法能够有效防御间接提示注入攻击，在 Llama3 和 GPT3.5 模型下，针对各种攻击方法均达到了最先进的性能。我们的代码已发布于：this https URL

[NLP-66] Can Large Language Models Act as Symbolic Reasoners?

【速读】：该论文试图解决大语言模型 (Large language models, LLMs) 在推理能力方面的局限性问题，特别是它们是否能够解释其推理过程和结论。解决方案的关键在于探讨 LLMs 是否具备内在的推理能力，或者是否需要额外的支持组件来实现推理。论文通过回顾现有文献，评估 LLMs 在特定领域或一般情况下的推理能力，并识别当前研究中的空白和未来研究趋势，以提出改进 LLMs 解释性的潜在方向。

链接: https://arxiv.org/abs/2410.21490
作者: Rob Sullivan,Nelly Elsayed
关键词-EN: Large language models, performance of Large, Large language, language models, broad range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 18 pages, currently under review

点击查看摘要

Abstract:The performance of Large language models (LLMs) across a broad range of domains has been impressive but have been critiqued as not being able to reason about their process and conclusions derived. This is to explain the conclusions draw, and also for determining a plan or strategy for their approach. This paper explores the current research in investigating symbolic reasoning and LLMs, and whether an LLM can inherently provide some form of reasoning or whether supporting components are necessary, and, if there is evidence for a reasoning capability, is this evident in a specific domain or is this a general capability? In addition, this paper aims to identify the current research gaps and future trends of LLM explainability, presenting a review of the literature, identifying current research into this topic and suggests areas for future work.
摘要：大语言模型 (LLMs) 在广泛领域的性能表现令人印象深刻，但也被批评为无法对其推理过程和得出的结论进行解释。这种解释不仅是为了说明结论的依据，也是为了确定其方法的计划或策略。本文探讨了当前关于符号推理和大语言模型的研究，探讨了 LLM 是否能内在地提供某种形式的推理，或者是否需要支持组件，以及如果有证据表明存在推理能力，这种能力是特定领域的表现还是普遍存在的能力。此外，本文旨在识别当前 LLM 可解释性研究中的空白和未来趋势，通过对文献的综述，识别当前对该主题的研究，并提出未来工作的方向。

[NLP-67] SpeechQE: Estimating the Quality of Direct Speech Translation EMNLP2024

【速读】：该论文试图解决语音翻译（Speech Translation）质量评估的问题，特别是针对语音翻译这一未被充分探索的领域。解决方案的关键在于引入了一种基于预训练文本大语言模型（LLM）的端到端系统（end-to-end system），并构建了一个基准测试。研究结果表明，端到端方法比传统的级联系统（cascaded systems）更适合直接评估语音翻译的质量。论文强调，语音翻译的质量评估应作为一个独立的问题进行研究，并公开了数据和模型以推动该领域的进一步研究。

链接: https://arxiv.org/abs/2410.21485
作者: HyoJung Han,Kevin Duh,Marine Carpuat
关键词-EN: Recent advances, speech modality underexplored, written language, modality underexplored, advances in automatic
类目: Computation and Language (cs.CL)
备注: EMNLP2024

点击查看摘要

Abstract:Recent advances in automatic quality estimation for machine translation have exclusively focused on written language, leaving the speech modality underexplored. In this work, we formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures. In this process, we introduce a novel end-to-end system leveraging pre-trained text LLM. Results suggest that end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems. More broadly, we argue that quality estimation of speech translation needs to be studied as a separate problem from that of text, and release our data and models to guide further research in this space.
摘要：近年来，机器翻译自动质量评估的进展主要集中在书面语言上，而对语音模式的探索相对不足。在本研究中，我们提出了语音翻译质量评估（SpeechQE）的任务，构建了一个基准测试，并评估了一系列基于级联和端到端架构的系统。在此过程中，我们引入了一种利用预训练文本大语言模型（LLM）的新型端到端系统。结果表明，与使用为文本设计的质量评估系统相比，端到端方法更适合直接评估语音翻译的质量。更广泛地说，我们认为语音翻译的质量评估需要作为一个独立的问题进行研究，并发布了我们的数据和模型，以指导该领域的进一步研究。

[NLP-68] AiSciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification

【速读】：该论文试图解决人工智能（AI）在科学研究中应用时面临的信任和可解释性问题，特别是在图像分类任务中。解决方案的关键在于引入AiSciVision框架，该框架通过两个核心组件实现：(1) 视觉检索增强生成（Visual Retrieval-Augmented Generation, VisRAG）和 (2) 领域特定工具的应用。VisRAG组件通过检索与目标图像最相似的正负标签图像作为上下文，帮助大模态模型（Large Multimodal Models, LMMs）进行更准确的分类。领域特定工具则模拟领域专家的工作流程，允许模型在多轮迭代中选择和应用工具来操作和检查目标图像，从而在最终预测前进行分析的精细化。每个推理过程不仅生成预测结果，还生成详细的自然语言记录，解释推理过程和工具使用情况，从而提高模型的透明度和可解释性。

链接: https://arxiv.org/abs/2410.21480
作者: Brendan Hogan,Anmol Kabra,Felipe Siqueira Pacheco,Laura Greenstreet,Joshua Fan,Aaron Ferber,Marta Ummus,Alecsander Brito,Olivia Graham,Lillian Aoki,Drew Harvell,Alex Flecker,Carla Gomes
关键词-EN: Artificial Intelligence, black boxes offering, boxes offering limited, offering limited transparency, Large Multimodal Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trust and interpretability are crucial for the use of Artificial Intelligence (AI) in scientific research, but current models often operate as black boxes offering limited transparency and justifications for their outputs. We introduce AiSciVision, a framework that specializes Large Multimodal Models (LMMs) into interactive research partners and classification models for image classification tasks in niche scientific domains. Our framework uses two key components: (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) domain-specific tools utilized in an agentic workflow. To classify a target image, AiSciVision first retrieves the most similar positive and negative labeled images as context for the LMM. Then the LMM agent actively selects and applies tools to manipulate and inspect the target image over multiple rounds, refining its analysis before making a final prediction. These VisRAG and tooling components are designed to mirror the processes of domain experts, as humans often compare new data to similar examples and use specialized tools to manipulate and inspect images before arriving at a conclusion. Each inference produces both a prediction and a natural language transcript detailing the reasoning and tool usage that led to the prediction. We evaluate AiSciVision on three real-world scientific image classification datasets: detecting the presence of aquaculture ponds, diseased eelgrass, and solar panels. Across these datasets, our method outperforms fully supervised models in low and full-labeled data settings. AiSciVision is actively deployed in real-world use, specifically for aquaculture research, through a dedicated web application that displays and allows the expert users to converse with the transcripts. This work represents a crucial step toward AI systems that are both interpretable and effective, advancing their use in scientific research and scientific discovery.
摘要：在科学研究中使用人工智能（AI）时，信任和可解释性至关重要，但当前的模型往往作为黑箱运行，提供有限的透明度和对其输出的解释。我们引入了 AiSciVision，这是一个将大模态模型（Large Multimodal Models, LMMs）专门化为交互式研究伙伴和针对特定科学领域图像分类任务的分类模型的框架。我们的框架使用两个关键组件：（1）视觉检索增强生成（Visual Retrieval-Augmented Generation, VisRAG）和（2）在智能体工作流程中使用的领域特定工具。为了对目标图像进行分类，AiSciVision 首先检索与目标图像最相似的正负标签图像作为 LMM 的上下文。然后，LMM 智能体主动选择并应用工具，在多轮次中操作和检查目标图像，在做出最终预测之前细化其分析。这些 VisRAG 和工具组件的设计旨在模拟领域专家的过程，因为人类通常会将新数据与类似示例进行比较，并使用专用工具在得出结论之前操作和检查图像。每次推理都会产生一个预测和一个自然语言记录，详细说明导致该预测的推理和工具使用情况。我们在三个真实世界的科学图像分类数据集上评估了 AiSciVision：检测水产养殖池塘、病态鳗草和太阳能电池板的存在。在这些数据集中，我们的方法在低标签和全标签数据设置下均优于完全监督模型。AiSciVision 正在实际应用中积极部署，特别是在水产养殖研究中，通过一个专门的网络应用程序，专家用户可以查看并与之对话的记录。这项工作代表了向既可解释又有效的 AI 系统迈出的关键一步，推动了其在科学研究和科学发现中的应用。

[NLP-69] ransformLLM : Adapting Large Language Models via LLM -Transformed Reading Comprehension Text

【速读】：该论文试图解决大型语言模型（LLMs）在特定领域（如法律）应用中的准确性和成本问题。解决方案的关键在于通过持续预训练（continued pre-training）和使用大型语言模型进行数据转换，开发出专门针对法律领域的语言模型，如Phi-2-Legal和Mistral-Legal-7B。这些模型基于Phi-2和Mistral-7B-v0.1，并通过超过5亿个法律文本标记的持续预训练，显著提升了在法律任务中的表现，甚至在法律基准测试中超越了使用更大数据集和更多资源训练的模型。这种方法不仅提高了模型的领域专业性，还保留了其通用语言理解能力，强调了领域适应性预训练和阅读理解在开发高效领域特定语言模型中的潜力。

链接: https://arxiv.org/abs/2410.21479
作者: Iftach Arbel,Yehonathan Refael,Ofir Lindenbaum
关键词-EN: Large Language Models, accuracy and costs, Models, Language Models, promise in highly-specialized
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in highly-specialized domains, however challenges are still present in aspects of accuracy and costs. These limitations restrict the usage of existing models in domain-specific tasks. While fine-tuning pre-trained models have shown promising results, this process can be computationally expensive and require massive datasets of the specialized application in hand. In this work, we bridge that gap. We have developed Phi-2-Legal and Mistral-Legal-7B, which are language models specifically designed for legal applications. These models are based on Phi-2 and Mistral-7B-v0.1, and have gone through continued pre-training with over 500 million tokens of legal texts. Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text. Our legal LLMs have demonstrated superior performance in legal benchmarks, even outperforming models trained on much larger datasets with more resources. This work emphasizes the effectiveness of continued pre-training on domain-specific texts, while using affordable LLMs for data conversion, which gives these models domain expertise while retaining general language understanding capabilities. While this work uses the legal domain as a test case, our method can be scaled and applied to any pre-training dataset, resulting in significant improvements across different tasks. These findings underscore the potential of domain-adaptive pre-training and reading comprehension for the development of highly effective domain-specific language models.
摘要：大语言模型 (LLMs) 在高度专业化的领域中显示出潜力，但在准确性和成本方面仍存在挑战。这些限制限制了现有模型在特定领域任务中的应用。尽管对预训练模型进行微调已显示出有希望的结果，但这一过程可能计算成本高昂，并且需要大量特定应用的数据集。在本研究中，我们填补了这一空白。我们开发了 Phi-2-Legal 和 Mistral-Legal-7B，这些语言模型专门设计用于法律应用。这些模型基于 Phi-2 和 Mistral-7B-v0.1，并经过持续预训练，使用了超过 5 亿个法律文本 Token。我们创新的方法通过使用大语言模型 (LLMs) 将原始训练数据转换为阅读理解文本，显著提升了法律任务的能力。我们的法律大语言模型在法律基准测试中表现出色，甚至优于在更大数据集上训练且资源更多的模型。本研究强调了在特定领域文本上进行持续预训练的有效性，同时使用经济实惠的大语言模型进行数据转换，这使得模型在保留通用语言理解能力的同时获得了领域专业知识。尽管本研究以法律领域为测试案例，但我们的方法可以扩展并应用于任何预训练数据集，从而在不同任务中实现显著改进。这些发现突显了领域适应性预训练和阅读理解在开发高效特定领域语言模型方面的潜力。

[NLP-70] Estimating Causal Effects of Text Interventions Leveraging LLM s

【速读】：该论文试图解决在社会系统中量化文本干预效果的问题，特别是在减少社交媒体帖子中的愤怒情绪以观察其对用户参与度的影响时。由于直接干预真实世界系统通常不可行，因此需要依赖观测数据。传统因果推断方法（通常设计用于二元或离散处理）无法处理文本数据的复杂高维特性。论文提出的解决方案之关键是CausalDANN，这是一种利用大型语言模型（LLMs）进行文本转换的新方法，能够估计因果效应。与现有方法不同，CausalDANN能够适应任意文本干预，并利用具有域适应能力的文本级分类器，即使在仅观察到对照组的情况下，也能产生对抗域偏移的稳健效应估计。这种灵活处理各种文本干预的能力是文本数据因果估计的关键进步，有助于更好地理解人类行为并制定有效的社会系统政策。

链接: https://arxiv.org/abs/2410.21474
作者: Siyi Guo,Myrl G. Marmarelis,Fred Morstatter,Kristina Lerman
关键词-EN: poses significant challenges, social media posts, impact on engagement, poses significant, reducing anger
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantifying the effect of textual interventions in social systems, such as reducing anger in social media posts to see its impact on engagement, poses significant challenges. Direct interventions on real-world systems are often infeasible, necessitating reliance on observational data. Traditional causal inference methods, typically designed for binary or discrete treatments, are inadequate for handling the complex, high-dimensional nature of textual data. This paper addresses these challenges by proposing a novel approach, CausalDANN, to estimate causal effects using text transformations facilitated by large language models (LLMs). Unlike existing methods, our approach accommodates arbitrary textual interventions and leverages text-level classifiers with domain adaptation ability to produce robust effect estimates against domain shifts, even when only the control group is observed. This flexibility in handling various text interventions is a key advancement in causal estimation for textual data, offering opportunities to better understand human behaviors and develop effective policies within social systems.
摘要：量化文本干预在社会系统中的效果，例如减少社交媒体帖子中的愤怒情绪以观察其对参与度的影响，面临着显著的挑战。直接干预现实世界系统通常不可行，因此需要依赖观察数据。传统的因果推断方法，通常设计用于二元或离散处理，不足以处理文本数据的复杂、高维特性。本文通过提出一种新颖的方法——CausalDANN，来应对这些挑战，该方法利用大语言模型（LLM）实现的文本转换来估计因果效应。与现有方法不同，我们的方法能够适应任意文本干预，并利用具备领域适应能力的文本级分类器，即使在仅观察到对照组的情况下，也能针对领域偏移产生稳健的效应估计。这种处理各种文本干预的灵活性是文本数据因果估计的关键进展，为更好地理解人类行为和在社会系统内制定有效政策提供了机会。

[NLP-71] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

【速读】：该论文试图解决长上下文大语言模型（LLMs）在高吞吐量推理中面临的内存占用和访问延迟问题。解决方案的关键是提出了一种名为ShadowKV的高吞吐量长上下文LLM推理系统。ShadowKV通过存储低秩的键缓存（key cache）并将值缓存（value cache）卸载到CPU，从而减少内存占用，支持更大的批处理大小和更长的序列。为了最小化解码延迟，ShadowKV采用了一种精确的键值对选择策略，动态重建最小的稀疏键值对。实验结果表明，ShadowKV在多个基准测试和模型上能够支持高达6倍的批处理大小，并在A100 GPU上将吞吐量提升至3.04倍，同时不牺牲生成质量，甚至在假设无限GPU内存的情况下，性能超过了无限批处理大小的假设性能。

链接: https://arxiv.org/abs/2410.21465
作者: Hanshi Sun,Li-Wen Chang,Wenlei Bao,Size Zheng,Ningxin Zheng,Xin Liu,Harry Dong,Yuejie Chi,Beidi Chen
关键词-EN: long-context large language, large language models, long-context LLM inference, high-throughput long-context LLM, serving long-context LLMs
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory consumption or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6 \times larger batch sizes and boost throughput by up to 3.04 \times on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory. The code is available at this https URL.
摘要：随着长上下文大语言模型（LLM）的广泛部署，对高效支持高吞吐量推理的需求日益增长。然而，随着序列长度的增加，键值（KV）缓存也随之扩展，导致内存占用增加，并且在每次Token生成时都需要访问缓存，这使得在服务长上下文LLM时吞吐量降低。尽管已有多种动态稀疏注意力方法被提出以加速推理并保持生成质量，但它们要么未能充分减少GPU内存消耗，要么通过将KV缓存卸载到CPU而引入了显著的解码延迟。我们提出了ShadowKV，一种高吞吐量的长上下文LLM推理系统，该系统存储低秩键缓存并将值缓存卸载以减少更大批次和更长序列的内存占用。为了最小化解码延迟，ShadowKV采用了一种精确的KV选择策略，即在运行时动态重建最小的稀疏KV对。通过在广泛的基准测试中评估ShadowKV，包括RULER、LongBench和Needle In A Haystack，以及Llama-3.1-8B、Llama-3-8B-1M、GLM-4-9B-1M、Yi-9B-200K、Phi-3-Mini-128K和Qwen2-7B-128K等模型，我们证明了它能够在A100 GPU上支持高达6倍的更大批次，并将吞吐量提升至3.04倍，而不会牺牲准确性，甚至在假设无限GPU内存的情况下，性能超越了无限批次大小的表现。代码可在以下链接获取：https URL。

[NLP-72] UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function

【速读】：该论文试图解决大语言模型（LLM）在经过监督微调（SFT）和对齐（alignment）后出现的灾难性遗忘问题。解决方案的关键在于引入统一微调（Unified Fine-Tuning, UFT），通过将SFT和对齐整合到一个训练阶段，并使用相同的优化目标和损失函数，通过隐式奖励函数实现。这种方法不仅在指令微调数据上优于单独的SFT，而且在结合指令微调数据和对齐数据时，有效防止了灾难性遗忘，显著提升了指令跟随任务和事实性问答任务的表现。

链接: https://arxiv.org/abs/2410.21438
作者: Zhichao Wang,Bin Bi,Zixu Zhu,Xiangbo Mao,Jun Wang,Shiyu Wang
关键词-EN: trillions of tokens, text generation, SFT and alignment, pretraining on trillions, gains the capability
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:By pretraining on trillions of tokens, an LLM gains the capability of text generation. However, to enhance its utility and reduce potential harm, SFT and alignment are applied sequentially to the pretrained model. Due to the differing nature and objective functions of SFT and alignment, catastrophic forgetting has become a significant issue. To address this, we introduce Unified Fine-Tuning (UFT), which integrates SFT and alignment into a single training stage using the same objective and loss functions through an implicit reward function. Our experimental results demonstrate that UFT outperforms SFT on instruction-tuning data alone. Moreover, when combining instruction-tuning data with alignment data, UFT effectively prevents catastrophic forgetting across these two stages and shows a clear advantage over sequentially applying SFT and alignment. This is evident in the significant improvements observed in the \textbfifeval task for instruction-following and the \textbftruthful-qa task for factuality. The proposed general fine-tuning framework UFT establishes an effective and efficient pretraining-UFT paradigm for LLM training.
摘要：通过在数万亿个 Token 上进行预训练，大语言模型 (LLM) 获得了文本生成的能力。然而，为了增强其效用并减少潜在的危害，预训练模型会依次进行监督微调 (SFT) 和对齐 (alignment)。由于 SFT 和对齐的性质和目标函数不同，灾难性遗忘 (catastrophic forgetting) 成为一个显著问题。为解决这一问题，我们提出了统一微调 (Unified Fine-Tuning, UFT)，它通过隐式奖励函数将 SFT 和对齐整合到一个训练阶段，使用相同的目标和损失函数。我们的实验结果表明，仅在指令微调数据上，UFT 优于 SFT。此外，当结合指令微调数据和对齐数据时，UFT 有效地防止了在这两个阶段之间的灾难性遗忘，并显示出明显优于依次应用 SFT 和对齐的优势。这在指令跟随任务 (\textbfifeval) 和事实性问答任务 (\textbftruthful-qa) 中观察到的显著改进中得到了体现。所提出的通用微调框架 UFT 为大语言模型训练建立了一个有效且高效的预训练-UFT 范式。

[NLP-73] Large Language Models for Manufacturing

【速读】：该论文试图解决如何将大型语言模型（Large Language Models, LLMs）应用于制造业，以优化流程、提高效率和推动创新的问题。解决方案的关键在于全面探索LLMs在制造业中的应用潜力，包括产品设计与开发、质量控制、供应链优化和人才管理等多个方面。通过评估LLMs在复杂指令执行、数据洞察提取和知识共享等方面的能力，论文展示了如GPT-4V等先进LLMs在制造业中的显著应用效果。此外，论文还探讨了LLMs在重塑制造业教育、自动化编码流程、增强机器人控制系统以及通过工业元宇宙创建沉浸式数据丰富环境等方面的变革潜力。

链接: https://arxiv.org/abs/2410.21418
作者: Yiwei Li,Huaqin Zhao,Hanqi Jiang,Yi Pan,Zhengliang Liu,Zihao Wu,Peng Shu,Jie Tian,Tianze Yang,Shaochen Xu,Yanjun Lyu,Parker Blenk,Jacob Pence,Jason Rupram,Eliza Banu,Ninghao Liu,Linbing Wang,Wenzhan Song,Xiaoming Zhai,Kenan Song,Dajiang Zhu,Beiwen Li,Xianqiao Wang,Tianming Liu
关键词-EN: Large Language Models, Language Models, Large Language, transform manufacturing industry, advances in Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advances in Large Language Models (LLMs) have the potential to transform manufacturing industry, offering new opportunities to optimize processes, improve efficiency, and drive innovation. This paper provides a comprehensive exploration of the integration of LLMs into the manufacturing domain, focusing on their potential to automate and enhance various aspects of manufacturing, from product design and development to quality control, supply chain optimization, and talent management. Through extensive evaluations across multiple manufacturing tasks, we demonstrate the remarkable capabilities of state-of-the-art LLMs, such as GPT-4V, in understanding and executing complex instructions, extracting valuable insights from vast amounts of data, and facilitating knowledge sharing. We also delve into the transformative potential of LLMs in reshaping manufacturing education, automating coding processes, enhancing robot control systems, and enabling the creation of immersive, data-rich virtual environments through the industrial metaverse. By highlighting the practical applications and emerging use cases of LLMs in manufacturing, this paper aims to provide a valuable resource for professionals, researchers, and decision-makers seeking to harness the power of these technologies to address real-world challenges, drive operational excellence, and unlock sustainable growth in an increasingly competitive landscape.
摘要：大语言模型 (LLM) 的快速发展具有改变制造业的潜力，为优化流程、提高效率和推动创新提供了新的机遇。本文全面探讨了 LLM 在制造业领域的整合，重点介绍了其在自动化和增强制造业各个方面（从产品设计与开发到质量控制、供应链优化和人才管理）的潜力。通过在多个制造任务中的广泛评估，我们展示了如 GPT-4V 等最先进 LLM 在理解和执行复杂指令、从大量数据中提取有价值见解以及促进知识共享方面的显著能力。我们还深入探讨了 LLM 在重塑制造业教育、自动化编码过程、增强机器人控制系统以及通过工业元宇宙创建沉浸式、数据丰富的虚拟环境方面的变革潜力。通过突出 LLM 在制造业中的实际应用和新兴用例，本文旨在为寻求利用这些技术解决现实世界挑战、推动运营卓越并在日益竞争的环境中实现可持续增长的专业人士、研究人员和决策者提供宝贵的资源。

[NLP-74] CT2C-QA: Multimodal Question Answering over Chinese Text Table and Chart

【速读】：该论文试图解决多模态问答（Multimodal Question Answering, MMQA）中现有研究主要集中在两种模态（如图像-文本、表格-文本、图表-文本）的问题，而缺乏对文本、表格和图表三种模态联合分析的研究。解决方案的关键在于提出了C \textT^2 C-QA数据集和AED多代理系统。C \textT^2 C-QA数据集是一个包含文本、表格和图表的推理型问答数据集，模拟真实网页环境，挑战模型在多模态数据中的分析和推理能力。AED系统通过分配代理（Assignment Agent）选择并激活擅长文本、表格和图表的专家代理，决策代理（Decision Agent）则基于专家代理的分析结果做出最终判断，实现多代理协作、信息交互和集体决策。实验结果表明，现有方法包括GPT-4在内，尚未达到该数据集设定的基准。

链接: https://arxiv.org/abs/2410.21414
作者: Bowen Zhao,Tianhao Cheng,Yuejie Zhang,Ying Cheng,Rui Feng,Xiaobo Zhang
关键词-EN: Multimodal Question Answering, diverse data representations, Question Answering, enables comprehensive understanding, understanding and accurate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present C \textT^2 C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (\textbfAllocating, \textbfExpert and \textbfDesicion), a multi-agent system implemented through collaborative deployment, information interaction, and collective decision-making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state-of-the-art models in MMQA, including GPT-4. The experimental outcomes demonstrate that current methodologies, including GPT-4, are yet to meet the benchmarks set by our dataset.
摘要：多模态问答 (Multimodal Question Answering, MMQA) 至关重要，因为它通过整合来自不同数据表示（如表格、图表和文本）的见解，实现了全面理解和准确回答。现有的大多数 MMQA 研究仅关注两种模态，如图像-文本问答、表格-文本问答和图表-文本问答，而对文本、表格和图表的联合分析研究则明显不足。本文中，我们提出了 C \textT^2 C-QA，这是一个开创性的基于中文推理的问答数据集，包含了从 200 个精选网页中精心编制的广泛文本、表格和图表集合。我们的数据集模拟了真实网页，并作为测试模型分析和推理多模态数据能力的良好工具，因为问题的答案可能出现在各种模态中，甚至可能根本不存在。此外，我们提出了 AED（分配 (Allocating)、专家 (Expert) 和决策 (Decision)），这是一个通过不同智能体之间的协作部署、信息交互和集体决策实现的多智能体系统。具体来说，分配智能体负责选择和激活专家智能体，包括擅长文本、表格和图表的智能体。决策智能体则负责根据这些专家智能体提供的分析见解做出最终判断。我们进行了全面的分析，将 AED 与多种最先进的 MMQA 模型（包括 GPT-4）进行了比较。实验结果表明，包括 GPT-4 在内的当前方法尚未达到我们数据集设定的基准。

[NLP-75] A Survey on Automatic Credibility Assessment of Textual Credibility Signals in the Era of Large Language Models

【速读】：该论文试图解决在社交媒体和生成式 AI (Generative AI) 时代下，自动评估在线社交媒体内容可信度的问题。解决方案的关键在于整合可信度信号 (credibility signals)，这些信号是指内容真实性、偏见或说服技巧等小单位信息，将其转化为整体可信度评分。与当前主要依赖各种（多为潜在的）特征的假新闻检测相比，可信度信号提供了更细粒度、更易解释且更广泛可用的信息。论文通过系统综述175篇研究论文，重点分析了文本可信度信号和自然语言处理 (NLP) 领域，特别是大型语言模型 (LLMs) 带来的显著进展。论文还探讨了可信度评估的方法以及9类可信度信号中的3类（事实性、主观性和偏见，说服技巧和逻辑谬误，声明和真实性），并指出了未来的挑战和机遇，特别关注生成式 AI 的快速发展。

链接: https://arxiv.org/abs/2410.21360
作者: Ivan Srba,Olesya Razuvayevskaya,João A. Leite,Robert Moro,Ipek Baris Schlicht,Sara Tonelli,Francisco Moreno García,Santiago Barrio Lottmann,Denis Teyssou,Valentin Porcellini,Carolina Scarton,Kalina Bontcheva,Maria Bielikova
关键词-EN: online social media, social media content, social media, credibility signals, credibility
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the current era of social media and generative AI, an ability to automatically assess the credibility of online social media content is of tremendous importance. Credibility assessment is fundamentally based on aggregating credibility signals, which refer to small units of information, such as content factuality, bias, or a presence of persuasion techniques, into an overall credibility score. Credibility signals provide a more granular, more easily explainable and widely utilizable information in contrast to currently predominant fake news detection, which utilizes various (mostly latent) features. A growing body of research on automatic credibility assessment and detection of credibility signals can be characterized as highly fragmented and lacking mutual interconnections. This issue is even more prominent due to a lack of an up-to-date overview of research works on automatic credibility assessment. In this survey, we provide such systematic and comprehensive literature review of 175 research papers while focusing on textual credibility signals and Natural Language Processing (NLP), which undergoes a significant advancement due to Large Language Models (LLMs). While positioning the NLP research into the context of other multidisciplinary research works, we tackle with approaches for credibility assessment as well as with 9 categories of credibility signals (we provide a thorough analysis for 3 of them, namely: 1) factuality, subjectivity and bias, 2) persuasion techniques and logical fallacies, and 3) claims and veracity). Following the description of the existing methods, datasets and tools, we identify future challenges and opportunities, while paying a specific attention to recent rapid development of generative AI.
摘要：在当前社交媒体和生成式 AI 的时代，自动评估在线社交媒体内容的可信度具有极其重要的意义。可信度评估从根本上依赖于聚合可信度信号，这些信号指的是诸如内容真实性、偏见或说服技巧存在等小单位信息，并将其整合为一个总体可信度评分。与目前主要采用的各种（大多是潜在的）特征的虚假新闻检测相比，可信度信号提供了更细粒度、更易于解释且更广泛可用的信息。关于自动可信度评估和可信度信号检测的研究日益增多，但这些研究呈现出高度碎片化且缺乏相互联系的特点。由于缺乏对自动可信度评估研究工作的最新综述，这一问题尤为突出。在本综述中，我们系统且全面地回顾了 175 篇研究论文，重点关注文本可信度信号和自然语言处理 (NLP)，后者由于大语言模型 (LLMs) 的发展而取得了显著进步。在将 NLP 研究置于其他多学科研究背景中的同时，我们探讨了可信度评估的方法以及 9 类可信度信号（我们对其中的 3 类进行了深入分析，即：1) 真实性、主观性和偏见，2) 说服技巧和逻辑谬误，以及 3) 声明和真实性）。在描述了现有方法、数据集和工具之后，我们指出了未来的挑战和机遇，特别关注生成式 AI 的快速发展。

[NLP-76] Can Machines Think Like Humans? A Behavioral Evaluation of LLM -Agents in Dictator Games

【速读】：该论文试图解决的问题是如何理解和评估基于大型语言模型（LLM）的代理在实际任务中的亲社会行为，并探讨这些行为与人类行为的差异。解决方案的关键在于通过不同的角色（personas）和实验框架来诱导和基准化LLM代理的亲社会行为，特别是在独裁者游戏中，比较同一LLM家族内、不同家族间以及与人类行为之间的差异。研究发现，仅仅赋予LLM代理人类身份并不能产生类似人类的行为，且LLM代理无法准确预测人类决策，其行为与人类行为的匹配度高度依赖于特定的模型架构和提示语设计，且这种依赖性并无清晰模式。

链接: https://arxiv.org/abs/2410.21359
作者: Ji Ma
关键词-EN: Large Language Model, Large Language, increasingly undertake real-world, undertake real-world tasks, based agents increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:As Large Language Model (LLM)-based agents increasingly undertake real-world tasks and engage with human society, how well do we understand their behaviors? This study (1) investigates how LLM agents’ prosocial behaviors – a fundamental social norm – can be induced by different personas and benchmarked against human behaviors; and (2) introduces a behavioral approach to evaluate the performance of LLM agents in complex decision-making scenarios. We explored how different personas and experimental framings affect these AI agents’ altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. Our findings reveal substantial variations and inconsistencies among LLMs and notable differences compared to human behaviors. Merely assigning a human-like identity to LLMs does not produce human-like behaviors. Despite being trained on extensive human-generated data, these AI agents cannot accurately predict human decisions. LLM agents are not able to capture the internal processes of human decision-making, and their alignment with human behavior is highly variable and dependent on specific model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern.
摘要：随着基于大语言模型 (LLM) 的智能体越来越多地承担现实世界的任务并参与人类社会，我们对其行为的理解程度如何？本研究 (1) 探讨了如何通过不同的角色设定来引导 LLM 智能体的亲社会行为——一种基本的社会规范，并将其行为与人类行为进行基准测试；(2) 引入了一种行为方法来评估 LLM 智能体在复杂决策场景中的表现。我们研究了不同的角色设定和实验框架如何影响这些 AI 智能体在独裁者游戏中的利他行为，并比较了同一 LLM 家族内、不同家族间以及与人类行为之间的差异。我们的研究结果揭示了 LLM 之间存在显著的差异和不一致性，并且与人类行为相比存在明显的不同。仅仅赋予 LLM 类似人类的身份并不能产生类似人类的行为。尽管这些 AI 智能体接受了大量人类生成数据的训练，但它们无法准确预测人类的决策。LLM 智能体无法捕捉人类决策的内在过程，它们与人类行为的匹配度高度依赖于特定的模型架构和提示词设计；更糟糕的是，这种依赖性并没有遵循清晰的规律。

[NLP-77] Energy-Based Diffusion Language Models for Text Generation

【速读】：该论文试图解决离散扩散模型在减少采样步骤时性能下降的问题。解决方案的关键在于提出了一种基于能量的扩散语言模型（Energy-based Diffusion Language Model, EDLM），该模型在每个扩散步骤中以全序列级别进行操作，从而改善了扩散模型底层近似的不完美性。具体来说，论文引入了一种残差形式的能量模型（EBM），并通过利用预训练的自回归模型或通过噪声对比估计微调双向Transformer来获得其参数。此外，论文还提出了一种通过并行重要采样的有效生成算法。实验结果表明，该模型在语言建模基准测试中显著优于现有的扩散模型，并且接近自回归模型的困惑度，同时在不降低生成性能的情况下，实现了比现有扩散模型快1.3倍的采样速度。

链接: https://arxiv.org/abs/2410.21357
作者: Minkai Xu,Tomas Geffner,Karsten Kreis,Weili Nie,Yilun Xu,Jure Leskovec,Stefano Ermon,Arash Vahdat
关键词-EN: alternative generative paradigms, diffusion models, actively explored, remarkable progress, generative paradigms
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite remarkable progress in autoregressive language models, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models’ perplexity. We further show that, without any generation performance drop, our framework offers a 1.3 \times sampling speedup over existing diffusion models.
摘要：尽管自回归语言模型取得了显著进展，但超越从左到右生成的替代生成范式仍在积极探索中。离散扩散模型凭借其并行生成的能力，最近作为一种有前景的替代方案崭露头角。然而，这些模型在性能上仍落后于自回归模型，尤其是在减少采样步骤数时，性能差距进一步扩大。我们的分析表明，这种性能下降是由于扩散模型所采用的不完美近似导致的。在本研究中，我们提出了基于能量的扩散语言模型（Energy-based Diffusion Language Model, EDLM），这是一种在每个扩散步骤中对整个序列进行操作的基于能量的模型，旨在改进扩散模型所使用的底层近似。更具体地说，我们引入了一种以残差形式存在的能量模型（EBM），并展示了其参数可以通过利用预训练的自回归模型或通过噪声对比估计对双向Transformer进行微调来获得。我们还提出了一种通过并行重要性采样的有效生成算法。在语言建模基准上的综合实验表明，我们的模型能够持续显著优于最先进的扩散模型，并接近自回归模型的困惑度（perplexity）。此外，我们进一步展示了，在不降低生成性能的情况下，我们的框架相比现有扩散模型提供了1.3倍的采样加速。

[NLP-78] Causal Interventions on Causal Paths: Mapping GPT-2s Reasoning From Syntax to Semantics

【速读】：该论文试图解决自然语言推理（natural language reasoning）在基于Transformer的大型语言模型（LLMs）中的解释性问题。解决方案的关键在于通过分析因果关系明确的句子（如“我打开伞因为开始下雨了”）来研究模型内部的因果推理机制。研究发现，因果语法主要集中在模型的前2-3层，而后期层的某些头部（heads）对因果句子的非逻辑变体表现出高度敏感性。这表明模型可能通过检测句法线索和在最终层中隔离专注于语义关系的特定头部来进行推理。

链接: https://arxiv.org/abs/2410.21353
作者: Isabelle Lee,Joshua Lum,Ziyi Liu,Dani Yogatama
关键词-EN: defies easy categorization, internal algorithms utilized, contextuality and ambiguity, defies easy, easy categorization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:While interpretability research has shed light on some internal algorithms utilized by transformer-based LLMs, reasoning in natural language, with its deep contextuality and ambiguity, defies easy categorization. As a result, formulating clear and motivating questions for circuit analysis that rely on well-defined in-domain and out-of-domain examples required for causal interventions is challenging. Although significant work has investigated circuits for specific tasks, such as indirect object identification (IOI), deciphering natural language reasoning through circuits remains difficult due to its inherent complexity. In this work, we take initial steps to characterize causal reasoning in LLMs by analyzing clear-cut cause-and-effect sentences like “I opened an umbrella because it started raining,” where causal interventions may be possible through carefully crafted scenarios using GPT-2 small. Our findings indicate that causal syntax is localized within the first 2-3 layers, while certain heads in later layers exhibit heightened sensitivity to nonsensical variations of causal sentences. This suggests that models may infer reasoning by (1) detecting syntactic cues and (2) isolating distinct heads in the final layers that focus on semantic relationships.
摘要：尽管可解释性研究已经揭示了一些基于 Transformer 的大语言模型（LLM）内部算法的工作机制，但自然语言推理因其深层次的上下文依赖性和模糊性，难以进行简单的分类。因此，为依赖于定义明确的领域内和领域外示例的因果干预电路分析制定清晰且有动机的问题是具有挑战性的。尽管已有大量工作研究了特定任务（如间接对象识别 (IOI)）的电路，但由于自然语言推理的固有复杂性，通过电路解读自然语言推理仍然困难。在本研究中，我们通过分析类似“我打开了一把伞，因为开始下雨了”这样的因果关系明确的句子，初步探讨了 LLM 中的因果推理。我们通过精心设计的 GPT-2 小型模型场景，尝试进行因果干预。研究结果表明，因果句法主要集中在模型的前 2-3 层，而某些后层中的头部对因果句子的无意义变体表现出更高的敏感性。这表明模型可能通过以下两种方式进行推理：(1) 检测句法线索；(2) 在最终层中隔离专注于语义关系的特定头部。

[NLP-79] LLM CBench: Benchmarking Large Language Model Compression for Efficient Deployment

【速读】：该论文试图解决大型语言模型（LLMs）在实际应用中因计算和存储需求高而受限的问题。解决方案的关键在于提出了一个名为“大型语言模型压缩基准测试（LLMCBench）”的全面评估框架。该基准测试通过分析实际模型生产需求，设计了详细的评估赛道和指标，并进行了广泛的实验比较，涵盖多种主流的LLM压缩方法。最终，通过对评估结果的深入分析，提供了对LLM压缩算法设计的宝贵见解，旨在为未来的研究奠定基础。

链接: https://arxiv.org/abs/2410.21352
作者: Ge Yang,Changyi He,Jinyang Guo,Jianyu Wu,Yifu Ding,Aishan Liu,Haotong Qin,Pengliang Ji,Xianglong Liu
关键词-EN: strong intelligence ability, LLM compression, intelligence ability, practical application, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive evaluation under more general scenarios. So it is still a question of which model compression approach we should use under a specific case. To mitigate this gap, we present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms. We first analyze the actual model production requirements and carefully design evaluation tracks and metrics. Then, we conduct extensive experiments and comparison using multiple mainstream LLM compression approaches. Finally, we perform an in-depth analysis based on the evaluation and provide useful insight for LLM compression design. We hope our LLMCBench can contribute insightful suggestions for LLM compression algorithm design and serve as a foundation for future research. Our code is available at this https URL.
摘要：尽管大语言模型 (Large Language Model, LLM) 展示了其强大的智能能力，但其对计算和存储的高需求限制了其在实际应用中的普及。为此，许多模型压缩技术被提出以提高 LLM 的效率。然而，当前的研究仅在有限的模型、数据集和指标上验证了其方法，并且在更普遍的场景下缺乏全面的评估。因此，在特定情况下应采用哪种模型压缩方法仍是一个问题。为了缩小这一差距，我们提出了大语言模型压缩基准 (Large Language Model Compression Benchmark, LLMCBench)，这是一个经过精心设计的基准，旨在深入分析 LLM 压缩算法。我们首先分析了实际模型生产需求，并仔细设计了评估赛道和指标。接着，我们使用多种主流 LLM 压缩方法进行了广泛的实验和比较。最后，我们基于评估结果进行了深入分析，并为 LLM 压缩设计提供了有用的见解。我们希望 LLMCBench 能够为 LLM 压缩算法设计提供有见地的建议，并为未来的研究奠定基础。我们的代码可在以下链接获取：https URL。

[NLP-80] Large Language Model Benchmarks in Medical Tasks

【速读】：该论文试图解决的问题是如何评估大型语言模型（LLMs）在医疗领域的性能，特别是通过使用基准数据集。解决方案的关键在于系统性地调查和分类各种用于医疗LLM任务的基准数据集，这些数据集涵盖了文本、图像和多模态数据，并关注电子健康记录（EHRs）、医生-患者对话、医学问答和医学图像描述等多个方面的医疗知识。论文通过讨论这些数据集的重要性、数据结构及其对临床任务（如诊断、报告生成和预测决策支持）中LLM发展的影响，提供了对现有基准的全面概述。关键的基准数据集包括MIMIC-III、MIMIC-IV、BioASQ、PubMedQA和CheXpert，它们在医学报告生成、临床总结和合成数据生成等任务中推动了技术进步。此外，论文强调了利用这些基准数据集推动多模态医疗智能发展的挑战和机遇，特别是需要更多语言多样性、结构化组学数据和创新的数据合成方法。

链接: https://arxiv.org/abs/2410.21348
作者: Lawrence K.Q. Yan,Ming Li,Yichao Zhang,Caitlyn Heqi Yin,Cheng Fei,Benji Peng,Ziqian Bi,Pohsun Feng,Keyu Chen,Junyu Liu,Qian Niu
关键词-EN: large language models, evaluating these models’, models’ performance, medical, datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 5 tables

点击查看摘要

Abstract:With the increasing application of large language models (LLMs) in the medical domain, evaluating these models’ performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.
摘要：随着大语言模型（LLMs）在医疗领域的应用日益增多，使用基准数据集评估这些模型的性能变得至关重要。本文对用于医疗LLM任务的各种基准数据集进行了全面调查。这些数据集涵盖了多种模态，包括文本、图像和多模态基准，重点关注医疗知识的各个方面，如电子健康记录（EHRs）、医生-患者对话、医疗问答和医疗图像描述。调查根据模态对数据集进行了分类，讨论了它们的重要性、数据结构及其对临床任务中LLM发展的影响，如诊断、报告生成和预测决策支持。关键基准包括MIMIC-III、MIMIC-IV、BioASQ、PubMedQA和CheXpert，这些基准促进了医疗报告生成、临床摘要生成和合成数据生成等任务的进展。本文总结了利用这些基准推动多模态医疗智能发展的挑战和机遇，强调了需要更多语言多样性、结构化组学数据和创新合成方法的数据集。这项工作还为未来在医学中应用LLMs的研究奠定了基础，有助于推动医疗人工智能领域的不断发展。

[NLP-81] FinTeamExperts: Role Specialized MOEs For Financial Analysis

【速读】：该论文试图解决大型语言模型（LLMs）在金融领域应用中的局限性问题，特别是在处理复杂金融任务时，单一模型难以全面理解和应对多维度的金融分析需求。解决方案的关键在于提出了FinTeamExperts框架，这是一个基于混合专家模型（Mixture of Experts, MOEs）的角色专业化LLM框架。该框架通过训练三个80亿参数的模型，分别专注于宏观分析师、微观分析师和量化分析师的角色，从而在不同金融领域实现专业化。这种角色特定的专业化增强了模型整合领域特定知识的能力，并通过在下游任务上的指令微调（instruct-tuning）来确保模型与实际金融任务的紧密对齐。实验结果表明，FinTeamExperts在多个数据集上优于同规模及更大规模的模型，特别是在处理复杂任务时表现尤为突出，这验证了角色专业化方法和持续训练策略的成功。

链接: https://arxiv.org/abs/2410.21338
作者: Yue Yu,Prayag Tiwari
关键词-EN: Large Language Models, Large Language, Language Models, leading a significant, significant leap
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as ChatGPT, Phi3 and Llama-3, are leading a significant leap in AI, as they can generalize knowledge from their training to new tasks without fine-tuning. However, their application in the financial domain remains relatively limited. The financial field is inherently complex, requiring a deep understanding across various perspectives, from macro, micro economic trend to quantitative analysis. Motivated by this complexity, a mixture of expert LLMs tailored to specific financial domains could offer a more comprehensive understanding for intricate financial tasks. In this paper, we present the FinTeamExperts, a role-specialized LLM framework structured as a Mixture of Experts (MOEs) for financial analysis. The framework simulates a collaborative team setting by training each model to specialize in distinct roles: Macro Analysts, Micro analysts, and Quantitative Analysts. This role-specific specialization enhances the model’s ability to integrate their domain-specific expertise. We achieve this by training three 8-billion parameter models on different corpus, each dedicated to excelling in specific finance-related roles. We then instruct-tune FinTeamExperts on downstream tasks to align with practical financial tasks. The experimental results show that FinTeamExperts outperform all models of the same size and larger on three out of four datasets. On the fourth dataset, which presents a more complex task, FinTeamExperts still surpass all models of the same size. This highlights the success of our role-based specialization approach and the continued training approach for FinTeamExperts.
摘要：大语言模型 (LLMs)，如 ChatGPT、Phi3 和 Llama-3，正在引领 AI 领域的重大飞跃，因为它们能够在无需微调的情况下将训练中的知识推广到新任务中。然而，这些模型在金融领域的应用仍然相对有限。金融领域本身非常复杂，需要从宏观、微观经济趋势到量化分析等多个角度进行深入理解。受此复杂性的启发，针对特定金融领域的专家混合型 LLMs 可以为复杂的金融任务提供更全面的理解。本文介绍了 FinTeamExperts，这是一个专为金融分析设计的角色专业化 LLM 框架，结构为专家混合 (Mixture of Experts, MOEs)。该框架通过训练每个模型专注于不同的角色，模拟了一个协作团队的环境：宏观分析师、微观分析师和量化分析师。这种角色特定的专业化增强了模型整合其领域特定专业知识的能力。我们通过在不同的语料库上训练三个 80 亿参数的模型来实现这一点，每个模型都致力于在特定金融相关角色中表现出色。然后，我们对 FinTeamExperts 进行指令微调，以使其与实际金融任务相匹配。实验结果显示，FinTeamExperts 在四个数据集中的三个上优于所有相同尺寸及更大的模型。在第四个数据集上，尽管任务更为复杂，FinTeamExperts 仍然超越了所有相同尺寸的模型。这突显了我们基于角色的专业化方法和 FinTeamExperts 持续训练方法的成功。

[NLP-82] Fine-tuned Large Language Models (LLM s): Improved Prompt Injection Attacks Detection

【速读】：该论文试图解决大型语言模型（LLMs）在面对提示注入攻击（prompt injection attacks）时的安全漏洞问题。解决方案的关键在于采用两种方法来检测提示是否存在漏洞：1) 使用预训练的LLM进行零样本分类（zero-shot classification）；2) 对预训练的LLM进行监督式微调（supervised fine-tuning），并使用特定任务的标注数据集进行训练。通过对比分析，论文发现经过微调的模型在检测提示注入攻击方面表现出色，准确率达到99.13%，精确度达到100%，召回率达到98.33%，F1分数达到99.15%，显示出高效的安全检测能力。

链接: https://arxiv.org/abs/2410.21337
作者: Md Abdur Rahman,Fan Wu,Alfredo Cuzzocrea,Sheikh Iqbal Ahamed
关键词-EN: Large language models, Large language, language-based tasks, prompt injection attacks, popular tool
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks. However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem. These attacks target LLMs applications through using carefully designed input prompts to divert the model from adhering to original instruction, thereby it could execute unintended actions. These manipulations pose serious security threats which potentially results in data leaks, biased outputs, or harmful responses. This project explores the security vulnerabilities in relation to prompt injection attacks. To detect whether a prompt is vulnerable or not, we follows two approaches: 1) a pre-trained LLM, and 2) a fine-tuned LLM. Then, we conduct a thorough analysis and comparison of the classification performance. Firstly, we use pre-trained XLM-RoBERTa model to detect prompt injections using test dataset without any fine-tuning and evaluate it by zero-shot classification. Then, this proposed work will apply supervised fine-tuning to this pre-trained LLM using a task-specific labeled dataset from deepset in huggingface, and this fine-tuned model achieves impressive results with 99.13% accuracy, 100% precision, 98.33% recall and 99.15% F1-score thorough rigorous experimentation and evaluation. We observe that our approach is highly efficient in detecting prompt injection attacks.
摘要：大语言模型（LLMs）因其处理广泛语言任务能力的显著提升而成为热门工具。然而，LLMs 应用极易受到提示注入攻击，这是一个关键问题。这些攻击通过精心设计的输入提示，使模型偏离原始指令，从而执行非预期的操作。这种操控带来了严重的安全威胁，可能导致数据泄露、输出偏差或有害响应。本项目探讨了与提示注入攻击相关的安全漏洞。为了检测提示是否存在漏洞，我们采用了两种方法：1) 预训练的大语言模型，2) 微调后的大语言模型。随后，我们对分类性能进行了全面分析和比较。首先，我们使用未经微调的 XLM-RoBERTa 模型，通过测试数据集进行零样本分类来检测提示注入，并对其进行评估。接着，本研究提出使用来自 huggingface 的 deepset 提供的任务特定标注数据集对预训练的大语言模型进行监督微调，经过严格实验和评估，该微调模型取得了令人瞩目的成果，准确率达到 99.13%，精确度达到 100%，召回率达到 98.33%，F1 分数达到 99.15%。我们观察到，我们的方法在检测提示注入攻击方面非常高效。

[NLP-83] Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

【速读】：该论文试图解决的问题是确定在何种任务设置下，链式思维（Chain-of-thought, CoT）提示会系统性地降低模型性能。解决方案的关键在于借鉴认知心理学，研究人类在哪些情况下口头思考或深思熟虑会损害表现，并探讨这些人类表现的约束是否适用于语言模型。论文通过识别三种特定情况（隐性统计学习、视觉识别和包含例外的模式分类），在这些情况下，CoT提示显著降低了多种最先进模型的性能（如OpenAI o1-preview与GPT-4o相比，准确率下降高达36.3%）。此外，论文还识别了三个满足条件(i)但不满足条件(ii)的任务，发现尽管口头思考降低了人类在这些任务中的表现，但CoT提示仍能保持或提升模型性能。总体而言，研究结果表明，虽然模型和人类的认知过程不完全平行，但考虑人类思考负面后果的情况有助于识别模型性能受影响的设置。通过将人类深思熟虑的文献与CoT评估相结合，论文提供了一种新的工具，用于理解提示选择和推理时间推理的影响。

链接: https://arxiv.org/abs/2410.21333
作者: Ryan Liu,Jiayi Geng,Addison J. Wu,Ilia Sucholutsky,Tania Lombrozo,Thomas L. Griffiths
关键词-EN: performance, widely used strategy, strategy for working, working with large, CoT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting has become a widely used strategy for working with large language and multimodal models. While CoT has been shown to improve performance across many tasks, determining the settings in which it is effective remains an ongoing effort. In particular, it is still an open question in what settings CoT systematically reduces model performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, looking at cases where (i) verbal thinking or deliberation hurts performance in humans, and (ii) the constraints governing human performance generalize to language models. Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions. In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts. We also identify three tasks that satisfy condition (i) but not (ii), and find that while verbal thinking reduces human performance in these tasks, CoT retains or increases model performance. Overall, our results show that while there is not an exact parallel between the cognitive processes of models and those of humans, considering cases where thinking has negative consequences for human performance can help us identify settings where it negatively impacts models. By connecting the literature on human deliberation with evaluations of CoT, we offer a new tool that can be used in understanding the impact of prompt choices and inference-time reasoning.
摘要：思维链 (Chain-of-thought, CoT) 提示已成为处理大语言模型和多模态模型的广泛使用策略。尽管 CoT 已被证明能提升许多任务的表现，但其有效设置的确定仍是一个持续的努力。特别是，CoT 在何种设置下系统性地降低模型性能仍是一个开放问题。本文旨在通过借鉴认知心理学，识别 CoT 降低性能的任务特征，特别是考察以下两种情况：(i) 言语思维或深思熟虑在人类中损害表现，以及 (ii) 人类表现的约束条件能推广到语言模型。我们发现了三种此类情况：隐性统计学习、视觉识别以及包含例外的模式分类。在所有三种设置的广泛实验中，我们发现一系列最先进的模型在使用推理时间推理时，性能显著下降（例如，OpenAI o1-preview 相比 GPT-4o 的绝对准确率下降高达 36.3%），相较于零样本对应模型。我们还识别了三种满足条件 (i) 但不满足 (ii) 的任务，发现尽管言语思维在这些任务中降低了人类表现，但 CoT 仍保持或提升了模型性能。总体而言，我们的研究结果表明，尽管模型和人类的认知过程并非完全平行，但考虑思维对人类表现产生负面影响的案例，有助于我们识别其对模型产生负面影响的设置。通过将人类深思熟虑的文献与 CoT 评估相结合，我们提供了一种新的工具，可用于理解提示选择和推理时间推理的影响。

[NLP-84] Building Reusing and Generalizing Abstract Representations from Concrete Sequences

【速读】：该论文试图解决序列学习模型在抽象能力上的不足，导致内存效率低下和迁移能力差的问题。解决方案的关键在于引入了一种非参数的分层变量学习模型 (Hierarchical Variable Learning Model, HVM)，该模型能够从序列中学习“块”，并将上下文相似的块抽象为变量。HVM通过有效地组织内存并揭示抽象概念，实现了紧凑的序列表示。在语言数据集上的实验表明，HVM比传统的压缩算法（如Lempel-Ziv）更能高效地学习词典。此外，在序列回忆任务中，HVM的序列似然性与人类回忆时间相关联，显示出其在抽象变量迁移方面的优势，这一点是大型语言模型 (Large Language Models, LLMs) 所难以比拟的。通过HVM的可调节抽象层，研究展示了模型在压缩和泛化之间的精确权衡，从而提供了一种能够捕捉人类认知中抽象表示学习和迁移的认知模型。

链接: https://arxiv.org/abs/2410.21332
作者: Shuchen Wu,Mirko Thalmann,Peter Dayan,Zeynep Akata,Eric Schulz
关键词-EN: filtering out irrelevant, irrelevant details, transferring these generalized, generalized concepts, HVM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans excel at learning abstract patterns across different sequences, filtering out irrelevant details, and transferring these generalized concepts to new sequences. In contrast, many sequence learning models lack the ability to abstract, which leads to memory inefficiency and poor transfer. We introduce a non-parametric hierarchical variable learning model (HVM) that learns chunks from sequences and abstracts contextually similar chunks as variables. HVM efficiently organizes memory while uncovering abstractions, leading to compact sequence representations. When learning on language datasets such as babyLM, HVM learns a more efficient dictionary than standard compression algorithms such as Lempel-Ziv. In a sequence recall task requiring the acquisition and transfer of variables embedded in sequences, we demonstrate HVM’s sequence likelihood correlates with human recall times. In contrast, large language models (LLMs) struggle to transfer abstract variables as effectively as humans. From HVM’s adjustable layer of abstraction, we demonstrate that the model realizes a precise trade-off between compression and generalization. Our work offers a cognitive model that captures the learning and transfer of abstract representations in human cognition and differentiates itself from the behavior of large language models.
摘要：人类在不同序列中学习抽象模式、过滤无关细节并将这些泛化概念转移到新序列中表现出色。相比之下，许多序列学习模型缺乏抽象能力，导致内存效率低下和迁移效果不佳。我们引入了一种非参数的分层变量学习模型 (Hierarchical Variable Model, HVM)，该模型从序列中学习块，并将上下文相似的块抽象为变量。HVM 在高效组织内存的同时揭示抽象概念，从而实现紧凑的序列表示。在诸如 babyLM 的语言数据集上学习时，HVM 比标准压缩算法如 Lempel-Ziv 学习到更高效的词典。在需要获取和迁移嵌入在序列中的变量的序列回忆任务中，我们展示了 HVM 的序列似然性与人类回忆时间相关。相比之下，大语言模型 (Large Language Models, LLMs) 在迁移抽象变量方面不如人类有效。通过 HVM 的可调节抽象层，我们证明了模型在压缩与泛化之间实现了精确的权衡。我们的工作提供了一种认知模型，捕捉了人类认知中抽象表示的学习和迁移，并区别于大语言模型的行为。

[NLP-85] LLM Robustness Against Misinformation in Biomedical Question Answering

【速读】：该论文试图解决大型语言模型（LLMs）在问答任务中因错误信息注入而导致的答案不准确问题。解决方案的关键在于评估不同LLMs在面对错误信息（prompt-injection attacks）时的鲁棒性，并探讨检索增强生成（RAG）方法在提供正确上下文时的有效性。研究通过对比四种LLMs（Gemma 2, GPT-4o-mini, Llama 3.1, 和 Mixtral）在三种场景下的表现（vanilla LLM answers, “perfect” augmented generation, 和 prompt-injection attacks），发现Llama 3.1在“完美”RAG场景下表现最佳，但所有模型在面对错误信息注入时的准确性显著下降，尤其是Llama 3.1作为攻击者时，其生成的错误上下文对目标模型的影响最大。研究结果强调了评估LLMs对抗攻击鲁棒性的复杂性。

链接: https://arxiv.org/abs/2410.21330
作者: Alexander Bondarenko,Adrian Viehweger
关键词-EN: external knowledge sources, providing additional context, additional context coming, knowledge sources, LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering by retrieving and providing additional context coming from external knowledge sources (e.g., by adding the context to the prompt). However, injecting incorrect information can mislead the LLM to generate an incorrect answer. In this paper, we evaluate the effectiveness and robustness of four LLMs against misinformation - Gemma 2, GPT-4o-mini, Llama~3.1, and Mixtral - in answering biomedical questions. We assess the answer accuracy on yes-no and free-form questions in three scenarios: vanilla LLM answers (no context is provided), “perfect” augmented generation (correct context is provided), and prompt-injection attacks (incorrect context is provided). Our results show that Llama 3.1 (70B parameters) achieves the highest accuracy in both vanilla (0.651) and “perfect” RAG (0.802) scenarios. However, the accuracy gap between the models almost disappears with “perfect” RAG, suggesting its potential to mitigate the LLM’s size-related effectiveness differences. We further evaluate the ability of the LLMs to generate malicious context on one hand and the LLM’s robustness against prompt-injection attacks on the other hand, using metrics such as attack success rate (ASR), accuracy under attack, and accuracy drop. As adversaries, we use the same four LLMs (Gemma 2, GPT-4o-mini, Llama 3.1, and Mixtral) to generate incorrect context that is injected in the target model’s prompt. Interestingly, Llama is shown to be the most effective adversary, causing accuracy drops of up to 0.48 for vanilla answers and 0.63 for “perfect” RAG across target models. Our analysis reveals that robustness rankings vary depending on the evaluation measure, highlighting the complexity of assessing LLM resilience to adversarial attacks. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.21330 [cs.CL] (or arXiv:2410.21330v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.21330 Focus to learn more arXiv-issued DOI via DataCite
摘要：检索增强生成 (Retrieval-Augmented Generation, RAG) 方法通过从外部知识源（例如，通过将上下文添加到提示中）检索并提供额外信息，来减少大语言模型 (Large Language Models, LLMs) 在问答任务中的虚构现象。然而，注入错误信息可能会误导 LLM 生成错误答案。本文中，我们评估了四种 LLM 在应对错误信息时的有效性和鲁棒性——Gemma 2、GPT-4o-mini、Llama~3.1 和 Mixtral，针对生物医学问题的回答。我们在三种场景下评估了答案的准确性：纯 LLM 回答（不提供上下文）、“完美”增强生成（提供正确上下文）和提示注入攻击（提供错误上下文）。结果显示，Llama 3.1（70B 参数）在纯 LLM 回答（0.651）和“完美”RAG（0.802）场景中均达到最高准确率。然而，在“完美”RAG 场景下，各模型之间的准确率差距几乎消失，表明 RAG 可能有助于缓解 LLM 规模相关的效果差异。我们进一步评估了 LLM 生成恶意上下文的能力以及 LLM 对提示注入攻击的鲁棒性，使用攻击成功率 (Attack Success Rate, ASR)、攻击下的准确率和准确率下降等指标。作为攻击者，我们使用相同的四种 LLM（Gemma 2、GPT-4o-mini、Llama 3.1 和 Mixtral）生成错误上下文，并注入目标模型的提示中。有趣的是，Llama 被证明是最有效的攻击者，导致目标模型在纯回答和“完美”RAG 场景下的准确率分别下降高达 0.48 和 0.63。我们的分析表明，鲁棒性排名因评估指标的不同而有所变化，突显了评估 LLM 对对抗攻击的韧性复杂性。

主题：计算与语言 (cs.CL)；人工智能 (cs.AI)
引用方式：arXiv:2410.21330 [cs.CL]（或 arXiv:2410.21330v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.21330
通过 DataCite 发布的 arXiv DOI

[NLP-86] Mathematical Derivation Graphs: A Task for Summarizing Equation Dependencies in STEM Manuscripts

【速读】：该论文试图解决在STEM文章中理解和提取数学表达式之间依赖关系的问题。解决方案的关键在于构建一个名为“推导图 (derivation graph)”的新对象，该图概括了文章中的数学内容，并通过手工标注的107篇STEM文章数据集来评估分析模型和自然语言处理模型（包括大型语言模型 (LLMs)）在识别和提取这些依赖关系方面的能力。研究发现，尽管NLP领域取得了显著进展，但在理解数学文本方面，当前模型（包括LLMs）的F1分数仅达到40-50%，表明在数学文本理解方面仍有较大的改进空间。

链接: https://arxiv.org/abs/2410.21324
作者: Vishesh Prasad,Brian Kim,Nickvash Kani
关键词-EN: natural language processing, large language models, language processing, natural language, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing textual data, applying analysis to mathematical equations and their relationships within texts has produced mixed results. In this paper, we take the initial steps toward understanding the dependency relationships between mathematical expressions in STEM articles. Our dataset, sourced from a random sampling of the arXiv corpus, contains an analysis of 107 published STEM manuscripts whose inter-equation dependency relationships have been hand-labeled, resulting in a new object we refer to as a derivation graph that summarizes the mathematical content of the manuscript. We exhaustively evaluate analytical and NLP-based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. Our comprehensive testing finds that both analytical and NLP models (including LLMs) achieve \sim 40-50% F1 scores for extracting derivation graphs from articles, revealing that the recent advances in NLP have not made significant inroads in comprehending mathematical texts compared to simpler analytic models. While current approaches offer a solid foundation for extracting mathematical information, further research is necessary to improve accuracy and depth in this area.
摘要：近年来，自然语言处理（Natural Language Processing, NLP）领域的进展，特别是大语言模型（Large Language Model, LLM）的出现，极大地推动了文本分析领域的发展。然而，尽管这些进展在分析文本数据方面取得了显著的进步，但在将分析应用于文本中的数学方程及其关系时，结果却参差不齐。本文首次尝试理解STEM（科学、技术、工程和数学）文章中数学表达式之间的依赖关系。我们的数据集来自arXiv语料库的随机抽样，包含了对107篇已发表的STEM手稿的分析，这些手稿中的方程间依赖关系已由人工标注，从而形成了一种新的对象，我们称之为推导图（derivation graph），它概括了手稿中的数学内容。我们全面评估了基于分析和NLP的模型，以评估它们识别和提取每篇文章推导关系的能力，并将结果与真实情况进行比较。我们的综合测试发现，无论是分析模型还是NLP模型（包括大语言模型），在从文章中提取推导图时，F1分数均在40-50%之间，这表明最近的NLP进展在理解数学文本方面并未比简单的分析模型取得显著进展。尽管当前的方法为提取数学信息提供了坚实的基础，但进一步的研究是必要的，以提高这一领域的准确性和深度。

[NLP-87] User-Aware Multilingual Abusive Content Detection in Social Media

【速读】：该论文试图解决在多语种低资源语言环境下社交媒体上的辱骂内容检测问题。解决方案的关键在于提出了一种新颖的方法，通过分别学习社交和文本上下文特征，并将这些特征整合用于最终的预测。具体来说，该方法首先在两个独立的模块中学习社交和文本上下文特征，然后将这些特征的整合表示用于最终的辱骂内容检测。实验结果表明，该方法在SCIDN和MACI数据集上分别比现有的最先进方法提高了4.08%和9.52%的F1-score，显著提升了检测性能。

链接: https://arxiv.org/abs/2410.21321
作者: Mohammad Zia Ur Rehman,Somya Mehta,Kuldeep Singh,Kunal Kaushik,Nagendra Kumar
关键词-EN: halt distasteful content, multilingualism has added, growing efforts, efforts to halt, halt distasteful
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite growing efforts to halt distasteful content on social media, multilingualism has added a new dimension to this problem. The scarcity of resources makes the challenge even greater when it comes to low-resource languages. This work focuses on providing a novel method for abusive content detection in multiple low-resource Indic languages. Our observation indicates that a post’s tendency to attract abusive comments, as well as features such as user history and social context, significantly aid in the detection of abusive content. The proposed method first learns social and text context features in two separate modules. The integrated representation from these modules is learned and used for the final prediction. To evaluate the performance of our method against different classical and state-of-the-art methods, we have performed extensive experiments on SCIDN and MACI datasets consisting of 1.5M and 665K multilingual comments, respectively. Our proposed method outperforms state-of-the-art baseline methods with an average increase of 4.08% and 9.52% in F1-scores on SCIDN and MACI datasets, respectively.
摘要：尽管在社交媒体上阻止令人反感的内容的努力不断增加，但多语言性为这一问题增添了新的维度。对于低资源语言而言，资源的稀缺性使得这一挑战更加严峻。本文聚焦于提供一种新颖的方法，用于在多种低资源的印度语言中检测辱骂性内容。我们的观察表明，帖子吸引辱骂性评论的倾向性，以及用户历史和社会背景等特征，显著有助于辱骂性内容的检测。所提出的方法首先在两个独立的模块中学习社会和文本上下文特征。这些模块的集成表示被学习并用于最终的预测。为了评估我们的方法相对于不同经典和最先进方法的性能，我们在SCIDN和MACI数据集上进行了广泛的实验，这两个数据集分别包含150万和66.5万条多语言评论。我们提出的方法在SCIDN和MACI数据集上的F1分数分别平均提高了4.08%和9.52%，优于最先进的基线方法。

[NLP-88] GraphLSS: Integrating Lexical Structural and Semantic Features for Long Document Extractive Summarization ACL

【速读】：该论文试图解决长文档摘要中异构图神经网络模型复杂且依赖外部工具或额外机器学习模型的问题。解决方案的关键在于提出了GraphLSS，一种结合词汇(Lexical)、结构(Structural)和语义(Semantic)特征的异构图构建方法。GraphLSS定义了两个层次的信息（词和句子）以及四种边类型（句子语义相似性、句子出现顺序、词在句子中、词语义相似性），无需辅助学习模型，简化了图结构并提高了模型的直观性。实验结果表明，GraphLSS在两个基准数据集上与顶尖的基于图的方法竞争，并优于最近的非图模型。

链接: https://arxiv.org/abs/2410.21315
作者: Margarita Bugueño,Hazem Abou Hamdan,Gerard de Melo
关键词-EN: node classification task, recently gained attention, graph neural networks, Heterogeneous graph neural, modeling the extraction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Short paper submitted to ACL ARR November cycle

点击查看摘要

Abstract:Heterogeneous graph neural networks have recently gained attention for long document summarization, modeling the extraction as a node classification task. Although effective, these models often require external tools or additional machine learning models to define graph components, producing highly complex and less intuitive structures. We present GraphLSS, a heterogeneous graph construction for long document extractive summarization, incorporating Lexical, Structural, and Semantic features. It defines two levels of information (words and sentences) and four types of edges (sentence semantic similarity, sentence occurrence order, word in sentence, and word semantic similarity) without any need for auxiliary learning models. Experiments on two benchmark datasets show that GraphLSS is competitive with top-performing graph-based methods, outperforming recent non-graph models. We release our code on GitHub.
摘要：异构图神经网络最近在长文档摘要领域引起了关注，将摘要提取建模为节点分类任务。尽管这些模型有效，但它们通常需要外部工具或额外的机器学习模型来定义图的组成部分，从而生成高度复杂且不直观的结构。我们提出了 GraphLSS，这是一种用于长文档抽取式摘要的异构图构建方法，整合了词汇 (Lexical)、结构 (Structural) 和语义 (Semantic) 特征。该方法定义了两个层次的信息（词和句子）以及四种边类型（句子语义相似性、句子出现顺序、词在句子中、词语义相似性），无需任何辅助学习模型。在两个基准数据集上的实验表明，GraphLSS 在与顶级图基方法的竞争中表现出色，优于最近的非图模型。我们在 GitHub 上发布了我们的代码。

[NLP-89] Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

【速读】：该论文试图解决扩散模型（Diffusion Models）在生成高质量图像时，其语义潜在空间（semantic latent spaces）的理解和解释比生成对抗网络（GANs）更为复杂的问题。解决方案的关键在于提出了一种新颖的无监督探索框架，通过直接利用自然语言提示（natural language prompts）和图像标题（image captions）来映射潜在方向（latent directions）。这种方法无需手动解释或训练特定向量，从而实现了对隐藏特征的自动理解，并支持更广泛的分析范围。该框架提供了对扩散模型中编码的语义知识的更可扩展和可解释的理解，有助于全面分析潜在偏差和模型学习的细微表示。

链接: https://arxiv.org/abs/2410.21314
作者: E. Zhixuan Zeng,Yuhao Chen,Alexander Wong
关键词-EN: models powerful tools, creating high-quality images, generation have made, powerful tools, tools for creating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in image generation have made diffusion models powerful tools for creating high-quality images. However, their iterative denoising process makes understanding and interpreting their semantic latent spaces more challenging than other generative models, such as GANs. Recent methods have attempted to address this issue by identifying semantically meaningful directions within the latent space. However, they often need manual interpretation or are limited in the number of vectors that can be trained, restricting their scope and utility. This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. This method allows for the automatic understanding of hidden features and supports a broader range of analysis without the need to train specific vectors. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models, facilitating comprehensive analysis of latent biases and the nuanced representations these models learn. Experimental results show that our framework can uncover hidden patterns and associations in various domains, offering new insights into the interpretability of diffusion model latent spaces.
摘要：近年来，图像生成技术的进步使得扩散模型成为创建高质量图像的强大工具。然而，其迭代的去噪过程使得理解和解释其语义潜在空间比其他生成模型（如 GANs）更具挑战性。最近的方法试图通过识别潜在空间内的语义有意义的方向来解决这一问题。然而，这些方法通常需要手动解释，或者在可训练的向量数量上受到限制，从而限制了其范围和实用性。本文提出了一种新的无监督探索扩散潜在空间的框架。我们直接利用自然语言提示和图像标题来映射潜在方向。这种方法支持自动理解隐藏特征，并支持更广泛的分析，无需训练特定的向量。我们的方法提供了对扩散模型内编码的语义知识更可扩展和可解释的理解，促进了潜在偏差和这些模型所学习到的细微表示的全面分析。实验结果表明，我们的框架能够在各个领域揭示隐藏的模式和关联，为扩散模型潜在空间的解释性提供了新的见解。

[NLP-90] textttPatentAgent : Intelligent Agent for Automated Pharmaceutical Patent Analysis

【速读】：该论文试图解决制药专利分析中缺乏统一智能代理的问题，该代理能够从专利阅读到核心化学结构识别的各个环节提供辅助。解决方案的关键在于引入了一个名为 PatentAgent 的智能代理，该代理由三个关键的端到端模块组成：PA-QA（专利问答）、PA-Img2Mol（图像到分子结构转换）和 PA-CoreId（核心化学结构识别）。这些模块分别解决了专利分析中的问答需求、图像到分子结构的转换问题以及核心化学结构的识别问题。通过利用大型语言模型（LLMs）的能力，PatentAgent 在多个专利基准测试中表现优异，特别是在 PA-Img2Mol 和 PA-CoreId 模块上，分别实现了2.46%到8.37%和7.15%到7.62%的准确性提升。

链接: https://arxiv.org/abs/2410.21312
作者: Xin Wang,Yifan Zhang,Xiaojing Zhang,Longhui Yu,Xinna Lin,Jindong Jiang,Bin Ma,Kaicheng Yu
关键词-EN: unique early access, experimental results, Pharmaceutical patents play, biochemical industries, drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages

点击查看摘要

Abstract:Pharmaceutical patents play a vital role in biochemical industries, especially in drug discovery, providing researchers with unique early access to data, experimental results, and research insights. With the advancement of machine learning, patent analysis has evolved from manual labor to tasks assisted by automatic tools. However, there still lacks an unified agent that assists every aspect of patent analysis, from patent reading to core chemical identification. Leveraging the capabilities of Large Language Models (LLMs) to understand requests and follow instructions, we introduce the \textbffirst intelligent agent in this domain, \textttPatentAgent , poised to advance and potentially revolutionize the landscape of pharmaceutical research. \textttPatentAgent comprises three key end-to-end modules – \textitPA-QA , \textitPA-Img2Mol , and \textitPA-CoreId – that respectively perform (1) patent question-answering, (2) image-to-molecular-structure conversion, and (3) core chemical structure identification, addressing the essential needs of scientists and practitioners in pharmaceutical patent analysis. Each module of \textttPatentAgent demonstrates significant effectiveness with the updated algorithm and the synergistic design of \textttPatentAgent framework. \textitPA-Img2Mol outperforms existing methods across CLEF, JPO, UOB, and USPTO patent benchmarks with an accuracy gain between 2.46% and 8.37% while \textitPA-CoreId realizes accuracy improvement ranging from 7.15% to 7.62% on PatentNetML benchmark. Our code and dataset will be publicly available.
摘要：在生物化学工业中，尤其是药物发现领域，药品专利扮演着至关重要的角色，为研究人员提供了独特的早期数据、实验结果和研究见解的访问权限。随着机器学习的进步，专利分析已从手工劳动转变为自动化工具辅助的任务。然而，目前仍缺乏一个统一的智能体来辅助从专利阅读到核心化学结构识别的各个方面。利用大语言模型（LLMs）理解请求和遵循指令的能力，我们在此领域引入了首个智能体——\texttt{PatentAgent}，该智能体有望推动并可能彻底改变药品研究的面貌。\texttt{PatentAgent} 包含三个关键的端到端模块——\textit{PA-QA}、\textit{PA-Img2Mol} 和 \textit{PA-CoreId}——分别执行（1）专利问答，（2）图像到分子结构转换，以及（3）核心化学结构识别，满足了科学家和从业者在药品专利分析中的基本需求。每个模块在更新算法和\texttt{PatentAgent}框架的协同设计下均展现出显著的有效性。\textit{PA-Img2Mol}在CLEF、JPO、UOB和USPTO专利基准测试中超越了现有方法，准确率提升在2.46%至8.37%之间，而\textit{PA-CoreId}在PatentNetML基准测试中的准确率提升范围为7.15%至7.62%。我们的代码和数据集将公开发布。

[NLP-91] Natural Language Processing for the Legal Domain: A Survey of Tasks Datasets Models and Challenges

【速读】：该论文旨在系统性地回顾和总结自然语言处理（Natural Language Processing, NLP）在法律领域的应用现状和挑战。解决方案的关键在于识别和分析法律文本处理的独特性，如文档长度、语言复杂性和数据集的有限性，并探讨了针对这些特性的具体NLP任务，包括法律文档摘要（Legal Document Summarization）、法律命名实体识别（legal Named Entity Recognition）、法律问答（Legal Question Answering）、法律文本分类（Legal Text Classification）和法律判决预测（Legal Judgment Prediction）。此外，论文还强调了开发和适应法律领域语言模型的重要性，并提出了15个开放研究挑战，如人工智能应用中的偏见、模型鲁棒性和可解释性的需求，以及提升法律语言和推理复杂性的解释能力。

链接: https://arxiv.org/abs/2410.21306
作者: Farid Ariai,Gianluca Demartini
关键词-EN: Natural Language Processing, Natural Language, Language Processing, Language Models, Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages

点击查看摘要

Abstract:Natural Language Processing is revolutionizing the way legal professionals and laypersons operate in the legal field. The considerable potential for Natural Language Processing in the legal sector, especially in developing computational tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 148 studies, with a final selection of 127 after manual filtering. It explores foundational concepts related to Natural Language Processing in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document length, complex language, and limited open legal datasets. We provide an overview of Natural Language Processing tasks specific to legal text, such as Legal Document Summarization, legal Named Entity Recognition, Legal Question Answering, Legal Text Classification, and Legal Judgment Prediction. In the section on legal Language Models, we analyze both developed Language Models and approaches for adapting general Language Models to the legal domain. Additionally, we identify 15 Open Research Challenges, including bias in Artificial Intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.
摘要：自然语言处理正在彻底改变法律专业人士和普通人在法律领域的工作方式。自然语言处理在法律领域的巨大潜力，特别是在开发各种法律过程的计算工具方面，多年来一直吸引着研究者的兴趣。本综述遵循系统评价和元分析的首选报告项目框架，审查了148项研究，经过手动筛选后最终选择了127项。本文探讨了与法律领域自然语言处理相关的基本概念，阐明了处理法律文本的独特方面和挑战，如文档长度长、语言复杂以及开放法律数据集有限。我们概述了针对法律文本的自然语言处理任务，如法律文档摘要、法律命名实体识别、法律问答、法律文本分类和法律判决预测。在法律语言模型部分，我们分析了已开发的语言模型以及将通用语言模型适应于法律领域的方法。此外，我们识别了15个开放研究挑战，包括人工智能应用中的偏见、对更稳健和可解释模型的需求，以及提高解释性以应对法律语言和推理的复杂性。

[NLP-92] MatExpert: Decomposing Materials Discovery by Mimicking Human Experts

【速读】：该论文试图解决材料发现领域的加速问题，特别是新固态材料的发现和设计。解决方案的关键在于引入了一个名为MatExpert的新框架，该框架结合了大型语言模型（Large Language Models, LLMs）和对比学习（contrastive learning）。MatExpert的解决方案分为三个关键阶段：检索（retrieval）、转换（transition）和生成（generation）。在检索阶段，MatExpert识别与目标标准最接近的现有材料；在转换阶段，它规划将该材料配方修改以满足用户特定需求的必要步骤；在生成阶段，MatExpert进行详细的计算和结构生成，以基于提供的信息创建新材料。实验结果表明，MatExpert在材料生成任务中优于现有最先进的方法，在有效性、分布和稳定性等多个指标上表现出色。

链接: https://arxiv.org/abs/2410.21317
作者: Qianggang Ding,Santiago Miret,Bang Liu
关键词-EN: critical research area, Large Language Models, leverages Large Language, critical research, research area
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Material discovery is a critical research area with profound implications for various industries. In this work, we introduce MatExpert, a novel framework that leverages Large Language Models (LLMs) and contrastive learning to accelerate the discovery and design of new solid-state materials. Inspired by the workflow of human materials design experts, our approach integrates three key stages: retrieval, transition, and generation. First, in the retrieval stage, MatExpert identifies an existing material that closely matches the desired criteria. Second, in the transition stage, MatExpert outlines the necessary modifications to transform this material formulation to meet specific requirements outlined by the initial user query. Third, in the generation state, MatExpert performs detailed computations and structural generation to create new materials based on the provided information. Our experimental results demonstrate that MatExpert outperforms state-of-the-art methods in material generation tasks, achieving superior performance across various metrics including validity, distribution, and stability. As such, MatExpert represents a meaningful advancement in computational material discovery using langauge-based generative models.
摘要：材料发现是一个对多个行业具有深远影响的关键研究领域。本文中，我们介绍了 MatExpert，这是一种利用大语言模型 (LLMs) 和对比学习的新型框架，旨在加速新固态材料的发现与设计。受人类材料设计专家工作流程的启发，我们的方法整合了三个关键阶段：检索、转换和生成。首先，在检索阶段，MatExpert 识别出与所需标准高度匹配的现有材料。其次，在转换阶段，MatExpert 概述了将该材料配方转化为满足用户初始查询中特定要求的必要修改。第三，在生成阶段，MatExpert 基于提供的信息进行详细计算和结构生成，以创造新的材料。我们的实验结果表明，MatExpert 在材料生成任务中优于现有最先进的方法，在有效性、分布和稳定性等多个指标上表现出色。因此，MatExpert 代表了使用基于语言的生成模型进行计算材料发现的重要进展。

人工智能

[AI-0] Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

链接: https://arxiv.org/abs/2410.22325
作者: Guangqi Jiang,Yifei Sun,Tao Huang,Huanyu Li,Yongyuan Liang,Huazhe Xu
关键词-EN: enhanced the efficiency, manipulation centricity, manipulation, Manipulation Centric Representation, tasks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the “manipulation centricity” is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot’s proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: this https URL.

[AI-1] An Efficient Approach to Generate Safe Drivable Space by LiDAR-Camera-HDmap Fusion

链接: https://arxiv.org/abs/2410.22314
作者: Minghao Ning,Ahmad Reza Alghooneh,Chen Sun,Ruihe Zhang,Pouya Panahandeh,Steven Tuer,Ehsan Hashemi,Amir Khajepour
关键词-EN: drivable space extraction, drivable space, robust perception module, space extraction, perception module
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose an accurate and robust perception module for Autonomous Vehicles (AVs) for drivable space extraction. Perception is crucial in autonomous driving, where many deep learning-based methods, while accurate on benchmark datasets, fail to generalize effectively, especially in diverse and unpredictable environments. Our work introduces a robust easy-to-generalize perception module that leverages LiDAR, camera, and HD map data fusion to deliver a safe and reliable drivable space in all weather conditions. We present an adaptive ground removal and curb detection method integrated with HD map data for enhanced obstacle detection reliability. Additionally, we propose an adaptive DBSCAN clustering algorithm optimized for precipitation noise, and a cost-effective LiDAR-camera frustum association that is resilient to calibration discrepancies. Our comprehensive drivable space representation incorporates all perception data, ensuring compatibility with vehicle dimensions and road regulations. This approach not only improves generalization and efficiency, but also significantly enhances safety in autonomous vehicle operations. Our approach is tested on a real dataset and its reliability is verified during the daily (including harsh snowy weather) operation of our autonomous shuttle, WATonoBus

[AI-2] Effective Guidance for Model Attention with Simple Yes-no Annotations

链接: https://arxiv.org/abs/2410.22312
作者: Seongmin Lee,Ali Payani,Duen Horng(Polo)Chau
关键词-EN: Modern deep learning, deep learning models, leading to biased, limited generalization, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 10 pages, 5 figures, IEEE BigData 2024 Paper

点击查看摘要

Abstract:Modern deep learning models often make predictions by focusing on irrelevant areas, leading to biased performance and limited generalization. Existing methods aimed at rectifying model attention require explicit labels for irrelevant areas or complex pixel-wise ground truth attention maps. We present CRAYON (Correcting Reasoning with Annotations of Yes Or No), offering effective, scalable, and practical solutions to rectify model attention using simple yes-no annotations. CRAYON empowers classical and modern model interpretation techniques to identify and guide model reasoning: CRAYON-ATTENTION directs classic interpretations based on saliency maps to focus on relevant image regions, while CRAYON-PRUNING removes irrelevant neurons identified by modern concept-based methods to mitigate their influence. Through extensive experiments with both quantitative and human evaluation, we showcase CRAYON’s effectiveness, scalability, and practicality in refining model attention. CRAYON achieves state-of-the-art performance, outperforming 12 methods across 3 benchmark datasets, surpassing approaches that require more complex annotations.

[AI-3] mathsfOPA: One-shot Private Aggregation with Single Client Interaction and its Applications to Federated Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.22303
作者: Harish Karthikeyan,Antigoni Polychroniadou
关键词-EN: communication rounds, OPA, One-shot Private Aggregation, aims to minimize, minimize interaction
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear at the NeurIPS 2024 FL@FM workshop

点击查看摘要

Abstract:Our work aims to minimize interaction in secure computation due to the high cost and challenges associated with communication rounds, particularly in scenarios with many clients. In this work, we revisit the problem of secure aggregation in the single-server setting where a single evaluation server can securely aggregate client-held individual inputs. Our key contribution is the introduction of One-shot Private Aggregation ( \mathsfOPA ) where clients speak only once (or even choose not to speak) per aggregation evaluation. Since each client communicates only once per aggregation, this simplifies managing dropouts and dynamic participation, contrasting with multi-round protocols and aligning with plaintext secure aggregation, where clients interact only once. We construct \mathsfOPA based on LWR, LWE, class groups, DCR and demonstrate applications to privacy-preserving Federated Learning (FL) where clients \emphspeak once. This is a sharp departure from prior multi-round FL protocols whose study was initiated by Bonawitz et al. (CCS, 2017). Moreover, unlike the YOSO (You Only Speak Once) model for general secure computation, \mathsfOPA eliminates complex committee selection protocols to achieve adaptive security. Beyond asymptotic improvements, \mathsfOPA is practical, outperforming state-of-the-art solutions. We benchmark logistic regression classifiers for two datasets, while also building an MLP classifier to train on MNIST, CIFAR-10, and CIFAR-100 datasets. We build two flavors of \caps (1) from (threshold) key homomorphic PRF and (2) from seed homomorphic PRG and secret sharing.

[AI-4] ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising WACV2025

链接: https://arxiv.org/abs/2410.22233
作者: Ashutosh Chaubey,Anoubhav Agarwaal,Sartaki Sinha Roy,Aayush Agarwal,Susmita Ghose
关键词-EN: advertising serves ads, Contextual advertising, Contextual advertising serves, serves ads, Contextual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:Contextual advertising serves ads that are aligned to the content that the user is viewing. The rapid growth of video content on social platforms and streaming services, along with privacy concerns, has increased the need for contextual advertising. Placing the right ad in the right context creates a seamless and pleasant ad viewing experience, resulting in higher audience engagement and, ultimately, better ad monetization. From a technology standpoint, effective contextual advertising requires a video retrieval system capable of understanding complex video content at a very granular level. Current text-to-video retrieval models based on joint multimodal training demand large datasets and computational resources, limiting their practicality and lacking the key functionalities required for ad ecosystem integration. We introduce ContextIQ, a multimodal expert-based video retrieval system designed specifically for contextual advertising. ContextIQ utilizes modality-specific experts-video, audio, transcript (captions), and metadata such as objects, actions, emotion, etc.-to create semantically rich video representations. We show that our system, without joint training, achieves better or comparable results to state-of-the-art models and commercial solutions on multiple text-to-video retrieval benchmarks. Our ablation studies highlight the benefits of leveraging multiple modalities for enhanced video retrieval accuracy instead of using a vision-language model alone. Furthermore, we show how video retrieval systems such as ContextIQ can be used for contextual advertising in an ad ecosystem while also addressing concerns related to brand safety and filtering inappropriate content.

[AI-5] A Methodology for Gradual Semantics for Structured Argumentation under Incomplete Information

链接: https://arxiv.org/abs/2410.22209
作者: Antonio Rago,Stylianos Loukas Vasileiou,Francesca Toni,Tran Cao Son,William Yeoh
关键词-EN: deploying quantitative bipolar, quantitative bipolar argumentation, Gradual semantics, bipolar argumentation frameworks, demonstrated great potential
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Gradual semantics have demonstrated great potential in argumentation, in particular for deploying quantitative bipolar argumentation frameworks (QBAFs) in a number of real-world settings, from judgmental forecasting to explainable AI. In this paper, we provide a novel methodology for obtaining gradual semantics for structured argumentation frameworks, where the building blocks of arguments and relations between them are known, unlike in QBAFs, where arguments are abstract entities. Differently from existing approaches, our methodology accommodates incomplete information about arguments’ premises. We demonstrate the potential of our approach by introducing two different instantiations of the methodology, leveraging existing gradual semantics for QBAFs in these more complex frameworks. We also define a set of novel properties for gradual semantics in structured argumentation, discuss their suitability over a set of existing properties. Finally, we provide a comprehensive theoretical analysis assessing the instantiations, demonstrating the their advantages over existing gradual semantics for QBAFs and structured argumentation.

[AI-6] Drone Acoustic Analysis for Predicting Psychoacoustic Annoyance via Artificial Neural Networks

链接: https://arxiv.org/abs/2410.22208
作者: Andrea Vaiuso,Marcello Righi,Oier Coretti,Moreno Apicella
关键词-EN: Unmanned Aerial Vehicles, Unmanned Aerial, low operational cost, Aerial Vehicles, operational cost
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注: 20 Pages, 10 Figures, 4 Tables

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) have become widely used in various fields and industrial applications thanks to their low operational cost, compact size and wide accessibility. However, the noise generated by drone propellers has emerged as a significant concern. This may affect the public willingness to implement these vehicles in services that require operation in proximity to residential areas. The standard approaches to address this challenge include sound pressure measurements and noise characteristic analyses. The integration of Artificial Intelligence models in recent years has further streamlined the process by enhancing complex feature detection in drone acoustics data. This study builds upon prior research by examining the efficacy of various Deep Learning models in predicting Psychoacoustic Annoyance, an effective index for measuring perceived annoyance by human ears, based on multiple drone characteristics as input. This is accomplished by constructing a training dataset using precise measurements of various drone models with multiple microphones and analyzing flight data, maneuvers, drone physical characteristics, and perceived annoyance under realistic conditions. The aim of this research is to improve our understanding of drone noise, aid in the development of noise reduction techniques, and encourage the acceptance of drone usage on public spaces.

[AI-7] Democratizing Reward Design for Personal and Representative Value-Alignment

链接: https://arxiv.org/abs/2410.22203
作者: Carter Blair,Kate Larson,Edith Law
关键词-EN: Aligning AI agents, agents with human, challenging due, Aligning, subjective notions
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 19 pages, 16 figures

点击查看摘要

Abstract:Aligning AI agents with human values is challenging due to diverse and subjective notions of values. Standard alignment methods often aggregate crowd feedback, which can result in the suppression of unique or minority preferences. We introduce Interactive-Reflective Dialogue Alignment, a method that iteratively engages users in reflecting on and specifying their subjective value definitions. This system learns individual value definitions through language-model-based preference elicitation and constructs personalized reward models that can be used to align AI behaviour. We evaluated our system through two studies with 30 participants, one focusing on “respect” and the other on ethical decision-making in autonomous vehicles. Our findings demonstrate diverse definitions of value-aligned behaviour and show that our system can accurately capture each person’s unique understanding. This approach enables personalized alignment and can inform more representative and interpretable collective alignment strategies.

[AI-8] Multi-Level Feature Distillation of Joint Teachers Trained on Distinct Image Datasets WACV2025

链接: https://arxiv.org/abs/2410.22184
作者: Adrian Iordache,Bogdan Alexe,Radu Tudor Ionescu
关键词-EN: teacher-student framework, framework to distill, feature distillation, feature distillation procedure, multi-level feature distillation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:We propose a novel teacher-student framework to distill knowledge from multiple teachers trained on distinct datasets. Each teacher is first trained from scratch on its own dataset. Then, the teachers are combined into a joint architecture, which fuses the features of all teachers at multiple representation levels. The joint teacher architecture is fine-tuned on samples from all datasets, thus gathering useful generic information from all data samples. Finally, we employ a multi-level feature distillation procedure to transfer the knowledge to a student model for each of the considered datasets. We conduct image classification experiments on seven benchmarks, and action recognition experiments on three benchmarks. To illustrate the power of our feature distillation procedure, the student architectures are chosen to be identical to those of the individual teachers. To demonstrate the flexibility of our approach, we combine teachers with distinct architectures. We show that our novel Multi-Level Feature Distillation (MLFD) can significantly surpass equivalent architectures that are either trained on individual datasets, or jointly trained on all datasets at once. Furthermore, we confirm that each step of the proposed training procedure is well motivated by a comprehensive ablation study. We publicly release our code at this https URL.

[AI-9] Analyzing Multimodal Interaction Strategies for LLM -Assisted Manipulation of 3D Scenes

链接: https://arxiv.org/abs/2410.22177
作者: Junlong Chen,Jens Grubert,Per Ola Kristensson
关键词-EN: large language models, involve LLMs, interaction patterns, applications of large, identify interaction patterns
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:As more applications of large language models (LLMs) for 3D content for immersive environments emerge, it is crucial to study user behaviour to identify interaction patterns and potential barriers to guide the future design of immersive content creation and editing systems which involve LLMs. In an empirical user study with 12 participants, we combine quantitative usage data with post-experience questionnaire feedback to reveal common interaction patterns and key barriers in LLM-assisted 3D scene editing systems. We identify opportunities for improving natural language interfaces in 3D design tools and propose design recommendations for future LLM-integrated 3D content creation systems. Through an empirical study, we demonstrate that LLM-assisted interactive systems can be used productively in immersive environments.

[AI-10] Standardization Trends on Safety and Trustworthiness Technology for Advanced AI

链接: https://arxiv.org/abs/2410.22151
作者: Jonghong Jeon
关键词-EN: artificial general intelligence, Artificial Intelligence, image and video, video recognition, scientific reasoning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 13 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Artificial Intelligence (AI) has rapidly evolved over the past decade and has advanced in areas such as language comprehension, image and video recognition, programming, and scientific reasoning. Recent AI technologies based on large language models and foundation models are approaching or surpassing artificial general intelligence. These systems demonstrate superior performance in complex problem solving, natural language processing, and multi-domain tasks, and can potentially transform fields such as science, industry, healthcare, and education. However, these advancements have raised concerns regarding the safety and trustworthiness of advanced AI, including risks related to uncontrollability, ethical conflicts, long-term socioeconomic impacts, and safety assurance. Efforts are being expended to develop internationally agreed-upon standards to ensure the safety and reliability of AI. This study analyzes international trends in safety and trustworthiness standardization for advanced AI, identifies key areas for standardization, proposes future directions and strategies, and draws policy implications. The goal is to support the safe and trustworthy development of advanced AI and enhance international competitiveness through effective standardization.

[AI-11] Lightweight Frequency Masker for Cross-Domain Few-Shot Semantic Segmentation NEURIPS2024

链接: https://arxiv.org/abs/2410.22135
作者: Jintao Tong,Yixiong Zou,Yuhua Li,Ruixuan Li
关键词-EN: Cross-domain few-shot segmentation, large-scale source-domain dataset, data-scarce target-domain datasets, pre-train the model, transfer the model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Cross-domain few-shot segmentation (CD-FSS) is proposed to first pre-train the model on a large-scale source-domain dataset, and then transfer the model to data-scarce target-domain datasets for pixel-level segmentation. The significant domain gap between the source and target datasets leads to a sharp decline in the performance of existing few-shot segmentation (FSS) methods in cross-domain scenarios. In this work, we discover an intriguing phenomenon: simply filtering different frequency components for target domains can lead to a significant performance improvement, sometimes even as high as 14% mIoU. Then, we delve into this phenomenon for an interpretation, and find such improvements stem from the reduced inter-channel correlation in feature maps, which benefits CD-FSS with enhanced robustness against domain gaps and larger activated regions for segmentation. Based on this, we propose a lightweight frequency masker, which further reduces channel correlations by an amplitude-phase-masker (APM) module and an Adaptive Channel Phase Attention (ACPA) module. Notably, APM introduces only 0.01% additional parameters but improves the average performance by over 10%, and ACPA imports only 2.5% parameters but further improves the performance by over 1.5%, which significantly surpasses the state-of-the-art CD-FSS methods.

[AI-12] ProMoE: Fast MoE-based LLM Serving using Proactive Caching

链接: https://arxiv.org/abs/2410.22134
作者: Xiaoniu Song,Zihang Zhong,Rong Chen
关键词-EN: limited GPU memory, GPU memory capacity, large language models, GPU memory demand, GPU memory
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The promising applications of large language models are often constrained by the limited GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model’s parameters during computation, allowing the unused parameters to be offloaded to host memory and reducing overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively and significantly impact system performance. In this paper, we propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. By proactively fetching experts in advance, ProMoE removes the loading time from the critical path and diminishes the performance overhead of offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively, compared to existing offloading solutions.

[AI-13] Solving Epistemic Logic Programs using Generate-and-Test with Propagation

链接: https://arxiv.org/abs/2410.22130
作者: Jorge Fandinno,Lute Lillo
关键词-EN: prove sufficient conditions, epistemic logic programs, prove sufficient, sufficient conditions, general framework
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:This paper introduces a general framework for generate-and-test-based solvers for epistemic logic programs that can be instantiated with different generator and tester programs, and we prove sufficient conditions on those programs for the correctness of the solvers built using this framework. It also introduces a new generator program that incorporates the propagation of epistemic consequences and shows that this can exponentially reduce the number of candidates that need to be tested while only incurring a linear overhead. We implement a new solver based on these theoretical findings and experimentally show that it outperforms existing solvers by achieving a ~3.3x speed-up and solving 91% more instances on well-known benchmarks.

[AI-14] Improving Performance of Commercially Available AI Products in a Multi-Agent Configuration

链接: https://arxiv.org/abs/2410.22129
作者: Cory Hymel,Sida Peng,Kevin Xu,Charath Ranganathan
关键词-EN: large language models, recent years, language models, rapid advancement, advancement of large
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:In recent years, with the rapid advancement of large language models (LLMs), multi-agent systems have become increasingly more capable of practical application. At the same time, the software development industry has had a number of new AI-powered tools developed that improve the software development lifecycle (SDLC). Academically, much attention has been paid to the role of multi-agent systems to the SDLC. And, while single-agent systems have frequently been examined in real-world applications, we have seen comparatively few real-world examples of publicly available commercial tools working together in a multi-agent system with measurable improvements. In this experiment we test context sharing between Crowdbotics PRD AI, a tool for generating software requirements using AI, and GitHub Copilot, an AI pair-programming tool. By sharing business requirements from PRD AI, we improve the code suggestion capabilities of GitHub Copilot by 13.8% and developer task success rate by 24.5% – demonstrating a real-world example of commercially-available AI systems working together with improved outcomes.

[AI-15] Vision Paper: Designing Graph Neural Networks in Compliance with the European Artificial Intelligence Act

链接: https://arxiv.org/abs/2410.22120
作者: Barbara Hoffmann,Jana Vatter,Ruben Mayer
关键词-EN: Union Artificial Intelligence, European Union Artificial, Graph Neural Networks, Artificial Intelligence Act, Artificial Intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The European Union’s Artificial Intelligence Act (AI Act) introduces comprehensive guidelines for the development and oversight of Artificial Intelligence (AI) and Machine Learning (ML) systems, with significant implications for Graph Neural Networks (GNNs). This paper addresses the unique challenges posed by the AI Act for GNNs, which operate on complex graph-structured data. The legislation’s requirements for data management, data governance, robustness, human oversight, and privacy necessitate tailored strategies for GNNs. Our study explores the impact of these requirements on GNN training and proposes methods to ensure compliance. We provide an in-depth analysis of bias, robustness, explainability, and privacy in the context of GNNs, highlighting the need for fair sampling strategies and effective interpretability techniques. Our contributions fill the research gap by offering specific guidance for GNNs under the new legislative framework and identifying open questions and future research directions.

[AI-16] Policy Gradient for Robust Markov Decision Processes

链接: https://arxiv.org/abs/2410.22114
作者: Qiuhao Wang,Shaohang Xu,Chin Pang Ho,Marek Petrick
关键词-EN: Markov Decision Processes, robust Markov Decision, Decision Processes, Markov Decision, policy gradient method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete and continuous state-action spaces, broadening the applicability of our approach. Empirical results validate the robustness and global convergence of DRPMD across various challenging robust MDP settings.

[AI-17] DAGE: DAG Query Answering via Relational Combinator with Logical Constraints

链接: https://arxiv.org/abs/2410.22105
作者: Yunjie He,Bo Xiong,Daniel Hernández,Yuqicheng Zhu,Evgeny Kharlamov,Steffen Staab
关键词-EN: complex reasoning task, Predicting answers, query requires subdividing, query embedding methods, Existing query embedding
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting answers to queries over knowledge graphs is called a complex reasoning task because answering a query requires subdividing it into subqueries. Existing query embedding methods use this decomposition to compute the embedding of a query as the combination of the embedding of the subqueries. This requirement limits the answerable queries to queries having a single free variable and being decomposable, which are called tree-form queries and correspond to the \mathcalSROI^- description logic. In this paper, we define a more general set of queries, called DAG queries and formulated in the \mathcalALCOIR description logic, propose a query embedding method for them, called DAGE, and a new benchmark to evaluate query embeddings on them. Given the computational graph of a DAG query, DAGE combines the possibly multiple paths between two nodes into a single path with a trainable operator that represents the intersection of relations and learns DAG-DL from tautologies. We show that it is possible to implement DAGE on top of existing query embedding methods, and we empirically measure the improvement of our method over the results of vanilla methods evaluated in tree-form queries that approximate the DAG queries of our proposed benchmark.

[AI-18] Hyperspectral Imaging-Based Perception in Autonomous Driving Scenarios: Benchmarking Baseline Semantic Segmentation Models

链接: https://arxiv.org/abs/2410.22101
作者: Imad Ali Shah,Jiarong Li,Martin Glavin,Edward Jones,Enda Ward,Brian Deegan
关键词-EN: traditional RGB imaging, RGB imaging, Driving Assistance Systems, Advanced Driving Assistance, traditional RGB
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE WHISPERS 2024

点击查看摘要

Abstract:Hyperspectral Imaging (HSI) is known for its advantages over traditional RGB imaging in remote sensing, agriculture, and medicine. Recently, it has gained attention for enhancing Advanced Driving Assistance Systems (ADAS) perception. Several HSI datasets such as HyKo, HSI-Drive, HSI-Road, and Hyperspectral City have been made available. However, a comprehensive evaluation of semantic segmentation models (SSM) using these datasets is lacking. To address this gap, we evaluated the available annotated HSI datasets on four deep learning-based baseline SSMs: DeepLab v3+, HRNet, PSPNet, and U-Net, along with its two variants: Coordinate Attention (UNet-CA) and Convolutional Block-Attention Module (UNet-CBAM). The original model architectures were adapted to handle the varying spatial and spectral dimensions of the datasets. These baseline SSMs were trained using a class-weighted loss function for individual HSI datasets and evaluated using mean-based metrics such as intersection over union (IoU), recall, precision, F1 score, specificity, and accuracy. Our results indicate that UNet-CBAM, which extracts channel-wise features, outperforms other SSMs and shows potential to leverage spectral information for enhanced semantic segmentation. This study establishes a baseline SSM benchmark on available annotated datasets for future evaluation of HSI-based ADAS perception. However, limitations of current HSI datasets, such as limited dataset size, high class imbalance, and lack of fine-grained annotations, remain significant constraints for developing robust SSMs for ADAS applications.

[AI-19] ractShapeNet: Efficient Multi-Shape Learning with 3D Tractography Point Clouds

链接: https://arxiv.org/abs/2410.22099
作者: Yui Lo,Yuqian Chen,Dongnan Liu,Jon Haitz Legarreta,Leo Zekelman,Fan Zhang,Jarrett Rushmore,Yogesh Rathi,Nikos Makris,Alexandra J. Golby,Weidong Cai,Lauren J. O’Donnell
关键词-EN: diffusion MRI tractography, MRI tractography geometric, brain white matter, Brain imaging studies, white matter pathways
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures, 4 tables. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Brain imaging studies have demonstrated that diffusion MRI tractography geometric shape descriptors can inform the study of the brain’s white matter pathways and their relationship to brain function. In this work, we investigate the possibility of utilizing a deep learning model to compute shape measures of the brain’s white matter connections. We introduce a novel framework, TractShapeNet, that leverages a point cloud representation of tractography to compute five shape measures: length, span, volume, total surface area, and irregularity. We assess the performance of the method on a large dataset including 1065 healthy young adults. Experiments for shape measure computation demonstrate that our proposed TractShapeNet outperforms other point cloud-based neural network models in both the Pearson correlation coefficient and normalized error metrics. We compare the inference runtime results with the conventional shape computation tool DSI-Studio. Our results demonstrate that a deep learning approach enables faster and more efficient shape measure computation. We also conduct experiments on two downstream language cognition prediction tasks, showing that shape measures from TractShapeNet perform similarly to those computed by DSI-Studio. Our code will be available at: this https URL.

[AI-20] Mapping the Neuro-Symbolic AI Landscape by Architectures: A Handbook on Augmenting Deep Learning Through Symbolic Reasoning

链接: https://arxiv.org/abs/2410.22077
作者: Jonathan Feldstein,Paulius Dilkas,Vaishak Belle,Efthymia Tsamoura
关键词-EN: Integrating symbolic techniques, artificial intelligence, Integrating symbolic, long-standing problem, problem in artificial
类目: Artificial Intelligence (cs.AI)
*备注: 57 pages

点击查看摘要

Abstract:Integrating symbolic techniques with statistical ones is a long-standing problem in artificial intelligence. The motivation is that the strengths of either area match the weaknesses of the other, and \unicodex2013 by combining the two \unicodex2013 the weaknesses of either method can be limited. Neuro-symbolic AI focuses on this integration where the statistical methods are in particular neural networks. In recent years, there has been significant progress in this research field, where neuro-symbolic systems outperformed logical or neural models alone. Yet, neuro-symbolic AI is, comparatively speaking, still in its infancy and has not been widely adopted by machine learning practitioners. In this survey, we present the first mapping of neuro-symbolic techniques into families of frameworks based on their architectures, with several benefits: Firstly, it allows us to link different strengths of frameworks to their respective architectures. Secondly, it allows us to illustrate how engineers can augment their neural networks while treating the symbolic methods as black-boxes. Thirdly, it allows us to map most of the field so that future researchers can identify closely related frameworks.

[AI-21] Enhance Hyperbolic Representation Learning via Second-order Pooling

链接: https://arxiv.org/abs/2410.22026
作者: Kun Song,Ruben Solozabal,Li hao,Lu Ren,Moloud Abdar,Qing Li,Fakhri Karray,Martin Takac
关键词-EN: Hyperbolic representation learning, capture hierarchical information, hierarchical information, Hyperbolic representation, representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hyperbolic representation learning is well known for its ability to capture hierarchical information. However, the distance between samples from different levels of hierarchical classes can be required large. We reveal that the hyperbolic discriminant objective forces the backbone to capture this hierarchical information, which may inevitably increase the Lipschitz constant of the backbone. This can hinder the full utilization of the backbone’s generalization ability. To address this issue, we introduce second-order pooling into hyperbolic representation learning, as it naturally increases the distance between samples without compromising the generalization ability of the input features. In this way, the Lipschitz constant of the backbone does not necessarily need to be large. However, current off-the-shelf low-dimensional bilinear pooling methods cannot be directly employed in hyperbolic representation learning because they inevitably reduce the distance expansion capability. To solve this problem, we propose a kernel approximation regularization, which enables the low-dimensional bilinear features to approximate the kernel function well in low-dimensional space. Finally, we conduct extensive experiments on graph-structured datasets to demonstrate the effectiveness of the proposed method.

[AI-22] Path-based summary explanations for graph recommenders – extended version

链接: https://arxiv.org/abs/2410.22020
作者: Danae Pla Karidi,Evaggelia Pitoura
关键词-EN: Path-based explanations provide, Path-based explanations, provide intrinsic insights, Steiner Tree, graph-based recommendation models
类目: Artificial Intelligence (cs.AI)
*备注: Extended Version - This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Path-based explanations provide intrinsic insights into graph-based recommendation models. However, most previous work has focused on explaining an individual recommendation of an item to a user. In this paper, we propose summary explanations, i.e., explanations that highlight why a user or a group of users receive a set of item recommendations and why an item, or a group of items, is recommended to a set of users as an effective means to provide insights into the collective behavior of the recommender. We also present a novel method to summarize explanations using efficient graph algorithms, specifically the Steiner Tree and the Prize-Collecting Steiner Tree. Our approach reduces the size and complexity of summary explanations while preserving essential information, making explanations more comprehensible for users and more useful to model developers. Evaluations across multiple metrics demonstrate that our summaries outperform baseline explanation methods in most scenarios, in a variety of quality aspects.

[AI-23] Modeling Temporal Positive and Negative Excitation for Sequential Recommendation

链接: https://arxiv.org/abs/2410.22013
作者: Chengkai Huang,Shoujin Wang,Xianzhi Wang,Lina Yao
关键词-EN: negative excitation, Sequential recommendation aims, Sequential recommendation, interest, dynamic interest
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommendation aims to predict the next item which interests users via modeling their interest in items over time. Most of the existing works on sequential recommendation model users’ dynamic interest in specific items while overlooking users’ static interest revealed by some static attribute information of items, e.g., category, or brand. Moreover, existing works often only consider the positive excitation of a user’s historical interactions on his/her next choice on candidate items while ignoring the commonly existing negative excitation, resulting in insufficient modeling dynamic interest. The overlook of static interest and negative excitation will lead to incomplete interest modeling and thus impede the recommendation performance. To this end, in this paper, we propose modeling both static interest and negative excitation for dynamic interest to further improve the recommendation performance. Accordingly, we design a novel Static-Dynamic Interest Learning (SDIL) framework featured with a novel Temporal Positive and Negative Excitation Modeling (TPNE) module for accurate sequential recommendation. TPNE is specially designed for comprehensively modeling dynamic interest based on temporal positive and negative excitation learning. Extensive experiments on three real-world datasets show that SDIL can effectively capture both static and dynamic interest and outperforms state-of-the-art baselines.

[AI-24] From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring System

链接: https://arxiv.org/abs/2410.21991
作者: Wen-Dong Jiang,Chih-Yung Chang,Hsiang-Chuan Chang,Diptendu Sinha Roy
关键词-EN: violence surveillance tasks, demonstrated outstanding performance, violence surveillance, research based, based on pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages,7 figures

点击查看摘要

Abstract:Recently, research based on pre-trained models has demonstrated outstanding performance in violence surveillance tasks. However, these black-box systems face challenges regarding explainability during training and inference processes. An important question is how to incorporate explicit knowledge into these implicit models, thereby designing expert-driven and interpretable violence surveillance systems. This paper proposes a new paradigm for weakly supervised violence monitoring (WSVM) called Rule base Violence monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure for different designs for images and text. One of the branches is called the implicit branch, which uses only visual features for coarse-grained binary classification. In this branch, image feature extraction is divided into two channels: one responsible for extracting scene frames and the other focusing on extracting actions. The other branch is called the explicit branch, which utilizes language-image alignment to perform fine-grained classification. For the language channel design in the explicit branch, the proposed RuleCLIP uses the state-of-the-art YOLO-World model to detect objects and actions in video frames, and association rules are identified through data mining methods as descriptions of the video. Leveraging the dual?branch architecture, RuleVM achieves interpretable coarse?grained and fine-grained violence surveillance. Extensive experiments were conducted on two commonly used benchmarks, and the results show that RuleCLIP achieved the best performance in both coarse-grained and fine-grained detection, significantly outperforming existing state-of-the-art methods. Moreover, interpretability experiments uncovered some interesting rules, such as the observation that as the number of people increases, the risk level of violent behavior also rises.

[AI-25] Automated Vulnerability Detection Using Deep Learning Technique

链接: https://arxiv.org/abs/2410.21968
作者: Guan-Yan Yang,Yi-Heng Ko,Farn Wang,Kuo-Hui Yeh,Haw-Shiang Chang,Hsueh-Yi Chen
关键词-EN: SQL injection vulnerabilities, detecting SQL injection, detecting SQL, SQL injection, Long Short-Term Memory
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 4 pages, 1 figures; Presented at The 30st International Conference on Computational Experimental Engineering and Sciences (ICCES2024)

点击查看摘要

Abstract:Our work explores the utilization of deep learning, specifically leveraging the CodeBERT model, to enhance code security testing for Python applications by detecting SQL injection vulnerabilities. Unlike traditional security testing methods that may be slow and error-prone, our approach transforms source code into vector representations and trains a Long Short-Term Memory (LSTM) model to identify vulnerable patterns. When compared with existing static application security testing (SAST) tools, our model displays superior performance, achieving higher precision, recall, and F1-score. The study demonstrates that deep learning techniques, particularly with CodeBERT’s advanced contextual understanding, can significantly improve vulnerability detection, presenting a scalable methodology applicable to various programming languages and vulnerability types.

[AI-26] Dual Conditional Diffusion Models for Sequential Recommendation

链接: https://arxiv.org/abs/2410.21967
作者: Hongtao Huang,Chengkai Huang,Xiaojun Chang,Wen Hu,Lina Yao
关键词-EN: shown promising results, Recent advancements, Dual Conditional Diffusion, conditional diffusion models, diffusion models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have shown promising results in sequential recommendation (SR). However, current diffusion-based methods still exhibit two key limitations. First, they implicitly model the diffusion process for target item embeddings rather than the discrete target item itself, leading to inconsistency in the recommendation process. Second, existing methods rely on either implicit or explicit conditional diffusion models, limiting their ability to fully capture the context of user behavior and leading to less robust target item embeddings. In this paper, we propose the Dual Conditional Diffusion Models for Sequential Recommendation (DCRec), introducing a discrete-to-continuous sequential recommendation diffusion framework. Our framework introduces a complete Markov chain to model the transition from the reversed target item representation to the discrete item index, bridging the discrete and continuous item spaces for diffusion models and ensuring consistency with the diffusion framework. Building on this framework, we present the Dual Conditional Diffusion Transformer (DCDT) that incorporates the implicit conditional and the explicit conditional for diffusion-based SR. Extensive experiments on public benchmark datasets demonstrate that DCRec outperforms state-of-the-art methods.

[AI-27] Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution

链接: https://arxiv.org/abs/2410.21940
作者: Senne Deproost,Denis Steckelmacher,Ann Nowé
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, Programmatic Reinforcement Learning, Reinforcement Learning, real-world systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in BNAIC/BeNeLearn 2024 conference proceedings

点击查看摘要

Abstract:With Deep Reinforcement Learning (DRL) being increasingly considered for the control of real-world systems, the lack of transparency of the neural network at the core of RL becomes a concern. Programmatic Reinforcement Learning (PRL) is able to to create representations of this black-box in the form of source code, not only increasing the explainability of the controller but also allowing for user adaptations. However, these methods focus on distilling a black-box policy into a program and do so after learning using the Mean Squared Error between produced and wanted behaviour, discarding other elements of the RL algorithm. The distilled policy may therefore perform significantly worse than the black-box learned policy. In this paper, we propose to directly learn a program as the policy of an RL agent. We build on TD3 and use its critics as the basis of the objective function of a genetic algorithm that syntheses the program. Our approach builds the program during training, as opposed to after the fact. This steers the program to actual high rewards, instead of a simple Mean Squared Error. Also, our approach leverages the TD3 critics to achieve high sample-efficiency, as opposed to pure genetic methods that rely on Monte-Carlo evaluations. Our experiments demonstrate the validity, explainability and sample-efficiency of our approach in a simple gridworld environment. Comments: Accepted in BNAIC/BeNeLearn 2024 conference proceedings Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.21940 [cs.LG] (or arXiv:2410.21940v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.21940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-28] Benchmarking OpenAI o1 in Cyber Security

链接: https://arxiv.org/abs/2410.21939
作者: Dan Ristea,Vasilios Mavroudis,Chris Hicks
关键词-EN: evaluate OpenAI, benchmarking their performance, Nginx challenge project, Abstract, DARPA AI Cyber
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We evaluate OpenAI’s o1-preview and o1-mini models, benchmarking their performance against the earlier GPT-4o model. Our evaluation focuses on their ability to detect vulnerabilities in real-world software by generating structured inputs that trigger known sanitizers. Using DARPA’s AI Cyber Challenge (AIxCC) framework and the Nginx challenge project–a deliberately modified version of the widely-used Nginx web server–we create a well-defined yet complex environment for testing LLMs on automated vulnerability detection (AVD) tasks. Our results show that the o1-preview model significantly outperforms GPT-4o in both success rate and efficiency, especially in more complex scenarios.

[AI-29] ReMix: Training Generalized Person Re-identification on a Mixture of Data WACV2025

链接: https://arxiv.org/abs/2410.21938
作者: Timur Mamedov,Anton Konushin,Vadim Konushin
关键词-EN: Modern person re-identification, capturing environments change, major accuracy drop, Modern person, environments change
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Modern person re-identification (Re-ID) methods have a weak generalization ability and experience a major accuracy drop when capturing environments change. This is because existing multi-camera Re-ID datasets are limited in size and diversity, since such data is difficult to obtain. At the same time, enormous volumes of unlabeled single-camera records are available. Such data can be easily collected, and therefore, it is more diverse. Currently, single-camera data is used only for self-supervised pre-training of Re-ID methods. However, the diversity of single-camera data is suppressed by fine-tuning on limited multi-camera data after pre-training. In this paper, we propose ReMix, a generalized Re-ID method jointly trained on a mixture of limited labeled multi-camera and large unlabeled single-camera data. Effective training of our method is achieved through a novel data sampling strategy and new loss functions that are adapted for joint use with both types of data. Experiments show that ReMix has a high generalization ability and outperforms state-of-the-art methods in generalizable person Re-ID. To the best of our knowledge, this is the first work that explores joint training on a mixture of multi-camera and single-camera data in person Re-ID.

[AI-30] LogSHIELD: A Graph-based Real-time Anomaly Detection Framework using Frequency Analysis

链接: https://arxiv.org/abs/2410.21936
作者: Krishna Chandra Roy,Qian Chen
关键词-EN: Anomaly-based cyber threat, cyber threat detection, real-time threat detection, real-time threat, detection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly-based cyber threat detection using deep learning is on a constant growth in popularity for novel cyber-attack detection and forensics. A robust, efficient, and real-time threat detector in a large-scale operational enterprise network requires high accuracy, high fidelity, and a high throughput model to detect malicious activities. Traditional anomaly-based detection models, however, suffer from high computational overhead and low detection accuracy, making them unsuitable for real-time threat detection. In this work, we propose LogSHIELD, a highly effective graph-based anomaly detection model in host data. We present a real-time threat detection approach using frequency-domain analysis of provenance graphs. To demonstrate the significance of graph-based frequency analysis we proposed two approaches. Approach-I uses a Graph Neural Network (GNN) LogGNN and approach-II performs frequency domain analysis on graph node samples for graph embedding. Both approaches use a statistical clustering algorithm for anomaly detection. The proposed models are evaluated using a large host log dataset consisting of 774M benign logs and 375K malware logs. LogSHIELD explores the provenance graph to extract contextual and causal relationships among logs, exposing abnormal activities. It can detect stealthy and sophisticated attacks with over 98% average AUC and F1 scores. It significantly improves throughput, achieves an average detection latency of 0.13 seconds, and outperforms state-of-the-art models in detection time.

[AI-31] Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

链接: https://arxiv.org/abs/2410.21926
作者: Halil Utku Unlu,Shuaihang Yuan,Congcong Wen,Hao Huang,Anthony Tzes,Yi Fang
关键词-EN: GLIP Vision Language, advancing semantic understanding, zero-shot object goal, Vision Language Model, enhancing the autonomy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures, 2 tables

点击查看摘要

Abstract:We introduce an innovative approach to advancing semantic understanding in zero-shot object goal navigation (ZS-OGN), enhancing the autonomy of robots in unfamiliar environments. Traditional reliance on labeled data has been a limitation for robotic adaptability, which we address by employing a dual-component framework that integrates a GLIP Vision Language Model for initial detection and an InstructionBLIP model for validation. This combination not only refines object and environmental recognition but also fortifies the semantic interpretation, pivotal for navigational decision-making. Our method, rigorously tested in both simulated and real-world settings, exhibits marked improvements in navigation precision and reliability.

[AI-32] Semi-Supervised Self-Learning Enhanced Music Emotion Recognition

链接: https://arxiv.org/abs/2410.21897
作者: Yifu Sun,Xulong Zhang,Monan Zhou,Wei Li
关键词-EN: Music emotion recognition, aims to identify, musical piece, Music emotion, MER
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music emotion recognition (MER) aims to identify the emotions conveyed in a given musical piece. But currently in the field of MER, the available public datasets have limited sample sizes. Recently, segment-based methods for emotion-related tasks have been proposed, which train backbone networks on shorter segments instead of entire audio clips, thereby naturally augmenting training samples without requiring additional resources. Then, the predicted segment-level results are aggregated to obtain the entire song prediction. The most commonly used method is that segment inherits the label of the clip containing it, but music emotion is not constant during the whole clip. Doing so will introduce label noise and make the training overfit easily. To handle the noisy label issue, we propose a semi-supervised self-learning (SSSL) method, which can differentiate between samples with correct and incorrect labels in a self-learning manner, thus effectively utilizing the augmented segment-level data. Experiments on three public emotional datasets demonstrate that the proposed method can achieve better or comparable performance.

[AI-33] Bayesian Optimization for Hyperparameters Tuning in Neural Networks

链接: https://arxiv.org/abs/2410.21886
作者: Gabriele Onorato
关键词-EN: image classification tasks, Convolutional Neural Networks, Bayesian Optimization, enhancement of Convolutional, Convolutional Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: Bachelor Thesis in Optimization for Machine Learning, 57 pages

点击查看摘要

Abstract:This study investigates the application of Bayesian Optimization (BO) for the hyperparameter tuning of neural networks, specifically targeting the enhancement of Convolutional Neural Networks (CNN) for image classification tasks. Bayesian Optimization is a derivative-free global optimization method suitable for expensive black-box functions with continuous inputs and limited evaluation budgets. The BO algorithm leverages Gaussian Process regression and acquisition functions like Upper Confidence Bound (UCB) and Expected Improvement (EI) to identify optimal configurations effectively. Using the Ax and BOTorch frameworks, this work demonstrates the efficiency of BO in reducing the number of hyperparameter tuning trials while achieving competitive model performance. Experimental outcomes reveal that BO effectively balances exploration and exploitation, converging rapidly towards optimal settings for CNN architectures. This approach underlines the potential of BO in automating neural network tuning, contributing to improved accuracy and computational efficiency in machine learning pipelines.

[AI-34] Building Altruistic and Moral AI Agent with Brain-inspired Affective Empathy Mechanisms

链接: https://arxiv.org/abs/2410.21882
作者: Feifei Zhao,Hui Feng,Haibo Tong,Zhengqiang Han,Enmeng Lu,Yinqian Sun,Yi Zeng
关键词-EN: closely interacts, crucial to ensure, intrinsic altruistic motivation, empathy, altruistic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI closely interacts with human society, it is crucial to ensure that its decision-making is safe, altruistic, and aligned with human ethical and moral values. However, existing research on embedding ethical and moral considerations into AI remains insufficient, and previous external constraints based on principles and rules are inadequate to provide AI with long-term stability and generalization capabilities. In contrast, the intrinsic altruistic motivation based on empathy is more willing, spontaneous, and robust. Therefore, this paper is dedicated to autonomously driving intelligent agents to acquire morally behaviors through human-like affective empathy mechanisms. We draw inspiration from the neural mechanism of human brain’s moral intuitive decision-making, and simulate the mirror neuron system to construct a brain-inspired affective empathy-driven altruistic decision-making model. Here, empathy directly impacts dopamine release to form intrinsic altruistic motivation. Based on the principle of moral utilitarianism, we design the moral reward function that integrates intrinsic empathy and extrinsic self-task goals. A comprehensive experimental scenario incorporating empathetic processes, personal objectives, and altruistic goals is developed. The proposed model enables the agent to make consistent moral decisions (prioritizing altruism) by balancing self-interest with the well-being of others. We further introduce inhibitory neurons to regulate different levels of empathy and verify the positive correlation between empathy levels and altruistic preferences, yielding conclusions consistent with findings from psychological behavioral experiments. This work provides a feasible solution for the development of ethical AI by leveraging the intrinsic human-like empathy mechanisms, and contributes to the harmonious coexistence between humans and AI.

[AI-35] Advancing Efficient Brain Tumor Multi-Class Classification – New Insights from the Vision Mamba Model in Transfer Learning

链接: https://arxiv.org/abs/2410.21872
作者: Yinyi Lai,Anbo Cao,Yuan Gao,Jiaqi Shang,Zongyu Li,Jia Guo
关键词-EN: brain tumor classification, patient survival rates, improving patient survival, brain tumor, tumor classification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early and accurate diagnosis of brain tumors is crucial for improving patient survival rates. However, the detection and classification of brain tumors are challenging due to their diverse types and complex morphological characteristics. This study investigates the application of pre-trained models for brain tumor classification, with a particular focus on deploying the Mamba model. We fine-tuned several mainstream transfer learning models and applied them to the multi-class classification of brain tumors. By comparing these models to those trained from scratch, we demonstrated the significant advantages of transfer learning, especially in the medical imaging field, where annotated data is often limited. Notably, we introduced the Vision Mamba (Vim), a novel network architecture, and applied it for the first time in brain tumor classification, achieving exceptional classification accuracy. Experimental results indicate that the Vim model achieved 100% classification accuracy on an independent test set, emphasizing its potential for tumor classification tasks. These findings underscore the effectiveness of transfer learning in brain tumor classification and reveal that, compared to existing state-of-the-art models, the Vim model is lightweight, efficient, and highly accurate, offering a new perspective for clinical applications. Furthermore, the framework proposed in this study for brain tumor classification, based on transfer learning and the Vision Mamba model, is broadly applicable to other medical imaging classification problems.

[AI-36] Cross-Entropy Is All You Need To Invert the Data Generating Process

链接: https://arxiv.org/abs/2410.21869
作者: Patrik Reizinger,Alice Bizeul,Attila Juhos,Julia E. Vogt,Randall Balestriero,Wieland Brendel,David Klindt
关键词-EN: modern machine learning, Independent Component Analysis, effectiveness remains elusive, comprehensive theory explaining, remains elusive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Supervised learning has become a cornerstone of modern machine learning, yet a comprehensive theory explaining its effectiveness remains elusive. Empirical phenomena, such as neural analogy-making and the linear representation hypothesis, suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning, particularly nonlinear Independent Component Analysis, have shown that these methods can recover latent structures by inverting the data generating process. We extend these identifiability results to parametric instance discrimination, then show how insights transfer to the ubiquitous setting of supervised learning with cross-entropy minimization. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation. We corroborate our theoretical contribution with a series of empirical studies. First, using simulated data matching our theoretical assumptions, we demonstrate successful disentanglement of latent factors. Second, we show that on DisLib, a widely-used disentanglement benchmark, simple classification tasks recover latent structures up to linear transformations. Finally, we reveal that models trained on ImageNet encode representations that permit linear decoding of proxy factors of variation. Together, our theoretical findings and experiments offer a compelling explanation for recent observations of linear representations, such as superposition in neural networks. This work takes a significant step toward a cohesive theory that accounts for the unreasonable effectiveness of supervised deep learning.

[AI-37] Learning Infinitesimal Generators of Continuous Symmetries from Data NEURIPS2024

链接: https://arxiv.org/abs/2410.21853
作者: Gyeonghoon Ko,Hyunsu Kim,Juho Lee
关键词-EN: Exploiting symmetry inherent, Exploiting symmetry, significantly improve, improve the sample, sample efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Neurips 2024

点击查看摘要

Abstract:Exploiting symmetry inherent in data can significantly improve the sample efficiency of a learning procedure and the generalization of learned models. When data clearly reveals underlying symmetry, leveraging this symmetry can naturally inform the design of model architectures or learning strategies. Yet, in numerous real-world scenarios, identifying the specific symmetry within a given data distribution often proves ambiguous. To tackle this, some existing works learn symmetry in a data-driven manner, parameterizing and learning expected symmetry through data. However, these methods often rely on explicit knowledge, such as pre-defined Lie groups, which are typically restricted to linear or affine transformations. In this paper, we propose a novel symmetry learning algorithm based on transformations defined with one-parameter groups, continuously parameterized transformations flowing along the directions of vector fields called infinitesimal generators. Our method is built upon minimal inductive biases, encompassing not only commonly utilized symmetries rooted in Lie groups but also extending to symmetries derived from nonlinear generators. To learn these symmetries, we introduce a notion of a validity score that examine whether the transformed data is still valid for the given task. The validity score is designed to be fully differentiable and easily computable, enabling effective searches for transformations that achieve symmetries innate to the data. We apply our method mainly in two domains: image data and partial differential equations, and demonstrate its advantages. Our codes are available at \urlthis https URL.

[AI-38] Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

链接: https://arxiv.org/abs/2410.21845
作者: Jianlan Luo,Charles Xu,Jeffrey Wu,Sergey Levine
关键词-EN: holds great promise, enabling autonomous acquisition, Reinforcement learning, holds great, great promise
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) holds great promise for enabling autonomous acquisition of complex robotic manipulation skills, but realizing this potential in real-world settings has been challenging. We present a human-in-the-loop vision-based RL system that demonstrates impressive performance on a diverse set of dexterous manipulation tasks, including dynamic manipulation, precision assembly, and dual-arm coordination. Our approach integrates demonstrations and human corrections, efficient RL algorithms, and other system-level design choices to learn policies that achieve near-perfect success rates and fast cycle times within just 1 to 2.5 hours of training. We show that our method significantly outperforms imitation learning baselines and prior RL approaches, with an average 2x improvement in success rate and 1.8x faster execution. Through extensive experiments and analysis, we provide insights into the effectiveness of our approach, demonstrating how it learns robust, adaptive policies for both reactive and predictive control strategies. Our results suggest that RL can indeed learn a wide range of complex vision-based manipulation policies directly in the real world within practical training times. We hope this work will inspire a new generation of learned robotic manipulation techniques, benefiting both industrial applications and research advancements. Videos and code are available at our project website this https URL.

[AI-39] Diffusion as Reasoning: Enhancing Object Goal Navigation with LLM -Biased Diffusion Model

链接: https://arxiv.org/abs/2410.21842
作者: Yiming Ji,Yang Liu,Zhengpu Wang,Boyu Ma,Zongwu Xie,Hong Liu
关键词-EN: Object Goal Navigation, target object, unseen environment, target, requires the agent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Object Goal Navigation (ObjectNav) task requires the agent to navigate to a specified target in an unseen environment. Since the environment layout is unknown, the agent needs to perform semantic reasoning to infer the potential location of the target, based on its accumulated memory of the environment during the navigation process. Diffusion models have been shown to be able to learn the distribution relationships between features in RGB images, and thus generate new realistic this http URL this work, we propose a new approach to solving the ObjectNav task, by training a diffusion model to learn the statistical distribution patterns of objects in semantic maps, and using the map of the explored regions during navigation as the condition to generate the map of the unknown regions, thereby realizing the semantic reasoning of the target object, i.e., diffusion as reasoning (DAR). Meanwhile, we propose the global target bias and local LLM bias methods, where the former can constrain the diffusion model to generate the target object more effectively, and the latter utilizes the common sense knowledge extracted from the LLM to improve the generalization of the reasoning process. Based on the generated map in the unknown region, the agent sets the predicted location of the target as the goal and moves towards it. Experiments on Gibson and MP3D show the effectiveness of our method.

[AI-40] A Fresh Look at Generalized Category Discovery through Non-negative Matrix Factorization

链接: https://arxiv.org/abs/2410.21807
作者: Zhong Ji,Shuo Yang,Jingren Liu,Yanwei Pang,Jungong Han
关键词-EN: Generalized Category Discovery, labeled base data, Non-Negative Generalized Category, Category Discovery, Generalized Category
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 figures,Submitted to IEEE Transactions on Image Processing

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify both base and novel images using labeled base data. However, current approaches inadequately address the intrinsic optimization of the co-occurrence matrix \barA based on cosine similarity, failing to achieve zero base-novel regions and adequate sparsity in base and novel domains. To address these deficiencies, we propose a Non-Negative Generalized Category Discovery (NN-GCD) framework. It employs Symmetric Non-negative Matrix Factorization (SNMF) as a mathematical medium to prove the equivalence of optimal K-means with optimal SNMF, and the equivalence of SNMF solver with non-negative contrastive learning (NCL) optimization. Utilizing these theoretical equivalences, it reframes the optimization of \barA and K-means clustering as an NCL optimization problem. Moreover, to satisfy the non-negative constraints and make a GCD model converge to a near-optimal region, we propose a GELU activation function and an NMF NCE loss. To transition \barA from a suboptimal state to the desired \barA^* , we introduce a hybrid sparse regularization approach to impose sparsity constraints. Experimental results show NN-GCD outperforms state-of-the-art methods on GCD benchmarks, achieving an average accuracy of 66.1% on the Semantic Shift Benchmark, surpassing prior counterparts by 4.7%.

[AI-41] xt-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.21802
作者: Lu Yu,Haiyang Zhang,Changsheng Xu
关键词-EN: attracted widespread attention, pre-trained vision-language models, text-guided attention, Attention-based Model Constraint, Attention Refinement module
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g. CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: the Attention Refinement module and the Attention-based Model Constraint module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Attention Refinement module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model’s robustness. Additionally, the Attention-based Model Constraint module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. The experiments validate that our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets. Our code is available at this https URL.

[AI-42] Robot Policy Learning with Temporal Optimal Transport Reward NEURIPS2024

链接: https://arxiv.org/abs/2410.21795
作者: Yuwei Fu,Haichao Zhang,Di Wu,Wei Xu,Benoit Boulet
关键词-EN: Temporal Optimal Transport, problems in Reinforcement, Toggle, requires tedious hand, tedious hand engineering
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Reward specification is one of the most tricky problems in Reinforcement Learning, which usually requires tedious hand engineering in practice. One promising approach to tackle this challenge is to adopt existing expert video demonstrations for policy learning. Some recent work investigates how to learn robot policies from only a single/few expert video demonstrations. For example, reward labeling via Optimal Transport (OT) has been shown to be an effective strategy to generate a proxy reward by measuring the alignment between the robot trajectory and the expert demonstrations. However, previous work mostly overlooks that the OT reward is invariant to temporal order information, which could bring extra noise to the reward signal. To address this issue, in this paper, we introduce the Temporal Optimal Transport (TemporalOT) reward to incorporate temporal order information for learning a more accurate OT-based proxy reward. Extensive experiments on the Meta-world benchmark tasks validate the efficacy of the proposed method. Code is available at: this https URL Comments: NeurIPS 2024 Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2410.21795 [cs.AI] (or arXiv:2410.21795v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.21795 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haichao Zhang [view email] [v1] Tue, 29 Oct 2024 07:00:47 UTC (2,710 KB) Full-text links: Access Paper: View a PDF of the paper titled Robot Policy Learning with Temporal Optimal Transport Reward, by Yuwei Fu and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2024-10 Change to browse by: cs cs.RO References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-43] Inverse Attention Agent for Multi-Agent System

链接: https://arxiv.org/abs/2410.21794
作者: Qian Long,Ruoyan Li,Minglu Zhao,Tao Gao,Demetri Terzopoulos
关键词-EN: Multi-Agent Systems, Systems is enabling, Attention, Inverse Attention, continually change
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:A major challenge for Multi-Agent Systems is enabling agents to adapt dynamically to diverse environments in which opponents and teammates may continually change. Agents trained using conventional methods tend to excel only within the confines of their training cohorts; their performance drops significantly when confronting unfamiliar agents. To address this shortcoming, we introduce Inverse Attention Agents that adopt concepts from the Theory of Mind, implemented algorithmically using an attention mechanism and trained in an end-to-end manner. Crucial to determining the final actions of these agents, the weights in their attention model explicitly represent attention to different goals. We furthermore propose an inverse attention network that deduces the ToM of agents based on observations and prior actions. The network infers the attentional states of other agents, thereby refining the attention weights to adjust the agent’s final action. We conduct experiments in a continuous environment, tackling demanding tasks encompassing cooperation, competition, and a blend of both. They demonstrate that the inverse attention network successfully infers the attention of other agents, and that this information improves agent performance. Additional human experiments show that, compared to baseline agent models, our inverse attention agents exhibit superior cooperation with humans and better emulate human behaviors.

[AI-44] Online Mirror Descent for Tchebycheff Scalarization in Multi-Objective Optimization

链接: https://arxiv.org/abs/2410.21764
作者: Meitong Liu,Xiaoyuan Zhang,Chulin Xie,Kate Donahue,Han Zhao
关键词-EN: potentially conflicting, learn under multiple, goal of multi-objective, multi-objective optimization, Tchebycheff scalarization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 27 pages, 7 figures, 2 tables

点击查看摘要

Abstract:The goal of multi-objective optimization (MOO) is to learn under multiple, potentially conflicting, objectives. One widely used technique to tackle MOO is through linear scalarization, where one fixed preference vector is used to combine the objectives into a single scalar value for optimization. However, recent work (Hu et al., 2024) has shown linear scalarization often fails to capture the non-convex regions of the Pareto Front, failing to recover the complete set of Pareto optimal solutions. In light of the above limitations, this paper focuses on Tchebycheff scalarization that optimizes for the worst-case objective. In particular, we propose an online mirror descent algorithm for Tchebycheff scalarization, which we call OMD-TCH. We show that OMD-TCH enjoys a convergence rate of O(\sqrt\log m/T) where m is the number of objectives and T is the number of iteration rounds. We also propose a novel adaptive online-to-batch conversion scheme that significantly improves the practical performance of OMD-TCH while maintaining the same convergence guarantees. We demonstrate the effectiveness of OMD-TCH and the adaptive conversion scheme on both synthetic problems and federated learning tasks under fairness constraints, showing state-of-the-art performance.

[AI-45] Efficient Reprogramming of Memristive Crossbars for DNNs: Weight Sorting and Bit Stucking

链接: https://arxiv.org/abs/2410.21730
作者: Matheus Farias,H. T. Kung
关键词-EN: deep neural networks, neural networks, number of times, deep neural, times required
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 5 pages, 10 figures

点击查看摘要

Abstract:We introduce a novel approach to reduce the number of times required for reprogramming memristors on bit-sliced compute-in-memory crossbars for deep neural networks (DNNs). Our idea addresses the limited non-volatile memory endurance, which restrict the number of times they can be reprogrammed. To reduce reprogramming demands, we employ two techniques: (1) we organize weights into sorted sections to schedule reprogramming of similar crossbars, maximizing memristor state reuse, and (2) we reprogram only a fraction of randomly selected memristors in low-order columns, leveraging their bit-level distribution and recognizing their relatively small impact on model accuracy. We evaluate our approach for state-of-the-art models on the ImageNet-1K dataset. We demonstrate a substantial reduction in crossbar reprogramming by 3.7x for ResNet-50 and 21x for ViT-Base, while maintaining model accuracy within a 1% margin. Comments: 5 pages, 10 figures Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2410.21730 [cs.AR] (or arXiv:2410.21730v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2410.21730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-46] Generating Realistic Tabular Data with Large Language Models ICDM2024

链接: https://arxiv.org/abs/2410.21717
作者: Dang Nguyen,Sunil Gupta,Kien Do,Thin Nguyen,Svetha Venkatesh
关键词-EN: tabular data generation, image data generation, data generation, tabular data, generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear at ICDM 2024

点击查看摘要

Abstract:While most generative models show achievements in image data generation, few are developed for tabular data generation. Recently, due to success of large language models (LLM) in diverse tasks, they have also been used for tabular data generation. However, these methods do not capture the correct correlation between the features and the target variable, hindering their applications in downstream predictive tasks. To address this problem, we propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. First, we propose a novel permutation strategy for the input data in the fine-tuning phase. Second, we propose a feature-conditional sampling approach to generate synthetic samples. Finally, we generate the labels by constructing prompts based on the generated samples to query our fine-tuned LLM. Our extensive experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks. It also produces highly realistic synthetic samples in terms of quality and diversity. More importantly, classifiers trained with our synthetic data can even compete with classifiers trained with the original data on half of the benchmark datasets, which is a significant achievement in tabular data generation.

[AI-47] AdaptGCD: Multi-Expert Adapter Tuning for Generalized Category Discovery

链接: https://arxiv.org/abs/2410.21705
作者: Yuxun Qu,Yongqiang Tang,Chenyang Zhang,Wensheng Zhang
关键词-EN: Generalized Category Discovery, Generalized Category, Category Discovery, traditional semi-supervised learning, semi-supervised learning paradigm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Different from the traditional semi-supervised learning paradigm that is constrained by the close-world assumption, Generalized Category Discovery (GCD) presumes that the unlabeled dataset contains new categories not appearing in the labeled set, and aims to not only classify old categories but also discover new categories in the unlabeled data. Existing studies on GCD typically devote to transferring the general knowledge from the self-supervised pretrained model to the target GCD task via some fine-tuning strategies, such as partial tuning and prompt learning. Nevertheless, these fine-tuning methods fail to make a sound balance between the generalization capacity of pretrained backbone and the adaptability to the GCD task. To fill this gap, in this paper, we propose a novel adapter-tuning-based method named AdaptGCD, which is the first work to introduce the adapter tuning into the GCD task and provides some key insights expected to enlighten future research. Furthermore, considering the discrepancy of supervision information between the old and new classes, a multi-expert adapter structure equipped with a route assignment constraint is elaborately devised, such that the data from old and new classes are separated into different expert groups. Extensive experiments are conducted on 7 widely-used datasets. The remarkable improvements in performance highlight the effectiveness of our proposals.

[AI-48] How Does Critical Batch Size Scale in Pre-training?

链接: https://arxiv.org/abs/2410.21676
作者: Hanlin Zhang,Depen Morwani,Nikhil Vyas,Jingfeng Wu,Difan Zou,Udaya Ghai,Dean Foster,Sham Kakade
关键词-EN: resources requires careful, requires careful design, resources requires, greater data parallelism, data parallelism leads
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size, concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control on factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.

[AI-49] BF-Meta: Secure Blockchain-enhanced Privacy-preserving Federated Learning for Metaverse

链接: https://arxiv.org/abs/2410.21675
作者: Wenbo Liu,Handi Chen,Edith C.H. Ngai
关键词-EN: economic activities, revolutionary platform, platform for social, social and economic, posing security
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The metaverse, emerging as a revolutionary platform for social and economic activities, provides various virtual services while posing security and privacy challenges. Wearable devices serve as bridges between the real world and the metaverse. To provide intelligent services without revealing users’ privacy in the metaverse, leveraging federated learning (FL) to train models on local wearable devices is a promising solution. However, centralized model aggregation in traditional FL may suffer from external attacks, resulting in a single point of failure. Furthermore, the absence of incentive mechanisms may weaken users’ participation during FL training, leading to degraded performance of the trained model and reduced quality of intelligent services. In this paper, we propose BF-Meta, a secure blockchain-empowered FL framework with decentralized model aggregation, to mitigate the negative influence of malicious users and provide secure virtual services in the metaverse. In addition, we design an incentive mechanism to give feedback to users based on their behaviors. Experiments conducted on five datasets demonstrate the effectiveness and applicability of BF-Meta.

[AI-50] Knowledge-Guided Prompt Learning for Request Quality Assurance in Public Code Review

链接: https://arxiv.org/abs/2410.21673
作者: Lin Li,Xinchun Yu,Xinyu Chen,Peng Liang
关键词-EN: Software Question Answering, public Software Question, Question Answering, Software Question, Public Code Review
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 28 pages, 7 images, 12 tables, Manuscript submitted to a journal (2024)

点击查看摘要

Abstract:Public Code Review (PCR) is an assistant to the internal code review of the development team, in the form of a public Software Question Answering (SQA) community, to help developers access high-quality and efficient review services. Current methods on PCR mainly focus on the reviewer’s perspective, including finding a capable reviewer, predicting comment quality, and recommending/generating review comments. However, it is not well studied that how to satisfy the review necessity requests posted by developers which can increase their visibility, which in turn acts as a prerequisite for better review responses. To this end, we propose a Knowledge-guided Prompt learning for Public Code Review (KP-PCR) to achieve developer-based code review request quality assurance (i.e., predicting request necessity and recommending tags subtask). Specifically, we reformulate the two subtasks via 1) text prompt tuning which converts both of them into a Masked Language Model (MLM) by constructing prompt templates using hard prompt; 2) knowledge and code prefix tuning which introduces external knowledge by soft prompt, and uses data flow diagrams to characterize code snippets. Finally, both of the request necessity prediction and tag recommendation subtasks output predicted results through an answer engineering module. In addition, we further analysis the time complexity of our KP-PCR that has lightweight prefix based the operation of introducing knowledge. Experimental results on the PCR dataset for the period 2011-2023 demonstrate that our KP-PCR outperforms baselines by 8.3%-28.8% in the request necessity prediction and by 0.1%-29.5% in the tag recommendation. The code implementation is released at this https URL.

[AI-51] RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis

链接: https://arxiv.org/abs/2410.21641
作者: Kehan Sui,Jinxu Xiang,Fang Jin
关键词-EN: Singing voice synthesis, produce high-fidelity singing, voice synthesis, aims to produce, music scores
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Singing voice synthesis (SVS) aims to produce high-fidelity singing audio from music scores, requiring a detailed understanding of notes, pitch, and duration, unlike text-to-speech tasks. Although diffusion models have shown exceptional performance in various generative tasks like image and video creation, their application in SVS is hindered by time complexity and the challenge of capturing acoustic features, particularly during pitch transitions. Some networks learn from the prior distribution and use the compressed latent state as a better start in the diffusion model, but the denoising step doesn’t consistently improve quality over the entire duration. We introduce RDSinger, a reference-based denoising diffusion network that generates high-quality audio for SVS tasks. Our approach is inspired by Animate Anyone, a diffusion image network that maintains intricate appearance features from reference images. RDSinger utilizes FastSpeech2 mel-spectrogram as a reference to mitigate denoising step artifacts. Additionally, existing models could be influenced by misleading information on the compressed latent state during pitch transitions. We address this issue by applying Gaussian blur on partial reference mel-spectrogram and adjusting loss weights in these regions. Extensive ablation studies demonstrate the efficiency of our method. Evaluations on OpenCpop, a Chinese singing dataset, show that RDSinger outperforms current state-of-the-art SVS methods in performance.

[AI-52] Asynchronous Tool Usage for Real-Time Agents

链接: https://arxiv.org/abs/2410.21620
作者: Antonio A. Ginart,Naveen Kodali,Jason Lee,Caiming Xiong,Silvio Savarese,John Emmons
关键词-EN: large language models, strict turn-based fashion, frontier large language, language models, turn-based fashion
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While frontier large language models (LLMs) are capable tool-using agents, current AI systems still operate in a strict turn-based fashion, oblivious to passage of time. This synchronous design forces user queries and tool-use to occur sequentially, preventing the systems from multitasking and reducing interactivity. To address this limitation, we introduce asynchronous AI agents capable of parallel processing and real-time tool-use. Our key contribution is an event-driven finite-state machine architecture for agent execution and prompting, integrated with automatic speech recognition and text-to-speech. Drawing inspiration from the concepts originally developed for real-time operating systems, this work presents both a conceptual framework and practical tools for creating AI agents capable of fluid, multitasking interactions.

[AI-53] Identifying Selections for Unsupervised Subtask Discovery NEURIPS2024

链接: https://arxiv.org/abs/2410.21616
作者: Yiwen Qiu,Yujia Zheng,Kun Zhang
关键词-EN: solving long-horizon tasks, solving long-horizon, intriguing to decompose, decompose the high-level, subtasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: NeurIPS 2024

点击查看摘要

Abstract:When solving long-horizon tasks, it is intriguing to decompose the high-level task into subtasks. Decomposing experiences into reusable subtasks can improve data efficiency, accelerate policy generalization, and in general provide promising solutions to multi-task reinforcement learning and imitation learning problems. However, the concept of subtasks is not sufficiently understood and modeled yet, and existing works often overlook the true structure of the data generation process: subtasks are the results of a \textitselection mechanism on actions, rather than possible underlying confounders or intermediates. Specifically, we provide a theory to identify, and experiments to verify the existence of selection variables in such data. These selections serve as subgoals that indicate subtasks and guide policy. In light of this idea, we develop a sequential non-negative matrix factorization (seq- NMF) method to learn these subgoals and extract meaningful behavior patterns as subtasks. Our empirical results on a challenging Kitchen environment demonstrate that the learned subtasks effectively enhance the generalization to new tasks in multi-task imitation learning scenarios. The codes are provided at this https URL_Selections_for_Unsupervised_Subtask_Discovery/README.md.

[AI-54] ImageNet-RIB Benchmark: Large Pre-Training Datasets Dont Guarantee Robustness after Fine-Tuning

链接: https://arxiv.org/abs/2410.21582
作者: Jaedong Hwang,Brian Cheung,Zhang-Wei Hong,Akhilan Boopathy,Pulkit Agrawal,Ila Fiete
关键词-EN: Highly performant large-scale, Highly performant, performant large-scale pre-trained, fine-tuning, Robustness Inheritance Benchmark
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Highly performant large-scale pre-trained models promise to also provide a valuable foundation for learning specialized tasks, by fine-tuning the model to the desired task. By starting from a good general-purpose model, the goal is to achieve both specialization in the target task and maintain robustness. To assess the robustness of models to out-of-distribution samples after fine-tuning on downstream datasets, we introduce a new robust fine-tuning benchmark, ImageNet-RIB (Robustness Inheritance Benchmark). The benchmark consists of a set of related but distinct specialized (downstream) tasks; pre-trained models are fine-tuned on one task in the set and their robustness is assessed on the rest, iterating across all tasks for fine-tuning and assessment. We find that the continual learning methods, EWC and LwF maintain robustness after fine-tuning though fine-tuning generally does reduce performance on generalization to related downstream tasks across models. Not surprisingly, models pre-trained on large and rich datasets exhibit higher initial robustness across datasets and suffer more pronounced degradation during fine-tuning. The distance between the pre-training and downstream datasets, measured by optimal transport, predicts this performance degradation on the pre-training dataset. However, counterintuitively, model robustness after fine-tuning on related downstream tasks is the worst when the pre-training dataset is the richest and the most diverse. This suggests that starting with the strongest foundation model is not necessarily the best approach for performance on specialist tasks. The benchmark thus offers key insights for developing more resilient fine-tuning strategies and building robust machine learning models. this https URL

[AI-55] A Generative Model Based Honeypot for Industrial OPC UA Communication

链接: https://arxiv.org/abs/2410.21574
作者: Olaf Sassnick,Georg Schäfer,Thomas Rosenstatter,Stefan Huber
关键词-EN: Industrial Operational Technology, Operational Technology, Information Technology, integration with Information, Technology
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is accepted and will be published in Computer Aided Systems Theory - EUROCAST 2024

点击查看摘要

Abstract:Industrial Operational Technology (OT) systems are increasingly targeted by cyber-attacks due to their integration with Information Technology (IT) systems in the Industry 4.0 era. Besides intrusion detection systems, honeypots can effectively detect these attacks. However, creating realistic honeypots for brownfield systems is particularly challenging. This paper introduces a generative model-based honeypot designed to mimic industrial OPC UA communication. Utilizing a Long ShortTerm Memory (LSTM) network, the honeypot learns the characteristics of a highly dynamic mechatronic system from recorded state space trajectories. Our contributions are twofold: first, we present a proof-of concept for a honeypot based on generative machine-learning models, and second, we publish a dataset for a cyclic industrial process. The results demonstrate that a generative model-based honeypot can feasibly replicate a cyclic industrial process via OPC UA communication. In the short-term, the generative model indicates a stable and plausible trajectory generation, while deviations occur over extended periods. The proposed honeypot implementation operates efficiently on constrained hardware, requiring low computational resources. Future work will focus on improving model accuracy, interaction capabilities, and extending the dataset for broader applications.

[AI-56] Mitigating Gradient Overlap in Deep Residual Networks with Gradient Normalization for Improved Non-Convex Optimization

链接: https://arxiv.org/abs/2410.21564
作者: Juyoung Yun
关键词-EN: Residual Networks, vanishing gradient problem, deep networks, Networks, proven effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In deep learning, Residual Networks (ResNets) have proven effective in addressing the vanishing gradient problem, allowing for the successful training of very deep networks. However, skip connections in ResNets can lead to gradient overlap, where gradients from both the learned transformation and the skip connection combine, potentially resulting in overestimated gradients. This overestimation can cause inefficiencies in optimization, as some updates may overshoot optimal regions, affecting weight updates. To address this, we examine Z-score Normalization (ZNorm) as a technique to manage gradient overlap. ZNorm adjusts the gradient scale, standardizing gradients across layers and reducing the negative impact of overlapping gradients. Our experiments demonstrate that ZNorm improves training process, especially in non-convex optimization scenarios common in deep learning, where finding optimal solutions is challenging. These findings suggest that ZNorm can affect the gradient flow, enhancing performance in large-scale data processing where accuracy is critical.

[AI-57] Going Beyond HE and Oncology: How Do Histopathology Foundation Models Perform for Multi-stain IHC and Immunology? NEURIPS2024

链接: https://arxiv.org/abs/2410.21560
作者: Amaya Gallagher-Syed,Elena Pontarini,Myles J. Lewis,Michael R. Barnes,Gregory Slabaugh
关键词-EN: multi-stain autoimmune Immunohistochemistry, autoimmune Immunohistochemistry datasets, Immunohistochemistry datasets, Rheumatoid Arthritis subtyping, Sjogren Disease detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
*备注: Accepted at Workshop on Advancements In Medical Foundation Models (NeurIPS 2024)

点击查看摘要

Abstract:This study evaluates the generalisation capabilities of state-of-the-art histopathology foundation models on out-of-distribution multi-stain autoimmune Immunohistochemistry datasets. We compare 13 feature extractor models, including ImageNet-pretrained networks, and histopathology foundation models trained on both public and proprietary data, on Rheumatoid Arthritis subtyping and Sjogren’s Disease detection tasks. Using a simple Attention-Based Multiple Instance Learning classifier, we assess the transferability of learned representations from cancer HE images to autoimmune IHC images. Contrary to expectations, histopathology-pretrained models did not significantly outperform ImageNet-pretrained models. Furthermore, there was evidence of both autoimmune feature misinterpretation and biased feature importance. Our findings highlight the challenges in transferring knowledge from cancer to autoimmune histopathology and emphasise the need for careful evaluation of AI models across diverse histopathological tasks. The code to run this benchmark is available at this https URL.

[AI-58] Bayesian Regression for Predicting Subscription to Bank Term Deposits in Direct Marketing Campaigns

链接: https://arxiv.org/abs/2410.21539
作者: Muhammad Farhan Tanvir,Md Maruf Hossain,Md Asifuzzaman Jishan
关键词-EN: highly competitive environment, highly competitive, competitive environment, essential to precisely, precisely forecast
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the highly competitive environment of the banking industry, it is essential to precisely forecast the behavior of customers in order to maximize the effectiveness of marketing initiatives and improve financial consequences. The purpose of this research is to examine the efficacy of logit and probit models in predicting term deposit subscriptions using a Portuguese bank’s direct marketing data. There are several demographic, economic, and behavioral characteristics in the dataset that affect the probability of subscribing. To increase model performance and provide an unbiased evaluation, the target variable was balanced, considering the inherent imbalance in the dataset. The two model’s prediction abilities were evaluated using Bayesian techniques and Leave-One-Out Cross-Validation (LOO-CV). The logit model performed better than the probit model in handling this classification problem. The results highlight the relevance of model selection when dealing with complicated decision-making processes in the financial services industry and imbalanced datasets. Findings from this study shed light on how banks can optimize their decision-making processes, improve their client segmentation, and boost their marketing campaigns by utilizing machine learning models.

[AI-59] A Multi-Agent Reinforcement Learning Testbed for Cognitive Radio Applications

链接: https://arxiv.org/abs/2410.21521
作者: Sriniketh Vangaru,Daniel Rosen,Dylan Green,Raphael Rodriguez,Maxwell Wiecek,Amos Johnson,Alyse M. Jones,William C. Headley
关键词-EN: Radio Frequency Reinforcement, Technological trends show, Radio Frequency, Frequency Reinforcement Learning, RFRL Gym
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
*备注: Accepted to IEEE CCNC 2025

点击查看摘要

Abstract:Technological trends show that Radio Frequency Reinforcement Learning (RFRL) will play a prominent role in the wireless communication systems of the future. Applications of RFRL range from military communications jamming to enhancing WiFi networks. Before deploying algorithms for these purposes, they must be trained in a simulation environment to ensure adequate performance. For this reason, we previously created the RFRL Gym: a standardized, accessible tool for the development and testing of reinforcement learning (RL) algorithms in the wireless communications space. This environment leveraged the OpenAI Gym framework and featured customizable simulation scenarios within the RF spectrum. However, the RFRL Gym was limited to training a single RL agent per simulation; this is not ideal, as most real-world RF scenarios will contain multiple intelligent agents in cooperative, competitive, or mixed settings, which is a natural consequence of spectrum congestion. Therefore, through integration with Ray RLlib, multi-agent reinforcement learning (MARL) functionality for training and assessment has been added to the RFRL Gym, making it even more of a robust tool for RF spectrum simulation. This paper provides an overview of the updated RFRL Gym environment. In this work, the general framework of the tool is described relative to comparable existing resources, highlighting the significant additions and refactoring we have applied to the Gym. Afterward, results from testing various RF scenarios in the MARL environment and future additions are discussed.

[AI-60] Sabotage Evaluations for Frontier Models

链接: https://arxiv.org/abs/2410.21514
作者: Joe Benton,Misha Wagner,Eric Christiansen,Cem Anil,Ethan Perez,Jai Srivastav,Esin Durmus,Deep Ganguli,Shauna Kravec,Buck Shlegeris,Jared Kaplan,Holden Karnofsky,Evan Hubinger,Roger Grosse,Samuel R. Bowman,David Duvenaud
关键词-EN: Sufficiently capable models, subvert human oversight, Sufficiently capable, subvert human, human oversight
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Sufficiently capable models could subvert human oversight and decision-making in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization’s activities in any of these ways. We demonstrate these evaluations on Anthropic’s Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using small-scale statistics.

[AI-61] owards Multi-dimensional Explanation Alignment for Medical Classification NEURIPS2024

链接: https://arxiv.org/abs/2410.21494
作者: Lijie Hu,Songning Lai,Wenshuo Chen,Hongru Xiao,Hongbin Lin,Lu Yu,Jingfeng Zhang,Di Wang
关键词-EN: medical image analysis, legal implications, Medical Multi-dimensional Interpretable, image analysis, analysis has significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:The lack of interpretability in the field of medical image analysis has significant ethical and legal implications. Existing interpretable methods in this domain encounter several challenges, including dependency on specific models, difficulties in understanding and visualization, as well as issues related to efficiency. To address these limitations, we propose a novel framework called Med-MICN (Medical Multi-dimensional Interpretable Concept Network). Med-MICN provides interpretability alignment for various angles, including neural symbolic reasoning, concept semantics, and saliency maps, which are superior to current interpretable methods. Its advantages include high prediction accuracy, interpretability across multiple dimensions, and automation through an end-to-end concept labeling process that reduces the need for extensive human training effort when working with new datasets. To demonstrate the effectiveness and interpretability of Med-MICN, we apply it to four benchmark datasets and compare it with baselines. The results clearly demonstrate the superior performance and interpretability of our Med-MICN.

[AI-62] rustworthiness of Stochastic Gradient Descent in Distributed Learning

链接: https://arxiv.org/abs/2410.21491
作者: Hongyang Li,Caesar Wu,Mohammed Chadli,Said Mammar,Pascal Bouvry
关键词-EN: leverages multiple nodes, compressed SGD, SGD, leverages multiple, accelerate training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Distributed learning (DL) leverages multiple nodes to accelerate training, enabling the efficient optimization of large-scale models. Stochastic Gradient Descent (SGD), a key optimization algorithm, plays a central role in this process. However, communication bottlenecks often limit scalability and efficiency, leading to the increasing adoption of compressed SGD techniques to alleviate these challenges. Despite addressing communication overheads, compressed SGD introduces trustworthiness concerns, as gradient exchanges among nodes are vulnerable to attacks like gradient inversion (GradInv) and membership inference attacks (MIA). The trustworthiness of compressed SGD remains underexplored, leaving important questions about its reliability unanswered. In this paper, we provide a trustworthiness evaluation of compressed versus uncompressed SGD. Specifically, we conduct empirical studies using GradInv attacks, revealing that compressed SGD demonstrates significantly higher resistance to privacy leakage compared to uncompressed SGD. Moreover, our findings suggest that MIA may not be a reliable metric for assessing privacy risks in machine learning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2410.21491 [cs.LG] (or arXiv:2410.21491v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.21491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-63] Enhancing CTR Prediction in Recommendation Domain with Search Query Representation CIKM2024

链接: https://arxiv.org/abs/2410.21487
作者: Yuening Wang,Man Chen,Yaochen Hu,Wei Guo,Yingxue Zhang,Huifeng Guo,Yong Liu,Mark Coates
关键词-EN: recommendation services simultaneously, recommendation domain, meet users’ diverse, recommendation, recommendation services
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by CIKM 2024 Full Research Track

点击查看摘要

Abstract:Many platforms, such as e-commerce websites, offer both search and recommendation services simultaneously to better meet users’ diverse needs. Recommendation services suggest items based on user preferences, while search services allow users to search for items before providing recommendations. Since users and items are often shared between the search and recommendation domains, there is a valuable opportunity to enhance the recommendation domain by leveraging user preferences extracted from the search domain. Existing approaches either overlook the shift in user intention between these domains or fail to capture the significant impact of learning from users’ search queries on understanding their interests. In this paper, we propose a framework that learns from user search query embeddings within the context of user preferences in the recommendation domain. Specifically, user search query sequences from the search domain are used to predict the items users will click at the next time point in the recommendation domain. Additionally, the relationship between queries and items is explored through contrastive learning. To address issues of data sparsity, the diffusion model is incorporated to infer positive items the user will select after searching with certain queries in a denoising manner, which is particularly effective in preventing false positives. Effectively extracting this information, the queries are integrated into click-through rate prediction in the recommendation domain. Experimental analysis demonstrates that our model outperforms state-of-the-art models in the recommendation domain. Comments: Accepted by CIKM 2024 Full Research Track Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.21487 [cs.IR] (or arXiv:2410.21487v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.21487 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: CIKM (2024) 2462-2471 Related DOI: https://doi.org/10.1145/3627673.3679849 Focus to learn more DOI(s) linking to related resources

[AI-64] Knowledge Distillation for Real-Time Classification of Early Media in Voice Communications

链接: https://arxiv.org/abs/2410.21478
作者: Kemal Altwlkany,Hadžem Hadžić,Amar Kurić,Emanuel Lacic
关键词-EN: early media exchanged, classification of early, early media, paper investigates, investigates the industrial
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper investigates the industrial setting of real-time classification of early media exchanged during the initialization phase of voice calls. We explore the application of state-of-the-art audio tagging models and highlight some limitations when applied to the classification of early media. While most existing approaches leverage convolutional neural networks, we propose a novel approach for low-resource requirements based on gradient-boosted trees. Our approach not only demonstrates a substantial improvement in runtime performance, but also exhibits a comparable accuracy. We show that leveraging knowledge distillation and class aggregation techniques to train a simpler and smaller model accelerates the classification of early media in voice calls. We provide a detailed analysis of the results on a proprietary and publicly available dataset, regarding accuracy and runtime performance. We additionally report a case study of the achieved performance improvements at a regional data center in India.

[AI-65] AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

链接: https://arxiv.org/abs/2410.21471
作者: Yaopei Zeng,Yuanpu Cao,Bochuan Cao,Yurui Chang,Jinghui Chen,Lu Lin
关键词-EN: Recent advances, generate NSFW content, NSFW content, safety concerns, Safe for Work
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.

[AI-66] ACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors

链接: https://arxiv.org/abs/2410.21443
作者: Adonisz Dimitriu,Tamás Michaletzky,Viktor Remeli
关键词-EN: Adversarial attacks threaten, machine learning models, defense systems, attacks threaten, threaten the reliability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial attacks threaten the reliability of machine learning models in critical applications like autonomous vehicles and defense systems. As object detectors become more robust with models like YOLOv8, developing effective adversarial methodologies is increasingly challenging. We present Truck Adversarial Camouflage Optimization (TACO), a novel framework that generates adversarial camouflage patterns on 3D vehicle models to deceive state-of-the-art object detectors. Adopting Unreal Engine 5, TACO integrates differentiable rendering with a Photorealistic Rendering Network to optimize adversarial textures targeted at YOLOv8. To ensure the generated textures are both effective in deceiving detectors and visually plausible, we introduce the Convolutional Smooth Loss function, a generalized smooth loss function. Experimental evaluations demonstrate that TACO significantly degrades YOLOv8’s detection performance, achieving an AP@0.5 of 0.0099 on unseen test data. Furthermore, these adversarial patterns exhibit strong transferability to other object detection models such as Faster R-CNN and earlier YOLO versions.

[AI-67] Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding ICRA2025

链接: https://arxiv.org/abs/2410.21415
作者: He Jiang,Yutong Wang,Rishi Veerapaneni,Tanishq Duhan,Guillaume Sartoretti,Jiaoyang Li
关键词-EN: Lifelong Multi-Agent Path, Multi-Agent Path Finding, Path Finding, necessitating frequent re-planning, Lifelong Multi-Agent
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Lifelong Multi-Agent Path Finding (LMAPF) is a variant of MAPF where agents are continually assigned new goals, necessitating frequent re-planning to accommodate these dynamic changes. Recently, this field has embraced learning-based methods, which reactively generate single-step actions based on individual local observations. However, it is still challenging for them to match the performance of the best search-based algorithms, especially in large-scale settings. This work proposes an imitation-learning-based LMAPF solver that introduces a novel communication module and systematic single-step collision resolution and global guidance techniques. Our proposed solver, Scalable Imitation Learning for LMAPF (SILLM), inherits the fast reasoning speed of learning-based methods and the high solution quality of search-based methods with the help of modern GPUs. Across six large-scale maps with up to 10,000 agents and varying obstacle structures, SILLM surpasses the best learning- and search-based baselines, achieving average throughput improvements of 137.7% and 16.0%, respectively. Furthermore, SILLM also beats the winning solution of the 2023 League of Robot Runners, an international LMAPF competition sponsored by Amazon Robotics. Finally, we validated SILLM with 10 real robots and 100 virtual robots in a mockup warehouse environment.

[AI-68] Exploring reinforcement learning for incident response in autonomous military vehicles

链接: https://arxiv.org/abs/2410.21407
作者: Henrik Madsen,Gudmund Grov,Federico Mancini,Magnus Baksaas,Åvald Åslaugson Sommervoll
关键词-EN: conduct advanced operations, conduct advanced, human intervention, fast pace, unmanned ground vehicle
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: DIGILIENCE 2024

点击查看摘要

Abstract:Unmanned vehicles able to conduct advanced operations without human intervention are being developed at a fast pace for many purposes. Not surprisingly, they are also expected to significantly change how military operations can be conducted. To leverage the potential of this new technology in a physically and logically contested environment, security risks are to be assessed and managed accordingly. Research on this topic points to autonomous cyber defence as one of the capabilities that may be needed to accelerate the adoption of these vehicles for military purposes. Here, we pursue this line of investigation by exploring reinforcement learning to train an agent that can autonomously respond to cyber attacks on unmanned vehicles in the context of a military operation. We first developed a simple simulation environment to quickly prototype and test some proof-of-concept agents for an initial evaluation. This agent was then applied to a more realistic simulation environment and finally deployed on an actual unmanned ground vehicle for even more realism. A key contribution of our work is demonstrating that reinforcement learning is a viable approach to train an agent that can be used for autonomous cyber defence on a real unmanned ground vehicle, even when trained in a simple simulation environment.

[AI-69] Unveiling the Role of Expert Guidance: A Comparative Analysis of User-centered Imitation Learning and Traditional Reinforcement Learning

链接: https://arxiv.org/abs/2410.21403
作者: Amr Gomaa,Bilal Mahdy
关键词-EN: human feedback plays, Integration of human, plays a key, key role, role in improving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as CEUR Workshop Proceedings in Proceedings of the 1st International Workshop on Human-in-the-Loop Applied Machine Learning (HITLAML 2023). Awarded Best Paper. this https URL

点击查看摘要

Abstract:Integration of human feedback plays a key role in improving the learning capabilities of intelligent systems. This comparative study delves into the performance, robustness, and limitations of imitation learning compared to traditional reinforcement learning methods within these systems. Recognizing the value of human-in-the-loop feedback, we investigate the influence of expert guidance and suboptimal demonstrations on the learning process. Through extensive experimentation and evaluations conducted in a pre-existing simulation environment using the Unity platform, we meticulously analyze the effectiveness and limitations of these learning approaches. The insights gained from this study contribute to the advancement of human-centered artificial intelligence by highlighting the benefits and challenges associated with the incorporation of human feedback into the learning process. Ultimately, this research promotes the development of models that can effectively address complex real-world problems.

[AI-70] LinFormer: A Linear-based Lightweight Transformer Architecture For Time-Aware MIMO Channel Prediction

链接: https://arxiv.org/abs/2410.21351
作者: Yanliang Jin,Yifan Wu,Yuan Gao,Shunqing Zhang,Shugong Xu,Cheng-Xiang Wang
关键词-EN: mobile networks brings, supporting high-mobility communications, mobile networks, supporting high-mobility, addressing the issue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The emergence of 6th generation (6G) mobile networks brings new challenges in supporting high-mobility communications, particularly in addressing the issue of channel aging. While existing channel prediction methods offer improved accuracy at the expense of increased computational complexity, limiting their practical application in mobile networks. To address these challenges, we present LinFormer, an innovative channel prediction framework based on a scalable, all-linear, encoder-only Transformer model. Our approach, inspired by natural language processing (NLP) models such as BERT, adapts an encoder-only architecture specifically for channel prediction tasks. We propose replacing the computationally intensive attention mechanism commonly used in Transformers with a time-aware multi-layer perceptron (TMLP), significantly reducing computational demands. The inherent time awareness of TMLP module makes it particularly suitable for channel prediction tasks. We enhance LinFormer’s training process by employing a weighted mean squared error loss (WMSELoss) function and data augmentation techniques, leveraging larger, readily available communication datasets. Our approach achieves a substantial reduction in computational complexity while maintaining high prediction accuracy, making it more suitable for deployment in cost-effective base stations (BS). Comprehensive experiments using both simulated and measured data demonstrate that LinFormer outperforms existing methods across various mobility scenarios, offering a promising solution for future wireless communication systems.

[AI-71] FALCON: Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization system

链接: https://arxiv.org/abs/2410.21349
作者: Zeyuan Li,Yangfan He,Lewei He,Jianhui Wang,Tianyu Shi,Bin Lei,Yuchen Li,Qiuwu Chen
关键词-EN: achieved significant progress, large language models, large language, achieved significant, significant progress
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Recently, large language models (LLMs) have achieved significant progress in automated code generation. Despite their strong instruction-following capabilities, these models frequently struggled to align with user intent in coding scenarios. In particular, they were hampered by datasets that lacked diversity and failed to address specialized tasks or edge cases. Furthermore, challenges in supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) led to failures in generating precise, human-intent-aligned code. To tackle these challenges and improve the code generation performance for automated programming systems, we propose Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization (i.e., FALCON). FALCON is structured into two hierarchical levels. From the global level, long-term memory improves code quality by retaining and applying learned knowledge. At the local level, short-term memory allows for the incorporation of immediate feedback from compilers and AI systems. Additionally, we introduce meta-reinforcement learning with feedback rewards to solve the global-local bi-level optimization problem and enhance the model’s adaptability across diverse code generation tasks. Extensive experiments demonstrate that our technique achieves state-of-the-art performance, leading other reinforcement learning methods by more than 4.5 percentage points on the MBPP benchmark and 6.1 percentage points on the Humaneval benchmark. The open-sourced code is publicly available at this https URL.

[AI-72] owards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach

链接: https://arxiv.org/abs/2410.21346
作者: Firas Bayram,Bestoun S. Ahmed
关键词-EN: Artificial intelligence, Machine Learning Operations, Machine Learning, impacting the daily, daily lives
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI), and especially its sub-field of Machine Learning (ML), are impacting the daily lives of everyone with their ubiquitous applications. In recent years, AI researchers and practitioners have introduced principles and guidelines to build systems that make reliable and trustworthy decisions. From a practical perspective, conventional ML systems process historical data to extract the features that are consequently used to train ML models that perform the desired task. However, in practice, a fundamental challenge arises when the system needs to be operationalized and deployed to evolve and operate in real-life environments continuously. To address this challenge, Machine Learning Operations (MLOps) have emerged as a potential recipe for standardizing ML solutions in deployment. Although MLOps demonstrated great success in streamlining ML processes, thoroughly defining the specifications of robust MLOps approaches remains of great interest to researchers and practitioners. In this paper, we provide a comprehensive overview of the trustworthiness property of MLOps systems. Specifically, we highlight technical practices to achieve robust MLOps systems. In addition, we survey the existing research approaches that address the robustness aspects of ML systems in production. We also review the tools and software available to build MLOps systems and summarize their support to handle the robustness aspects. Finally, we present the open challenges and propose possible future directions and opportunities within this emerging field. The aim of this paper is to provide researchers and practitioners working on practical AI applications with a comprehensive view to adopt robust ML solutions in production environments.

[AI-73] Heterogeneous Interaction Modeling With Reduced Accumulated Error for Multi-Agent Trajectory Prediction

链接: https://arxiv.org/abs/2410.21342
作者: Siyuan Chen,Jiahai Wang
关键词-EN: Dynamical complex systems, including urban traffic, urban traffic systems, complex systems composed, heterogeneous interaction modeling
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 20 pages, accepted by IEEE TNNLS

点击查看摘要

Abstract:Dynamical complex systems composed of interactive heterogeneous agents are prevalent in the world, including urban traffic systems and social networks. Modeling the interactions among agents is the key to understanding and predicting the dynamics of the complex system, e.g., predicting the trajectories of traffic participants in the city. Compared with interaction modeling in homogeneous systems such as pedestrians in a crowded scene, heterogeneous interaction modeling is less explored. Worse still, the error accumulation problem becomes more severe since the interactions are more complex. To tackle the two problems, this paper proposes heterogeneous interaction modeling with reduced accumulated error for multi-agent trajectory prediction. Based on the historical trajectories, our method infers the dynamic interaction graphs among agents, featured by directed interacting relations and interacting effects. A heterogeneous attention mechanism is defined on the interaction graphs for aggregating the influence from heterogeneous neighbors to the target agent. To alleviate the error accumulation problem, this paper analyzes the error sources from the spatial and temporal perspectives, and proposes to introduce the graph entropy and the mixup training strategy for reducing the two types of errors respectively. Our method is examined on three real-world datasets containing heterogeneous agents, and the experimental results validate the superiority of our method.

[AI-74] Retrieval-Retro: Retrieval-based Inorganic Retrosynthesis with Expert Knowledge NEURIPS2024

链接: https://arxiv.org/abs/2410.21341
作者: Heewoong Noh,Namkyeong Lee,Gyoung S. Na,Chanyoung Park
关键词-EN: inorganic retrosynthesis planning, retrosynthesis planning, inorganic retrosynthesis, chemical science, application of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:While inorganic retrosynthesis planning is essential in the field of chemical science, the application of machine learning in this area has been notably less explored compared to organic retrosynthesis planning. In this paper, we propose Retrieval-Retro for inorganic retrosynthesis planning, which implicitly extracts the precursor information of reference materials that are retrieved from the knowledge base regarding domain expertise in the field. Specifically, instead of directly employing the precursor information of reference materials, we propose implicitly extracting it with various attention layers, which enables the model to learn novel synthesis recipes more effectively. Moreover, during retrieval, we consider the thermodynamic relationship between target material and precursors, which is essential domain expertise in identifying the most probable precursor set among various options. Extensive experiments demonstrate the superiority of Retrieval-Retro in retrosynthesis planning, especially in discovering novel synthesis recipes, which is crucial for materials discovery. The source code for Retrieval-Retro is available at this https URL.

[AI-75] Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments

链接: https://arxiv.org/abs/2410.21340
作者: Yuzhe Yang,Yipeng Du,Ahmad Farhan,Claudio Angione,Yue Zhao,Harry Yang,Fielding Johnston,James Buban,Patrick Colangelo
关键词-EN: sophisticated image generation, large language models, incurs substantial costs, substantial costs due, image generation systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The deployment of large-scale models, such as large language models (LLMs) and sophisticated image generation systems, incurs substantial costs due to their computational demands. To mitigate these costs and address challenges related to scalability and data security, there is a growing shift towards decentralized systems for deploying such models. In these decentralized environments, efficient inference acceleration becomes crucial to manage computational resources effectively and enhance system responsiveness. In this work, we address the challenge of selecting optimal acceleration methods in decentralized systems by introducing a meta-learning-based framework. This framework automates the selection process by learning from historical performance data of various acceleration techniques across different tasks. Unlike traditional methods that rely on random selection or expert intuition, our approach systematically identifies the best acceleration strategies based on the specific characteristics of each task. We demonstrate that our meta-learning framework not only streamlines the decision-making process but also consistently outperforms conventional methods in terms of efficiency and performance. Our results highlight the potential of meta-learning to revolutionize inference acceleration in decentralized AI systems, offering a path towards more democratic and economically feasible artificial intelligence solutions.

[AI-76] E(3)-invaraint diffusion model for pocket-aware peptide generation

链接: https://arxiv.org/abs/2410.21335
作者: Po-Yu Liang,Jun Bai
关键词-EN: understanding biological processes, frequently desire protein, Biologists frequently desire, variety of reasons, problems in agriculture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Biologists frequently desire protein inhibitors for a variety of reasons, including use as research tools for understanding biological processes and application to societal problems in agriculture, healthcare, etc. Immunotherapy, for instance, relies on immune checkpoint inhibitors to block checkpoint proteins, preventing their binding with partner proteins and boosting immune cell function against abnormal cells. Inhibitor discovery has long been a tedious process, which in recent years has been accelerated by computational approaches. Advances in artificial intelligence now provide an opportunity to make inhibitor discovery smarter than ever before. While extensive research has been conducted on computer-aided inhibitor discovery, it has mainly focused on either sequence-to-structure mapping, reverse mapping, or bio-activity prediction, making it unrealistic for biologists to utilize such tools. Instead, our work proposes a new method of computer-assisted inhibitor discovery: de novo pocket-aware peptide structure and sequence generation network. Our approach consists of two sequential diffusion models for end-to-end structure generation and sequence prediction. By leveraging angle and dihedral relationships between backbone atoms, we ensure an E(3)-invariant representation of peptide structures. Our results demonstrate that our method achieves comparable performance to state-of-the-art models, highlighting its potential in pocket-aware peptide design. This work offers a new approach for precise drug discovery using receptor-specific peptide generation.

[AI-77] Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

链接: https://arxiv.org/abs/2410.21331
作者: Qi Zhang,Yifei Wang,Jingyi Cui,Xiang Pan,Qi Lei,Stefanie Jegelka,Yisen Wang
关键词-EN: Deep learning models, multiple unrelated semantics, Deep learning, due to polysemanticity, resulting in unclear
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness. Code is available at \urlthis https URL.

[AI-78] Deconfounding Time Series Forecasting

链接: https://arxiv.org/abs/2410.21328
作者: Wentao Gao,Feiyu Yang,Mengze Hong,Xiaojing Du,Zechen Hu,Xiongren Chen,Ziqi Xu
关键词-EN: drive informed decision-making, informed decision-making, critical task, accurate predictions, predictions can drive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is a critical task in various domains, where accurate predictions can drive informed decision-making. Traditional forecasting methods often rely on current observations of variables to predict future outcomes, typically overlooking the influence of latent confounders, unobserved variables that simultaneously affect both the predictors and the target outcomes. This oversight can introduce bias and degrade the performance of predictive models. In this study, we address this challenge by proposing an enhanced forecasting approach that incorporates representations of latent confounders derived from historical data. By integrating these confounders into the predictive process, our method aims to improve the accuracy and robustness of time series forecasts. The proposed approach is demonstrated through its application to climate science data, showing significant improvements over traditional methods that do not account for confounders.

[AI-79] Self-Supervised Learning and Opportunistic Inference for Continuous Monitoring of Freezing of Gait in Parkinsons Disease

链接: https://arxiv.org/abs/2410.21326
作者: Shovito Barua Soumma,Kartik Mangipudi,Daniel Peterson,Shyamal Mehta,Hassan Ghasemzadeh
关键词-EN: Freezing of Gait, progressive neurological disorder, Parkinson disease, making in-home monitoring, life significantly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:Parkinson’s disease (PD) is a progressive neurological disorder that impacts the quality of life significantly, making in-home monitoring of motor symptoms such as Freezing of Gait (FoG) critical. However, existing symptom monitoring technologies are power-hungry, rely on extensive amounts of labeled data, and operate in controlled settings. These shortcomings limit real-world deployment of the technology. This work presents LIFT-PD, a computationally-efficient self-supervised learning framework for real-time FoG detection. Our method combines self-supervised pre-training on unlabeled data with a novel differential hopping windowing technique to learn from limited labeled instances. An opportunistic model activation module further minimizes power consumption by selectively activating the deep learning module only during active periods. Extensive experimental results show that LIFT-PD achieves a 7.25% increase in precision and 4.4% improvement in accuracy compared to supervised models while using as low as 40% of the labeled training data used for supervised learning. Additionally, the model activation module reduces inference time by up to 67% compared to continuous inference. LIFT-PD paves the way for practical, energy-efficient, and unobtrusive in-home monitoring of PD patients with minimal labeling requirements.

[AI-80] Just Propagate: Unifying Matrix Factorization Network Embedding and LightGCN for Link Prediction

链接: https://arxiv.org/abs/2410.21325
作者: Haoxin Liu
关键词-EN: Link prediction, Link, fundamental task, link prediction methods, graph analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Link prediction is a fundamental task in graph analysis. Despite the success of various graph-based machine learning models for link prediction, there lacks a general understanding of different models. In this paper, we propose a unified framework for link prediction that covers matrix factorization and representative network embedding and graph neural network methods. Our preliminary methodological and empirical analyses further reveal several key design factors based on our unified framework. We believe our results could deepen our understanding and inspire novel designs for link prediction methods.

[AI-81] Angel or Devil: Discriminating Hard Samples and Anomaly Contaminations for Unsupervised Time Series Anomaly Detection

链接: https://arxiv.org/abs/2410.21322
作者: Ruyi Zhang,Hongzuo Xu,Songlei Jian,Yusong Tan,Haifang Zhou,Rulin Xu
关键词-EN: unsupervised time series, time series anomaly, discrimination between harmful, contaminations’ and beneficial, unsupervised time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Training in unsupervised time series anomaly detection is constantly plagued by the discrimination between harmful anomaly contaminations' and beneficial hard normal samples’. These two samples exhibit analogous loss behavior that conventional loss-based methodologies struggle to differentiate. To tackle this problem, we propose a novel approach that supplements traditional loss behavior with `parameter behavior’, enabling a more granular characterization of anomalous patterns. Parameter behavior is formalized by measuring the parametric response to minute perturbations in input samples. Leveraging the complementary nature of parameter and loss behaviors, we further propose a dual Parameter-Loss Data Augmentation method (termed PLDA), implemented within the reinforcement learning paradigm. During the training phase of anomaly detection, PLDA dynamically augments the training data through an iterative process that simultaneously mitigates anomaly contaminations while amplifying informative hard normal samples. PLDA demonstrates remarkable versatility, which can serve as an additional component that seamlessly integrated with existing anomaly detectors to enhance their detection performance. Extensive experiments on ten datasets show that PLDA significantly improves the performance of four distinct detectors by up to 8%, outperforming three state-of-the-art data augmentation methods.

[AI-82] owards Continuous Skin Sympathetic Nerve Activity Monitoring: Removing Muscle Noise ALT

链接: https://arxiv.org/abs/2410.21319
作者: Farnoush Baghestani,Mahdi Pirayesh Shirazi Nejad,Youngsun Kong,Ki H. Chon
关键词-EN: sympathetic nerve activity, sympathetic nervous system, non-invasive skin sympathetic, skin sympathetic nerve, nerve activity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 4 pages, 5 figures, 1 table, IEEE-EMBS International Conference on Body Sensor Networks: NextGen Health: Sensor Innovation, AI, and Social Responsibility (IEEE BSN 2024)

点击查看摘要

Abstract:Continuous monitoring of non-invasive skin sympathetic nerve activity (SKNA) holds promise for understanding the sympathetic nervous system (SNS) dynamics in various physiological and pathological conditions. However, muscle noise artifacts present a challenge in accurate SKNA analysis, particularly in real-life scenarios. This study proposes a deep convolutional neural network (CNN) approach to detect and remove muscle noise from SKNA recordings obtained via ECG electrodes. Twelve healthy participants underwent controlled experimental protocols involving cognitive stress induction and voluntary muscle movements, while collecting SKNA data. Power spectral analysis revealed significant muscle noise interference within the SKNA frequency band (500-1000 Hz). A 2D CNN model was trained on the spectrograms of the data segments to classify them into baseline, stress-induced SKNA, and muscle noise-contaminated periods, achieving an average accuracy of 89.85% across all subjects. Our findings underscore the importance of addressing muscle noise for accurate SKNA monitoring, advancing towards wearable SKNA sensors for real-world applications.

[AI-83] Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval

链接: https://arxiv.org/abs/2410.21318
作者: Bin Kang,Bin Chen,Junjie Wang,Yong Xu
关键词-EN: Text-based person retrieval, descriptions as queries, Text-based person, aims to identify, identify the specific
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-based person retrieval aims to identify the specific persons using textual descriptions as queries. Existing ad vanced methods typically depend on vision-language pre trained (VLP) models to facilitate effective cross-modal alignment. However, the inherent constraints of VLP mod-els, which include the global alignment biases and insuffi-cient self-feedback regulation, impede optimal retrieval per formance. In this paper, we propose MeFa, a Multi-Pathway Exploration, Feedback, and Adjustment framework, which deeply explores intrinsic feedback of intra and inter-modal to make targeted adjustment, thereby achieving more precise person-text associations. Specifically, we first design an intra modal reasoning pathway that generates hard negative sam ples for cross-modal data, leveraging feedback from these samples to refine intra-modal reasoning, thereby enhancing sensitivity to subtle discrepancies. Subsequently, we intro duce a cross-modal refinement pathway that utilizes both global information and intermodal feedback to refine local in formation, thus enhancing its global semantic representation. Finally, the discriminative clue correction pathway incorpo rates fine-grained features of secondary similarity as discrim inative clues to further mitigate retrieval failures caused by disparities in these features. Experimental results on three public benchmarks demonstrate that MeFa achieves superior person retrieval performance without necessitating additional data or complex structures.

[AI-84] Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

链接: https://arxiv.org/abs/2410.21316
作者: Avinash Maurya,Jie Ye,M. Mustafa Rafique,Franck Cappello,Bogdan Nicolae
关键词-EN: large language models, GPU memory, large language, rapid adoption, optimizer state
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall’', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward and update phases generate fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement \proj, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5 \times faster iterations over state-of-the-art approaches using extensive experiments.

[AI-85] owards Robust Out-of-Distribution Generalization: Data Augmentation and Neural Architecture Search Approaches

链接: https://arxiv.org/abs/2410.21313
作者: Haoyue Bai
关键词-EN: recent years, Deep learning, demonstrated with tremendous, tremendous success, success in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Hong Kong University of Science and Technology Thesis

点击查看摘要

Abstract:Deep learning has been demonstrated with tremendous success in recent years. Despite so, its performance in practice often degenerates drastically when encountering out-of-distribution (OoD) data, i.e. training and test data are sampled from different distributions. In this thesis, we study ways toward robust OoD generalization for deep learning, i.e., its performance is not susceptible to distribution shift in the test data. We first propose a novel and effective approach to disentangle the spurious correlation between features that are not essential for recognition. It employs decomposed feature representation by orthogonalizing the two gradients of losses for category and context branches. Furthermore, we perform gradient-based augmentation on context-related features (e.g., styles, backgrounds, or scenes of target objects) to improve the robustness of learned representations. Results show that our approach generalizes well for different distribution shifts. We then study the problem of strengthening neural architecture search in OoD scenarios. We propose to optimize the architecture parameters that minimize the validation loss on synthetic OoD data, under the condition that corresponding network parameters minimize the training loss. Moreover, to obtain a proper validation set, we learn a conditional generator by maximizing their losses computed by different neural architectures. Results show that our approach effectively discovers robust architectures that perform well for OoD generalization. Comments: Hong Kong University of Science and Technology Thesis Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.21313 [cs.CV] (or arXiv:2410.21313v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.21313 Focus to learn more arXiv-issued DOI via DataCite

[AI-86] MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

链接: https://arxiv.org/abs/2410.21311
作者: Fengbin Zhu,Ziyang Liu,Xiang Yao Ng,Haohui Wu,Wenjie Wang,Fuli Feng,Chao Wang,Huanbo Luan,Tat Seng Chua
关键词-EN: Large Vision-Language Models, remain insufficiently evaluated, achieved remarkable performance, Large Vision-Language, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks, yet their capabilities in fine-grained visual understanding remain insufficiently evaluated. Existing benchmarks either contain limited fine-grained evaluation samples that are mixed with other data, or are confined to object-level assessments in natural images. To holistically assess LVLMs’ fine-grained visual understanding capabilities, we propose using document images with multi-granularity and multi-modal information to supplement natural images. In this light, we construct MMDocBench, a benchmark with various OCR-free document understanding tasks for the evaluation of fine-grained visual perception and reasoning abilities. MMDocBench defines 15 main tasks with 4,338 QA pairs and 11,353 supporting regions, covering various document images such as research papers, receipts, financial reports, Wikipedia tables, charts, and infographics. Based on MMDocBench, we conduct extensive experiments using 13 open-source and 3 proprietary advanced LVLMs, assessing their strengths and weaknesses across different tasks and document image types. The benchmark, task instructions, and evaluation code will be made publicly available.

[AI-87] VEMOCLAP: A video emotion classification web application

链接: https://arxiv.org/abs/2410.21303
作者: Serkan Sulun,Paula Viana,Matthew E. P. Davies
关键词-EN: Video EMOtion Classifier, introduce VEMOCLAP, EMOtion Classifier, Pretrained features, resulting pretrained features
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注: Accepted to 2024 IEEE International Symposium on Multimedia (ISM), Tokyo, Japan

点击查看摘要

Abstract:We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at this http URL.

[AI-88] Explainable Artificial Intelligent (XAI) for Predicting Asphalt Concrete Stiffness and Rutting Resistance: Integrating Baileys Aggregate Gradation Method

链接: https://arxiv.org/abs/2410.21298
作者: Warat Kongkitkul,Sompote Youwai,Siwipa Khamsoy,Manaswee Feungfung
关键词-EN: wheel track tests, explainable artificial intelligence, employs explainable artificial, artificial intelligence, techniques to analyze
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The link to web app this https URL this https URL

点击查看摘要

Abstract:This study employs explainable artificial intelligence (XAI) techniques to analyze the behavior of asphalt concrete with varying aggregate gradations, focusing on resilience modulus (MR) and dynamic stability (DS) as measured by wheel track tests. The research utilizes a deep learning model with a multi-layer perceptron architecture to predict MR and DS based on aggregate gradation parameters derived from Bailey’s Method, including coarse aggregate ratio (CA), fine aggregate coarse ratio (FAc), and other mix design variables. The model’s performance was validated using k-fold cross-validation, demonstrating superior accuracy compared to alternative machine learning approaches. SHAP (SHapley Additive exPlanations) values were applied to interpret the model’s predictions, providing insights into the relative importance and impact of different gradation characteristics on asphalt concrete performance. Key findings include the identification of critical aggregate size thresholds, particularly the 0.6 mm sieve size, which significantly influences both MR and DS. The study revealed size-dependent performance of aggregates, with coarse aggregates primarily affecting rutting resistance and medium-fine aggregates influencing stiffness. The research also highlighted the importance of aggregate lithology in determining rutting resistance. To facilitate practical application, web-based interfaces were developed for predicting MR and DS, incorporating explainable features to enhance transparency and interpretation of results. This research contributes a data-driven approach to understanding the complex relationships between aggregate gradation and asphalt concrete performance, potentially informing more efficient and performance-oriented mix design processes in the future.

[AI-89] Large-scale Multi-objective Feature Selection: A Multi-phase Search Space Shrinking Approach

链接: https://arxiv.org/abs/2410.21293
作者: Azam Asilian Bidgoli,Shahryar Rahnamayan
关键词-EN: increase computational costs, machine learning, Feature selection, search space, crucial step
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature selection is a crucial step in machine learning, especially for high-dimensional datasets, where irrelevant and redundant features can degrade model performance and increase computational costs. This paper proposes a novel large-scale multi-objective evolutionary algorithm based on the search space shrinking, termed LMSSS, to tackle the challenges of feature selection particularly as a sparse optimization problem. The method includes a shrinking scheme to reduce dimensionality of the search space by eliminating irrelevant features before the main evolutionary process. This is achieved through a ranking-based filtering method that evaluates features based on their correlation with class labels and frequency in an initial, cost-effective evolutionary process. Additionally, a smart crossover scheme based on voting between parent solutions is introduced, giving higher weight to the parent with better classification accuracy. An intelligent mutation process is also designed to target features prematurely excluded from the population, ensuring they are evaluated in combination with other features. These integrated techniques allow the evolutionary process to explore the search space more efficiently and effectively, addressing the sparse and high-dimensional nature of large-scale feature selection problems. The effectiveness of the proposed algorithm is demonstrated through comprehensive experiments on 15 large-scale datasets, showcasing its potential to identify more accurate feature subsets compared to state-of-the-art large-scale feature selection algorithms. These results highlight LMSSS’s capability to improve model performance and computational efficiency, setting a new benchmark in the field.

[AI-90] A Systematic Assessment of OpenAI o1-Preview for Higher Order Thinking in Education

链接: https://arxiv.org/abs/2410.21287
作者: Ehsan Latif,Yifan Zhou,Shuchen Guo,Yizhu Gao,Lehong Shi,Matthew Nayaaba,Gyeonggeon Lee,Liang Zhang,Arne Bewersdorff,Luyang Fang,Xiantong Yang,Huaqin Zhao,Hanqi Jiang,Haoran Lu,Jiaxi Li,Jichao Yu,Weihang You,Zhengliang Liu,Vincent Shung Liu,Hui Wang,Zihao Wu,Jin Lu,Fei Dou,Ping Ma,Ninghao Liu,Tianming Liu,Xiaoming Zhai
关键词-EN: demonstrates capabilities comparable, reasoning, abstract reasoning, thinking, critical thinking
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: An assessment of OpenAI o1-Preview for Higher Order Thinking in Education

点击查看摘要

Abstract:As artificial intelligence (AI) continues to advance, it demonstrates capabilities comparable to human intelligence, with significant potential to transform education and workforce development. This study evaluates OpenAI o1-preview’s ability to perform higher-order cognitive tasks across 14 dimensions, including critical thinking, systems thinking, computational thinking, design thinking, metacognition, data literacy, creative thinking, abstract reasoning, quantitative reasoning, logical reasoning, analogical reasoning, and scientific reasoning. We used validated instruments like the Ennis-Weir Critical Thinking Essay Test and the Biological Systems Thinking Test to compare the o1-preview’s performance with human performance systematically. Our findings reveal that o1-preview outperforms humans in most categories, achieving 150% better results in systems thinking, computational thinking, data literacy, creative thinking, scientific reasoning, and abstract reasoning. However, compared to humans, it underperforms by around 25% in logical reasoning, critical thinking, and quantitative reasoning. In analogical reasoning, both o1-preview and humans achieved perfect scores. Despite these strengths, the o1-preview shows limitations in abstract reasoning, where human psychology students outperform it, highlighting the continued importance of human oversight in tasks requiring high-level abstraction. These results have significant educational implications, suggesting a shift toward developing human skills that complement AI, such as creativity, abstract reasoning, and critical thinking. This study emphasizes the transformative potential of AI in education and calls for a recalibration of educational goals, teaching methods, and curricula to align with an AI-driven world.

[AI-91] OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents

链接: https://arxiv.org/abs/2410.21286
作者: Yuwei Yan,Qingbin Zeng,Zhiheng Zheng,Jingzhe Yuan,Jie Feng,Jun Zhang,Fengli Xu,Yong Li
关键词-EN: complex societal phenomena, Agent-based models, individual behaviors aggregate, Large Language Models, long been employed
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Agent-based models (ABMs) have long been employed to explore how individual behaviors aggregate into complex societal phenomena in urban space. Unlike black-box predictive models, ABMs excel at explaining the micro-macro linkages that drive such emergent behaviors. The recent rise of Large Language Models (LLMs) has led to the development of LLM agents capable of simulating urban activities with unprecedented realism. However, the extreme high computational cost of LLMs presents significant challenges for scaling up the simulations of LLM agents. To address this problem, we propose OpenCity, a scalable simulation platform optimized for both system and prompt efficiencies. Specifically, we propose a LLM request scheduler to reduce communication overhead by parallelizing requests through IO multiplexing. Besides, we deisgn a “group-and-distill” prompt optimization strategy minimizes redundancy by clustering agents with similar static attributes. Through experiments on six global cities, OpenCity achieves a 600-fold acceleration in simulation time per agent, a 70% reduction in LLM requests, and a 50% reduction in token usage. These improvements enable the simulation of 10,000 agents’ daily activities in 1 hour on commodity hardware. Besides, the substantial speedup of OpenCity allows us to establish a urban simulation benchmark for LLM agents for the first time, comparing simulated urban activities with real-world data in 6 major cities around the globe. We believe our OpenCity platform provides a critical infrastructure to harness the power of LLMs for interdisciplinary studies in urban space, fostering the collective efforts of broader research communities. Code repo is available at this https URL.

[AI-92] AI-driven innovation in medicaid: enhancing access cost efficiency and population health management

链接: https://arxiv.org/abs/2410.21284
作者: Balaji Shesharao Ingole,Vishnu Ramineni,Manjunatha Sughaturu Krishnappa,Vivekananda Jayaram
关键词-EN: include rapidly increasing, uneven care accessibility, experiencing critical challenges, rapidly increasing healthcare, increasing healthcare costs
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The U.S. Medicaid program is experiencing critical challenges that include rapidly increasing healthcare costs, uneven care accessibility, and the challenge associated with addressing a varied set of population health needs. This paper investigates the transformative potential of Artificial Intelligence (AI) in reshaping Medicaid by streamlining operations, improving patient results, and lowering costs. We delve into the pivotal role of AI in predictive analytics, care coordination, the detection of fraud, and personalized medicine. By leveraging insights from advanced data models and addressing challenges particular to Medicaid, we put forward AI-driven solutions that prioritize equitable care and improved public health outcomes. This study underscores the urgency of integrating AI into Medicaid to not only improve operational effectiveness but also to create a more accessible and equitable healthcare system for all beneficiaries.

[AI-93] Logic Error Localization in Student Programming Assignments Using Pseudocode and Graph Neural Networks

链接: https://arxiv.org/abs/2410.21282
作者: Zhenyu Xu,Kun Zhang,Victor S. Sheng
关键词-EN: utilizing natural language, define algorithmic behaviors, instruct computer science, computer science students, logic errors
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Pseudocode is extensively used in introductory programming courses to instruct computer science students in algorithm design, utilizing natural language to define algorithmic behaviors. This learning approach enables students to convert pseudocode into source code and execute it to verify their algorithms’ correctness. This process typically introduces two types of errors: syntax errors and logic errors. Syntax errors are often accompanied by compiler feedback, which helps students identify incorrect lines. In contrast, logic errors are more challenging because they do not trigger compiler errors and lack immediate diagnostic feedback, making them harder to detect and correct. To address this challenge, we developed a system designed to localize logic errors within student programming assignments at the line level. Our approach utilizes pseudocode as a scaffold to build a code-pseudocode graph, connecting symbols from the source code to their pseudocode counterparts. We then employ a graph neural network to both localize and suggest corrections for logic errors. Additionally, we have devised a method to efficiently gather logic-error-prone programs during the syntax error correction process and compile these into a dataset that includes single and multiple line logic errors, complete with indices of the erroneous lines. Our experimental results are promising, demonstrating a localization accuracy of 99.2% for logic errors within the top-10 suspected lines, highlighting the effectiveness of our approach in enhancing students’ coding proficiency and error correction skills.

[AI-94] he Social Impact of Generative LLM -Based AI

链接: https://arxiv.org/abs/2410.21281
作者: Yu Xie,Sofia Avila
关键词-EN: Artificial Intelligence, dominate economic production, phase of human, human history, dominate economic
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 34 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Liking it or not, ready or not, we are likely to enter a new phase of human history in which Artificial Intelligence (AI) will dominate economic production and social life – the AI Revolution. Before the actual arrival of the AI Revolution, it is time for us to speculate on how AI will impact the social world. In this article, we focus on the social impact of generative LLM-based AI (GELLMAI), discussing societal factors that contribute to its technological development and its potential roles in enhancing both between-country and within-country social inequality. There are good indications that the US and China will lead the field and will be the main competitors for domination of AI in the world. We conjecture the AI Revolution will likely give rise to a post-knowledge society in which knowledge per se will become less important than in today’s world. Instead, individual relationships and social identity will become more important. So will soft skills.

[AI-95] Comparative Global AI Regulation: Policy Perspectives from the EU China and the US

链接: https://arxiv.org/abs/2410.21279
作者: Jon Chun,Christian Schroeder de Witt,Katherine Elkins
关键词-EN: advancing dual-use technology, rapidly advancing dual-use, dual-use technology, worrisome risks, powerful and rapidly
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 36 pages, 11 figures and tables

点击查看摘要

Abstract:As a powerful and rapidly advancing dual-use technology, AI offers both immense benefits and worrisome risks. In response, governing bodies around the world are developing a range of regulatory AI laws and policies. This paper compares three distinct approaches taken by the EU, China and the US. Within the US, we explore AI regulation at both the federal and state level, with a focus on California’s pending Senate Bill 1047. Each regulatory system reflects distinct cultural, political and economic perspectives. Each also highlights differing regional perspectives on regulatory risk-benefit tradeoffs, with divergent judgments on the balance between safety versus innovation and cooperation versus competition. Finally, differences between regulatory frameworks reflect contrastive stances in regards to trust in centralized authority versus trust in a more decentralized free market of self-interested stakeholders. Taken together, these varied approaches to AI innovation and regulation influence each other, the broader international community, and the future of AI regulation.

[AI-96] Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression NEURIPS2024

链接: https://arxiv.org/abs/2410.19400
作者: Yixiu Mao,Qi Wang,Chen Chen,Yun Qu,Xiangyang Ji
关键词-EN: OOD state correction, offline reinforcement learning, OOD state, OOD, OOD state issue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus, but we argue that there exists an OOD state issue that also impairs performance yet has been underexplored. Such an issue describes the scenario when the agent encounters states out of the offline dataset during the test phase, leading to uncontrolled behavior and performance degradation. To this end, we propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL. Technically, SCAS achieves value-aware OOD state correction, capable of correcting the agent from OOD states to high-value in-distribution states. Theoretical and empirical results show that SCAS also exhibits the effect of suppressing OOD actions. On standard offline RL benchmarks, SCAS achieves excellent performance without additional hyperparameter tuning. Moreover, benefiting from its OOD state correction feature, SCAS demonstrates enhanced robustness against environmental perturbations.

[AI-97] Grounded GUI Understanding for Vision Based Spatial Intelligent Agent : Exemplified by Virtual Reality Apps

链接: https://arxiv.org/abs/2409.10811
作者: Shuqing Li,Binchang Li,Yepang Liu,Cuiyun Gao,Jianping Zhang,Shing-Chi Cheung,Michael R. Lyu
关键词-EN: offering users immersive, diversified virtual environments, GUI, GUI element, GUI ElemeNT dEtection
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In recent years, spatial computing Virtual Reality (VR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with VR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in VR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to VR apps. In this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection. The detection process is iterated within a feedback-directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection-directed IGE candidate detection, and (3) Context-sensitive interactability classification. Extensive experiments demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches.

[AI-98] Less Cybersickness Please: Demystifying and Detecting Stereoscopic Visual Inconsistencies in Virtual Reality Apps

链接: https://arxiv.org/abs/2406.09313
作者: Shuqing Li,Cuiyun Gao,Jianping Zhang,Yujia Zhang,Yepang Liu,Jiazhen Gu,Yun Peng,Michael R. Lyu
关键词-EN: Graphical User Interface, Virtual Reality, quality of Virtual, SVI issues, User Interface
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: This work has been accepted at the ACM International Conference on the Foundations of Software Engineering (FSE) 2024, Porto de Galinhas, Brazil. DOI: this https URL

点击查看摘要

Abstract:The quality of Virtual Reality (VR) apps is vital, particularly the rendering quality of the VR Graphical User Interface (GUI). Different from traditional 2D apps, VR apps create a 3D digital scene for users, by rendering two distinct 2D images for the user’s left and right eyes, respectively. Stereoscopic visual inconsistency (denoted as “SVI”) issues, however, undermine the rendering process of the user’s brain, leading to user discomfort and even adverse health effects. Such issues commonly exist but remain underexplored. We conduct an empirical analysis on 282 SVI bug reports from 15 VR platforms, summarizing 15 types of manifestations. The empirical analysis reveals that automatically detecting SVI issues is challenging, mainly because: (1) lack of training data; (2) the manifestations of SVI issues are diverse, complicated, and often application-specific; (3) most accessible VR apps are closed-source commercial software. Existing pattern-based supervised classification approaches may be inapplicable or ineffective in detecting the SVI issues. To counter these challenges, we propose an unsupervised black-box testing framework named StereoID to identify the stereoscopic visual inconsistencies, based only on the rendered GUI states. StereoID generates a synthetic right-eye image based on the actual left-eye image and computes distances between the synthetic right-eye image and the actual right-eye image to detect SVI issues. We propose a depth-aware conditional stereo image translator to power the image generation process, which captures the expected perspective shifts between left-eye and right-eye images. We build a large-scale unlabeled VR stereo screenshot dataset with larger than 171K images from 288 real-world VR apps for experiments. After substantial experiments, StereoID demonstrates superior performance for detecting SVI issues in both user reports and wild VR apps.

[AI-99] Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation

链接: https://arxiv.org/abs/2410.22271
作者: Davide Berghi,Philip J. B. Jackson
关键词-EN: Audiovisual Sound Event, Sound Event Localization, Source Distance Estimation, Event Localization, Localization and Detection
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event Localization and Detection with Source Distance Estimation (Track B). Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively. This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x. Our second system performs a temporal ensemble from the outputs of the AV-Conformer. We then extended the model with features for distance estimation, such as direct and reverberant signal components extracted from the omnidirectional audio channel, and depth maps extracted from the video frames. While the new system improved the RDE of our previous model by about 3 percentage points, it achieved a lower F1 score. This may be caused by sound classes that rarely appear in the training set and that the more complex system does not detect, as analysis can determine. To overcome this problem, our fourth and final system consists of an ensemble strategy combining the predictions of the other three. Many opportunities to refine the system and training strategy can be tested in future ablation experiments, and likely achieve incremental performance gains for this audio-visual task.

[AI-100] Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding ICASSP2025

链接: https://arxiv.org/abs/2410.21951
作者: Bohan Li,Hankun Wang,Situo Zhang,Yiwei Guo,Kai Yu
关键词-EN: auto-regressive architecture, accelerate auto-regressive TTS, speech tokens, auto-regressive TTS, Abstract
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 3 figures, 3 tables. Submitted to ICASSP 2025

点击查看摘要

Abstract:The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

[AI-101] Differentiable Inductive Logic Programming for Fraud Detection

链接: https://arxiv.org/abs/2410.21928
作者: Boris Wolfson,Erman Acar
关键词-EN: Machine Learning prefer, Current trends, Learning prefer explainability, trends in Machine, Fraud Detection
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current trends in Machine Learning prefer explainability even when it comes at the cost of performance. Therefore, explainable AI methods are particularly important in the field of Fraud Detection. This work investigates the applicability of Differentiable Inductive Logic Programming (DILP) as an explainable AI approach to Fraud Detection. Although the scalability of DILP is a well-known issue, we show that with some data curation such as cleaning and adjusting the tabular and numerical data to the expected format of background facts statements, it becomes much more applicable. While in processing it does not provide any significant advantage on rather more traditional methods such as Decision Trees, or more recent ones like Deep Symbolic Classification, it still gives comparable results. We showcase its limitations and points to improve, as well as potential use cases where it can be much more useful compared to traditional methods, such as recursive rule learning.

[AI-102] On the Statistical Complexity of Estimating VENDI Scores from Empirical Data

链接: https://arxiv.org/abs/2410.21719
作者: Azim Ospanov,Farzan Farnia
关键词-EN: machine learning community, VENDI score, Reference-free evaluation metrics, truncated VENDI statistic, VENDI
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reference-free evaluation metrics for generative models have recently been studied in the machine learning community. As a reference-free metric, the VENDI score quantifies the diversity of generative models using matrix-based entropy from information theory. The VENDI score is usually computed through the eigendecomposition of an n \times n kernel matrix for n generated samples. However, due to the high computational cost of eigendecomposition for large n , the score is often computed on sample sizes limited to a few tens of thousands. In this paper, we explore the statistical convergence of the VENDI score and demonstrate that for kernel functions with an infinite feature map dimension, the evaluated score for a limited sample size may not converge to the matrix-based entropy statistic. We introduce an alternative statistic called the t -truncated VENDI statistic. We show that the existing Nyström method and the FKEA approximation method for the VENDI score will both converge to the defined truncated VENDI statistic given a moderate sample size. We perform several numerical experiments to illustrate the concentration of the empirical VENDI score around the truncated VENDI statistic and discuss how this statistic correlates with the visual diversity of image data.

[AI-103] PACE: Physics Informed Uncertainty Aware Climate Emulator

链接: https://arxiv.org/abs/2410.21657
作者: Hira Saleem,Flora Salim,Cormac Purcell
关键词-EN: future climate scenarios, projecting future climate, serve as critical, critical tools, tools for evaluating
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Climate models serve as critical tools for evaluating the effects of climate change and projecting future climate scenarios. However, the reliance on numerical simulations of physical equations renders them computationally intensive and inefficient. While deep learning methodologies have made significant progress in weather forecasting, they are still unstable for climate emulation tasks. Here, we propose PACE, a lightweight 684K parameter Physics Informed Uncertainty Aware Climate Emulator. PACE emulates temperature and precipitation stably for 86 years while only being trained on greenhouse gas emissions data. We incorporate a fundamental physical law of advection-diffusion in PACE accounting for boundary conditions and empirically estimating the diffusion co-efficient and flow velocities from emissions data. PACE has been trained on 15 climate models provided by ClimateSet outperforming baselines across most of the climate models and advancing a new state of the art in a climate diagnostic task.

[AI-104] A Tutorial on Clinical Speech AI Development: From Data Collection to Model Validation

链接: https://arxiv.org/abs/2410.21640
作者: Si-Ioi Ng,Lingfeng Xu,Ingo Siegert,Nicholas Cummins,Nina R. Benway,Julie Liss,Visar Berisha
关键词-EN: speech, wide spectrum, clinical, clinical speech, speech-based Artificial Intelligence
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 76 pages, 24 figures

点击查看摘要

Abstract:There has been a surge of interest in leveraging speech as a marker of health for a wide spectrum of conditions. The underlying premise is that any neurological, mental, or physical deficits that impact speech production can be objectively assessed via automated analysis of speech. Recent advances in speech-based Artificial Intelligence (AI) models for diagnosing and tracking mental health, cognitive, and motor disorders often use supervised learning, similar to mainstream speech technologies like recognition and verification. However, clinical speech AI has distinct challenges, including the need for specific elicitation tasks, small available datasets, diverse speech representations, and uncertain diagnostic labels. As a result, application of the standard supervised learning paradigm may lead to models that perform well in controlled settings but fail to generalize in real-world clinical deployments. With translation into real-world clinical scenarios in mind, this tutorial paper provides an overview of the key components required for robust development of clinical speech AI. Specifically, this paper will cover the design of speech elicitation tasks and protocols most appropriate for different clinical conditions, collection of data and verification of hardware, development and validation of speech representations designed to measure clinical constructs of interest, development of reliable and robust clinical prediction models, and ethical and participant considerations for clinical speech AI. The goal is to provide comprehensive guidance on building models whose inputs and outputs link to the more interpretable and clinically meaningful aspects of speech, that can be interrogated and clinically validated on clinical datasets, and that adhere to ethical, privacy, and security considerations by design.

[AI-105] Absorb Escape: Overcoming Single Model Limitations in Generating Genomic Sequences NEURIPS2024

链接: https://arxiv.org/abs/2410.21345
作者: Zehui Li,Yuhao Ni,Guoxuan Xia,William Beardall,Akashaditya Das,Guy-Bart Stan,Yiren Zhao
关键词-EN: Abstract Recent advances, Abstract Recent, Recent advances, DNA sequence design, deep generative methods
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Abstract Recent advances in immunology and synthetic biology have accelerated the development of deep generative methods for DNA sequence design. Two dominant approaches in this field are AutoRegressive (AR) models and Diffusion Models (DMs). However, genomic sequences are functionally heterogeneous, consisting of multiple connected regions (e.g., Promoter Regions, Exons, and Introns) where elements within each region come from the same probability distribution, but the overall sequence is non-homogeneous. This heterogeneous nature presents challenges for a single model to accurately generate genomic sequences. In this paper, we analyze the properties of AR models and DMs in heterogeneous genomic sequence generation, pointing out crucial limitations in both methods: (i) AR models capture the underlying distribution of data by factorizing and learning the transition probability but fail to capture the global property of DNA sequences. (ii) DMs learn to recover the global distribution but tend to produce errors at the base pair level. To overcome the limitations of both approaches, we propose a post-training sampling method, termed Absorb Escape (AE) to perform compositional generation from AR models and DMs. This approach starts with samples generated by DMs and refines the sample quality using an AR model through the alternation of the Absorb and Escape steps. To assess the quality of generated sequences, we conduct extensive experiments on 15 species for conditional and unconditional DNA generation. The experiment results from motif distribution, diversity checks, and genome integration tests unequivocally show that AE outperforms state-of-the-art AR models and DMs in genomic sequence generation.

[AI-106] Evaluating the Posterior Sampling Ability of PlugPlay Diffusion Methods in Sparse-View CT

链接: https://arxiv.org/abs/2410.21301
作者: Liam Moroy,Guillaume Bourmaud,Frédéric Champagnat,Jean-François Giovannelli
关键词-EN: computed tomography, posterior, diffusion models, Abstract, PlugPlay
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PlugPlay (PnP) diffusion models are state-of-the-art methods in computed tomography (CT) reconstruction. Such methods usually consider applications where the sinogram contains a sufficient amount of information for the posterior distribution to be peaked, and consequently are evaluated using image-to-image metrics such as PSNR/SSIM. Instead, we are interested in reconstructing compressible flow images from sinograms having a small number of projections, which results in a posterior distribution no longer peaked or even multimodal. Thus, in this paper, we aim at evaluating the approximate posterior of PnP diffusion models and introduce two posterior evaluation criteria. We quantitatively evaluate three PnP diffusion methods on three different datasets for several numbers of projections. We surprisingly find that, for each method, the approximate posterior deviates from the true posterior when the number of projections decreases.

[AI-107] pLDDT-Predictor: High-speed Protein Screening Using Transformer and ESM2

链接: https://arxiv.org/abs/2410.21283
作者: Joongwon Chae,Zhenyu Wang,Peiwu Qin
关键词-EN: achieving near-experimental accuracy, Recent advancements, revolutionized structural biology, near-experimental accuracy, biology by achieving
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages main topic, 8 pages including citiation, 4 figures

点击查看摘要

Abstract:Recent advancements in protein structure prediction, particularly AlphaFold2, have revolutionized structural biology by achieving near-experimental accuracy. However, the computational intensity of these models limits their application in high-throughput protein screening. Concurrently, large language models like ESM (Evolutionary Scale Modeling) have demonstrated the potential to extract rich structural information directly from protein sequences. Despite these advances, a significant gap remains in rapidly assessing protein structure quality for large-scale analyses. We introduce pLDDT-Predictor, a high-speed protein screening tool that bridges this gap by leveraging pre-trained ESM2 protein embeddings and a Transformer architecture to accurately predict AlphaFold2s pLDDT (predicted Local Distance Difference Test) scores. Our model addresses the critical need for fast, accurate protein structure quality assessment without the computational burden of full structure prediction. By combining the evolutionary information captured in ESM2 embeddings with the sequence-wide context modeling of Transformers, pLDDT-Predictor achieves a balance between structural insight and computational efficiency. Our experimental results, conducted on a diverse dataset of 1.5 million protein sequences, demonstrate that pLDDT-Predictor can classify more than 90 percent of proteins with a pLDDT score above 70, closely matching AlphaFold2s confidence level.

[AI-108] raderTalk: An LLM Behavioural ABM applied to Simulating Human Bilateral Trading Interactions

链接: https://arxiv.org/abs/2410.21280
作者: Alicia Vidler,Toby Walsh
关键词-EN: Large Language Models, Large Language, generated by Large, Language Models, behaviors generated
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI)
*备注: 4 pages

点击查看摘要

Abstract:We introduce a novel hybrid approach that augments Agent-Based Models (ABMs) with behaviors generated by Large Language Models (LLMs) to simulate human trading interactions. We call our model TraderTalk. Leveraging LLMs trained on extensive human-authored text, we capture detailed and nuanced representations of bilateral conversations in financial trading. Applying this Generative Agent-Based Model (GABM) to government bond markets, we replicate trading decisions between two stylised virtual humans. Our method addresses both structural challenges, such as coordinating turn-taking between realistic LLM-based agents, and design challenges, including the interpretation of LLM outputs by the agent model. By exploring prompt design opportunistically rather than systematically, we enhance the realism of agent interactions without exhaustive overfitting or model reliance. Our approach successfully replicates trade-to-order volume ratios observed in related asset markets, demonstrating the potential of LLM-augmented ABMs in financial simulations

计算机视觉

[CV-0] Local Policies Enable Zero-shot Long-horizon Manipulation

链接: https://arxiv.org/abs/2410.22332
作者: Murtaza Dalal,Min Liu,Walter Talbott,Chen Chen,Deepak Pathak,Jian Zhang,Ruslan Salakhutdinov
关键词-EN: simulating complex contacts, realistic task distributions, generating realistic task, difficult due, challenges of simulating
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Main paper 7 pages, 3 tables, 3 figures. Appendix 6 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at this https URL

[CV-1] Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier WACV2025

链接: https://arxiv.org/abs/2410.22317
作者: Kai Wang,Fei Yang,Bogdan Raducanu,Joost van de Weijer
关键词-EN: large pre-trained vision-language, pre-trained vision-language models, CLIP model, CLIP, advent of large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in WACV 2025. Code link: this https URL

点击查看摘要

Abstract:With the advent of large pre-trained vision-language models such as CLIP, prompt learning methods aim to enhance the transferability of the CLIP model. They learn the prompt given few samples from the downstream task given the specific class names as prior knowledge, which we term as semantic-aware classification. However, in many realistic scenarios, we only have access to few samples and knowledge of the class names (e.g., when considering instances of classes). This challenging scenario represents the semantic-agnostic discriminative case. Text-to-Image (T2I) personalization methods aim to adapt T2I models to unseen concepts by learning new tokens and endowing these tokens with the capability of generating the learned concepts. These methods do not require knowledge of class names as a semantic-aware prior. Therefore, in this paper, we first explore Textual Inversion and reveal that the new concept tokens possess both generation and classification capabilities by regarding each category as a single concept. However, learning classifiers from single-concept textual inversion is limited since the learned tokens are suboptimal for the discriminative tasks. To mitigate this issue, we propose Multi-Class textual inversion, which includes a discriminative regularization term for the token updating process. Using this technique, our method MC-TI achieves stronger Semantic-Agnostic Classification while preserving the generation capability of these modifier tokens given only few samples per category. In the experiments, we extensively evaluate MC-TI on 12 datasets covering various scenarios, which demonstrates that MC-TI achieves superior results in terms of both classification and generation outcomes.

[CV-2] Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

链接: https://arxiv.org/abs/2410.22313
作者: Bo Jiang,Shaoyu Chen,Bencheng Liao,Xingyu Zhang,Wei Yin,Qian Zhang,Chang Huang,Wenyu Liu,Xinggang Wang
关键词-EN: rare scenarios due, driving demonstrates strong, struggles in complex, rare scenarios, strong planning capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Project Page: this https URL

点击查看摘要

Abstract:End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM’s planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna’s cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at this https URL.

[CV-3] Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention NEURIPS2024

链接: https://arxiv.org/abs/2410.22306
作者: Haomeng Zhang,Chiao-An Yang,Raymond A. Yeh
关键词-EN: Grounding involves locating, involves locating, boxes based, point cloud, query phrase
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

[CV-4] Emotion-Guided Image to Music Generation

链接: https://arxiv.org/abs/2410.22299
作者: Souraja Kundu,Saket Singh,Yuji Iwahori
关键词-EN: social media experiences, including background music, Generating music, enhance various applications, including background
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 2024 6th Asian Digital Image Processing Conference

点击查看摘要

Abstract:Generating music from images can enhance various applications, including background music for photo slideshows, social media experiences, and video creation. This paper presents an emotion-guided image-to-music generation framework that leverages the Valence-Arousal (VA) emotional space to produce music that aligns with the emotional tone of a given image. Unlike previous models that rely on contrastive learning for emotional consistency, the proposed approach directly integrates a VA loss function to enable accurate emotional alignment. The model employs a CNN-Transformer architecture, featuring pre-trained CNN image feature extractors and three Transformer encoders to capture complex, high-level emotional features from MIDI music. Three Transformer decoders refine these features to generate musically and emotionally consistent MIDI sequences. Experimental results on a newly curated emotionally paired image-MIDI dataset demonstrate the proposed model’s superior performance across metrics such as Polyphony Rate, Pitch Entropy, Groove Consistency, and loss convergence.

[CV-5] Motion Graph Unleashed: A Novel Approach to Video Prediction NEURIPS2024

链接: https://arxiv.org/abs/2410.22288
作者: Yiqi Zhong,Luming Liang,Bohan Tang,Ilya Zharkov,Ulrich Neumann
关键词-EN: limited past data, predicts future video, future video frames, video prediction problem, past data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024, 19 pages, 12 figures

点击查看摘要

Abstract:We introduce motion graph, a novel approach to the video prediction problem, which predicts future video frames from limited past data. The motion graph transforms patches of video frames into interconnected graph nodes, to comprehensively describe the spatial-temporal relationships among them. This representation overcomes the limitations of existing motion representations such as image differences, optical flow, and motion matrix that either fall short in capturing complex motion patterns or suffer from excessive memory consumption. We further present a video prediction pipeline empowered by motion graph, exhibiting substantial performance improvements and cost reductions. Experiments on various datasets, including UCF Sports, KITTI and Cityscapes, highlight the strong representative ability of motion graph. Especially on UCF Sports, our method matches and outperforms the SOTA methods with a significant reduction in model size by 78% and a substantial decrease in GPU memory utilization by 47%.

[CV-6] Active Event Alignment for Monocular Distance Estimation

链接: https://arxiv.org/abs/2410.22280
作者: Nan Cai,Pia Bideau
关键词-EN: extracting visual information, visual information, data efficient representation, extracting visual, Event cameras provide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Event cameras provide a natural and data efficient representation of visual information, motivating novel computational strategies towards extracting visual information. Inspired by the biological vision system, we propose a behavior driven approach for object-wise distance estimation from event camera data. This behavior-driven method mimics how biological systems, like the human eye, stabilize their view based on object distance: distant objects require minimal compensatory rotation to stay in focus, while nearby objects demand greater adjustments to maintain alignment. This adaptive strategy leverages natural stabilization behaviors to estimate relative distances effectively. Unlike traditional vision algorithms that estimate depth across the entire image, our approach targets local depth estimation within a specific region of interest. By aligning events within a small region, we estimate the angular velocity required to stabilize the image motion. We demonstrate that, under certain assumptions, the compensatory rotational flow is inversely proportional to the object’s distance. The proposed approach achieves new state-of-the-art accuracy in distance estimation - a performance gain of 16% on EVIMO2. EVIMO2 event sequences comprise complex camera motion and substantial variance in depth of static real world scenes.

[CV-7] NCA-Morph: Medical Image Registration with Neural Cellular Automata

链接: https://arxiv.org/abs/2410.22265
作者: Amin Ranem,John Kalkhof,Anirban Mukhopadhyay
关键词-EN: surgical planning, critical process, process that aligns, Neural Cellular Automata, Deep Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image registration is a critical process that aligns various patient scans, facilitating tasks like diagnosis, surgical planning, and tracking. Traditional optimization based methods are slow, prompting the use of Deep Learning (DL) techniques, such as VoxelMorph and Transformer-based strategies, for faster results. However, these DL methods often impose significant resource demands. In response to these challenges, we present NCA-Morph, an innovative approach that seamlessly blends DL with a bio-inspired communication and networking approach, enabled by Neural Cellular Automata (NCAs). NCA-Morph not only harnesses the power of DL for efficient image registration but also builds a network of local communications between cells and respective voxels over time, mimicking the interaction observed in living systems. In our extensive experiments, we subject NCA-Morph to evaluations across three distinct 3D registration tasks, encompassing Brain, Prostate and Hippocampus images from both healthy and diseased patients. The results showcase NCA-Morph’s ability to achieve state-of-the-art performance. Notably, NCA-Morph distinguishes itself as a lightweight architecture with significantly fewer parameters; 60% and 99.7% less than VoxelMorph and TransMorph. This characteristic positions NCA-Morph as an ideal solution for resource-constrained medical applications, such as primary care settings and operating rooms.

[CV-8] owards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

链接: https://arxiv.org/abs/2410.22217
作者: Shenghao Xie,Wenqiang Zu,Mingyang Zhao,Duo Su,Shilong Liu,Ruohua Shi,Guoqi Li,Shanghang Zhang,Lei Ma
关键词-EN: vision foundation models, token prediction paradigm, shown impressive scalability, foundation models, vision foundation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation models. In this survey, we review the recent advances and discuss future directions for autoregressive vision foundation models. First, we present the trend for next generation of vision foundation models, i.e., unifying both understanding and generation in vision tasks. We then analyze the limitations of existing vision foundation models, and present a formal definition of autoregression with its advantages. Later, we categorize autoregressive vision foundation models from their vision tokenizers and autoregression backbones. Finally, we discuss several promising research challenges and directions. To the best of our knowledge, this is the first survey to comprehensively summarize autoregressive vision foundation models under the trend of unifying understanding and generation. A collection of related resources is available at this https URL.

[CV-9] LiVisSfM: Accurate and Robust Structure-from-Motion with LiDAR and Visual Cues

链接: https://arxiv.org/abs/2410.22213
作者: Hanqing Jiang,Liyang Zhou,Zhuang Zhang,Yihao Yu,Guofeng Zhang
关键词-EN: Inertial Measurement Unit, fully combines LiDAR, pipeline named LiVisSfM, SfM-based reconstruction system, robust LiDAR pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 9 figures, 2 tables

点击查看摘要

Abstract:This paper presents an accurate and robust Structure-from-Motion (SfM) pipeline named LiVisSfM, which is an SfM-based reconstruction system that fully combines LiDAR and visual cues. Unlike most existing LiDAR-inertial odometry (LIO) and LiDAR-inertial-visual odometry (LIVO) methods relying heavily on LiDAR registration coupled with Inertial Measurement Unit (IMU), we propose a LiDAR-visual SfM method which innovatively carries out LiDAR frame registration to LiDAR voxel map in a Point-to-Gaussian residual metrics, combined with a LiDAR-visual BA and explicit loop closure in a bundle optimization way to achieve accurate and robust LiDAR pose estimation without dependence on IMU incorporation. Besides, we propose an incremental voxel updating strategy for efficient voxel map updating during the process of LiDAR frame registration and LiDAR-visual BA optimization. Experiments demonstrate the superior effectiveness of our LiVisSfM framework over state-of-the-art LIO and LIVO works on more accurate and robust LiDAR pose recovery and dense point cloud reconstruction of both public KITTI benchmark and a variety of self-captured dataset.

[CV-10] Active Learning for Vision-Language Models WACV2025

链接: https://arxiv.org/abs/2410.22187
作者: Bardia Safaei,Vishal M. Patel
关键词-EN: Pre-trained vision-language models, computer vision tasks, downstream computer vision, Pre-trained vision-language, demonstrated impressive zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in WACV 2025

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.

[CV-11] Shining a Light on Hurricane Damage Estimation via Nighttime Light Data: Pre-processing Matters

链接: https://arxiv.org/abs/2410.22150
作者: Nancy Thomas,Saba Rahimi,Annita Vapsi,Cathy Ansell,Elizabeth Christie,Daniel Borrajo,Tucker Balch,Manuela Veloso
关键词-EN: Amidst escalating climate, escalating climate change, inflicting severe socioeconomic, Amidst escalating, severe socioeconomic impacts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Amidst escalating climate change, hurricanes are inflicting severe socioeconomic impacts, marked by heightened economic losses and increased displacement. Previous research utilized nighttime light data to predict the impact of hurricanes on economic losses. However, prior work did not provide a thorough analysis of the impact of combining different techniques for pre-processing nighttime light (NTL) data. Addressing this gap, our research explores a variety of NTL pre-processing techniques, including value thresholding, built masking, and quality filtering and imputation, applied to two distinct datasets, VSC-NTL and VNP46A2, at the zip code level. Experiments evaluate the correlation of the denoised NTL data with economic damages of Category 4-5 hurricanes in Florida. They reveal that the quality masking and imputation technique applied to VNP46A2 show a substantial correlation with economic damage data.

[CV-12] Capacity Control is an Effective Memorization Mitigation Mechanism in Text-Conditional Diffusion Models ICML’24

链接: https://arxiv.org/abs/2410.22149
作者: Raman Dutt,Pedro Sanchez,Ondrej Bohdal,Sotirios A. Tsaftaris,Timothy Hospedales
关键词-EN: controlling model capacity, present compelling evidence, effectively mitigate memorization, controlling model, model capacity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the GenLaw (Generative AI + Law) workshop at ICML’24

点击查看摘要

Abstract:In this work, we present compelling evidence that controlling model capacity during fine-tuning can effectively mitigate memorization in diffusion models. Specifically, we demonstrate that adopting Parameter-Efficient Fine-Tuning (PEFT) within the pre-train fine-tune paradigm significantly reduces memorization compared to traditional full fine-tuning approaches. Our experiments utilize the MIMIC dataset, which comprises image-text pairs of chest X-rays and their corresponding reports. The results, evaluated through a range of memorization and generation quality metrics, indicate that PEFT not only diminishes memorization but also enhances downstream generation quality. Additionally, PEFT methods can be seamlessly combined with existing memorization mitigation techniques for further improvement. The code for our experiments is available at: this https URL

[CV-13] Lighten CARAFE: Dynamic Lightweight Upsampling with Guided Reassemble Kernels ICPR2024

链接: https://arxiv.org/abs/2410.22139
作者: Ruigang Fu,Qingyong Hu,Xiaohu Dong,Yinghui Gao,Biao Li,Ping Zhong
关键词-EN: modern machine vision, modern machine, upsampling, feature upsampling, machine vision models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:As a fundamental operation in modern machine vision models, feature upsampling has been widely used and investigated in the literatures. An ideal upsampling operation should be lightweight, with low computational complexity. That is, it can not only improve the overall performance but also not affect the model complexity. Content-aware Reassembly of Features (CARAFE) is a well-designed learnable operation to achieve feature upsampling. Albeit encouraging performance achieved, this method requires generating large-scale kernels, which brings a mass of extra redundant parameters, and inherently has limited scalability. To this end, we propose a lightweight upsampling operation, termed Dynamic Lightweight Upsampling (DLU) in this paper. In particular, it first constructs a small-scale source kernel space, and then samples the large-scale kernels from the kernel space by introducing learnable guidance offsets, hence avoiding introducing a large collection of trainable parameters in upsampling. Experiments on several mainstream vision tasks show that our DLU achieves comparable and even better performance to the original CARAFE, but with much lower complexity, e.g., DLU requires 91% fewer parameters and at least 63% fewer FLOPs (Floating Point Operations) than CARAFE in the case of 16x upsampling, but outperforms the CARAFE by 0.3% mAP in object detection. Code is available at this https URL.

[CV-14] PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting

链接: https://arxiv.org/abs/2410.22128
作者: Sunghwan Hong,Jaewoo Jung,Heeseong Shin,Jisang Han,Jiaolong Yang,Chong Luo,Seungryong Kim
关键词-EN: view synthesis, single feed-forward, dense image views, view synthesis capabilities, unposed images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:We consider the problem of novel view synthesis from unposed images in a single feed-forward. Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS, where we further extend it to offer a practical solution that relaxes common assumptions such as dense image views, accurate camera poses, and substantial image overlaps. We achieve this through identifying and addressing unique challenges arising from the use of pixel-aligned 3DGS: misaligned 3D Gaussians across different views induce noisy or sparse gradients that destabilize training and hinder convergence, especially when above assumptions are not met. To mitigate this, we employ pre-trained monocular depth estimation and visual correspondence models to achieve coarse alignments of 3D Gaussians. We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis. Furthermore, the refined estimates are leveraged to estimate geometry confidence scores, which assess the reliability of 3D Gaussian centers and condition the prediction of Gaussian parameters accordingly. Extensive evaluations on large-scale real-world datasets demonstrate that PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices.

[CV-15] 4D-based Robot Navigation Using Relativistic Image Processing AAAI

链接: https://arxiv.org/abs/2410.22087
作者: Simone Müller,Dieter Kranzlmüller
关键词-EN: Machine perception, important prerequisite, prerequisite for safe, locomotion in dynamic, dynamic environments
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: AAAI Fall Symposia 2024

点击查看摘要

Abstract:Machine perception is an important prerequisite for safe interaction and locomotion in dynamic environments. This requires not only the timely perception of surrounding geometries and distances but also the ability to react to changing situations through predefined, learned but also reusable skill endings of a robot so that physical damage or bodily harm can be avoided. In this context, 4D perception offers the possibility of predicting one’s own position and changes in the environment over time. In this paper, we present a 4D-based approach to robot navigation using relativistic image processing. Relativistic image processing handles the temporal-related sensor information in a tensor model within a constructive 4D space. 4D-based navigation expands the causal understanding and the resulting interaction radius of a robot through the use of visual and sensory 4D information.

[CV-16] HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

链接: https://arxiv.org/abs/2410.22079
作者: Zhoujie Xu
关键词-EN: significant challenge, Human pose estimation, high-resolution feature maps, Convolutional Neural Networks, feature maps
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review

点击查看摘要

Abstract:Human pose estimation on medium and small scales has long been a significant challenge in this field. Most existing methods focus on restoring high-resolution feature maps by stacking multiple costly deconvolutional layers or by continuously aggregating semantic information from low-resolution feature maps while maintaining high-resolution ones, which can lead to information redundancy. Additionally, due to quantization errors, heatmap-based methods have certain disadvantages in accurately locating keypoints of medium and small-scale human figures. In this paper, we propose HRPVT, which utilizes PVT v2 as the backbone to model long-range dependencies. Building on this, we introduce the High-Resolution Pyramid Module (HRPM), designed to generate higher quality high-resolution representations by incorporating the intrinsic inductive biases of Convolutional Neural Networks (CNNs) into the high-resolution feature maps. The integration of HRPM enhances the performance of pure transformer-based models for human pose estimation at medium and small scales. Furthermore, we replace the heatmap-based method with SimCC approach, which eliminates the need for costly upsampling layers, thereby allowing us to allocate more computational resources to HRPM. To accommodate models with varying parameter scales, we have developed two insertion strategies of HRPM, each designed to enhancing the model’s ability to perceive medium and small-scale human poses from two distinct perspectives.

[CV-17] FreeGaussian: Guidance-free Controllable 3D Gaussian Splats with Flow Derivatives

链接: https://arxiv.org/abs/2410.22070
作者: Qizhi Chen,Delin Qu,Yiwen Tang,Haoming Song,Yiting Zhang,Dong Wang,Bin Zhao,Xuelong Li
关键词-EN: Reconstructing controllable Gaussian, challenging task due, inherently insufficient constraints, controllable Gaussian splats, Reconstructing controllable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reconstructing controllable Gaussian splats from monocular video is a challenging task due to its inherently insufficient constraints. Widely adopted approaches supervise complex interactions with additional masks and control signal annotations, limiting their real-world applications. In this paper, we propose an annotation guidance-free method, dubbed FreeGaussian, that mathematically derives dynamic Gaussian motion from optical flow and camera motion using novel dynamic Gaussian constraints. By establishing a connection between 2D flows and 3D Gaussian dynamic control, our method enables self-supervised optimization and continuity of dynamic Gaussian motions from flow priors. Furthermore, we introduce a 3D spherical vector controlling scheme, which represents the state with a 3D Gaussian trajectory, thereby eliminating the need for complex 1D control signal calculations and simplifying controllable Gaussian modeling. Quantitative and qualitative evaluations on extensive experiments demonstrate the state-of-the-art visual performance and control capability of our method. Project page: this https URL.

[CV-18] PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement WACV2025

链接: https://arxiv.org/abs/2410.22059
作者: Shutong Jin,Ruiyu Wang,Kuangyi Chen,Florian T.Pokorny
关键词-EN: diverse object arrangements, predicting diverse object, robotic manipulation due, table tidying, robotic manipulation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV2025

点击查看摘要

Abstract:Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.

[CV-19] Benchmarking Human and Automated Prompting in the Segment Anything Model

链接: https://arxiv.org/abs/2410.22048
作者: Jorge Quesada,Zoe Fowler,Mohammad Alotaibi,Mohit Prabhushankar,Ghassan AlRegib
关键词-EN: Segment Anything Model, visual prompting strategies, automated visual prompting, effective visual prompts, visual prompting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The remarkable capabilities of the Segment Anything Model (SAM) for tackling image segmentation tasks in an intuitive and interactive manner has sparked interest in the design of effective visual prompts. Such interest has led to the creation of automated point prompt selection strategies, typically motivated from a feature extraction perspective. However, there is still very little understanding of how appropriate these automated visual prompting strategies are, particularly when compared to humans, across diverse image domains. Additionally, the performance benefits of including such automated visual prompting strategies within the finetuning process of SAM also remains unexplored, as does the effect of interpretable factors like distance between the prompt points on segmentation performance. To bridge these gaps, we leverage a recently released visual prompting dataset, PointPrompt, and introduce a number of benchmarking tasks that provide an array of opportunities to improve the understanding of the way human prompts differ from automated ones and what underlying factors make for effective visual prompts. We demonstrate that the resulting segmentation scores obtained by humans are approximately 29% higher than those given by automated strategies and identify potential features that are indicative of prompting performance with R^2 scores over 0.5. Additionally, we demonstrate that performance when using automated methods can be improved by up to 68% via a finetuning approach. Overall, our experiments not only showcase the existing gap between human prompts and automated methods, but also highlight potential avenues through which this gap can be leveraged to improve effective visual prompt design. Further details along with the dataset links and codes are available at this https URL

[CV-20] Feature distribution Adaptation Network for Speech Emotion Recognition

链接: https://arxiv.org/abs/2410.22023
作者: Shaokai Li,Yixuan Ji,Peng Song,Haoqin Sun,Wenming Zheng
关键词-EN: speech emotion recognition, emotion recognition problem, transfer learning framework, inductive transfer learning, named feature distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem. Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion, thereby improving the performance of speech emotion recognition. In our model, the pre-trained ResNet-34 is utilized for feature extraction for facial expression images and acoustic Mel spectrograms, respectively. Then, the cross-attention mechanism is introduced to model the intrinsic similarity relationships of multi-modal features. Finally, the multi-modal feature distribution adaptation is performed efficiently with feed-forward network, which is extended using the local maximum mean discrepancy loss. Experiments are carried out on two benchmark datasets, and the results demonstrate that our model can achieve excellent performance compared with existing this http URL code is available at this https URL.

[CV-21] A Machine Learning-Based Secure Face Verification Scheme and Its Applications to Digital Surveillance

链接: https://arxiv.org/abs/2410.21993
作者: Huan-Chih Wang,Ja-Ling Wu
关键词-EN: well-known image analysis, image analysis application, Face verification, contemporary society, facial images
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: accepted by International Conference on Digital Image and Signal Processing (DISP) 2019

点击查看摘要

Abstract:Face verification is a well-known image analysis application and is widely used to recognize individuals in contemporary society. However, most real-world recognition systems ignore the importance of protecting the identity-sensitive facial images that are used for verification. To address this problem, we investigate how to implement a secure face verification system that protects the facial images from being imitated. In our work, we use the DeepID2 convolutional neural network to extract the features of a facial image and an EM algorithm to solve the facial verification problem. To maintain the privacy of facial images, we apply homomorphic encryption schemes to encrypt the facial data and compute the EM algorithm in the ciphertext domain. We develop three face verification systems for surveillance (or entrance) control of a local community based on three levels of privacy concerns. The associated timing performances are presented to demonstrate their feasibility for practical implementation.

[CV-22] A Survey on RGB 3D and Multimodal Approaches for Unsupervised Industrial Anomaly Detection

链接: https://arxiv.org/abs/2410.21982
作者: Yuxuan Lin,Yang Chang,Xuan Tong,Jiawen Yu,Antonio Liotta,Guofan Huang,Wei Song,Deyu Zeng,Zongze Wu,Yan Wang,Wenqiang Zhang
关键词-EN: Industrial Anomaly Detection, Unsupervised Industrial Anomaly, technology effectively overcomes, Anomaly Detection, multimodal anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages, 18 figures

点击查看摘要

Abstract:In the advancement of industrial informatization, Unsupervised Industrial Anomaly Detection (UIAD) technology effectively overcomes the scarcity of abnormal samples and significantly enhances the automation and reliability of smart manufacturing. While RGB, 3D, and multimodal anomaly detection have demonstrated comprehensive and robust capabilities within the industrial informatization sector, existing reviews on industrial anomaly detection have not sufficiently classified and discussed methods in 3D and multimodal settings. We focus on 3D UIAD and multimodal UIAD, providing a comprehensive summary of unsupervised industrial anomaly detection in three modal settings. Firstly, we compare our surveys with recent works, introducing commonly used datasets, evaluation metrics, and the definitions of anomaly detection problems. Secondly, we summarize five research paradigms in RGB, 3D and multimodal UIAD and three emerging industrial manufacturing optimization directions in RGB UIAD, and review three multimodal feature fusion strategies in multimodal settings. Finally, we outline the primary challenges currently faced by UIAD in three modal settings, and offer insights into future development directions, aiming to provide researchers with a thorough reference and offer new perspectives for the advancement of industrial informatization. Corresponding resources are available at this https URL.

[CV-23] BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays NEURIPS24

链接: https://arxiv.org/abs/2410.21969
作者: Yang Zhou,Tan Li Hui Faith,Yanyu Xu,Sicong Leng,Xinxing Xu,Yong Liu,Rick Siow Mong Goh
关键词-EN: Medical Vision-Language Pretraining, transferable visual representations, Vision-Language Pretraining, unpaired medical images, MedVLP methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS24 Datasets and Benchmarks Track

点击查看摘要

Abstract:Medical Vision-Language Pretraining (MedVLP) shows promise in learning generalizable and transferable visual representations from paired and unpaired medical images and reports. MedVLP can provide useful features to downstream tasks and facilitate adapting task-specific models to new setups using fewer examples. However, existing MedVLP methods often differ in terms of datasets, preprocessing, and finetuning implementations. This pose great challenges in evaluating how well a MedVLP method generalizes to various clinically-relevant tasks due to the lack of unified, standardized, and comprehensive benchmark. To fill this gap, we propose BenchX, a unified benchmark framework that enables head-to-head comparison and systematical analysis between MedVLP methods using public chest X-ray datasets. Specifically, BenchX is composed of three components: 1) Comprehensive datasets covering nine datasets and four medical tasks; 2) Benchmark suites to standardize data preprocessing, train-test splits, and parameter selection; 3) Unified finetuning protocols that accommodate heterogeneous MedVLP methods for consistent task adaptation in classification, segmentation, and report generation, respectively. Utilizing BenchX, we establish baselines for nine state-of-the-art MedVLP methods and found that the performance of some early MedVLP methods can be enhanced to surpass more recent ones, prompting a revisiting of the developments and conclusions from prior works in MedVLP. Our code are available at this https URL.

[CV-24] PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference

链接: https://arxiv.org/abs/2410.21966
作者: Kendong Liu,Zhiyu Zhu,Chuanhao Li,Hui Liu,Huanqiang Zeng,Junhui Hou
关键词-EN: human aesthetic standards, align diffusion models, significantly improving, image inpainting, attempt to align
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we make the first attempt to align diffusion models for image inpainting with human aesthetic standards via a reinforcement learning framework, significantly improving the quality and visual appeal of inpainted images. Specifically, instead of directly measuring the divergence with paired images, we train a reward model with the dataset we construct, consisting of nearly 51,000 images annotated with human preferences. Then, we adopt a reinforcement learning process to fine-tune the distribution of a pre-trained diffusion model for image inpainting in the direction of higher reward. Moreover, we theoretically deduce the upper bound on the error of the reward model, which illustrates the potential confidence of reward estimation throughout the reinforcement alignment process, thereby facilitating accurate regularization. Extensive experiments on inpainting comparison and downstream tasks, such as image extension and 3D reconstruction, demonstrate the effectiveness of our approach, showing significant improvements in the alignment of inpainted images with human preference compared with state-of-the-art methods. This research not only advances the field of image inpainting but also provides a framework for incorporating human preference into the iterative refinement of generative models based on modeling reward accuracy, with broad implications for the design of visually driven AI applications. Our code and dataset are publicly available at this https URL.

[CV-25] FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

链接: https://arxiv.org/abs/2410.21964
作者: Dat Nguyen,Marcella Astrid,Enjie Ghorbel,Djamila Aouada
关键词-EN: Vision Transformers, achieved unprecedented effectiveness, Convolution Neural Networks, image classification, achieved unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at \urlthis https URL.

[CV-26] Spatio-temporal Transformers for Action Unit Classification with Event Cameras

链接: https://arxiv.org/abs/2410.21958
作者: Luca Cultrera,Federico Becattini,Lorenzo Berlincioni,Claudio Ferrari,Alberto Del Bimbo
关键词-EN: angles to infer, Shifted Patch Tokenization, infer emotion, Traditionally RGB cameras, Event
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review at CVIU. arXiv admin note: substantial text overlap with arXiv:2409.10213

点击查看摘要

Abstract:Face analysis has been studied from different angles to infer emotion, poses, shapes, and landmarks. Traditionally RGB cameras are used, yet for fine-grained tasks standard sensors might not be up to the task due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. We propose a novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered one of the main causes of an existing gap between the maturity of RGB and neuromorphic vision models. Gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and contains streams collected with various possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. Our proposed model outperforms baseline methods by effectively capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.

[CV-27] ActiveSplat: High-Fidelity Scene Reconstruction through Active Gaussian Splatting

链接: https://arxiv.org/abs/2410.21955
作者: Yuetao Li,Zijia Kuang,Ting Li,Guyue Zhou,Shaohui Zhang,Zike Yan
关键词-EN: leveraging Gaussian splatting, system leveraging Gaussian, Gaussian splatting, leveraging Gaussian, reconstruction system leveraging
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose ActiveSplat, an autonomous high-fidelity reconstruction system leveraging Gaussian splatting. Taking advantage of efficient and realistic rendering, the system establishes a unified framework for online mapping, viewpoint selection, and path planning. The key to ActiveSplat is a hybrid map representation that integrates both dense information about the environment and a sparse abstraction of the workspace. Therefore, the system leverages sparse topology for efficient viewpoint sampling and path planning, while exploiting view-dependent dense prediction for viewpoint selection, facilitating efficient decision-making with promising accuracy and completeness. A hierarchical planning strategy based on the topological map is adopted to mitigate repetitive trajectories and improve local granularity given limited budgets, ensuring high-fidelity reconstruction with photorealistic view synthesis. Extensive experiments and ablation studies validate the efficacy of the proposed method in terms of reconstruction accuracy, data coverage, and exploration efficiency. Project page: this https URL.

[CV-28] Structured Analysis and Comparison of Alphabets in Historical Handwritten Ciphers ECCV24

链接: https://arxiv.org/abs/2410.21913
作者: Martín Méndez,Pau Torras,Adrià Molina,Jialuo Chen,Oriol Ramos-Terrades,Alicia Fornés
关键词-EN: Historical ciphered manuscripts, Historical ciphered, sensitive communications, communications within military, military and diplomatic
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
*备注: Acccepted at ECCV24 Workshop AI4DH

点击查看摘要

Abstract:Historical ciphered manuscripts are documents that were typically used in sensitive communications within military and diplomatic contexts or among members of secret societies. These secret messages were concealed by inventing a method of writing employing symbols from diverse sources such as digits, alchemy signs and Latin or Greek characters. When studying a new, unseen cipher, the automatic search and grouping of ciphers with a similar alphabet can aid the scholar in its transcription and cryptanalysis because it indicates a probability that the underlying cipher is similar. In this study, we address this need by proposing the CSI metric, a novel way of comparing pairs of ciphered documents. We assess their effectiveness in an unsupervised clustering scenario utilising visual features, including SIFT, pre-trained learnt embeddings, and OCR descriptors.

[CV-29] Multi-step feature fusion for natural disaster damage assessment on satellite images

链接: https://arxiv.org/abs/2410.21901
作者: Mateusz Żarski,Jarosław Adam Miszczak
关键词-EN: subsequent recovery operations, undertaking properly targeted, properly targeted rescue, Quick and accurate, disaster recovery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, for associated Github repository: this https URL

点击查看摘要

Abstract:Quick and accurate assessment of the damage state of buildings after natural disasters is crucial for undertaking properly targeted rescue and subsequent recovery operations, which can have a major impact on the safety of victims and the cost of disaster recovery. The quality of such a process can be significantly improved by harnessing the potential of machine learning methods in computer vision. This paper presents a novel damage assessment method using an original multi-step feature fusion network for the classification of the damage state of buildings based on pre- and post-disaster large-scale satellite images. We introduce a novel convolutional neural network (CNN) module that performs feature fusion at multiple network levels between pre- and post-disaster images in the horizontal and vertical directions of CNN network. An additional network element - Fuse Module - was proposed to adapt any CNN model to analyze image pairs in the issue of pair classification. We use, open, large-scale datasets (IDA-BD and xView2) to verify, that the proposed method is suitable to improve on existing state-of-the-art architectures. We report over a 3 percentage point increase in the accuracy of the Vision Transformer model.

[CV-30] Self-Relaxed Joint Training: Sample Selection for Severity Estimation with Ordinal Noisy Labels WACV2025

链接: https://arxiv.org/abs/2410.21885
作者: Shumpei Takezaki,Kiyohito Tanaka,Seiichi Uchida
关键词-EN: medical image diagnosis, Severity level estimation, crucial task, task in medical, image diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV2025

点击查看摘要

Abstract:Severity level estimation is a crucial task in medical image diagnosis. However, accurately assigning severity class labels to individual images is very costly and challenging. Consequently, the attached labels tend to be noisy. In this paper, we propose a new framework for training with ``ordinal’’ noisy labels. Since severity levels have an ordinal relationship, we can leverage this to train a classifier while mitigating the negative effects of noisy labels. Our framework uses two techniques: clean sample selection and dual-network architecture. A technical highlight of our approach is the use of soft labels derived from noisy hard labels. By appropriately using the soft and hard labels in the two techniques, we achieve more accurate sample selection and robust network training. The proposed method outperforms various state-of-the-art methods in experiments using two endoscopic ulcerative colitis (UC) datasets and a retinal Diabetic Retinopathy (DR) dataset. Our codes are available at this https URL.

[CV-31] HRGR: Enhancing Image Manipulation Detection via Hierarchical Region-aware Graph Reasoning

链接: https://arxiv.org/abs/2410.21861
作者: Xudong Wang,Yuezun Li,Huiyu Zhou,Jiaran Zhou,Junyu Dong
关键词-EN: Hierarchical Region-aware Graph, Image manipulation detection, Region-aware Graph Reasoning, Toggle, Image manipulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image manipulation detection is to identify the authenticity of each pixel in images. One typical approach to uncover manipulation traces is to model image correlations. The previous methods commonly adopt the grids, which are fixed-size squares, as graph nodes to model correlations. However, these grids, being independent of image content, struggle to retain local content coherence, resulting in imprecise detection. To address this issue, we describe a new method named Hierarchical Region-aware Graph Reasoning (HRGR) to enhance image manipulation detection. Unlike existing grid-based methods, we model image correlations based on content-coherence feature regions with irregular shapes, generated by a novel Differentiable Feature Partition strategy. Then we construct a Hierarchical Region-aware Graph based on these regions within and across different feature layers. Subsequently, we describe a structural-agnostic graph reasoning strategy tailored for our graph to enhance the representation of nodes. Our method is fully differentiable and can seamlessly integrate into mainstream networks in an end-to-end manner, without requiring additional supervision. Extensive experiments demonstrate the effectiveness of our method in image manipulation detection, exhibiting its great potential as a plug-and-play component for existing architectures. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.21861 [cs.CV] (or arXiv:2410.21861v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.21861 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuezun Li [view email] [v1] Tue, 29 Oct 2024 08:51:30 UTC (1,472 KB) Full-text links: Access Paper: View a PDF of the paper titled HRGR: Enhancing Image Manipulation Detection via Hierarchical Region-aware Graph Reasoning, by Xudong Wang and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-10 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-32] Micro-Structures Graph-Based Point Cloud Registration for Balancing Efficiency and Accuracy

链接: https://arxiv.org/abs/2410.21857
作者: Rongling Zhang,Li Yan,Pengcheng Wei,Hong Xie,Pinzhuo Wang,Binbing Wang
关键词-EN: optimal rigid transformation, Point Cloud Registration, Point Cloud, global point cloud, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Point Cloud Registration (PCR) is a fundamental and significant issue in photogrammetry and remote sensing, aiming to seek the optimal rigid transformation between sets of points. Achieving efficient and precise PCR poses a considerable challenge. We propose a novel micro-structures graph-based global point cloud registration method. The overall method is comprised of two stages. 1) Coarse registration (CR): We develop a graph incorporating micro-structures, employing an efficient graph-based hierarchical strategy to remove outliers for obtaining the maximal consensus set. We propose a robust GNC-Welsch estimator for optimization derived from a robust estimator to the outlier process in the Lie algebra space, achieving fast and robust alignment. 2) Fine registration (FR): To refine local alignment further, we use the octree approach to adaptive search plane features in the micro-structures. By minimizing the distance from the point-to-plane, we can obtain a more precise local alignment, and the process will also be addressed effectively by being treated as a planar adjustment algorithm combined with Anderson accelerated optimization (PA-AA). After extensive experiments on real data, our proposed method performs well on the 3DMatch and ETH datasets compared to the most advanced methods, achieving higher accuracy metrics and reducing the time cost by at least one-third.

[CV-33] Enhanced Survival Prediction in Head and Neck Cancer Using Convolutional Block Attention and Multimodal Data Fusion ACCV2024

链接: https://arxiv.org/abs/2410.21831
作者: Aiman Farooq,Utkarsh Sharma,Deepak Mishra
关键词-EN: guiding clinical decision-making, optimizing treatment strategies, Accurate survival prediction, Accurate survival, neck cancer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to [ACCV 2024 Workshop]

点击查看摘要

Abstract:Accurate survival prediction in head and neck cancer (HNC) is essential for guiding clinical decision-making and optimizing treatment strategies. Traditional models, such as Cox proportional hazards, have been widely used but are limited in their ability to handle complex multi-modal data. This paper proposes a deep learning-based approach leveraging CT and PET imaging modalities to predict survival outcomes in HNC patients. Our method integrates feature extraction with a Convolutional Block Attention Module (CBAM) and a multi-modal data fusion layer that combines imaging data to generate a compact feature representation. The final prediction is achieved through a fully parametric discrete-time survival model, allowing for flexible hazard functions that overcome the limitations of traditional survival models. We evaluated our approach using the HECKTOR and HEAD-NECK-RADIOMICS- HN1 datasets, demonstrating its superior performance compared to conconventional statistical and machine learning models. The results indicate that our deep learning model significantly improves survival prediction accuracy, offering a robust tool for personalized treatment planning in HNC

[CV-34] Volumetric Conditioning Module to Control Pretrained Diffusion Models for 3D Medical Images WACV2025

链接: https://arxiv.org/abs/2410.21826
作者: Suhyun Ahn,Wonjung Park,Jihoon Cho,Seunghyuck Park,Jinah Park
关键词-EN: Spatial control methods, Spatial control, pretrained diffusion models, control methods, gained attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 18 figures, accepted @ WACV 2025

点击查看摘要

Abstract:Spatial control methods using additional modules on pretrained diffusion models have gained attention for enabling conditional generation in natural images. These methods guide the generation process with new conditions while leveraging the capabilities of large models. They could be beneficial as training strategies in the context of 3D medical imaging, where training a diffusion model from scratch is challenging due to high computational costs and data scarcity. However, the potential application of spatial control methods with additional modules to 3D medical images has not yet been explored. In this paper, we present a tailored spatial control method for 3D medical images with a novel lightweight module, Volumetric Conditioning Module (VCM). Our VCM employs an asymmetric U-Net architecture to effectively encode complex information from various levels of 3D conditions, providing detailed guidance in image synthesis. To examine the applicability of spatial control methods and the effectiveness of VCM for 3D medical data, we conduct experiments under single- and multimodal conditions scenarios across a wide range of dataset sizes, from extremely small datasets with 10 samples to large datasets with 500 samples. The experimental results show that the VCM is effective for conditional generation and efficient in terms of requiring less training data and computational resources. We further investigate the potential applications for our spatial control method through axial super-resolution for medical images. Our code is available at \urlthis https URL

[CV-35] PK-YOLO: Pretrained Knowledge Guided YOLO for Brain Tumor Detection in Multiplanar MRI Slices WACV2025

链接: https://arxiv.org/abs/2410.21822
作者: Ming Kang,Fung Fung Ting,Raphaël C.-W. Phan,Chee-Ming Ting
关键词-EN: Magnetic Resonance Imaging, multiplane Magnetic Resonance, Resonance Imaging, Magnetic Resonance, challenging task due
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Applications (stat.AP)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Brain tumor detection in multiplane Magnetic Resonance Imaging (MRI) slices is a challenging task due to the various appearances and relationships in the structure of the multiplane images. In this paper, we propose a new You Only Look Once (YOLO)-based detection model that incorporates Pretrained Knowledge (PK), called PK-YOLO, to improve the performance for brain tumor detection in multiplane MRI slices. To our best knowledge, PK-YOLO is the first pretrained knowledge guided YOLO-based object detector. The main components of the new method are a pretrained pure lightweight convolutional neural network-based backbone via sparse masked modeling, a YOLO architecture with the pretrained backbone, and a regression loss function for improving small object detection. The pretrained backbone allows for feature transferability of object queries on individual plane MRI slices into the model encoders, and the learned domain knowledge base can improve in-domain detection. The improved loss function can further boost detection performance on small-size brain tumors in multiplanar two-dimensional MRI slices. Experimental results show that the proposed PK-YOLO achieves competitive performance on the multiplanar MRI brain tumor detection datasets compared to state-of-the-art YOLO-like and DETR-like object detectors. The code is available at this https URL.

[CV-36] SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharyngeal Tumor Detection

链接: https://arxiv.org/abs/2410.21813
作者: Jia Wei,Yun Li,Xiaomao Fan,Wenjun Ma,Meiyu Qiu,Hongyu Chen,Wenbin Lei
关键词-EN: highly lethal malignancy, neck region, highly lethal, lethal malignancy, head and neck
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Laryngo-pharyngeal cancer (LPC) is a highly lethal malignancy in the head and neck region. Recent advancements in tumor detection, particularly through dual-branch network architectures, have significantly improved diagnostic accuracy by integrating global and local feature extraction. However, challenges remain in accurately localizing lesions and fully capitalizing on the complementary nature of features within these branches. To address these issues, we propose SAM-Swin, an innovative SAM-driven Dual-Swin Transformer for laryngo-pharyngeal tumor detection. This model leverages the robust segmentation capabilities of the Segment Anything Model 2 (SAM2) to achieve precise lesion segmentation. Meanwhile, we present a multi-scale lesion-aware enhancement module (MS-LAEM) designed to adaptively enhance the learning of nuanced complementary features across various scales, improving the quality of feature extraction and representation. Furthermore, we implement a multi-scale class-aware guidance (CAG) loss that delivers multi-scale targeted supervision, thereby enhancing the model’s capacity to extract class-specific features. To validate our approach, we compiled three LPC datasets from the First Affiliated Hospital (FAHSYSU), the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University, and Nanfang Hospital of Southern Medical University (NHSMU). The FAHSYSU dataset is utilized for internal training, while the SAHSYSU and NHSMU datasets serve for external evaluation. Extensive experiments demonstrate that SAM-Swin outperforms state-of-the-art methods, showcasing its potential for advancing LPC detection and improving patient outcomes. The source code of SAM-Swin is available at the URL of \hrefthis https URLthis https URL.

[CV-37] Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging

链接: https://arxiv.org/abs/2410.21804
作者: Li Shen,Anke Tang,Enneng Yang,Guibing Guo,Yong Luo,Lefei Zhang,Xiaochun Cao,Bo Du,Dacheng Tao
关键词-EN: facilitate knowledge transfer, knowledge transfer, MTL, facilitate knowledge, task arithmetic-based MTL
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after finetuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.

[CV-38] HairDiffusion: Vivid Multi-Colored Hair Editing via Latent Diffusion

链接: https://arxiv.org/abs/2410.21789
作者: Yu Zeng,Yang Zhang,Jiachen Liu,Linlin Shen,Kaijun Deng,Weizhao He,Jinbao Wang
关键词-EN: critical image synthesis, hair color, aims to edit, edit hair color, image synthesis task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hair editing is a critical image synthesis task that aims to edit hair color and hairstyle using text descriptions or reference images, while preserving irrelevant attributes (e.g., identity, background, cloth). Many existing methods are based on StyleGAN to address this task. However, due to the limited spatial distribution of StyleGAN, it struggles with multiple hair color editing and facial preservation. Considering the advancements in diffusion models, we utilize Latent Diffusion Models (LDMs) for hairstyle editing. Our approach introduces Multi-stage Hairstyle Blend (MHB), effectively separating control of hair color and hairstyle in diffusion latent space. Additionally, we train a warping module to align the hair color with the target region. To further enhance multi-color hairstyle editing, we fine-tuned a CLIP model using a multi-color hairstyle dataset. Our method not only tackles the complexity of multi-color hairstyles but also addresses the challenge of preserving original colors during diffusion editing. Extensive experiments showcase the superiority of our method in editing multi-color hairstyles while preserving facial attributes given textual descriptions and reference images.

[CV-39] Fast-OMRA: Fast Online Motion Resolution Adaptation for Neural B-Frame Coding

链接: https://arxiv.org/abs/2410.21763
作者: Sang NguyenQuang,Zong-Lin Gao,Kuan-Wei Ho,Xiem HoangVan,Wen-Hsiao Peng
关键词-EN: hierarchical temporal prediction, temporal prediction suffer, domain shift issue, shift issue caused, learned B-frame codecs
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Most learned B-frame codecs with hierarchical temporal prediction suffer from the domain shift issue caused by the discrepancy in the Group-of-Pictures (GOP) size used for training and test. As such, the motion estimation network may fail to predict large motion properly. One effective strategy to mitigate this domain shift issue is to downsample video frames for motion estimation. However, finding the optimal downsampling factor involves a time-consuming rate-distortion optimization process. This work introduces lightweight classifiers to determine the downsampling factor. To strike a good rate-distortion-complexity trade-off, our classifiers observe simple state signals, including only the coding and reference frames, to predict the best downsampling factor. We present two variants that adopt binary and multi-class classifiers, respectively. The binary classifier adopts the Focal Loss for training, classifying between motion estimation at high and low resolutions. Our multi-class classifier is trained with novel soft labels incorporating the knowledge of the rate-distortion costs of different downsampling factors. Both variants operate as add-on modules without the need to re-train the B-frame codec. Experimental results confirm that they achieve comparable coding performance to the brute-force search methods while greatly reducing computational complexity.

[CV-40] IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

链接: https://arxiv.org/abs/2410.21759
作者: Hang Guo,Yawei Li,Tao Dai,Shu-Tao Xia,Luca Benini
关键词-EN: yielded impressive results, impressive results, downstream tasks, tasks has yielded, yielded impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Fine-tuning large-scale text-to-image diffusion models for various downstream tasks has yielded impressive results. However, the heavy computational burdens of tuning large models prevent personal customization. Recent advances have attempted to employ parameter-efficient fine-tuning (PEFT) techniques to adapt the floating-point (FP) or quantized pre-trained weights. Nonetheless, the adaptation parameters in existing works are still restricted to FP arithmetic, hindering hardware-friendly acceleration. In this work, we propose IntLoRA, to further push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models. By working in the integer arithmetic, our IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting, eliminating additional post-training quantization. Extensive experiments demonstrate that IntLoRA can achieve performance on par with or even superior to the vanilla LoRA, accompanied by significant efficiency improvements. Code is available at \urlthis https URL.

[CV-41] DOFS: A Real-world 3D Deformable Object Dataset with Full Spatial Information for Dynamics Model Learning

链接: https://arxiv.org/abs/2410.21758
作者: Zhen Zhang,Xiangyu Chu,Yunxi Tang,K. W. Samuel Au
关键词-EN: work proposes DOFS, full spatial information, transparent operating plane, proposes DOFS, spatial information
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 5 pages, 6 figures, 2024 CoRL Workshop on Learning Robot Fine and Dexterous Manipulation: Perception and Control

点击查看摘要

Abstract:This work proposes DOFS, a pilot dataset of 3D deformable objects (DOs) (e.g., elasto-plastic objects) with full spatial information (i.e., top, side, and bottom information) using a novel and low-cost data collection platform with a transparent operating plane. The dataset consists of active manipulation action, multi-view RGB-D images, well-registered point clouds, 3D deformed mesh, and 3D occupancy with semantics, using a pinching strategy with a two-parallel-finger gripper. In addition, we trained a neural network with the down-sampled 3D occupancy and action as input to model the dynamics of an elasto-plastic object. Our dataset and all CADs of the data collection system will be released soon on our website.

[CV-42] Memory-Efficient Point Cloud Registration via Overlapping Region Sampling

链接: https://arxiv.org/abs/2410.21753
作者: Tomoyasu Shimada,Kazuhiko Murasaki,Shogo Sato,Toshihiko Nishimura,Taiga Yoshida,Ryuichi Tanida
关键词-EN: graphics processing unit, increased graphics processing, Recent advances, requiring preliminary sampling, learning have improved
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted for IEEE International Conference on Visual Communications and Image Processing 2024 (VCIP2024)

点击查看摘要

Abstract:Recent advances in deep learning have improved 3D point cloud registration but increased graphics processing unit (GPU) memory usage, often requiring preliminary sampling that reduces accuracy. We propose an overlapping region sampling method to reduce memory usage while maintaining accuracy. Our approach estimates the overlapping region and intensively samples from it, using a k-nearest-neighbor (kNN) based point compression mechanism with multi layer perceptron (MLP) and transformer architectures. Evaluations on 3DMatch and 3DLoMatch datasets show our method outperforms other sampling methods in registration recall, especially at lower GPU memory levels. For 3DMatch, we achieve 94% recall with 33% reduced memory usage, with greater advantages in 3DLoMatch. Our method enables efficient large-scale point cloud registration in resource-constrained environments, maintaining high accuracy while significantly reducing memory requirements.

[CV-43] MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

链接: https://arxiv.org/abs/2410.21747
作者: Yuan Wang,Di Huang,Yaqi Zhang,Wanli Ouyang,Jile Jiao,Xuetao Feng,Yan Zhou,Pengfei Wan,Shixiang Tang,Dan Xu
关键词-EN: http URL impressive, URL impressive advances, Generating lifelike human, Large Motion-Language Model, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital this http URL impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion this http URL this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM’s vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.

[CV-44] EI-Nexus: Towards Unmediated and Flexible Inter-Modality Local Feature Extraction and Matching for Event-Image Data WACV2025

链接: https://arxiv.org/abs/2410.21743
作者: Zhonghua Yi,Hao Shi,Qi Jiang,Kailun Yang,Ze Wang,Diyang Gu,Yufan Zhang,Kaiwei Wang
关键词-EN: high dynamic range, high temporal resolution, Local Feature Distillation, high temporal, high dynamic
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: Accepted to WACV 2025. The source code and benchmarks will be made publicly available at this https URL

点击查看摘要

Abstract:Event cameras, with high temporal resolution and high dynamic range, have limited research on the inter-modality local feature extraction and matching of event-image data. We propose EI-Nexus, an unmediated and flexible framework that integrates two modality-specific keypoint extractors and a feature matcher. To achieve keypoint extraction across viewpoint and modality changes, we bring Local Feature Distillation (LFD), which transfers the viewpoint consistency from a well-learned image extractor to the event extractor, ensuring robust feature correspondence. Furthermore, with the help of Context Aggregation (CA), a remarkable enhancement is observed in feature matching. We further establish the first two inter-modality feature matching benchmarks, MVSEC-RPE and EC-RPE, to assess relative pose estimation on event-image data. Our approach outperforms traditional methods that rely on explicit modal transformation, offering more unmediated and adaptable feature extraction and matching, achieving better keypoint similarity and state-of-the-art results on the MVSEC-RPE and EC-RPE benchmarks. The source code and benchmarks will be made publicly available at this https URL.

[CV-45] SS3DM: Benchmarking Street-View Surface Reconstruction with a Synthetic 3D Mesh Dataset NEURIPS2024

链接: https://arxiv.org/abs/2410.21739
作者: Yubin Hu,Kairui Wen,Heng Zhou,Xiaoyang Guo,Yongjin Liu
关键词-EN: autonomous driving simulation, Reconstructing accurate, crucial for applications, digital entertainment, entertainment and autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024, Track on Datasets and Benchmarks

点击查看摘要

Abstract:Reconstructing accurate 3D surfaces for street-view scenarios is crucial for applications such as digital entertainment and autonomous driving simulation. However, existing street-view datasets, including KITTI, Waymo, and nuScenes, only offer noisy LiDAR points as ground-truth data for geometric evaluation of reconstructed surfaces. These geometric ground-truths often lack the necessary precision to evaluate surface positions and do not provide data for assessing surface normals. To overcome these challenges, we introduce the SS3DM dataset, comprising precise \textbfSynthetic \textbfStreet-view \textbf3D \textbfMesh models exported from the CARLA simulator. These mesh models facilitate accurate position evaluation and include normal vectors for evaluating surface normal. To simulate the input data in realistic driving scenarios for 3D reconstruction, we virtually drive a vehicle equipped with six RGB cameras and five LiDAR sensors in diverse outdoor scenes. Leveraging this dataset, we establish a benchmark for state-of-the-art surface reconstruction methods, providing a comprehensive evaluation of the associated challenges. For more information, visit our homepage at this https URL. Comments: NeurIPS 2024, Track on Datasets and Benchmarks Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.21739 [cs.CV] (or arXiv:2410.21739v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.21739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-46] DiffSTR: Controlled Diffusion Models for Scene Text Removal

链接: https://arxiv.org/abs/2410.21721
作者: Sanhita Pathak,Vinay Kaushik,Brejesh Lall
关键词-EN: Scene Text Removal, Text Removal, prevent unauthorized, STR, Removal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 Pages, 6 Figures, 3 Tables

点击查看摘要

Abstract:To prevent unauthorized use of text in images, Scene Text Removal (STR) has become a crucial task. It focuses on automatically removing text and replacing it with a natural, text-less background while preserving significant details such as texture, color, and contrast. Despite its importance in privacy protection, STR faces several challenges, including boundary artifacts, inconsistent texture and color, and preserving correct shadows. Most STR approaches estimate a text region mask to train a model, solving for image translation or inpainting to generate a text-free image. Thus, the quality of the generated image depends on the accuracy of the inpainting mask and the generator’s capability. In this work, we leverage the superior capabilities of diffusion models in generating high-quality, consistent images to address the STR problem. We introduce a ControlNet diffusion model, treating STR as an inpainting task. To enhance the model’s robustness, we develop a mask pretraining pipeline to condition our diffusion model. This involves training a masked autoencoder (MAE) using a combination of box masks and coarse stroke masks, and fine-tuning it using masks derived from our novel segmentation-based mask refinement framework. This framework iteratively refines an initial mask and segments it using the SLIC and Hierarchical Feature Selection (HFS) algorithms to produce an accurate final text mask. This improves mask prediction and utilizes rich textural information in natural scene images to provide accurate inpainting masks. Experiments on the SCUT-EnsText and SCUT-Syn datasets demonstrate that our method significantly outperforms existing state-of-the-art techniques.

[CV-47] Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation NEURIPS2024

链接: https://arxiv.org/abs/2410.21708
作者: Ruihao Xia,Yu Liang,Peng-Tao Jiang,Hao Zhang,Bo Li,Yang Tang,Pan Zhou
关键词-EN: unsupervised domain adaptation, domain adaptation methods, segmentation primarily focus, semantic segmentation primarily, abundant visual modalities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model’s cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudo-labels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. We open-source our code and models at this https URL.

[CV-48] Investigating Memorization in Video Diffusion Models

链接: https://arxiv.org/abs/2410.21669
作者: Chen Chen,Enhuai Liu,Daochang Liu,Mubarak Shah,Chang Xu
关键词-EN: potentially generating unauthorized, generating unauthorized copyrighted, Diffusion models, image diffusion models, unauthorized copyrighted content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this gap, we first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way that focuses on privacy preservation and applies to all generation types. We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. Additionally, we curate a dataset of text prompts that are most prone to triggering memorization when used as conditioning in VDMs. By leveraging these prompts, we generate diverse videos from various open-source VDMs, successfully extracting numerous training videos from each tested model. Through the application of our proposed metrics, we systematically analyze memorization across various pretrained VDMs, including text-conditional and unconditional models, on a variety of datasets. Our comprehensive study reveals that memorization is widespread across all tested VDMs, indicating that VDMs can also memorize image training data in addition to video datasets. Finally, we propose efficient and effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs.

[CV-49] Revisiting Multi-Granularity Representation via Group Contrastive Learning for Unsupervised Vehicle Re-identification

链接: https://arxiv.org/abs/2410.21667
作者: Zhigang Chang,Shibao Zheng
关键词-EN: surveillance camera views, disjoint surveillance camera, retrieving vehicle images, Vehicle ReID, vehicle ReID research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vehicle re-identification (Vehicle ReID) aims at retrieving vehicle images across disjoint surveillance camera views. The majority of vehicle ReID research is heavily reliant upon supervisory labels from specific human-collected datasets for training. When applied to the large-scale real-world scenario, these models will experience dreadful performance declines due to the notable domain discrepancy between the source dataset and the target. To address this challenge, in this paper, we propose an unsupervised vehicle ReID framework (MGR-GCL). It integrates a multi-granularity CNN representation for learning discriminative transferable features and a contrastive learning module responsible for efficient domain adaptation in the unlabeled target domain. Specifically, after training the proposed Multi-Granularity Representation (MGR) on the labeled source dataset, we propose a group contrastive learning module (GCL) to generate pseudo labels for the target dataset, facilitating the domain adaptation process. We conducted extensive experiments and the results demonstrated our superiority against existing state-of-the-art methods.

[CV-50] Exploring Local Memorization in Diffusion Models via Bright Ending Attention

链接: https://arxiv.org/abs/2410.21665
作者: Chen Chen,Daochang Liu,Mubarak Shah,Chang Xu
关键词-EN: diffusion models prone, locating localized memorization, diffusion models, bright ending, locating localized
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:In this paper, we identify and leverage a novel `bright ending’ (BE) anomaly in diffusion models prone to memorizing training images to address a new task: locating localized memorization regions within these models. BE refers to a distinct cross-attention pattern observed in text-to-image generations using diffusion models. Specifically, memorized image patches exhibit significantly greater attention to the end token during the final inference step compared to non-memorized patches. This attention map effectively highlights regions where the generated image replicates training data. Furthermore, driven by our observation that local memorization significantly underperforms in existing tasks of measuring, detecting, and mitigating memorization in diffusion models compared to global memorization, we propose a simple yet effective method to integrate BE and the results of the new localization task into these existing frameworks. This integration effectively improves their performances by narrowing the performance gap caused by local memorization. Our results not only demonstrate the successful execution of the new localization task but also establish new state-of-the-art performance across all existing tasks, underscoring the significance of the BE phenomenon.

[CV-51] Discriminative Pedestrian Features and Gated Channel Attention for Clothes-Changing Person Re-Identification

链接: https://arxiv.org/abs/2410.21663
作者: Yongkang Ding,Rui Mao,Hanyue Zhu,Anqi Wang,Liyan Zhang
关键词-EN: Clothes-Changing Person Re-Identification, Person Re-Identification, Clothes-Changing Person, social life, increasingly significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The article has been accepted by IEEE International Conference on Multimedia and Expo 2024

点击查看摘要

Abstract:In public safety and social life, the task of Clothes-Changing Person Re-Identification (CC-ReID) has become increasingly significant. However, this task faces considerable challenges due to appearance changes caused by clothing alterations. Addressing this issue, this paper proposes an innovative method for disentangled feature extraction, effectively extracting discriminative features from pedestrian images that are invariant to clothing. This method leverages pedestrian parsing techniques to identify and retain features closely associated with individual identity while disregarding the variable nature of clothing attributes. Furthermore, this study introduces a gated channel attention mechanism, which, by adjusting the network’s focus, aids the model in more effectively learning and emphasizing features critical for pedestrian identity recognition. Extensive experiments conducted on two standard CC-ReID datasets validate the effectiveness of the proposed approach, with performance surpassing current leading solutions. The Top-1 accuracy under clothing change scenarios on the PRCC and VC-Clothes datasets reached 64.8% and 83.7%, respectively.

[CV-52] Fingerprints of Super Resolution Networks

链接: https://arxiv.org/abs/2410.21653
作者: Jeremy Vonderfecht,Feng Liu
关键词-EN: deep-learning based image, based image generation, recent studies, studies have demonstrated, demonstrated that deep-learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in Transactions on Machine Learning Research (2022)

点击查看摘要

Abstract:Several recent studies have demonstrated that deep-learning based image generation models, such as GANs, can be uniquely identified, and possibly even reverse-engineered, by the fingerprints they leave on their output images. We extend this research to single image super-resolution (SISR) networks. Compared to previously studied models, SISR networks are a uniquely challenging class of image generation model from which to extract and analyze fingerprints, as they can often generate images that closely match the corresponding ground truth and thus likely leave little flexibility to embed signatures. We take SISR models as examples to investigate if the findings from the previous work on fingerprints of GAN-based networks are valid for general image generation models. We show that SISR networks with a high upscaling factor or trained using adversarial loss leave highly distinctive fingerprints, and that under certain conditions, some SISR network hyperparameters can be reverse-engineered from these fingerprints.

[CV-53] Predicting the Encoding Error of SIRENs

链接: https://arxiv.org/abs/2410.21645
作者: Jeremy Vonderfecht,Feng Liu
关键词-EN: Implicit Neural Representations, Implicit Neural, Neural Representations, neural network size, Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in Transactions on Machine Learning Research (2024)

点击查看摘要

Abstract:Implicit Neural Representations (INRs), which encode signals such as images, videos, and 3D shapes in the weights of neural networks, are becoming increasingly popular. Among their many applications is signal compression, for which there is great interest in achieving the highest possible fidelity to the original signal subject to constraints such as neural network size, training (encoding) and inference (decoding) time. But training INRs can be a computationally expensive process, making it challenging to determine the best possible tradeoff under such constraints. Towards this goal, we present a method which predicts the encoding error that a popular INR network (SIREN) will reach, given its network hyperparameters and the signal to encode. This method is trained on a unique dataset of 300,000 SIRENs, trained across a variety of images and hyperparameters. (Dataset available here: this https URL.) Our predictive method demonstrates the feasibility of this regression problem, and allows users to anticipate the encoding error that a SIREN network will reach in milliseconds instead of minutes or longer. We also provide insights into the behavior of SIREN networks, such as why narrow SIRENs can have very high random variation in encoding error, and how the performance of SIRENs relates to JPEG compression.

[CV-54] On filter design in deep convolutional neural network

链接: https://arxiv.org/abs/2410.21644
作者: Gaurav Hirani,Waleed Abdulla
关键词-EN: convolutional neural network, deep convolutional neural, neural network, deep convolutional, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The deep convolutional neural network (DCNN) in computer vision has given promising results. It is widely applied in many areas, from medicine, agriculture, self-driving car, biometric system, and almost all computer vision-based applications. Filters or weights are the critical elements responsible for learning in DCNN. Backpropagation has been the primary learning algorithm for DCNN and provides promising results, but the size and numbers of the filters remain hyper-parameters. Various studies have been done in the last decade on semi-supervised, self-supervised, and unsupervised methods and their properties. The effects of filter initialization, size-shape selection, and the number of filters on learning and optimization have not been investigated in a separate publication to collate all the options. Such attributes are often treated as hyper-parameters and lack mathematical understanding. Computer vision algorithms have many limitations in real-life applications, and understanding the learning process is essential to have some significant improvement. To the best of our knowledge, no separate investigation has been published discussing the filters; this is our primary motivation. This study focuses on arguments for choosing specific physical parameters of filters, initialization, and learning technic over scattered methods. The promising unsupervised approaches have been evaluated. Additionally, the limitations, current challenges, and future scope have been discussed in this paper.

[CV-55] Neural Experts: Mixture of Experts for Implicit Neural Representations NEURIPS2024

链接: https://arxiv.org/abs/2410.21643
作者: Yizhak Ben-Shabat,Chamin Hewa Koneputugodage,Sameera Ramasinghe,Stephen Gould
关键词-EN: proven effective, Implicit neural, Implicit neural representations, Implicit, reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Implicit neural representations (INRs) have proven effective in various tasks including image, shape, audio, and video reconstruction. These INRs typically learn the implicit field from sampled input points. This is often done using a single network for the entire domain, imposing many global constraints on a single function. In this paper, we propose a mixture of experts (MoE) implicit neural representation approach that enables learning local piece-wise continuous functions that simultaneously learns to subdivide the domain and fit locally. We show that incorporating a mixture of experts architecture into existing INR formulations provides a boost in speed, accuracy, and memory requirements. Additionally, we introduce novel conditioning and pretraining methods for the gating network that improves convergence to the desired solution. We evaluate the effectiveness of our approach on multiple reconstruction tasks, including surface reconstruction, image reconstruction, and audio signal reconstruction and show improved performance compared to non-MoE methods.

[CV-56] Investigation of moving objects through atmospheric turbulence from a non-stationary platform

链接: https://arxiv.org/abs/2410.21639
作者: Nicholas Ferrante,Jerome Gilles,Shibin Parameswaran
关键词-EN: optical flow field, flow field, optical flow, flow field induced, scene impacted
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we extract the optical flow field corresponding to moving objects from an image sequence of a scene impacted by atmospheric turbulence \emphand captured from a moving camera. Our procedure first computes the optical flow field and creates a motion model to compensate for the flow field induced by camera motion. After subtracting the motion model from the optical flow, we proceed with our previous work, Gilles et al~\citegilles2018detection, where a spatial-temporal cartoon+texture inspired decomposition is performed on the motion-compensated flow field in order to separate flows corresponding to atmospheric turbulence and object motion. Finally, the geometric component is processed with the detection and tracking method and is compared against a ground truth. All of the sequences and code used in this work are open source and are available by contacting the authors.

[CV-57] Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis NEURIPS2024

链接: https://arxiv.org/abs/2410.21638
作者: Deepak Sridhar,Abhishek Peri,Rohith Rachala,Nuno Vasconcelos
关键词-EN: Recent advances, enabled breakthroughs, advances in generative, generative modeling, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024 conference. Project Page: this https URL

点击查看摘要

Abstract:Recent advances in generative modeling with diffusion processes (DPs) enabled breakthroughs in image synthesis. Despite impressive image quality, these models have various prompt compliance problems, including low recall in generating multiple objects, difficulty in generating text in images, and meeting constraints like object locations and pose. For fine-grained editing and manipulation, they also require fine-grained semantic or instance maps that are tedious to produce manually. While prompt compliance can be enhanced by addition of loss functions at inference, this is time consuming and does not scale to complex scenes. To overcome these limitations, this work introduces a new family of \textitFactor Graph Diffusion Models (FG-DMs) that models the joint distribution of images and conditioning variables, such as semantic, sketch, depth or normal maps via a factor graph decomposition. This joint structure has several advantages, including support for efficient sampling based prompt compliance schemes, which produce images of high object recall, semi-automated fine-grained editing, text-based editing of conditions with noise inversion, explainability at intermediate levels, ability to produce labeled datasets for the training of downstream models such as segmentation or depth, training with missing data, and continual learning where new conditioning variables can be added with minimal or no modifications to the existing structure. We propose an implementation of FG-DMs by adapting a pre-trained Stable Diffusion (SD) model to implement all FG-DM factors, using only COCO dataset, and show that it is effective in generating images with 15% higher recall than SD while retaining its generalization ability. We introduce an attention distillation loss that encourages consistency among the attention maps of all factors, improving the fidelity of the generated conditions and image.

[CV-58] OFER: Occluded Face Expression Reconstruction

链接: https://arxiv.org/abs/2410.21629
作者: Pratheba Selvaraju,Victoria Fernandez Abrevaya,Timo Bolkart,Rick Akkerman,Tianyu Ding,Faezeh Amjadi,Ilya Zharkov
关键词-EN: inherently ill-posed problem, inherently ill-posed, Reconstructing, ill-posed problem, problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions. In addition to fewer available observations, occlusions introduce an extra source of ambiguity, where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature. In this paper we introduce OFER, a novel approach for single image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces, even under strong occlusions. Specifically, we train two diffusion models to generate the shape and expression coefficients of a face parametric model, conditioned on the input image. This approach captures the multi-modal nature of the problem, generating a distribution of solutions as output. Although this addresses the ambiguity problem, the challenge remains to pick the best matching shape to ensure consistency across diverse expressions. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on the predicted shape accuracy scores to select the best match. We evaluate our method using standard benchmarks and introduce CO-545, a new protocol and dataset designed to assess the accuracy of expressive faces under occlusion. Our results show improved performance over occlusion-based methods, with added ability to generate multiple expressions for a given image.

[CV-59] NYC-Event-VPR: A Large-Scale High-Resolution Event-Based Visual Place Recognition Dataset in Dense Urban Environments

链接: https://arxiv.org/abs/2410.21615
作者: Taiyi Pan,Junyang He,Chao Chen,Yiming Li,Chen Feng
关键词-EN: Visual place recognition, enables autonomous robots, identify previously visited, Visual place, previously visited locations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual place recognition (VPR) enables autonomous robots to identify previously visited locations, which contributes to tasks like simultaneous localization and mapping (SLAM). VPR faces challenges such as accurate image neighbor retrieval and appearance change in scenery. Event cameras, also known as dynamic vision sensors, are a new sensor modality for VPR and offer a promising solution to the challenges with their unique attributes: high temporal resolution (1MHz clock), ultra-low latency (in \mus), and high dynamic range (120dB). These attributes make event cameras less susceptible to motion blur and more robust in variable lighting conditions, making them suitable for addressing VPR challenges. However, the scarcity of event-based VPR datasets, partly due to the novelty and cost of event cameras, hampers their adoption. To fill this data gap, our paper introduces the NYC-Event-VPR dataset to the robotics and computer vision communities, featuring the Prophesee IMX636 HD event sensor (1280x720 resolution), combined with RGB camera and GPS module. It encompasses over 13 hours of geotagged event data, spanning 260 kilometers across New York City, covering diverse lighting and weather conditions, day/night scenarios, and multiple visits to various locations. Furthermore, our paper employs three frameworks to conduct generalization performance assessments, promoting innovation in event-based VPR and its integration into robotics applications.

[CV-60] opological numbers and their use to characterize simple points for 2D binary images

链接: https://arxiv.org/abs/2410.21588
作者: Christophe Lohou
关键词-EN: binary images, topological numbers, efficiently characterize simple, characterize simple points, proposed to efficiently
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:In this paper, we adapt the two topological numbers, which have been proposed to efficiently characterize simple points in specific neighborhoods for 3D binary images, to the case of 2D binary images. Unlike the 3D case, we only use a single neighborhood to define these two topological numbers for the 2D case. Then, we characterize simple points either by using the two topological numbers or by a single topological number linked to another one condition. We compare the characterization of simple points by topological numbers with two other ones based on Hilditch crossing number and Yokoi number. We also highlight the number of possible configurations corresponding to a simple point, which also represents the maximum limit of local configurations that a thinning algorithm operating by parallel deletion of simple (individual) points may delete while preserving topology (limit usually not reachable, depending on the deletion strategy).

[CV-61] MVSDet: Multi-View Indoor 3D Object Detection via Efficient Plane Sweeps NEURIPS2024

链接: https://arxiv.org/abs/2410.21566
作者: Yating Xu,Chen Li,Gim Hee Lee
关键词-EN: multi-view indoor, images for precise, key challenge, challenge of multi-view, information from images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:The key challenge of multi-view indoor 3D object detection is to infer accurate geometry information from images for precise 3D detection. Previous method relies on NeRF for geometry reasoning. However, the geometry extracted from NeRF is generally inaccurate, which leads to sub-optimal detection performance. In this paper, we propose MVSDet which utilizes plane sweep for geometry-aware 3D object detection. To circumvent the requirement for a large number of depth planes for accurate depth prediction, we design a probabilistic sampling and soft weighting mechanism to decide the placement of pixel features on the 3D volume. We select multiple locations that score top in the probability volume for each pixel and use their probability score to indicate the confidence. We further apply recent pixel-aligned Gaussian Splatting to regularize depth prediction and improve detection performance with little computation overhead. Extensive experiments on ScanNet and ARKitScenes datasets are conducted to show the superiority of our model. Our code is available at this https URL.

[CV-62] Empirical curvelet based Fully Convolutional Network for supervised texture image segmentation

链接: https://arxiv.org/abs/2410.21562
作者: Yuan Huang,Fugen Zhou,Jerome Gilles
关键词-EN: supervised texture classification, perform supervised texture, Fully Convolutional Network, perform supervised, Fully Convolutional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a new approach to perform supervised texture classification/segmentation. The proposed idea is to feed a Fully Convolutional Network with specific texture descriptors. These texture features are extracted from images by using an empirical curvelet transform. We propose a method to build a unique empirical curvelet filter bank adapted to a given dictionary of textures. We then show that the output of these filters can be used to build efficient texture descriptors utilized to finally feed deep learning networks. Our approach is finally evaluated on several datasets and compare the results to various state-of-the-art algorithms and show that the proposed method dramatically outperform all existing ones.

[CV-63] Super-resolution in disordered media using neural networks

链接: https://arxiv.org/abs/2410.21556
作者: Alexander Christie,Matan Leibovitch,Miguel Moscoso,Alexei Novikov,George Papanicolaou,Chrysoula Tsogka
关键词-EN: diverse data sets, medium Green functions, strongly scattering media, ambient medium Green, Green functions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We propose a methodology that exploits large and diverse data sets to accurately estimate the ambient medium’s Green’s functions in strongly scattering media. Given these estimates, obtained with and without the use of neural networks, excellent imaging results are achieved, with a resolution that is better than that of a homogeneous medium. This phenomenon, also known as super-resolution, occurs because the ambient scattering medium effectively enhances the physical imaging aperture.

[CV-64] Detection of moving objects through turbulent media. Decomposition of Oscillatory vs Non-Oscillatory spatio-temporal vector fields

链接: https://arxiv.org/abs/2410.21551
作者: Jerome Gilles,Francis Alvarez,Nicholas B. Ferrante,Margaret Fortman,Lena Tahir,Alex Tarter,Anneke von Seeger
关键词-EN: detected when images, images are impacted, impacted by atmospheric, moving objects, atmospheric turbulence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we investigate how moving objects can be detected when images are impacted by atmospheric turbulence. We present a geometric spatio-temporal point of view to the problem and show that it is possible to distinguish movement due to the turbulence vs. moving objects. To perform this task, we propose an extension of 2D cartoon+texture decomposition algorithms to 3D vector fields. Our algorithm is based on curvelet spaces which permit to better characterize the movement flow geometry. We present experiments on real data which illustrate the efficiency of the proposed method.

[CV-65] ECMamba: Consolidating Selective State Space Model with Retinex Guidance for Efficient Multiple Exposure Correction NEURIPS2024

链接: https://arxiv.org/abs/2410.21535
作者: Wei Dong,Han Zhou,Yulun Zhang,Xiaohong Liu,Jun Chen
关键词-EN: recover proper exposure, proper exposure conditions, Exposure Correction, Exposure Correction Mamba, aims to recover
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. Retinex-theory, Mamba, Exposure Correction

点击查看摘要

Abstract:Exposure Correction (EC) aims to recover proper exposure conditions for images captured under over-exposure or under-exposure scenarios. While existing deep learning models have shown promising results, few have fully embedded Retinex theory into their architecture, highlighting a gap in current methodologies. Additionally, the balance between high performance and efficiency remains an under-explored problem for exposure correction task. Inspired by Mamba which demonstrates powerful and highly efficient sequence modeling, we introduce a novel framework based on Mamba for Exposure Correction (ECMamba) with dual pathways, each dedicated to the restoration of reflectance and illumination map, respectively. Specifically, we firstly derive the Retinex theory and we train a Retinex estimator capable of mapping inputs into two intermediary spaces, each approximating the target reflectance and illumination map, respectively. This setup facilitates the refined restoration process of the subsequent Exposure Correction Mamba Module (ECMM). Moreover, we develop a novel 2D Selective State-space layer guided by Retinex information (Retinex-SS2D) as the core operator of ECMM. This architecture incorporates an innovative 2D scanning strategy based on deformable feature aggregation, thereby enhancing both efficiency and effectiveness. Extensive experiment results and comprehensive ablation studies demonstrate the outstanding performance and the importance of each component of our proposed ECMamba. Code is available at this https URL.

[CV-66] Constrained Transformer-Based Porous Media Generation to Spatial Distribution of Rock Properties

链接: https://arxiv.org/abs/2410.21462
作者: Zihan Ren,Sanjay Srinivasan,Dustin Crandall
关键词-EN: Geologic Carbon Storage, Carbon Storage, Geologic Carbon, studying complex subsurface, complex subsurface processes
类目: Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
*备注: 24 pages

点击查看摘要

Abstract:Pore-scale modeling of rock images based on information in 3D micro-computed tomography data is crucial for studying complex subsurface processes such as CO2 and brine multiphase flow during Geologic Carbon Storage (GCS). While deep learning models can generate 3D rock microstructures that match static rock properties, they have two key limitations: they don’t account for the spatial distribution of rock properties that can have an important influence on the flow and transport characteristics (such as permeability and relative permeability) of the rock and they generate structures below the representative elementary volume (REV) scale for those transport properties. Addressing these issues is crucial for building a consistent workflow between pore-scale analysis and field-scale modeling. To address these challenges, we propose a two-stage modeling framework that combines a Vector Quantized Variational Autoencoder (VQVAE) and a transformer model for spatial upscaling and arbitrary-size 3D porous media reconstruction in an autoregressive manner. The VQVAE first compresses and quantizes sub-volume training images into low-dimensional tokens, while we train a transformer to spatially assemble these tokens into larger images following specific spatial order. By employing a multi-token generation strategy, our approach preserves both sub-volume integrity and spatial relationships among these sub-image patches. We demonstrate the effectiveness of our multi-token transformer generation approach and validate it using real data from a test well, showcasing its potential to generate models for the porous media at the well scale using only a spatial porosity model. The interpolated representative porous media that reflect field-scale geological properties accurately model transport properties, including permeability and multiphase flow relative permeability of CO2 and brine.

[CV-67] SocialGPT: Prompting LLM s for Social Relation Reasoning via Greedy Segment Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.21411
作者: Wanhua Li,Zibin Meng,Jiawei Zhou,Donglai Wei,Chuang Gan,Hanspeter Pfister
关键词-EN: identify relation categories, Vision Foundation Models, Large Language Models, relation reasoning aims, aims to identify
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named \name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. \name introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at this https URL.

[CV-68] Domain Adaptation with a Single Vision-Language Embedding

链接: https://arxiv.org/abs/2410.21361
作者: Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Patrick Pérez,Raoul de Charette
关键词-EN: training time, uncommon conditions, Domain adaptation, extensively investigated, investigated in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in some uncommon conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the zero-shot and one-shot settings.

[CV-69] ArCSEM: Artistic Colorization of SEM Images via Gaussian Splatting ECCV

链接: https://arxiv.org/abs/2410.21310
作者: Takuma Nishimura,Andreea Dogaru,Martin Oeggerli,Bernhard Egger
关键词-EN: Scanning Electron Microscopes, Scanning Electron, Electron Microscopes, capture highly detailed, offering the capability
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: presented and published at AI for Visual Arts Workshop and Challenges (AI4VA) in conjunction with European Conference on Computer Vision (ECCV) 2024, Milano, Italy

点击查看摘要

Abstract:Scanning Electron Microscopes (SEMs) are widely renowned for their ability to analyze the surface structures of microscopic objects, offering the capability to capture highly detailed, yet only grayscale, images. To create more expressive and realistic illustrations, these images are typically manually colorized by an artist with the support of image editing software. This task becomes highly laborious when multiple images of a scanned object require colorization. We propose facilitating this process by using the underlying 3D structure of the microscopic scene to propagate the color information to all the captured images, from as little as one colorized view. We explore several scene representation techniques and achieve high-quality colorized novel view synthesis of a SEM scene. In contrast to prior work, there is no manual intervention or labelling involved in obtaining the 3D representation. This enables an artist to color a single or few views of a sequence and automatically retrieve a fully colored scene or video. Project page: this https URL

[CV-70] A Robust Anchor-based Method for Multi-Camera Pedestrian Localization

链接: https://arxiv.org/abs/2410.21308
作者: Wanyu Zhang,Jiaqi Zhang,Dongdong Ge,Yu Lin,Huiwen Yang,Huikang Liu,Yinyu Ye
关键词-EN: vision-based pedestrian localization, vision-based pedestrian, pedestrian location, paper addresses, addresses the problem
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of vision-based pedestrian localization, which estimates a pedestrian’s location using images and camera parameters. In practice, however, calibrated camera parameters often deviate from the ground truth, leading to inaccuracies in localization. To address this issue, we propose an anchor-based method that leverages fixed-position anchors to reduce the impact of camera parameter errors. We provide a theoretical analysis that demonstrates the robustness of our approach. Experiments conducted on simulated, real-world, and public datasets show that our method significantly improves localization accuracy and remains resilient to noise in camera parameters, compared to methods without anchors.

[CV-71] VideoSAM: A Large Vision Foundation Model for High-Speed Video Segmentation

链接: https://arxiv.org/abs/2410.21304
作者: Chika Maduabuchi,Ericmoore Jossou,Matteo Bucci
关键词-EN: boiling heat transfer, analyzing dynamic physical, dynamic physical processes, High-speed video, industrial applications
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:High-speed video (HSV) segmentation is essential for analyzing dynamic physical processes in scientific and industrial applications, such as boiling heat transfer. Existing models like U-Net struggle with generalization and accurately segmenting complex bubble formations. We present VideoSAM, a specialized adaptation of the Segment Anything Model (SAM), fine-tuned on a diverse HSV dataset for phase detection. Through diverse experiments, VideoSAM demonstrates superior performance across four fluid environments – Water, FC-72, Nitrogen, and Argon – significantly outperforming U-Net in complex segmentation tasks. In addition to introducing VideoSAM, we contribute an open-source HSV segmentation dataset designed for phase detection, enabling future research in this domain. Our findings underscore VideoSAM’s potential to set new standards in robust and accurate HSV segmentation. The code and dataset used in this study are available online at this https URL .

[CV-72] Domain-Adaptive Pre-training of Self-Supervised Foundation Models for Medical Image Classification in Gastrointestinal Endoscopy

链接: https://arxiv.org/abs/2410.21302
作者: Marcel Roth,Micha V. Nowak
关键词-EN: enabling early disease, early disease detection, Video capsule endoscopy, capturing detailed images, Video capsule
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Video capsule endoscopy has transformed gastrointestinal endoscopy (GIE) diagnostics by offering a non-invasive method for capturing detailed images of the gastrointestinal tract, enabling early disease detection. However, its potential is limited by the sheer volume of images generated during the imaging procedure, which can take anywhere from 6-8 hours and often produce up to 1 million images, necessitating automated analysis. Additionally, the variability of these images, combined with the need for expert annotations and the scarcity of large, high-quality labeled datasets, constrains the effectiveness of current medical image analysis models. To address this, we introduce a novel large gastrointestinal endoscopy dataset, called EndoExtend24, created by merging and re-stratifying the train/test splits of ten existing public and private datasets, ensuring no overlap of patient data across splits. EndoExtend24 includes over 226,000 labeled images, as well as dynamic class mappings, which allow unified training across datasets with differing labeling granularity, supporting up to 123 distinct pathological findings. Further, we propose to leverage domain adaptive pre-training of foundation models in computer vision trained with self-supervision on generic image data, to adapt them to the task of GIE medical diagnosis. Specifically, the EVA-02 model, which is based on the vision transformer architecture and was trained on ImageNet-22k with masked image modeling (using EVA-CLIP as a MIM teacher), is pre-trained on the novel EndoExtend24 dataset to achieve domain adaptation, and finally trained on the Capsule Endoscopy 2024 Challenge dataset. Experimental results show promising results on the challenge validation set, with an AUC Macro score of 0.993 and a balanced accuracy of 89.3%.

[CV-73] Contrastive Learning with Auxiliary User Detection for Identifying Activities ICML

链接: https://arxiv.org/abs/2410.21300
作者: Wen Ge,Guanyi Mou,Emmanuel O. Agu,Kyumin Lee
关键词-EN: Human Activity Recognition, far-reaching real-world applications, Human Activity, Activity Recognition, ubiquitous computing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in ICMLA 2024

点击查看摘要

Abstract:Human Activity Recognition (HAR) is essential in ubiquitous computing, with far-reaching real-world applications. While recent SOTA HAR research has demonstrated impressive performance, some key aspects remain under-explored. Firstly, HAR can be both highly contextualized and personalized. However, prior work has predominantly focused on being Context-Aware (CA) while largely ignoring the necessity of being User-Aware (UA). We argue that addressing the impact of innate user action-performing differences is equally crucial as considering external contextual environment settings in HAR tasks. Secondly, being user-aware makes the model acknowledge user discrepancies but does not necessarily guarantee mitigation of these discrepancies, i.e., unified predictions under the same activities. There is a need for a methodology that explicitly enforces closer (different user, same activity) representations. To bridge this gap, we introduce CLAUDIA, a novel framework designed to address these issues. Specifically, we expand the contextual scope of the CA-HAR task by integrating User Identification (UI) within the CA-HAR framework, jointly predicting both CA-HAR and UI in a new task called User and Context-Aware HAR (UCA-HAR). This approach enriches personalized and contextual understanding by jointly learning user-invariant and user-specific patterns. Inspired by SOTA designs in the visual domain, we introduce a supervised contrastive loss objective on instance-instance pairs to enhance model efficacy and improve learned feature quality. Evaluation across three real-world CA-HAR datasets reveals substantial performance enhancements, with average improvements ranging from 5.8% to 14.1% in Matthew’s Correlation Coefficient and 3.0% to 7.2% in Macro F1 score.

[CV-74] V-3DG: Mastering Text-to-3D Customized Generation with Visual Prompt

链接: https://arxiv.org/abs/2410.21299
作者: Jiahui Yang,Donglin Di,Baorui Ma,Xun Yang,Yongjia Ma,Wenzhang Sun,Wei Chen,Jianxun Cui,Zhou Xue,Meng Wang,Yebin Liu
关键词-EN: recent years, advancements in generative, generative models, models have significantly, significantly expanded
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, advancements in generative models have significantly expanded the capabilities of text-to-3D generation. Many approaches rely on Score Distillation Sampling (SDS) technology. However, SDS struggles to accommodate multi-condition inputs, such as text and visual prompts, in customized generation tasks. To explore the core reasons, we decompose SDS into a difference term and a classifier-free guidance term. Our analysis identifies the core issue as arising from the difference term and the random noise addition during the optimization process, both contributing to deviations from the target mode during distillation. To address this, we propose a novel algorithm, Classifier Score Matching (CSM), which removes the difference term in SDS and uses a deterministic noise addition process to reduce noise during optimization, effectively overcoming the low-quality limitations of SDS in our customized generation framework. Based on CSM, we integrate visual prompt information with an attention fusion mechanism and sampling guidance techniques, forming the Visual Prompt CSM (VPCSM) algorithm. Furthermore, we introduce a Semantic-Geometry Calibration (SGC) module to enhance quality through improved textual information integration. We present our approach as TV-3DG, with extensive experiments demonstrating its capability to achieve stable, high-quality, customized 3D generation. Project page: \urlthis https URL

[CV-75] Guide3D: A Bi-planar X-ray Dataset for 3D Shape Reconstruction ACCV2024

链接: https://arxiv.org/abs/2410.22224
作者: Tudor Jianu,Baoru Huang,Hoan Nguyen,Binod Bhattarai,Tuong Do,Erman Tjiputra,Quang Tran,Pierre Berthet-Rayne,Ngan Le,Sebastiano Fichera,Anh Nguyen
关键词-EN: endovascular tool navigation, Endovascular surgical tool, surgical tool reconstruction, tool navigation, important factor
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACCV 2024

点击查看摘要

Abstract:Endovascular surgical tool reconstruction represents an important factor in advancing endovascular tool navigation, which is an important step in endovascular surgery. However, the lack of publicly available datasets significantly restricts the development and validation of novel machine learning approaches. Moreover, due to the need for specialized equipment such as biplanar scanners, most of the previous research employs monoplanar fluoroscopic technologies, hence only capturing the data from a single view and significantly limiting the reconstruction accuracy. To bridge this gap, we introduce Guide3D, a bi-planar X-ray dataset for 3D reconstruction. The dataset represents a collection of high resolution bi-planar, manually annotated fluoroscopic videos, captured in real-world settings. Validating our dataset within a simulated environment reflective of clinical settings confirms its applicability for real-world applications. Furthermore, we propose a new benchmark for guidewrite shape prediction, serving as a strong baseline for future work. Guide3D not only addresses an essential need by offering a platform for advancing segmentation and 3D reconstruction techniques but also aids the development of more accurate and efficient endovascular surgery interventions. Our project is available at this https URL.

[CV-76] MAPUNetR: A Hybrid Vision Transformer and U-Net Architecture for Efficient and Interpretable Medical Image Segmentation

链接: https://arxiv.org/abs/2410.22223
作者: Ovais Iqbal Shah,Danish Raza Rizvi,Aqib Nazir Mir
关键词-EN: informing treatment strategies, tracking disease progression, enhancing diagnostic accuracy, Medical image segmentation, pivotal in healthcare
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image segmentation is pivotal in healthcare, enhancing diagnostic accuracy, informing treatment strategies, and tracking disease progression. This process allows clinicians to extract critical information from visual data, enabling personalized patient care. However, developing neural networks for segmentation remains challenging, especially when preserving image resolution, which is essential in detecting subtle details that influence diagnoses. Moreover, the lack of transparency in these deep learning models has slowed their adoption in clinical practice. Efforts in model interpretability are increasingly focused on making these models’ decision-making processes more transparent. In this paper, we introduce MAPUNetR, a novel architecture that synergizes the strengths of transformer models with the proven U-Net framework for medical image segmentation. Our model addresses the resolution preservation challenge and incorporates attention maps highlighting segmented regions, increasing accuracy and interpretability. Evaluated on the BraTS 2020 dataset, MAPUNetR achieved a dice score of 0.88 and a dice coefficient of 0.92 on the ISIC 2018 dataset. Our experiments show that the model maintains stable performance and potential as a powerful tool for medical image segmentation in clinical practice.

[CV-77] DINeuro: Distilling Knowledge from 2D Natural Images via Deformable Tubular Transferring Strategy for 3D Neuron Reconstruction

链接: https://arxiv.org/abs/2410.22078
作者: Yik San Cheng,Runkai Zhao,Heng Wang,Hanchuan Peng,Yui Lo,Yuqian Chen,Lauren J. O’Donnell,Weidong Cai
关键词-EN: Reconstructing neuron morphology, light microscope imaging, analyzing brain networks, microscope imaging data, Reconstructing neuron
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 3 figures, and 2 tables. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Reconstructing neuron morphology from 3D light microscope imaging data is critical to aid neuroscientists in analyzing brain networks and neuroanatomy. With the boost from deep learning techniques, a variety of learning-based segmentation models have been developed to enhance the signal-to-noise ratio of raw neuron images as a pre-processing step in the reconstruction workflow. However, most existing models directly encode the latent representative features of volumetric neuron data but neglect their intrinsic morphological knowledge. To address this limitation, we design a novel framework that distills the prior knowledge from a 2D Vision Transformer pre-trained on extensive 2D natural images to facilitate neuronal morphological learning of our 3D Vision Transformer. To bridge the knowledge gap between the 2D natural image and 3D microscopic morphologic domains, we propose a deformable tubular transferring strategy that adapts the pre-trained 2D natural knowledge to the inherent tubular characteristics of neuronal structure in the latent embedding space. The experimental results on the Janelia dataset of the BigNeuron project demonstrate that our method achieves a segmentation performance improvement of 4.53% in mean Dice and 3.56% in mean 95% Hausdorff distance.

[CV-78] FANCL: Feature-Guided Attention Network with Curriculum Learning for Brain Metastases Segmentation

链接: https://arxiv.org/abs/2410.22057
作者: Zijiang Liu,Xiaoyu Liu,Linhao Qu,Yonghong Shi
关键词-EN: Accurate segmentation, follow-up of patients, diagnosis and follow-up, Accurate, segmentation performance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate segmentation of brain metastases (BMs) in MR image is crucial for the diagnosis and follow-up of patients. Methods based on deep convolutional neural networks (CNNs) have achieved high segmentation performance. However, due to the loss of critical feature information caused by convolutional and pooling operations, CNNs still face great challenges in small BMs segmentation. Besides, BMs are irregular and easily confused with healthy tissues, which makes it difficult for the model to effectively learn tumor structure during training. To address these issues, this paper proposes a novel model called feature-guided attention network with curriculum learning (FANCL). Based on CNNs, FANCL utilizes the input image and its feature to establish the intrinsic connections between metastases of different sizes, which can effectively compensate for the loss of high-level feature from small tumors with the information of large tumors. Furthermore, FANCL applies the voxel-level curriculum learning strategy to help the model gradually learn the structure and details of BMs. And baseline models of varying depths are employed as curriculum-mining networks for organizing the curriculum progression. The evaluation results on the BraTS-METS 2023 dataset indicate that FANCL significantly improves the segmentation performance, confirming the effectiveness of our method.

[CV-79] Analyzing Noise Models and Advanced Filtering Algorithms for Image Enhancement

链接: https://arxiv.org/abs/2410.21946
作者: Sahil Ali Akbar,Ananya Verma
关键词-EN: transmission or capturing, Image Processing, time of transmission, image, Noise
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Noise, an unwanted component in an image, can be the reason for the degradation of Image at the time of transmission or capturing. Noise reduction from images is still a challenging task. Digital Image Processing is a component of Digital signal processing. A wide variety of algorithms can be used in image processing to apply to an image or an input dataset and obtain important outcomes. In image processing research, removing noise from images before further analysis is essential. Post-noise removal of images improves clarity, enabling better interpretation and analysis across medical imaging, satellite imagery, and radar applications. While numerous algorithms exist, each comes with its own assumptions, strengths, and limitations. The paper aims to evaluate the effectiveness of different filtering techniques on images with eight types of noise. It evaluates methodologies like Wiener, Median, Gaussian, Mean, Low pass, High pass, Laplacian and bilateral filtering, using the performance metric Peak signal to noise ratio. It shows us the impact of different filters on noise models by applying a variety of filters to various kinds of noise. Additionally, it also assists us in determining which filtering strategy is most appropriate for a certain noise model based on the circumstances.

[CV-80] CT to PET Translation: A Large-scale Dataset and Domain-Knowledge-Guided Diffusion Approach WACV

链接: https://arxiv.org/abs/2410.21932
作者: Dac Thai Nguyen,Trung Thanh Nguyen,Huu Tien Nguyen,Thanh Trung Nguyen,Huy Hieu Pham,Thanh Hung Nguyen,Thao Nguyen Truong,Phi Le Nguyen
关键词-EN: Positron Emission Tomography, Emission Tomography, Computed Tomography, Positron Emission, PET
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Positron Emission Tomography (PET) and Computed Tomography (CT) are essential for diagnosing, staging, and monitoring various diseases, particularly cancer. Despite their importance, the use of PET/CT systems is limited by the necessity for radioactive materials, the scarcity of PET scanners, and the high cost associated with PET imaging. In contrast, CT scanners are more widely available and significantly less expensive. In response to these challenges, our study addresses the issue of generating PET images from CT images, aiming to reduce both the medical examination cost and the associated health risks for patients. Our contributions are twofold: First, we introduce a conditional diffusion model named CPDM, which, to our knowledge, is one of the initial attempts to employ a diffusion model for translating from CT to PET images. Second, we provide the largest CT-PET dataset to date, comprising 2,028,628 paired CT-PET images, which facilitates the training and evaluation of CT-to-PET translation models. For the CPDM model, we incorporate domain knowledge to develop two conditional maps: the Attention map and the Attenuation map. The former helps the diffusion process focus on areas of interest, while the latter improves PET data correction and ensures accurate diagnostic information. Experimental evaluations across various benchmarks demonstrate that CPDM surpasses existing methods in generating high-quality PET images in terms of multiple metrics. The source code and data samples are available at this https URL.

机器学习

[LG-0] Optimizing Posterior Samples for Bayesian Optimization via Rootfinding

链接: https://arxiv.org/abs/2410.22322
作者: Taiwo A. Adebiyi,Bach Do,Ruda Zhang
关键词-EN: Bayesian optimization devolves, Bayesian optimization, costly objective function, costly objective, global optimization
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization devolves the global optimization of a costly objective function to the global optimization of a sequence of acquisition functions. This inner-loop optimization can be catastrophically difficult if it involves posterior samples, especially in higher dimensions. We introduce an efficient global optimization strategy for posterior samples based on global rootfinding. It provides gradient-based optimizers with judiciously selected starting points, designed to combine exploitation and exploration. The algorithm scales practically linearly to high dimensions. For posterior sample-based acquisition functions such as Gaussian process Thompson sampling (GP-TS) and variants of entropy search, we demonstrate remarkable improvement in both inner- and outer-loop optimization, surprisingly outperforming alternatives like EI and GP-UCB in most cases. We also propose a sample-average formulation of GP-TS, which has a parameter to explicitly control exploitation and can be computed at the cost of one posterior sample. Our implementation is available at this https URL .

[LG-1] Online Detecting LLM -Generated Texts via Sequential Hypothesis Testing by Betting

链接: https://arxiv.org/abs/2410.22318
作者: Can Chen,Jun-Kun Wang
关键词-EN: garnered substantial attention, recent years, Developing algorithms, machine-generated texts, garnered substantial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing algorithms to differentiate between machine-generated texts and human-written texts has garnered substantial attention in recent years. Existing methods in this direction typically concern an offline setting where a dataset containing a mix of real and machine-generated texts is given upfront, and the task is to determine whether each sample in the dataset is from a large language model (LLM) or a human. However, in many practical scenarios, sources such as news websites, social media accounts, or on other forums publish content in a streaming fashion. Therefore, in this online scenario, how to quickly and accurately determine whether the source is an LLM with strong statistical guarantees is crucial for these media or platforms to function effectively and prevent the spread of misinformation and other potential misuse of LLMs. To tackle the problem of online detection, we develop an algorithm based on the techniques of sequential hypothesis testing by betting that not only builds upon and complements existing offline detection techniques but also enjoys statistical guarantees, which include a controlled false positive rate and the expected time to correctly identify a source as an LLM. Experiments were conducted to demonstrate the effectiveness of our method.

[LG-2] Convex Formulations for Training Two-Layer ReLU Neural Networks

链接: https://arxiv.org/abs/2410.22311
作者: Karthik Prakhya,Tolga Birdal,Alp Yurtsever
关键词-EN: machine learning models, including neural networks, machine learning, learning models, black-box machine learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Solving non-convex, NP-hard optimization problems is crucial for training machine learning models, including neural networks. However, non-convexity often leads to black-box machine learning models with unclear inner workings. While convex formulations have been used for verifying neural network robustness, their application to training neural networks remains less explored. In response to this challenge, we reformulate the problem of training infinite-width two-layer ReLU networks as a convex completely positive program in a finite-dimensional (lifted) space. Despite the convexity, solving this problem remains NP-hard due to the complete positivity constraint. To overcome this challenge, we introduce a semidefinite relaxation that can be solved in polynomial time. We then experimentally evaluate the tightness of this relaxation, demonstrating its competitive performance in test accuracy across a range of classification tasks.

[LG-3] LLM s are Highly-Constrained Biophysical Sequence Optimizers

链接: https://arxiv.org/abs/2410.22296
作者: Angelica Chen,Samuel D. Stanton,Robert G. Alberstein,Andrew M. Watkins,Richard Bonneau,Vladimir Gligorijevi,Kyunghyun Cho,Nathan C. Frey
关键词-EN: Large language models, recently shown significant, shown significant potential, Large language, molecule design
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Supercedes arXiv:2407.00236v1

点击查看摘要

Abstract:Large language models (LLMs) have recently shown significant potential in various biological tasks such as protein engineering and molecule design. These tasks typically involve black-box discrete sequence optimization, where the challenge lies in generating sequences that are not only biologically feasible but also adhere to hard fine-grained constraints. However, LLMs often struggle with such constraints, especially in biological contexts where verifying candidate solutions is costly and time-consuming. In this study, we explore the possibility of employing LLMs as highly-constrained bilevel optimizers through a methodology we refer to as Language Model Optimization with Margin Expectation (LLOME). This approach combines both offline and online optimization, utilizing limited oracle evaluations to iteratively enhance the sequences generated by the LLM. We additionally propose a novel training objective – Margin-Aligned Expectation (MargE) – that trains the LLM to smoothly interpolate between the reward and reference distributions. Lastly, we introduce a synthetic test suite that bears strong geometric similarity to real biophysical problems and enables rapid evaluation of LLM optimizers without time-consuming lab validation. Our findings reveal that, in comparison to genetic algorithm baselines, LLMs achieve significantly lower regret solutions while requiring fewer test function evaluations. However, we also observe that LLMs exhibit moderate miscalibration, are susceptible to generator collapse, and have difficulty finding the optimal solution when no explicit ground truth rewards are available.

[LG-4] Embedding-based classifiers can detect prompt injection attacks

链接: https://arxiv.org/abs/2410.22284
作者: Md. Ahsan Ayub,Subhabrata Majumdar
关键词-EN: Large Language Models, Large Language, exceptional generative capabilities, Language Models, generative capabilities
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are seeing significant adoption in every type of organization due to their exceptional generative capabilities. However, LLMs are found to be vulnerable to various adversarial attacks, particularly prompt injection attacks, which trick them into producing harmful or inappropriate content. Adversaries execute such attacks by crafting malicious prompts to deceive the LLMs. In this paper, we propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat. We leverage three commonly used embedding models to generate embeddings of malicious and benign prompts and utilize ML classifiers to predict whether an input prompt is malicious. Out of several traditional ML methods, we achieve the best performance with classifiers built using Random Forest and XGBoost. Our classifiers outperform state-of-the-art prompt injection classifiers available in open-source implementations, which use encoder-only neural networks.

[LG-5] Meta-Learning Adaptable Foundation Models

链接: https://arxiv.org/abs/2410.22264
作者: Jacob L. Block,Sundararajan Srinivasan,Liam Collins,Aryan Mokhtari,Sanjay Shakkottai
关键词-EN: highly expressive representations, learn highly expressive, power of foundation, highly expressive, expressive representations
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:The power of foundation models (FMs) lies in their capacity to learn highly expressive representations that can be adapted to a broad spectrum of tasks. However, these pretrained models require multiple stages of fine-tuning to become effective for downstream applications. Conventionally, the model is first retrained on the aggregate of a diverse set of tasks of interest and then adapted to specific low-resource downstream tasks by utilizing a parameter-efficient fine-tuning (PEFT) scheme. While this two-phase procedure seems reasonable, the independence of the retraining and fine-tuning phases causes a major issue, as there is no guarantee the retrained model will achieve good performance post-fine-tuning. To explicitly address this issue, we introduce a meta-learning framework infused with PEFT in this intermediate retraining stage to learn a model that can be easily adapted to unseen tasks. For our theoretical results, we focus on linear models using low-rank adaptations. In this setting, we demonstrate the suboptimality of standard retraining for finding an adaptable set of parameters. Further, we prove that our method recovers the optimally adaptable parameters. We then apply these theoretical insights to retraining the RoBERTa model to predict the continuation of conversations between different personas within the ConvAI2 dataset. Empirically, we observe significant performance benefits using our proposed meta-learning scheme during retraining relative to the conventional approach.

[LG-6] LipKernel: Lipschitz-Bounded Convolutional Neural Networks via Dissipative Layers

链接: https://arxiv.org/abs/2410.22258
作者: Patricia Pauli,Ruigang Wang,Ian Manchester,Frank Allgöwer
关键词-EN: prescribed Lipschitz bound, includes built-in robustness, built-in robustness guarantees, includes built-in, guarantees by enforcing
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a novel layer-wise parameterization for convolutional neural networks (CNNs) that includes built-in robustness guarantees by enforcing a prescribed Lipschitz bound. Each layer in our parameterization is designed to satisfy a linear matrix inequality (LMI), which in turn implies dissipativity with respect to a specific supply rate. Collectively, these layer-wise LMIs ensure Lipschitz boundedness for the input-output mapping of the neural network, yielding a more expressive parameterization than through spectral bounds or orthogonal layers. Our new method LipKernel directly parameterizes dissipative convolution kernels using a 2-D Roesser-type state space model. This means that the convolutional layers are given in standard form after training and can be evaluated without computational overhead. In numerical experiments, we show that the run-time using our method is orders of magnitude faster than state-of-the-art Lipschitz-bounded networks that parameterize convolutions in the Fourier domain, making our approach particularly attractive for improving robustness of learning-based real-time perception or control in robotics, autonomous vehicles, or automation systems. We focus on CNNs, and in contrast to previous works, our approach accommodates a wide variety of layers typically used in CNNs, including 1-D and 2-D convolutional layers, maximum and average pooling layers, as well as strided and dilated convolutions and zero padding. However, our approach naturally extends beyond CNNs as we can incorporate any layer that is incrementally dissipative.

[LG-7] Hypergraph-based multi-scale spatio-temporal graph convolution network for Time-Series anomaly detection

链接: https://arxiv.org/abs/2410.22256
作者: Hongyi Xu
关键词-EN: cloud service providers, fields including aerospace, Multivariate time series, detection technology plays, anomaly detection
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series anomaly detection technology plays an important role in many fields including aerospace, water treatment, cloud service providers, etc. Excellent anomaly detection models can greatly improve work efficiency and avoid major economic losses. However, with the development of technology, the increasing size and complexity of data, and the lack of labels for relevant abnormal data, it is becoming increasingly challenging to perform effective and accurate anomaly detection in high-dimensional and complex data sets. In this paper, we propose a hypergraph based spatiotemporal graph convolutional neural network model STGCN_Hyper, which explicitly captures high-order, multi-hop correlations between multiple variables through a hypergraph based dynamic graph structure learning module. On this basis, we further use the hypergraph based spatiotemporal graph convolutional network to utilize the learned hypergraph structure to effectively propagate and aggregate one-hop and multi-hop related node information in the convolutional network, thereby obtaining rich spatial information. Furthermore, through the multi-scale TCN dilated convolution module, the STGCN_hyper model can also capture the dependencies of features at different scales in the temporal dimension. An unsupervised anomaly detector based on PCA and GMM is also integrated into the STGCN_hyper model. Through the anomaly score of the detector, the model can detect the anomalies in an unsupervised way. Experimental results on multiple time series datasets show that our model can flexibly learn the multi-scale time series features in the data and the dependencies between features, and outperforms most existing baseline models in terms of precision, recall, F1-score on anomaly detection tasks. Our code is available on: this https URL

[LG-8] Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs MICRO MICRO57

链接: https://arxiv.org/abs/2410.22249
作者: Rishabh Jain,Vivek M. Bhasi,Adwait Jog,Anand Sivasubramaniam,Mahmut T. Kandemir,Chita R. Das
关键词-EN: Deep Learning Recommendation, Personalized recommendation, extensively leveraging Deep, Learning Recommendation Models, leveraging Deep Learning
类目: Hardware Architecture (cs.AR); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG); Performance (cs.PF); Software Engineering (cs.SE)
*备注: This work has been accepted in the 57th MICRO ( this https URL ). Please check appendix for details on reproducing our work including codebase and steps

点击查看摘要

Abstract:Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline. Comments: This work has been accepted in the 57th MICRO (this https URL). Please check appendix for details on reproducing our work including codebase and steps Subjects: Hardware Architecture (cs.AR); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG); Performance (cs.PF); Software Engineering (cs.SE) Cite as: arXiv:2410.22249 [cs.AR] (or arXiv:2410.22249v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2410.22249 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Abrupt Learning in Transformers: A Case Study on Matrix Completion NEURIPS2024

链接: https://arxiv.org/abs/2410.22244
作者: Pulkit Gopalani,Ekdeep Singh Lubana,Wei Hu
关键词-EN: Recent analysis, Transformers has unveiled, dynamics of Transformers, interesting characteristic, unveiled an interesting
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024 Poster

点击查看摘要

Abstract:Recent analysis on the training dynamics of Transformers has unveiled an interesting characteristic: the training loss plateaus for a significant number of training steps, and then suddenly (and sharply) drops to near–optimal values. To understand this phenomenon in depth, we formulate the low-rank matrix completion problem as a masked language modeling (MLM) task, and show that it is possible to train a BERT model to solve this task to low error. Furthermore, the loss curve shows a plateau early in training followed by a sudden drop to near-optimal values, despite no changes in the training procedure or hyper-parameters. To gain interpretability insights into this sudden drop, we examine the model’s predictions, attention heads, and hidden states before and after this transition. Concretely, we observe that (a) the model transitions from simply copying the masked input to accurately predicting the masked entries; (b) the attention heads transition to interpretable patterns relevant to the task; and © the embeddings and hidden states encode information relevant to the problem. We also analyze the training dynamics of individual model components to understand the sudden drop in loss.

[LG-10] Auditing f-Differential Privacy in One Run

链接: https://arxiv.org/abs/2410.22235
作者: Saeed Mahloujifar,Luca Melis,Kamalika Chaudhuri
关键词-EN: implementation of privacy-preserving, auditing procedure, empirical privacy, auditing, Existing auditing mechanisms
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Empirical auditing has emerged as a means of catching some of the flaws in the implementation of privacy-preserving algorithms. Existing auditing mechanisms, however, are either computationally inefficient requiring multiple runs of the machine learning algorithms or suboptimal in calculating an empirical privacy. In this work, we present a tight and efficient auditing procedure and analysis that can effectively assess the privacy of mechanisms. Our approach is efficient; similar to the recent work of Steinke, Nasr, and Jagielski (2023), our auditing procedure leverages the randomness of examples in the input dataset and requires only a single run of the target mechanism. And it is more accurate; we provide a novel analysis that enables us to achieve tight empirical privacy estimates by using the hypothesized f -DP curve of the mechanism, which provides a more accurate measure of privacy than the traditional \epsilon,\delta differential privacy parameters. We use our auditing procure and analysis to obtain empirical privacy, demonstrating that our auditing procedure delivers tighter privacy estimates.

[LG-11] Subgraph Aggregation for Out-of-Distribution Generalization on Graphs

链接: https://arxiv.org/abs/2410.22228
作者: Bowen Liu,Haoyang Li,Shuning Wang,Shuo Nie,Shanghang Zhang
关键词-EN: Graph Neural Networks, Neural Networks, gained significant attention, significant attention due, Graph Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) generalization in Graph Neural Networks (GNNs) has gained significant attention due to its critical importance in graph-based predictions in real-world scenarios. Existing methods primarily focus on extracting a single causal subgraph from the input graph to achieve generalizable predictions. However, relying on a single subgraph can lead to susceptibility to spurious correlations and is insufficient for learning invariant patterns behind graph data. Moreover, in many real-world applications, such as molecular property prediction, multiple critical subgraphs may influence the target label property. To address these challenges, we propose a novel framework, SubGraph Aggregation (SuGAr), designed to learn a diverse set of subgraphs that are crucial for OOD generalization on graphs. Specifically, SuGAr employs a tailored subgraph sampler and diversity regularizer to extract a diverse set of invariant subgraphs. These invariant subgraphs are then aggregated by averaging their representations, which enriches the subgraph signals and enhances coverage of the underlying causal structures, thereby improving OOD generalization. Extensive experiments on both synthetic and real-world datasets demonstrate that \ours outperforms state-of-the-art methods, achieving up to a 24% improvement in OOD generalization on graphs. To the best of our knowledge, this is the first work to study graph OOD generalization by learning multiple invariant subgraphs.

[LG-12] GRINNs: Godunov-Riemann Informed Neural Networks for Learning Hyperbolic Conservation Laws

链接: https://arxiv.org/abs/2410.22193
作者: Dimitrios G. Patsatzis,Mario di Bernardo,Lucia Russo,Constantinos Siettos
关键词-EN: analysis-informed neural networks, Partial Differential Equations, numerical analysis-informed neural, hyperbolic Partial Differential, conservation laws
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:We present GRINNs: numerical analysis-informed neural networks for the solution of inverse problems of non-linear systems of conservation laws. GRINNs are based on high-resolution Godunov schemes for the solution of the Riemann problem in hyperbolic Partial Differential Equations (PDEs). In contrast to other existing machine learning methods that learn the numerical fluxes of conservative Finite Volume methods, GRINNs learn the physical flux function per se. Due to their structure, GRINNs provide interpretable, conservative schemes, that learn the solution operator on the basis of approximate Riemann solvers that satisfy the Rankine-Hugoniot condition. The performance of GRINNs is assessed via four benchmark problems, namely the Burgers’, the Shallow Water, the Lighthill-Whitham-Richards and the Payne-Whitham traffic flow models. The solution profiles of these PDEs exhibit shock waves, rarefactions and/or contact discontinuities at finite times. We demonstrate that GRINNs provide a very high accuracy both in the smooth and discontinuous regions.

[LG-13] rAge-k: Communication-Efficient Federated Learning Using Age Factor

链接: https://arxiv.org/abs/2410.22192
作者: Matin Mortaheb,Priyanka Kaswan,Sennur Ulukus
关键词-EN: Federated learning, unified machine-learning model, train a unified, unified machine-learning, Federated
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a collaborative approach where multiple clients, coordinated by a parameter server (PS), train a unified machine-learning model. The approach, however, suffers from two key challenges: data heterogeneity and communication overhead. Data heterogeneity refers to inconsistencies in model training arising from heterogeneous data at different clients. Communication overhead arises from the large volumes of parameter updates exchanged between the PS and clients. Existing solutions typically address these challenges separately. This paper introduces a new communication-efficient algorithm that uses the age of information metric to simultaneously tackle both limitations of FL. We introduce age vectors at the PS, which keep track of how often the different model parameters are updated from the clients. The PS uses this to selectively request updates for specific gradient indices from each client. Further, the PS employs age vectors to identify clients with statistically similar data and group them into clusters. The PS combines the age vectors of the clustered clients to efficiently coordinate gradient index updates among clients within a cluster. We evaluate our approach using the MNIST and CIFAR10 datasets in highly non-i.i.d. settings. The experimental results show that our proposed method can expedite training, surpassing other communication-efficient strategies in efficiency.

[LG-14] EconoJax: A Fast Scalable Economic Simulation in Jax

链接: https://arxiv.org/abs/2410.22165
作者: Koen Ponse,Aske Plaat,Niki van Stein,Thomas M. Moerland
关键词-EN: Accurate economic simulations, Accurate economic, experimental runs, require many experimental, reinforcement learning
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 8 pages

点击查看摘要

Abstract:Accurate economic simulations often require many experimental runs, particularly when combined with reinforcement learning. Unfortunately, training reinforcement learning agents in multi-agent economic environments can be slow. This paper introduces EconoJax, a fast simulated economy, based on the AI economist. EconoJax, and its training pipeline, are completely written in JAX. This allows EconoJax to scale to large population sizes and perform large experiments, while keeping training times within minutes. Through experiments with populations of 100 agents, we show how real-world economic behavior emerges through training within 15 minutes, in contrast to previous work that required several days. To aid and inspire researchers to build more rich and dynamic economic simulations, we open-source EconoJax on Github at: this https URL.

[LG-15] Learning Successor Features the Simple Way NEURIPS

链接: https://arxiv.org/abs/2410.22133
作者: Raymond Chua,Arna Ghosh,Christos Kaplanis,Blake A. Richards,Doina Precup
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, exhibit catastrophic forgetting, Reinforcement Learning, non-stationary environments
类目: Machine Learning (cs.LG)
*备注: Main Paper: 10 pages and 8 figures. Accepted at Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:In Deep Reinforcement Learning (RL), it is a challenge to learn representations that do not exhibit catastrophic forgetting or interference in non-stationary environments. Successor Features (SFs) offer a potential solution to this challenge. However, canonical techniques for learning SFs from pixel-level observations often lead to representation collapse, wherein representations degenerate and fail to capture meaningful variations in the data. More recent methods for learning SFs can avoid representation collapse, but they often involve complex losses and multiple learning phases, reducing their efficiency. We introduce a novel, simple method for learning SFs directly from pixels. Our approach uses a combination of a Temporal-difference (TD) loss and a reward prediction loss, which together capture the basic mathematical definition of SFs. We show that our approach matches or outperforms existing SF learning techniques in both 2D (Minigrid), 3D (Miniworld) mazes and Mujoco, for both single and continual learning scenarios. As well, our technique is efficient, and can reach higher levels of performance in less time than other approaches. Our work provides a new, streamlined technique for learning SFs directly from pixel observations, with no pretraining required.

[LG-16] Where Do Large Learning Rates Lead Us? NEURIPS2024

链接: https://arxiv.org/abs/2410.22113
作者: Ildus Sadrtdinov,Maxim Kodryan,Eduard Pokonechny,Ekaterina Lobacheva,Dmitry Vetrov
关键词-EN: large learning rates, neural networks training, starting neural networks, learning rates, generally accepted
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in NeurIPS 2024. First three authors contributed equally, last two authors share senior authorship

点击查看摘要

Abstract:It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.

[LG-17] Data Generation for Hardware-Friendly Post-Training Quantization

链接: https://arxiv.org/abs/2410.22110
作者: Lior Dikstein,Ariel Lapid,Arnon Netzer,Hai Victor Habi
关键词-EN: Zero-shot quantization, existing data generation, data, data generation methods, data generation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot quantization (ZSQ) using synthetic data is a key approach for post-training quantization (PTQ) under privacy and security constraints. However, existing data generation methods often struggle to effectively generate data suitable for hardware-friendly quantization, where all model layers are quantized. We analyze existing data generation methods based on batch normalization (BN) matching and identify several gaps between synthetic and real data: 1) Current generation algorithms do not optimize the entire synthetic dataset simultaneously; 2) Data augmentations applied during training are often overlooked; and 3) A distribution shift occurs in the final model layers due to the absence of BN in those layers. These gaps negatively impact ZSQ performance, particularly in hardware-friendly quantization scenarios. In this work, we propose Data Generation for Hardware-friendly quantization (DGH), a novel method that addresses these gaps. DGH jointly optimizes all generated images, regardless of the image set size or GPU memory constraints. To address data augmentation mismatches, DGH includes a preprocessing stage that mimics the augmentation process and enhances image quality by incorporating natural image priors. Finally, we propose a new distribution-stretching loss that aligns the support of the feature map distribution between real and synthetic data. This loss is applied to the model’s output and can be adapted to various tasks. DGH demonstrates significant improvements in quantization performance across multiple tasks, achieving up to a 30% increase in accuracy for hardware-friendly ZSQ in both classification and object detection, often performing on par with real data.

[LG-18] InLINE: Inner-Layer Information Exchange for Multi-task Learning on Heterogeneous Graphs

链接: https://arxiv.org/abs/2410.22089
作者: Xinyue Feng,Jinquan Hang,Yuequn Zhang,Haotian Wang,Desheng Zhang,Guang Wang
关键词-EN: modeling complex relational, complex relational data, modeling complex, complex relational, real-world scenarios
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous graph is an important structure for modeling complex relational data in real-world scenarios and usually involves various node prediction tasks within a single graph. Training these tasks separately may neglect beneficial information sharing, hence a preferred way is to learn several tasks in a same model by Multi-Task Learning (MTL). However, MTL introduces the issue of negative transfer, where the training of different tasks interferes with each other as they may focus on different information from the data, resulting in suboptimal performance. To solve the issue, existing MTL methods use separate backbones for each task, then selectively exchange beneficial features through interactions among the output embeddings from each layer of different backbones, which we refer to as outer-layer exchange. However, the negative transfer in heterogeneous graphs arises not simply from the varying importance of an individual node feature across tasks, but also from the varying importance of inter-relation between two nodes across tasks. These inter-relations are entangled in the output embedding, making it difficult for existing methods to discriminate beneficial information from the embedding. To address this challenge, we propose the Inner-Layer Information Exchange (InLINE) model that facilitate fine-grained information exchanges within each graph layer rather than through output embeddings. Specifically, InLINE consists of (1) Structure Disentangled Experts for layer-wise structure disentanglement, (2) Structure Disentangled Gates for assigning disentangled information to different tasks. Evaluations on two public datasets and a large industry dataset show that our model effectively alleviates the significant performance drop on specific tasks caused by negative transfer, improving Macro F1 by 6.3% on DBLP dataset and AUC by 3.6% on the industry dataset compared to SoA methods.

[LG-19] Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

链接: https://arxiv.org/abs/2410.22069
作者: Nikolaos Tsilivis,Gal Vardi,Julia Kempe
关键词-EN: includes gradient descent, deep homogeneous neural, homogeneous neural networks, steepest descent algorithms, general family
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the implicit bias of the general family of steepest descent algorithms, which includes gradient descent, sign descent and coordinate descent, in deep homogeneous neural networks. We prove that an algorithm-dependent geometric margin starts increasing once the networks reach perfect training accuracy and characterize the late-stage bias of the algorithms. In particular, we define a generalized notion of stationarity for optimization problems and show that the algorithms progressively reduce a (generalized) Bregman divergence, which quantifies proximity to such stationary points of a margin-maximization problem. We then experimentally zoom into the trajectories of neural networks optimized with various steepest descent algorithms, highlighting connections to the implicit bias of Adam.

[LG-20] CHORDONOMICON: A Dataset of 666000 Songs and their Chord Progressions

链接: https://arxiv.org/abs/2410.22046
作者: Spyridon Kantarelis,Konstantinos Thomas,Vassilis Lyberatos,Edmund Dervakos,Giorgos Stamou
关键词-EN: encapsulate important information, progressions encapsulate important, Chord progressions encapsulate, Chord progressions, conveyed emotions
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.

[LG-21] On the Robustness of Adversarial Training Against Uncertainty Attacks

链接: https://arxiv.org/abs/2410.21952
作者: Emanuele Ledda,Giovanni Scodeller,Daniele Angioni,Giorgio Piras,Antonio Emanuele Cinà,Giorgio Fumera,Battista Biggio,Fabio Roli
关键词-EN: learning problems, noise inherent, task at hand, hand hinders, hinders the possibility
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In learning problems, the noise inherent to the task at hand hinders the possibility to infer without a certain degree of uncertainty. Quantifying this uncertainty, regardless of its wide use, assumes high relevance for security-sensitive applications. Within these scenarios, it becomes fundamental to guarantee good (i.e., trustworthy) uncertainty measures, which downstream modules can securely employ to drive the final decision-making process. However, an attacker may be interested in forcing the system to produce either (i) highly uncertain outputs jeopardizing the system’s availability or (ii) low uncertainty estimates, making the system accept uncertain samples that would instead require a careful inspection (e.g., human intervention). Therefore, it becomes fundamental to understand how to obtain robust uncertainty estimates against these kinds of attacks. In this work, we reveal both empirically and theoretically that defending against adversarial examples, i.e., carefully perturbed samples that cause misclassification, additionally guarantees a more secure, trustworthy uncertainty estimate under common attack scenarios without the need for an ad-hoc defense strategy. To support our claims, we evaluate multiple adversarial-robust models from the publicly available benchmark RobustBench on the CIFAR-10 and ImageNet datasets.

[LG-22] SCGNet-Stacked Convolution with Gated Recurrent Unit Network for Cyber Network Intrusion Detection and Intrusion Type Classification

链接: https://arxiv.org/abs/2410.21873
作者: Rajana Akter,Shahnure Rabib,Rahul Deb Mohalder,Laboni Paul,Ferdous Bin Ali
关键词-EN: Intrusion detection system, malicious activity, Network-based intrusion detection, Intrusion detection, detection system
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Intrusion detection system (IDS) is a piece of hardware or software that looks for malicious activity or policy violations in a network. It looks for malicious activity or security flaws on a network or system. IDS protects hosts or networks by looking for indications of known attacks or deviations from normal behavior (Network-based intrusion detection system, or NIDS for short). Due to the rapidly increasing amount of network data, traditional intrusion detection systems (IDSs) are far from being able to quickly and efficiently identify complex and varied network attacks, especially those linked to low-frequency attacks. The SCGNet (Stacked Convolution with Gated Recurrent Unit Network) is a novel deep learning architecture that we propose in this study. It exhibits promising results on the NSL-KDD dataset in both task, network attack detection, and attack type classification with 99.76% and 98.92% accuracy, respectively. We have also introduced a general data preprocessing pipeline that is easily applicable to other similar datasets. We have also experimented with conventional machine-learning techniques to evaluate the performance of the data processing pipeline.

[LG-23] Reliable and Compact Graph Fine-tuning via GraphSparse Prompting

链接: https://arxiv.org/abs/2410.21749
作者: Bo Jiang,Hao Wu,Beibei Wang,Jin Tang,Bin Luo
关键词-EN: garnered increasing attention, Graph Sparse, graph, Graph Sparse Feature, Graph Sparse Prompting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, graph prompt learning has garnered increasing attention in adapting pre-trained GNN models for downstream graph learning tasks. However, existing works generally conduct prompting over all graph elements (e.g., nodes, edges, node attributes, etc.), which is suboptimal and obviously redundant. To address this issue, we propose exploiting sparse representation theory for graph prompting and present Graph Sparse Prompting (GSP). GSP aims to adaptively and sparsely select the optimal elements (e.g., certain node attributes) to achieve compact prompting for downstream tasks. Specifically, we propose two kinds of GSP models, termed Graph Sparse Feature Prompting (GSFP) and Graph Sparse multi-Feature Prompting (GSmFP). Both GSFP and GSmFP provide a general scheme for tuning any specific pre-trained GNNs that can achieve attribute selection and compact prompt learning simultaneously. A simple yet effective algorithm has been designed for solving GSFP and GSmFP models. Experiments on 16 widely-used benchmark datasets validate the effectiveness and advantages of the proposed GSFPs.

[LG-24] A Dual Adaptive Assignment Approach for Robust Graph-Based Clustering

链接: https://arxiv.org/abs/2410.21745
作者: Yang Xiang,Li Fan,Tulika Saha,Yushan Pan,Haiyang Zhang,Chengtao Ji
关键词-EN: involves grouping nodes, separate clusters, essential aspect, aspect of network, network analysis
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph clustering is an essential aspect of network analysis that involves grouping nodes into separate clusters. Recent developments in deep learning have resulted in advanced deep graph clustering techniques, which have proven effective in many applications. Nonetheless, these methods often encounter difficulties when dealing with the complexities of real-world graphs, particularly in the presence of noisy edges. Additionally, many denoising graph clustering strategies tend to suffer from lower performance compared to their non-denoised counterparts, training instability, and challenges in scaling to large datasets. To tackle these issues, we introduce a new framework called the Dual Adaptive Assignment Approach for Robust Graph-Based Clustering (RDSA). RDSA consists of three key components: (i) a node embedding module that effectively integrates the graph’s topological features and node attributes; (ii) a structure-based soft assignment module that improves graph modularity by utilizing an affinity matrix for node assignments; and (iii) a node-based soft assignment module that identifies community landmarks and refines node assignments to enhance the model’s robustness. We assess RDSA on various real-world datasets, demonstrating its superior performance relative to existing state-of-the-art methods. Our findings indicate that RDSA provides robust clustering across different graph types, excelling in clustering effectiveness and robustness, including adaptability to noise, stability, and scalability.

[LG-25] Sliced-Wasserstein-based Anomaly Detection and Open Dataset for Localized Critical Peak Rebates

链接: https://arxiv.org/abs/2410.21712
作者: Julien Pallage,Bertrand Scherrer,Salma Naccache,Christophe Bélanger,Antoine Lesage-Landry
关键词-EN: unsupervised anomaly, sliced-Wasserstein metric, localized critical peak, critical peak rebate, critical peak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present a new unsupervised anomaly (outlier) detection (AD) method using the sliced-Wasserstein metric. This filtering technique is conceptually interesting for integration in MLOps pipelines deploying trustworthy machine learning models in critical sectors like energy. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We demonstrate the capabilities of our method on synthetic datasets as well as standard AD datasets and use it in the making of a first benchmark for our open-source localized critical peak rebate dataset.

[LG-26] Multi-view clustering integrating anchor attribute and structural information

链接: https://arxiv.org/abs/2410.21711
作者: Xuetong Li,Xiao-Dong Zhang
关键词-EN: Multisource data, data has spurred, spurred the development, development of advanced, critically relies
类目: Machine Learning (cs.LG)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Multisource data has spurred the development of advanced clustering algorithms, such as multi-view clustering, which critically relies on constructing similarity matrices. Traditional algorithms typically generate these matrices from sample attributes alone. However, real-world networks often include pairwise directed topological structures critical for clustering. This paper introduces a novel multi-view clustering algorithm, AAS. It utilizes a two-step proximity approach via anchors in each view, integrating attribute and directed structural information. This approach enhances the clarity of category characteristics in the similarity matrices. The anchor structural similarity matrix leverages strongly connected components of directed graphs. The entire process-from similarity matrices construction to clustering - is consolidated into a unified optimization framework. Comparative experiments on the modified Attribute SBM dataset against eight algorithms affirm the effectiveness and superiority of AAS.

[LG-27] Stochastic Approximation with Unbounded Markovian Noise: A General-Purpose Theorem

链接: https://arxiv.org/abs/2410.21704
作者: Shaan Ul Haque,Siva Theja Maguluri
关键词-EN: average-reward Reinforcement Learning, Motivated by engineering, average-reward Reinforcement, unbounded state space, Reinforcement Learning
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Motivated by engineering applications such as resource allocation in networks and inventory systems, we consider average-reward Reinforcement Learning with unbounded state space and reward function. Recent works studied this problem in the actor-critic framework and established finite sample bounds assuming access to a critic with certain error guarantees. We complement their work by studying Temporal Difference (TD) learning with linear function approximation and establishing finite-time bounds with the optimal \mathcalO\left(1/\epsilon^2\right) sample complexity. These results are obtained using the following general-purpose theorem for non-linear Stochastic Approximation (SA). Suppose that one constructs a Lyapunov function for a non-linear SA with certain drift condition. Then, our theorem establishes finite-time bounds when this SA is driven by unbounded Markovian noise under suitable conditions. It serves as a black box tool to generalize sample guarantees on SA from i.i.d. or martingale difference case to potentially unbounded Markovian noise. The generality and the mild assumption of the setup enables broad applicability of our theorem. We illustrate its power by studying two more systems: (i) We improve upon the finite-time bounds of Q -learning by tightening the error bounds and also allowing for a larger class of behavior policies. (ii) We establish the first ever finite-time bounds for distributed stochastic optimization of high-dimensional smooth strongly convex function using cyclic block coordinate descent. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC) Cite as: arXiv:2410.21704 [cs.LG] (or arXiv:2410.21704v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.21704 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] On the Role of Depth and Looping for In-Context Learning with Task Diversity

链接: https://arxiv.org/abs/2410.21698
作者: Khashayar Gatmiry,Nikunj Saunshi,Sashank J. Reddi,Stefanie Jegelka,Sanjiv Kumar
关键词-EN: garnered significant attention, intriguing in-context learning, multilayer Transformers, deep Transformer models, Transformers
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The intriguing in-context learning (ICL) abilities of deep Transformer models have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers’ abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable ability of Transformers to learn multiple tasks in context. To this end, we study in-context learning for linear regression with diverse tasks, characterized by data covariance matrices with condition numbers ranging from [1, \kappa] , and highlight the importance of depth in this setting. More specifically, (a) we show theoretical lower bounds of \log(\kappa) (or \sqrt\kappa ) linear attention layers in the unrestricted (or restricted) attention setting and, (b) we show that multilayer Transformers can indeed solve such tasks with a number of layers that matches the lower bounds. However, we show that this expressivity of multilayer Transformer comes at the price of robustness. In particular, multilayer Transformers are not robust to even distributional shifts as small as O(e^-L) in Wasserstein distance, where L is the depth of the network. We then demonstrate that Looped Transformers – a special class of multilayer Transformers with weight-sharing – not only exhibit similar expressive power but are also provably robust under mild assumptions. Besides out-of-distribution generalization, we also show that Looped Transformers are the only models that exhibit a monotonic behavior of loss with respect to depth.

[LG-29] Pushing the Limits of All-Atom Geometric Graph Neural Networks: Pre-Training Scaling and Zero-Shot Transfer

链接: https://arxiv.org/abs/2410.21683
作者: Zihan Pengmei,Zhengyuan Shen,Zichen Wang,Marcus Collins,Huzefa Rangwala
关键词-EN: protein mechanism analysis, learning-based molecular dynamics, Constructing transferable descriptors, biological systems finds, systems finds numerous
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 15 pages, 4 figures, supporting information appended

点击查看摘要

Abstract:Constructing transferable descriptors for conformation representation of molecular and biological systems finds numerous applications in drug discovery, learning-based molecular dynamics, and protein mechanism analysis. Geometric graph neural networks (Geom-GNNs) with all-atom information have transformed atomistic simulations by serving as a general learnable geometric descriptors for downstream tasks including prediction of interatomic potential and molecular properties. However, common practices involve supervising Geom-GNNs on specific downstream tasks, which suffer from the lack of high-quality data and inaccurate labels leading to poor generalization and performance degradation on out-of-distribution (OOD) scenarios. In this work, we explored the possibility of using pre-trained Geom-GNNs as transferable and highly effective geometric descriptors for improved generalization. To explore their representation power, we studied the scaling behaviors of Geom-GNNs under self-supervised pre-training, supervised and unsupervised learning setups. We find that the expressive power of different architectures can differ on the pre-training task. Interestingly, Geom-GNNs do not follow the power-law scaling on the pre-training task, and universally lack predictable scaling behavior on the supervised tasks with quantum chemical labels important for screening and design of novel molecules. More importantly, we demonstrate how all-atom graph embedding can be organically combined with other neural architectures to enhance the expressive power. Meanwhile, the low-dimensional projection of the latent space shows excellent agreement with conventional geometrical descriptors.

[LG-30] Revisiting Reliability in Large-Scale Machine Learning Research Clusters

链接: https://arxiv.org/abs/2410.21680
作者: Apostolos Kokolis,Michael Kuchnik,John Hoffman,Adithya Kumar,Parth Malani,Faye Ma,Zachary DeVito,Shubho Sengupta,Kalyan Saladi,Carole-Jean Wu
关键词-EN: operating large-scale machine, large-scale machine learning, continues to grow, fundamental challenge, challenge in operating
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, the impact of job failures across different scales remains unclear. This paper presents a view of managing two large, multi-tenant ML clusters, providing quantitative analysis, operational experience, and our own perspective in understanding and addressing reliability concerns at scale. Our analysis reveals that while large jobs are most vulnerable to failures, smaller jobs make up the majority of jobs in the clusters and should be incorporated into optimization objectives. We identify key workload properties, compare them across clusters, and demonstrate essential reliability requirements for pushing the boundaries of ML training at scale. We hereby introduce a taxonomy of failures and key reliability metrics, analyze 11 months of data from two state-of-the-art ML environments with over 150 million A100 GPU hours and 4 million jobs. Building on our data, we fit a failure model to project Mean Time to Failure for various GPU scales. We further propose a method to estimate a related metric, Effective Training Time Ratio, as a function of job parameters, and we use this model to gauge the efficacy of potential software mitigations at scale. Our work provides valuable insights and future research directions for improving the reliability of AI supercomputer clusters, emphasizing the need for flexible, workload-agnostic, and reliability-aware infrastructure, system software, and algorithms. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2410.21680 [cs.DC] (or arXiv:2410.21680v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2410.21680 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Minimum Entropy Coupling with Bottleneck NEURIPS2024

链接: https://arxiv.org/abs/2410.21666
作者: M.Reza Ebrahimi,Jun Chen,Ashish Khisti
关键词-EN: reconstruction distribution diverges, minimum entropy coupling, lossy compression framework, compression framework operating, entropy coupling framework
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) - Spotlight

点击查看摘要

Abstract:This paper investigates a novel lossy compression framework operating under logarithmic loss, designed to handle situations where the reconstruction distribution diverges from the source distribution. This framework is especially relevant for applications that require joint compression and retrieval, and in scenarios involving distributional shifts due to processing. We show that the proposed formulation extends the classical minimum entropy coupling framework by integrating a bottleneck, allowing for a controlled degree of stochasticity in the coupling. We explore the decomposition of the Minimum Entropy Coupling with Bottleneck (MEC-B) into two distinct optimization problems: Entropy-Bounded Information Maximization (EBIM) for the encoder, and Minimum Entropy Coupling (MEC) for the decoder. Through extensive analysis, we provide a greedy algorithm for EBIM with guaranteed performance, and characterize the optimal solution near functional mappings, yielding significant theoretical insights into the structural complexity of this problem. Furthermore, we illustrate the practical application of MEC-B through experiments in Markov Coding Games (MCGs) under rate limits. These games simulate a communication scenario within a Markov Decision Process, where an agent must transmit a compressed message from a sender to a receiver through its actions. Our experiments highlight the trade-offs between MDP rewards and receiver accuracy across various compression rates, showcasing the efficacy of our method compared to conventional compression baseline.

[LG-32] Dimensionality-induced information loss of outliers in deep neural networks ECML KDD2024

链接: https://arxiv.org/abs/2410.21656
作者: Kazuki Uematsu,Kosuke Haruki,Taiji Suzuki,Mitsuhiro Kimura,Takahiro Takimoto,Hideyuki Nakagawa
关键词-EN: deep neural network, OOD samples, neural network, OOD, OOD detection
类目: Machine Learning (cs.LG)
*备注: This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in ECML PKDD 2024, and is available online at this https URL

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is a critical issue for the stable and reliable operation of systems using a deep neural network (DNN). Although many OOD detection methods have been proposed, it remains unclear how the differences between in-distribution (ID) and OOD samples are generated by each processing step inside DNNs. We experimentally clarify this issue by investigating the layer dependence of feature representations from multiple perspectives. We find that intrinsic low dimensionalization of DNNs is essential for understanding how OOD samples become more distinct from ID samples as features propagate to deeper layers. Based on these observations, we provide a simple picture that consistently explains various properties of OOD samples. Specifically, low-dimensional weights eliminate most information from OOD samples, resulting in misclassifications due to excessive attention to dataset bias. In addition, we demonstrate the utility of dimensionality by proposing a dimensionality-aware OOD detection method based on alignment of features and weights, which consistently achieves high performance for various datasets with lower computational cost.

[LG-33] Faster Local Solvers for Graph Diffusion Equations NEURIPS2024

链接: https://arxiv.org/abs/2410.21634
作者: Jiahe Bai,Baojian Zhou,Deqing Yang,Yanghua Xiao
关键词-EN: training neural networks, Katz centrality, Personalized PageRank, Heat kernel, crucial for clustering
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Efficient computation of graph diffusion equations (GDEs), such as Personalized PageRank, Katz centrality, and the Heat kernel, is crucial for clustering, training neural networks, and many other graph-related problems. Standard iterative methods require accessing the whole graph per iteration, making them time-consuming for large-scale graphs. While existing local solvers approximate diffusion vectors through heuristic local updates, they often operate sequentially and are typically designed for specific diffusion types, limiting their applicability. Given that diffusion vectors are highly localizable, as measured by the participation ratio, this paper introduces a novel framework for approximately solving GDEs using a local diffusion process. This framework reveals the suboptimality of existing local solvers. Furthermore, our approach effectively localizes standard iterative solvers by designing simple and provably sublinear time algorithms. These new local solvers are highly parallelizable, making them well-suited for implementation on GPUs. We demonstrate the effectiveness of our framework in quickly obtaining approximate diffusion vectors, achieving up to a hundred-fold speed improvement, and its applicability to large-scale dynamic graphs. Our framework could also facilitate more efficient local message-passing mechanisms for GNNs.

[LG-34] Graph Sparsification for Enhanced Conformal Prediction in Graph Neural Networks

链接: https://arxiv.org/abs/2410.21618
作者: Yuntian He,Pranav Maneriker,Anutam Srinivasan,Aditya T. Vadlamani,Srinivasan Parthasarathy
关键词-EN: machine learning tasks, ensures reliable coverage, Conformal Prediction, learning tasks, robust framework
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal Prediction is a robust framework that ensures reliable coverage across machine learning tasks. Although recent studies have applied conformal prediction to graph neural networks, they have largely emphasized post-hoc prediction set generation. Improving conformal prediction during the training stage remains unaddressed. In this work, we tackle this challenge from a denoising perspective by introducing SparGCP, which incorporates graph sparsification and a conformal prediction-specific objective into GNN training. SparGCP employs a parameterized graph sparsification module to filter out task-irrelevant edges, thereby improving conformal prediction efficiency. Extensive experiments on real-world graph datasets demonstrate that SparGCP outperforms existing methods, reducing prediction set sizes by an average of 32% and scaling seamlessly to large networks on commodity GPUs.

[LG-35] CaloChallenge 2022: A Community Challenge for Fast Calorimeter Simulation

链接: https://arxiv.org/abs/2410.21611
作者: Claudius Krause,Michele Faucci Giannelli,Gregor Kasieczka,Benjamin Nachman,Dalila Salamani,David Shih,Anna Zaborowska,Oz Amram,Kerstin Borras,Matthew R. Buckley,Erik Buhmann,Thorsten Buss,Renato Paulo Da Costa Cardoso,Anthony L. Caterini,Nadezda Chernyavskaya,Federico A.G. Corchia,Jesse C. Cresswell,Sascha Diefenbacher,Etienne Dreyer,Vijay Ekambaram,Engin Eren,Florian Ernst,Luigi Favaro,Matteo Franchini,Frank Gaede,Eilam Gross,Shih-Chieh Hsu,Kristina Jaruskova,Benno Käch,Jayant Kalagnanam,Raghav Kansal,Taewoo Kim,Dmitrii Kobylianskii,Anatolii Korol,William Korcari,Dirk Krücker,Katja Krüger,Marco Letizia,Shu Li,Qibin Liu,Xiulong Liu,Gabriel Loaiza-Ganem,Thandikire Madula,Peter McKeown,Isabell-A. Melzer-Pellmann,Vinicius Mikuni,Nam Nguyen,Ayodele Ore,Sofia Palacios Schweitzer,Ian Pang,Kevin Pedro,Tilman Plehn,Witold Pokorski,Huilin Qu,Piyush Raikwar,John A. Raine,Humberto Reyes-Gonzalez,Lorenzo Rinaldi,Brendan Leigh Ross,Moritz A.W. Scham,Simon Schnake,Chase Shimmin,Eli Shlizerman,Nathalie Soybelman,Mudhakar Srivatsa,Kalliopi Tsolaki,Sofia Vallecorsa,Kyongmin Yeo,Rui Zhang
关键词-EN: Calorimeter Simulation Challenge, Conditional Flow Matching, Simulation Challenge, Generative Adversarial Networks, Fast Calorimeter Simulation
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Instrumentation and Detectors (physics.ins-det)
*备注: 204 pages, 100+ figures, 30+ tables

点击查看摘要

Abstract:We present the results of the “Fast Calorimeter Simulation Challenge 2022” - the CaloChallenge. We study state-of-the-art generative models on four calorimeter shower datasets of increasing dimensionality, ranging from a few hundred voxels to a few tens of thousand voxels. The 31 individual submissions span a wide range of current popular generative architectures, including Variational AutoEncoders (VAEs), Generative Adversarial Networks (GANs), Normalizing Flows, Diffusion models, and models based on Conditional Flow Matching. We compare all submissions in terms of quality of generated calorimeter showers, as well as shower generation time and model size. To assess the quality we use a broad range of different metrics including differences in 1-dimensional histograms of observables, KPD/FPD scores, AUCs of binary classifiers, and the log-posterior of a multiclass classifier. The results of the CaloChallenge provide the most complete and comprehensive survey of cutting-edge approaches to calorimeter fast simulation to date. In addition, our work provides a uniquely detailed perspective on the important problem of how to evaluate generative models. As such, the results presented here should be applicable for other domains that use generative AI and require fast and faithful generation of samples in a large phase space.

[LG-36] he Limits of Transfer Reinforcement Learning with Latent Low-rank Structure

链接: https://arxiv.org/abs/2410.21601
作者: Tyler Sam,Yudong Chen,Christina Lee Yu
关键词-EN: latent low rank, reinforcement learning, large sizes, action space, practice due
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many reinforcement learning (RL) algorithms are too costly to use in practice due to the large sizes S, A of the problem’s state and action space. To resolve this issue, we study transfer RL with latent low rank structure. We consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels with Tucker rank (S , d, A ) , (S , S , d), (d, S, A ) , or (d , d , d ) . In each setting, we introduce the transfer-ability coefficient \alpha that measures the difficulty of representational transfer. Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on S, A , or S A in the target MDP regret bound. We complement our positive results with information theoretic lower bounds that show our algorithms (excluding the ( d, d, d ) setting) are minimax-optimal with respect to \alpha .

[LG-37] Deep Trees for (Un)structured Data: Tractability Performance and Interpretability

链接: https://arxiv.org/abs/2410.21595
作者: Dimitris Bertsimas,Lisa Everest,Jiayi Gu,Matthew Peroni,Vasiliki Stoumpou
关键词-EN: Generalized Soft Trees, soft decision trees, Soft Trees, Decision Trees, Trees
类目: Machine Learning (cs.LG)
*备注: Submitted to Machine Learning. Authors are listed in alphabetical order

点击查看摘要

Abstract:Decision Trees have remained a popular machine learning method for tabular datasets, mainly due to their interpretability. However, they lack the expressiveness needed to handle highly nonlinear or unstructured datasets. Motivated by recent advances in tree-based machine learning (ML) techniques and first-order optimization methods, we introduce Generalized Soft Trees (GSTs), which extend soft decision trees (STs) and are capable of processing images directly. We demonstrate their advantages with respect to tractability, performance, and interpretability. We develop a tractable approach to growing GSTs, given by the DeepTree algorithm, which, in addition to new regularization terms, produces high-quality models with far fewer nodes and greater interpretability than traditional soft trees. We test the performance of our GSTs on benchmark tabular and image datasets, including MIMIC-IV, MNIST, Fashion MNIST, CIFAR-10 and Celeb-A. We show that our approach outperforms other popular tree methods (CART, Random Forests, XGBoost) in almost all of the datasets, with Convolutional Trees having a significant edge in the hardest CIFAR-10 and Fashion MNIST datasets. Finally, we explore the interpretability of our GSTs and find that even the most complex GSTs are considerably more interpretable than deep neural networks. Overall, our approach of Generalized Soft Trees provides a tractable method that is high-performing on (un)structured datasets and preserves interpretability more than traditional deep learning methods.

[LG-38] Audio Classification of Low Feature Spectrograms Utilizing Convolutional Neural Networks

链接: https://arxiv.org/abs/2410.21561
作者: Noel Elias
关键词-EN: Modern day audio, Modern day, frequency data representations, temporal frequency data, spectrographic temporal frequency
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Modern day audio signal classification techniques lack the ability to classify low feature audio signals in the form of spectrographic temporal frequency data representations. Additionally, currently utilized techniques rely on full diverse data sets that are often not representative of real-world distributions. This paper derives several first-of-its-kind machine learning methodologies to analyze these low feature audio spectrograms given data distributions that may have normalized, skewed, or even limited training sets. In particular, this paper proposes several novel customized convolutional architectures to extract identifying features using binary, one-class, and siamese approaches to identify the spectrographic signature of a given audio signal. Utilizing these novel convolutional architectures as well as the proposed classification methods, these experiments demonstrate state-of-the-art classification accuracy and improved efficiency than traditional audio classification methods.

[LG-39] A Novel Score-CAM based Denoiser for Spectrographic Signature Extraction without Ground Truth

链接: https://arxiv.org/abs/2410.21557
作者: Noel Elias
关键词-EN: growing area, area of research, data, passive sonar transducers, audio
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Sonar based audio classification techniques are a growing area of research in the field of underwater acoustics. Usually, underwater noise picked up by passive sonar transducers contains all types of signals that travel through the ocean and is transformed into spectrographic images. As a result, the corresponding spectrograms intended to display the temporal-frequency data of a certain object often include the tonal regions of abundant extraneous noise that can effectively interfere with a ‘contact’. So, a majority of spectrographic samples extracted from underwater audio signals are rendered unusable due to their clutter and lack the required indistinguishability between different objects. With limited clean true data for supervised training, creating classification models for these audio signals is severely bottlenecked. This paper derives several new techniques to combat this problem by developing a novel Score-CAM based denoiser to extract an object’s signature from noisy spectrographic data without being given any ground truth data. In particular, this paper proposes a novel generative adversarial network architecture for learning and producing spectrographic training data in similar distributions to low-feature spectrogram inputs. In addition, this paper also a generalizable class activation mapping based denoiser for different distributions of acoustic data, even real-world data distributions. Utilizing these novel architectures and proposed denoising techniques, these experiments demonstrate state-of-the-art noise reduction accuracy and improved classification accuracy than current audio classification standards. As such, this approach has applications not only to audio data but for countless data distributions used all around the world for machine learning. Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2410.21557 [cs.SD] (or arXiv:2410.21557v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2410.21557 Focus to learn more arXiv-issued DOI via DataCite Journalreference: 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 2023, pp. 1-8 Related DOI: https://doi.org/10.1109/IJCNN54540.2023.10191897 Focus to learn more DOI(s) linking to related resources

[LG-40] Exploring the Design Space of Diffusion Bridge Models via Stochasticity Control

链接: https://arxiv.org/abs/2410.21553
作者: Shaorong Zhang,Yuanbin Cheng,Xianghao Kong,Greg Ver Steeg
关键词-EN: models effectively facilitate, bridge models effectively, Diffusion bridge models, Stochasticity-controlled Diffusion Bridge, Diffusion bridge
类目: Machine Learning (cs.LG)
*备注: 23 pages, 9 figures

点击查看摘要

Abstract:Diffusion bridge models effectively facilitate image-to-image (I2I) translation by connecting two distributions. However, existing methods overlook the impact of noise in sampling SDEs, transition kernel, and the base distribution on sampling efficiency, image quality and diversity. To address this gap, we propose the Stochasticity-controlled Diffusion Bridge (SDB), a novel theoretical framework that extends the design space of diffusion bridges, and provides strategies to mitigate singularities during both training and sampling. By controlling stochasticity in the sampling SDEs, our sampler achieves speeds up to 5 times faster than the baseline, while also producing lower FID scores. After training, SDB sets new benchmarks in image quality and sampling efficiency via managing stochasticity within the transition kernel. Furthermore, introducing stochasticity into the base distribution significantly improves image diversity, as quantified by a newly introduced metric.

[LG-41] Personalized Federated Learning with Mixture of Models for Adaptive Prediction and Model Fine-Tuning NEURIPS2024

链接: https://arxiv.org/abs/2410.21547
作者: Pouya M. Ghari,Yanning Shen
关键词-EN: retain data privacy, ensuring that users, orchestrates collaborations, efficacy in distributed, Federated learning
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Federated learning is renowned for its efficacy in distributed model training, ensuring that users, called clients, retain data privacy by not disclosing their data to the central server that orchestrates collaborations. Most previous work on federated learning assumes that clients possess static batches of training data. However, clients may also need to make real-time predictions on streaming data in non-stationary environments. In such dynamic environments, employing pre-trained models may be inefficient, as they struggle to adapt to the constantly evolving data streams. To address this challenge, clients can fine-tune models online, leveraging their observed data to enhance performance. Despite the potential benefits of client participation in federated online model fine-tuning, existing analyses have not conclusively demonstrated its superiority over local model fine-tuning. To bridge this gap, the present paper develops a novel personalized federated learning algorithm, wherein each client constructs a personalized model by combining a locally fine-tuned model with multiple federated models learned by the server over time. Theoretical analysis and experiments on real datasets corroborate the effectiveness of this approach for real-time predictions and federated model fine-tuning.

[LG-42] Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

链接: https://arxiv.org/abs/2410.21523
作者: Hengrui Zhang,Liancheng Fang,Qitian Wu,Philip S. Yu
关键词-EN: data remains underexplored, natural language generation, tabular data remains, tabular data, Diffusion-nested Autoregressive model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features. To ensure arbitrary generation order, TabDAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional distribution of a target column given an arbitrary combination of other columns. These designs enable TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling. We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.

[LG-43] Predicting sub-population specific viral evolution

链接: https://arxiv.org/abs/2410.21518
作者: Wenxian Shi,Menghua Wu,Regina Barzilay
关键词-EN: Forecasting the change, disease surveillance, crucial for therapeutic, therapeutic design, design and disease
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.

[LG-44] A Systematic Review of Machine Learning in Sports Betting: Techniques Challenges and Future Directions

链接: https://arxiv.org/abs/2410.21484
作者: René Manassé Galekwa,Jean Marie Tshimula,Etienne Gael Tajeuna,Kyamakya Kyandoghere
关键词-EN: experienced rapid growth, rapid growth, driven largely, online platforms, industry has experienced
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The sports betting industry has experienced rapid growth, driven largely by technological advancements and the proliferation of online platforms. Machine learning (ML) has played a pivotal role in the transformation of this sector by enabling more accurate predictions, dynamic odds-setting, and enhanced risk management for both bookmakers and bettors. This systematic review explores various ML techniques, including support vector machines, random forests, and neural networks, as applied in different sports such as soccer, basketball, tennis, and cricket. These models utilize historical data, in-game statistics, and real-time information to optimize betting strategies and identify value bets, ultimately improving profitability. For bookmakers, ML facilitates dynamic odds adjustment and effective risk management, while bettors leverage data-driven insights to exploit market inefficiencies. This review also underscores the role of ML in fraud detection, where anomaly detection models are used to identify suspicious betting patterns. Despite these advancements, challenges such as data quality, real-time decision-making, and the inherent unpredictability of sports outcomes remain. Ethical concerns related to transparency and fairness are also of significant importance. Future research should focus on developing adaptive models that integrate multimodal data and manage risk in a manner akin to financial portfolios. This review provides a comprehensive examination of the current applications of ML in sports betting, and highlights both the potential and the limitations of these technologies.

[LG-45] A Mathematical Analysis of Neural Operator Behaviors

链接: https://arxiv.org/abs/2410.21481
作者: Vu-Anh Le,Mehmet Dik
关键词-EN: partial differential equations, solving complex partial, complex partial differential, offering useful applications, differential equations
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 24 pages

点击查看摘要

Abstract:Neural operators have emerged as transformative tools for learning mappings between infinite-dimensional function spaces, offering useful applications in solving complex partial differential equations (PDEs). This paper presents a rigorous mathematical framework for analyzing the behaviors of neural operators, with a focus on their stability, convergence, clustering dynamics, universality, and generalization error. By proposing a list of novel theorems, we provide stability bounds in Sobolev spaces and demonstrate clustering in function space via gradient flow interpretation, guiding neural operator design and optimization. Based on these theoretical gurantees, we aim to offer clear and unified guidance in a single setting for the future design of neural operator-based methods.

[LG-46] Inverting Gradient Attacks Naturally Makes Data Poisons: An Availability Attack on Neural Networks

链接: https://arxiv.org/abs/2410.21453
作者: Wassim Bouaziz,El-Mahdi El-Mhamdi,Nicolas Usunier
关键词-EN: machine learning algorithms, data poisoning, data poisoning tamper, data, Gradient attacks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:Gradient attacks and data poisoning tamper with the training of machine learning algorithms to maliciously alter them and have been proven to be equivalent in convex settings. The extent of harm these attacks can produce in non-convex settings is still to be determined. Gradient attacks can affect far less systems than data poisoning but have been argued to be more harmful since they can be arbitrary, whereas data poisoning reduces the attacker’s power to only being able to inject data points to training sets, via e.g. legitimate participation in a collaborative dataset. This raises the question of whether the harm made by gradient attacks can be matched by data poisoning in non-convex settings. In this work, we provide a positive answer in a worst-case scenario and show how data poisoning can mimic a gradient attack to perform an availability attack on (non-convex) neural networks. Through gradient inversion, commonly used to reconstruct data points from actual gradients, we show how reconstructing data points out of malicious gradients can be sufficient to perform a range of attacks. This allows us to show, for the first time, an availability attack on neural networks through data poisoning, that degrades the model’s performances to random-level through a minority (as low as 1%) of poisoned points.

[LG-47] A Temporal Linear Network for Time Series Forecasting

链接: https://arxiv.org/abs/2410.21448
作者: Remi Genet,Hugo Inzirillo
关键词-EN: outperform sophisticated approaches, Recent research, deep learning architectures, complex deep learning, time series forecasting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has challenged the necessity of complex deep learning architectures for time series forecasting, demonstrating that simple linear models can often outperform sophisticated approaches. Building upon this insight, we introduce a novel architecture the Temporal Linear Net (TLN), that extends the capabilities of linear models while maintaining interpretability and computational efficiency. TLN is designed to effectively capture both temporal and feature-wise dependencies in multivariate time series data. Our approach is a variant of TSMixer that maintains strict linearity throughout its architecture. TSMixer removes activation functions, introduces specialized kernel initializations, and incorporates dilated convolutions to handle various time scales, while preserving the linear nature of the model. Unlike transformer-based models that may lose temporal information due to their permutation-invariant nature, TLN explicitly preserves and leverages the temporal structure of the input data. A key innovation of TLN is its ability to compute an equivalent linear model, offering a level of interpretability not found in more complex architectures such as TSMixer. This feature allows for seamless conversion between the full TLN model and its linear equivalent, facilitating both training flexibility and inference optimization.

[LG-48] Sum-of-squares lower bounds for Non-Gaussian Component Analysis

链接: https://arxiv.org/abs/2410.21426
作者: Ilias Diakonikolas,Sushrut Karmalkar,Shuo Pang,Aaron Potechin
关键词-EN: Non-Gaussian Component Analysis, Component Analysis, standard Gaussian, Non-Gaussian Component, high-dimensional dataset
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Non-Gaussian Component Analysis (NGCA) is the statistical task of finding a non-Gaussian direction in a high-dimensional dataset. Specifically, given i.i.d.\ samples from a distribution P^A_v on \mathbbR^n that behaves like a known distribution A in a hidden direction v and like a standard Gaussian in the orthogonal complement, the goal is to approximate the hidden direction. The standard formulation posits that the first k-1 moments of A match those of the standard Gaussian and the k -th moment differs. Under mild assumptions, this problem has sample complexity O(n) . On the other hand, all known efficient algorithms require \Omega(n^k/2) samples. Prior work developed sharp Statistical Query and low-degree testing lower bounds suggesting an information-computation tradeoff for this problem. Here we study the complexity of NGCA in the Sum-of-Squares (SoS) framework. Our main contribution is the first super-constant degree SoS lower bound for NGCA. Specifically, we show that if the non-Gaussian distribution A matches the first (k-1) moments of \mathcalN(0, 1) and satisfies other mild conditions, then with fewer than n^(1 - \varepsilon)k/2 many samples from the normal distribution, with high probability, degree (\log n)^1\over 2-o_n(1) SoS fails to refute the existence of such a direction v . Our result significantly strengthens prior work by establishing a super-polynomial information-computation tradeoff against a broader family of algorithms. As corollaries, we obtain SoS lower bounds for several problems in robust statistics and the learning of mixture models. Our SoS lower bound proof introduces a novel technique, that we believe may be of broader interest, and a number of refinements over existing methods. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (stat.ML) MSC classes: 03F20, 68Q17 ACMclasses: F.2.2 Cite as: arXiv:2410.21426 [cs.LG] (or arXiv:2410.21426v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.21426 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] Bayesian Collaborative Bandits with Thompson Sampling for Improved Outreach in Maternal Health Program

链接: https://arxiv.org/abs/2410.21405
作者: Arpan Dasgupta,Gagan Jain,Arun Suggala,Karthikeyan Shanmugam,Milind Tambe,Aparna Taneja
关键词-EN: Mobile health, automated health information, automated health, face a critical, optimizing the timing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mobile health (mHealth) programs face a critical challenge in optimizing the timing of automated health information calls to beneficiaries. This challenge has been formulated as a collaborative multi-armed bandit problem, requiring online learning of a low-rank reward matrix. Existing solutions often rely on heuristic combinations of offline matrix completion and exploration strategies. In this work, we propose a principled Bayesian approach using Thompson Sampling for this collaborative bandit problem. Our method leverages prior information through efficient Gibbs sampling for posterior inference over the low-rank matrix factors, enabling faster convergence. We demonstrate significant improvements over state-of-the-art baselines on a real-world dataset from the world’s largest maternal mHealth program. Our approach achieves a 16% reduction in the number of calls compared to existing methods and a 47 % reduction compared to the deployed random policy. This efficiency gain translates to a potential increase in program capacity by 0.5-1.4 million beneficiaries, granting them access to vital ante-natal and post-natal care information. Furthermore, we observe a 7% and 29% improvement in beneficiary retention (an extremely hard metric to impact) compared to state-of-the-art and deployed baselines, respectively. Synthetic simulations further demonstrate the superiority of our approach, particularly in low-data regimes and in effectively utilizing prior information. We also provide a theoretical analysis of our algorithm in a special setting using Eluder dimension.

[LG-50] Batch match and patch: low-rank approximations for score-based variational inference

链接: https://arxiv.org/abs/2410.22292
作者: Chirag Modi,Diana Cai,Lawrence K. Saul
关键词-EN: Black-box variational inference, multivariate Gaussian approximation, Black-box variational, high dimensional problems, scales poorly
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Black-box variational inference (BBVI) scales poorly to high dimensional problems when it is used to estimate a multivariate Gaussian approximation with a full covariance matrix. In this paper, we extend the batch-and-match (BaM) framework for score-based BBVI to problems where it is prohibitively expensive to store such covariance matrices, let alone to estimate them. Unlike classical algorithms for BBVI, which use gradient descent to minimize the reverse Kullback-Leibler divergence, BaM uses more specialized updates to match the scores of the target density and its Gaussian approximation. We extend the updates for BaM by integrating them with a more compact parameterization of full covariance matrices. In particular, borrowing ideas from factor analysis, we add an extra step to each iteration of BaM – a patch – that projects each newly updated covariance matrix into a more efficiently parameterized family of diagonal plus low rank matrices. We evaluate this approach on a variety of synthetic target distributions and real-world problems in high-dimensional inference.

[LG-51] Leveraging Recurrent Neural Networks for Predicting Motor Movements from Primate Motor Cortex Neural Recordings

链接: https://arxiv.org/abs/2410.22283
作者: Yuanxi Wang,Zuowen Wang,Shih-Chii Liu
关键词-EN: efficient deep learning, deep learning solution, decoding motor movements, Gated Recurrent Unit, Autoencoder Gated Recurrent
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:This paper presents an efficient deep learning solution for decoding motor movements from neural recordings in non-human primates. An Autoencoder Gated Recurrent Unit (AEGRU) model was adopted as the model architecture for this task. The autoencoder is only used during the training stage to achieve better generalization. Together with the preprocessing techniques, our model achieved 0.71 R^2 score, surpassing the baseline models in Neurobench and is ranked first for R^2 in the IEEE BioCAS 2024 Grand Challenge on Neural Decoding. Model pruning is also applied leading to a reduction of 41.4% of the multiply-accumulate (MAC) operations with little change in the R^2 score compared to the unpruned model.

[LG-52] Deep Q-Exponential Processes

链接: https://arxiv.org/abs/2410.22119
作者: Zhi Chang,Chukwudi Obite,Shuang Zhou,Shiwei Lan
关键词-EN: deep Gaussian process, deep neural networks, Gaussian process, neural networks, deep Gaussian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Motivated by deep neural networks, the deep Gaussian process (DGP) generalizes the standard GP by stacking multiple layers of GPs. Despite the enhanced expressiveness, GP, as an L_2 regularization prior, tends to be over-smooth and sub-optimal for inhomogeneous subjects, such as images with edges. Recently, Q-exponential process (Q-EP) has been proposed as an L_q relaxation to GP and demonstrated with more desirable regularization properties through a parameter q0 with q=2 corresponding to GP. Sharing the similar tractability of posterior and predictive distributions with GP, Q-EP can also be stacked to improve its modeling flexibility. In this paper, we generalize Q-EP to deep Q-EP to enjoy both proper regularization and improved expressiveness. The generalization is realized by introducing shallow Q-EP as a latent variable model and then building a hierarchy of the shallow Q-EP layers. Sparse approximation by inducing points and scalable variational strategy are applied to facilitate the inference. We demonstrate the numerical advantages of the proposed deep Q-EP model by comparing with multiple state-of-the-art deep probabilistic models.

[LG-53] Variational inference for pile-up removal at hadron colliders with diffusion models

链接: https://arxiv.org/abs/2410.22074
作者: Malte Algren,Christopher Pollard,John Andrew Raine,Tobias Golling
关键词-EN: interactions using variational, variational inference, inference with diffusion, diffusion models, called Vipr
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 19 pages, 13 figures

点击查看摘要

Abstract:In this paper, we present a novel method for pile-up removal of pp interactions using variational inference with diffusion models, called Vipr. Instead of using classification methods to identify which particles are from the primary collision, a generative model is trained to predict the constituents of the hard-scatter particle jets with pile-up removed. This results in an estimate of the full posterior over hard-scatter jet constituents, which has not yet been explored in the context of pile-up removal. We evaluate the performance of Vipr in a sample of jets from simulated t\bart events overlain with pile-up contamination. Vipr outperforms SoftDrop in predicting the substructure of the hard-scatter jets over a wide range of pile-up scenarios.

[LG-54] Hamiltonian Monte Carlo on ReLU Neural Networks is Inefficient NEURIPS2024

链接: https://arxiv.org/abs/2410.22065
作者: Vu C. Dinh,Lam Si Tung Ho,Cuong V. Nguyen
关键词-EN: Hamiltonian Monte Carlo, Monte Carlo algorithm, Hamiltonian Monte, Monte Carlo, Bayesian neural network
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Paper published at NeurIPS 2024

点击查看摘要

Abstract:We analyze the error rates of the Hamiltonian Monte Carlo algorithm with leapfrog integrator for Bayesian neural network inference. We show that due to the non-differentiability of activation functions in the ReLU family, leapfrog HMC for networks with these activation functions has a large local error rate of \Omega(\epsilon) rather than the classical error rate of O(\epsilon^3) . This leads to a higher rejection rate of the proposals, making the method inefficient. We then verify our theoretical findings through empirical simulations as well as experiments on a real-world dataset that highlight the inefficiency of HMC inference on ReLU-based neural networks compared to analytical networks.

[LG-55] On uniqueness in structured model learning

链接: https://arxiv.org/abs/2410.22009
作者: Martin Holler,Erion Morina
关键词-EN: partial differential equations, learning physical laws, unknown model components, differential equations, model components
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of uniqueness in learning physical laws for systems of partial differential equations (PDEs). Contrary to most existing approaches, it considers a framework of structured model learning, where existing, approximately correct physical models are augmented with components that are learned from data. The main result of the paper is a uniqueness result that covers a large class of PDEs and a suitable class of neural networks used for approximating the unknown model components. The uniqueness result shows that, in the idealized setting of full, noiseless measurements, a unique identification of the unknown model components is possible as regularization-minimizing solution of the PDE system. Furthermore, the paper provides a convergence result showing that model components learned on the basis of incomplete, noisy measurements approximate the ground truth model component in the limit. These results are possible under specific properties of the approximating neural networks and due to a dedicated choice of regularization. With this, a practical contribution of this analytic paper is to provide a class of model learning frameworks different to standard settings where uniqueness can be expected in the limit of full measurements.

[LG-56] Node Regression on Latent Position Random Graphs via Local Averaging

链接: https://arxiv.org/abs/2410.21987
作者: Martin Gjorgjevski,Nicolas Keriven,Simon Barthelmé,Yohann De Castro
关键词-EN: Latent Position Model, Nadaraya Watson estimator, Latent Position, graph, Latent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Node regression consists in predicting the value of a graph label at a node, given observations at the other nodes. To gain some insight into the performance of various estimators for this task, we perform a theoretical study in a context where the graph is random. Specifically, we assume that the graph is generated by a Latent Position Model, where each node of the graph has a latent position, and the probability that two nodes are connected depend on the distance between the latent positions of the two nodes. In this context, we begin by studying the simplest possible estimator for graph regression, which consists in averaging the value of the label at all neighboring nodes. We show that in Latent Position Models this estimator tends to a Nadaraya Watson estimator in the latent space, and that its rate of convergence is in fact the same. One issue with this standard estimator is that it averages over a region consisting of all neighbors of a node, and that depending on the graph model this may be too much or too little. An alternative consists in first estimating the true distances between the latent positions, then injecting these estimated distances into a classical Nadaraya Watson estimator. This enables averaging in regions either smaller or larger than the typical graph neighborhood. We show that this method can achieve standard nonparametric rates in certain instances even when the graph neighborhood is too large or too small.

[LG-57] Individualised recovery trajectories of patients with impeded mobility using distance between probability distributions of learnt graphs

链接: https://arxiv.org/abs/2410.21983
作者: Chuqiao Zhang,Crina Grosan,Dalia Chakrabarty
关键词-EN: undergoing physical rehabilitation, cumulative performance attained, Movement Recovery Scores, benefit from feedback, reliable assessment
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Patients who are undergoing physical rehabilitation, benefit from feedback that follows from reliable assessment of their cumulative performance attained at a given time. In this paper, we provide a method for the learning of the recovery trajectory of an individual patient, as they undertake exercises as part of their physical therapy towards recovery of their loss of movement ability, following a critical illness. The difference between the Movement Recovery Scores (MRSs) attained by a patient, when undertaking a given exercise routine on successive instances, is given by a statistical distance/divergence between the (posterior) probabilities of random graphs that are Bayesianly learnt using time series data on locations of 20 of the patient’s joints, recorded on an e-platform as the patient exercises. This allows for the computation of the MRS on every occasion the patient undertakes this exercise, using which, the recovery trajectory is drawn. We learn each graph as a Random Geometric Graph drawn in a probabilistic metric space, and identify the closed-form marginal posterior of any edge of the graph, given the correlation structure of the multivariate time series data on joint locations. On the basis of our recovery learning, we offer recommendations on the optimal exercise routines for patients with given level of mobility impairment.

[LG-58] Online Test of a Neural Network Deep Convection Parameterization in ARP-GEM1

链接: https://arxiv.org/abs/2410.21920
作者: Blanka Balogh,David Saint-Martin,Olivier Geoffroy
关键词-EN: neural network, OASIS coupler, global atmospheric model, neural network emulator, neural network-based parameterization
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, submitted to Artificial Intelligence for the Earth Systems

点击查看摘要

Abstract:In this study, we present the integration of a neural network-based parameterization into the global atmospheric model ARP-GEM1, leveraging the Python interface of the OASIS coupler. This approach facilitates the exchange of fields between the Fortran-based ARP-GEM1 model and a Python component responsible for neural network inference. As a proof-of-concept experiment, we trained a neural network to emulate the deep convection parameterization of ARP-GEM1. Using the flexible Fortran/Python interface, we have successfully replaced ARP-GEM1’s deep convection scheme with a neural network emulator. To assess the performance of the neural network deep convection scheme, we have run a 5-years ARP-GEM1 simulation using the neural network emulator. The evaluation of averaged fields showed good agreement with output from an ARP-GEM1 simulation using the physics-based deep convection scheme. The Python component was deployed on a separate partition from the general circulation model, using GPUs to increase inference speed of the neural network.

[LG-59] Identifiability Analysis of Linear ODE Systems with Hidden Confounders NEURIPS2024

链接: https://arxiv.org/abs/2410.21917
作者: Yuanyuan Wang,Biwei Huang,Wei Huang,Xi Geng,Mingming Gong
关键词-EN: Ordinary Differential Equation, linear Ordinary Differential, Differential Equation, Ordinary Differential, linear Ordinary
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:The identifiability analysis of linear Ordinary Differential Equation (ODE) systems is a necessary prerequisite for making reliable causal inferences about these systems. While identifiability has been well studied in scenarios where the system is fully observable, the conditions for identifiability remain unexplored when latent variables interact with the system. This paper aims to address this gap by presenting a systematic analysis of identifiability in linear ODE systems incorporating hidden confounders. Specifically, we investigate two cases of such systems. In the first case, latent confounders exhibit no causal relationships, yet their evolution adheres to specific functional forms, such as polynomial functions of time t . Subsequently, we extend this analysis to encompass scenarios where hidden confounders exhibit causal dependencies, with the causal structure of latent variables described by a Directed Acyclic Graph (DAG). The second case represents a more intricate variation of the first case, prompting a more comprehensive identifiability analysis. Accordingly, we conduct detailed identifiability analyses of the second system under various observation conditions, including both continuous and discrete observations from single or multiple trajectories. To validate our theoretical results, we perform a series of simulations, which support and substantiate our findings.

[LG-60] Hierarchical mixtures of Unigram models for short text clustering: the role of Beta-Liouville priors

链接: https://arxiv.org/abs/2410.21862
作者: Massimo Bilancia,Samuele Magro
关键词-EN: Multinomial mixture model, mixture model tailored, short text data, Dirichlet prior distribution, Multinomial mixture
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 47 pages, 4 figures. Submitted

点击查看摘要

Abstract:This paper presents a variant of the Multinomial mixture model tailored for the unsupervised classification of short text data. Traditionally, the Multinomial probability vector in this hierarchical model is assigned a Dirichlet prior distribution. Here, however, we explore an alternative prior - the Beta-Liouville distribution - which offers a more flexible correlation structure than the Dirichlet. We examine the theoretical properties of the Beta-Liouville distribution, focusing on its conjugacy with the Multinomial likelihood. This property enables the derivation of update equations for a CAVI (Coordinate Ascent Variational Inference) variational algorithm, facilitating the approximate posterior estimation of model parameters. Additionally, we propose a stochastic variant of the CAVI algorithm that enhances scalability. The paper concludes with data examples that demonstrate effective strategies for setting the Beta-Liouville hyperparameters.

[LG-61] Joint Estimation of Conditional Mean and Covariance for Unbalanced Panels

链接: https://arxiv.org/abs/2410.21858
作者: Damir Filipovic,Paul Schneider
关键词-EN: nonparametric kernel-based estimator, nonparametric kernel-based, covariance matrices, large unbalanced panels, large unbalanced
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a novel nonparametric kernel-based estimator of cross-sectional conditional mean and covariance matrices for large unbalanced panels. We show its consistency and provide finite-sample guarantees. In an empirical application, we estimate conditional mean and covariance matrices for a large unbalanced panel of monthly stock excess returns given macroeconomic and firm-specific covariates from 1962 to this http URL estimator performs well with respect to statistical measures. It is informative for empirical asset pricing, generating conditional mean-variance efficient portfolios with substantial out-of-sample Sharpe ratios far beyond equal-weighted benchmarks.

[LG-62] Exponentially Consistent Statistical Classification of Continuous Sequences with Distribution Uncertainty

链接: https://arxiv.org/abs/2410.21799
作者: Lina Zhu,Lin Zhou
关键词-EN: training sequences, testing sequence, multiple classification, study multiple classification, aims to determine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.01161

点击查看摘要

Abstract:In multiple classification, one aims to determine whether a testing sequence is generated from the same distribution as one of the M training sequences or not. Unlike most of existing studies that focus on discrete-valued sequences with perfect distribution match, we study multiple classification for continuous sequences with distribution uncertainty, where the generating distributions of the testing and training sequences deviate even under the true hypothesis. In particular, we propose distribution free tests and prove that the error probabilities of our tests decay exponentially fast for three different test designs: fixed-length, sequential, and two-phase tests. We first consider the simple case without the null hypothesis, where the testing sequence is known to be generated from a distribution close to the generating distribution of one of the training sequences. Subsequently, we generalize our results to a more general case with the null hypothesis by allowing the testing sequence to be generated from a distribution that is vastly different from the generating distributions of all training sequences.

[LG-63] Minimax optimality of deep neural networks on dependent data via PAC-Bayes bounds

链接: https://arxiv.org/abs/2410.21702
作者: Pierre Alquier,William Kengne
关键词-EN: deep neural networks, groundbreaking work, defined by composition, optimality of deep, deep neural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a groundbreaking work, Schmidt-Hieber (2020) proved the minimax optimality of deep neural networks with ReLu activation for least-square regression estimation over a large class of functions defined by composition. In this paper, we extend these results in many directions. First, we remove the i.i.d. assumption on the observations, to allow some time dependence. The observations are assumed to be a Markov chain with a non-null pseudo-spectral gap. Then, we study a more general class of machine learning problems, which includes least-square and logistic regression as special cases. Leveraging on PAC-Bayes oracle inequalities and a version of Bernstein inequality due to Paulin (2015), we derive upper bounds on the estimation risk for a generalized Bayesian estimator. In the case of least-square regression, this bound matches (up to a logarithmic factor) the lower bound of Schmidt-Hieber (2020). We establish a similar lower bound for classification with the logistic loss, and prove that the proposed DNN estimator is optimal in the minimax sense.

[LG-64] he Effects of Multi-Task Learning on ReLU Neural Network Functions

链接: https://arxiv.org/abs/2410.21696
作者: Julia Nakhleh,Joseph Shenouda,Robert D. Nowak
关键词-EN: shallow ReLU neural, multi-task shallow ReLU, ReLU neural network, neural network, neural network learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies the properties of solutions to multi-task shallow ReLU neural network learning problems, wherein the network is trained to fit a dataset with minimal sum of squared weights. Remarkably, the solutions learned for each individual task resemble those obtained by solving a kernel method, revealing a novel connection between neural networks and kernel methods. It is known that single-task neural network training problems are equivalent to minimum norm interpolation problem in a non-Hilbertian Banach space, and that the solutions of such problems are generally non-unique. In contrast, we prove that the solutions to univariate-input, multi-task neural network interpolation problems are almost always unique, and coincide with the solution to a minimum-norm interpolation problem in a Sobolev (Reproducing Kernel) Hilbert Space. We also demonstrate a similar phenomenon in the multivariate-input case; specifically, we show that neural network learning problems with large numbers of diverse tasks are approximately equivalent to an \ell^2 (Hilbert space) minimization problem over a fixed kernel determined by the optimal neurons.

[LG-65] Learning the structure of any Hamiltonian from minimal assumptions

链接: https://arxiv.org/abs/2410.21635
作者: Andrew Zhao
关键词-EN: quantum many-body Hamiltonian, unknown quantum many-body, many-body Hamiltonian, study the problem, unknown quantum
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 44 pages

点击查看摘要

Abstract:We study the problem of learning an unknown quantum many-body Hamiltonian H from black-box queries to its time evolution e^-\mathrmi H t . Prior proposals for solving this task either impose some assumptions on H , such as its interaction structure or locality, or otherwise use an exponential amount of computational postprocessing. In this paper, we present efficient algorithms to learn any n -qubit Hamiltonian, assuming only a bound on the number of Hamiltonian terms, m \leq \mathrmpoly(n) . Our algorithms do not need to know the terms in advance, nor are they restricted to local interactions. We consider two models of control over the time evolution: the first has access to time reversal ( t 0 ), enabling an algorithm that outputs an \epsilon -accurate classical description of H after querying its dynamics for a total of \widetildeO(m/\epsilon) evolution time. The second access model is more conventional, allowing only forward-time evolutions; our algorithm requires \widetildeO(|H|^3/\epsilon^4) evolution time in this setting. Central to our results is the recently introduced concept of a pseudo-Choi state of H . We extend the utility of this learning resource by showing how to use it to learn the Fourier spectrum of H , how to achieve nearly Heisenberg-limited scaling with it, and how to prepare it even under our more restricted access models.

[LG-66] Refined Risk Bounds for Unbounded Losses via Transductive Priors

链接: https://arxiv.org/abs/2410.21621
作者: Jian Qian,Alexander Rakhlin,Nikita Zhivotovskiy
关键词-EN: design vectors, design, problems with hinge, characterized by unbounded, fixed design regression
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We revisit the sequential variants of linear regression with the squared loss, classification problems with hinge loss, and logistic regression, all characterized by unbounded losses in the setup where no assumptions are made on the magnitude of design vectors and the norm of the optimal vector of parameters. The key distinction from existing results lies in our assumption that the set of design vectors is known in advance (though their order is not), a setup sometimes referred to as transductive online learning. While this assumption seems similar to fixed design regression or denoising, we demonstrate that the sequential nature of our algorithms allows us to convert our bounds into statistical ones with random design without making any additional assumptions about the distribution of the design vectors–an impossibility for standard denoising results. Our key tools are based on the exponential weights algorithm with carefully chosen transductive (design-dependent) priors, which exploit the full horizon of the design vectors. Our classification regret bounds have a feature that is only attributed to bounded losses in the literature: they depend solely on the dimension of the parameter space and on the number of rounds, independent of the design vectors or the norm of the optimal solution. For linear regression with squared loss, we further extend our analysis to the sparse case, providing sparsity regret bounds that additionally depend on the magnitude of the response variables. We argue that these improved bounds are specific to the transductive setting and unattainable in the worst-case sequential setup. Our algorithms, in several cases, have polynomial time approximations and reduce to sampling with respect to log-concave measures instead of aggregating over hard-to-construct \varepsilon -covers of classes. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2410.21621 [stat.ML] (or arXiv:2410.21621v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2410.21621 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] Accelerated Robust Lower-Field Neonatal MRI with Generative Models

链接: https://arxiv.org/abs/2410.21602
作者: Yamin Arefeen,Brett Levac,Jonathan I. Tamir
关键词-EN: Magnetic Resonance Imaging, Resonance Imaging, Neonatal Magnetic Resonance, enables non-invasive assessment, early life development
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 5 pages, 3 figures, submitted to ISBI 2025

点击查看摘要

Abstract:Neonatal Magnetic Resonance Imaging (MRI) enables non-invasive assessment of potential brain abnormalities during the critical phase of early life development. Recently, interest has developed in lower field (i.e., below 1.5 Tesla) MRI systems that trade-off magnetic field strength for portability and access in the neonatal intensive care unit (NICU). Unfortunately, lower-field neonatal MRI still suffers from long scan times and motion artifacts that can limit its clinical utility for neonates. This work improves motion robustness and accelerates lower field neonatal MRI through diffusion-based generative modeling and signal processing based motion modeling. We first gather a training dataset of clinical neonatal MRI images. Then we train a diffusion-based generative model to learn the statistical distribution of fully-sampled images by applying several signal processing methods to handle the lower signal-to-noise ratio and lower quality of our MRI images. Finally, we present experiments demonstrating the utility of our generative model to improve reconstruction performance across two tasks: accelerated MRI and motion correction.

[LG-68] ATLAS: Adapting Trajectory Lengths and Step-Size for Hamiltonian Monte Carlo

链接: https://arxiv.org/abs/2410.21587
作者: Chirag Modi
关键词-EN: fixed mass matrix, Hamiltonian Monte-Carlo, constant step size, varying curvature, auto-tuned variant
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code available at this https URL

点击查看摘要

Abstract:Hamiltonian Monte-Carlo (HMC) and its auto-tuned variant, the No U-Turn Sampler (NUTS) can struggle to accurately sample distributions with complex geometries, e.g., varying curvature, due to their constant step size for leapfrog integration and fixed mass matrix. In this work, we develop a strategy to locally adapt the step size parameter of HMC at every iteration by evaluating a low-rank approximation of the local Hessian and estimating its largest eigenvalue. We combine it with a strategy to similarly adapt the trajectory length by monitoring the no U-turn condition, resulting in an adaptive sampler, ATLAS: adapting trajectory length and step-size. We further use a delayed rejection framework for making multiple proposals that improves the computational efficiency of ATLAS, and develop an approach for automatically tuning its hyperparameters during warmup. We compare ATLAS with state-of-the-art samplers like NUTS on a suite of synthetic and real world examples, and show that i) unlike NUTS, ATLAS is able to accurately sample difficult distributions with complex geometries, ii) it is computationally competitive to NUTS for simpler distributions, and iii) it is more robust to the tuning of hyperparamters.

[LG-69] Deep Learning Methods for the Noniterative Conditional Expectation G-Formula for Causal Inference from Complex Observational Data

链接: https://arxiv.org/abs/2410.21531
作者: Sophia M Rein,Jing Li,Miguel Hernan,Andrew Beam
关键词-EN: parametric NICE estimator, sustained treatment strategies, assumptions of consistency, NICE g-formula estimator, identifying assumptions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The g-formula can be used to estimate causal effects of sustained treatment strategies using observational data under the identifying assumptions of consistency, positivity, and exchangeability. The non-iterative conditional expectation (NICE) estimator of the g-formula also requires correct estimation of the conditional distribution of the time-varying treatment, confounders, and outcome. Parametric models, which have been traditionally used for this purpose, are subject to model misspecification, which may result in biased causal estimates. Here, we propose a unified deep learning framework for the NICE g-formula estimator that uses multitask recurrent neural networks for estimation of the joint conditional distributions. Using simulated data, we evaluated our model’s bias and compared it with that of the parametric g-formula estimator. We found lower bias in the estimates of the causal effect of sustained treatment strategies on a survival outcome when using the deep learning estimator compared with the parametric NICE estimator in settings with simple and complex temporal dependencies between covariates. These findings suggest that our Deep Learning g-formula estimator may be less sensitive to model misspecification than the classical parametric NICE estimator when estimating the causal effect of sustained treatment strategies from complex observational data.

[LG-70] Diagnosis of Knee Osteoarthritis Using Bioimpedance and Deep Learning

链接: https://arxiv.org/abs/2410.21512
作者: Jamal Al-Nabulsi,Mohammad Al-Sayed Ahmad,Baraa Hasaneiah,Fayhaa AlZoubi
关键词-EN: ultimately improving patient, improving patient outcomes, Diagnosing knee osteoarthritis, early is crucial, joint damage
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diagnosing knee osteoarthritis (OA) early is crucial for managing symptoms and preventing further joint damage, ultimately improving patient outcomes and quality of life. In this paper, a bioimpedance-based diagnostic tool that combines precise hardware and deep learning for effective non-invasive diagnosis is proposed. system features a relay-based circuit and strategically placed electrodes to capture comprehensive bioimpedance data. The data is processed by a neural network model, which has been optimized using convolutional layers, dropout regularization, and the Adam optimizer. This approach achieves a 98% test accuracy, making it a promising tool for detecting knee osteoarthritis musculoskeletal disorders.

[LG-71] Flow Matching for Atmospheric Retrieval of Exoplanets: Where Reliability meets Adaptive Noise Levels

链接: https://arxiv.org/abs/2410.21477
作者: Timothy D. Gebhard,Jonas Wildberger,Maximilian Dax,Annalena Kofler,Daniel Angerhausen,Sascha P. Quanz,Bernhard Schölkopf
关键词-EN: Inferring atmospheric properties, Inferring atmospheric, atmospheric retrieval, understanding their formation, properties of exoplanets
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: Accepted for publication in Astronomy Astrophysics

点击查看摘要

Abstract:Inferring atmospheric properties of exoplanets from observed spectra is key to understanding their formation, evolution, and habitability. Since traditional Bayesian approaches to atmospheric retrieval (e.g., nested sampling) are computationally expensive, a growing number of machine learning (ML) methods such as neural posterior estimation (NPE) have been proposed. We seek to make ML-based atmospheric retrieval (1) more reliable and accurate with verified results, and (2) more flexible with respect to the underlying neural networks and the choice of the assumed noise models. First, we adopt flow matching posterior estimation (FMPE) as a new ML approach to atmospheric retrieval. FMPE maintains many advantages of NPE, but provides greater architectural flexibility and scalability. Second, we use importance sampling (IS) to verify and correct ML results, and to compute an estimate of the Bayesian evidence. Third, we condition our ML models on the assumed noise level of a spectrum (i.e., error bars), thus making them adaptable to different noise models. Both our noise level-conditional FMPE and NPE models perform on par with nested sampling across a range of noise levels when tested on simulated data. FMPE trains about 3 times faster than NPE and yields higher IS efficiencies. IS successfully corrects inaccurate ML results, identifies model failures via low efficiencies, and provides accurate estimates of the Bayesian evidence. FMPE is a powerful alternative to NPE for fast, amortized, and parallelizable atmospheric retrieval. IS can verify results, thus helping to build confidence in ML-based approaches, while also facilitating model comparison via the evidence ratio. Noise level conditioning allows design studies for future instruments to be scaled up, for example, in terms of the range of signal-to-noise ratios.

[LG-72] High-Dimensional Gaussian Process Regression with Soft Kernel Interpolation

链接: https://arxiv.org/abs/2410.21419
作者: Chris Camaño,Daniel Huang
关键词-EN: scalable Gaussian Process, Soft Kernel Interpolation, introduce Soft Kernel, Gaussian Process, introduce Soft
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:We introduce Soft Kernel Interpolation (SoftKI) designed for scalable Gaussian Process (GP) regression on high-dimensional datasets. Inspired by Structured Kernel Interpolation (SKI), which approximates a GP kernel via interpolation from a structured lattice, SoftKI approximates a kernel via softmax interpolation from a smaller number of learned interpolation (i.e, inducing) points. By abandoning the lattice structure used in SKI-based methods, SoftKI separates the cost of forming an approximate GP kernel from the dimensionality of the data, making it well-suited for high-dimensional datasets. We demonstrate the effectiveness of SoftKI across various examples, and demonstrate that its accuracy exceeds that of other scalable GP methods when the data dimensionality is modest (around 10 ).

[LG-73] Model-agnostic basis functions for the 2-point correlation function of dark matter in linear theory

链接: https://arxiv.org/abs/2410.21374
作者: Aseem Paranjape(IUCAA),Ravi K. Sheth(UPenn/ICTP)
关键词-EN: mathcal, boldsymbol, theta, lin, linearly evolved
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 20 pages, 9 figures, to be submitted to JCAP. The implementation of the BiSequential architecture, along with a simple example notebook, is publicly available as part of the MLFundas repository at this https URL

点击查看摘要

Abstract:We consider approximating the linearly evolved 2-point correlation function (2pcf) of dark matter \xi_\rm lin(r;\boldsymbol\theta) in a cosmological model with parameters \boldsymbol\theta as the linear combination \xi_\rm lin(r;\boldsymbol\theta)\approx\sum_i,b_i®,w_i(\boldsymbol\theta) , where the functions \mathcalB=\b_i®\ form a \textitmodel-agnostic basis for the linear 2pcf. This decomposition is important for model-agnostic analyses of the baryon acoustic oscillation (BAO) feature in the nonlinear 2pcf of galaxies that fix \mathcalB and leave the coefficients \w_i\ free. To date, such analyses have made simple but sub-optimal choices for \mathcalB , such as monomials. We develop a machine learning framework for systematically discovering a \textitminimal basis \mathcalB that describes \xi_\rm lin® near the BAO feature in a wide class of cosmological models. We use a custom architecture, denoted \textttBiSequential , for a neural network (NN) that explicitly realizes the separation between r and \boldsymbol\theta above. The optimal NN trained on data in which only \Omega_\rm m,h\ are varied in a \textitflat \Lambda CDM model produces a basis \mathcalB comprising 9 functions capable of describing \xi_\rm lin® to \sim0.6% accuracy in \textitcurved w CDM models varying 7 parameters within \sim5% of their fiducial, flat \Lambda CDM values. Scales such as the peak, linear point and zero-crossing of \xi_\rm lin® are also recovered with very high accuracy. We compare our approach to other compression schemes in the literature, and speculate that \mathcalB may also encompass \xi_\rm lin® in modified gravity models near our fiducial \Lambda CDM model. Using our basis functions in model-agnostic BAO analyses can potentially lead to significant statistical gains.

[LG-74] Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects CIKM2024

链接: https://arxiv.org/abs/2410.21343
作者: Dong Yao,Caizhi Tang,Qing Cui,Longfei Li
关键词-EN: observational data, Data, randomized data, observational, readily obtainable
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, Accepted By CIKM2024

点击查看摘要

Abstract:Data from observational studies (OSs) is widely available and readily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, resulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing attention. However, existing methods for integrating observational data with randomized data must require \textitcomplete observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consistently achievable. In our paper, we propose a resilient approach to \textbfCombine \textbfIncomplete \textbfObservational data and randomized data for HTE estimation, which we abbreviate as \textbfCIO. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To validate our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets.

[LG-75] CloudCast – Total Cloud Cover Nowcasting with Machine Learning

链接: https://arxiv.org/abs/2410.21329
作者: Mikko Partio,Leila Hieta,Anniina Kokkonen
关键词-EN: solar power generation, Cloud cover, Cloud cover plays, total cloud cover, including agriculture
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 27 pages, 12 figures

点击查看摘要

Abstract:Cloud cover plays a critical role in weather prediction and impacts several sectors, including agriculture, solar power generation, and aviation. Despite advancements in numerical weather prediction (NWP) models, forecasting total cloud cover remains challenging due to the small-scale nature of cloud formation processes. In this study, we introduce CloudCast, a convolutional neural network (CNN) based on the U-Net architecture, designed to predict total cloud cover (TCC) up to five hours ahead. Trained on five years of satellite data, CloudCast significantly outperforms traditional NWP models and optical flow methods. Compared to a reference NWP model, CloudCast achieves a 24% lower mean absolute error and reduces multi-category prediction errors by 46%. The model demonstrates strong performance, particularly in capturing the large-scale structure of cloud cover in the first few forecast hours, though later predictions are subject to blurring and underestimation of cloud formation. An ablation study identified the optimal input features and loss functions, with MAE-based models performing the best. CloudCast has been integrated into the Finnish Meteorological Institute’s operational nowcasting system, where it improves cloud cover forecasts used by public and private sector clients. While CloudCast is limited by a relatively short skillful lead time of about three hours, future work aims to extend this through more complex network architectures and higher-resolution data. CloudCast code is available at this https URL.

[LG-76] Achilles Neural Network to Predict the Gold Vs US Dollar Integration with Trading Bot for Automatic Trading

链接: https://arxiv.org/abs/2410.21291
作者: Angel Varela
关键词-EN: big challenge, machine learning world, Long Short Term, Short Term Memory, Gold vs USD
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting the stock market is a big challenge for the machine learning world. It is known how difficult it is to have accurate and consistent predictions with ML models. Some architectures are able to capture the movement of stocks but almost never are able to be launched to the production world. We present Achilles, with a classical architecture of LSTM(Long Short Term Memory) neural network this model is able to predict the Gold vs USD commodity. With the predictions minute-per-minute of this model we implemented a trading bot to run during 23 days of testing excluding weekends. At the end of the testing period we generated 1623.52 in profit with the methodology used. The results of our method demonstrate Machine Learning can successfully be implemented to predict the Gold vs USD commodity.

信息检索

[IR-0] Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

链接: https://arxiv.org/abs/2410.22182
作者: Marco Braga,Pranav Kasela,Alessandro Raganato,Gabriella Pasi
关键词-EN: Large Language Models, Information Retrieval, Large Language, topic studied, Language Models
类目: Information Retrieval (cs.IR)
*备注: Accepted in WI-IAT '24

点击查看摘要

Abstract:Personalization in Information Retrieval (IR) is a topic studied by the research community since a long time. However, there is still a lack of datasets to conduct large-scale evaluations of personalized IR; this is mainly due to the fact that collecting and curating high-quality user-related information requires significant costs and time investment. Furthermore, the creation of datasets for Personalized IR (PIR) tasks is affected by both privacy concerns and the need for accurate user-related data, which are often not publicly available. Recently, researchers have started to explore the use of Large Language Models (LLMs) to generate synthetic datasets, which is a possible solution to generate data for low-resource tasks. In this paper, we investigate the potential of Large Language Models (LLMs) for generating synthetic documents to train an IR system for a Personalized Community Question Answering task. To study the effectiveness of IR models fine-tuned on LLM-generated data, we introduce a new dataset, named Sy-SE-PQA. We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Starting from questions in SE-PQA, we generate synthetic answers using different prompt techniques and LLMs. Our findings suggest that LLMs have high potential in generating data tailored to users’ needs. The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.

[IR-1] SimRec: Mitigating the Cold-Start Problem in Sequential Recommendation by Integrating Item Similarity RECSYS2024

链接: https://arxiv.org/abs/2410.22136
作者: Shaked Brody,Shoval Lagziel
关键词-EN: Sequential recommendation systems, Sequential recommendation, struggle to make, make predictions, action when dealing
类目: Information Retrieval (cs.IR)
*备注: ACM RecSys 2024 Workshop on Context-Aware Recommender Systems

点击查看摘要

Abstract:Sequential recommendation systems often struggle to make predictions or take action when dealing with cold-start items that have limited amount of interactions. In this work, we propose SimRec - a new approach to mitigate the cold-start problem in sequential recommendation systems. SimRec addresses this challenge by leveraging the inherent similarity among items, incorporating item similarities into the training process through a customized loss function. Importantly, this enhancement is attained with identical model architecture and the same amount of trainable parameters, resulting in the same inference time and requiring minimal additional effort. This novel approach results in a robust contextual sequential recommendation model capable of effectively handling rare items, including those that were not explicitly seen during training, thereby enhancing overall recommendation performance. Rigorous evaluations against multiple baselines on diverse datasets showcase SimRec’s superiority, particularly in scenarios involving items occurring less than 10 times in the training data. The experiments reveal an impressive improvement, with SimRec achieving up to 78% higher HR@10 compared to SASRec. Notably, SimRec outperforms strong baselines on sparse datasets while delivering on-par performance on dense datasets. Our code is available at this https URL.

[IR-2] sting Identity of Distributions under Kolmogorov Distance in Polylogarithmic Space

链接: https://arxiv.org/abs/2410.22123
作者: Christian Janos Lebeda,Jakub Tětek
关键词-EN: varepsilon, Suppose, fixed distribution, cs.DS, Abstract
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Suppose we have a sample from a distribution D and we want to test whether D = D^* for a fixed distribution D^* . Specifically, we want to reject with constant probability, if the distance of D from D^* is \geq \varepsilon in a given metric. In the case of continuous distributions, this has been studied thoroughly in the statistics literature. Namely, for the well-studied Kolmogorov metric a test is known that uses the optimal O(1/\varepsilon^2) samples. However, this test naively uses also space O(1/\varepsilon^2) , and previous work improved this to O(1/\varepsilon) . In this paper, we show that much less space suffices – we give an algorithm that uses space O(\log^4 \varepsilon^-1) in the streaming setting while also using an asymptotically optimal number of samples. This is in contrast with the standard total variation distance on discrete distributions for which such space reduction is known to be impossible. Finally, we state 9 related open problems that we hope will spark interest in this and related problems. Subjects: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2410.22123 [cs.DS] (or arXiv:2410.22123v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2410.22123 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Guided Diffusion-based Counterfactual Augmentation for Robust Session-based Recommendation

链接: https://arxiv.org/abs/2410.21892
作者: Muskan Gupta,Priyanka Gupta,Lovekesh Vig
关键词-EN: Session-based recommendation, recommend top-K items, current session, user behaviour, aim to recommend
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Session-based recommendation (SR) models aim to recommend top-K items to a user, based on the user’s behaviour during the current session. Several SR models are proposed in the literature, however,concerns have been raised about their susceptibility to inherent biases in the training data (observed data) such as popularity bias. SR models when trained on the biased training data may encounter performance challenges on out-of-distribution data in real-world scenarios. One way to mitigate popularity bias is counterfactual data augmentation. Compared to prior works that rely on generating data using SR models, we focus on utilizing the capabilities of state-of-the art diffusion models for generating counterfactual data. We propose a guided diffusion-based counterfactual augmentation framework for SR. Through a combination of offline and online experiments on a real-world and simulated dataset, respectively, we show that our approach performs significantly better than the baseline SR models and other state-of-the art augmentation frameworks. More importantly, our framework shows significant improvement on less popular target items, by achieving up to 20% gain in Recall and 13% gain in CTR on real-world and simulated datasets,respectively.

[IR-4] Application of Audio Fingerprinting Techniques for Real-Time Scalable Speech Retrieval and Speech Clusterization

链接: https://arxiv.org/abs/2410.21876
作者: Kemal Altwlkany,Sead Delalić,Adis Alihodžić,Elmedin Selmanović,Damir Hasić
关键词-EN: queried audio sample, recent years, noisy conditions, great advances, advances in recent
类目: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple’s Shazam or Google’s Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.

[IR-5] PerSRV: Personalized Sticker Retrieval with Vision-Language Model

链接: https://arxiv.org/abs/2410.21801
作者: Heng Er Metilda Chee,Jiayin Wang,Zhiqiang Guo,Weizhi Ma,Min Zhang
关键词-EN: Instant Messaging, sticker retrieval, sticker, retrieval, daily communication
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Instant Messaging is a popular means for daily communication, allowing users to send text and stickers. As the saying goes, “a picture is worth a thousand words”, so developing an effective sticker retrieval technique is crucial for enhancing user experience. However, existing sticker retrieval methods rely on labeled data to interpret stickers, and general-purpose Vision-Language Models (VLMs) often struggle to capture the unique semantics of stickers. Additionally, relevant-based sticker retrieval methods lack personalization, creating a gap between diverse user expectations and retrieval results. To address these, we propose the Personalized Sticker Retrieval with Vision-Language Model framework, namely PerSRV, structured into offline calculations and online processing modules. The online retrieval part follows the paradigm of relevant recall and personalized ranking, supported by the offline pre-calculation parts, which are sticker semantic understanding, utility evaluation and personalization modules. Firstly, for sticker-level semantic understanding, we supervised fine-tuned LLaVA-1.5-7B to generate human-like sticker semantics, complemented by textual content extracted from figures and historical interaction queries. Secondly, we investigate three crowd-sourcing metrics for sticker utility evaluation. Thirdly, we cluster style centroids based on users’ historical interactions to achieve personal preference modeling. Finally, we evaluate our proposed PerSRV method on a public sticker retrieval dataset from WeChat, containing 543,098 candidates and 12,568 interactions. Experimental results show that PerSRV significantly outperforms existing methods in multi-modal sticker retrieval. Additionally, our fine-tuned VLM delivers notable improvements in sticker semantic understandings.

[IR-6] Can Users Detect Biases or Factual Errors in Generated Responses in Conversational Information-Seeking? SIGIR

链接: https://arxiv.org/abs/2410.21529
作者: Weronika Łajewska,Krisztian Balog,Damiano Spina,Johanne Trippas
关键词-EN: require exploring multiple, exploring multiple facets, Information-seeking dialogues span, range of questions, facets and viewpoints
类目: Information Retrieval (cs.IR)
*备注: Extended version of the paper that appeared in the Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP '24)

点击查看摘要

Abstract:Information-seeking dialogues span a wide range of questions, from simple factoid to complex queries that require exploring multiple facets and viewpoints. When performing exploratory searches in unfamiliar domains, users may lack background knowledge and struggle to verify the system-provided information, making them vulnerable to misinformation. We investigate the limitations of response generation in conversational information-seeking systems, highlighting potential inaccuracies, pitfalls, and biases in the responses. The study addresses the problem of query answerability and the challenge of response incompleteness. Our user studies explore how these issues impact user experience, focusing on users’ ability to identify biased, incorrect, or incomplete responses. We design two crowdsourcing tasks to assess user experience with different system response variants, highlighting critical issues to be addressed in future conversational information-seeking research. Our analysis reveals that it is easier for users to detect response incompleteness than query answerability and user satisfaction is mostly associated with response diversity, not factual correctness.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-10-30

目录

概览 (2024-10-30)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载