Arxiv今日论文 | 2024-11-01

本篇博文主要展示 2024-11-01 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决如何利用丰富的语言输入来促进强化学习（RL）实体代理的任务学习问题。解决方案的关键在于研究不同类型的语言输入对代理学习的影响，特别是语言的信息量（即对过去行为的反馈和未来指导）和多样性（即语言表达的变异）。通过在四个RL基准上的实验，论文发现，使用多样化和信息丰富的语言反馈训练的代理能够实现更好的泛化能力和对新任务的快速适应。这一发现强调了语言在教授实体代理新任务中的关键作用。

链接: https://arxiv.org/abs/2410.24218
作者: Jiajun Xi,Yinong He,Jianing Yang,Yinpei Dai,Joyce Chai
关键词-EN: leverage human language, real-world scenarios, ability to leverage, gain explicit, explicit or implicit
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: EMNLP 2024 Main. Project website: this https URL

点击查看摘要

Abstract:In real-world scenarios, it is desirable for embodied agents to have the ability to leverage human language to gain explicit or implicit knowledge for learning tasks. Despite recent progress, most previous approaches adopt simple low-level instructions as language inputs, which may not reflect natural human communication. It’s not clear how to incorporate rich language use to facilitate task learning. To address this question, this paper studies different types of language inputs in facilitating reinforcement learning (RL) embodied agents. More specifically, we examine how different levels of language informativeness (i.e., feedback on past behaviors and future guidance) and diversity (i.e., variation of language expressions) impact agent learning and inference. Our empirical results based on four RL benchmarks demonstrate that agents trained with diverse and informative language feedback can achieve enhanced generalization and fast adaptation to new tasks. These findings highlight the pivotal role of language use in teaching embodied agents new tasks in an open world. Project website: this https URL
摘要：在现实世界中，实体智能体具备利用人类语言获取显性或隐性知识以完成学习任务的能力是十分理想的。尽管近期取得了一些进展，但大多数先前的方法采用简单的低级指令作为语言输入，这可能无法反映自然的人类交流方式。如何将丰富的语言运用融入到任务学习中尚不明确。为了解决这一问题，本文研究了不同类型的语言输入在促进强化学习 (Reinforcement Learning, RL) 实体智能体学习中的作用。更具体地说，我们探讨了不同层次的语言信息量（即对过去行为的反馈和未来指导）和多样性（即语言表达的变异）如何影响智能体的学习和推理。基于四个 RL 基准的实证结果表明，通过多样化和信息丰富的语言反馈训练的智能体能够实现更强的泛化能力和对新任务的快速适应。这些发现突显了语言运用在教授实体智能体在开放世界中学习新任务的关键作用。项目网站：this https URL

[NLP-1] P-Masking: Power Law Masking Improves Multi-attribute Controlled Generation

【速读】：该论文试图解决在文本生成过程中对多种语言属性进行精确控制的问题，尤其是在属性数量变化的情况下。解决方案的关键在于引入了一种名为动态P-MASKING的策略，该策略在训练过程中从幂律分布中采样掩码率。这一创新方法使模型能够发展出鲁棒的表示，并根据属性的数量变化调整其控制能力，从单一属性到多重复杂配置。P-MASKING技术增强了模型在不同属性可见性级别下的管理能力，从而在多属性生成任务中表现出色。实验结果表明，LingGen在属性控制精度和文本流畅性方面超越了当前最先进的模型，特别是在属性需求变化的情况下表现尤为突出。

链接: https://arxiv.org/abs/2410.24201
作者: Mohamed Elgaar,Hadi Amiri
关键词-EN: wide array, offers precise control, controlled text generation, attributes varies, attribute
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce LingGen, a novel approach for controlled text generation that offers precise control over a wide array of linguistic attributes, even as the number of attributes varies. LingGen employs a dynamic P-MASKING strategy, which samples masking rates from a power law distribution during training. This innovative approach enables the model to develop robust representations and adapt its attribute control capabilities across a variable number of attributes, from a single attribute to multiple complex configurations. The P-MASKING technique enhances LingGen’s ability to manage different levels of attribute visibility, resulting in superior performance in multi-attribute generation tasks. Our experiments demonstrate that LingGen surpasses current state-of-the-art models in both attribute control accuracy and text fluency, particularly excelling in scenarios with varying attribute demands. Additionally, our ablation studies highlight the effectiveness of P-MASKING and the influence of different base language models on performance. These findings demonstrate LingGen’s potential for applications requiring precise and adaptable control over multiple linguistic attributes in text generation.
摘要：我们介绍了 LingGen，这是一种新颖的受控文本生成方法，能够在属性数量变化的情况下，对广泛的语义属性提供精确的控制。LingGen 采用了一种动态 P-MASKING 策略，在训练过程中从幂律分布中采样掩码率。这种创新方法使模型能够发展出强大的表示能力，并适应从单一属性到多重复杂配置的属性控制能力。P-MASKING 技术增强了 LingGen 管理不同属性可见性水平的能力，从而在多属性生成任务中表现出卓越的性能。我们的实验表明，LingGen 在属性控制准确性和文本流畅性方面均超越了当前最先进的模型，特别是在属性需求变化的情况下表现尤为突出。此外，我们的消融研究突显了 P-MASKING 的有效性以及不同基础语言模型对性能的影响。这些发现展示了 LingGen 在需要对文本生成中的多重语义属性进行精确且适应性控制的应用中的潜力。

[NLP-2] Length-Induced Embedding Collapse in Transformer-based Models

【速读】：该论文试图解决文本嵌入在处理长文本时性能下降的问题，这一问题源于一种称为“长度坍缩（Length Collapse）”的现象，即长文本的嵌入向量会坍缩到一个狭窄的空间，导致不同长度文本的嵌入分布不一致，从而影响下游任务的性能。解决方案的关键在于通过理论分析发现，自注意力机制（self-attention mechanism）本质上作为低通滤波器，长序列会增加低通滤波效果的衰减率，导致输入的token特征图仅保留直流分量（Direct-Current component），从而引发长度坍缩。为此，论文提出了一种名为TempScale的无调参方法，通过在softmax()中引入温度参数，实现更高的低通滤波衰减率，从而缓解长度坍缩的限制。该方法可以无缝集成到多种基于transformer的嵌入模型中，并在实验中显著提升了现有模型在长文本输入上的性能。

链接: https://arxiv.org/abs/2410.24200
作者: Yuqi Zhou,Sunhao Dai,Zhanshuo Cao,Xiao Zhang,Jun Xu
关键词-EN: Text embeddings enable, longer text embeddings, text embeddings collapse, enable various applications, Text Embedding Benchmark
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax(), which achieves a higher low-filter attenuation rate. The tuning-free method, called TempScale, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models, especially on long text inputs, bringing up to 0.53% performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and 0.82% performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval.
摘要：文本嵌入技术支持多种应用，但在处理较长文本时其性能会下降。本文研究发现，这种性能下降是由于一种称为“长度坍缩”的现象，即较长文本的嵌入向量会坍缩到一个狭窄的空间中。这种坍缩导致不同长度文本的嵌入向量之间出现分布不一致，最终影响下游任务的性能。从理论上讲，通过考虑自注意力机制本质上作为低通滤波器的作用，我们证明了长序列会增加自注意力机制低通滤波效果的衰减率。随着层数的加深，过度的低通滤波使得Token信号仅保留其直流分量（DC），这意味着输入Token特征图将坍缩到一个狭窄的空间，尤其是在长文本中。基于上述分析，我们提出通过在softmax()中引入温度参数来缓解不理想的长度坍缩限制，从而实现更高的低通滤波衰减率。这种无需调参的方法称为TempScale，可以插入到多种基于Transformer的嵌入模型中。实验证明，TempScale能够提升现有嵌入模型的性能，特别是在长文本输入方面，在Massive Text Embedding Benchmark (MTEB)的40个数据集中带来了高达0.53%的性能提升，在专注于长上下文检索的LongEmbed的4个数据集中带来了0.82%的性能提升。

[NLP-3] Multi-Attribute Linguistic Tuning for Controlled Paraphrase Generation

【速读】：该论文试图解决生成式重述（paraphrase generation）中对40种语言属性（linguistic attributes）的精确控制和微调问题。解决方案的关键在于采用了一种编码器-解码器架构（encoder-decoder architecture），该架构能够根据输入的源句子和期望的语言属性生成满足这些属性的重述。为了确保推理时的高质量输出，该方法配备了一个质量控制机制（quality control mechanism），通过逐步调整语言属性的嵌入（embedding）来找到最接近且可实现的目标属性配置，从而生成高质量的重述。实验结果表明，该模型在生成满足期望语言属性的重述方面优于现有的可控生成模型。

链接: https://arxiv.org/abs/2410.24199
作者: Mohamed Elgaar,Hadi Amiri
关键词-EN: enables precise control, linguistic attributes, desired linguistic attributes, enables precise, attributes
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a novel approach to paraphrase generation that enables precise control and fine-tuning of 40 linguistic attributes for English. Our model is an encoder-decoder architecture that takes as input a source sentence and desired linguistic attributes, and produces paraphrases of the source that satisfy the desired attributes. To guarantee high-quality outputs at inference time, our method is equipped with a quality control mechanism that gradually adjusts the embedding of linguistic attributes to find the nearest and most attainable configuration of desired attributes for paraphrase generation. We evaluate the effectiveness of our method by comparing it to recent controllable generation models. Experimental results demonstrate that the proposed model outperforms baselines in generating paraphrases that satisfy desired linguistic attributes.
摘要：我们提出了一种新颖的释义生成方法，能够对英语的40种语言属性进行精确控制和微调。我们的模型采用编码器-解码器架构，输入包括源句子和所需的语言属性，并生成满足这些属性的释义。为了在推理时保证高质量的输出，我们的方法配备了质量控制机制，该机制逐步调整语言属性的嵌入，以找到最接近且最可实现的所需属性配置用于释义生成。我们通过与最近的控制生成模型进行比较，评估了该方法的有效性。实验结果表明，所提出的模型在生成满足所需语言属性的释义方面优于基线模型。

[NLP-4] SelfCodeAlign: Self-Alignment for Code Generation NEURIPS2024

【速读】：该论文试图解决大规模语言模型（LLMs）在遵循人类指令方面的能力提升问题，特别是针对代码生成模型。解决方案的关键是提出了SelfCodeAlign，这是一种完全透明且许可的自我对齐流程，无需大量人工标注或蒸馏。SelfCodeAlign的核心在于使用同一基础模型在整个数据生成过程中进行推理，通过从高质量种子代码片段中提取多样化的编程概念来生成新任务，然后对每个任务生成多个响应，并在沙盒环境中验证这些响应。通过这种方式，SelfCodeAlign能够生成高质量的指令-响应对数据集，用于模型的指令微调，从而显著提升模型在代码生成任务上的性能。实验结果表明，使用SelfCodeAlign微调的模型在HumanEval+上的pass@1达到了67.1，超过了CodeLlama-70B-Instruct，尽管模型规模小了十倍。此外，SelfCodeAlign在不同规模的LLMs上均表现出色，并且能够更好地利用自身数据分布进行对齐。

链接: https://arxiv.org/abs/2410.24198
作者: Yuxiang Wei,Federico Cassano,Jiawei Liu,Yifeng Ding,Naman Jain,Zachary Mueller,Harm de Vries,Leandro von Werra,Arjun Guha,Lingming Zhang
关键词-EN: supervised fine-tuning approach, large language models, follow human instructions, supervised fine-tuning, fine-tuning approach
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component’s effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.
摘要：指令微调是一种监督式微调方法，显著提升大语言模型（LLMs）遵循人类指令的能力。我们提出 SelfCodeAlign，这是首个完全透明且许可的流程，用于自我对齐代码 LLMs，无需大量人工标注或蒸馏。SelfCodeAlign 在整个数据生成过程中使用相同的基模型进行推理。首先，它从高质量的种子代码片段中提取多样化的编程概念以生成新任务。接着，为每个任务抽取多个响应，并将其与测试用例配对，在沙盒环境中进行验证。最后，通过的示例被选用于指令微调。在我们的主要实验中，我们使用 SelfCodeAlign 与 CodeQwen1.5-7B 生成了一个包含 74k 指令-响应对的训练集。在此数据集上进行微调后，模型在 HumanEval+ 上达到了 67.1 的 pass@1，超过了 CodeLlama-70B-Instruct，尽管其规模仅为后者的十分之一。在所有基准测试中，这一微调模型始终优于使用 OctoPack 训练的原始版本，OctoPack 是之前无需人工标注或蒸馏的指令微调的最先进方法。此外，我们展示了 SelfCodeAlign 在不同规模的 LLMs（从 3B 到 33B）中的有效性，并表明基模型能从与其自身数据分布的对齐中获得更多收益。我们进一步验证了流程中每个组件的有效性，结果显示 SelfCodeAlign 优于直接从 GPT-4o 蒸馏以及基于 GPT-3.5 的领先蒸馏方法，如 OSS-Instruct 和 Evol-Instruct。SelfCodeAlign 还促成了 StarCoder2-Instruct 的诞生，这是首个完全透明、许可且自我对齐的代码 LLM，达到了最先进的编码性能。

[NLP-5] Hidden Persuaders: LLM s Political Leaning and Their Influence on Voters EMNLP2024

【速读】：该论文试图解决的问题是生成式大型语言模型（LLMs）如何影响民主进程，特别是它们在政治倾向和选民选择方面的潜在影响。解决方案的关键在于通过一系列实验，包括投票模拟和与真实选民的互动，来揭示LLMs在政治倾向上的偏好，并评估其对选民选择的影响。实验结果表明，LLMs在未被明确要求说服用户支持特定候选人的情况下，仍能显著影响选民的选择，使其更倾向于支持民主党候选人，这一效应在某些情况下甚至超过了传统政治竞选活动的影响。此外，研究还探讨了如何通过安全方法使LLMs在政治上更加中立，但仍存在一些未解之谜需要进一步研究。

链接: https://arxiv.org/abs/2410.24190
作者: Yujin Potter,Shiyang Lai,Junsol Kim,James Evans,Dawn Song
关键词-EN: Democratic nominee, Democratic, LLMs, nominee, political
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: EMNLP 2024 Main

点击查看摘要

Abstract:How could LLMs influence our democracy? We investigate LLMs’ political leanings and the potential influence of LLMs on voters by conducting multiple experiments in a U.S. presidential election context. Through a voting simulation, we first demonstrate 18 open- and closed-weight LLMs’ political preference for a Democratic nominee over a Republican nominee. We show how this leaning towards the Democratic nominee becomes more pronounced in instruction-tuned models compared to their base versions by analyzing their responses to candidate-policy related questions. We further explore the potential impact of LLMs on voter choice by conducting an experiment with 935 U.S. registered voters. During the experiments, participants interacted with LLMs (Claude-3, Llama-3, and GPT-4) over five exchanges. The experiment results show a shift in voter choices towards the Democratic nominee following LLM interaction, widening the voting margin from 0.7% to 4.6%, even though LLMs were not asked to persuade users to support the Democratic nominee during the discourse. This effect is larger than many previous studies on the persuasiveness of political campaigns, which have shown minimal effects in presidential elections. Many users also expressed a desire for further political interaction with LLMs. Which aspects of LLM interactions drove these shifts in voter choice requires further study. Lastly, we explore how a safety method can make LLMs more politically neutral, while leaving some open questions.
摘要：大语言模型（LLM）如何影响我们的民主？我们通过在美国总统选举背景下进行多项实验，研究了 LLM 的政治倾向及其对选民的潜在影响。通过投票模拟，我们首先展示了 18 个开放权重和封闭权重的 LLM 对民主党候选人的政治偏好，超过了对共和党候选人的偏好。我们通过分析模型对候选人政策相关问题的回答，展示了在指令调优模型中，这种对民主党候选人的倾向性比其基础版本更为明显。我们进一步通过与 935 名美国注册选民进行的实验，探讨了 LLM 对选民选择的潜在影响。在实验过程中，参与者与 LLM（Claude-3、Llama-3 和 GPT-4）进行了五轮互动。实验结果显示，在与 LLM 互动后，选民的选择向民主党候选人倾斜，投票差距从 0.7% 扩大到 4.6%，尽管在对话过程中 LLM 并未被要求说服用户支持民主党候选人。这一效果大于许多先前关于政治竞选说服力的研究，这些研究在总统选举中显示出极小的影响。许多用户还表达了希望与 LLM 进行更多政治互动的愿望。哪些方面的 LLM 互动导致了选民选择的这种转变，需要进一步研究。最后，我们探讨了一种安全方法如何使 LLM 更具政治中立性，同时留下了一些开放性问题。

[NLP-6] Constraint Back-translation Improves Complex Instruction Following of Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在遵循复杂约束指令（如格式、长度等）方面的困难。解决方案的关键在于提出了一种新的数据生成技术，称为约束反向翻译（constraint back-translation）。具体来说，该方法利用现有数据集中高质量的指令-响应对，并通过高级LLMs将响应中已满足的复杂约束添加到指令中，从而降低成本和数据噪声。实验结果表明，通过在CRAB数据集上进行后训练，可以显著提升多个骨干LLMs的复杂指令遵循能力，并且约束反向翻译还可以作为后训练中有用的辅助训练目标。

链接: https://arxiv.org/abs/2410.24175
作者: Yunjia Qi,Hao Peng,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li
关键词-EN: Large language models, Large language, complex, advanced LLMs, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc. Following the conventional instruction-tuning practice, previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs. However, even advanced LLMs cannot follow complex instructions well, thus limiting the quality of generated data. In this work, we find that existing datasets inherently contain implicit complex constraints and propose a novel data generation technique, constraint back-translation. Specifically, we take the high-quality instruction-response pairs in existing datasets and only adopt advanced LLMs to add complex constraints already met by the responses to the instructions, which naturally reduces costs and data noise. In the experiments, we adopt Llama3-70B-Instruct to back-translate constraints and create a high-quality complex instruction-response dataset, named CRAB. We present that post-training on CRAB improves multiple backbone LLMs’ complex instruction-following ability, evaluated on extensive instruction-following benchmarks. We further find that constraint back-translation also serves as a useful auxiliary training objective in post-training. Our code, data, and models will be released to facilitate future research.
摘要：大语言模型（LLMs）在遵循具有复杂格式、长度等约束的指令时表现不佳。按照传统的指令微调实践，先前的工作通过对由高级LLMs根据复杂指令生成的复杂指令-响应对进行后训练来解决这一问题。然而，即使高级LLMs也无法很好地遵循复杂指令，从而限制了生成数据的质量。在本研究中，我们发现现有数据集本身就包含了隐含的复杂约束，并提出了一种新的数据生成技术——约束反向翻译。具体来说，我们利用现有数据集中高质量的指令-响应对，仅采用高级LLMs将响应中已满足的复杂约束添加到指令中，这自然降低了成本并减少了数据噪声。在实验中，我们采用Llama3-70B-Instruct进行约束反向翻译，并创建了一个高质量的复杂指令-响应数据集，命名为CRAB。我们展示了通过对CRAB进行后训练，可以显著提升多个骨干LLMs的复杂指令遵循能力，并在广泛的指令遵循基准测试中进行了评估。进一步研究发现，约束反向翻译在后训练中也可作为有用的辅助训练目标。我们的代码、数据和模型将公开发布，以促进未来研究。

[NLP-7] Novel Architecture for Distributed Travel Data Integration and Service Provision Using Microservices

【速读】：该论文试图解决航空公司预订系统在灵活性、性能和可扩展性方面的问题。解决方案的关键在于采用微服务架构，结合Redis缓存技术、Kafka和RabbitMQ消息系统、MongoDB和PostgreSQL存储系统，以及OAuth2和JWT授权技术，确保系统在高峰期仍能维持高数据一致性（99.5%）和低延迟（数据传播延迟小于75 ms），同时实现高吞吐量（1050事件/秒）。通过Docker和Kubernetes实现水平扩展，进一步提升了系统的可扩展性，同时保持低错误率（0.2%），从而优化用户体验和操作效率。

链接: https://arxiv.org/abs/2410.24174
作者: Biman Barua,M. Shamim Kaiser
关键词-EN: flexibility and performance, airline reservation system, paper introduces, Redis cache technologies, design incorporates Redis
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:This paper introduces a microservices architecture for the purpose of enhancing the flexibility and performance of an airline reservation system. The architectural design incorporates Redis cache technologies, two different messaging systems (Kafka and RabbitMQ), two types of storages (MongoDB, and PostgreSQL). It also introduces authorization techniques, including secure communication through OAuth2 and JWT which is essential with the management of high-demand travel services. According to selected indicators, the architecture provides an impressive level of data consistency at 99.5% and a latency of data propagation of less than 75 ms allowing rapid and reliable intercommunication between microservices. A system throughput of 1050 events per second was achieved so that the acceptability level was maintained even during peak time. Redis caching reduced a 92% cache hit ratio on the database thereby lowering the burden on the database and increasing the speed of response. Further improvement of the systems scalability was done through the use of Docker and Kubernetes which enabled services to be expanded horizontally to cope with the changes in demand. The error rates were very low, at 0.2% further enhancing the efficiency of the system in handling real-time data integration. This approach is suggested to meet the specific needs of the airline reservation system. It is secure, fast, scalable, all serving to improve the user experience as well as the efficiency of operations. The low latency and high data integration levels and prevaiing efficient usage of the resources demonstrates the architecture ability to offer continued support in the ever growing high demand situations.
摘要：本文介绍了一种用于增强航空公司预订系统灵活性和性能的微服务架构。该架构设计整合了 Redis 缓存技术、两种消息系统（Kafka 和 RabbitMQ）以及两种存储类型（MongoDB 和 PostgreSQL）。此外，还引入了授权技术，包括通过 OAuth2 和 JWT 实现的安全通信，这对于管理高需求的旅行服务至关重要。根据选定的指标，该架构在数据一致性方面达到了 99.5% 的高水平，数据传播延迟低于 75 毫秒，从而实现了微服务之间快速可靠的通信。系统吞吐量达到了每秒 1050 个事件，即使在高峰时段也能保持可接受的水平。Redis 缓存将数据库的缓存命中率提高了 92%，从而减轻了数据库的负担并加快了响应速度。通过使用 Docker 和 Kubernetes，进一步提升了系统的可扩展性，使得服务能够水平扩展以应对需求变化。错误率非常低，仅为 0.2%，进一步提高了系统处理实时数据集成的效率。该方法旨在满足航空公司预订系统的特定需求，具备安全性、快速性和可扩展性，从而提升用户体验和运营效率。低延迟、高数据集成水平以及资源的高效利用，展示了该架构在日益增长的高需求场景中持续支持的能力。

[NLP-8] Redefining Creative in Dictionary: Towards a Enhanced Semantic Understanding of Creative Generation

【速读】：该论文试图解决生成式模型在处理“创造性”概念时的抽象性和不可靠性问题。解决方案的关键在于通过TP2O任务（将两个不相关的概念融合），引入CreTok（\textttCreTok），将“创造性”重新定义为一个具体的、可普遍适用的融合概念的标记。这一重新定义通过不断随机采样不同概念的文本对，并优化目标与常量提示之间的余弦相似度来实现，使得CreTok能够学习到一种创造性概念融合的方法。实验结果表明，CreTok显著超越了最新的SOTA扩散模型，在创造性生成方面表现更优，且具有更大的灵活性和较低的时间开销，无需重新训练即可适用于任何概念的创造性生成。

链接: https://arxiv.org/abs/2410.24160
作者: Fu Feng,Yucheng Xie,Jing Wang,Xin Geng
关键词-EN: yield reliable semantic, reliable semantic recognition, simply adding, remains an inherently, yield reliable
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Creativity, both in human and diffusion models, remains an inherently abstract concept; thus, simply adding “creative” to a prompt does not yield reliable semantic recognition by the model. In this work, we concretize the abstract notion of “creative” through the TP2O task, which aims to merge two unrelated concepts, and introduce CreTok, redefining “creative” as the token \textttCreTok . This redefinition offers a more concrete and universally adaptable representation for concept blending. This redefinition occurs continuously, involving the repeated random sampling of text pairs with different concepts and optimizing cosine similarity between target and constant prompts. This approach enables \textttCreTok to learn a method for creative concept fusion. Extensive experiments demonstrate that the creative capability enabled by \textttCreTok substantially surpasses recent SOTA diffusion models and achieves superior creative generation. CreTok exhibits greater flexibility and reduced time overhead, as \textttCreTok can function as a universal token for any concept, facilitating creative generation without retraining.
摘要：创造力，无论是人类还是扩散模型中的创造力，仍然是一个本质上抽象的概念；因此，仅仅在提示中添加“创造性”并不能使模型产生可靠的语义识别。在这项工作中，我们通过 TP2O 任务将“创造性”这一抽象概念具体化，该任务旨在融合两个不相关的概念，并引入 CreTok，将“创造性”重新定义为 Token \textttCreTok。这种重新定义为概念融合提供了一个更具体且普遍适用的表示。这一重新定义过程是连续的，涉及不同概念的文本对进行重复随机采样，并优化目标提示与常量提示之间的余弦相似度。这种方法使得 \textttCreTok 能够学习一种创造性概念融合的方法。广泛的实验表明，\textttCreTok 所启用的创造性能力显著超越了最近的 SOTA 扩散模型，并在创造性生成方面表现更优。CreTok 展示了更大的灵活性和更少的时间开销，因为 \textttCreTok 可以作为任何概念的通用 Token，促进创造性生成而无需重新训练。

[NLP-9] GPT or BERT: why not both?

【速读】：该论文试图解决如何将掩码语言建模（Masked Language Modeling, MLM）与因果语言建模（Causal Language Modeling, CLM）结合的问题。解决方案的关键在于提出了一种混合训练目标，使得单个Transformer模型（GPT-BERT）能够在同一架构下同时具备MLM和CLM的优势。通过在BabyLM Challenge 2024上的测试，结果表明这种混合预训练方法优于单一的MLM或CLM模型。

链接: https://arxiv.org/abs/2410.24159
作者: Lucas Georges Gabriel Charpentier,David Samuel
关键词-EN: merge masked language, masked language modeling, causal language modeling, masked language, language modeling
类目: Computation and Language (cs.CL)
备注: 22 pages; submission to the BabyLM Challenge 2024

点击查看摘要

Abstract:We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.
摘要：我们提出了一种简单的方法，将掩码语言建模与因果语言建模相结合。这种混合训练目标使得模型在一个Transformer堆栈中同时结合了两种建模范式的优势：GPT-BERT可以像任何标准的因果或掩码语言模型一样透明地使用。我们在2024年的BabyLM挑战赛上测试了这种灵活行为的预训练过程。结果显示，混合预训练的表现优于仅使用掩码或仅使用因果建模的模型。我们公开发布了模型、训练语料库和代码。

[NLP-10] hought Space Explorer: Navigating and Expanding Thought Space for Large Language Model Reasoning

【速读】：该论文试图解决大语言模型 (LLMs) 在复杂推理任务中存在的认知盲点问题，即现有方法往往局限于已探索的解空间，未能充分扩展和优化思维结构。解决方案的关键在于设计了思维空间探索器 (Thought Space Explorer, TSE) 这一新型框架，通过生成新的推理步骤和分支，基于原有思维结构并采用多种设计策略，从而扩展思维空间并减轻认知盲点对LLM推理的影响。实验结果表明，TSE在多层次推理任务中有效，并通过对结构化和扩展思维的深入分析，揭示了其对释放LLM推理潜力的贡献。

链接: https://arxiv.org/abs/2410.24155
作者: Jinghan Zhang,Fengran Mo,Xiting Wang,Kunpeng Liu
关键词-EN: large language models, Recent advances, handling complex reasoning, language models, Thought Space Explorer
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated their potential in handling complex reasoning tasks, which are usually achieved by constructing a thought chain to guide the model to solve the problem with multi-step thinking. However, existing methods often remain confined to previously explored solution spaces and thus overlook the critical blind spot within LLMs’ cognitive range. To address these issues, we design the Thought Space Explorer (TSE), a novel framework to expand and optimize thought structures to guide LLMs to explore their blind spots of thinking. By generating new reasoning steps and branches based on the original thought structure with various designed strategies, TSE broadens the thought space and alleviates the impact of blind spots for LLM reasoning. Experimental results on multiple levels of reasoning tasks demonstrate the efficacy of TSE. We also conduct extensive analysis to understand how structured and expansive thought can contribute to unleashing the potential of LLM reasoning capabilities.
摘要：近年来，大语言模型（Large Language Models, LLMs）在处理复杂推理任务方面展示了其潜力，这些任务通常通过构建思维链来引导模型进行多步骤思考来解决。然而，现有方法往往局限于先前探索的解空间，从而忽略了LLMs认知范围内的关键盲点。为解决这些问题，我们设计了思维空间探索器（Thought Space Explorer, TSE），这是一个新颖的框架，旨在扩展和优化思维结构，以引导LLMs探索其思维盲点。通过基于原始思维结构生成新的推理步骤和分支，并结合多种设计策略，TSE扩展了思维空间，并缓解了LLM推理中盲点的影响。在多个推理任务层次上的实验结果证明了TSE的有效性。我们还进行了广泛的分析，以理解结构化和扩展思维如何有助于释放LLM推理能力的潜力。

[NLP-11] Scaling Concept With Text-Guided Diffusion Models

【速读】：该论文试图解决在文本引导的扩散模型中如何增强或抑制特定概念的问题。解决方案的关键在于提出了一种名为 ScalingConcept 的方法，该方法通过分解文本引导扩散模型中的概念，并对其进行放大或缩小，从而在不引入新元素的情况下对输入内容进行调整。这一方法的核心在于识别并利用模型中概念分解的趋势，并通过系统评估验证了其在图像和音频领域的多种零样本应用，如规范姿态生成和生成声音的突出或移除。

链接: https://arxiv.org/abs/2410.24151
作者: Chao Huang,Susan Liang,Yunlong Tang,Yapeng Tian,Anurag Kumar,Chenliang Xu
关键词-EN: producing high-fidelity content, Text-guided diffusion models, producing high-fidelity, high-fidelity content, text descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog to a tiger). In this work, we explore a novel approach: instead of replacing a concept, can we enhance or suppress the concept itself? Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models. Leveraging this insight, we introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. To systematically evaluate our approach, we present the WeakConcept-10 dataset, where concepts are imperfect and need to be enhanced. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.
摘要：文本引导的扩散模型通过从文本描述中生成高保真内容，彻底改变了生成任务。它们还实现了一种编辑范式，其中概念可以通过文本条件进行替换（例如，将狗替换为老虎）。在本研究中，我们探索了一种新颖的方法：我们能否在不替换概念的情况下，增强或抑制概念本身？通过实证研究，我们发现了一个趋势，即概念可以在文本引导的扩散模型中被分解。基于这一洞察，我们提出了ScalingConcept，这是一种简单而有效的方法，可以在不引入新元素的情况下，对真实输入中的分解概念进行放大或缩小。为了系统地评估我们的方法，我们提出了WeakConcept-10数据集，其中概念存在缺陷，需要被增强。更重要的是，ScalingConcept在图像和音频领域中实现了多种新颖的零样本应用，包括规范姿态生成和生成声音的突出显示或移除等任务。

[NLP-12] Dont Touch My Diacritics

【速读】：该论文旨在解决文本预处理过程中对变音符号（diacritics）处理不当导致的模型性能下降问题。论文通过多个案例研究展示了不一致的变音符号编码和完全去除变音符号对模型性能的负面影响。解决方案的关键在于呼吁社区在所有模型和工具包中采取简单但必要的步骤，以改进对变音符号文本的处理，从而提升多语言自然语言处理（NLP）的公平性（equity）。

链接: https://arxiv.org/abs/2410.24140
作者: Kyle Gorman,Yuval Pinter
关键词-EN: NLP models introduces, common practice, practice of preprocessing, introduces many decision, decision points
类目: Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts originating in many languages and scripts. We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether. We call on the community to adopt simple but necessary steps across all models and toolkits in order to improve handling of diacritized text and, by extension, increase equity in multilingual NLP.
摘要：在将文本输入自然语言处理（NLP）模型之前进行预处理的常见做法引入了许多决策点，这些决策点对模型性能产生了意想不到的后果。在这篇观点文章中，我们重点关注了源自多种语言和文字的文本中变音符号的处理问题。通过几个案例研究，我们展示了变音符号字符编码不一致以及完全移除变音符号所带来的负面影响。我们呼吁社区在所有模型和工具包中采取简单但必要的步骤，以改进对带变音符号文本的处理，并由此扩展，提升多语言NLP的公平性。

[NLP-13] Multi-environment Topic Models

【速读】：该论文试图解决在大规模文本数据集中，如何从文档的协变量（如来源、风格、政治倾向）中提取出环境无关的潜在主题，并区分出环境特定的词汇。解决方案的关键在于引入了多环境主题模型 (Multi-environment Topic Model, MTM)，该模型通过无监督的概率建模方法，将全局主题与环境特定词汇分离。实验结果表明，MTM在多环境数据上表现优异，不仅在分布内和分布外预测中超越了强基线模型，还能准确发现主题对现实世界结果的因果效应。

链接: https://arxiv.org/abs/2410.24126
作者: Dominic Sobhani,Amir Feder,David Blei
关键词-EN: extracting latent themes, large text datasets, text datasets, powerful tool, tool for extracting
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Probabilistic topic models are a powerful tool for extracting latent themes from large text datasets. In many text datasets, we also observe per-document covariates (e.g., source, style, political affiliation) that act as environments that modulate a “global” (environment-agnostic) topic representation. Accurately learning these representations is important for prediction on new documents in unseen environments and for estimating the causal effect of topics on real-world outcomes. To this end, we introduce the Multi-environment Topic Model (MTM), an unsupervised probabilistic model that separates global and environment-specific terms. Through experimentation on various political content, from ads to tweets and speeches, we show that the MTM produces interpretable global topics with distinct environment-specific words. On multi-environment data, the MTM outperforms strong baselines in and out-of-distribution. It also enables the discovery of accurate causal effects.
摘要：概率主题模型是从大型文本数据集中提取潜在主题的强大工具。在许多文本数据集中，我们还观察到每篇文档的协变量（例如，来源、风格、政治倾向），这些协变量作为环境，调节着一种“全局”（与环境无关）的主题表示。准确学习这些表示对于在新环境中对新文档进行预测以及估计主题对现实世界结果的因果效应至关重要。为此，我们引入了多环境主题模型 (Multi-environment Topic Model, MTM)，这是一种无监督的概率模型，能够分离全局和环境特定的术语。通过对各种政治内容（从广告到推文和演讲）的实验，我们展示了 MTM 生成的全局主题具有明显的环境特定词汇，且在多环境数据上，MTM 在分布内和分布外均优于强大的基线模型。此外，它还支持发现准确的因果效应。

[NLP-14] Nearest Neighbor Normalization Improves Multimodal Retrieval

【速读】：该论文试图解决多模态模型在图像-文本检索任务中表现不完美的问题。解决方案的关键是提出了一种无需额外训练的简单高效方法，称为最近邻归一化 (Nearest Neighbor Normalization, NNN)。NNN通过利用参考数据库中的信息来纠正已训练的对比图像-文本检索模型中的错误，从而在多个对比模型（如CLIP, BLIP, ALBEF, SigLIP, BEiT）和两个数据集（MS-COCO和Flickr30k）上显著提升检索性能，包括文本检索和图像检索。NNN方法不需要在参考数据库上进行训练，甚至可以在模型微调后进一步提高检索准确性。

链接: https://arxiv.org/abs/2410.24114
作者: Neil Chowdhury,Franklin Wang,Sumedh Shenoy,Douwe Kiela,Sarah Schwettmann,Tristan Thrush
关键词-EN: visual question answering, Multimodal models leverage, leverage large-scale pre-training, Nearest Neighbor Normalization, models leverage large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
摘要：多模态模型通过大规模预训练在图像描述、视觉问答和跨模态检索等任务上取得了显著但仍不完美的表现。本文提出了一种简单且高效的方法，用于在不进行额外训练的情况下修正训练后的对比图像-文本检索模型的错误，称为最近邻归一化（Nearest Neighbor Normalization, NNN）。我们在我们测试的所有对比模型（CLIP、BLIP、ALBEF、SigLIP、BEiT）以及我们使用的两个数据集（MS-COCO 和 Flickr30k）上，展示了在文本检索和图像检索指标上的改进。NNN 需要一个参考数据库，但不需要对该数据库进行任何训练，甚至在模型微调后仍能提高检索准确性。

[NLP-15] In-Context Fine-Tuning for Time-Series Foundation Models

【速读】：该论文试图解决时间序列预测中的零样本预测问题，特别是如何在不进行显式微调的情况下，通过上下文中的时间序列示例来改进预测性能。解决方案的关键在于提出了一种上下文微调 (in-context fine-tuning) 的方法，通过在推理时使用多个相关时间序列的示例来提示预训练的时间序列基础模型，从而使其能够更好地适应目标领域的分布。这种方法不仅在流行的预测基准测试中表现优于传统的监督深度学习方法和统计模型，甚至与在目标领域显式微调的基础模型性能相当。

链接: https://arxiv.org/abs/2410.24087
作者: Abhimanyu Das,Matthew Faw,Rajat Sen,Yichen Zhou
关键词-EN: foundation model, time-series foundation models, foundation, time-series foundation, inference time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Motivated by the recent success of time-series foundation models for zero-shot forecasting, we present a methodology for \textitin-context fine-tuning of a time-series foundation model. In particular, we design a pretrained foundation model that can be prompted (at inference time) with multiple time-series examples, in order to forecast a target time-series into the future. Our foundation model is specifically trained to utilize examples from multiple related time-series in its context window (in addition to the history of the target time-series) to help it adapt to the specific distribution of the target domain at inference time. We show that such a foundation model that uses in-context examples at inference time can obtain much better performance on popular forecasting benchmarks compared to supervised deep learning methods, statistical models, as well as other time-series foundation models. Interestingly, our in-context fine-tuning approach even rivals the performance of a foundation model that is explicitly fine-tuned on the target domain.
摘要：受到近期时间序列基础模型在零样本预测方面取得成功的启发，我们提出了一种时间序列基础模型的上下文内微调方法。具体而言，我们设计了一个预训练的基础模型，该模型可以在推理时通过提供多个时间序列示例来预测目标时间序列的未来趋势。我们的基础模型专门训练用于利用上下文窗口中的多个相关时间序列示例（除了目标时间序列的历史数据外），以帮助其在推理时适应目标领域的特定分布。我们展示了这种在推理时使用上下文示例的基础模型，在流行的预测基准测试中，相比监督深度学习方法、统计模型以及其他时间序列基础模型，能够获得更好的性能。有趣的是，我们的上下文内微调方法甚至可以与在目标领域上显式微调的基础模型性能相媲美。

[NLP-16] Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLM s

【速读】：该论文试图解决大型语言模型（LLMs）中嵌入的社会偏见问题，特别是针对阿拉伯人与西方人的偏见。解决方案的关键在于创建两个数据集：一个用于评估LLM对阿拉伯人与西方人的偏见，另一个用于测试模型对夸大负面特质提示（jailbreaks）的抵抗能力。通过评估六个LLMs（包括GPT-4、GPT-4o、LlaMA 3.1、Mistral 7B和Claude 3.5 Sonnet），研究发现79%的案例显示出对阿拉伯人的负面偏见，其中LlaMA 3.1-405B最为偏见。在jailbreak测试中，GPT-4o显示出最高的脆弱性，尽管它是优化版本。研究强调了在LLMs中实施更强大的偏见缓解策略和加强安全措施的迫切需要。

链接: https://arxiv.org/abs/2410.24049
作者: Muhammed Saeed,Elgizouli Mohamed,Mukhtar Mohamed,Shaina Raza,Shady Shehata,Muhammad Abdul-Mageed
关键词-EN: Large language models, Arabs versus Westerners, raise ethical concerns, ethical concerns due, Large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used but raise ethical concerns due to embedded social biases. This study examines LLM biases against Arabs versus Westerners across eight domains, including women’s rights, terrorism, and anti-Semitism and assesses model resistance to perpetuating these biases. To this end, we create two datasets: one to evaluate LLM bias toward Arabs versus Westerners and another to test model safety against prompts that exaggerate negative traits (“jailbreaks”). We evaluate six LLMs – GPT-4, GPT-4o, LlaMA 3.1 (8B 405B), Mistral 7B, and Claude 3.5 Sonnet. We find 79% of cases displaying negative biases toward Arabs, with LlaMA 3.1-405B being the most biased. Our jailbreak tests reveal GPT-4o as the most vulnerable, despite being an optimized version, followed by LlaMA 3.1-8B and Mistral 7B. All LLMs except Claude exhibit attack success rates above 87% in three categories. We also find Claude 3.5 Sonnet the safest, but it still displays biases in seven of eight categories. Despite being an optimized version of GPT4, We find GPT-4o to be more prone to biases and jailbreaks, suggesting optimization flaws. Our findings underscore the pressing need for more robust bias mitigation strategies and strengthened security measures in LLMs.
摘要：大语言模型（LLMs）被广泛应用，但由于其内嵌的社会偏见，引发了伦理问题。本研究针对阿拉伯人与西方人在八个领域（包括妇女权利、恐怖主义和反犹太主义）的偏见进行了分析，并评估了模型抵抗这些偏见的能力。为此，我们创建了两个数据集：一个用于评估大语言模型对阿拉伯人与西方人的偏见，另一个用于测试模型对夸大负面特质提示（“越狱”）的安全性。我们评估了六种大语言模型——GPT-4、GPT-4o、LlaMA 3.1（8B 和 405B）、Mistral 7B 和 Claude 3.5 Sonnet。研究发现，79% 的案例显示出对阿拉伯人的负面偏见，其中 LlaMA 3.1-405B 的偏见最为严重。在越狱测试中，尽管 GPT-4o 是优化版本，但其脆弱性最高，其次是 LlaMA 3.1-8B 和 Mistral 7B。除 Claude 外，所有大语言模型在三个类别中的攻击成功率均超过 87%。我们还发现 Claude 3.5 Sonnet 是最安全的，但在八个类别中仍有七个显示出偏见。尽管 GPT-4o 是 GPT-4 的优化版本，我们发现其更易受偏见和越狱影响，这表明存在优化缺陷。我们的研究结果强调了在大语言模型中迫切需要更强大的偏见缓解策略和加强的安全措施。

[NLP-17] Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

【速读】：该论文试图解决大型语言模型（LLM）在处理用户模糊查询或缺乏足够上下文信息时难以提供个性化支持的问题。解决方案的关键在于引入了一个名为“协同个性化探索助手（CARE）”的系统，该系统通过结合多代理LLM框架和结构化用户界面来增强个性化探索任务的支持。CARE的界面包括聊天面板、解决方案面板和需求面板，支持迭代查询细化和动态解决方案生成。多代理框架协作识别用户的显性和隐性需求，提供定制化的、可操作的解决方案。在用户研究中，CARE被证明能够显著减少认知负荷、激发创造力，并提供更个性化的解决方案，从而被用户普遍偏好。

链接: https://arxiv.org/abs/2410.24032
作者: Yingzhe Peng,Xiaoting Qin,Zhiyang Zhang,Jue Zhang,Qingwei Lin,Xu Yang,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
关键词-EN: large language models, synthesize vast amounts, revolutionized user interactions, language models, assist with complex
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has revolutionized user interactions with knowledge-based systems, enabling chatbots to synthesize vast amounts of information and assist with complex, exploratory tasks. However, LLM-based chatbots often struggle to provide personalized support, particularly when users start with vague queries or lack sufficient contextual information. This paper introduces the Collaborative Assistant for Personalized Exploration (CARE), a system designed to enhance personalization in exploratory tasks by combining a multi-agent LLM framework with a structured user interface. CARE’s interface consists of a Chat Panel, Solution Panel, and Needs Panel, enabling iterative query refinement and dynamic solution generation. The multi-agent framework collaborates to identify both explicit and implicit user needs, delivering tailored, actionable solutions. In a within-subject user study with 22 participants, CARE was consistently preferred over a baseline LLM chatbot, with users praising its ability to reduce cognitive load, inspire creativity, and provide more tailored solutions. Our findings highlight CARE’s potential to transform LLM-based systems from passive information retrievers to proactive partners in personalized problem-solving and exploration.
摘要：大语言模型 (LLM) 的兴起彻底改变了用户与基于知识的系统之间的互动方式，使聊天机器人能够综合大量信息并协助完成复杂的探索性任务。然而，基于 LLM 的聊天机器人往往难以提供个性化的支持，尤其是在用户开始时提出模糊查询或缺乏足够上下文信息的情况下。本文介绍了个性化探索协作助手 (CARE)，这是一个旨在通过结合多智能体 LLM 框架与结构化用户界面来增强探索性任务中个性化支持的系统。CARE 的界面由聊天面板、解决方案面板和需求面板组成，支持迭代查询细化与动态解决方案生成。多智能体框架协作识别用户的显性和隐性需求，提供量身定制的可操作解决方案。在一个包含 22 名参与者的同主题用户研究中，CARE 始终优于基线 LLM 聊天机器人，用户称赞其能够减轻认知负担、激发创造力并提供更个性化的解决方案。我们的研究结果突显了 CARE 将基于 LLM 的系统从被动信息检索者转变为个性化问题解决和探索中的主动合作伙伴的潜力。

[NLP-18] Joint Training for Selective Prediction

【速读】：该论文试图解决在自然语言处理(NLP)中，如何通过人机协作系统提高分类器模型的性能和可信度的问题。解决方案的关键在于引入了一种新颖的联合训练方法(joint-training approach)，该方法同时优化分类器模块的学得表示和学得的延迟策略(deferral policy)。通过这种联合训练，不仅在四个分类任务上实现了比两个强基线更好的选择性预测(Selective Prediction, SP)结果，还提升了两个模块的整体性能。

链接: https://arxiv.org/abs/2410.24029
作者: Zhaohui Li,Rebecca J. Passonneau
关键词-EN: natural language processing, language processing, high accuracy, prevalent in natural, natural language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Classifier models are prevalent in natural language processing (NLP), often with high accuracy. Yet in real world settings, human-in-the-loop systems can foster trust in model outputs and even higher performance. Selective Prediction (SP) methods determine when to adopt a classifier’s output versus defer to a human. Previous SP approaches have addressed how to improve softmax as a measure of model confidence, or have developed separate confidence estimators. One previous method involves learning a deferral model based on engineered features. We introduce a novel joint-training approach that simultaneously optimizes learned representations used by the classifier module and a learned deferral policy. Our results on four classification tasks demonstrate that joint training not only leads to better SP outcomes over two strong baselines, but also improves the performance of both modules.
摘要：分类器模型在自然语言处理 (NLP) 中广泛应用，通常具有较高的准确性。然而，在实际应用中，人机协作系统可以增强对模型输出的信任，甚至进一步提升性能。选择性预测 (Selective Prediction, SP) 方法决定何时采用分类器的输出，何时将决策权交给人类。先前的 SP 方法主要关注如何改进 softmax 作为模型置信度的度量，或开发独立的置信度估计器。一种先前的研究方法涉及基于工程特征学习一个延迟决策模型。我们提出了一种新颖的联合训练方法，该方法同时优化分类器模块使用的学习表示和学习到的延迟策略。我们在四个分类任务上的实验结果表明，联合训练不仅在两个强基线上实现了更好的 SP 结果，还提升了两个模块的性能。

[NLP-19] Detecting text level intellectual influence with knowledge graph embeddings

【速读】：该论文试图解决如何通过知识图谱（Knowledge Graph）来追踪和预测学术文献之间的引用关系，从而揭示学术思想和影响力的传播。解决方案的关键在于利用生成式预训练语言模型（Gemini LLM）生成知识图谱表示，并结合图神经网络（Graph Neural Network）嵌入模型来预测文献对之间的引用关系。实验结果表明，该方法在区分有引用和无引用的文献对方面表现优异，且训练后运行效率高，能够针对特定语料库进行微调以满足不同研究者的需求。

链接: https://arxiv.org/abs/2410.24021
作者: Lucian Li,Eryclis Silva
关键词-EN: computational social science, Tracing the spread, Graph Neural Network, social science, Neural Network based
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Introduction: Tracing the spread of ideas and the presence of influence is a question of special importance across a wide range of disciplines, ranging from intellectual history to cultural analytics, computational social science, and the science of science. Method: We collect a corpus of open source journal articles, generate Knowledge Graph representations using the Gemini LLM, and attempt to predict the existence of citations between sampled pairs of articles using previously published methods and a novel Graph Neural Network based embedding model. Results: We demonstrate that our knowledge graph embedding method is superior at distinguishing pairs of articles with and without citation. Once trained, it runs efficiently and can be fine-tuned on specific corpora to suit individual researcher needs. Conclusion(s): This experiment demonstrates that the relationships encoded in a knowledge graph, especially the types of concepts brought together by specific relations can encode information capable of revealing intellectual influence. This suggests that further work in analyzing document level knowledge graphs to understand latent structures could provide valuable insights. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.24021 [cs.CL] (or arXiv:2410.24021v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.24021 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lucian Li [view email] [v1] Thu, 31 Oct 2024 15:21:27 UTC (485 KB)
摘要：

追踪思想的传播及其影响力的存在是跨学科研究中的一个特别重要的问题，涉及从知识史到文化分析、计算社会科学以及科学学等多个领域。我们收集了一个开放获取期刊文章的语料库，使用 Gemini 大语言模型生成知识图谱表示，并尝试使用已发表的方法和一种基于图神经网络的新嵌入模型来预测采样文章对之间是否存在引用关系。实验结果表明，我们的知识图谱嵌入方法在区分有引用和无引用的文章对方面表现更优。一旦训练完成，该模型运行高效，并且可以根据特定语料库进行微调，以满足个别研究者的需求。这一实验表明，知识图谱中编码的关系，特别是特定关系所结合的概念类型，能够编码揭示知识影响力的信息。这表明，进一步分析文档级知识图谱以理解潜在结构可能会提供有价值的见解。

主题：计算与语言 (cs.CL)
引用方式：arXiv:2410.24021 [cs.CL]（或 arXiv:2410.24021v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.24021
通过 arXiv 发布的 DOI 通过 DataCite（待注册）
提交历史：从 Lucian Li [查看电子邮件]
[v1] 2024年10月31日 15:21:27 UTC (485 KB)

[NLP-20] Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?

【速读】：该论文试图解决在语音到文本翻译（S2TT）系统中评估韵律感知能力的问题。解决方案的关键在于引入了一种新的评估方法和一个专门的基准测试（ContraProST），通过使用大型语言模型和可控的文本到语音（TTS）技术生成对比示例，来捕捉广泛的韵律现象。实验结果表明，尽管S2TT模型内部具有一定的韵律表示，但韵律信号通常不足以影响翻译结果；端到端（E2E）系统在韵律感知翻译方面优于级联的语音识别和文本翻译系统，证实了其在理论上的优势；某些级联系统也能捕捉到韵律信息，但其效果依赖于转录文本的表面形式。

链接: https://arxiv.org/abs/2410.24019
作者: Ioannis Tsiamas,Matthias Sperber,Andrew Finch,Sarthak Garg
关键词-EN: intonation and rhythm, spoken utterance, including features, features like stress, underlying semantics
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: WMT 2024

点击查看摘要

Abstract:The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into German, Spanish, and Japanese, we find that (a) S2TT models possess some internal representation of prosody, but the prosody signal is often not strong enough to affect the translations, (b) E2E systems outperform cascades of speech recognition and text translation systems, confirming their theoretical advantage in this regard, and © certain cascaded systems also capture prosodic information in the translation, but only to a lesser extent that depends on the particulars of the transcript’s surface form.
摘要：口语表达的韵律，包括重音、语调和节奏等特征，可以显著影响其底层语义，从而也会影响其文本翻译。然而，韵律在语音到文本翻译 (S2TT) 系统中的研究却很少。特别是，端到端 (E2E) 系统被认为是适合韵律感知的翻译，因为它们在做出翻译决策时可以直接访问语音信号，但这种做法在实际中的成功程度仍有限。一个主要挑战是评估翻译中的韵律感知难度。为了应对这一挑战，我们引入了一种评估方法和一个专注于捕捉广泛韵律现象的基准测试（名为 ContraProST）。我们的方法利用大语言模型和可控文本到语音 (TTS) 生成对比示例。通过将英语语音翻译成德语、西班牙语和日语的实验，我们发现：(a) S2TT 模型具有一定的韵律内部表示，但韵律信号往往不足以影响翻译；(b) E2E 系统优于语音识别和文本翻译系统的级联，证实了它们在这方面的理论优势；© 某些级联系统也能在翻译中捕捉到韵律信息，但仅限于依赖于转录文本表面形式的较小程度。

[NLP-21] Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

【速读】：该论文试图解决非英语语言在多语言大型语言模型（LLMs）预训练中表现不佳的问题，主要原因是多语言预训练语料库的质量和多样性不足。解决方案的关键在于利用高质量的单一源语言（如英语）的机器翻译文本，来显著提升多语言LLMs的预训练效果。具体做法是将高质量的英语网络数据集FineWeb-Edu翻译成法语、德语和西班牙语，构建了一个包含300亿标记的数据集TransWeb-Edu，并在此基础上从头开始预训练了一个13亿参数的模型CuatroLLM。实验结果表明，尽管使用的数据量远少于现有模型（如Llama3.2和Gemma2），CuatroLLM在五个非英语推理任务中仍能匹配或超越这些模型的表现。此外，通过额外的领域特定预训练，CuatroLLM在多语言推理任务中进一步超越了现有技术水平。

链接: https://arxiv.org/abs/2410.23956
作者: Jiayi Wang,Yao Lu,Maurice Weber,Max Ryabinin,Yihong Chen,Raphael Tang,Pontus Stenetorp
关键词-EN: high-quality large language, large language models, high-resource language, large language, pretraining
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2 and Gemma2, despite using an order of magnitude less data, such as about 6% of the tokens used for Llama3.2’s training. We further demonstrate that with additional domain-specific pretraining, amounting to less than 1% of TransWeb-Edu, CuatroLLM surpasses the state of the art in multilingual reasoning. To promote reproducibility, we release our corpus, models, and training pipeline under open licenses at this http URL.
摘要：英语作为一种资源非常丰富的语言，使得高质量大语言模型（LLMs）的预训练成为可能。然而，对于大多数其他语言来说，情况并非如此，因为领先的大语言模型在非英语语言上的表现仍然不尽如人意，这可能是因为可用的多语言预训练语料库在质量和多样性上存在差距。在本研究中，我们发现，从单一高质量源语言机器翻译的文本可以显著促进多语言大语言模型的预训练。我们将高质量的英语网络数据集FineWeb-Edu翻译成法语、德语和西班牙语，最终形成了一个包含3000亿Token的数据集，我们称之为TransWeb-Edu，并在此数据集上从头开始预训练了一个13亿参数的模型，命名为CuatroLLM。在五个非英语推理任务中，我们展示了CuatroLLM与使用封闭数据训练的最先进多语言模型（如Llama3.2和Gemma2）相匹配或超越，尽管使用的数据量仅为Llama3.2训练数据的约6%。我们进一步证明，通过额外的领域特定预训练，仅占TransWeb-Edu不到1%的数据量，CuatroLLM在多语言推理方面超越了当前的最先进水平。为了促进可重复性，我们在该http URL下以开放许可形式发布了我们的语料库、模型和训练流程。

[NLP-22] Representative Social Choice: From Learning Theory to AI Alignment NEURIPS2024

【速读】：该论文试图解决在社会选择理论中，当问题和个体数量过多时，直接考虑所有偏好不切实际的问题。解决方案的关键在于提出代表性社会选择框架（representative social choice framework），通过有限样本的个体-问题对来代表整个群体，从而进行社会选择决策。论文展示了如何将代表性社会选择中的许多深层问题自然地表述为统计学习问题，并利用机器学习理论证明了社会选择机制的泛化性质。此外，论文还提出了代表性社会选择的公理，并使用新的组合分析工具证明了类似阿罗不可能定理的结果。这一框架为社会选择、学习理论和AI对齐领域的交叉研究开辟了新的方向。

链接: https://arxiv.org/abs/2410.23953
作者: Tianyi Qiu
关键词-EN: Social choice, representative social choice, representative social, Social, choice
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注: Full version (20 pages). Under review. An excerpt was previously accepted to NeurIPS 2024 Pluralistic Alignment Workshop

点击查看摘要

Abstract:Social choice theory is the study of preference aggregation across a population, used both in mechanism design for human agents and in the democratic alignment of language models. In this study, we propose the representative social choice framework for the modeling of democratic representation in collective decisions, where the number of issues and individuals are too large for mechanisms to consider all preferences directly. These scenarios are widespread in real-world decision-making processes, such as jury trials, indirect elections, legislation processes, corporate governance, and, more recently, language model alignment. In representative social choice, the population is represented by a finite sample of individual-issue pairs based on which social choice decisions are made. We show that many of the deepest questions in representative social choice can be naturally formulated as statistical learning problems, and prove the generalization properties of social choice mechanisms using the theory of machine learning. We further formulate axioms for representative social choice, and prove Arrow-like impossibility theorems with new combinatorial tools of analysis. Our framework introduces the representative approach to social choice, opening up research directions at the intersection of social choice, learning theory, and AI alignment.
摘要：社会选择理论研究的是在人群中进行偏好聚合的问题，这一理论既用于人类智能体的机制设计，也用于大语言模型的民主对齐。在本研究中，我们提出了代表性社会选择框架，用于建模集体决策中的民主代表性，其中问题和个体的数量过大，以至于机制无法直接考虑所有偏好。这类场景在现实世界的决策过程中广泛存在，如陪审团审判、间接选举、立法过程、公司治理，以及最近的大语言模型对齐。在代表性社会选择中，人口由基于个体-问题对的有限样本代表，社会选择决策基于此样本作出。我们展示了代表性社会选择中的许多深刻问题可以自然地表述为统计学习问题，并利用机器学习理论证明了社会选择机制的泛化性质。我们进一步为代表性社会选择制定了公理，并使用新的组合分析工具证明了类似Arrow的不可能性定理。我们的框架引入了代表性社会选择方法，开辟了社会选择、学习理论和AI对齐交叉领域的研究方向。

[NLP-23] Language Models can Self-Lengthen to Generate Long Texts

【速读】：该论文试图解决大语言模型（LLMs）在生成长且对齐的输出方面存在的显著差距，这一问题源于预训练阶段缺乏有效的长文本生成指令，以及后训练数据主要由短查询-响应对组成。解决方案的关键在于引入了一种创新的迭代训练框架，称为Self-Lengthen，该框架利用LLMs自身的内在知识和技能，无需辅助数据或专有模型。框架包括两个角色：生成器（Generator）和扩展器（Extender）。生成器生成初始响应，扩展器将其分割并扩展，生成新的更长的响应，用于迭代训练生成器和扩展器。通过这一过程，模型逐步训练以处理越来越长的响应，实验结果表明，Self-Lengthen在长文本生成方面优于现有方法，特别是在应用于Qwen2和LLaMA3等顶级开源LLMs时。

链接: https://arxiv.org/abs/2410.23933
作者: Shanghaoran Quan,Tianyi Tang,Bowen Yu,An Yang,Dayiheng Liu,Bofei Gao,Jianhong Tu,Yichang Zhang,Jingren Zhou,Junyang Lin
关键词-EN: Large Language Models, Large Language, notable gap remains, Recent advancements, process long contexts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation stems from a training gap where pre-training lacks effective instructions for long-text generation, and post-training data primarily consists of short query-response pairs. Current approaches, such as instruction backtranslation and behavior imitation, face challenges including data quality, copyright issues, and constraints on proprietary model usage. In this paper, we introduce an innovative iterative training framework called Self-Lengthen that leverages only the intrinsic knowledge and skills of LLMs without the need for auxiliary data or proprietary models. The framework consists of two roles: the Generator and the Extender. The Generator produces the initial response, which is then split and expanded by the Extender. This process results in a new, longer response, which is used to train both the Generator and the Extender iteratively. Through this process, the models are progressively trained to handle increasingly longer responses. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation, when applied to top open-source LLMs such as Qwen2 and LLaMA3. Our code is publicly available at this https URL.
摘要：近年来，大语言模型 (LLM) 在处理长上下文方面取得了显著进展，但在生成与输入对齐的长文本输出方面仍存在明显差距。这一限制源于训练过程中的缺陷，即预训练缺乏有效的长文本生成指导，而后训练数据主要由短查询-响应对组成。当前的方法，如指令回译和行为模仿，面临着数据质量、版权问题以及对专有模型使用的限制等挑战。本文提出了一种创新的迭代训练框架，称为自延长 (Self-Lengthen)，该框架仅利用 LLM 的内在知识和技能，无需辅助数据或专有模型。该框架包含两个角色：生成器 (Generator) 和扩展器 (Extender)。生成器生成初始响应，然后由扩展器分割和扩展。这一过程产生新的、更长的响应，用于迭代训练生成器和扩展器。通过这一过程，模型逐步训练以处理越来越长的响应。在基准测试和人工评估中，当应用于 Qwen2 和 LLaMA3 等顶级开源 LLM 时，自延长在长文本生成方面优于现有方法。我们的代码已公开发布，详见此 https URL。

[NLP-24] BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

【速读】：该论文试图解决大型语言模型（LLMs）在本地设备上部署时面临的内存限制问题。解决方案的关键在于提出了一种名为 BitStack 的训练无关权重压缩方法，该方法通过权重分解实现内存使用与模型性能之间的兆字节级权衡。BitStack 的核心创新在于动态调整模型大小，通过迭代分解权重矩阵并考虑每个参数的重要性，生成每个分解迭代中约1比特的残差块。这些残差块作为基本传输单元存储在存储设备中，并根据当前内存可用性加载不同数量的块。实验结果表明，尽管提供了细粒度的尺寸控制，BitStack 在极端压缩比下仍能匹配或超越强量化基线，从而有效填补了分解方法与实际压缩技术之间的差距。

链接: https://arxiv.org/abs/2410.23918
作者: Xinghao Wang,Pengyu Wang,Bo Wang,Dong Zhang,Yunhua Zhou,Xipeng Qiu
关键词-EN: revolutionized numerous applications, Large language models, Large language, deployment remains challenged, numerous applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textitcapability to \textitavailability, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at this https URL.
摘要：大语言模型（LLMs）在众多应用中取得了革命性的进展，但其部署仍受到本地设备内存限制的挑战。尽管缩放定律提升了大语言模型的能力，但主要的瓶颈已从能力转向可用性，强调了高效内存管理的必要性。传统的压缩方法，如量化，通常需要预定义的压缩比率，并为每种设置进行单独的压缩过程，这使得在可变内存环境中部署变得复杂。本文中，我们提出了 BitStack，一种新颖的、无需训练的权重压缩方法，能够在内存使用和模型性能之间实现兆字节级别的权衡。通过利用权重分解，BitStack 可以动态调整模型大小，同时最小化运行内存与存储设备之间的传输。我们的方法在考虑每个参数重要性的基础上，迭代地分解权重矩阵，在每次分解迭代中产生约每个参数 1 比特的残差块。这些块在存储中按顺序堆叠，并根据当前内存可用性加载不同数量的块。在广泛的任务范围内进行的广泛实验表明，尽管提供了细粒度的尺寸控制，BitStack 仍然能够持续匹配或超越强量化基线，特别是在极端压缩比率下。据我们所知，这是首个有效弥合分解方法与量化等实用压缩技术之间差距的方法。代码可在以下链接获取：https URL。

[NLP-25] Responsible Retrieval Augmented Generation for Climate Decision Making from Documents

【速读】：该论文试图解决气候决策过程中因复杂、技术性强且多语言的文档导致的关键信息难以获取的问题。解决方案的关键在于引入了一种针对气候相关文档的领域特定评估框架，并将其应用于评估检索增强生成 (Retrieval-Augmented Generation, RAG) 方法。该框架旨在解决生成式 AI 技术在处理此类文档时的三大局限：信息幻觉、生成输出的不可控性以及在特定技术领域的性能下降。通过这一评估框架，论文不仅评估了 RAG 方法在检索和生成质量上的表现，还发布了一个人工标注的数据集和可扩展的自动化评估工具，以促进这些系统在气候领域的广泛应用和稳健评估。此外，研究还强调了在部署 RAG 系统以增强决策时需要考虑的关键因素，以及在高风险领域中构建用户信任的用户体验 (UX) 考虑。

链接: https://arxiv.org/abs/2410.23902
作者: Matyas Juhasz,Kalyan Dutia,Henry Franks,Conor Delahunty,Patrick Fawbert Mills,Harrison Pim
关键词-EN: Climate decision making, decision making, making is constrained, complexity and inaccessibility, multi-lingual documents
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Climate decision making is constrained by the complexity and inaccessibility of key information within lengthy, technical, and multi-lingual documents. Generative AI technologies offer a promising route for improving the accessibility of information contained within these documents, but suffer from limitations. These include (1) a tendency to hallucinate or mis-represent information, (2) difficulty in steering or guaranteeing properties of generated output, and (3) reduced performance in specific technical domains. To address these challenges, we introduce a novel evaluation framework with domain-specific dimensions tailored for climate-related documents. We then apply this framework to evaluate Retrieval-Augmented Generation (RAG) approaches and assess retrieval- and generation-quality within a prototype tool that answers questions about individual climate law and policy documents. In addition, we publish a human-annotated dataset and scalable automated evaluation tools, with the aim of facilitating broader adoption and robust assessment of these systems in the climate domain. Our findings highlight the key components of responsible deployment of RAG to enhance decision-making, while also providing insights into user experience (UX) considerations for safely deploying such systems to build trust with users in high-risk domains.
摘要：气候决策受到复杂且难以获取的关键信息限制，这些信息通常存在于冗长、技术性强且多语言的文档中。生成式 AI 技术提供了一条改善这些文档中信息可访问性的有前景的途径，但存在一些局限性。这些局限包括：(1) 倾向于产生幻觉或错误信息，(2) 难以控制或保证生成输出的特性，(3) 在特定技术领域的表现下降。为了应对这些挑战，我们引入了一种新颖的评估框架，该框架具有针对气候相关文档的特定领域维度。随后，我们将此框架应用于评估检索增强生成 (RAG) 方法，并在一个原型工具中评估检索和生成质量，该工具用于回答关于个别气候法律和政策文档的问题。此外，我们发布了一个人工标注的数据集和可扩展的自动化评估工具，旨在促进这些系统在气候领域的更广泛应用和稳健评估。我们的研究结果突出了负责任部署 RAG 以增强决策的关键组成部分，同时提供了关于在高风险领域安全部署此类系统以建立用户信任的用户体验 (UX) 考虑的见解。

[NLP-26] Leveraging LLM s for MT in Crisis Scenarios: a blueprint for low-resource languages

【速读】：该论文试图解决在危机通信背景下，特别是针对低资源语言，如何构建强大且适应性强的机器翻译 (Machine Translation, MT) 系统的问题。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 和多语言大型语言模型 (Multilingual LLMs, MLLMs)，结合微调技术 (fine-tuning techniques) 和社区驱动的语料库开发策略，以应对危机情境下对速度、准确性和多语言处理能力的迫切需求。论文通过开发和评估针对两种低资源语言对的定制化 MT 系统，展示了从模型选择、微调到部署的全过程，并强调了社区参与在创建危机特定数据集中的重要性。研究结果表明，经过微调的多语言大型语言模型在性能上优于单一语言的大型语言模型，为危机场景下快速开发可扩展的 MT 系统提供了可复制的模型。

链接: https://arxiv.org/abs/2410.23890
作者: Séamus Lankford,Andy Way
关键词-EN: adaptable Machine Translation, Machine Translation, adaptable Machine, Large Language Models, leveraging Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2403.02370 , arXiv:2403.01580

点击查看摘要

Abstract:In an evolving landscape of crisis communication, the need for robust and adaptable Machine Translation (MT) systems is more pressing than ever, particularly for low-resource languages. This study presents a comprehensive exploration of leveraging Large Language Models (LLMs) and Multilingual LLMs (MLLMs) to enhance MT capabilities in such scenarios. By focusing on the unique challenges posed by crisis situations where speed, accuracy, and the ability to handle a wide range of languages are paramount, this research outlines a novel approach that combines the cutting-edge capabilities of LLMs with fine-tuning techniques and community-driven corpus development strategies. At the core of this study is the development and empirical evaluation of MT systems tailored for two low-resource language pairs, illustrating the process from initial model selection and fine-tuning through to deployment. Bespoke systems are developed and modelled on the recent Covid-19 pandemic. The research highlights the importance of community involvement in creating highly specialised, crisis-specific datasets and compares custom GPTs with NLLB-adapted MLLM models. It identifies fine-tuned MLLM models as offering superior performance compared with their LLM counterparts. A scalable and replicable model for rapid MT system development in crisis scenarios is outlined. Our approach enhances the field of humanitarian technology by offering a blueprint for developing multilingual communication systems during emergencies.
摘要：在危机沟通不断演变的背景下，构建稳健且适应性强的机器翻译 (Machine Translation, MT) 系统的需求比以往任何时候都更为迫切，尤其是在资源匮乏的语言环境中。本研究全面探讨了如何利用大语言模型 (Large Language Models, LLMs) 和多语言大语言模型 (Multilingual LLMs, MLLMs) 来提升此类场景下的机器翻译能力。通过聚焦危机情境中速度、准确性以及处理多种语言能力的关键挑战，本研究提出了一种结合 LLMs 前沿能力与微调技术及社区驱动语料库开发策略的创新方法。研究的核心在于针对两种资源匮乏的语言对开发并实证评估定制化的机器翻译系统，展示了从初始模型选择、微调到部署的全过程。这些定制系统以最近的 Covid-19 大流行为模型进行开发。研究强调了社区参与在创建高度专业化、危机特定数据集中的重要性，并比较了定制 GPT 与 NLLB 适配的多语言大语言模型。研究发现，微调后的多语言大语言模型相较于其大语言模型版本表现出更优越的性能。本研究还概述了一种可扩展且可复制的模型，用于在危机场景中快速开发机器翻译系统。我们的方法通过提供在紧急情况下开发多语言沟通系统的蓝图，增强了人道主义技术领域。

[NLP-27] Failure Modes of LLM s for Causal Reasoning on Narratives

【速读】：该论文试图解决大型语言模型（LLMs）在因果推理能力上的局限性问题，特别是从叙述中推断因果关系的能力。研究发现，即使是先进的LLMs也依赖于不可靠的捷径，如事件的拓扑排序（即较早的事件导致较晚的事件），导致在事件顺序不符合因果顺序时表现下降。此外，LLMs在处理长篇叙述和复杂事件时也表现不佳，并且过度依赖模型参数知识而非叙述内容进行推理。解决方案的关键在于通过显式生成因果图来提升性能，而简单的链式思维方法则效果不佳。这些发现为未来提升LLMs因果推理能力的技术发展指明了方向。

链接: https://arxiv.org/abs/2410.23884
作者: Khurram Yamin,Shantanu Gupta,Gaurav R. Ghosal,Zachary C. Lipton,Bryan Wilder
关键词-EN: large language models, representative problem, problem of inferring, language models, inferring causal relationships
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge. For example, LLMs tend to determine causal relationships based on the topological ordering of events (i.e., earlier events cause later ones), resulting in lower performance whenever events are not narrated in their exact causal order. Similarly, we demonstrate that LLMs struggle with long-term causal reasoning and often fail when the narratives are long and contain many events. Additionally, we show LLMs appear to rely heavily on their parametric knowledge at the expense of reasoning over the provided narrative. This degrades their abilities whenever the narrative opposes parametric knowledge. We extensively validate these failure modes through carefully controlled synthetic experiments, as well as evaluations on real-world narratives. Finally, we observe that explicitly generating a causal graph generally improves performance while naive chain-of-thought is ineffective. Collectively, our results distill precise failure modes of current state-of-the-art models and can pave the way for future techniques to enhance causal reasoning in LLMs.
摘要：在本研究中，我们通过从叙述中推断因果关系的典型问题，探讨了大语言模型 (LLM) 的因果推理能力。我们发现，即使是目前最先进的语言模型也依赖于不可靠的捷径，无论是在叙述的呈现方式还是其参数化知识方面。例如，LLM 倾向于根据事件的拓扑排序来确定因果关系（即，较早的事件导致较晚的事件），当事件未按其确切的因果顺序叙述时，其表现会显著下降。同样，我们证明 LLM 在长期因果推理方面存在困难，当叙述较长且包含多个事件时，它们往往失败。此外，我们展示 LLM 似乎过度依赖其参数化知识，而忽视了对所提供叙述的推理，这导致当叙述与参数化知识相悖时，其能力显著下降。我们通过精心控制的合成实验以及对现实世界叙述的评估，广泛验证了这些失败模式。最后，我们观察到，显式生成因果图通常能提高性能，而简单的思维链方法则无效。总体而言，我们的研究结果精确提炼了当前最先进模型的失败模式，并为未来提升 LLM 因果推理能力的技术铺平了道路。

[NLP-28] No Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

【速读】：该论文旨在解决多模态情境下的分布外检测（Out-of-distribution, OOD）问题，特别是在开放领域对话系统或现实对话交互中，识别对话和图像输入组合中的异常情况。解决方案的关键是引入了一种名为对话图像对齐增强框架（Dialogue Image Aligning and Enhancing Framework, DIAEF）的新评分框架，该框架结合了视觉语言模型和创新提出的评分方法，能够有效检测两种关键场景中的OOD：（1）对话与图像输入对之间的不匹配；（2）包含先前未见标签的输入对。实验结果表明，与单独使用任一模态相比，集成图像和多轮对话的OOD检测在处理先前未见标签时更为有效，并且在存在不匹配对的情况下，所提出的评分方法能够有效识别这些不匹配，并在长对话中表现出强大的鲁棒性。

链接: https://arxiv.org/abs/2410.23883
作者: Rena Gao,Xuetong Wu,Siwen Luo,Caren Han,Feng Liu
关键词-EN: real-life dialogue interactions, open-domain dialogue systems, multimodal contexts, contexts is essential, essential for identifying
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) detection in multimodal contexts is essential for identifying deviations in combined inputs from different modalities, particularly in applications like open-domain dialogue systems or real-life dialogue interactions. This paper aims to improve the user experience that involves multi-round long dialogues by efficiently detecting OOD dialogues and images. We introduce a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios (1) mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels. Our experimental results, derived from various benchmarks, demonstrate that integrating image and multi-round dialogue OOD detection is more effective with previously unseen labels than using either modality independently. In the presence of mismatched pairs, our proposed score effectively identifies these mismatches and demonstrates strong robustness in long dialogues. This approach enhances domain-aware, adaptive conversational agents and establishes baselines for future studies.
摘要：在多模态环境中进行分布外（Out-of-distribution, OOD）检测对于识别来自不同模态的组合输入中的偏差至关重要，特别是在开放域对话系统或现实对话交互等应用中。本文旨在通过高效检测OOD对话和图像来提升涉及多轮长对话的用户体验。我们引入了一种名为对话图像对齐与增强框架（Dialogue Image Aligning and Enhancing Framework, DIAEF）的新型评分框架，该框架集成了视觉语言模型与新颖的评分方法，用于检测两种关键场景中的OOD情况：（1）对话与图像输入对之间的不匹配；（2）包含先前未见标签的输入对。我们的实验结果基于多种基准测试，表明在处理先前未见标签时，集成图像和多轮对话的OOD检测比单独使用任一模态更为有效。在存在不匹配对的情况下，我们提出的评分方法能够有效识别这些不匹配，并在长对话中表现出强大的鲁棒性。这种方法增强了领域感知和自适应对话智能体的能力，并为未来的研究奠定了基准。

[NLP-29] Audio Is the Achilles Heel: Red Teaming Audio Large Multimodal Models

【速读】：该论文试图解决多模态大模型（Large Multimodal Models, LMMs）在处理音频输入时的安全性问题。解决方案的关键在于全面测试五种先进的音频LMMs在三种不同设置下的安全性，包括有害问题在音频和文本格式中的表现、有害问题在文本格式中伴随非语音音频干扰的情况，以及针对语音的特定越狱攻击。研究结果表明，开源音频LMMs在有害音频问题上的平均攻击成功率为69.14%，并且在非语音音频干扰下表现出安全漏洞。针对Gemini-1.5-Pro的语音特定越狱攻击在有害查询基准上的攻击成功率为70.67%。研究提供了关于这些安全错位可能原因的见解。

链接: https://arxiv.org/abs/2410.23861
作者: Hao Yang,Lizhen Qu,Ehsan Shareghi,Gholamreza Haffari
关键词-EN: Large Language Models, combining Large Language, Large Multimodal Models, Large Language, align multimodal information
类目: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated the ability to interact with humans under real-world conditions by combining Large Language Models (LLMs) and modality encoders to align multimodal information (visual and auditory) with text. However, such models raise new safety challenges of whether models that are safety-aligned on text also exhibit consistent safeguards for multimodal inputs. Despite recent safety-alignment research on vision LMMs, the safety of audio LMMs remains under-explored. In this work, we comprehensively red team the safety of five advanced audio LMMs under three settings: (i) harmful questions in both audio and text formats, (ii) harmful questions in text format accompanied by distracting non-speech audio, and (iii) speech-specific jailbreaks. Our results under these settings demonstrate that open-source audio LMMs suffer an average attack success rate of 69.14% on harmful audio questions, and exhibit safety vulnerabilities when distracted with non-speech audio noise. Our speech-specific jailbreaks on Gemini-1.5-Pro achieve an attack success rate of 70.67% on the harmful query benchmark. We provide insights on what could cause these reported safety-misalignments. Warning: this paper contains offensive examples.
摘要：大型多模态模型 (Large Multimodal Models, LMMs) 通过结合大语言模型 (Large Language Models, LLMs) 和模态编码器，展示了在真实世界条件下与人类互动的能力，能够将多模态信息（视觉和听觉）与文本对齐。然而，这类模型也带来了新的安全挑战，即在文本上安全对齐的模型是否也能对多模态输入保持一致的安全防护。尽管近期在视觉 LMMs 的安全对齐研究方面取得了进展，但音频 LMMs 的安全性仍未得到充分探索。在本研究中，我们对五种先进的音频 LMMs 在三种设置下进行了全面的安全测试：(i) 有害问题以音频和文本两种格式呈现，(ii) 有害问题以文本格式呈现，并伴随有干扰性的非语音音频，(iii) 针对语音的特定越狱攻击。我们的测试结果表明，开源音频 LMMs 在有害音频问题上的平均攻击成功率为 69.14%，并且在受到非语音音频噪声干扰时表现出安全漏洞。我们对 Gemini-1.5-Pro 进行的语音特定越狱攻击在有害查询基准上的攻击成功率达到 70.67%。我们提供了关于可能导致这些报告的安全对齐问题的见解。警告：本文包含冒犯性示例。

[NLP-30] Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在面对带有噪声的推理链（chain-of-thought prompting with noisy rationales）时的鲁棒性问题。具体来说，论文关注的是在上下文学习中使用的示例中包含无关或不准确的推理思路时，LLMs的表现。解决方案的关键在于提出了一种对比去噪方法（contrastive denoising with noisy chain-of-thought, CD-CoT），通过在输入空间中重新表述和选择推理链以实现显式去噪，并在输出空间中探索多样化的推理路径并通过投票选择答案，从而增强LLMs的去噪推理能力。实验结果显示，CD-CoT在准确性上比基础模型平均提高了17.8%，并表现出比基线方法更强的去噪能力。

链接: https://arxiv.org/abs/2410.23856
作者: Zhanke Zhou,Rong Tao,Jianing Zhu,Yiwen Luo,Zengmao Wang,Bo Han
关键词-EN: paper investigates, investigates an under-explored, noisy rationales, large language models, in-context learning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales, which include irrelevant or inaccurate reasoning thoughts within examples used for in-context learning. We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales. Our findings on NoRa dataset reveal a prevalent vulnerability to such noise among current LLMs, with existing robust methods like self-correction and self-consistency showing limited efficacy. Notably, compared to prompting with clean rationales, base LLM drops by 1.4%-19.8% in accuracy with irrelevant thoughts and more drastically by 2.2%-40.4% with inaccurate thoughts. Addressing this challenge necessitates external supervision that should be accessible in practice. Here, we propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT). It enhances LLMs’ denoising-reasoning capabilities by contrasting noisy rationales with only one clean rationale, which can be the minimal requirement for denoising-purpose prompting. This method follows a principle of exploration and exploitation: (1) rephrasing and selecting rationales in the input space to achieve explicit denoising and (2) exploring diverse reasoning paths and voting on answers in the output space. Empirically, CD-CoT demonstrates an average improvement of 17.8% in accuracy over the base model and shows significantly stronger denoising capabilities than baseline methods. The source code is publicly available at: this https URL. Comments: Accepted by NeurIPS 2024 Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2410.23856 [cs.CL] (or arXiv:2410.23856v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.23856 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：本文探讨了大语言模型 (LLM) 中一个未被充分研究的问题：带有噪声推理路径的链式思维提示，即在上下文学习中使用的示例中包含无关或不准确的推理思路。我们构建了 NoRa 数据集，专门用于评估在存在噪声推理路径的情况下推理的鲁棒性。我们在 NoRa 数据集上的研究发现，当前的 LLM 普遍对这种噪声存在脆弱性，现有的鲁棒方法如自我修正和自我一致性显示出有限的效果。值得注意的是，与使用干净推理路径的提示相比，基础 LLM 在包含无关思路时的准确率下降了 1.4%-19.8%，而在包含不准确思路时则更为严重，下降了 2.2%-40.4%。解决这一挑战需要实际可行的外部监督。在此，我们提出了对比去噪与噪声链式思维 (CD-CoT) 的方法。该方法通过对比噪声推理路径与仅一个干净的推理路径，增强了 LLM 的去噪推理能力，这可以作为去噪目的提示的最小要求。该方法遵循探索与利用的原则：(1) 在输入空间中重新表述和选择推理路径以实现显式去噪；(2) 在输出空间中探索多样的推理路径并对答案进行投票。实证结果显示，CD-CoT 在准确率上比基础模型平均提高了 17.8%，并显示出比基线方法更强的去噪能力。源代码已公开发布，详见：this https URL。

评论：已被 NeurIPS 2024 接受。
主题：计算与语言 (cs.CL)；机器学习 (cs.LG)
引用方式：arXiv:2410.23856 [cs.CL]（或 arXiv:2410.23856v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.23856
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-31] he Automated Verification of Textual Claims (AVeriTeC) Shared Task

【速读】：该论文试图解决自动验证文本声明（Automated Verification of Textual Claims, AVeriTeC）的问题，即通过检索证据并预测声明的真实性来验证事实核查员检查的现实世界声明。解决方案的关键在于同时正确预测声明的判决（verdict）和检索到的证据达到一定的质量阈值，这两个条件同时满足才能准确验证声明。评估方法采用AVeriTeC评分，该评分综合考虑了判决的准确性和证据的质量。

链接: https://arxiv.org/abs/2410.23850
作者: Michael Schlichtkrull,Yulong Chen,Chenxi Whitehouse,Zhenyun Deng,Mubashara Akhtar,Rami Aly,Zhijiang Guo,Christos Christodoulopoulos,Oana Cocarascu,Arpit Mittal,James Thorne,Andreas Vlachos
关键词-EN: Automated Verification, Verification of Textual, real-world claims checked, Textual Claims, checked by fact-checkers
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Automated Verification of Textual Claims (AVeriTeC) shared task asks participants to retrieve evidence and predict veracity for real-world claims checked by fact-checkers. Evidence can be found either via a search engine, or via a knowledge store provided by the organisers. Submissions are evaluated using AVeriTeC score, which considers a claim to be accurately verified if and only if both the verdict is correct and retrieved evidence is considered to meet a certain quality threshold. The shared task received 21 submissions, 18 of which surpassed our baseline. The winning team was TUDA_MAI with an AVeriTeC score of 63%. In this paper we describe the shared task, present the full results, and highlight key takeaways from the shared task.
摘要：自动验证文本声明（Automated Verification of Textual Claims, AVeriTeC）共享任务要求参与者检索证据并预测事实核查员检查的真实世界声明的真实性。证据可以通过搜索引擎获取，或通过主办方提供的知识库获取。提交的评估使用AVeriTeC评分，该评分认为声明被准确验证当且仅当判决正确且检索到的证据达到一定的质量阈值。该共享任务收到了21份提交，其中18份超过了我们的基线。获胜团队是TUDA_MAI，其AVeriTeC评分为63%。本文描述了共享任务，展示了完整的结果，并强调了共享任务的关键收获。

[NLP-32] Commonsense Knowledge Editing Based on Free-Text in LLM s

【速读】：该论文试图解决常识知识（commonsense knowledge）在大语言模型（LLMs）中的编辑问题，特别是针对自由文本形式的常识知识，其特点是知识范围广、内容长且非实例化。传统方法如MEMIT主要针对单一标记或实体进行编辑，不适用于自由文本形式的常识知识。论文的关键解决方案包括两个方面：知识定位（knowledge localization）和知识编辑（knowledge editing）。首先，提出了自由文本知识定位（Knowledge Localization for Free-Text, KLFT）方法，揭示了常识知识在MLP和Attention层中的分布挑战，特别是分散分布的问题。其次，提出了动态感知编辑方法（Dynamics-aware Editing Method, DEM），通过动态感知模块定位与常识知识对应的参数位置，并使用知识编辑模块进行知识更新。DEM方法充分利用了MLP和Attention层的潜力，成功实现了基于自由文本的常识知识编辑，实验结果表明DEM具有出色的编辑性能。

链接: https://arxiv.org/abs/2410.23844
作者: Xiusheng Huang,Yequan Wang,Jun Zhao,Kang Liu
关键词-EN: large language models, broad knowledge scope, Knowledge editing technology, commonsense knowledge, Knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Knowledge editing technology is crucial for maintaining the accuracy and timeliness of large language models (LLMs) . However, the setting of this task overlooks a significant portion of commonsense knowledge based on free-text in the real world, characterized by broad knowledge scope, long content and non instantiation. The editing objects of previous methods (e.g., MEMIT) were single token or entity, which were not suitable for commonsense knowledge in free-text form. To address the aforementioned challenges, we conducted experiments from two perspectives: knowledge localization and knowledge editing. Firstly, we introduced Knowledge Localization for Free-Text(KLFT) method, revealing the challenges associated with the distribution of commonsense knowledge in MLP and Attention layers, as well as in decentralized distribution. Next, we propose a Dynamics-aware Editing Method(DEM), which utilizes a Dynamics-aware Module to locate the parameter positions corresponding to commonsense knowledge, and uses Knowledge Editing Module to update knowledge. The DEM method fully explores the potential of the MLP and Attention layers, and successfully edits commonsense knowledge based on free-text. The experimental results indicate that the DEM can achieve excellent editing performance.
摘要：知识编辑技术对于维护大语言模型（LLMs）的准确性和时效性至关重要。然而，这一任务的设定忽视了现实世界中基于自由文本的大量常识性知识，这些知识具有广泛的知识范围、长内容和非实例化的特点。以往的方法（如 MEMIT）的编辑对象是单个 Token 或实体，不适用于自由文本形式的常识性知识。为了解决上述挑战，我们从知识定位和知识编辑两个角度进行了实验。首先，我们引入了自由文本知识定位（Knowledge Localization for Free-Text, KLFT）方法，揭示了常识性知识在 MLP 和 Attention 层中的分布挑战，以及分散分布的问题。接着，我们提出了一种动态感知编辑方法（Dynamics-aware Editing Method, DEM），该方法利用动态感知模块定位与常识性知识对应的参数位置，并使用知识编辑模块更新知识。DEM 方法充分挖掘了 MLP 和 Attention 层的潜力，并成功编辑了基于自由文本的常识性知识。实验结果表明，DEM 能够实现出色的编辑性能。

[NLP-33] Reasons and Solutions for the Decline in Model Performance after Editing

【速读】：该论文试图解决在大规模语言模型中进行知识编辑时，编辑后的模型性能下降的问题。解决方案的关键在于从数据和模型两个角度进行深入分析和优化。首先，从数据角度，论文构建了多问题数据集（Multi-Question Dataset, MQD），通过实验发现编辑目标的多样性和序列长度对编辑模型性能有显著影响。其次，从模型角度，论文发现编辑模型层的L1范数与编辑准确性之间存在强相关性，这是导致编辑性能瓶颈的重要因素。基于此，论文提出了一种名为Dump for Sequence (D4S)的方法，通过降低编辑层的L1范数，成功克服了之前的编辑瓶颈，实现了多次有效编辑并最小化了模型损害。

链接: https://arxiv.org/abs/2410.23843
作者: Xiusheng Huang,Jiaxiang Liu,Yequan Wang,Kang Liu
关键词-EN: Knowledge editing technology, received widespread attention, large-scale language models, outdated knowledge, editing
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Knowledge editing technology has received widespread attention for low-cost updates of incorrect or outdated knowledge in large-scale language models. However, recent research has found that edited models often exhibit varying degrees of performance degradation. The reasons behind this phenomenon and potential solutions have not yet been provided. In order to investigate the reasons for the performance decline of the edited model and optimize the editing method, this work explores the underlying reasons from both data and model perspectives. Specifically, 1) from a data perspective, to clarify the impact of data on the performance of editing models, this paper first constructs a Multi-Question Dataset (MQD) to evaluate the impact of different types of editing data on model performance. The performance of the editing model is mainly affected by the diversity of editing targets and sequence length, as determined through experiments. 2) From a model perspective, this article explores the factors that affect the performance of editing models. The results indicate a strong correlation between the L1-norm of the editing model layer and the editing accuracy, and clarify that this is an important factor leading to the bottleneck of editing performance. Finally, in order to improve the performance of the editing model, this paper further proposes a Dump for Sequence (D4S) method, which successfully overcomes the previous editing bottleneck by reducing the L1-norm of the editing layer, allowing users to perform multiple effective edits and minimizing model damage. Our code is available at this https URL.
摘要：知识编辑技术因其能够以低成本更新大规模语言模型中的错误或过时知识而受到广泛关注。然而，近期研究表明，经过编辑的模型往往会出现不同程度的性能下降。目前，这种现象背后的原因及潜在解决方案尚未明确。为了探究编辑模型性能下降的原因并优化编辑方法，本文从数据和模型两个角度深入探讨了其背后的原因。具体而言，1) 从数据角度，为了明确数据对编辑模型性能的影响，本文首先构建了一个多问题数据集（Multi-Question Dataset, MQD），以评估不同类型编辑数据对模型性能的影响。实验结果表明，编辑模型的性能主要受编辑目标的多样性和序列长度的影响。2) 从模型角度，本文探讨了影响编辑模型性能的因素。研究结果显示，编辑模型层的L1-范数与编辑准确性之间存在强相关性，并明确指出这是导致编辑性能瓶颈的重要因素。最后，为了提升编辑模型的性能，本文进一步提出了一种序列转储方法（Dump for Sequence, D4S），通过降低编辑层的L1-范数，成功克服了之前的编辑瓶颈，使用户能够进行多次有效编辑，同时最大限度地减少对模型的损害。我们的代码可在以下链接获取：https URL。

[NLP-34] GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages NEURIPS2024

【速读】：该论文试图解决的问题是现有文本语料库在覆盖少数语言方面的不足，特别是在这些语言的数据量不足、生成过程不透明且难以复现、以及数据质量不可靠的情况下。解决方案的关键在于提出了GlotCC，这是一个经过严格清理的、文档级别的、2TB通用领域语料库，源自CommonCrawl，覆盖了超过1000种语言。GlotCC不仅提供了高质量的数据，还通过开源的方式公开了生成语料库的整个管道、语言识别模型和过滤器，从而确保了其可复现性和透明性，为研究社区提供了可靠的资源。

链接: https://arxiv.org/abs/2410.23825
作者: Amir Hossein Kargaran,François Yvon,Hinrich Schütze
关键词-EN: large text corpora, advent of pretrained, discovery of scaling, scaling laws, large text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2024

点击查看摘要

Abstract:The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 this https URL, Pipeline v. 3.0 this https URL.
摘要：随着预训练语言模型的出现，特别是这些模型的缩放定律的发现，对大规模文本语料库的需求显著增加。大多数现有的语料库仅适用于拥有庞大主导群体的语言。然而，目前尚无语料库满足以下条件：(i) 覆盖广泛的少数语言；(ii) 由开源可复现的流程生成；(iii) 经过严格去噪处理，确保其可信度。我们提出了 GlotCC，这是一个干净、文档级别的 2TB 通用领域语料库，源自 CommonCrawl，涵盖了超过 1000 种语言。我们将 GlotCC 及其生成系统（包括流程、语言识别模型和过滤器）提供给研究社区。语料库 v. 1.0 可通过此 https URL 获取，流程 v. 3.0 可通过此 https URL 获取。

[NLP-35] What is Wrong with Perplexity for Long-context Language Modeling?

【速读】：该论文试图解决大语言模型 (LLMs) 在处理长上下文输入时，传统困惑度 (PPL) 评估指标无法准确反映模型性能的问题。解决方案的关键在于提出了一个新的评估指标 LongPPL，该指标通过长短期上下文对比方法识别关键标记 (key tokens)，并聚焦于这些关键标记进行评估，从而更准确地反映模型在长上下文场景中的表现。此外，论文还引入了 LongCE (Long-context Cross-Entropy) 损失函数，通过重新加权策略优先处理关键标记，以提升模型在各种长上下文基准测试中的性能。

链接: https://arxiv.org/abs/2410.23771
作者: Lizhe Fang,Yifei Wang,Zhaoyang Liu,Chenheng Zhang,Stefanie Jegelka,Jinyang Gao,Bolin Ding,Yisen Wang
关键词-EN: many-shot in-context learning, Handling long-context inputs, large language models, document summarization, Handling long-context
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbfLongCE (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs. Code is available at this https URL.
摘要：在处理长上下文输入方面，大语言模型（LLMs）在扩展对话、文档摘要和多样本上下文学习等任务中至关重要。尽管近期方法已扩展了LLMs的上下文窗口，并采用了困惑度（PPL）作为标准评估指标，但PPL在评估长上下文能力方面已被证明不可靠。这一局限性的根本原因一直未明。在本研究中，我们对此问题进行了全面解释。我们发现，PPL通过平均所有Token的方式，忽略了关键Token，这些Token对于长上下文理解至关重要，从而掩盖了模型在长上下文场景中的真实表现。为解决这一问题，我们提出了LongPPL，这是一种新颖的指标，通过采用长短上下文对比方法来识别关键Token。我们的实验表明，LongPPL与各种长上下文基准测试的表现高度相关（例如，Pearson相关系数为-0.96），显著优于传统的PPL在预测准确性方面的表现。此外，我们引入了LongCE（长上下文交叉熵）损失，这是一种微调中的重新加权策略，优先考虑关键Token，从而在多个基准测试中实现了持续改进。总之，这些贡献深入剖析了PPL的局限性，并提出了准确评估和增强LLMs长上下文能力的有效解决方案。代码可在以下链接获取：https URL。

[NLP-36] he Potential of LLM s in Medical Education: Generating Questions and Answers for Qualification Exams

【速读】：该论文试图解决的问题是如何利用大型语言模型 (LLMs) 在医学教育领域生成高质量的医学资格考试题目及答案。解决方案的关键在于通过少样本提示 (few-shot prompts) 技术，使 LLMs 能够基于实际的老年慢性病数据集生成开放式问题和答案，并通过医学专家的手动评估来验证其正确性、基于证据的陈述和专业性。研究结果表明，尽管 LLMs 在生成考试题目方面表现出色，但在答案的准确性和专业性方面仍有提升空间，同时 LLMs 也展示了纠正和修正参考答案的能力。

链接: https://arxiv.org/abs/2410.23769
作者: Yunqi Zhu,Wen Tang,Ying Sun,Xuebing Yang
关键词-EN: large language models, Recent research, language models, specialized domains, medical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research on large language models (LLMs) has primarily focused on their adaptation and application in specialized domains. The application of LLMs in the medical field is mainly concentrated on tasks such as the automation of medical report generation, summarization, diagnostic reasoning, and question-and-answer interactions between doctors and patients. The challenge of becoming a good teacher is more formidable than that of becoming a good student, and this study pioneers the application of LLMs in the field of medical education. In this work, we investigate the extent to which LLMs can generate medical qualification exam questions and corresponding answers based on few-shot prompts. Utilizing a real-world Chinese dataset of elderly chronic diseases, we tasked the LLMs with generating open-ended questions and answers based on a subset of sampled admission reports across eight widely used LLMs, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Llama 3, and Mistral. Furthermore, we engaged medical experts to manually evaluate these open-ended questions and answers across multiple dimensions. The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions, whereas there is room for improvement in the correctness, evidence-based statements, and professionalism of the generated answers. Moreover, LLMs also demonstrate a decent level of ability to correct and rectify reference answers. Given the immense potential of artificial intelligence in the medical field, the task of generating questions and answers for medical qualification exams aimed at medical students, interns and residents can be a significant focus of future research.
摘要：近年来，大语言模型 (LLMs) 的研究主要集中在其在特定领域的适应与应用上。LLMs 在医疗领域的应用主要集中在医疗报告生成、摘要、诊断推理以及医患问答等任务上。成为一名优秀的教师比成为一名优秀的学生更具挑战性，而本研究开创了 LLMs 在医学教育领域的应用。在本研究中，我们探讨了 LLMs 在基于少样本提示的情况下生成医学资格考试题目及其对应答案的能力。利用一个真实世界的中国老年人慢性病数据集，我们让 LLMs 生成开放式问题和答案，涉及的模型包括 ERNIE 4、ChatGLM 4、Doubao、Hunyuan、Spark 4、Qwen、Llama 3 和 Mistral。此外，我们还邀请医学专家对这些开放式问题和答案进行多维度的手动评估。研究发现，LLMs 在使用少样本提示后，能够有效地模仿真实世界的医学资格考试题目，但在生成答案的正确性、基于证据的陈述以及专业性方面仍有改进空间。此外，LLMs 还展现出相当程度的纠正和修正参考答案的能力。鉴于人工智能在医疗领域的巨大潜力，为医学生、实习生和住院医师生成医学资格考试题目和答案的任务，可以成为未来研究的重要焦点。

[NLP-37] DetectRL: Benchmarking LLM -Generated Text Detection in Real-World Scenarios NEURIPS2024

【速读】：该论文试图解决现有大型语言模型（LLMs）生成文本检测器在实际应用中的可靠性问题。解决方案的关键在于提出了一个新的基准测试 DetectRL，该基准通过模拟真实世界中复杂的攻击场景，包括高级提示使用、人类修订和写作错误，来评估现有最先进（SOTA）检测技术的性能。通过分析不同检测器在面对不同写作风格、模型类型、攻击方法、文本长度和真实人类写作因素时的表现，DetectRL 揭示了当前检测器的优势和局限性，并为未来检测器的发展提供了更具挑战性的评估标准。

链接: https://arxiv.org/abs/2410.23746
作者: Junchao Wu,Runzhe Zhan,Derek F. Wong,Shu Yang,Xinyi Yang,Yulin Yuan,Lidia S. Chao
关键词-EN: great recent interest, Detecting text generated, large language models, recent interest, Detecting text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2024 Dataset Benchmarking Track

点击查看摘要

Abstract:Detecting text generated by large language models (LLMs) is of great recent interest. With zero-shot methods like DetectGPT, detection capabilities have reached impressive levels. However, the reliability of existing detectors in real-world applications remains underexplored. In this study, we present a new benchmark, DetectRL, highlighting that even state-of-the-art (SOTA) detection techniques still underperformed in this task. We collected human-written datasets from domains where LLMs are particularly prone to misuse. Using popular LLMs, we generated data that better aligns with real-world applications. Unlike previous studies, we employed heuristic rules to create adversarial LLM-generated text, simulating advanced prompt usages, human revisions like word substitutions, and writing errors. Our development of DetectRL reveals the strengths and limitations of current SOTA detectors. More importantly, we analyzed the potential impact of writing styles, model types, attack methods, the text lengths, and real-world human writing factors on different types of detectors. We believe DetectRL could serve as an effective benchmark for assessing detectors in real-world scenarios, evolving with advanced attack methods, thus providing more stressful evaluation to drive the development of more efficient detectors. Data and code are publicly available at: this https URL.
摘要：检测由大语言模型（LLMs）生成的文本近年来引起了广泛关注。通过零样本方法如 DetectGPT，检测能力已达到令人印象深刻的水平。然而，现有检测器在实际应用中的可靠性仍未得到充分探索。在本研究中，我们提出了一个新的基准，DetectRL，强调即使是目前最先进的（SOTA）检测技术在此任务中仍表现不佳。我们收集了来自大语言模型特别容易滥用的领域的人类撰写数据集。使用流行的大语言模型，我们生成了更符合实际应用的数据。与以往的研究不同，我们采用了启发式规则来创建对抗性的大语言模型生成文本，模拟了高级提示用法、人类修订如词语替换以及书写错误。我们的 DetectRL 开发揭示了当前 SOTA 检测器的优势和局限性。更重要的是，我们分析了书写风格、模型类型、攻击方法、文本长度以及现实世界人类书写因素对不同类型检测器的潜在影响。我们相信 DetectRL 可以作为一个有效的基准，用于评估实际场景中的检测器，并随着高级攻击方法的发展而演进，从而提供更具压力的评估，推动更高效检测器的发展。数据和代码公开可用，访问地址为：此 https URL。

[NLP-38] What Happened in LLM s Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

【速读】：该论文试图解决的问题是：在大型语言模型（LLMs）的训练后阶段，不同的训练模式（如快速思考与慢速思考）如何影响模型各层的梯度变化，从而影响模型的学习稳定性和效率。解决方案的关键在于通过分析不同训练模式下的层级梯度变化，揭示了慢速思考（如使用链式思维（Chain-of-Thoughts, CoT）和过程奖励）相较于快速思考能够带来更大的学习稳定性和梯度一致性。此外，研究还发现预训练的LLMs比指令微调的LLMs更能抵抗快速思考带来的不稳定性。通过这些分析，论文为构建更具泛化能力的系统2代理（System-2 agent）提供了新的见解和基础理解。

链接: https://arxiv.org/abs/2410.23743
作者: Ming Li,Yanhong Li,Tianyi Zhou
关键词-EN: slow thinking, thinking, LLMs, fast thinking, slow
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: this https URL.
摘要：大语言模型（LLM）训练后期的差异性体现在哪些方面？我们通过梯度视角，研究了在不同响应和初始模型条件下，LLM不同层的训练模式。特别地，我们关注在当前流行的基于推理路径（如思维链（CoT）和过程奖励）的LLM训练中，快速思考与慢速思考如何影响层级梯度。研究表明，在没有CoT的快速思考中，梯度较大且层间梯度差异较大，而慢速思考（详细CoT）则带来了更高的学习稳定性。此外，预训练的LLM相比指令微调的LLM，受快速思考不稳定性的影响较小。我们还探讨了在慢速与快速思考路径下训练不同LLM时，梯度模式是否能反映响应的正确性。结果表明，慢速思考的梯度能够区分正确与无关的推理路径。作为对比，我们在非推理知识学习任务上进行了类似的梯度分析，发现简单增加响应长度并不能导致类似慢速思考的行为。本研究深化了对LLM训练的基本理解，并为提升其效率和稳定性提供了新视角，为构建可泛化的系统2型AI智能体铺平了道路。相关代码、数据及梯度统计可在以下链接获取：this https URL。

[NLP-39] GigaCheck: Detecting LLM -generated Content

【速读】：该论文试图解决生成式文本检测的问题，特别是区分人类编写的文本与基于大型语言模型（LLM）生成的文本，以及在人机协作文本中检测LLM生成的片段。解决方案的关键在于提出了一种名为GigaCheck的方法，该方法通过两种途径实现目标：(i) 利用通用LLM的语言能力进行微调，以高效地检测LLM生成的文本；(ii) 结合计算机视觉和自然语言处理技术，使用微调的通用LLM与类似DETR的检测模型，定位文本中的人工生成片段。GigaCheck在多个数据集上的评估结果显示，其在区分和定位生成文本方面优于先前的方法，即使在分布外设置下也表现出色。

链接: https://arxiv.org/abs/2410.23728
作者: Irina Tolstykh,Aleksandra Tsybina,Sergey Yakubson,Aleksandr Gordeev,Vladimir Dokholyan,Maksim Kuprashevich
关键词-EN: LLM-based assistants, growing rapidly, spread of LLM-based, content is growing, increasing quality
类目: Computation and Language (cs.CL)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:With the increasing quality and spread of LLM-based assistants, the amount of artificially generated content is growing rapidly. In many cases and tasks, such texts are already indistinguishable from those written by humans, and the quality of generation tends to only increase. At the same time, detection methods are developing more slowly, making it challenging to prevent misuse of these technologies. In this work, we investigate the task of generated text detection by proposing the GigaCheck. Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts. For the first task, our approach utilizes a general-purpose LLM, leveraging its extensive language abilities to fine-tune efficiently for the downstream task of LLM-generated text detection, achieving high performance even with limited data. For the second task, we propose a novel approach that combines computer vision and natural language processing techniques. Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize artificially generated intervals within text. We evaluate the GigaCheck on five classification datasets with English texts and three datasets designed for Human-Machine collaborative text analysis. Our results demonstrate that GigaCheck outperforms previous methods, even in out-of-distribution settings, establishing a strong baseline across all datasets. Comments: 11 pages, 1 figure Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.23728 [cs.CL] (or arXiv:2410.23728v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.23728 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：随着基于大语言模型（LLM）的助手质量和普及度的不断提升，人工生成内容的数量正在迅速增长。在许多情况和任务中，这些文本已经与人类撰写的文本难以区分，且生成质量有持续提升的趋势。与此同时，检测方法的发展相对较慢，使得防止这些技术的滥用变得具有挑战性。在本研究中，我们通过提出GigaCheck来探讨生成文本检测的任务。我们的研究探索了两种方法：（i）区分人类撰写的文本与大语言模型生成的文本，以及（ii）在人机协作文本中检测大语言模型生成的片段。对于第一个任务，我们的方法利用了一个通用的大语言模型，借助其广泛的语言能力，能够高效地进行下游任务的微调，即使在数据有限的情况下也能实现高性能。对于第二个任务，我们提出了一种结合计算机视觉和自然语言处理技术的新方法。具体来说，我们使用了一个经过微调的通用大语言模型，并结合了类似于DETR的检测模型（从计算机视觉领域改编而来），以定位文本中人工生成的片段。我们在五个包含英文文本的分类数据集和三个专为人机协作文本分析设计的数据集上评估了GigaCheck。结果表明，GigaCheck在分布外设置下也优于以往的方法，在所有数据集上建立了强大的基准。

评论：11页，1图
主题：计算与语言（cs.CL）
引用为：arXiv:2410.23728 [cs.CL]
（或 arXiv:2410.23728v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.23728
了解更多信息
arXiv-issued DOI via DataCite（待注册）

[NLP-40] Artificial intelligence to improve clinical coding practice in Scandinavia: a crossover randomized controlled trial

【速读】：该论文试图解决临床编码任务中的效率问题，特别是针对复杂临床文本的编码时间。解决方案的关键在于开发并测试了一款名为Easy-ICD的AI工具，该工具旨在辅助临床编码员，通过减少编码时间来提高工作效率。研究结果显示，使用Easy-ICD工具在处理复杂临床文本时，编码时间显著减少了46%，而在处理简单文本时则没有显著的时间差异。尽管在编码准确性方面没有显著提升，但该研究仍表明AI工具有潜力显著改善复杂临床编码任务的工作效率。

链接: https://arxiv.org/abs/2410.23725
作者: Taridzo Chomutare,Therese Olsen Svenning,Miguel Ángel Tejedor Hernández,Phuong Dinh Ngo,Andrius Budrionis,Kaisa Markljung,Lill Irene Hind,Torbjørn Torsvik,Karl Øyvind Mikalsen,Aleksandar Babic,Hercules Dalianis
关键词-EN: Crossover randomized controlled, randomized controlled trial, Crossover randomized, Trial design, controlled trial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:\textbfTrial design Crossover randomized controlled trial. \textbfMethods An AI tool, Easy-ICD, was developed to assist clinical coders and was tested for improving both accuracy and time in a user study in Norway and Sweden. Participants were randomly assigned to two groups, and crossed over between coding complex (longer) texts versus simple (shorter) texts, while using our tool versus not using our tool. \textbfResults Based on Mann-Whitney U test, the median coding time difference for complex clinical text sequences was 123 seconds (\emphP\textless.001, 95% CI: 81 to 164), representing a 46% reduction in median coding time when our tool is used. There was no significant time difference for simpler text sequences. For coding accuracy, the improvement we noted for both complex and simple texts was not significant. \textbfConclusions This study demonstrates the potential of AI to transform common tasks in clinical workflows, with ostensible positive impacts on work efficiencies for complex clinical coding tasks. Further studies within hospital workflows are required before these presumed impacts can be more clearly understood.
摘要：试验设计 交叉随机对照试验。方法开发了一种名为 Easy-ICD 的 AI 工具，旨在协助临床编码员，并在挪威和瑞典的用户研究中测试其提高编码准确性和效率的效果。参与者被随机分配到两个组，分别在编码复杂（较长）文本和简单（较短）文本时使用或不使用该工具。结果根据 Mann-Whitney U 检验，复杂临床文本序列的编码时间中位数差异为 123 秒（P<.001，95% CI: 81 至 164），表示在使用该工具时，编码时间中位数减少了 46%。对于较简单的文本序列，编码时间没有显著差异。在编码准确性方面，无论是复杂还是简单的文本，我们观察到的改进均不显著。结论本研究展示了 AI 在临床工作流程中转变常见任务的潜力，对复杂临床编码任务的工作效率有明显的积极影响。在医院工作流程中进一步研究之前，这些假设的影响需要更清晰地理解。

[NLP-41] OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在离线环境下对其思维链（chain-of-thought）能力的评估问题。解决方案的关键在于提出了一个名为OCEAN的离线思维链评估框架，该框架将思维链推理建模为马尔可夫决策过程（MDP），并通过知识图谱（KG）的偏好建模来评估策略的匹配度。为了克服LLM推理与KG结构之间的异质性问题，论文利用策略上的KG探索和强化学习（RL）来模拟KG的推理偏好，生成针对LLM生成思维链推理路径的token级似然分布。此外，论文还提出了KG-IPS估计器，将知识图谱反馈的有效性和对齐性纳入逆倾向得分（IPS）中，并证明了该估计器的无偏性和方差下界。通过这种离线评估的值函数，可以直接进行离线策略优化，从而进一步增强思维链的对齐性。

链接: https://arxiv.org/abs/2410.23703
作者: Junda Wu,Xintong Li,Ruoyu Wang,Yu Xia,Yuxin Xiong,Jianing Wang,Tong Yu,Xiang Chen,Branislav Kveton,Lina Yao,Jingbo Shang,Julian McAuley
关键词-EN: current methods remain, methods remain underexplored, Offline evaluation, understanding their capacities, existing research
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Offline evaluation of LLMs is crucial in understanding their capacities, though current methods remain underexplored in existing research. In this work, we focus on the offline evaluation of the chain-of-thought capabilities and show how to optimize LLMs based on the proposed evaluation method. To enable offline feedback with rich knowledge and reasoning paths, we use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. Due to the heterogeneity between LLM reasoning and KG structures, direct interaction and feedback from KGs on LLM behavior are challenging, as they require accurate entity linking and grounding of LLM-generated chains of thought in the KG. To address the above challenge, we propose an offline chain-of-thought evaluation framework, OCEAN, which models chain-of-thought reasoning in LLMs as an MDP and evaluate the policy’s alignment with KG preference modeling. To overcome the reasoning heterogeneity and grounding problems, we leverage on-policy KG exploration and RL to model a KG policy that generates token-level likelihood distributions for LLM-generated chain-of-thought reasoning paths, simulating KG reasoning preference. Then we incorporate the knowledge-graph feedback on the validity and alignment of the generated reasoning paths into inverse propensity scores and propose KG-IPS estimator. Theoretically, we prove the unbiasedness of the proposed KG-IPS estimator and provide a lower bound on its variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance chain-of-thought alignment. Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs’ general abilities in downstream tasks or their internal knowledge.
摘要：离线评估大语言模型（LLM）的能力对于理解其功能至关重要，尽管现有研究中对此类方法的探索仍显不足。在本研究中，我们专注于大语言模型的链式思维（chain-of-thought）能力的离线评估，并展示了如何基于所提出的评估方法优化大语言模型。为了实现具有丰富知识和推理路径的离线反馈，我们利用知识图谱（如Wikidata5m）对生成的链式思维进行反馈。由于大语言模型的推理与知识图谱结构之间的异质性，直接从知识图谱获取对大语言模型行为的反馈存在挑战，这需要在大语言模型生成的链式思维与知识图谱之间进行准确的实体链接和基础对齐。为应对上述挑战，我们提出了一种离线链式思维评估框架——OCEAN，该框架将大语言模型中的链式思维推理建模为马尔可夫决策过程（MDP），并评估策略与知识图谱偏好模型的对齐情况。为克服推理异质性和基础对齐问题，我们利用策略上的知识图谱探索和强化学习（RL）来建模一个知识图谱策略，该策略为大语言模型生成的链式思维推理路径生成Token级别的似然分布，模拟知识图谱的推理偏好。随后，我们将知识图谱对生成推理路径的有效性和对齐情况的反馈纳入逆倾向得分（IPS），并提出了知识图谱逆倾向得分（KG-IPS）估计器。理论上，我们证明了所提出的KG-IPS估计器的无偏性，并提供了其方差的下界。通过离线评估的价值函数，我们可以直接进行离线策略优化，以进一步增强链式思维的对齐效果。我们的实证研究表明，OCEAN能够在不影响大语言模型在下游任务中的通用能力或其内部知识的情况下，高效地优化生成具有更高估计值的链式思维推理路径。

[NLP-42] Instruction-Tuning Llama-3-8B Excels in City-Scale Mobility Prediction

【速读】：该论文试图解决长期城市范围内的人类移动性预测问题，特别是在跨城市环境中的泛化能力。解决方案的关键在于引入了一个名为Llama-3-8B-Mob的大型语言模型，该模型通过指令微调（instruction tuning）进行优化，能够以问答方式进行长期移动性预测。研究结果表明，Llama-3-8B-Mob在多个预测指标上超越了现有最先进的方法，并展示了强大的零样本泛化能力，即使仅在单一城市的有限样本上进行微调，也能有效泛化到其他城市。

链接: https://arxiv.org/abs/2410.23692
作者: Peizhi Tang,Chuang Yang,Tong Xing,Xiaohang Xu,Renhe Jiang,Kaoru Sezaki
关键词-EN: disaster response, epidemic forecasting, plays a critical, critical role, role in applications
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Human mobility prediction plays a critical role in applications such as disaster response, urban planning, and epidemic forecasting. Traditional methods often rely on designing crafted, domain-specific models, and typically focus on short-term predictions, which struggle to generalize across diverse urban environments. In this study, we introduce Llama-3-8B-Mob, a large language model fine-tuned with instruction tuning, for long-term citywide mobility prediction – in a QA manner. We validate our approach using large-scale human mobility data from four metropolitan areas in Japan, focusing on predicting individual trajectories over the next 15 days. The results demonstrate that Llama-3-8B-Mob excels in modeling long-term human mobility – surpassing the state-of-the-art on multiple prediction metrics. It also displays strong zero-shot generalization capabilities – effectively generalizing to other cities even when fine-tuned only on limited samples from a single city. Source codes are available at this https URL.
摘要：人类移动性预测在灾害响应、城市规划和疫情预测等应用中起着至关重要的作用。传统方法通常依赖于设计特定领域的模型，并且主要关注短期预测，这些方法在多样化的城市环境中难以泛化。在本研究中，我们引入了 Llama-3-8B-Mob，这是一个经过指令微调的大语言模型，用于以问答方式进行长期的城市范围移动性预测。我们通过日本四个大都市区的大规模人类移动性数据验证了我们的方法，重点预测未来15天的个人轨迹。结果表明，Llama-3-8B-Mob 在长期人类移动性建模方面表现出色，在多个预测指标上超越了当前最先进的技术。它还展示了强大的零样本泛化能力，即使仅在单一城市的有限样本上进行微调，也能有效泛化到其他城市。源代码可在以下链接获取：https URL。

[NLP-43] Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

【速读】：该论文试图解决由于字节级字节对编码（byte-level byte-pair encoding, BPE）分词器产生的不可解码令牌（incomplete tokens）所导致的模型行为异常问题。解决方案的关键在于识别并利用“不可能二元组”（improbable bigrams），即由不可解码令牌与不常见令牌组合而成的、超出分布范围的令牌组合，以展示这些令牌在遇到不熟悉令牌时的脆弱性。实验结果表明，这种令牌组合显著增加了模型产生幻觉行为的风险，而使用其他分词方式对相同短语进行分词则能大幅降低幻觉率（如Llama3.1模型中减少了93%）。因此，论文强调了字节级BPE分词器可能引入的潜在漏洞，并警示其在构建可信赖语言模型中的风险。

链接: https://arxiv.org/abs/2410.23684
作者: Eugene Jang,Kimin Lee,Jin-Woo Chung,Keuntae Park,Seungwon Shin
关键词-EN: bridges human-readable text, model-readable discrete tokens, crucial step, step that bridges, bridges human-readable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, alternative tokenizations of the same phrases result in drastically lower rates of hallucination (93% reduction in Llama3.1). We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may impede the development of trustworthy language models.
摘要：Token化是将人类可读文本转换为模型可读离散Token的关键步骤。然而，最近的研究表明，Token化器可能被利用以引发不希望的模型行为。在本研究中，我们探讨了不完整Token，即由于字节级字节对编码（BPE）Token化产生的包含杂散字节的不可解码Token。我们假设这些Token高度依赖其相邻Token，并且在与不熟悉的Token配对时显得脆弱。为了展示这种脆弱性，我们引入了不可能的二元组：设计用于利用其依赖性的不完整Token的分布外组合。我们的实验显示，不可能的二元组显著容易产生幻觉行为。令人惊讶的是，相同短语的替代Token化导致幻觉率大幅降低（Llama3.1中减少了93%）。我们提醒，字节级BPE Token化器可能引入的潜在漏洞可能会阻碍可信语言模型的发展。

[NLP-44] Pseudo-Conversation Injection for LLM Goal Hijacking

【速读】：该论文试图解决大型语言模型 (LLMs) 中的目标劫持 (Goal Hijacking) 问题，即通过对抗性攻击手段操纵模型生成特定的预定输出。解决方案的关键在于提出了一种名为伪对话注入 (Pseudo-Conversation Injection) 的新型攻击方法，该方法利用了 LLMs 在对话上下文中角色识别的弱点。具体而言，攻击者通过伪造模型对用户初始提示的响应，并随后引入一个恶意的新任务提示，使模型误认为初始提示和伪造响应已完成对话，从而执行新的伪造提示。论文提出了三种伪对话构建策略：目标伪对话 (Targeted Pseudo-Conversation)、通用伪对话 (Universal Pseudo-Conversation) 和鲁棒伪对话 (Robust Pseudo-Conversation)，以在不同场景下实现有效的目标劫持。实验结果表明，该方法在攻击效果上显著优于现有方法。

链接: https://arxiv.org/abs/2410.23678
作者: Zheng Chen,Buhui Yao
关键词-EN: Large Language Models, Large Language, user original input, Language Models, Goal hijacking
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Goal hijacking is a type of adversarial attack on Large Language Models (LLMs) where the objective is to manipulate the model into producing a specific, predetermined output, regardless of the user’s original input. In goal hijacking, an attacker typically appends a carefully crafted malicious suffix to the user’s prompt, which coerces the model into ignoring the user’s original input and generating the target response. In this paper, we introduce a novel goal hijacking attack method called Pseudo-Conversation Injection, which leverages the weaknesses of LLMs in role identification within conversation contexts. Specifically, we construct the suffix by fabricating responses from the LLM to the user’s initial prompt, followed by a prompt for a malicious new task. This leads the model to perceive the initial prompt and fabricated response as a completed conversation, thereby executing the new, falsified prompt. Following this approach, we propose three Pseudo-Conversation construction strategies: Targeted Pseudo-Conversation, Universal Pseudo-Conversation, and Robust Pseudo-Conversation. These strategies are designed to achieve effective goal hijacking across various scenarios. Our experiments, conducted on two mainstream LLM platforms including ChatGPT and Qwen, demonstrate that our proposed method significantly outperforms existing approaches in terms of attack effectiveness.
摘要：目标劫持是一种针对大语言模型 (LLM) 的对抗攻击类型，其目的是操纵模型生成特定的预定输出，而不管用户的原始输入内容。在目标劫持中，攻击者通常会在用户的提示后附加一个精心设计的恶意后缀，迫使模型忽略用户的原始输入并生成目标响应。本文介绍了一种新颖的目标劫持攻击方法，称为伪对话注入 (Pseudo-Conversation Injection)，该方法利用了 LLM 在对话上下文中角色识别的弱点。具体而言，我们通过伪造 LLM 对用户初始提示的响应，随后附加一个恶意新任务的提示来构建后缀。这使得模型将初始提示和伪造的响应视为已完成对话，从而执行新的伪造提示。基于此方法，我们提出了三种伪对话构建策略：定向伪对话 (Targeted Pseudo-Conversation)、通用伪对话 (Universal Pseudo-Conversation) 和鲁棒伪对话 (Robust Pseudo-Conversation)。这些策略旨在在各种场景下实现有效的目标劫持。我们在两个主流 LLM 平台（包括 ChatGPT 和 Qwen）上进行的实验表明，我们提出的方法在攻击效果方面显著优于现有方法。

[NLP-45] Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance

【速读】：该论文试图解决在生成式 AI 推理应用中，由于内核边界同步开销导致的 GPU 性能低下问题。解决方案的关键是提出了一种名为“内核循环 (kernel looping)”的专门全局优化技术。该技术通过结合现代数据流架构中可能的层级融合与语言模型中重复的层结构，将连续调用同一内核的同步成本消除，通过将这些调用转换为对包含流水线外循环的修改内核的单次调用，从而显著提升了解码阶段的性能。实验结果表明，该方法在 SambaNova SN40L 可重构数据流单元 (RDU) 上，对多种开源模型的解码阶段性能提升了高达 2.2 倍，并且在多 SN40L 插槽上实现了高达 2.5 倍的加速，最终在 8 和 16 插槽上达到了超过 90% 的峰值性能，相较于 DGX H100 实现了高达 3.7 倍的加速。

链接: https://arxiv.org/abs/2410.23668
作者: David Koeplinger,Darshan Gandhi,Pushkar Nandkar,Nathan Sheeley,Matheen Musaddiq,Leon Zhang,Reid Goodbar,Matthew Shaffer,Han Wang,Angela Wang,Mingran Wang,Raghu Prabhakar
关键词-EN: Token generation, token generation due, kernel looping, Token generation speed, kernel
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Token generation speed is critical to power the next wave of AI inference applications. GPUs significantly underperform during token generation due to synchronization overheads at kernel boundaries, utilizing only 21% of their peak memory bandwidth. While recent dataflow architectures mitigate these overheads by enabling aggressive fusion of decoder layers into a single kernel, they too leave performance on the table due to synchronization penalties at layer boundaries. This paper presents kernel looping, a specialized global optimization technique which exploits an optimization opportunity brought by combining the unique layer-level fusion possible in modern dataflow architectures with the repeated layer structure found in language models. Kernel looping eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a modified kernel containing a pipelined outer loop. We evaluate kernel looping on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU), a commercial dataflow accelerator for AI. Experiments demonstrate that kernel looping speeds up the decode phase of a wide array of powerful open-source models by up to 2.2 \times on SN40L. Kernel looping allows scaling of decode performance over multiple SN40L sockets, achieving speedups of up to 2.5 \times . Finally, kernel looping enables SN40L to achieve over 90% of peak performance on 8 and 16 sockets and achieve a speedup of up to 3.7 \times over DGX H100. Kernel looping, as well as the models evaluated in this paper, are deployed in production in a commercial AI inference cloud. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) ACMclasses: D.3.4; C.1.3 Cite as: arXiv:2410.23668 [cs.CL] (or arXiv:2410.23668v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.23668 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：Token 生成速度对于推动下一波 AI 推理应用至关重要。GPU 在 Token 生成过程中由于内核边界上的同步开销，仅能利用其峰值内存带宽的 21%。尽管最近的数据流架构通过将解码器层积极融合到一个内核中来缓解这些开销，但由于层边界上的同步惩罚，它们仍然无法充分发挥性能。本文提出了一种名为“内核循环”的专门全局优化技术，该技术利用了现代数据流架构中可能实现的独特层级融合与语言模型中重复层结构的结合所带来的优化机会。内核循环通过将连续调用同一内核的操作转换为对包含流水线外循环的修改内核的单次调用，从而消除了这些调用之间的同步成本。我们在 SambaNova SN40L 可重构数据流单元（RDU）上评估了内核循环，这是一个用于 AI 的商业数据流加速器。实验表明，内核循环在 SN40L 上将一系列强大的开源模型的解码阶段加速了高达 2.2 倍。内核循环还支持在多个 SN40L 插槽上扩展解码性能，实现了高达 2.5 倍的加速。最后，内核循环使得 SN40L 在 8 和 16 个插槽上实现了超过 90% 的峰值性能，并在 DGX H100 上实现了高达 3.7 倍的加速。内核循环以及本文评估的模型已在商业 AI 推理云中投入生产。

主题：计算与语言（cs.CL）；人工智能（cs.AI）；硬件架构（cs.AR）
ACM 分类：D.3.4；C.1.3
引用方式：arXiv:2410.23668 [cs.CL]（或 arXiv:2410.23668v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.23668
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-46] Morphological Typology in BPE Subword Productivity and Language Modeling

【速读】：该论文试图解决形态类型学对分词和语言建模性能的影响问题。解决方案的关键在于研究合成型和分析型语言在使用字节对编码 (BPE) 算法进行分词时的生产力和子词规律性，并通过实验比较不同语言在相同数据量下的模型表现。研究发现，具有合成特征的语言在BPE分词中表现出更高的子词规律性和生产力，从而在语言建模任务中取得更好的结果，这表明形态类型学与BPE分词效率之间存在相关性。

链接: https://arxiv.org/abs/2410.23656
作者: Iñigo Parra
关键词-EN: study investigates, investigates the impact, BPE tokenization, BPE, language modeling
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:This study investigates the impact of morphological typology on tokenization and language modeling performance. We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized using the byte-pair encoding (BPE) algorithm. We compare the performance of models trained with similar amounts of data in different languages. Our experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization and achieve better results in language modeling tasks. We also observe that the typological continuum from linguistic theory is reflected in several experiments. These findings suggest a correlation between morphological typology and BPE tokenization efficiency.
摘要：本研究探讨了形态类型学对分词和语言建模性能的影响。我们重点关注具有综合和分析形态结构的语言，并考察了它们在使用字节对编码 (BPE) 算法进行分词时的生产力。我们比较了在不同语言中使用相似数据量训练的模型的性能。实验结果显示，具有综合特征的语言在 BPE 分词时表现出更高的子词规则性和生产力，并在语言建模任务中取得了更好的结果。我们还观察到，语言学理论中的类型学连续体在多个实验中得到了体现。这些发现表明，形态类型学与 BPE 分词效率之间存在关联。

[NLP-47] On Positional Bias of Faithfulness for Long-form Summarization

【速读】：该论文试图解决大型语言模型（LLMs）在长文本处理中存在的位置偏差问题，特别是在长文本摘要任务中，模型往往忽视输入文本中间部分的信息，导致摘要的忠实度（faithfulness）下降。解决方案的关键在于识别和评估这种位置偏差，并通过实验验证了多种技术手段来缓解这一问题。研究者发现，通过提示技术（prompting techniques）可以有效引导模型关注特定位置的内容，从而提高摘要的忠实度，而更复杂的生成技术则效果有限。

链接: https://arxiv.org/abs/2410.23609
作者: David Wan,Jesse Vig,Mohit Bansal,Shafiq Joty
关键词-EN: Large Language Models, Large Language, long-context settings, under-attending to information, Language Models
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias. To consistently evaluate faithfulness, we first compile a benchmark of eight human-annotated long-form summarization datasets and perform a meta-evaluation of faithfulness metrics. We show that LLM-based faithfulness metrics, though effective with full-context inputs, remain sensitive to document order, indicating positional bias. Analyzing LLM-generated summaries across six datasets, we find a “U-shaped” trend in faithfulness, where LLMs faithfully summarize the beginning and end of documents but neglect middle content. Perturbing document order similarly reveals models are less faithful when important documents are placed in the middle of the input. We find that this behavior is partly due to shifting focus with context length: as context increases, summaries become less faithful, but beyond a certain length, faithfulness improves as the model focuses on the end. Finally, we experiment with different generation techniques to reduce positional bias and find that prompting techniques effectively direct model attention to specific positions, whereas more sophisticated approaches offer limited improvements. Our data and code are available in this https URL.
摘要：大语言模型 (LLM) 在长上下文设置中常常表现出位置偏差，即对输入中间部分的信息关注不足。我们研究了这种偏差在长篇摘要生成中的存在性、其对忠实度的影响以及多种缓解这种偏差的技术。为了持续评估忠实度，我们首先编制了一个包含八个人类标注的长篇摘要数据集的基准，并对忠实度指标进行了元评估。我们发现，基于 LLM 的忠实度指标虽然在全上下文输入中有效，但仍然对文档顺序敏感，表明存在位置偏差。通过对六个数据集上的 LLM 生成摘要进行分析，我们发现了一个“U 形”的忠实度趋势，即 LLM 能够忠实地总结文档的开始和结尾部分，但忽略了中间内容。同样，扰动文档顺序也揭示了当重要文档被放置在输入的中间时，模型的忠实度较低。我们发现，这种行为部分是由于随着上下文长度的增加，模型的关注点发生了变化：随着上下文的增加，摘要的忠实度降低，但超过一定长度后，忠实度会提高，因为模型开始关注结尾部分。最后，我们尝试了不同的生成技术以减少位置偏差，发现提示技术能够有效地引导模型关注特定位置，而更复杂的方法提供的改进有限。我们的数据和代码可在以下链接获取：https URL。

[NLP-48] Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在处理长尾知识（long-tail knowledge）时预测不确定性高的问题。解决方案的关键在于提出了一种基于强化学习的动态不确定性排序方法，用于增强上下文学习（in-context learning, ICL）中的检索增强效果。该方法通过动态调整检索样本的排序，优先选择信息量丰富且稳定的样本，同时降低误导性样本的影响，并根据LLM对每个检索样本的反馈进行排序更新。此外，引入了一个可学习的动态排序阈值，以提高训练效率并降低查询成本，特别是在模型遇到负向预测偏移时进行调整。实验结果表明，该方法在不同领域的问答数据集上显著提升了长尾问题的准确性，相较于最佳基线提高了2.76%，在难以通过零样本推理解决的长尾问题上提升了5.96%的准确性。

链接: https://arxiv.org/abs/2410.23605
作者: Shuyang Yu,Runxue Bao,Parminder Bhatia,Taha Kass-Hout,Jiayu Zhou,Cao Xiao
关键词-EN: Large language models, learn vast amounts, Large language, learn vast, vast amounts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. However, long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models’ memorization. Prior work has shown that in-context learning (ICL) with retriever augmentation can help LLMs better capture long-tail knowledge, reducing their reliance on pre-trained data. Despite these advances, we observe that LLM predictions for long-tail questions remain uncertain to variations in retrieved samples. To take advantage of the uncertainty in ICL for guiding LLM predictions toward correct answers on long-tail samples, we propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions. Our approach prioritizes more informative and stable samples while demoting misleading ones, updating rankings based on the feedback from the LLM w.r.t. each retrieved sample. To enhance training efficiency and reduce query costs, we introduce a learnable dynamic ranking threshold, adjusted when the model encounters negative prediction shifts. Experimental results on various question-answering datasets from different domains show that our method outperforms the best baseline by 2.76% , with a notable 5.96% boost in accuracy on long-tail questions that elude zero-shot inference.
摘要：大语言模型 (LLM) 在预训练阶段能够从多个领域学习大量知识。然而，来自专业领域的长尾知识往往稀缺且代表性不足，很少出现在模型的记忆中。先前的工作表明，通过检索增强的上下文学习 (ICL) 可以帮助 LLM 更好地捕捉长尾知识，减少对预训练数据的依赖。尽管取得了这些进展，我们观察到，LLM 对长尾问题的预测仍然容易受到检索样本变化的影响。为了利用 ICL 中的不确定性来引导 LLM 对长尾样本的预测朝向正确答案，我们提出了一种基于强化学习的动态不确定性排序方法，该方法考虑了每个检索样本对 LLM 预测的不同影响。我们的方法优先考虑更具信息量和稳定性的样本，同时降低误导性样本的优先级，并根据 LLM 对每个检索样本的反馈更新排序。为了提高训练效率并降低查询成本，我们引入了一个可学习的动态排序阈值，当模型遇到负预测偏移时进行调整。在来自不同领域的各种问答数据集上的实验结果表明，我们的方法比最佳基线高出 2.76%，在那些难以通过零样本推理解决的长尾问题上，准确率显著提高了 5.96%。

[NLP-49] Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

【速读】：该论文试图解决的问题是：在视觉刺激引起的美感体验中，感知计算（perceptual computations）和概念知识（conceptual knowledge）各自的作用如何区分。解决方案的关键在于利用线性解码技术，分析单模态视觉模型（如SimCLR）、单模态语言模型（如GPT2）以及多模态（语言对齐）深度神经网络模型（如SLIP）在学习到的表示上的表现，以预测人类对自然图像的美感评分。研究结果表明，单模态视觉模型解释了这些评分中的绝大部分可解释方差，而语言对齐的视觉模型和单模态语言模型在视觉嵌入基础上生成的描述性语言并未显著提升预测准确性。这表明，美感体验的基础可能主要由前馈感知（feedforward perception）的不可言喻计算构成。

链接: https://arxiv.org/abs/2410.23603
作者: Colin Conwell,Christopher Hamblin,Chelsea Boccagno,David Mayo,Jesse Cummings,Leyla Isik,Andrei Barbu
关键词-EN: versus conceptual knowledge, describe versus conceptual, stimulus as beautiful, derives from perceptual, versus conceptual
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When we experience a visual stimulus as beautiful, how much of that experience derives from perceptual computations we cannot describe versus conceptual knowledge we can readily translate into natural language? Disentangling perception from language in visually-evoked affective and aesthetic experiences through behavioral paradigms or neuroimaging is often empirically intractable. Here, we circumnavigate this challenge by using linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal (language-aligned) deep neural network (DNN) models to predict human beauty ratings of naturalistic images. We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision. Unimodal language models (e.g. GPT2) conditioned on visual embeddings to generate captions (via CLIPCap) yield no further gains. Caption embeddings alone yield less accurate predictions than image and caption embeddings combined (concatenated). Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.
摘要：当我们感受到视觉刺激带来的美感时，这种体验中有多少是源自于我们无法描述的感知计算，又有多少是来自于我们可以轻松转化为自然语言的概念知识？通过行为范式或神经成像技术来分离视觉引发的情感和美学体验中的感知与语言，通常在实证上是难以实现的。在此，我们通过使用单模态视觉、单模态语言以及多模态（语言对齐的）深度神经网络（DNN）模型的学习表示进行线性解码，来预测人类对自然图像的美感评分，从而绕过了这一挑战。我们发现，单模态视觉模型（如 SimCLR）在这些评分中解释了绝大部分的可解释方差。语言对齐的视觉模型（如 SLIP）相对于单模态视觉模型仅带来了微小的提升。单模态语言模型（如 GPT2）在基于视觉嵌入生成描述（通过 CLIPCap）时，并未进一步提高预测精度。单独的描述嵌入比图像和描述嵌入结合（连接）的预测准确性更低。综上所述，这些结果表明，无论我们最终找到何种词汇来描述我们对美的体验，前馈感知中那些难以言喻的计算可能已经为这种体验提供了足够的基石。

[NLP-50] End-to-End Ontology Learning with Large Language Models

【速读】：该论文试图解决现有方法在构建本体（ontology）时，仅关注单个任务（subtasks）而忽视任务间交互的问题。解决方案的关键在于提出了OLLM方法，通过微调大型语言模型（LLM）并引入自定义正则化器（custom regulariser）来减少高频概念的过拟合，从而整体建模目标本体的子组件。这种方法不仅提高了生成本体的语义准确性，还保持了其结构完整性，并通过深度学习技术定义的更稳健的图距离度量来评估生成本体的质量。

链接: https://arxiv.org/abs/2410.23584
作者: Andy Lo,Albert Q. Jiang,Wenda Li,Mateja Jamnik
关键词-EN: automatic machine processing, structured format, automatic machine, machine processing, constructing ontologies requires
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ontologies are useful for automatic machine processing of domain knowledge as they represent it in a structured format. Yet, constructing ontologies requires substantial manual effort. To automate part of this process, large language models (LLMs) have been applied to solve various subtasks of ontology learning. However, this partial ontology learning does not capture the interactions between subtasks. We address this gap by introducing OLLM, a general and scalable method for building the taxonomic backbone of an ontology from scratch. Rather than focusing on subtasks, like individual relations between entities, we model entire subcomponents of the target ontology by finetuning an LLM with a custom regulariser that reduces overfitting on high-frequency concepts. We introduce a novel suite of metrics for evaluating the quality of the generated ontology by measuring its semantic and structural similarity to the ground truth. In contrast to standard metrics, our metrics use deep learning techniques to define more robust distance measures between graphs. Both our quantitative and qualitative results on Wikipedia show that OLLM outperforms subtask composition methods, producing more semantically accurate ontologies while maintaining structural integrity. We further demonstrate that our model can be effectively adapted to new domains, like arXiv, needing only a small number of training examples. Our source code and datasets are available at this https URL.
摘要：本体论对于自动机器处理领域知识非常有用，因为它们以结构化的格式表示这些知识。然而，构建本体论需要大量的手动工作。为了自动化这一过程的一部分，大语言模型（LLMs）已被应用于解决本体学习中的各种子任务。然而，这种部分本体学习并未捕捉到子任务之间的交互。我们通过引入 OLLM，一种通用且可扩展的方法，来填补这一空白，该方法能够从头开始构建本体论的分类主干。我们不是专注于子任务，如实体之间的个别关系，而是通过使用自定义正则化器对 LLM 进行微调，来建模目标本体论的整个子组件，该正则化器减少了高频概念上的过拟合。我们引入了一套新颖的评估指标，通过测量生成本体论与真实本体论之间的语义和结构相似性，来评估生成本体论的质量。与标准指标相比，我们的指标使用深度学习技术来定义更稳健的图间距离度量。我们在维基百科上的定量和定性结果表明，OLLM 优于子任务组合方法，生成的本体论在保持结构完整性的同时，语义上更为准确。我们进一步证明，我们的模型可以有效地适应新领域，如 arXiv，仅需少量训练样本。我们的源代码和数据集可在以下链接获取：https URL。

[NLP-51] BioNCERE: Non-Contrastive Enhancement For Relation Extraction In Biomedical Texts

【速读】：该论文试图解决生物医学领域关系抽取 (Relation Extraction, RE) 中的各向异性问题 (anisotropy problem)，并降低标注成本。解决方案的关键在于引入了一种名为生物非对比关系抽取 (Biological Non-Contrastive Relation Extraction, BioNCERE) 的新训练方法。BioNCERE 通过迁移学习和非对比学习 (non-contrastive learning) 来避免完全或维度坍塌以及过拟合问题。该方法分三个阶段进行关系抽取，利用两次迁移学习，并在第二阶段采用非对比学习，最终在不依赖命名实体标签的情况下进行关系预测。实验结果表明，BioNCERE 在 SemMedDB 数据集上几乎达到了当前最先进的关系抽取性能，且无需使用命名实体信息。

链接: https://arxiv.org/abs/2410.23583
作者: Farshad Noravesh
关键词-EN: biomedical domain, domain consider finetuning, finetuning BioBERT, relation extraction, anisotropy problem
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 figures, 2 tables, 10 pages

点击查看摘要

Abstract:State-of-the-art models for relation extraction (RE) in the biomedical domain consider finetuning BioBERT using classification, but they may suffer from the anisotropy problem. Contrastive learning methods can reduce this anisotropy phenomena, and also help to avoid class collapse in any classification problem. In the present paper, a new training method called biological non-contrastive relation extraction (BioNCERE) is introduced for relation extraction without using any named entity labels for training to reduce annotation costs. BioNCERE uses transfer learning and non-contrastive learning to avoid full or dimensional collapse as well as bypass overfitting. It resolves RE in three stages by leveraging transfer learning two times. By freezing the weights learned in previous stages in the proposed pipeline and by leveraging non-contrastive learning in the second stage, the model predicts relations without any knowledge of named entities. Experiments have been done on SemMedDB that are almost similar to State-of-the-art performance on RE without using the information of named entities.
摘要：在生物医学领域的关系抽取 (Relation Extraction, RE) 中，当前最先进的模型考虑使用分类方法对 BioBERT 进行微调，但这些模型可能受到各向异性问题的困扰。对比学习方法可以减少这种各向异性现象，并有助于避免分类问题中的类别崩溃。本文提出了一种新的训练方法，称为生物非对比关系抽取 (Biological Non-Contrastive Relation Extraction, BioNCERE)，该方法在不使用任何命名实体标签进行训练的情况下进行关系抽取，以降低标注成本。BioNCERE 利用迁移学习和非对比学习来避免完全或维度崩溃，并绕过过拟合问题。它通过两次利用迁移学习分三个阶段解决 RE 问题。通过在提出的流程中冻结前一阶段学习到的权重，并在第二阶段利用非对比学习，模型可以在不了解命名实体的情况下预测关系。实验在 SemMedDB 上进行，结果显示在不使用命名实体信息的情况下，RE 性能几乎达到了当前最先进的水平。

[NLP-52] From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents

【速读】：该论文试图解决大语言模型（LLM）在复杂现实应用中，如交互式网页导航中的性能优化问题。解决方案的关键在于优化上下文管理，特别是交互历史和网页表示的影响。通过有效管理这些上下文元素，论文展示了在分布外场景（包括未见过的网站、类别和地理位置）中，代理性能的显著提升。这些发现为设计和优化基于LLM的代理提供了重要见解，从而在实际应用中实现更准确和有效的网页导航。

链接: https://arxiv.org/abs/2410.23555
作者: Nalin Tiwary,Vardhan Dongre,Sanil Arun Chawla,Ashwin Lamani,Dilek Hakkani-Tür
关键词-EN: Large Language Model, Language Model, Large Language, Recent advancements, advancements in Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Recent advancements in Large Language Model (LLM)-based frameworks have extended their capabilities to complex real-world applications, such as interactive web navigation. These systems, driven by user commands, navigate web browsers to complete tasks through multi-turn dialogues, offering both innovative opportunities and significant challenges. Despite the introduction of benchmarks for conversational web navigation, a detailed understanding of the key contextual components that influence the performance of these agents remains elusive. This study aims to fill this gap by analyzing the various contextual elements crucial to the functioning of web navigation agents. We investigate the optimization of context management, focusing on the influence of interaction history and web page representation. Our work highlights improved agent performance across out-of-distribution scenarios, including unseen websites, categories, and geographic locations through effective context management. These findings provide insights into the design and optimization of LLM-based agents, enabling more accurate and effective web navigation in real-world applications.
摘要：近年来，基于大语言模型 (Large Language Model, LLM) 的框架在复杂现实应用中的能力得到了扩展，例如交互式网页导航。这些系统通过用户指令驱动，通过多轮对话在网页浏览器中完成任务，既提供了创新的机会，也带来了显著的挑战。尽管已经引入了用于对话式网页导航的基准测试，但对于影响这些智能体性能的关键上下文组件的详细理解仍然不足。本研究旨在填补这一空白，通过分析影响网页导航智能体功能的各种上下文元素。我们研究了上下文管理的优化，重点关注交互历史和网页表示的影响。我们的工作强调了通过有效的上下文管理，智能体在分布外场景中的性能提升，包括未见过的网站、类别和地理位置。这些发现为设计和优化基于 LLM 的智能体提供了见解，使其在现实应用中实现更准确和有效的网页导航。

[NLP-53] Simulating User Agents for Embodied Conversational-AI

【速读】：该论文试图解决在训练和评估具身代理（embodied agents）时，收集大规模多样化的人机对话数据集所面临的成本高、劳动密集和时间消耗大的问题。解决方案的关键在于构建一个基于大型语言模型（LLM）的用户代理，该代理能够在虚拟环境中模拟用户在与具身代理交互时的行为。通过这种方式，用户代理能够根据用户目标（如制作早餐）在每个时间步观察机器人动作或进行干预或回答问题，从而提高数据集生成的可扩展性和效率。这种方法对于提升机器人交互和任务完成能力，以及在强化学习中使用AI反馈的研究至关重要。通过与TEACh数据集的对比实验，论文展示了LLM-based用户代理在模拟人类对话行为方面的可行性和有效性。

链接: https://arxiv.org/abs/2410.23535
作者: Daniel Philipov,Vardhan Dongre,Gokhan Tur,Dilek Hakkani-Tür
关键词-EN: Embodied agents designed, interpret instructions, user agent, resolve issues, communicate effectively
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Embodied agents designed to assist users with tasks must engage in natural language interactions, interpret instructions, execute actions, and communicate effectively to resolve issues. However, collecting large-scale, diverse datasets of situated human-robot dialogues to train and evaluate such agents is expensive, labor-intensive, and time-consuming. To address this challenge, we propose building a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a user goal (e.g., make breakfast), at each time step, the user agent may observe" the robot actions or speak" to either intervene with the robot or answer questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot’s interaction and task completion ability, as well as for research in reinforcement learning using AI feedback. We evaluate our user agent’s ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset. We perform three experiments: zero-shot prompting to predict dialogue acts, few-shot prompting, and fine-tuning on the TEACh training subset. Results show the LLM-based user agent achieves an F-measure of 42% with zero-shot prompting and 43.4% with few-shot prompting in mimicking human speaking behavior. Through fine-tuning, performance in deciding when to speak remained stable, while deciding what to say improved from 51.1% to 62.5%. These findings showcase the feasibility of the proposed approach for assessing and enhancing the effectiveness of robot task completion through natural language communication.
摘要：设计用于协助用户完成任务的具身智能体必须能够进行自然语言交互、解释指令、执行动作并有效地沟通以解决问题。然而，收集大规模、多样化的具身人机对话数据集以训练和评估此类智能体既昂贵又耗时。为应对这一挑战，我们提出构建一个基于大语言模型（LLM）的用户智能体，该智能体能够在虚拟环境中模拟用户在与具身智能体交互时的行为。给定一个用户目标（例如，制作早餐），在每个时间步，用户智能体可以观察机器人动作或通过说话来干预机器人或回答问题。这种用户智能体有助于提高具身对话数据集生成的可扩展性和效率，对于增强和评估机器人的交互和任务完成能力，以及使用AI反馈进行强化学习研究至关重要。我们通过将用户智能体生成的模拟对话与TEACh数据集进行比较，评估其生成类似人类行为的能力。我们进行了三项实验：零样本提示以预测对话行为、少样本提示以及在TEACh训练子集上的微调。结果显示，基于LLM的用户智能体在零样本提示下达到42%的F值，在少样本提示下达到43.4%，以模仿人类说话行为。通过微调，决定何时说话的性能保持稳定，而决定说什么的能力从51.1%提高到62.5%。这些发现展示了所提出方法在通过自然语言沟通评估和增强机器人任务完成效果方面的可行性。

[NLP-54] Large Language Models for Patient Comments Multi-Label Classification

【速读】：该论文试图解决患者反馈文本的多标签分类问题，特别是在缺乏标注数据的情况下，如何有效利用大型语言模型 (LLMs) 进行分类。解决方案的关键在于利用 GPT-4o-Turbo 进行多标签文本分类 (MLTC)，并通过引入保护健康信息 (PHI) 检测框架确保患者数据的匿名性。此外，论文还探索了零样本学习、上下文学习和链式思维提示等提示工程框架，以提高分类性能。最终，GPT-4o-Turbo 在零样本和少样本设置下均优于传统方法和预训练语言模型 (PLMs)，实现了最高的整体性能，F1 分数达到 76.12%，加权 F1 分数为 73.61%。

链接: https://arxiv.org/abs/2410.23528
作者: Hajar Sakai,Sarah S. Lam,Mohammadsadegh Mikaeili,Joshua Bosire,Franziska Jovin
关键词-EN: sustainability and reputation, care quality, quality are crucial, Large Language Models, Pre-trained Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patient experience and care quality are crucial for a hospital’s sustainability and reputation. The analysis of patient feedback offers valuable insight into patient satisfaction and outcomes. However, the unstructured nature of these comments poses challenges for traditional machine learning methods following a supervised learning paradigm. This is due to the unavailability of labeled data and the nuances these texts encompass. This research explores leveraging Large Language Models (LLMs) in conducting Multi-label Text Classification (MLTC) of inpatient comments shared after a stay in the hospital. GPT-4o-Turbo was leveraged to conduct the classification. However, given the sensitive nature of patients’ comments, a security layer is introduced before feeding the data to the LLM through a Protected Health Information (PHI) detection framework, which ensures patients’ de-identification. Additionally, using the prompt engineering framework, zero-shot learning, in-context learning, and chain-of-thought prompting were experimented with. Results demonstrate that GPT-4o-Turbo, whether following a zero-shot or few-shot setting, outperforms traditional methods and Pre-trained Language Models (PLMs) and achieves the highest overall performance with an F1-score of 76.12% and a weighted F1-score of 73.61% followed closely by the few-shot learning results. Subsequently, the results’ association with other patient experience structured variables (e.g., rating) was conducted. The study enhances MLTC through the application of LLMs, offering healthcare practitioners an efficient method to gain deeper insights into patient feedback and deliver prompt, appropriate responses.
摘要：患者体验和护理质量对于医院的可持续性和声誉至关重要。对患者反馈的分析提供了关于患者满意度和治疗效果的宝贵见解。然而，这些评论的无结构性质对遵循监督学习范式的传统机器学习方法构成了挑战。这是因为缺乏标注数据以及这些文本所包含的细微差别。本研究探讨了利用大语言模型 (LLM) 进行住院患者评论的多标签文本分类 (MLTC)。GPT-4o-Turbo 被用于进行分类。然而，鉴于患者评论的敏感性，在将数据输入 LLM 之前，通过引入受保护的健康信息 (PHI) 检测框架来确保患者信息的去识别化。此外，通过提示工程框架，实验了零样本学习、上下文学习和链式思维提示。结果表明，无论是在零样本还是少样本设置下，GPT-4o-Turbo 的表现均优于传统方法和预训练语言模型 (PLM)，并在总体性能上达到了最高的 F1 分数 76.12% 和加权 F1 分数 73.61%，紧随其后的是少样本学习结果。随后，研究了这些结果与其他患者体验结构化变量（如评分）的关联。该研究通过应用 LLM 提升了 MLTC，为医疗从业者提供了一种高效的方法，以深入了解患者反馈并及时做出适当的回应。

[NLP-55] LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在知识密集型领域如医疗问答（QA）中事实准确性不足的问题。解决方案的关键在于引入LEAF（Learning and Evaluation Augmented by Fact-Checking）方法，通过两种策略提升LLMs的输出可靠性：一是Fact-Check-Then-RAG策略，通过事实核查结果指导检索增强生成（RAG）过程，而不更新模型参数；二是Learning from Fact-Checks via Self-Training策略，通过监督微调（SFT）或简单偏好优化（SimPO）结合事实核查作为排序机制，更新LLM参数。这两种策略共同作用，显著提高了LLMs在医疗问答等高要求场景中的事实准确性和可靠性。

链接: https://arxiv.org/abs/2410.23526
作者: Hieu Tran,Junda Wang,Yujan Ting,Weijing Huang,Terrence Chen
关键词-EN: Large language models, language processing tasks, natural language processing, shown remarkable capabilities, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in various natural language processing tasks, yet they often struggle with maintaining factual accuracy, particularly in knowledge-intensive domains like healthcare. This study introduces LEAF: Learning and Evaluation Augmented by Fact-Checking, a novel approach designed to enhance the factual reliability of LLMs, with a focus on medical question answering (QA). LEAF utilizes a dual strategy to enhance the factual accuracy of responses from models such as Llama 3 70B Instruct and Llama 3 8B Instruct. The first strategy, Fact-Check-Then-RAG, improves Retrieval-Augmented Generation (RAG) by incorporating fact-checking results to guide the retrieval process without updating model parameters. The second strategy, Learning from Fact-Checks via Self-Training, involves supervised fine-tuning (SFT) on fact-checked responses or applying Simple Preference Optimization (SimPO) with fact-checking as a ranking mechanism, both updating LLM parameters from supervision. These findings suggest that integrating fact-checked responses whether through RAG enhancement or self-training enhances the reliability and factual correctness of LLM outputs, offering a promising solution for applications where information accuracy is crucial.
摘要：大语言模型（LLMs）在多种自然语言处理任务中展现了卓越的能力，但在保持事实准确性方面，尤其是在医疗等知识密集型领域，仍面临挑战。本研究提出了LEAF：通过事实核查增强学习和评估，这是一种旨在提升LLMs事实可靠性的创新方法，特别关注医疗问答（QA）。LEAF采用双重策略来提高Llama 3 70B Instruct和Llama 3 8B Instruct等模型生成答案的事实准确性。第一种策略，即事实核查后增强检索生成（Fact-Check-Then-RAG），通过整合事实核查结果来指导检索过程，从而改进检索增强生成（RAG），且无需更新模型参数。第二种策略，通过自我训练学习事实核查（Learning from Fact-Checks via Self-Training），涉及对事实核查后的答案进行监督微调（SFT），或应用简单偏好优化（SimPO），将事实核查作为排序机制，这两种方法均通过监督更新LLM参数。这些发现表明，无论是通过RAG增强还是自我训练，整合事实核查后的答案都能提升LLM输出结果的可靠性和事实正确性，为信息准确性至关重要的应用提供了一种有前景的解决方案。

[NLP-56] Neural spell-checker: Beyond words with synthetic data generation

【速读】：该论文旨在通过引入基于深度学习和大型语言模型的新型拼写检查器，提升传统拼写检查器的功能，使其不仅能识别拼写错误，还能评估词汇在特定上下文中的适用性。解决方案的关键在于两种新型拼写检查器的开发与比较：一种是基于形态学词典的传统快速词级方法，具有比现有拼写检查器更大的词汇列表；另一种是基于在大型语料库上训练的语言模型，该语料库中插入了合成错误。论文强调了训练数据构建策略在神经拼写检查器中的重要性，并展示了所提出的神经模型在斯洛文尼亚语数据集上的精确度和召回率均显著优于现有拼写检查器。

链接: https://arxiv.org/abs/2410.23514
作者: Matej Klemen,Martin Božič,Špela Arhar Holdt,Marko Robnik-Šikonja
关键词-EN: identifying misspelled words, written texts, valuable tools, tools that enhance, enhance communication
类目: Computation and Language (cs.CL)
备注: Camera-ready version. Accepted to TSD 2024

点击查看摘要

Abstract:Spell-checkers are valuable tools that enhance communication by identifying misspelled words in written texts. Recent improvements in deep learning, and in particular in large language models, have opened new opportunities to improve traditional spell-checkers with new functionalities that not only assess spelling correctness but also the suitability of a word for a given context. In our work, we present and compare two new spell-checkers and evaluate them on synthetic, learner, and more general-domain Slovene datasets. The first spell-checker is a traditional, fast, word-based approach, based on a morphological lexicon with a significantly larger word list compared to existing spell-checkers. The second approach uses a language model trained on a large corpus with synthetically inserted errors. We present the training data construction strategies, which turn out to be a crucial component of neural spell-checkers. Further, the proposed neural model significantly outperforms all existing spell-checkers for Slovene in both precision and recall.
摘要：拼写检查器是一种通过识别书面文本中的拼写错误来提升沟通效率的有价值工具。近年来，深度学习的进步，特别是大语言模型的发展，为传统拼写检查器带来了新的功能改进机会，这些新功能不仅评估拼写的正确性，还评估单词在特定上下文中的适用性。在我们的研究中，我们提出了两种新的拼写检查器，并对其在合成数据集、学习者数据集以及更广泛的斯洛文尼亚语数据集上进行了评估。第一种拼写检查器是基于传统、快速的单词方法，采用了一个形态学词典，其词汇量显著大于现有拼写检查器。第二种方法则利用了一个在大规模语料库上训练的语言模型，该语料库中人工插入了错误。我们展示了训练数据的构建策略，这些策略被证明是神经拼写检查器的关键组成部分。此外，所提出的神经模型在精确度和召回率方面显著优于所有现有的斯洛文尼亚语拼写检查器。

[NLP-57] Dynamic Strategy Planning for Efficient Question Answering with Large Language Models ACL

【速读】：该论文试图解决在大语言模型（LLMs）中使用单一固定策略回答不同类型问题时，性能不佳且效率低下的问题。解决方案的关键在于提出了一种名为DyPlan的新技术，该技术通过动态策略选择过程来提升LLMs在问答任务中的表现并降低成本。DyPlan的核心在于引入了一个初始决策步骤，根据输入问题选择最合适的策略，并指导LLM的响应生成。此外，论文还扩展了DyPlan，提出了DyPlan-verify，增加了内部验证和修正过程，以进一步丰富生成的答案。实验结果表明，DyPlan在多跳问答（MHQA）数据集上能够提高模型性能7-13%，同时降低成本11-32%。

链接: https://arxiv.org/abs/2410.23511
作者: Tanmay Parekh,Pradyot Prakash,Alexander Radovic,Akshay Shekher,Denis Savenkov
关键词-EN: Large Language Models, Large Language, Research has shown, augmented generation strategies, retrieval augmented generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at ACL Rolling Review

点击查看摘要

Abstract:Research has shown the effectiveness of reasoning (e.g., Chain-of-Thought), planning (e.g., SelfAsk), and retrieval augmented generation strategies to improve the performance of Large Language Models (LLMs) on various tasks, such as question answering. However, using a single fixed strategy to answer different kinds of questions is suboptimal in performance and inefficient in terms of generated output tokens and performed retrievals. In our work, we propose a novel technique DyPlan, to induce a dynamic strategy selection process in LLMs, to improve performance and reduce costs in question-answering. DyPlan incorporates an initial decision step to select the most suitable strategy conditioned on the input question and guides the LLM’s response generation accordingly. We extend DyPlan to DyPlan-verify, adding an internal verification and correction process to further enrich the generated answer. Experiments on three prominent multi-hop question answering (MHQA) datasets reveal how DyPlan can improve model performance by 7-13% while reducing the cost by 11-32% relative to the best baseline model.
摘要：研究表明，推理（如思维链）、规划（如自我询问）和检索增强生成策略在提升大语言模型（LLMs）在各种任务（如问答）中的表现方面具有显著效果。然而，使用单一固定策略来回答不同类型的问题在性能上并不理想，且在生成的输出Token和执行的检索方面效率低下。在我们的工作中，我们提出了一种新颖的技术DyPlan，旨在在大语言模型中引入动态策略选择过程，以提升问答任务中的性能并降低成本。DyPlan包含一个初始决策步骤，根据输入问题选择最合适的策略，并相应地指导大语言模型的响应生成。我们进一步扩展DyPlan为DyPlan-verify，增加了内部验证和修正过程，以进一步丰富生成的答案。在三个著名的多跳问答（MHQA）数据集上的实验表明，DyPlan能够将模型性能提升7-13%，同时相对于最佳基线模型，成本降低11-32%。

[NLP-58] ny Transformers Excel at Sentence Compression

【速读】：该论文试图解决的问题是当前大型语言模型（Large Language Models）在处理英语文本时，每个单词平均需要24千字节的存储空间，远超其实际的ASCII编码需求（5-6字节）。论文提出的解决方案关键在于利用1-3层的Transformer模型，通过编码和解码过程，将标准英语句子压缩至单个3千字节的token。这一方法不仅展示了在每个token嵌入中增加信息的可能性，还暗示了通过从子词token嵌入转向更大文本片段的方式来优化大型语言模型的潜力。

链接: https://arxiv.org/abs/2410.23510
作者: Peter Belcak,Roger Wattenhofer
关键词-EN: bytes of ASCII, kilobytes when served, staggering that words, average represented, large language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is staggering that words of the English language, which are on average represented by 5–6 bytes of ASCII, require as much as 24 kilobytes when served to large language models. We show that there is room for more information in every token embedding. We demonstrate that 1–3-layer transformers are capable of encoding and subsequently decoding standard English sentences into as little as a single 3-kilobyte token. Our work implies that even small networks can learn to construct valid English sentences and suggests the possibility of optimising large language models by moving from sub-word token embeddings towards larger fragments of text.
摘要：令人震惊的是，英语单词平均由5-6字节的ASCII表示，但在提供给大语言模型时，却需要高达24千字节的存储空间。我们证明，每个Token嵌入中仍有更多的信息空间。我们展示了1-3层的Transformer能够将标准英语句子编码并随后解码为仅一个3千字节的Token。我们的研究暗示，即使是小型网络也能学会构建有效的英语句子，并提出了通过从子词Token嵌入转向更大的文本片段来优化大语言模型的可能性。

[NLP-59] Efficient and Interpretable Grammatical Error Correction with Mixture of Experts EMNLP2024

【速读】：该论文试图解决在语法错误纠正 (Grammatical Error Correction, GEC) 模型中，系统组合方法由于需要运行多个基础系统而导致的计算成本高的问题。解决方案的关键是提出了一种混合专家模型 (Mixture-of-Experts, MoE)，称为 MoECE。该模型通过在单一模型中集成多个专门针对不同错误类型的子网络，实现了与 T5-XL 相当的性能，同时有效参数减少了三倍。此外，MoECE 模型在推理过程中不仅能生成纠正，还能识别错误类型，从而提供可解释的纠正结果。

链接: https://arxiv.org/abs/2410.23507
作者: Muhammad Reza Qorib,Alham Fikri Aji,Hwee Tou Ng
关键词-EN: combining GEC models, combining GEC, GEC models, GEC, Error type information
类目: Computation and Language (cs.CL)
备注: Findings of EMNLP 2024

点击查看摘要

Abstract:Error type information has been widely used to improve the performance of grammatical error correction (GEC) models, whether for generating corrections, re-ranking them, or combining GEC models. Combining GEC models that have complementary strengths in correcting different error types is very effective in producing better corrections. However, system combination incurs a high computational cost due to the need to run inference on the base systems before running the combination method itself. Therefore, it would be more efficient to have a single model with multiple sub-networks that specialize in correcting different error types. In this paper, we propose a mixture-of-experts model, MoECE, for grammatical error correction. Our model successfully achieves the performance of T5-XL with three times fewer effective parameters. Additionally, our model produces interpretable corrections by also identifying the error type during inference.
摘要：错误类型信息已被广泛用于提升语法错误纠正（Grammatical Error Correction, GEC）模型的性能，无论是在生成纠正、重新排序纠正，还是组合GEC模型方面。结合在纠正不同错误类型上具有互补优势的GEC模型，能够非常有效地产生更好的纠正结果。然而，系统组合由于需要在运行组合方法之前对基础系统进行推理，因此计算成本较高。因此，拥有一个包含多个专门纠正不同错误类型的子网络的单一模型将更为高效。本文提出了一种混合专家模型，即MoECE，用于语法错误纠正。我们的模型成功实现了与T5-XL相当的性能，但有效参数减少了三倍。此外，我们的模型在推理过程中还能识别错误类型，从而产生可解释的纠正结果。

[NLP-60] Learning to Achieve Goals with Belief State Transformers

【速读】：该论文试图解决传统前向Transformer在处理复杂问题时表现不佳的问题。解决方案的关键在于引入“Belief State Transformer”，这是一种同时接受前缀和后缀作为输入的下一个词预测模型，其新颖的目标是预测前缀的下一个词和后缀的前一个词。通过学习一个紧凑的信念状态（belief state），该模型能够捕捉所有必要的信息以进行准确的预测，从而在领域无关的方式下有效解决传统Transformer难以应对的挑战性问题。实验证明，该模型在故事写作等任务中，无论目标是否已知，均优于Fill-in-the-Middle方法，并展现出更高效的基于目标的解码、更好的测试时推理能力以及高质量的文本表示。

链接: https://arxiv.org/abs/2410.23506
作者: Edward S. Hu,Kwangjun Ahn,Qinghua Liu,Haoran Xu,Manan Tomar,Ada Langford,Dinesh Jayaraman,Alex Lamb,John Langford
关键词-EN: Belief State Transformer, Belief State, State Transformer, compact belief state, State Transformer effectively
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the “Belief State Transformer”, a next-token predictor that takes both a prefix and suffix as inputs, with a novel objective of predicting both the next token for the prefix and the previous token for the suffix. The Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with, in a domain-independent fashion. Key to this success is learning a compact belief state that captures all relevant information necessary for accurate predictions. Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown. Altogether, the Belief State Transformer enables more efficient goal-conditioned decoding, better test-time inference, and high-quality text representations on small scale problems.
摘要：我们引入了“信念状态 Transformer”，这是一种下一 Token 预测器，它同时接受前缀和后缀作为输入，并采用一种新颖的目标，即预测前缀的下一个 Token 和后缀的前一个 Token。信念状态 Transformer 能够有效地解决传统仅前向 Transformer 难以应对的复杂问题，且不依赖于特定领域。这一成功的关键在于学习一个紧凑的信念状态，该状态捕捉了进行准确预测所需的所有相关信息。实证消融实验表明，在标准 Transformer 表现不佳的困难场景中，模型的每个组成部分都是必不可少的。对于已知前缀和后缀的故事写作任务，我们的方法在达到已知目标方面优于填空方法，并且在目标未知的情况下也表现出更好的性能。总体而言，信念状态 Transformer 在目标条件解码、测试时推理以及小规模问题上的高质量文本表示方面实现了更高效的性能。

[NLP-61] Smaller Large Language Models Can Do Moral Self-Correction

【速读】：该论文试图解决的问题是验证为什么较小规模的语言模型（LLMs）在道德自我修正（moral self-correction）方面表现不佳，尽管先前的研究假设较大的模型更擅长遵循指令和理解抽象的社会规范。解决方案的关键在于通过细致的提示（meticulous prompting）和安全对齐微调（safety alignment fine-tuning）来验证这一假设。实验结果表明，经过适当安全对齐微调的3.8B参数LLMs能够实现良好的道德自我修正性能，突显了安全对齐的重要性。同时，研究发现所有规模的LLMs在面对不道德指令时，自我修正表现均不佳，这表明理解社会规范和自我解释（self-explanation）能力在不同规模的模型中存在差异。

链接: https://arxiv.org/abs/2410.23496
作者: Guangliang Liu,Zhiyu Xue,Rongrong Wang,Kristen Marie Johnson
关键词-EN: Large Language Models, amazing emerging capabilities, natural language feedback, capabilities of Large, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs), enabling LLMs to self-modify an inappropriate output given a natural language feedback which describes the problems of that output. Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update, making it both computationally lightweight and capable of preserving the language modeling ability. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction. However, there is no direct proof as to why such smaller models fall short of moral self-correction, though previous research hypothesizes that larger models are skilled in following instructions and understanding abstract social norms. In this paper, we empirically validate this hypothesis in the context of social stereotyping, through meticulous prompting. Our experimental results indicate that (i) surprisingly, 3.8B LLMs with proper safety alignment fine-tuning can achieve very good moral self-correction performance, highlighting the significant effects of safety alignment; and (ii) small LLMs are indeed weaker than larger-scale models in terms of comprehending social norms and self-explanation through CoT, but all scales of LLMs show bad self-correction performance given unethical instructions.
摘要：自我纠正是大语言模型 (LLM) 中一项令人惊叹的新兴能力，它使 LLM 能够在接收到描述输出问题的自然语言反馈后，自我修正不适当的输出。道德自我纠正是通过事后方法纠正不道德生成内容的一种方式，无需进行梯度更新，既计算轻量又能保持语言建模能力。先前的研究表明，LLM 能够自我去偏，并报告指出，参数少于 220 亿的小模型不具备道德自我纠正的能力。然而，尽管先前研究假设较大的模型擅长遵循指令和理解抽象的社会规范，但并无直接证据表明为何这些较小的模型在道德自我纠正方面表现不足。本文通过细致的提示设计，在社会刻板印象的背景下，实证验证了这一假设。我们的实验结果表明：(i) 令人惊讶的是，经过适当安全对齐微调的 38 亿参数 LLM 能够实现非常出色的道德自我纠正表现，突显了安全对齐的显著效果；(ii) 小规模 LLM 在理解社会规范和通过思维链 (CoT) 进行自我解释方面确实弱于大规模模型，但所有规模的 LLM 在接收到不道德指令时都表现出较差的自我纠正性能。

[NLP-62] Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs

【速读】：该论文试图解决科学领域中文献信息提取模型的比较和应用难题，特别是针对科学出版物中最常见的PDF格式。解决方案的关键在于提出了Collage工具，该工具支持快速原型设计、可视化和评估不同信息提取模型在科学PDF上的表现。Collage不仅集成了多种HuggingFace的token分类器、大型语言模型（LLMs）和其他任务特定模型，还提供了可扩展的软件接口以加速新模型的实验。此外，Collage通过提供处理过程中间状态的细粒度视图，帮助开发者和用户更好地检查、调试和理解模型的工作流程，从而在材料科学等领域的文献综述中辅助信息提取。

链接: https://arxiv.org/abs/2410.23478
作者: Sireesh Gururaja,Yueheng Zhang,Guannan Tang,Tianhao Zhang,Kevin Murphy,Yu-Tsen Yi,Junwon Seo,Anthony Rollett,Emma Strubell
关键词-EN: increasingly multimodal pretrained, multimodal pretrained transformer, pretrained transformer models, Recent years, domain-specific information extraction
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent years in NLP have seen the continued development of domain-specific information extraction tools for scientific documents, alongside the release of increasingly multimodal pretrained transformer models. While the opportunity for scientists outside of NLP to evaluate and apply such systems to their own domains has never been clearer, these models are difficult to compare: they accept different input formats, are often black-box and give little insight into processing failures, and rarely handle PDF documents, the most common format of scientific publication. In this work, we present Collage, a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs. Collage allows the use and evaluation of any HuggingFace token classifier, several LLMs, and multiple other task-specific models out of the box, and provides extensible software interfaces to accelerate experimentation with new models. Further, we enable both developers and users of NLP-based tools to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing. We demonstrate our system in the context of information extraction to assist with literature review in materials science.
摘要：近年来，自然语言处理（NLP）领域在科学文档的领域特定信息提取工具的持续发展与多模态预训练Transformer模型的不断发布中取得了显著进展。尽管非NLP领域的科学家评估和应用这些系统到自身领域的机会从未如此清晰，但这些模型难以比较：它们接受不同的输入格式，通常是黑箱模型，对处理失败的洞察力有限，并且很少处理PDF文档，这是科学出版物最常见的格式。在此工作中，我们介绍了Collage，这是一个专为科学PDF上的不同信息提取模型进行快速原型设计、可视化和评估而设计的工具。Collage允许直接使用和评估任何HuggingFace Token分类器、多个大语言模型（LLMs）以及多种其他任务特定的模型，并提供可扩展的软件接口以加速新模型的实验。此外，我们通过提供处理中间状态的细粒度视图，使NLP工具的开发者和用户能够检查、调试和更好地理解建模流程。我们在材料科学文献综述中展示我们的系统，以辅助信息提取。

[NLP-63] MDCure: A Scalable Pipeline for Multi-Document Instruction-Following

【速读】：该论文试图解决大型语言模型（LLMs）在处理多文档（Multi-document, MD）任务时面临的挑战，如管理文档间依赖关系、冗余信息和结构不连贯等问题。解决方案的关键是引入了一个名为MDCure的可扩展且有效的微调管道，通过生成高质量的合成MD指令数据来增强LLMs的MD处理能力，而无需进行预训练或依赖人工标注数据。MDCure的核心在于利用针对性的提示从相关文章集合中生成合成数据，并进一步通过多目标奖励模型MDCureRM筛选出对MD训练有用的数据。通过MDCure，论文成功微调了多个LLMs，包括FlanT5、Qwen2和LLAMA3.1系列模型，最大参数规模达到70B，并在广泛的MD和长上下文基准测试中展示了显著的性能提升，相比预训练基线模型和相应的基础模型，性能提升最高达75.5%。

链接: https://arxiv.org/abs/2410.23463
作者: Gabrielle Kaili-May Liu,Bowen Shi,Avi Caciularu,Idan Szpektor,Arman Cohan
关键词-EN: handle real-world tasks, handle real-world, summarization and question-answering, question-answering across large, Multi-document
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present challenges, such as managing inter-document dependencies, redundancy, and incoherent structures. We introduce MDCure, a scalable and effective fine-tuning pipeline to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human annotated data. MDCure is based on generation of high-quality synthetic MD instruction data from sets of related articles via targeted prompts. We further introduce MDCureRM, a multi-objective reward model which filters generated data based on their training utility for MD settings. With MDCure, we fine-tune a variety of LLMs, from the FlanT5, Qwen2, and LLAMA3.1 model families, up to 70B parameters in size. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks show MDCure consistently improves performance over pre-trained baselines and over corresponding base models by up to 75.5%. Our code, datasets, and models are available at this https URL.
摘要：多文档 (Multi-document, MD) 处理对于大语言模型 (Large Language Model, LLM) 在处理现实世界任务（如跨大量文档的摘要和问答）至关重要。尽管大语言模型在处理长输入方面有所改进，但多文档上下文仍面临挑战，如管理文档间依赖关系、冗余和不连贯的结构。我们引入了 MDCure，这是一个可扩展且有效的微调管道，旨在增强大语言模型的多文档处理能力，而无需预训练的计算成本或依赖人工标注数据。MDCure 基于从相关文章集合中通过定向提示生成高质量合成多文档指令数据。我们进一步引入了 MDCureRM，这是一个多目标奖励模型，用于根据其在多文档设置中的训练效用筛选生成的数据。通过 MDCure，我们对多种大语言模型进行了微调，包括 FlanT5、Qwen2 和 LLAMA3.1 模型家族，最大规模达 70B 参数。在涵盖各种任务的广泛多文档和长上下文基准测试中，MDCure 持续提升了性能，相较于预训练基线模型和相应的基础模型，性能提升高达 75.5%。我们的代码、数据集和模型可通过此 https URL 获取。

[NLP-64] Graph-Augmented Relation Extraction Model with LLM s-Generated Support Document

【速读】：该论文试图解决传统句子级关系抽取 (RE) 模型在捕捉复杂关系和跨句子上下文方面的局限性。解决方案的关键在于将图神经网络 (GNNs) 与大型语言模型 (LLMs) 相结合，通过 LLMs 生成上下文丰富的辅助文档，并构建复杂的图表示。随后，利用 GNN 处理该图，以精炼和丰富每个实体的嵌入，确保对数据有更细致和互联的理解。这种方法通过整合更广泛的上下文和利用实体间的交互，显著提升了模型捕捉复杂关系的能力。

链接: https://arxiv.org/abs/2410.23452
作者: Vicky Dong,Hao Yu,Yao Chen
关键词-EN: Large Language Models, Graph Neural Networks, enriched support documents, Large Language, contextually enriched support
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study introduces a novel approach to sentence-level relation extraction (RE) that integrates Graph Neural Networks (GNNs) with Large Language Models (LLMs) to generate contextually enriched support documents. By harnessing the power of LLMs to generate auxiliary information, our approach crafts an intricate graph representation of textual data. This graph is subsequently processed through a Graph Neural Network (GNN) to refine and enrich the embeddings associated with each entity ensuring a more nuanced and interconnected understanding of the data. This methodology addresses the limitations of traditional sentence-level RE models by incorporating broader contexts and leveraging inter-entity interactions, thereby improving the model’s ability to capture complex relationships across sentences. Our experiments, conducted on the CrossRE dataset, demonstrate the effectiveness of our approach, with notable improvements in performance across various domains. The results underscore the potential of combining GNNs with LLM-generated context to advance the field of relation extraction.
摘要：本研究提出了一种新颖的句子级关系抽取 (Relation Extraction, RE) 方法，该方法将图神经网络 (Graph Neural Networks, GNNs) 与大语言模型 (Large Language Models, LLMs) 相结合，以生成上下文丰富的支持文档。通过利用大语言模型生成辅助信息，我们的方法构建了文本数据的复杂图表示。随后，该图通过图神经网络 (GNN) 进行处理，以精炼和丰富与每个实体相关的嵌入，确保对数据有更细致和互联的理解。这种方法通过纳入更广泛的上下文并利用实体间的交互，解决了传统句子级 RE 模型的局限性，从而提高了模型捕捉句子间复杂关系的能力。我们在 CrossRE 数据集上进行的实验表明了该方法的有效性，并在各个领域中取得了显著的性能提升。结果强调了将 GNNs 与 LLM 生成的上下文相结合，以推动关系抽取领域发展的潜力。

[NLP-65] Learning and Transferring Sparse Contextual Bigrams with Linear Transformers

【速读】：该论文试图解决Transformer模型在自然语言建模中的理论基础问题，特别是其结合上下文信息和全局知识的能力。解决方案的关键在于引入稀疏上下文双词模型（Sparse Contextual Bigram, SCB），作为经典双词模型的自然扩展，其中下一个词的生成依赖于由最后一个词决定的稀疏早期位置集合。论文分析了使用基于梯度算法的单层线性Transformer训练SCB的训练动态和样本复杂性，并证明了从零开始训练时，训练过程可以分为初始样本密集阶段和后续样本高效阶段。此外，论文还证明了在下游任务与预训练任务之间存在非平凡相关性的前提下，从预训练模型微调可以跳过初始样本密集阶段。实验结果表明，该算法在此设置下优于随机梯度下降（SGD），并讨论了其与常规基于softmax的Transformer之间的关系。

链接: https://arxiv.org/abs/2410.23438
作者: Yunwei Ren,Zixuan Wang,Jason D. Lee
关键词-EN: combine contextual informal, natural language modeling, global knowledge, Sparse Contextual Bigram, language modeling
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers have excelled in natural language modeling and one reason behind this success is their exceptional ability to combine contextual informal and global knowledge. However, the theoretical basis remains unclear. In this paper, first we introduce the Sparse Contextual Bigram (SCB), a natural extension of the classical bigram model, where the next token’s generation depends on a sparse set of earlier positions determined by the last token. We then analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We show that when trained from scratch, the training process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stage of further improvement. Additionally, we prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage. We also empirically demonstrate that our algorithm can outperform SGD in this setting and discuss its relationship with the usual softmax-based transformers.
摘要：Transformer 在自然语言建模中表现出色，其成功的一个关键因素在于其卓越的结合上下文信息和全局知识的能力。然而，其理论基础仍不明确。本文首先介绍了稀疏上下文二元模型 (Sparse Contextual Bigram, SCB)，这是经典二元模型的一种自然扩展，其中下一个 Token 的生成依赖于由最后一个 Token 确定的一组稀疏的早期位置。接着，我们分析了使用基于梯度算法的一层线性 Transformer 学习 SCB 的训练动态和样本复杂度。研究表明，从零开始训练时，训练过程可分为两个阶段：首先是样本密集阶段，该阶段将相关性从零提升至非平凡值；随后是样本效率更高的阶段，进一步改进模型性能。此外，我们证明了，在下游任务与预训练任务之间存在非平凡相关性的前提下，从预训练模型进行微调可以跳过初始的样本密集阶段。我们还通过实验证明，在此设置下，我们的算法能够优于随机梯度下降 (SGD)，并讨论了其与常规基于 softmax 的 Transformer 之间的关系。

[NLP-66] Mind the Gap: A Generalized Approach for Cross-Modal Embedding Alignment

【速读】：该论文试图解决跨不同文本模态（如编程代码与伪代码、英语与法语句子）之间由于语义差异导致的检索增强生成 (Retrieval-Augmented Generation, RAG) 系统性能下降的问题。解决方案的关键在于引入了一种基于投影的方法，灵感来源于迁移学习中的适配器模块 (adapter modules)，通过轻量级的投影网络将异构文本模态的嵌入对齐到一个统一的语义空间中。这种方法不仅提高了检索的速度和准确性，还显著优于传统的检索方法（如Okapi BM25算法）和模型（如Dense Passage Retrieval, DPR），同时接近Sentence Transformers的准确性，且在训练和推理过程中仅需极少的资源。

链接: https://arxiv.org/abs/2410.23437
作者: Arihan Yadav,Alan McMillan
关键词-EN: incorporating external knowledge, enhance text generation, systems enhance text, Retrieval-Augmented Generation, systems enhance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems enhance text generation by incorporating external knowledge but often struggle when retrieving context across different text modalities due to semantic gaps. We introduce a generalized projection-based method, inspired by adapter modules in transfer learning, that efficiently bridges these gaps between various text types, such as programming code and pseudocode, or English and French sentences. Our approach emphasizes speed, accuracy, and data efficiency, requiring minimal resources for training and inference. By aligning embeddings from heterogeneous text modalities into a unified space through a lightweight projection network, our model significantly outperforms traditional retrieval methods like the Okapi BM25 algorithm and models like Dense Passage Retrieval (DPR), while approaching the accuracy of Sentence Transformers. Extensive evaluations demonstrate the effectiveness and generalizability of our method across different tasks, highlighting its potential for real-time, resource-constrained applications.
摘要：检索增强生成 (Retrieval-Augmented Generation, RAG) 系统通过整合外部知识来提升文本生成能力，但在跨不同文本模态检索上下文时，由于语义鸿沟的存在，常常遇到困难。我们提出了一种基于投影的广义方法，灵感来源于迁移学习中的适配器模块，能够高效地弥合各种文本类型之间的语义鸿沟，例如编程代码与伪代码，或英语与法语句子。我们的方法注重速度、准确性和数据效率，训练和推理所需的资源极少。通过将异构文本模态的嵌入通过轻量级投影网络对齐到一个统一空间，我们的模型显著优于传统的检索方法，如 Okapi BM25 算法和密集段落检索 (Dense Passage Retrieval, DPR) 模型，同时接近 Sentence Transformers 的准确性。广泛的评估表明，我们的方法在不同任务中的有效性和通用性，突显了其在实时、资源受限应用中的潜力。

[NLP-67] Social Science Meets LLM s: How Reliable Are Large Language Models in Social Simulations?

【速读】：该论文试图解决“基于大型语言模型 (LLM) 的模拟的可靠性”问题。解决方案的关键在于引入了一个名为 TrustSim 的评估数据集，用于系统性地研究 LLM 在计算社会科学 (CSS) 相关主题上的模拟可靠性。通过实验，论文发现 LLM 在模拟角色时存在不一致性，并且这种一致性与 LLM 的总体性能没有强相关性。为了提高 LLM 在模拟中的可靠性，论文提出了一种基于自适应学习率的强化学习算法 AdaORPO (Adaptive Learning Rate Based ORPO)，并在 7 个 LLM 上进行了验证。这一研究为未来探索更稳健和可信的 LLM 模拟奠定了基础。

链接: https://arxiv.org/abs/2410.23426
作者: Yue Huang,Zhengqing Yuan,Yujun Zhou,Kehan Guo,Xiangqi Wang,Haomin Zhuang,Weixiang Sun,Lichao Sun,Jindong Wang,Yanfang Ye,Xiangliang Zhang
关键词-EN: Large Language Models, Computational Social Science, Large Language, Language Models, Social Science
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed for simulations, enabling applications in role-playing agents and Computational Social Science (CSS). However, the reliability of these simulations is under-explored, which raises concerns about the trustworthiness of LLMs in these applications. In this paper, we aim to answer ``How reliable is LLM-based simulation?‘’ To address this, we introduce TrustSim, an evaluation dataset covering 10 CSS-related topics, to systematically investigate the reliability of the LLM simulation. We conducted experiments on 14 LLMs and found that inconsistencies persist in the LLM-based simulated roles. In addition, the consistency level of LLMs does not strongly correlate with their general performance. To enhance the reliability of LLMs in simulation, we proposed Adaptive Learning Rate Based ORPO (AdaORPO), a reinforcement learning-based algorithm to improve the reliability in simulation across 7 LLMs. Our research provides a foundation for future studies to explore more robust and trustworthy LLM-based simulations.
摘要：大语言模型 (LLM) 在模拟应用中的使用日益增多，这些应用包括角色扮演智能体和计算社会科学 (CSS)。然而，这些模拟的可靠性尚未得到充分探索，这引发了人们对 LLM 在这些应用中可信度的担忧。本文旨在回答“基于 LLM 的模拟有多可靠？”这一问题。为此，我们引入了 TrustSim，这是一个涵盖 10 个 CSS 相关主题的评估数据集，用于系统地研究 LLM 模拟的可靠性。我们在 14 个 LLM 上进行了实验，发现基于 LLM 的模拟角色中仍存在不一致性。此外，LLM 的一致性水平与其整体性能之间没有强相关性。为了提高 LLM 在模拟中的可靠性，我们提出了基于自适应学习率的 ORPO (AdaORPO)，这是一种基于强化学习的算法，用于在 7 个 LLM 中提升模拟的可靠性。我们的研究为未来探索更稳健和可信的基于 LLM 的模拟提供了基础。

[NLP-68] Leveraging Language Models and Bandit Algorithms to Drive Adoption of Battery-Electric Vehicles

【速读】：该论文试图解决行为改变干预措施在推广电动车辆（Battery Electric Vehicles, BEVs）时可能导致的反效果问题，即平均有效的干预措施可能在一些子群体中引发反弹效应。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）和上下文强盗（contextual bandit）算法，开发针对个体价值观的个性化对话干预措施。通过结合LLMs和上下文强盗算法，研究能够根据参与者的社会人口统计特征，学习并定制干预措施，从而提高说服效果。此外，研究还通过LLMs模拟参与者，以离线方式训练强盗算法，并将其与未使用社会人口统计特征的LLM生成的对话干预措施进行对比，以评估其说服效果。

链接: https://arxiv.org/abs/2410.23371
作者: Keiichi Namikoshi,David A. Shamma,Rumen Iliev,Jingchao Fang,Alexandre Filipowicz,Candice L Hogan,Charlene Wu,Nikos Arechiga
关键词-EN: coordinate societal action, reduce emissions, Behavior change interventions, coordinate societal, societal action
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Behavior change interventions are important to coordinate societal action across a wide array of important applications, including the adoption of electrified vehicles to reduce emissions. Prior work has demonstrated that interventions for behavior must be personalized, and that the intervention that is most effective on average across a large group can result in a backlash effect that strengthens opposition among some subgroups. Thus, it is important to target interventions to different audiences, and to present them in a natural, conversational style. In this context, an important emerging application domain for large language models (LLMs) is conversational interventions for behavior change. In this work, we leverage prior work on understanding values motivating the adoption of battery electric vehicles. We leverage new advances in LLMs, combined with a contextual bandit, to develop conversational interventions that are personalized to the values of each study participant. We use a contextual bandit algorithm to learn to target values based on the demographics of each participant. To train our bandit algorithm in an offline manner, we leverage LLMs to play the role of study participants. We benchmark the persuasive effectiveness of our bandit-enhanced LLM against an unaided LLM generating conversational interventions without demographic-targeted values.
摘要：行为改变干预对于协调社会在广泛重要应用中的行动至关重要，包括推广电动汽车以减少排放。先前的工作表明，行为干预必须个性化，并且对大型群体平均最有效的干预可能会在某些子群体中引发反效果，增强反对力量。因此，针对不同受众进行干预并采用自然、对话式的呈现方式非常重要。在此背景下，大语言模型（LLM）的一个重要新兴应用领域是对话式行为改变干预。在本研究中，我们利用了先前关于理解推动电池电动汽车采用的价值观的工作。我们结合了LLM的最新进展和上下文强盗（contextual bandit），开发了针对每位研究参与者价值观的个性化对话干预。我们使用上下文强盗算法，根据每位参与者的社会人口统计特征来学习目标价值观。为了以离线方式训练我们的强盗算法，我们利用LLM扮演研究参与者的角色。我们通过比较我们的强盗增强型LLM与未辅助的LLM（后者生成不基于社会人口统计目标价值观的对话干预）的说服效果，进行了基准测试。

[NLP-69] Can Models Help Us Create Better Models? Evaluating LLM s as Data Scientists

【速读】：该论文试图解决在数据科学领域中，大型语言模型（LLMs）在特征工程代码生成任务中的表现评估问题。解决方案的关键在于提出了一种新的基准（FeatEng），通过评估生成的代码对数据集的改进效果来衡量模型的能力。具体来说，模型被提供一个数据集描述，并要求生成相应的特征工程代码，然后通过XGBoost模型在修改后的数据集上的表现与原始数据集上的表现进行对比，从而得出评估分数。这种方法相较于现有方法，能够更便宜且高效地评估LLMs的广泛能力。

链接: https://arxiv.org/abs/2410.23331
作者: Michał Pietruszka,Łukasz Borchmann,Aleksander Jędrosz,Paweł Morawiecki
关键词-EN: writing feature engineering, requires domain knowledge, feature engineering code, large language models, language models designed
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.
摘要：我们提出了一种针对大语言模型的基准测试，旨在解决数据科学中最具知识密集型的任务之一：编写特征工程代码。这一任务不仅需要深厚的领域知识，还需要对底层问题和数据结构有深刻的理解。模型通过提示获得数据集描述，并被要求生成对该数据集进行转换的代码。评估分数基于在修改后的数据集上拟合的XGBoost模型相对于原始数据的改进程度得出。通过广泛评估最先进的模型并与已建立的基准进行比较，我们证明了我们提出的FeatEng能够以低成本和高效率评估大语言模型的广泛能力，与现有方法形成鲜明对比。

[NLP-70] VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

【速读】：该论文试图解决视觉语言模型 (Vision-Language Models, VLMs) 在推理过程中由于存储和访问大型键值 (Key-Value, KV) 缓存而导致的计算效率问题。解决方案的关键在于提出了一种名为 VL-Cache 的新型 KV 缓存压缩方法，该方法针对 VLM 的独特稀疏性模式进行了优化。具体来说，VL-Cache 通过区分视觉和文本令牌在预填充和解码阶段的稀疏性模式，引入了一种层级自适应的稀疏感知缓存预算分配方法，有效分配有限的缓存预算，从而在不牺牲准确性的前提下显著减少 KV 缓存的大小。此外，VL-Cache 还开发了一种模态感知的令牌评分策略，以更好地评估令牌的重要性。实验结果表明，仅保留 10% 的 KV 缓存即可达到与全缓存相当的准确性，同时在速度和内存占用方面实现了显著的提升。

链接: https://arxiv.org/abs/2410.23317
作者: Dezhan Tu,Danylo Vashchilenko,Yuzhe Lu,Panpan Xu
关键词-EN: demonstrated impressive performance, Large Language Models, Vision-Language Models, set of tasks, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.
摘要：视觉-语言模型（Vision-Language Models, VLMs）在多种任务中展现了卓越的性能。加速 VLMs 的一个关键挑战在于存储和访问编码了长视觉上下文（如图像或视频）的大型键值（Key-Value, KV）缓存。尽管现有的 KV 缓存压缩方法对大语言模型（Large Language Models, LLMs）有效，但直接将其迁移到 VLMs 会导致准确性和加速效果不佳。为了填补这一差距，我们提出了 VL-Cache，一种专为加速 VLM 推理而设计的新型 KV 缓存压缩方案。本文首先通过区分预填充和解码阶段的视觉和文本 Token，研究了 VLM 注意力的独特稀疏模式。基于这些观察，我们引入了一种层自适应的稀疏感知缓存预算分配方法，该方法能够在不同层之间有效分配有限的缓存预算，进一步减少 KV 缓存大小而不影响准确性。此外，我们还开发了一种模态感知的 Token 评分策略，以更好地评估 Token 的重要性。在多个基准数据集上的实验结果表明，仅保留 10% 的 KV 缓存即可达到与全缓存相当的准确性。在速度基准测试中，我们的方法将生成 100 个 Token 的端到端延迟加速了最高 2.33 倍，解码速度提高了最高 7.08 倍，同时将 GPU 中 KV 缓存的内存占用减少了 90%。

[NLP-71] Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

【速读】：该论文旨在系统分析大型语言模型 (LLMs) 对各种提示注入攻击 (Prompt Injection Attacks) 的脆弱性，并探讨模型参数和架构对这种脆弱性的影响。解决方案的关键在于通过统计分析（如逻辑回归和随机森林特征分析）揭示模型参数大小和架构对易受攻击性的显著影响，并识别出与特定模型配置相关的不同脆弱性特征。研究结果强调了在关键基础设施和敏感行业中部署的LLMs需要多层次的防御措施，以应对潜在的严重后果，如数据泄露、未经授权的访问或错误信息传播。未来研究应探索多语言和多步骤防御策略以及适应性缓解措施，以增强LLMs在多样化现实环境中的安全性。

链接: https://arxiv.org/abs/2410.23308
作者: Victoria Benjamin,Emily Braca,Israel Carter,Hafsa Kanchwala,Nava Khojasteh,Charly Landow,Yi Luo,Caroline Ma,Anna Magarelli,Rachel Mirin,Avery Moyer,Kayla Simpson,Amelia Skawinski,Thomas Heverin
关键词-EN: study systematically analyzes, leverages carefully crafted, large language models, malicious LLM behavior, carefully crafted prompts
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study systematically analyzes the vulnerability of 36 large language models (LLMs) to various prompt injection attacks, a technique that leverages carefully crafted prompts to elicit malicious LLM behavior. Across 144 prompt injection tests, we observed a strong correlation between model parameters and vulnerability, with statistical analyses, such as logistic regression and random forest feature analysis, indicating that parameter size and architecture significantly influence susceptibility. Results revealed that 56 percent of tests led to successful prompt injections, emphasizing widespread vulnerability across various parameter sizes, with clustering analysis identifying distinct vulnerability profiles associated with specific model configurations. Additionally, our analysis uncovered correlations between certain prompt injection techniques, suggesting potential overlaps in vulnerabilities. These findings underscore the urgent need for robust, multi-layered defenses in LLMs deployed across critical infrastructure and sensitive industries. Successful prompt injection attacks could result in severe consequences, including data breaches, unauthorized access, or misinformation. Future research should explore multilingual and multi-step defenses alongside adaptive mitigation strategies to strengthen LLM security in diverse, real-world environments.
摘要：本研究系统性地分析了36个大语言模型 (LLMs) 对各种提示注入攻击的脆弱性，这是一种利用精心设计的提示引发恶意大语言模型行为的攻击技术。在144次提示注入测试中，我们观察到模型参数与脆弱性之间存在强烈的相关性，统计分析（如逻辑回归和随机森林特征分析）表明，参数大小和架构显著影响模型的易感性。结果显示，56%的测试导致了成功的提示注入，强调了不同参数大小模型普遍存在的脆弱性，聚类分析识别出与特定模型配置相关的独特脆弱性特征。此外，我们的分析揭示了某些提示注入技术之间的相关性，暗示了脆弱性可能的重叠。这些发现强调了在部署于关键基础设施和敏感行业的LLMs中，迫切需要建立强大且多层次的防御措施。成功的提示注入攻击可能导致严重的后果，包括数据泄露、未经授权的访问或错误信息传播。未来的研究应探索多语言和多步骤防御措施，以及自适应缓解策略，以增强大语言模型在多样化的现实环境中的安全性。

[NLP-72] Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions EMNLP2023

【速读】：该论文试图解决在线平台内容审核过程缺乏透明度的问题，特别是在Wikipedia上，尽管审核讨论是公开的，但编辑们明确引用内容审核政策的情况较少（英语评论中仅20%，德语和土耳其语评论中仅2%）。解决方案的关键在于构建了一个多语言的Wikipedia编辑讨论数据集，该数据集包含了编辑的立场（保留、删除、合并、评论）、陈述的理由以及对应的内容审核政策。通过联合预测编辑立场和相应理由（政策），论文展示了高准确度的预测能力，从而增加了决策过程的透明度。此外，论文还发布了联合预测模型和多语言内容审核数据集，以促进自动化透明内容审核的进一步研究。

链接: https://arxiv.org/abs/2310.05779
作者: Lucie-Aimée Kaffee,Arnav Arora,Isabelle Augenstein
关键词-EN: online platforms, content moderation, moderation, content, Wikipedia editor discussions
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This submission has been accepted to 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

点击查看摘要

Abstract:The moderation of content on online platforms is usually non-transparent. On Wikipedia, however, this discussion is carried out publicly and the editors are encouraged to use the content moderation policies as explanations for making moderation decisions. Currently, only a few comments explicitly mention those policies – 20% of the English ones, but as few as 2% of the German and Turkish comments. To aid in this process of understanding how content is moderated, we construct a novel multilingual dataset of Wikipedia editor discussions along with their reasoning in three languages. The dataset contains the stances of the editors (keep, delete, merge, comment), along with the stated reason, and a content moderation policy, for each edit decision. We demonstrate that stance and corresponding reason (policy) can be predicted jointly with a high degree of accuracy, adding transparency to the decision-making process. We release both our joint prediction models and the multilingual content moderation dataset for further research on automated transparent content moderation.
摘要：在线平台上的内容审核通常是不透明的。然而，在维基百科上，这一讨论是公开进行的，编辑们被鼓励使用内容审核政策作为做出审核决策的解释依据。目前，只有少数评论明确提及这些政策——英语评论中约20%，而德语和土耳其语评论中仅约2%。为了帮助理解内容是如何被审核的，我们构建了一个新颖的多语言数据集，包含了维基百科编辑在三种语言中的讨论及其推理过程。该数据集包含了编辑的立场（保留、删除、合并、评论）、所述理由以及每个编辑决策所依据的内容审核政策。我们展示了立场及其对应的理由（政策）可以以高准确度联合预测，从而增加了决策过程的透明度。我们同时发布了联合预测模型和多语言内容审核数据集，以供进一步研究自动化透明内容审核。

[NLP-73] DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

【速读】：该论文试图解决语音语言模型（SLM）中语音标记化的问题，特别是如何将音频信号有效地转换为SLM能够处理的标记，同时保持语音信息的丰富性和对输入变化的鲁棒性。解决方案的关键是提出了双码本说话人不变聚类（Double-Codebook Speaker-invariant Clustering, DC-Spin）方法，该方法通过提取富含音素信息且对输入变化具有鲁棒性的说话人不变标记，来改进语音标记化过程。DC-Spin采用分块方式处理数据，使得模型无需重新训练即可实现流式处理，且不会导致性能下降。此外，论文通过对比不同的标记化方法、模型可扩展性及下游任务表现，验证了易于被n-gram语言模型建模或与音素对齐的标记在SLM中的强大性能，为设计高效的语音标记器提供了重要见解。

链接: https://arxiv.org/abs/2410.24177
作者: Heng-Jui Chang,Hongyu Gong,Changhan Wang,James Glass,Yu-An Chung
关键词-EN: Spoken language models, gained increasing attention, Spoken language, decoder-only language models, decoder-only language
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Preprint

点击查看摘要

Abstract:Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.
摘要：随着基于文本的解码器专用语言模型的进步，口语语言模型 (Spoken Language Models, SLMs) 受到了越来越多的关注。SLMs 处理文本和语音，能够同时实现语音理解和生成。本文介绍了双码本说话者不变聚类 (Double-Codebook Speaker-invariant Clustering, DC-Spin)，旨在通过连接音频信号和 SLM Token 来改进语音 Token 化。DC-Spin 提取了富含音素信息且对输入变化具有弹性的说话者不变 Token，从而增强了零样本 SLM 任务和语音重合成。我们提出了一种分块方法，使得 DC-Spin 无需重新训练且不会降级即可实现流式处理。通过比较 Token 化方法（自监督和神经音频编解码器）、模型可扩展性以及下游任务代理，我们发现那些容易被 n-gram 语言模型建模或与音素对齐的 Token 表现出强大的性能，为设计适用于 SLMs 的语音 Token 化器提供了见解。

[NLP-74] All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling

【速读】：该论文试图解决的问题是：在语言模型中普遍存在的线性特性（如“easy”和“easiest”的向量差与“lucky”和“luckiest”的向量差平行）是否可以通过可识别性来解释，即在一个模型中发现的线性特性是否必然存在于所有诱导相同分布的模型中。解决方案的关键在于：首先，通过证明一个可识别性结果来刻画分布等价的下一个词预测器，从而放宽了先前结果中的多样性要求；其次，基于关系线性性的细化（relational linearity），展示了多种线性概念如何适用于该分析；最后，证明了在适当条件下，这些线性特性在所有或无分布等价的下一个词预测器中都成立。

链接: https://arxiv.org/abs/2410.23501
作者: Emanuele Marconato,Sébastien Lachapelle,Sebastian Weichwald,Luigi Gresele
关键词-EN: vector difference, distribution-equivalent next-token predictors, language models, analyze identifiability, linear properties
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We analyze identifiability as a possible explanation for the ubiquity of linear properties across language models, such as the vector difference between the representations of “easy” and “easiest” being parallel to that between “lucky” and “luckiest”. For this, we ask whether finding a linear property in one model implies that any model that induces the same distribution has that property, too. To answer that, we first prove an identifiability result to characterize distribution-equivalent next-token predictors, lifting a diversity requirement of previous results. Second, based on a refinement of relational linearity [Paccanaro and Hinton, 2001; Hernandez et al., 2024], we show how many notions of linearity are amenable to our analysis. Finally, we show that under suitable conditions, these linear properties either hold in all or none distribution-equivalent next-token predictors.
摘要：我们分析了可识别性作为解释语言模型中普遍存在的线性特性的可能原因，例如“easy”和“easiest”的表示向量差与“lucky”和“luckiest”的表示向量差平行。为此，我们探讨在一个模型中发现线性特性是否意味着任何诱导相同分布的模型也具有该特性。为回答这一问题，我们首先证明了可识别性结果，以表征分布等价的下一个Token预测器，放宽了先前结果的多样性要求。其次，基于关系线性性的改进[Paccanaro和Hinton, 2001; Hernandez等, 2024]，我们展示了多种线性概念如何适用于我们的分析。最后，我们证明在适当条件下，这些线性特性在所有或无分布等价的下一个Token预测器中均成立。

[NLP-75] Exploiting Phonological Similarities between African Languages to achieve Speech to Speech Translation

【速读】：该论文试图解决在数据标注成本高或不切实际的情况下，利用非洲语言间的语言学相似性进行直接语音到语音翻译 (S2ST) 的问题。解决方案的关键在于提出了一种基于片段的模型，该模型通过映射同一语系内外的语音片段，有效消除了对大规模配对数据集的依赖。利用配对片段和引导扩散技术，该模型能够在数据集中任意两种语言之间进行翻译。实验结果表明，该模型在片段配对和翻译质量上表现出色，特别是在同一语系内的语言之间。此外，研究还发现片段长度显著影响翻译准确性，平均长度的片段配对质量最高。与传统的级联式自动语音识别-机器翻译 (ASR-MT) 技术相比，该模型在翻译性能上几乎相当，凸显了利用语言群体内的相似性进行高效S2ST的潜力，特别是在低资源语言环境中。

链接: https://arxiv.org/abs/2410.23323
作者: Peter Ochieng,Dennis Kaburu
关键词-EN: selected African languages, selected African, traditional data annotation, African languages, expensive or impractical
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a pilot study on direct speech-to-speech translation (S2ST) by leveraging linguistic similarities among selected African languages within the same phylum, particularly in cases where traditional data annotation is expensive or impractical. We propose a segment-based model that maps speech segments both within and across language phyla, effectively eliminating the need for large paired datasets. By utilizing paired segments and guided diffusion, our model enables translation between any two languages in the dataset. We evaluate the model on a proprietary dataset from the Kenya Broadcasting Corporation (KBC), which includes five languages: Swahili, Luo, Kikuyu, Nandi, and English. The model demonstrates competitive performance in segment pairing and translation quality, particularly for languages within the same phylum. Our experiments reveal that segment length significantly influences translation accuracy, with average-length segments yielding the highest pairing quality. Comparative analyses with traditional cascaded ASR-MT techniques show that the proposed model delivers nearly comparable translation performance. This study underscores the potential of exploiting linguistic similarities within language groups to perform efficient S2ST, especially in low-resource language contexts.
摘要：本文介绍了一项关于直接语音到语音翻译（Speech-to-Speech Translation, S2ST）的试点研究，该研究利用了同一语系内选定的非洲语言之间的语言相似性，特别是在传统数据标注成本高昂或不切实际的情况下。我们提出了一种基于片段的模型，该模型能够在同一语系内及跨语系之间映射语音片段，从而有效消除了对大规模配对数据集的需求。通过利用配对片段和引导扩散技术，我们的模型能够在数据集中的任意两种语言之间进行翻译。我们在肯尼亚广播公司（KBC）提供的专有数据集上评估了该模型，该数据集包含五种语言：斯瓦希里语、卢奥语、基库尤语、南迪语和英语。模型在片段配对和翻译质量方面表现出竞争性性能，特别是在同一语系内的语言之间。我们的实验表明，片段长度显著影响翻译准确性，平均长度的片段在配对质量上表现最佳。与传统的级联式自动语音识别-机器翻译（ASR-MT）技术相比，所提出的模型在翻译性能上几乎相当。本研究强调了利用语言群组内的语言相似性进行高效S2ST的潜力，特别是在低资源语言环境中。

人工智能

[AI-0] Bridging Geometric States via Geometric Diffusion Bridge NEURIPS2024

链接: https://arxiv.org/abs/2410.24220
作者: Shengjie Luo,Yixian Xu,Di He,Shuxin Zheng,Tie-Yan Liu,Liwei Wang
关键词-EN: geometric states, advancing scientific domains, target geometric states, geometric, complex systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 33 pages, 5 tables; NeurIPS 2024 Camera Ready version

点击查看摘要

Abstract:The accurate prediction of geometric state evolution in complex systems is critical for advancing scientific domains such as quantum chemistry and material modeling. Traditional experimental and computational methods face challenges in terms of environmental constraints and computational demands, while current deep learning approaches still fall short in terms of precision and generality. In this work, we introduce the Geometric Diffusion Bridge (GDB), a novel generative modeling framework that accurately bridges initial and target geometric states. GDB leverages a probabilistic approach to evolve geometric state distributions, employing an equivariant diffusion bridge derived by a modified version of Doob’s h -transform for connecting geometric states. This tailored diffusion process is anchored by initial and target geometric states as fixed endpoints and governed by equivariant transition kernels. Moreover, trajectory data can be seamlessly leveraged in our GDB framework by using a chain of equivariant diffusion bridges, providing a more detailed and accurate characterization of evolution dynamics. Theoretically, we conduct a thorough examination to confirm our framework’s ability to preserve joint distributions of geometric states and capability to completely model the underlying dynamics inducing trajectory distributions with negligible error. Experimental evaluations across various real-world scenarios show that GDB surpasses existing state-of-the-art approaches, opening up a new pathway for accurately bridging geometric states and tackling crucial scientific challenges with improved accuracy and applicability.

[AI-1] Understanding Optimization in Deep Learning with Central Flows

链接: https://arxiv.org/abs/2410.24206
作者: Jeremy M. Cohen,Alex Damian,Ameet Talwalkar,Zico Kolter,Jason D. Lee
关键词-EN: remains poorly understood, learning remains poorly, poorly understood, setting of deterministic, remains poorly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: first two authors contributed equally; author order determined by coin flip

点击查看摘要

Abstract:Optimization in deep learning remains poorly understood, even in the simple setting of deterministic (i.e. full-batch) training. A key difficulty is that much of an optimizer’s behavior is implicitly determined by complex oscillatory dynamics, referred to as the “edge of stability.” The main contribution of this paper is to show that an optimizer’s implicit behavior can be explicitly captured by a “central flow:” a differential equation which models the time-averaged optimization trajectory. We show that these flows can empirically predict long-term optimization trajectories of generic neural networks with a high degree of numerical accuracy. By interpreting these flows, we reveal for the first time 1) the precise sense in which RMSProp adapts to the local loss landscape, and 2) an “acceleration via regularization” mechanism, wherein adaptive optimizers implicitly navigate towards low-curvature regions in which they can take larger steps. This mechanism is key to the efficacy of these adaptive optimizers. Overall, we believe that central flows constitute a promising tool for reasoning about optimization in deep learning.

[AI-2] Zonal RL-RRT: Integrated RL-RRT Path Planning with Collision Probability and Zone Connectivity

链接: https://arxiv.org/abs/2410.24205
作者: AmirMohammad Tahmasbi,MohammadSaleh Faghfoorian,Saeed Khodaygan,Aniket Bera
关键词-EN: poses significant challenges, high-dimensional spaces poses, spaces poses significant, significant challenges, fair success rate
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Path planning in high-dimensional spaces poses significant challenges, particularly in achieving both time efficiency and a fair success rate. To address these issues, we introduce a novel path-planning algorithm, Zonal RL-RRT, that leverages kd-tree partitioning to segment the map into zones while addressing zone connectivity, ensuring seamless transitions between zones. By breaking down the complex environment into multiple zones and using Q-learning as the high-level decision-maker, our algorithm achieves a 3x improvement in time efficiency compared to basic sampling methods such as RRT and RRT* in forest-like maps. Our approach outperforms heuristic-guided methods like BIT* and Informed RRT* by 1.5x in terms of runtime while maintaining robust and reliable success rates across 2D to 6D environments. Compared to learning-based methods like NeuralRRT* and MPNetSMP, as well as the heuristic RRT*J, our algorithm demonstrates, on average, 1.5x better performance in the same environments. We also evaluate the effectiveness of our approach through simulations of the UR10e arm manipulator in the MuJoCo environment. A key observation of our approach lies in its use of zone partitioning and Reinforcement Learning (RL) for adaptive high-level planning allowing the algorithm to accommodate flexible policies across diverse environments, making it a versatile tool for advanced path planning.

[AI-3] DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion NEURIPS2024

链接: https://arxiv.org/abs/2410.24203
作者: Weicai Ye,Chenhao Ji,Zheng Chen,Junyao Gao,Xiaoshui Huang,Song-Hai Zhang,Wanli Ouyang,Tong He,Cairong Zhao,Guofeng Zhang
关键词-EN: achieved remarkable achievements, Diffusion-based methods, images remains constrained, remains constrained, methods have achieved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
*备注: NeurIPS2024, Project: this https URL Code: this https URL

点击查看摘要

Abstract:Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation, however, the generation of 3D scenes and even 360^\circ images remains constrained, due to the limited number of scene datasets, the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images. To address these issues, we first establish a large-scale panoramic video-text dataset containing millions of consecutive panoramic keyframes with corresponding panoramic depths, camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation. Specifically, benefiting from the powerful generative capabilities of stable diffusion, we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset. We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images. Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.

[AI-4] Chasing Better Deep Image Priors between Over- and Under-parameterization

链接: https://arxiv.org/abs/2410.24187
作者: Qiming Wu,Xiaohan Chen,Yifan Jiang,Zhangyang Wang
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, image priors, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Codes are available at this https URL

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are well-known to act as over-parameterized deep image priors (DIP) that regularize various image inverse problems. Meanwhile, researchers also proposed extremely compact, under-parameterized image priors (e.g., deep decoder) that are strikingly competent for image restoration too, despite a loss of accuracy. These two extremes push us to think whether there exists a better solution in the middle: between over- and under-parameterized image priors, can one identify “intermediate” parameterized image priors that achieve better trade-offs between performance, efficiency, and even preserving strong transferability? Drawing inspirations from the lottery ticket hypothesis (LTH), we conjecture and study a novel “lottery image prior” (LIP) by exploiting DNN inherent sparsity, stated as: given an over-parameterized DNN-based image prior, it will contain a sparse subnetwork that can be trained in isolation, to match the original DNN’s performance when being applied as a prior to various image inverse problems. Our results validate the superiority of LIPs: we can successfully locate the LIP subnetworks from over-parameterized DIPs at substantial sparsity ranges. Those LIP subnetworks significantly outperform deep decoders under comparably compact model sizes (by often fully preserving the effectiveness of their over-parameterized counterparts), and they also possess high transferability across different images as well as restoration task types. Besides, we also extend LIP to compressive sensing image reconstruction, where a pre-trained GAN generator is used as the prior (in contrast to untrained DIP or deep decoder), and confirm its validity in this setting too. To our best knowledge, this is the first time that LTH is demonstrated to be relevant in the context of inverse problems or image priors.

[AI-5] DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning

链接: https://arxiv.org/abs/2410.24185
作者: Zhenyu Jiang,Yuqi Xie,Kevin Lin,Zhenjia Xu,Weikang Wan,Ajay Mandlekar,Linxi Fan,Yuke Zhu
关键词-EN: Imitation learning, robots manipulation skills, data, teach robots manipulation, data generation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Imitation learning from human demonstrations is an effective means to teach robots manipulation skills. But data acquisition is a major bottleneck in applying this paradigm more broadly, due to the amount of cost and human effort involved. There has been significant interest in imitation learning for bimanual dexterous robots, like humanoids. Unfortunately, data collection is even more challenging here due to the challenges of simultaneously controlling multiple arms and multi-fingered hands. Automated data generation in simulation is a compelling, scalable alternative to fuel this need for data. To this end, we introduce DexMimicGen, a large-scale automated data generation system that synthesizes trajectories from a handful of human demonstrations for humanoid robots with dexterous hands. We present a collection of simulation environments in the setting of bimanual dexterous manipulation, spanning a range of manipulation behaviors and different requirements for coordination among the two arms. We generate 21K demos across these tasks from just 60 source human demos and study the effect of several data generation and policy learning decisions on agent performance. Finally, we present a real-to-sim-to-real pipeline and deploy it on a real-world humanoid can sorting task. Videos and more are at this https URL

[AI-6] Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing

链接: https://arxiv.org/abs/2410.24119
作者: Akash Dhruv,Anshu Dubey
关键词-EN: generative artificial intelligence, artificial intelligence, emergence of foundational, foundational models, models and generative
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of foundational models and generative artificial intelligence (GenAI) is poised to transform productivity in scientific computing, especially in code development, refactoring, and translating from one programming language to another. However, because the output of GenAI cannot be guaranteed to be correct, manual intervention remains necessary. Some of this intervention can be automated through task-specific tools, alongside additional methodologies for correctness verification and effective prompt development. We explored the application of GenAI in assisting with code translation, language interoperability, and codebase inspection within a legacy Fortran codebase used to simulate particle interactions at the Large Hadron Collider (LHC). In the process, we developed a tool, CodeScribe, which combines prompt engineering with user supervision to establish an efficient process for code conversion. In this paper, we demonstrate how CodeScribe assists in converting Fortran code to C++, generating Fortran-C APIs for integrating legacy systems with modern C++ libraries, and providing developer support for code organization and algorithm implementation. We also address the challenges of AI-driven code translation and highlight its benefits for enhancing productivity in scientific computing workflows.

[AI-7] AIDOVECL: AI-generated Dataset of Outpainted Vehicles for Eye-level Classification and Localization

链接: https://arxiv.org/abs/2410.24116
作者: Amir Kazemi,Qurat ul ain Fatima,Volodymyr Kindratenko,Christopher Tessum
关键词-EN: computer vision technologies, learning models due, vision technologies, critical bottleneck, development of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Image labeling is a critical bottleneck in the development of computer vision technologies, often constraining the potential of machine learning models due to the time-intensive nature of manual annotations. This work introduces a novel approach that leverages outpainting to address the problem of annotated data scarcity by generating artificial contexts and annotations, significantly reducing manual labeling efforts. We apply this technique to a particularly acute challenge in autonomous driving, urban planning, and environmental monitoring: the lack of diverse, eye-level vehicle images in desired classes. Our dataset comprises AI-generated vehicle images obtained by detecting and cropping vehicles from manually selected seed images, which are then outpainted onto larger canvases to simulate varied real-world conditions. The outpainted images include detailed annotations, providing high-quality ground truth data. Advanced outpainting techniques and image quality assessments ensure visual fidelity and contextual relevance. Augmentation with outpainted vehicles improves overall performance metrics by up to 8% and enhances prediction of underrepresented classes by up to 20%. This approach, exemplifying outpainting as a self-annotating paradigm, presents a solution that enhances dataset versatility across multiple domains of machine learning. The code and links to datasets used in this study are available for further research and replication at this https URL.

[AI-8] Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers NEURIPS2024

链接: https://arxiv.org/abs/2410.24108
作者: Kai Yan,Alexander G. Schwing,Yu-Xiong Wang
关键词-EN: offline Reinforcement Learning, Reinforcement Learning, Decision Transformers, Online Decision Transformer, offline Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted as NeurIPS 2024 spotlight. 33 pages, 26 figures

点击查看摘要

Abstract:Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, online finetuning of decision transformers has been surprisingly under-explored. The widely adopted state-of-the-art Online Decision Transformer (ODT) still struggles when pretrained with low-reward offline data. In this paper, we theoretically analyze the online-finetuning of the decision transformer, showing that the commonly used Return-To-Go (RTG) that’s far from the expected return hampers the online fine-tuning process. This problem, however, is well-addressed by the value function and advantage of standard RL algorithms. As suggested by our analysis, in our experiments, we hence find that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data. These findings provide new directions to further improve decision transformers.

[AI-9] 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing

链接: https://arxiv.org/abs/2410.24091
作者: Binghao Huang,Yixuan Wang,Xinyi Yang,Yiyue Luo,Yunzhu Li
关键词-EN: crucial for humans, perform fine-grained interactions, multi-modal sensing, visual perception, fine-grained interactions
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at Conference on Robot Learning (CoRL) 2024

点击查看摘要

Abstract:Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and expand their manipulation skills. This paper introduces \textbf3D-ViTac, a multi-modal sensing and learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped with dense sensing units, each covering an area of 3 mm^2 . These sensors are low-cost and flexible, providing detailed and extensive coverage of physical contacts, effectively complementing visual information. To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves their 3D structures and spatial relationships. The multi-modal representation can then be coupled with diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies, particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand manipulation. Our project page is available at \urlthis https URL.

[AI-10] Graph Learning for Numeric Planning NEURIPS2024

链接: https://arxiv.org/abs/2410.24080
作者: Dillon Z. Chen,Sylvie Thiébaux
关键词-EN: exploit relational structures, relational structures exhibited, object-centric planning due, input planning instances, numbers of objects
类目: Artificial Intelligence (cs.AI)
*备注: Extended version of NeurIPS 2024 paper

点击查看摘要

Abstract:Graph learning is naturally well suited for use in symbolic, object-centric planning due to its ability to exploit relational structures exhibited in planning domains and to take as input planning instances with arbitrary numbers of objects. Numeric planning is an extension of symbolic planning in which states may now also exhibit numeric variables. In this work, we propose data-efficient and interpretable machine learning models for learning to solve numeric planning tasks. This involves constructing a new graph kernel for graphs with both continuous and categorical attributes, as well as new optimisation methods for learning heuristic functions for numeric planning. Experiments show that our graph kernels are vastly more efficient and generalise better than graph neural networks for numeric planning, and also yield competitive coverage performance compared to domain-independent numeric planners. Code is available at this https URL

[AI-11] Dynamical similarity analysis uniquely captures how computations develop in RNNs

链接: https://arxiv.org/abs/2410.24070
作者: Quentin Guilhot,Jascha Achterberg,Michał Wójcik,Rui Ponte Costa
关键词-EN: increasingly popular tools, Methods for analyzing, mechanistic interpretability, systems are increasingly, increasingly popular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Methods for analyzing representations in neural systems are increasingly popular tools in neuroscience and mechanistic interpretability. Measures comparing neural activations across conditions, architectures, and species give scalable ways to understand information transformation within different neural networks. However, recent findings show that some metrics respond to spurious signals, leading to misleading results. Establishing benchmark test cases is thus essential for identifying the most reliable metric and potential improvements. We propose that compositional learning in recurrent neural networks (RNNs) can provide a test case for dynamical representation alignment metrics. Implementing this case allows us to evaluate if metrics can identify representations that develop throughout learning and determine if representations identified by metrics reflect the network’s actual computations. Building both attractor and RNN based test cases, we show that the recently proposed Dynamical Similarity Analysis (DSA) is more noise robust and reliably identifies behaviorally relevant representations compared to prior metrics (Procrustes, CKA). We also demonstrate how such test cases can extend beyond metric evaluation to study new architectures. Specifically, testing DSA in modern (Mamba) state space models suggests that these models, unlike RNNs, may not require changes in recurrent dynamics due to their expressive hidden states. Overall, we develop test cases that showcase how DSA’s enhanced ability to detect dynamical motifs makes it highly effective for identifying ongoing computations in RNNs and revealing how networks learn tasks.

[AI-12] Identifying General Mechanism Shifts in Linear Causal Representations

链接: https://arxiv.org/abs/2410.24059
作者: Tianyu Chen,Kevin Bello,Francesco Locatello,Bryon Aragam,Pradeep Ravikumar
关键词-EN: linear causal representation, linear structural causal, structural causal model, causal representation learning, unknown latent factors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeuIPS 2024

点击查看摘要

Abstract:We consider the linear causal representation learning setting where we observe a linear mixing of d unknown latent factors, which follow a linear structural causal model. Recent work has shown that it is possible to recover the latent factors as well as the underlying structural causal model over them, up to permutation and scaling, provided that we have at least d environments, each of which corresponds to perfect interventions on a single latent node (factor). After this powerful result, a key open problem faced by the community has been to relax these conditions: allow for coarser than perfect single-node interventions, and allow for fewer than d of them, since the number of latent factors d could be very large. In this work, we consider precisely such a setting, where we allow a smaller than d number of environments, and also allow for very coarse interventions that can very coarsely \textitchange the entire causal graph over the latent factors. On the flip side, we relax what we wish to extract to simply the \textitlist of nodes that have shifted between one or more environments. We provide a surprising identifiability result that it is indeed possible, under some very mild standard assumptions, to identify the set of shifted nodes. Our identifiability proof moreover is a constructive one: we explicitly provide necessary and sufficient conditions for a node to be a shifted node, and show that we can check these conditions given observed data. Our algorithm lends itself very naturally to the sample setting where instead of just interventional distributions, we are provided datasets of samples from each of these distributions. We corroborate our results on both synthetic experiments as well as an interesting psychometric dataset. The code can be found at this https URL.

[AI-13] State- and context-dependent robotic manipulation and grasping via uncertainty-aware imitation learning

链接: https://arxiv.org/abs/2410.24035
作者: Tim R. Winter,Ashok M. Sundaram,Werner Friedl,Maximo A. Roa,Freek Stulp,João Silvério
关键词-EN: Generating context-adaptive manipulation, Generating context-adaptive, context-adaptive manipulation, external variables, Generating
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating context-adaptive manipulation and grasping actions is a challenging problem in robotics. Classical planning and control algorithms tend to be inflexible with regard to parameterization by external variables such as object shapes. In contrast, Learning from Demonstration (LfD) approaches, due to their nature as function approximators, allow for introducing external variables to modulate policies in response to the environment. In this paper, we utilize this property by introducing an LfD approach to acquire context-dependent grasping and manipulation strategies. We treat the problem as a kernel-based function approximation, where the kernel inputs include generic context variables describing task-dependent parameters such as the object shape. We build on existing work on policy fusion with uncertainty quantification to propose a state-dependent approach that automatically returns to demonstrations, avoiding unpredictable behavior while smoothly adapting to context changes. The approach is evaluated against the LASA handwriting dataset and on a real 7-DoF robot in two scenarios: adaptation to slippage while grasping and manipulating a deformable food item.

[AI-14] A Multi-Modal Approach for Face Anti-Spoofing in Non-Calibrated Systems using Disparity Maps

链接: https://arxiv.org/abs/2410.24031
作者: Ariel Larey,Eyal Rond,Omer Achrack
关键词-EN: Face recognition technologies, face spoofing attacks, Face recognition, face spoofing, recognition technologies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Face recognition technologies are increasingly used in various applications, yet they are vulnerable to face spoofing attacks. These spoofing attacks often involve unique 3D structures, such as printed papers or mobile device screens. Although stereo-depth cameras can detect such attacks effectively, their high-cost limits their widespread adoption. Conversely, two-sensor systems without extrinsic calibration offer a cost-effective alternative but are unable to calculate depth using stereo techniques. In this work, we propose a method to overcome this challenge by leveraging facial attributes to derive disparity information and estimate relative depth for anti-spoofing purposes, using non-calibrated systems. We introduce a multi-modal anti-spoofing model, coined Disparity Model, that incorporates created disparity maps as a third modality alongside the two original sensor modalities. We demonstrate the effectiveness of the Disparity Model in countering various spoof attacks using a comprehensive dataset collected from the Intel RealSense ID Solution F455. Our method outperformed existing methods in the literature, achieving an Equal Error Rate (EER) of 1.71% and a False Negative Rate (FNR) of 2.77% at a False Positive Rate (FPR) of 1%. These errors are lower by 2.45% and 7.94% than the errors of the best comparison method, respectively. Additionally, we introduce a model ensemble that addresses 3D spoof attacks as well, achieving an EER of 2.04% and an FNR of 3.83% at an FPR of 1%. Overall, our work provides a state-of-the-art solution for the challenging task of anti-spoofing in non-calibrated systems that lack depth information.

[AI-15] AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

链接: https://arxiv.org/abs/2410.24024
作者: Yifan Xu,Xiao Liu,Xueqiao Sun,Siyi Cheng,Hao Yu,Hanyu Lai,Shudan Zhang,Dan Zhang,Jie Tang,Yuxiao Dong
关键词-EN: Autonomous agents, real world, increasingly important, important for interacting, Android agents
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at \urlthis https URL.

[AI-16] Assessing the Impact of Packing on Machine Learning-Based Malware Detection and Classification Systems

链接: https://arxiv.org/abs/2410.24017
作者: Daniel Gibert,Nikolaos Totosis,Constantinos Patsakis,Giulio Zizzo,Quan Le
关键词-EN: signature-based malware detection, significant challenge, malware detection, static machine learning-based, malware
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The proliferation of malware, particularly through the use of packing, presents a significant challenge to static analysis and signature-based malware detection techniques. The application of packing to the original executable code renders extracting meaningful features and signatures challenging. To deal with the increasing amount of malware in the wild, researchers and anti-malware companies started harnessing machine learning capabilities with very promising results. However, little is known about the effects of packing on static machine learning-based malware detection and classification systems. This work addresses this gap by investigating the impact of packing on the performance of static machine learning-based models used for malware detection and classification, with a particular focus on those using visualisation techniques. To this end, we present a comprehensive analysis of various packing techniques and their effects on the performance of machine learning-based detectors and classifiers. Our findings highlight the limitations of current static detection and classification systems and underscore the need to be proactive to effectively counteract the evolving tactics of malware authors.

[AI-17] An Information Criterion for Controlled Disentanglement of Multimodal Data

链接: https://arxiv.org/abs/2410.23996
作者: Chenyu Wang,Sharut Gupta,Xinyi Zhang,Sana Tonekaboni,Stefanie Jegelka,Tommi Jaakkola,Caroline Uhler
关键词-EN: Multimodal representation learning, decompose information inherent, Multimodal representation, seeks to relate, relate and decompose
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Multimodal representation learning seeks to relate and decompose information inherent in multiple modalities. By disentangling modality-specific information from information that is shared across modalities, we can improve interpretability and robustness and enable downstream tasks such as the generation of counterfactual outcomes. Separating the two types of information is challenging since they are often deeply entangled in many real-world applications. We propose Disentangled Self-Supervised Learning (DisentangledSSL), a novel self-supervised approach for learning disentangled representations. We present a comprehensive analysis of the optimality of each disentangled representation, particularly focusing on the scenario not covered in prior work where the so-called Minimum Necessary Information (MNI) point is not attainable. We demonstrate that DisentangledSSL successfully learns shared and modality-specific features on multiple synthetic and real-world datasets and consistently outperforms baselines on various downstream tasks, including prediction tasks for vision-language data, as well as molecule-phenotype retrieval tasks for biological data.

[AI-18] Localization balance and affinity: a stronger multifaceted collaborative salient object detector in remote sensing images

链接: https://arxiv.org/abs/2410.23991
作者: Yakun Xie,Suning Liu,Hongyu Chen,Shaohan Cao,Huixin Zhang,Dejun Feng,Qian Wan,Jun Zhu,Qing Zhu
关键词-EN: remote sensing images, optical remote sensing, intricate edge structures, challenges persist due, salient object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite significant advancements in salient object detection(SOD) in optical remote sensing images(ORSI), challenges persist due to the intricate edge structures of ORSIs and the complexity of their contextual relationships. Current deep learning approaches encounter difficulties in accurately identifying boundary features and lack efficiency in collaboratively modeling the foreground and background by leveraging contextual features. To address these challenges, we propose a stronger multifaceted collaborative salient object detector in ORSIs, termed LBA-MCNet, which incorporates aspects of localization, balance, and affinity. The network focuses on accurately locating targets, balancing detailed features, and modeling image-level global context information. Specifically, we design the Edge Feature Adaptive Balancing and Adjusting(EFABA) module for precise edge localization, using edge features to guide attention to boundaries and preserve spatial details. Moreover, we design the Global Distributed Affinity Learning(GDAL) module to model global context. It captures global context by generating an affinity map from the encoders final layer, ensuring effective modeling of global patterns. Additionally, deep supervision during deconvolution further enhances feature representation. Finally, we compared with 28 state of the art approaches on three publicly available datasets. The results clearly demonstrate the superiority of our method.

[AI-19] Average Controlled and Average Natural Micro Direct Effects in Summary Causal Graphs

链接: https://arxiv.org/abs/2410.23975
作者: Simon Ferreira,Charles K. Assaad
关键词-EN: omitted temporal information, temporal information complicate, micro direct effect, complicate causal inference, causal systems represented
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the identifiability of average controlled direct effects and average natural direct effects in causal systems represented by summary causal graphs, which are abstractions of full causal graphs, often used in dynamic systems where cycles and omitted temporal information complicate causal inference. Unlike in the traditional linear setting, where direct effects are typically easier to identify and estimate, non-parametric direct effects, which are crucial for handling real-world complexities, particularly in epidemiological contexts where relationships between variables (e.g, genetic, environmental, and behavioral factors) are often non-linear, are much harder to define and identify. In particular, we give sufficient conditions for identifying average controlled micro direct effect and average natural micro direct effect from summary causal graphs in the presence of hidden confounding. Furthermore, we show that the conditions given for the average controlled micro direct effect become also necessary in the setting where there is no hidden confounding and where we are only interested in identifiability by adjustment.

[AI-20] Image Synthesis with Class-Aware Semantic Diffusion Models for Surgical Scene Segmentation

链接: https://arxiv.org/abs/2410.23962
作者: Yihang Zhou,Rebecca Towning,Zaid Awad,Stamatia Giannarou
关键词-EN: frequently compromised, Semantic Diffusion Model, enhancing surgical precision, Class-Aware Semantic Diffusion, segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Surgical scene segmentation is essential for enhancing surgical precision, yet it is frequently compromised by the scarcity and imbalance of available data. To address these challenges, semantic image synthesis methods based on generative adversarial networks and diffusion models have been developed. However, these models often yield non-diverse images and fail to capture small, critical tissue classes, limiting their effectiveness. In response, we propose the Class-Aware Semantic Diffusion Model (CASDM), a novel approach which utilizes segmentation maps as conditions for image synthesis to tackle data scarcity and imbalance. Novel class-aware mean squared error and class-aware self-perceptual loss functions have been defined to prioritize critical, less visible classes, thereby enhancing image quality and relevance. Furthermore, to our knowledge, we are the first to generate multi-class segmentation maps using text prompts in a novel fashion to specify their contents. These maps are then used by CASDM to generate surgical scene images, enhancing datasets for training and validating segmentation models. Our evaluation, which assesses both image quality and downstream segmentation performance, demonstrates the strong effectiveness and generalisability of CASDM in producing realistic image-map pairs, significantly advancing surgical scene segmentation across diverse and challenging datasets.

[AI-21] owards Fast Algorithms for the Preference Consistency Problem Based on Hierarchical Models WWW IJCAI IJCAI’16

链接: https://arxiv.org/abs/2410.23934
作者: Anne-Marie George,Nic Wilson,Barry O’Sullivan
关键词-EN: Preference Consistency Problem, Preference Consistency, compare algorithmic approaches, preference statements, preference statements based
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: Longer Version of IJCAI’16 publication this https URL

点击查看摘要

Abstract:In this paper, we construct and compare algorithmic approaches to solve the Preference Consistency Problem for preference statements based on hierarchical models. Instances of this problem contain a set of preference statements that are direct comparisons (strict and non-strict) between some alternatives, and a set of evaluation functions by which all alternatives can be rated. An instance is consistent based on hierarchical preference models, if there exists an hierarchical model on the evaluation functions that induces an order relation on the alternatives by which all relations given by the preference statements are satisfied. Deciding if an instance is consistent is known to be NP-complete for hierarchical models. We develop three approaches to solve this decision problem. The first involves a Mixed Integer Linear Programming (MILP) formulation, the other two are recursive algorithms that are based on properties of the problem by which the search space can be pruned. Our experiments on synthetic data show that the recursive algorithms are faster than solving the MILP formulation and that the ratio between the running times increases extremely quickly.

[AI-22] ransformer-based Model Predictive Control: Trajectory Optimization via Sequence Modeling

链接: https://arxiv.org/abs/2410.23916
作者: Davide Celestini,Daniele Gammelli,Tommaso Guffanti,Simone D’Amico,Elisa Capello,Marco Pavone
关键词-EN: enabling general-purpose robot, diverse real-world scenarios, general-purpose robot autonomy, Model predictive control, predictive control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 8 pages, 7 figures. Datasets, videos and code available at: this https URL

点击查看摘要

Abstract:Model predictive control (MPC) has established itself as the primary methodology for constrained control, enabling general-purpose robot autonomy in diverse real-world scenarios. However, for most problems of interest, MPC relies on the recursive solution of highly non-convex trajectory optimization problems, leading to high computational complexity and strong dependency on initialization. In this work, we present a unified framework to combine the main strengths of optimization-based and learning-based methods for MPC. Our approach entails embedding high-capacity, transformer-based neural network models within the optimization process for trajectory generation, whereby the transformer provides a near-optimal initial guess, or target plan, to a non-convex optimization problem. Our experiments, performed in simulation and the real world onboard a free flyer platform, demonstrate the capabilities of our framework to improve MPC convergence and runtime. Compared to purely optimization-based approaches, results show that our approach can improve trajectory generation performance by up to 75%, reduce the number of solver iterations by up to 45%, and improve overall MPC runtime by 7x without loss in performance.

[AI-23] Efficient Inference and Computation of Optimal Alternatives for Preference Languages Based On Lexicographic Models IJCAI’17 IJCAI WWW

链接: https://arxiv.org/abs/2410.23913
作者: Nic Wilson,Anne-Marie George
关键词-EN: general preference languages, preference languages based, analyse preference inference, lexicographic models, based on lexicographic
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注: Longer Version of IJCAI’17 publication this https URL

点击查看摘要

Abstract:We analyse preference inference, through consistency, for general preference languages based on lexicographic models. We identify a property, which we call strong compositionality, that applies for many natural kinds of preference statement, and that allows a greedy algorithm for determining consistency of a set of preference statements. We also consider different natural definitions of optimality, and their relations to each other, for general preference languages based on lexicographic models. Based on our framework, we show that testing consistency, and thus inference, is polynomial for a specific preference language LpqT, which allows strict and non-strict statements, comparisons between outcomes and between partial tuples, both ceteris paribus and strong statements, and their combination. Computing different kinds of optimal sets is also shown to be polynomial; this is backed up by our experimental results.

[AI-24] RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner

链接: https://arxiv.org/abs/2410.23912
作者: Fu-Chieh Chang,Yu-Ting Lee,Hui-Ying Shih,Pei-Yuan Wu
关键词-EN: solve complex tasks, large language models, stepwise manner, abilities of large, large language
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The reasoning abilities of large language models (LLMs) have improved with chain-of-thought (CoT) prompting, allowing models to solve complex tasks in a stepwise manner. However, training CoT capabilities requires detailed reasoning data, which is often scarce. The self-taught reasoner (STaR) framework addresses this by using reinforcement learning to automatically generate reasoning steps, reducing reliance on human-labeled data. Although STaR and its variants have demonstrated empirical success, a theoretical foundation explaining these improvements is lacking. This work provides a theoretical framework for understanding the effectiveness of reinforcement learning on CoT reasoning and STaR. Our contributions are: (1) an analysis of policy improvement, showing why LLM reasoning improves iteratively with STaR; (2) conditions for convergence to an optimal reasoning policy; (3) an examination of STaR’s robustness, explaining how it can improve reasoning even when incorporating occasional incorrect steps; and (4) criteria for the quality of pre-trained models necessary to initiate effective reasoning improvement. This framework aims to bridge empirical findings with theoretical insights, advancing reinforcement learning approaches for reasoning in LLMs.

[AI-25] Neural Network Verification with PyRAT

链接: https://arxiv.org/abs/2410.23903
作者: Augustin Lemesle,Julien Lehmann,Tristan Le Gall
关键词-EN: critical domains, neural networks, neural network starting, neural, ensure safety guarantees
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As AI systems are becoming more and more popular and used in various critical domains (health, transport, energy, …), the need to provide guarantees and trust of their safety is undeniable. To this end, we present PyRAT, a tool based on abstract interpretation to verify the safety and the robustness of neural networks. In this paper, we describe the different abstractions used by PyRAT to find the reachable states of a neural network starting from its input as well as the main features of the tool to provide fast and accurate analysis of neural networks. PyRAT has already been used in several collaborations to ensure safety guarantees, with its second place at the VNN-Comp 2024 showcasing its performance.

[AI-26] AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery NEURIPS2024

链接: https://arxiv.org/abs/2410.23891
作者: Hangyu Zhou,Chia-Hsiang Kao,Cheng Perng Phoo,Utkarsh Mall,Bharath Hariharan,Kavita Bala
关键词-EN: satellite imagery pose, downstream applications, pose a significant, cloud removal, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024 Datasets and Benchmarks Track. Code and data available at this https URL

点击查看摘要

Abstract:Clouds in satellite imagery pose a significant challenge for downstream applications. A major challenge in current cloud removal research is the absence of a comprehensive benchmark and a sufficiently large and diverse training dataset. To address this problem, we introduce the largest public dataset – \textitAllClear for cloud removal, featuring 23,742 globally distributed regions of interest (ROIs) with diverse land-use patterns, comprising 4 million images in total. Each ROI includes complete temporal captures from the year 2022, with (1) multi-spectral optical imagery from Sentinel-2 and Landsat 8/9, (2) synthetic aperture radar (SAR) imagery from Sentinel-1, and (3) auxiliary remote sensing products such as cloud masks and land cover maps. We validate the effectiveness of our dataset by benchmarking performance, demonstrating the scaling law – the PSNR rises from 28.47 to 33.87 with 30\times more data, and conducting ablation studies on the temporal length and the importance of individual modalities. This dataset aims to provide comprehensive coverage of the Earth’s surface and promote better cloud removal results.

[AI-27] GEPS: Boosting Generalization in Parametric PDE Neural Solvers through Adaptive Conditioning

链接: https://arxiv.org/abs/2410.23889
作者: Armand Kassaï Koupaï,Jorge Misfut Benet,Yuan Yin,Jean-Noël Vittaut,Patrick Gallinari
关键词-EN: Solving parametric partial, partial differential equations, presents significant challenges, parametric partial differential, Solving parametric
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Solving parametric partial differential equations (PDEs) presents significant challenges for data-driven methods due to the sensitivity of spatio-temporal dynamics to variations in PDE parameters. Machine learning approaches often struggle to capture this variability. To address this, data-driven approaches learn parametric PDEs by sampling a very large variety of trajectories with varying PDE parameters. We first show that incorporating conditioning mechanisms for learning parametric PDEs is essential and that among them, \textitadaptive conditioning , allows stronger generalization. As existing adaptive conditioning methods do not scale well with respect to the number of parameters to adapt in the neural solver, we propose GEPS, a simple adaptation mechanism to boost GEneralization in Pde Solvers via a first-order optimization and low-rank rapid adaptation of a small set of context parameters. We demonstrate the versatility of our approach for both fully data-driven and for physics-aware neural solvers. Validation performed on a whole range of spatio-temporal forecasting problems demonstrates excellent performance for generalizing to unseen conditions including initial conditions, PDE coefficients, forcing terms and solution domain. \textitProject page : this https URL

[AI-28] Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs

链接: https://arxiv.org/abs/2410.23875
作者: Liyi Chen,Panrong Tong,Zhongming Jin,Ying Sun,Jieping Ye,Hui Xiong
关键词-EN: Large Language Models, Large Language, Language Models, remarkable reasoning capabilities, shown remarkable reasoning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable reasoning capabilities on complex tasks, but they still suffer from out-of-date knowledge, hallucinations, and opaque decision-making. In contrast, Knowledge Graphs (KGs) can provide explicit and editable knowledge for LLMs to alleviate these issues. Existing paradigm of KG-augmented LLM manually predefines the breadth of exploration space and requires flawless navigation in KGs. However, this paradigm cannot adaptively explore reasoning paths in KGs based on the question semantics and self-correct erroneous reasoning paths, resulting in a bottleneck in efficiency and effect. To address these limitations, we propose a novel self-correcting adaptive planning paradigm for KG-augmented LLM named Plan-on-Graph (PoG), which first decomposes the question into several sub-objectives and then repeats the process of adaptively exploring reasoning paths, updating memory, and reflecting on the need to self-correct erroneous reasoning paths until arriving at the answer. Specifically, three important mechanisms of Guidance, Memory, and Reflection are designed to work together, to guarantee the adaptive breadth of self-correcting planning for graph reasoning. Finally, extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of PoG.

[AI-29] RAGraph: A General Retrieval-Augmented Graph Learning Framework NEURIPS2024

链接: https://arxiv.org/abs/2410.23855
作者: Xinke Jiang,Rihong Qiu,Yongxin Xu,Wentao Zhang,Yichen Zhu,Ruizhe Zhang,Yuchen Fang,Xu Chu,Junfeng Zhao,Yasha Wang
关键词-EN: Graph Neural Networks, Neural Networks, interpreting relational data, Graph Neural, training instances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become essential in interpreting relational data across various domains, yet, they often struggle to generalize to unseen graph data that differs markedly from training instances. In this paper, we introduce a novel framework called General Retrieval-Augmented Graph Learning (RAGraph), which brings external graph data into the general graph foundation model to improve model generalization on unseen scenarios. On the top of our framework is a toy graph vector library that we established, which captures key attributes, such as features and task-specific label information. During inference, the RAGraph adeptly retrieves similar toy graphs based on key similarities in downstream tasks, integrating the retrieved data to enrich the learning context via the message-passing prompting mechanism. Our extensive experimental evaluations demonstrate that RAGraph significantly outperforms state-of-the-art graph learning methods in multiple tasks such as node classification, link prediction, and graph classification across both dynamic and static datasets. Furthermore, extensive testing confirms that RAGraph consistently maintains high performance without the need for task-specific fine-tuning, highlighting its adaptability, robustness, and broad applicability.

[AI-30] Auditing Googles Search Algorithm: Measuring News Diversity Across Brazil the UK and the US

链接: https://arxiv.org/abs/2410.23842
作者: Raphael Hernandes,Giulio Corsi
关键词-EN: examines the influence, Google system preferentially, Google, Brazil, Google search algorithm
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 21 pages, 3 figures, 7 tables

点击查看摘要

Abstract:This study examines the influence of Google’s search algorithm on news diversity by analyzing search results in Brazil, the UK, and the US. It explores how Google’s system preferentially favors a limited number of news outlets. Utilizing algorithm auditing techniques, the research measures source concentration with the Herfindahl-Hirschman Index (HHI) and Gini coefficient, revealing significant concentration trends. The study underscores the importance of conducting horizontal analyses across multiple search queries, as focusing solely on individual results pages may obscure these patterns. Factors such as popularity, political bias, and recency were evaluated for their impact on news rankings. Findings indicate a slight leftward bias in search outcomes and a preference for popular, often national outlets. This bias, combined with a tendency to prioritize recent content, suggests that Google’s algorithm may reinforce existing media inequalities. By analyzing the largest dataset to date – 221,863 search results – this research provides comprehensive, longitudinal insights into how algorithms shape public access to diverse news sources.

[AI-31] Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

链接: https://arxiv.org/abs/2410.23822
作者: Jinlong He,Pengfei Li,Gang Liu,Shenjun Zhong
关键词-EN: Multimodal Large Language, text understanding capabilities, Large Language Models, superior text understanding, understanding capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) inherit the superior text understanding capabilities of LLMs and extend these capabilities to multimodal scenarios. These models achieve excellent results in the general domain of multimodal tasks. However, in the medical domain, the substantial training costs and the requirement for extensive medical data pose challenges to the development of medical MLLMs. Furthermore, due to the free-text form of answers, tasks such as visual grounding that need to produce output in a prescribed form become difficult for MLLMs. So far, there have been no medical MLLMs works in medical visual grounding area. For the medical vision grounding task, which involves identifying locations in medical images based on short text descriptions, we propose Parameter-efficient Fine-tuning medical multimodal large language models for Medcial Visual Grounding (PFMVG). To validate the performance of the model, we evaluate it on a public benchmark dataset for medical visual grounding, where it achieves competitive results, and significantly outperforming GPT-4v. Our code will be open sourced after peer review.

[AI-32] Disentangling Disentangled Representations: Towards Improved Latent Units via Diffusion Models

链接: https://arxiv.org/abs/2410.23820
作者: Youngjun Jun,Jiwoo Park,Kyobin Choo,Tae Eun Choi,Seong Jae Hwang
关键词-EN: core intrinsic factors, aims to break, Disentangled representation learning, break down observed, core intrinsic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Disentangled representation learning (DRL) aims to break down observed data into core intrinsic factors for a profound understanding of the data. In real-world scenarios, manually defining and labeling these factors are non-trivial, making unsupervised methods attractive. Recently, there have been limited explorations of utilizing diffusion models (DMs), which are already mainstream in generative modeling, for unsupervised DRL. They implement their own inductive bias to ensure that each latent unit input to the DM expresses only one distinct factor. In this context, we design Dynamic Gaussian Anchoring to enforce attribute-separated latent units for more interpretable DRL. This unconventional inductive bias explicitly delineates the decision boundaries between attributes while also promoting the independence among latent units. Additionally, we also propose Skip Dropout technique, which easily modifies the denoising U-Net to be more DRL-friendly, addressing its uncooperative nature with the disentangling feature extractor. Our methods, which carefully consider the latent unit semantics and the distinct DM structure, enhance the practicality of DM-based disentangled representations, demonstrating state-of-the-art disentanglement performance on both synthetic and real data, as well as advantages in downstream tasks.

[AI-33] he NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

链接: https://arxiv.org/abs/2410.23815
作者: Dake Guo,Jixun Yao,Xinfa Zhu,Kangxiang Xia,Zhao Guo,Ziyu Zhang,Yao Wang,Jie Liu,Lei Xie
关键词-EN: Audio Generation Challenge, Convincing Audio Generation, Inspirational and Convincing, Generation Challenge, NPU-HWC system submitted
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: accepted by ISCSLP 2024

点击查看摘要

Abstract:This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively.

[AI-34] CALE: Continuous Arcade Learning Environment

链接: https://arxiv.org/abs/2410.23810
作者: Jesse Farebrother,Pablo Samuel Castro
关键词-EN: Arcade Learning Environment, well-known Arcade Learning, Continuous Arcade Learning, Arcade Learning, Learning Environment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce the Continuous Arcade Learning Environment (CALE), an extension of the well-known Arcade Learning Environment (ALE) [Bellemare et al., 2013]. The CALE uses the same underlying emulator of the Atari 2600 gaming system (Stella), but adds support for continuous actions. This enables the benchmarking and evaluation of continuous-control agents (such as PPO [Schulman et al., 2017] and SAC [Haarnoja et al., 2018]) and value-based agents (such as DQN [Mnih et al., 2015] and Rainbow [Hessel et al., 2018]) on the same environment suite. We provide a series of open questions and research directions that CALE enables, as well as initial baseline results using Soft Actor-Critic. CALE is available as part of the ALE athttps://github.com/Farama-Foundation/Arcade-Learning-Environment.

[AI-35] Generative AI for Accessible and Inclusive Extended Reality

链接: https://arxiv.org/abs/2410.23803
作者: Jens Grubert,Junlong Chen,Per Ola Kristensson
关键词-EN: Artificial Intelligence-Generated Content, Artificial Intelligence-Generated, virtual environments, accessible virtual environments, transform how people
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Presented at the CHI 2024 Workshop “Building a Metaverse for All: Opportunities and Challenges for Future Inclusive and Accessible Virtual Environments”, May 11, 2024, Honolulu, Hawaii

点击查看摘要

Abstract:Artificial Intelligence-Generated Content (AIGC) has the potential to transform how people build and interact with virtual environments. Within this paper, we discuss potential benefits but also challenges that AIGC has for the creation of inclusive and accessible virtual environments. Specifically, we touch upon the decreased need for 3D modeling expertise, benefits of symbolic-only as well as multimodal input, 3D content editing, and 3D model accessibility as well as foundation model-specific challenges.

[AI-36] Improving snore detection under limited dataset through harmonic/percussive source separation and convolutional neural networks

链接: https://arxiv.org/abs/2410.23796
作者: F.D. Gonzalez-Martinez,J.J. Carabias-Orti,F.J. Canadas-Quesada,N. Ruiz-Reyes,D. Martinez-Munoz,S. Garcia-Galan
关键词-EN: Sleep Apnoea Syndrome, Obstructive Sleep Apnoea, Apnoea Syndrome, Obstructive Sleep, Sleep Apnoea
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Snoring, an acoustic biomarker commonly observed in individuals with Obstructive Sleep Apnoea Syndrome (OSAS), holds significant potential for diagnosing and monitoring this recognized clinical disorder. Irrespective of snoring types, most snoring instances exhibit identifiable harmonic patterns manifested through distinctive energy distributions over time. In this work, we propose a novel method to differentiate monaural snoring from non-snoring sounds by analyzing the harmonic content of the input sound using harmonic/percussive sound source separation (HPSS). The resulting feature, based on the harmonic spectrogram from HPSS, is employed as input data for conventional neural network architectures, aiming to enhance snoring detection performance even under a limited data learning framework. To evaluate the performance of our proposal, we studied two different scenarios: 1) using a large dataset of snoring and interfering sounds, and 2) using a reduced training set composed of around 1% of the data material. In the former scenario, the proposed HPSS-based feature provides competitive results compared to other input features from the literature. However, the key advantage of the proposed method lies in the superior performance of the harmonic spectrogram derived from HPSS in a limited data learning context. In this particular scenario, using the proposed harmonic feature significantly enhances the performance of all the studied architectures in comparison to the classical input features documented in the existing literature. This finding clearly demonstrates that incorporating harmonic content enables more reliable learning of the essential time-frequency characteristics that are prevalent in most snoring sounds, even in scenarios where the amount of training data is limited.

[AI-37] EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching NEURIPS2024

链接: https://arxiv.org/abs/2410.23788
作者: Xinwang Chen,Ning Liu,Yichen Zhu,Feifei Feng,Jian Tang
关键词-EN: Transformer-based Diffusion Probabilistic, Diffusion Probabilistic Models, widespread practical applications, computational requirements hinder, requirements hinder widespread
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Xinwang Chen and Ning Liu are with equal contributions. This paper has been accepted by NeurIPS 2024

点击查看摘要

Abstract:Transformer-based Diffusion Probabilistic Models (DPMs) have shown more potential than CNN-based DPMs, yet their extensive computational requirements hinder widespread practical applications. To reduce the computation budget of transformer-based DPMs, this work proposes the Efficient Diffusion Transformer (EDT) framework. The framework includes a lightweight-design diffusion model architecture, and a training-free Attention Modulation Matrix and its alternation arrangement in EDT inspired by human-like sketching. Additionally, we propose a token relation-enhanced masking training strategy tailored explicitly for EDT to augment its token relation learning capability. Our extensive experiments demonstrate the efficacy of EDT. The EDT framework reduces training and inference costs and surpasses existing transformer-based diffusion models in image synthesis performance, thereby achieving a significant overall enhancement. With lower FID, EDT-S, EDT-B, and EDT-XL attained speed-ups of 3.93x, 2.84x, and 1.92x respectively in the training phase, and 2.29x, 2.29x, and 2.22x respectively in inference, compared to the corresponding sizes of MDTv2. The source code is released at this https URL.

[AI-38] Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map

链接: https://arxiv.org/abs/2410.23780
作者: Xinyuan Chang,Maixuan Xue,Xinran Liu,Zheng Pan,Xing Wei
关键词-EN: Ensuring adherence, traffic sign, traffic sign regulations, traffic, sign
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27 pages, 13 figures

点击查看摘要

Abstract:Ensuring adherence to traffic sign regulations is essential for both human and autonomous vehicle navigation. While current benchmark datasets concentrate on lane perception or basic traffic sign recognition, they often overlook the intricate task of integrating these regulations into lane operations. Addressing this gap, we introduce MapDR, a novel dataset designed for the extraction of Driving Rules from traffic signs and their association with vectorized, locally perceived HD Maps. MapDR features over 10,000 annotated video clips that capture the intricate correlation between traffic sign regulations and lanes. We define two pivotal sub-tasks: 1) Rule Extraction from Traffic Sign, which accurately deciphers regulatory instructions, and 2) Rule-Lane Correspondence Reasoning, which aligns these rules with their respective lanes. Built upon this benchmark, we provide a multimodal solution that offers a strong baseline for advancing autonomous driving technologies. It fills a critical gap in the integration of traffic sign rules, contributing to the development of reliable autonomous navigation systems.

[AI-39] Enhancing Chess Reinforcement Learning with Graph Representation

链接: https://arxiv.org/abs/2410.23753
作者: Tomas Rigaux,Hisashi Kashima
关键词-EN: Convolutional Neural Network, Mastering games, Graph Neural Networks, rigid Convolutional Neural, hard task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mastering games is a hard task, as games can be extremely complex, and still fundamentally different in structure from one another. While the AlphaZero algorithm has demonstrated an impressive ability to learn the rules and strategy of a large variety of games, ranging from Go and Chess, to Atari games, its reliance on extensive computational resources and rigid Convolutional Neural Network (CNN) architecture limits its adaptability and scalability. A model trained to play on a 19\times 19 Go board cannot be used to play on a smaller 13\times 13 board, despite the similarity between the two Go variants. In this paper, we focus on Chess, and explore using a more generic Graph-based Representation of a game state, rather than a grid-based one, to introduce a more general architecture based on Graph Neural Networks (GNN). We also expand the classical Graph Attention Network (GAT) layer to incorporate edge-features, to naturally provide a generic policy output format. Our experiments, performed on smaller networks than the initial AlphaZero paper, show that this new architecture outperforms previous architectures with a similar number of parameters, being able to increase playing strength an order of magnitude faster. We also show that the model, when trained on a smaller 5\times 5 variant of chess, is able to be quickly fine-tuned to play on regular 8\times 8 chess, suggesting that this approach yields promising generalization abilities. Our code is available at this https URL.

[AI-40] LSEAttention is All You Need for Time Series Forecasting

链接: https://arxiv.org/abs/2410.23749
作者: Dizhen Liang
关键词-EN: achieved remarkable success, natural language processing, Transformer-based architectures, computer vision, architectures have achieved
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages with referencing, 1 figure, 3 tables

点击查看摘要

Abstract:Transformer-based architectures have achieved remarkable success in natural language processing and computer vision. However, their performance in multivariate long-term forecasting often lags behind simpler linear baselines. Previous studies have identified the traditional attention mechanism as a significant factor contributing to this limitation. To unlock the full potential of transformers for multivariate time series forecasting, I introduce \textbfLSEAttention, an approach designed to address entropy collapse and training instability commonly observed in transformer models. I validate the effectiveness of LSEAttention across various real-world multivariate time series datasets, demonstrating that it not only outperforms existing time series transformer models but also exceeds the performance of some state-of-the-art models on specific datasets.

[AI-41] Exploring Consistency in Graph Representations:from Graph Kernels to Graph Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.23748
作者: Xuyuan Liu,Yinghao Cai,Qihui Yang,Yujun Yan
关键词-EN: Graph Neural Networks, neural network methods, dominant approach, similarity relationships, graph representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a dominant approach in graph representation learning, yet they often struggle to capture consistent similarity relationships among graphs. While graph kernel methods such as the Weisfeiler-Lehman subtree (WL-subtree) and Weisfeiler-Lehman optimal assignment (WLOA) kernels are effective in capturing similarity relationships, they rely heavily on predefined kernels and lack sufficient non-linearity for more complex data patterns. Our work aims to bridge the gap between neural network methods and kernel approaches by enabling GNNs to consistently capture relational structures in their learned representations. Given the analogy between the message-passing process of GNNs and WL algorithms, we thoroughly compare and analyze the properties of WL-subtree and WLOA kernels. We find that the similarities captured by WLOA at different iterations are asymptotically consistent, ensuring that similar graphs remain similar in subsequent iterations, thereby leading to superior performance over the WL-subtree kernel. Inspired by these findings, we conjecture that the consistency in the similarities of graph representations across GNN layers is crucial in capturing relational structures and enhancing graph classification performance. Thus, we propose a loss to enforce the similarity of graph representations to be consistent across different layers. Our empirical analysis verifies our conjecture and shows that our proposed consistency loss can significantly enhance graph classification performance across several GNN backbones on various datasets.

[AI-42] Syno: Structured Synthesis for Neural Operators

链接: https://arxiv.org/abs/2410.23745
作者: Yongqi Zhuo,Zhengyuan Su,Chenggang Zhao,Mingyu Gao
关键词-EN: higher execution performance, networks never end, higher execution, execution performance, neural operator synthesis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:The desires for better prediction accuracy and higher execution performance in neural networks never end. Neural architecture search (NAS) and tensor compilers are two popular techniques to optimize these two goals, but they are both limited to composing or optimizing existing manually designed operators rather than coming up with completely new designs. In this work, we explore the less studied direction of neural operator synthesis, which aims to automatically and efficiently discover novel neural operators with better accuracy and/or speed. We develop an end-to-end framework Syno, to realize practical neural operator synthesis. Syno makes use of a novel set of fine-grained primitives defined on tensor dimensions, which ensure various desired properties to ease model training, and also enable expression canonicalization techniques to avoid redundant candidates during search. Syno further adopts a novel guided synthesis flow to obtain valid operators matched with the specified input/output dimension sizes, and leverages efficient stochastic tree search algorithms to quickly explore the design space. We demonstrate that Syno discovers better operators with an average of 2.06\times speedup and less than 1% accuracy loss, even on NAS-optimized models.

[AI-43] owards Reliable Alignment: Uncertainty-aware RLHF

链接: https://arxiv.org/abs/2410.23726
作者: Debangshu Banerjee,Aditya Gopalan
关键词-EN: aligning Large Language, Large Language Models, reward models, Recent advances, aligning Large
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in aligning Large Language Models with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models used in Reinforcement Learning with Human Feedback (RLHF) are typically learned from small datasets using stochastic optimization algorithms, making them prone to high variability. We illustrate the inconsistencies between reward models empirically on numerous open-source datasets. We theoretically show that the fluctuation of the reward models can be detrimental to the alignment problem because the derived policies are more overfitted to the reward model and, hence, are riskier if the reward model itself is uncertain. We use concentration of measure to motivate an uncertainty-aware, conservative algorithm for policy optimization. We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards. We theoretically prove that our proposed methodology has less risk than the vanilla method. We corroborate our theoretical results with experiments based on designing an ensemble of reward models. We use this ensemble of reward models to align a language model using our methodology and observe that our empirical findings match our theoretical predictions. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.23726 [cs.AI] (or arXiv:2410.23726v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.23726 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-44] Argumentation and Machine Learning

链接: https://arxiv.org/abs/2410.23724
作者: Antonio Rago,Kristijonas Čyras,Jack Mumford,Oana Cocarascu
关键词-EN: Machine Learning, cross-fertilisation between Computational, Computational Argumentation, Machine Learning components, Learning
类目: Artificial Intelligence (cs.AI)
*备注: 44 pages, to appear in the Handbook of Formal Argumentation and the Journal of Applied Logics

点击查看摘要

Abstract:This chapter provides an overview of research works that present approaches with some degree of cross-fertilisation between Computational Argumentation and Machine Learning. Our review of the literature identified two broad themes representing the purpose of the interaction between these two areas: argumentation for machine learning and machine learning for argumentation. Across these two themes, we systematically evaluate the spectrum of works across various dimensions, including the type of learning and the form of argumentation framework used. Further, we identify three types of interaction between these two areas: synergistic approaches, where the Argumentation and Machine Learning components are tightly integrated; segmented approaches, where the two are interleaved such that the outputs of one are the inputs of the other; and approximated approaches, where one component shadows the other at a chosen level of detail. We draw conclusions about the suitability of certain forms of Argumentation for supporting certain types of Machine Learning, and vice versa, with clear patterns emerging from the review. Whilst the reviewed works provide inspiration for successfully combining the two fields of research, we also identify and discuss limitations and challenges that ought to be addressed in order to ensure that they remain a fruitful pairing as AI advances.

[AI-45] Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

链接: https://arxiv.org/abs/2410.23680
作者: Weichao Zhou,Wenchao Li
关键词-EN: inverse reinforcement learning, algorithms use inverse, inverse reinforcement, reward functions, IRL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2306.01731

点击查看摘要

Abstract:Many imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with the demonstration. However, the inferred reward functions often fail to capture the underlying task objectives. In this paper, we propose a novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment. Our framework is a semi-supervised approach that leverages expert demonstrations as weak supervision to derive a set of candidate reward functions that align with the task rather than only with the data. It then adopts an adversarial mechanism to train a policy with this set of reward functions to gain a collective validation of the policy’s ability to accomplish the task. We provide theoretical insights into this framework’s ability to mitigate task-reward misalignment and present a practical implementation. Our experimental results show that our framework outperforms conventional IL baselines in complex and transfer learning scenarios.

[AI-46] Provable Benefit of Cutout and CutMix for Feature Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.23672
作者: Junsoo Oh,Chulhee Yun
关键词-EN: demonstrated significant efficacy, vision tasks, demonstrated significant, significant efficacy, efficacy in enhancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeurIPS 2024 camera-ready version, 81 pages

点击查看摘要

Abstract:Patch-level data augmentation techniques such as Cutout and CutMix have demonstrated significant efficacy in enhancing the performance of vision tasks. However, a comprehensive theoretical understanding of these methods remains elusive. In this paper, we study two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training. Our analysis focuses on a feature-noise data model, which consists of several label-dependent features of varying rarity and label-independent noises of differing strengths. Our theorems demonstrate that Cutout training can learn low-frequency features that vanilla training cannot, while CutMix training can learn even rarer features that Cutout cannot capture. From this, we establish that CutMix yields the highest test accuracy among the three. Our novel analysis reveals that CutMix training makes the network learn all features and noise vectors “evenly” regardless of the rarity and strength, which provides an interesting insight into understanding patch-level augmentation.

[AI-47] Deep Convolutional Neural Networks on Multiclass Classification of Three-Dimensional Brain Images for Parkinsons Disease Stage Prediction

链接: https://arxiv.org/abs/2410.23649
作者: Guan-Hua Huang,Wan-Chen Lai,Tai-Been Chen,Chien-Chin Hsu,Huei-Yung Chen,Yi-Chen Wu,Li-Ren Yeh
关键词-EN: central nervous system, emission computed tomography, functional medical imaging, single-photon emission computed, Parkinson disease
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 34 pages, 7 figures, and 4 tables

点击查看摘要

Abstract:Parkinson’s disease (PD), a degenerative disorder of the central nervous system, is commonly diagnosed using functional medical imaging techniques such as single-photon emission computed tomography (SPECT). In this study, we utilized two SPECT data sets (n = 634 and n = 202) from different hospitals to develop a model capable of accurately predicting PD stages, a multiclass classification task. We used the entire three-dimensional (3D) brain images as input and experimented with various model architectures. Initially, we treated the 3D images as sequences of two-dimensional (2D) slices and fed them sequentially into 2D convolutional neural network (CNN) models pretrained on ImageNet, averaging the outputs to obtain the final predicted stage. We also applied 3D CNN models pretrained on Kinetics-400. Additionally, we incorporated an attention mechanism to account for the varying importance of different slices in the prediction process. To further enhance model efficacy and robustness, we simultaneously trained the two data sets using weight sharing, a technique known as cotraining. Our results demonstrated that 2D models pretrained on ImageNet outperformed 3D models pretrained on Kinetics-400, and models utilizing the attention mechanism outperformed both 2D and 3D models. The cotraining technique proved effective in improving model performance when the cotraining data sets were sufficiently large.

[AI-48] Anytime-Constrained Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2410.23637
作者: Jeremy McMahan,Xiaojin Zhu
关键词-EN: introduce anytime constraints, anytime-constrained Markov games, introduce anytime, anytime constraints, multi-agent setting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We introduce anytime constraints to the multi-agent setting with the corresponding solution concept being anytime-constrained equilibrium (ACE). Then, we present a comprehensive theory of anytime-constrained Markov games, which includes (1) a computational characterization of feasible policies, (2) a fixed-parameter tractable algorithm for computing ACE, and (3) a polynomial-time algorithm for approximately computing feasible ACE. Since computing a feasible policy is NP-hard even for two-player zero-sum games, our approximation guarantees are the best possible under worst-case analysis. We also develop the first theory of efficient computation for action-constrained Markov games, which may be of independent interest.

[AI-49] Adaptive Alignment: Dynamic Preference Adjustments via Multi-Objective Reinforcement Learning for Pluralistic AI NEURIPS2024

链接: https://arxiv.org/abs/2410.23630
作者: Hadassah Harland,Richard Dazeley,Peter Vamplew,Hashini Senaratne,Bahareh Nakisa,Francisco Cruz
关键词-EN: Pluralistic Artificial Intelligence, Artificial Intelligence, Pluralistic Artificial, Objective Reinforcement Learning, Emerging research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for the Pluralistic Alignment workshop at NeurIPS 2024

点击查看摘要

Abstract:Emerging research in Pluralistic Artificial Intelligence (AI) alignment seeks to address how intelligent systems can be designed and deployed in accordance with diverse human needs and values. We contribute to this pursuit with a dynamic approach for aligning AI with diverse and shifting user preferences through Multi Objective Reinforcement Learning (MORL), via post-learning policy selection adjustment. In this paper, we introduce the proposed framework for this approach, outline its anticipated advantages and assumptions, and discuss technical details about the implementation. We also examine the broader implications of adopting a retroactive alignment approach through the sociotechnical systems perspective.

[AI-50] Posture-Informed Muscular Force Learning for Robust Hand Pressure Estimation NEURIPS2024

链接: https://arxiv.org/abs/2410.23629
作者: Kyungjin Seo,Junghoon Seo,Hanseok Jeong,Sangpil Kim,Sang Ho Yoon
关键词-EN: forearm surface electromyography, augment forearm surface, present PiMForce, surface electromyography, augment forearm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:We present PiMForce, a novel framework that enhances hand pressure estimation by leveraging 3D hand posture information to augment forearm surface electromyography (sEMG) signals. Our approach utilizes detailed spatial information from 3D hand poses in conjunction with dynamic muscle activity from sEMG to enable accurate and robust whole-hand pressure measurements under diverse hand-object interactions. We also developed a multimodal data collection system that combines a pressure glove, an sEMG armband, and a markerless finger-tracking module. We created a comprehensive dataset from 21 participants, capturing synchronized data of hand posture, sEMG signals, and exerted hand pressure across various hand postures and hand-object interaction scenarios using our collection system. Our framework enables precise hand pressure estimation in complex and natural interaction scenarios. Our approach substantially mitigates the limitations of traditional sEMG-based or vision-based methods by integrating 3D hand posture information with sEMG signals. Video demos, data, and code are available online.

[AI-51] Using Structural Similarity and Kolmogorov-Arnold Networks for Anatomical Embedding of 3-hinge Gyrus

链接: https://arxiv.org/abs/2410.23598
作者: Minheng Chen,Chao Cao,Tong Chen,Yan Zhuang,Jing Zhang,Yanjun Lyu,Xiaowei Yu,Lu Zhang,Tianming Liu,Dajiang Zhu
关键词-EN: defined folding pattern, newly defined folding, folding pattern, defined folding, cortical folding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The 3-hinge gyrus (3HG) is a newly defined folding pattern, which is the conjunction of gyri coming from three directions in cortical folding. Many studies demonstrated that 3HGs can be reliable nodes when constructing brain networks or connectome since they simultaneously possess commonality and individuality across different individual brains and populations. However, 3HGs are identified and validated within individual spaces, making it difficult to directly serve as the brain network nodes due to the absence of cross-subject correspondence. The 3HG correspondences represent the intrinsic regulation of brain organizational architecture, traditional image-based registration methods tend to fail because individual anatomical properties need to be fully respected. To address this challenge, we propose a novel self-supervised framework for anatomical feature embedding of the 3HGs to build the correspondences among different brains. The core component of this framework is to construct a structural similarity-enhanced multi-hop feature encoding strategy based on the recently developed Kolmogorov-Arnold network (KAN) for anatomical feature embedding. Extensive experiments suggest that our approach can effectively establish robust cross-subject correspondences when no one-to-one mapping exists.

[AI-52] How Do Flow Matching Models Memorize and Generalize in Sample Data Subspaces?

链接: https://arxiv.org/abs/2410.23594
作者: Weiguo Gao,Ming Li
关键词-EN: low-dimensional structure embedded, Real-world data, sample data subspace, high-dimensional space, embedded in high-dimensional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 33 pages, 9 figures

点击查看摘要

Abstract:Real-world data is often assumed to lie within a low-dimensional structure embedded in high-dimensional space. In practical settings, we observe only a finite set of samples, forming what we refer to as the sample data subspace. It serves an essential approximation supporting tasks such as dimensionality reduction and generation. A major challenge lies in whether generative models can reliably synthesize samples that stay within this subspace rather than drifting away from the underlying structure. In this work, we provide theoretical insights into this challenge by leveraging Flow Matching models, which transform a simple prior into a complex target distribution via a learned velocity field. By treating the real data distribution as discrete, we derive analytical expressions for the optimal velocity field under a Gaussian prior, showing that generated samples memorize real data points and represent the sample data subspace exactly. To generalize to suboptimal scenarios, we introduce the Orthogonal Subspace Decomposition Network (OSDNet), which systematically decomposes the velocity field into subspace and off-subspace components. Our analysis shows that the off-subspace component decays, while the subspace component generalizes within the sample data subspace, ensuring generated samples preserve both proximity and diversity.

[AI-53] Automating Quantum Software Maintenance: Flakiness Detection and Root Cause Analysis

链接: https://arxiv.org/abs/2410.23578
作者: Janakan Sivaloganathan,Ainaz Jamshidi,Andriy Miranskyy,Lei Zhang
关键词-EN: quantum software, Flaky tests, wasted developer effort, Toggle, flaky test detection
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Flaky tests, which pass or fail inconsistently without code changes, are a major challenge in software engineering in general and in quantum software engineering in particular due to their complexity and probabilistic nature, leading to hidden issues and wasted developer effort. We aim to create an automated framework to detect flaky tests in quantum software and an extended dataset of quantum flaky tests, overcoming the limitations of manual methods. Building on prior manual analysis of 14 quantum software repositories, we expanded the dataset and automated flaky test detection using transformers and cosine similarity. We conducted experiments with Large Language Models (LLMs) from the OpenAI GPT and Meta LLaMA families to assess their ability to detect and classify flaky tests from code and issue descriptions. Embedding transformers proved effective: we identified 25 new flaky tests, expanding the dataset by 54%. Top LLMs achieved an F1-score of 0.8871 for flakiness detection but only 0.5839 for root cause identification. We introduced an automated flaky test detection framework using machine learning, showing promising results but highlighting the need for improved root cause detection and classification in large quantum codebases. Future work will focus on improving detection techniques and developing automatic flaky test fixes. Comments: 5 pages, 1 figure Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.23578 [cs.SE] (or arXiv:2410.23578v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.23578 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lei Zhang [view email] [v1] Thu, 31 Oct 2024 02:43:04 UTC (215 KB) Full-text links: Access Paper: View a PDF of the paper titled Automating Quantum Software Maintenance: Flakiness Detection and Root Cause Analysis, by Janakan Sivaloganathan and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2024-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-54] ransferable Ensemble Black-box Jailbreak Attacks on Large Language Models

链接: https://arxiv.org/abs/2410.23558
作者: Yiqi Yang,Hongye Fu
关键词-EN: black-box jailbreak attacking, jailbreak attacking framework, attacking framework, framework that incorporates, deliver transferable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this report, we propose a novel black-box jailbreak attacking framework that incorporates various LLM-as-Attacker methods to deliver transferable and powerful jailbreak attacks. Our method is designed based on three key observations from existing jailbreaking studies and practices. First, we consider an ensemble approach should be more effective in exposing the vulnerabilities of an aligned LLM compared to individual attacks. Second, different malicious instructions inherently vary in their jailbreaking difficulty, necessitating differentiated treatment to ensure more efficient attacks. Finally, the semantic coherence of a malicious instruction is crucial for triggering the defenses of an aligned LLM; therefore, it must be carefully disrupted to manipulate its embedding representation, thereby increasing the jailbreak success rate. We validated our approach by participating in the Competition for LLM and Agent Safety 2024, where our team achieved top performance in the Jailbreaking Attack Track.

[AI-55] ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

链接: https://arxiv.org/abs/2410.23537
作者: Youpeng Zhao,Jun Wang
关键词-EN: Large Language Models, Large Language, Language Models, artificial general intelligence, represent a revolutionary
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
*备注: ICCAD 2024

点击查看摘要

Abstract:Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI). As exemplified by ChatGPT, LLM-based applications necessitate minimal response latency and maximal throughput for inference serving. However, due to the unpredictability of LLM execution, the first-come-first-serve (FCFS) scheduling policy employed by current LLM serving systems suffers from head-of-line (HoL) blocking issues and long job response times. In this paper, we propose a new efficient LLM inference serving framework, named ALISE. The key design paradigm of ALISE is to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads. Furthermore, to mitigate the memory overhead of the intermediate key-value (KV) cache, we employ a priority-based adaptive memory management protocol and quantization-based compression techniques. Evaluations demonstrate that in comparison to the state-of-the-art solution vLLM, ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively. Comments: ICCAD 2024 Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.23537 [cs.PF] (or arXiv:2410.23537v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2410.23537 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] here and Back Again: On the relation between noises images and their inversions in diffusion models

链接: https://arxiv.org/abs/2410.23530
作者: Łukasz Staniszewski,Łukasz Kuciński,Kamil Deja
关键词-EN: Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, Probabilistic Models, lack meaningful latent, meaningful latent space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) achieve state-of-the-art performance in synthesizing new images from random noise, but they lack meaningful latent space that encodes data into features. Recent DDPM-based editing techniques try to mitigate this issue by inverting images back to their approximated staring noise. In this work, we study the relation between the initial Gaussian noise, the samples generated from it, and their corresponding latent encodings obtained through the inversion procedure. First, we interpret their spatial distance relations to show the inaccuracy of the DDIM inversion technique by localizing latent representations manifold between the initial noise and generated samples. Then, we demonstrate the peculiar relation between initial Gaussian noise and its corresponding generations during diffusion training, showing that the high-level features of generated images stabilize rapidly, keeping the spatial distance relationship between noises and generations consistent throughout the training.

[AI-57] Kernel-Based Function Approximation for Average Reward Reinforcement Learning: An Optimist No-Regret Algorithm NEURIPS2024

链接: https://arxiv.org/abs/2410.23498
作者: Sattar Vakili,Julia Olkhovskaya
关键词-EN: Reinforcement learning utilizing, great representational capacity, learning utilizing kernel, utilizing kernel ridge, kernel ridge regression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Reinforcement learning utilizing kernel ridge regression to predict the expected value function represents a powerful method with great representational capacity. This setting is a highly versatile framework amenable to analytical results. We consider kernel-based function approximation for RL in the infinite horizon average reward setting, also referred to as the undiscounted setting. We propose an optimistic algorithm, similar to acquisition function based algorithms in the special case of bandits. We establish novel no-regret performance guarantees for our algorithm, under kernel-based modelling assumptions. Additionally, we derive a novel confidence interval for the kernel-based prediction of the expected value function, applicable across various RL problems.

[AI-58] DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity NEURIPS2024

链接: https://arxiv.org/abs/2410.23495
作者: Baekrok Shin,Junsoo Oh,Hanseul Cho,Chulhee Yun
关键词-EN: previously learned weights, weights is appealing, practical neural networks, continuous influx, neural network training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at NeurIPS 2024

点击查看摘要

Abstract:Warm-starting neural network training by initializing networks with previously learned weights is appealing, as practical neural networks are often deployed under a continuous influx of new data. However, it often leads to loss of plasticity, where the network loses its ability to learn new information, resulting in worse generalization than training from scratch. This occurs even under stationary data distributions, and its underlying mechanism is poorly understood. We develop a framework emulating real-world neural network training and identify noise memorization as the primary cause of plasticity loss when warm-starting on stationary data. Motivated by this, we propose Direction-Aware SHrinking (DASH), a method aiming to mitigate plasticity loss by selectively forgetting memorized noise while preserving learned features. e validate our approach on vision tasks, demonstrating improvements in test accuracy and training efficiency.

[AI-59] Causality-Driven Audits of Model Robustness

链接: https://arxiv.org/abs/2410.23494
作者: Nathan Drenkow,Chris Ribaudo,Mathias Unberath
关键词-EN: deep neural networks, significantly degrade DNN, degrade DNN performance, neural networks, challenging real-world imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robustness audits of deep neural networks (DNN) provide a means to uncover model sensitivities to the challenging real-world imaging conditions that significantly degrade DNN performance in-the-wild. Such conditions are often the result of the compounding of multiple factors inherent to the environment, sensor, or processing pipeline and may lead to complex image distortions that are not easily categorized. When robustness audits are limited to a set of pre-determined imaging effects or distortions, the results cannot be (easily) transferred to real-world conditions where image corruptions may be more complex or nuanced. To address this challenge, we present a new alternative robustness auditing method that uses causal inference to measure DNN sensitivities to the factors of the imaging process that cause complex distortions. Our approach uses causal models to explicitly encode assumptions about the domain-relevant factors and their interactions. Then, through extensive experiments on natural and rendered images across multiple vision tasks, we show that our approach reliably estimates causal effects of each factor on DNN performance using observational domain data. These causal effects directly tie DNN sensitivities to observable properties of the imaging pipeline in the domain of interest towards reducing the risk of unexpected DNN failures when deployed in that domain.

[AI-60] Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

链接: https://arxiv.org/abs/2410.23483
作者: Julian Collado,Kevin Stangl
关键词-EN: Recent approaches, proxy model, approaches in machine, solve a task, composition of multiple
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Recent approaches in machine learning often solve a task using a composition of multiple models or agentic architectures. When targeting a composed system with adversarial attacks, it might not be computationally or informationally feasible to train an end-to-end proxy model or a proxy model for every component of the system. We introduce a method to craft an adversarial attack against the overall multi-model system when we only have a proxy model for the final black-box model, and when the transformation applied by the initial models can make the adversarial perturbations ineffective. Current methods handle this by applying many copies of the first model/transformation to an input and then re-use a standard adversarial attack by averaging gradients, or learning a proxy model for both stages. To our knowledge, this is the first attack specifically designed for this threat model and our method has a substantially higher attack success rate (80% vs 25%) and contains 9.4% smaller perturbations (MSE) compared to prior state-of-the-art methods. Our experiments focus on a supervised image pipeline, but we are confident the attack will generalize to other multi-model settings [e.g. a mix of open/closed source foundation models], or agentic systems

[AI-61] Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

链接: https://arxiv.org/abs/2410.23472
作者: Rokas Gipiškis,Ayrton San Joaquin,Ze Shen Chin,Adrian Regenfuß,Ariel Gil,Koen Holtman
关键词-EN: Artificial Intelligence, risk management measures, newly emerging types, types of Artificial, management measures
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 91 pages, 8 figures

点击查看摘要

Abstract:There is an urgent need to identify both short and long-term risks from newly emerging types of Artificial Intelligence (AI), as well as available risk management measures. In response, and to support global efforts in regulating AI and writing safety standards, we compile an extensive catalog of risk sources and risk management measures for general-purpose AI (GPAI) systems, complete with descriptions and supporting examples where relevant. This work involves identifying technical, operational, and societal risks across model development, training, and deployment stages, as well as surveying established and experimental methods for managing these risks. To the best of our knowledge, this paper is the first of its kind to provide extensive documentation of both GPAI risk sources and risk management measures that are descriptive, self-contained and neutral with respect to any existing regulatory framework. This work intends to help AI providers, standards experts, researchers, policymakers, and regulators in identifying and mitigating systemic risks from GPAI systems. For this reason, the catalog is released under a public domain license for ease of direct use by stakeholders in AI governance and standards.

[AI-62] Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

链接: https://arxiv.org/abs/2410.23450
作者: Ruhan Wang,Yu Yang,Zhishuai Liu,Dongruo Zhou,Pan Xu
关键词-EN: easily accessible source, accessible source domain, enhance policy learning, study offline off-dynamics, offline off-dynamics reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: 26 pages, 10 tables, 10 figures

点击查看摘要

Abstract:We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on the decision transformer (DT), which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works tackle the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented Decision Transformer (RADT) method, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from RADT achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations RADT-DARA and RADT-MV respectively. Extensive experiments conducted on D4RL datasets reveal that our methods generally outperform dynamic programming based methods in off-dynamics RL scenarios.

[AI-63] Venire: A Machine Learning-Guided Panel Review System for Community Content Moderation

链接: https://arxiv.org/abs/2410.23448
作者: Vinay Koshy,Frederick Choi,Yi-Shyuan Chiang,Hari Sundaram,Eshwar Chandrasekharan,Karrie Karahalios
关键词-EN: unified voice, Venire, Research, moderation, Research into community
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research into community content moderation often assumes that moderation teams govern with a single, unified voice. However, recent work has found that moderators disagree with one another at modest, but concerning rates. The problem is not the root disagreements themselves. Subjectivity in moderation is unavoidable, and there are clear benefits to including diverse perspectives within a moderation team. Instead, the crux of the issue is that, due to resource constraints, moderation decisions end up being made by individual decision-makers. The result is decision-making that is inconsistent, which is frustrating for community members. To address this, we develop Venire, an ML-backed system for panel review on Reddit. Venire uses a machine learning model trained on log data to identify the cases where moderators are most likely to disagree. Venire fast-tracks these cases for multi-person review. Ideally, Venire allows moderators to surface and resolve disagreements that would have otherwise gone unnoticed. We conduct three studies through which we design and evaluate Venire: a set of formative interviews with moderators, technical evaluations on two datasets, and a think-aloud study in which moderators used Venire to make decisions on real moderation cases. Quantitatively, we demonstrate that Venire is able to improve decision consistency and surface latent disagreements. Qualitatively, we find that Venire helps moderators resolve difficult moderation cases more confidently. Venire represents a novel paradigm for human-AI content moderation, and shifts the conversation from replacing human decision-making to supporting it.

[AI-64] PP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes WACV2025

链接: https://arxiv.org/abs/2410.23409
作者: Alessandro D’Amelio,Giuseppe Cartella,Vittorio Cuculo,Manuele Lucchi,Marcella Cornia,Rita Cucchiara,Giuseppe Boccignone
关键词-EN: current processing demands, proper location, processing demands, fixate the proper, scene and holds
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer’s visual scanpaths (where to look), while often putting on the background the temporal facet of attention dynamics (when). In this paper we present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP), that jointly learns the temporal dynamics of fixations position and duration, integrating deep learning methodologies with point process theory. We conduct extensive experiments across five publicly available datasets. Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches. Source code and trained models are publicly available at: this https URL.

[AI-65] FlowLLM : Flow Matching for Material Generation with Large Language Models as Base Distributions

链接: https://arxiv.org/abs/2410.23405
作者: Anuroop Sriram,Benjamin Kurt Miller,Ricky T. Q. Chen,Brandon M. Wood
关键词-EN: including carbon capture, renewable energy, revolutionize various fields, including carbon, carbon capture
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Material discovery is a critical area of research with the potential to revolutionize various fields, including carbon capture, renewable energy, and electronics. However, the immense scale of the chemical space makes it challenging to explore all possible materials experimentally. In this paper, we introduce FlowLLM, a novel generative model that combines large language models (LLMs) and Riemannian flow matching (RFM) to design novel crystalline materials. FlowLLM first fine-tunes an LLM to learn an effective base distribution of meta-stable crystals in a text representation. After converting to a graph representation, the RFM model takes samples from the LLM and iteratively refines the coordinates and lattice parameters. Our approach significantly outperforms state-of-the-art methods, increasing the generation rate of stable materials by over three times and increasing the rate for stable, unique, and novel crystals by \sim50% - a huge improvement on a difficult problem. Additionally, the crystals generated by FlowLLM are much closer to their relaxed state when compared with another leading model, significantly reducing post-hoc computational cost.

[AI-66] Adaptive Network Intervention for Complex Systems: A Hierarchical Graph Reinforcement Learning Approach

链接: https://arxiv.org/abs/2410.23396
作者: Qiliang Chen,Babak Heydari
关键词-EN: managing system-wide outcomes, complex multi-agent systems, Hierarchical Graph Reinforcement, Effective governance, complex multi-agent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Effective governance and steering of behavior in complex multi-agent systems (MAS) are essential for managing system-wide outcomes, particularly in environments where interactions are structured by dynamic networks. In many applications, the goal is to promote pro-social behavior among agents, where network structure plays a pivotal role in shaping these interactions. This paper introduces a Hierarchical Graph Reinforcement Learning (HGRL) framework that governs such systems through targeted interventions in the network structure. Operating within the constraints of limited managerial authority, the HGRL framework demonstrates superior performance across a range of environmental conditions, outperforming established baseline methods. Our findings highlight the critical influence of agent-to-agent learning (social learning) on system behavior: under low social learning, the HGRL manager preserves cooperation, forming robust core-periphery networks dominated by cooperators. In contrast, high social learning accelerates defection, leading to sparser, chain-like networks. Additionally, the study underscores the importance of the system manager’s authority level in preventing system-wide failures, such as agent rebellion or collapse, positioning HGRL as a powerful tool for dynamic network-based governance.

[AI-67] Resource Governance in Networked Systems via Integrated Variational Autoencoders and Reinforcement Learning

链接: https://arxiv.org/abs/2410.23393
作者: Qiliang Chen,Babak Heydari
关键词-EN: integrates variational autoencoders, dynamically adjusting network, adjusting network structures, balance system performance, Deep Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We introduce a framework that integrates variational autoencoders (VAE) with reinforcement learning (RL) to balance system performance and resource usage in multi-agent systems by dynamically adjusting network structures over time. A key innovation of this method is its capability to handle the vast action space of the network structure. This is achieved by combining Variational Auto-Encoder and Deep Reinforcement Learning to control the latent space encoded from the network structures. The proposed method, evaluated on the modified OpenAI particle environment under various scenarios, not only demonstrates superior performance compared to baselines but also reveals interesting strategies and insights through the learned behaviors.

[AI-68] Estimating Neural Network Robustness via Lipschitz Constant and Architecture Sensitivity

链接: https://arxiv.org/abs/2410.23382
作者: Abulikemu Abuduweili,Changliu Liu
关键词-EN: Ensuring neural network, Ensuring neural, robotic learning systems, real-world environments, Lipschitz constant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: SAFE-ROL at CoRL 2024

点击查看摘要

Abstract:Ensuring neural network robustness is essential for the safe and reliable operation of robotic learning systems, especially in perception and decision-making tasks within real-world environments. This paper investigates the robustness of neural networks in perception systems, specifically examining their sensitivity to targeted, small-scale perturbations. We identify the Lipschitz constant as a key metric for quantifying and enhancing network robustness. We derive an analytical expression to compute the Lipschitz constant based on neural network architecture, providing a theoretical basis for estimating and improving robustness. Several experiments reveal the relationship between network design, the Lipschitz constant, and robustness, offering practical insights for developing safer, more robust robot learning systems.

[AI-69] Sequential Order-Robust Mamba for Time Series Forecasting NEURIPS

链接: https://arxiv.org/abs/2410.23356
作者: Seunghan Lee,Juri Hong,Kibok Lee,Taeyoung Park
关键词-EN: offering near-linear complexity, alternative to Transformers, processing sequential data, offering near-linear, recently emerged
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeurIPS Workshop on Time Series in the Age of Large Models, 2024

点击查看摘要

Abstract:Mamba has recently emerged as a promising alternative to Transformers, offering near-linear complexity in processing sequential data. However, while channels in time series (TS) data have no specific order in general, recent studies have adopted Mamba to capture channel dependencies (CD) in TS, introducing a sequential order bias. To address this issue, we propose SOR-Mamba, a TS forecasting method that 1) incorporates a regularization strategy to minimize the discrepancy between two embedding vectors generated from data with reversed channel orders, thereby enhancing robustness to channel order, and 2) eliminates the 1D-convolution originally designed to capture local information in sequential data. Furthermore, we introduce channel correlation modeling (CCM), a pretraining task aimed at preserving correlations between channels from the data space to the latent space in order to enhance the ability to capture CD. Extensive experiments demonstrate the efficacy of the proposed method across standard and transfer learning scenarios. Code is available at this https URL.

[AI-70] MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts NEURIPS2024

链接: https://arxiv.org/abs/2410.23332
作者: Jie Zhu,Yixiong Chen,Mingyu Ding,Ping Luo,Leye Wang,Jingdong Wang
关键词-EN: attracted vast attention, vast attention due, impressive image-generation capabilities, attracted vast, vast attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at NeurIPS 2024

点击查看摘要

Abstract:Text-to-image diffusion has attracted vast attention due to its impressive image-generation capabilities. However, when it comes to human-centric text-to-image generation, particularly in the context of faces and hands, the results often fall short of naturalness due to insufficient training priors. We alleviate the issue in this work from two perspectives. 1) From the data aspect, we carefully collect a human-centric dataset comprising over one million high-quality human-in-the-scene images and two specific sets of close-up images of faces and hands. These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model. 2) On the methodological front, we propose a simple yet effective method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts. This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale. To validate the superiority of MoLE in the context of human-centric image generation compared to state-of-the-art, we construct two benchmarks and perform evaluations with diverse metrics and human studies. Datasets, model, and code are released at this https URL.

[AI-71] CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

链接: https://arxiv.org/abs/2410.23330
作者: Tianyu Yang,Lisen Dai,Zheyuan Liu,Xiangqi Wang,Meng Jiang,Yapeng Tian,Xiangliang Zhang
关键词-EN: full retraining process, gained significant attention, remove specific data, Machine unlearning, retraining process
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning (MU) has gained significant attention as a means to remove specific data from trained models without requiring a full retraining process. While progress has been made in unimodal domains like text and image classification, unlearning in multimodal models remains relatively underexplored. In this work, we address the unique challenges of unlearning in CLIP, a prominent multimodal model that aligns visual and textual representations. We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations, ensuring that unlearning does not compromise model performance. CLIPErase consists of three key modules: a Forgetting Module that disrupts the associations in the forget set, a Retention Module that preserves performance on the retain set, and a Consistency Module that maintains consistency with the original model. Extensive experiments on the CIFAR-100 and Flickr30K datasets across four CLIP downstream tasks demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples, while preserving the model’s performance on the retain set after unlearning.

[AI-72] Advanced Cyberattack Detection in Internet of Medical Things (IoMT) Using Convolutional Neural Networks

链接: https://arxiv.org/abs/2410.23306
作者: Alireza Mohammadi,Hosna Ghahramani,Seyyed Amir Asghari,Mehdi Aminian
关键词-EN: Medical Things, Internet of Medical, significantly enhanced patient, enhanced patient care, Convolutional Neural Networks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, Accepted at Iranian Conference on Intelligent Systems (ICIS) 23-24 October, 2024, Sirjan University of Technology, Sirjan, Kerman, Iran. \c{opyright} 2024 IEEE. Personal use of this material is permitted. The accepted version is shared here. For the final published version, refer to the IEEE Xplore Digital Library

点击查看摘要

Abstract:The increasing integration of the Internet of Medical Things (IoMT) into healthcare systems has significantly enhanced patient care but has also introduced critical cybersecurity challenges. This paper presents a novel approach based on Convolutional Neural Networks (CNNs) for detecting cyberattacks within IoMT environments. Unlike previous studies that predominantly utilized traditional machine learning (ML) models or simpler Deep Neural Networks (DNNs), the proposed model leverages the capabilities of CNNs to effectively analyze the temporal characteristics of network traffic data. Trained and evaluated on the CICIoMT2024 dataset, which comprises 18 distinct types of cyberattacks across a range of IoMT devices, the proposed CNN model demonstrates superior performance compared to previous state-of-the-art methods, achieving a perfect accuracy of 99% in binary, categorical, and multiclass classification tasks. This performance surpasses that of conventional ML models such as Logistic Regression, AdaBoost, DNNs, and Random Forests. These findings highlight the potential of CNNs to substantially improve IoMT cybersecurity, thereby ensuring the protection and integrity of connected healthcare systems.

[AI-73] FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware

链接: https://arxiv.org/abs/2410.23299
作者: Minwoo Kang,Mingjie Liu,Ghaith Bany Hamad,Syed Suhaib,Haoxing Ren
关键词-EN: spurred significant interest, large language models, digital chip design, spurred significant, significant interest
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The remarkable reasoning and code generation capabilities of large language models (LLMs) have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation. In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV. The benchmark consists of three sub-tasks that measure LLM capabilities at different levels: from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without additional human input. As test instances, we present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with industrial FV workflows. A wide range of existing LLMs, both proprietary and open-source, are evaluated against FVEval, based on which we investigate where today’s LLMs stand and how we might further enable their application toward improving productivity in digital FV. Our benchmark and evaluation code is available at \urlthis https URL.

[AI-74] Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

链接: https://arxiv.org/abs/2410.04640
作者: Christopher Agia,Rohan Sinha,Jingyun Yang,Zi-ang Cao,Rika Antonova,Marco Pavone,Jeannette Bohg
关键词-EN: Robot behavior policies, Vision Language Models, Robot behavior, training data, imitation learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL . 35 pages, 9 figures. Accepted to the Conference on Robot Learning (CoRL) 2024

点击查看摘要

Abstract:Robot behavior policies trained via imitation learning are prone to failure under conditions that deviate from their training data. Thus, algorithms that monitor learned policies at test time and provide early warnings of failure are necessary to facilitate scalable deployment. We propose Sentinel, a runtime monitoring framework that splits the detection of failures into two complementary categories: 1) Erratic failures, which we detect using statistical measures of temporal action consistency, and 2) task progression failures, where we use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. Our approach has two key strengths. First, because learned policies exhibit diverse failure modes, combining complementary detectors leads to significantly higher accuracy at failure detection. Second, using a statistical temporal action consistency measure ensures that we quickly detect when multimodal, generative policies exhibit erratic behavior at negligible computational cost. In contrast, we only use VLMs to detect failure modes that are less time-sensitive. We demonstrate our approach in the context of diffusion policies trained on robotic mobile manipulation domains in both simulation and the real world. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone and significantly outperforms baselines, thus highlighting the importance of assigning specialized detectors to complementary categories of failure. Qualitative results are made available at this https URL.

[AI-75] xt2Motion: From Natural Language Instructions to Feasible Plans

链接: https://arxiv.org/abs/2303.12153
作者: Kevin Lin,Christopher Agia,Toki Migimatsu,Marco Pavone,Jeannette Bohg
关键词-EN: framework enabling robots, Large Language Models, require long-horizon reasoning, planning framework enabling, enabling robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published in Autonomous Robots, Special Issue: Large Language Models in Robotics 2023. Project page: this https URL . First two authors contributed equally

点击查看摘要

Abstract:We propose Text2Motion, a language-based planning framework enabling robots to solve sequential manipulation tasks that require long-horizon reasoning. Given a natural language instruction, our framework constructs both a task- and motion-level plan that is verified to reach inferred symbolic goals. Text2Motion uses feasibility heuristics encoded in Q-functions of a library of skills to guide task planning with Large Language Models. Whereas previous language-based planners only consider the feasibility of individual skills, Text2Motion actively resolves geometric dependencies spanning skill sequences by performing geometric feasibility planning during its search. We evaluate our method on a suite of problems that require long-horizon reasoning, interpretation of abstract goals, and handling of partial affordance perception. Our experiments show that Text2Motion can solve these challenging problems with a success rate of 82%, while prior state-of-the-art language-based planning methods only achieve 13%. Text2Motion thus provides promising generalization characteristics to semantically diverse sequential manipulation tasks with geometric dependencies between skills.

[AI-76] Counterfactual MRI Data Augmentation using Conditional Denoising Diffusion Generative Models

链接: https://arxiv.org/abs/2410.23835
作者: Pedro Morão,Joao Santinha,Yasna Forghani,Nuno Loução,Pedro Gouveia,Mario A. T. Figueiredo
关键词-EN: image acquisition parameters, Deep learning, imaging face challenges, medical imaging face, acquisition parameters
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning (DL) models in medical imaging face challenges in generalizability and robustness due to variations in image acquisition parameters (IAP). In this work, we introduce a novel method using conditional denoising diffusion generative models (cDDGMs) to generate counterfactual magnetic resonance (MR) images that simulate different IAP without altering patient anatomy. We demonstrate that using these counterfactual images for data augmentation can improve segmentation accuracy, particularly in out-of-distribution settings, enhancing the overall generalizability and robustness of DL models across diverse imaging conditions. Our approach shows promise in addressing domain and covariate shifts in medical imaging. The code is publicly available at https: //github.com/pedromorao/Counterfactual-MRI-Data-Augmentation

[AI-77] MS-Glance: Non-semantic context vectors and the applications in supervising image reconstruction WACV2025

链接: https://arxiv.org/abs/2410.23577
作者: Ziqi Gao,Wendi Yang,Yujia Li,Lei Xing,S. Kevin Zhou
关键词-EN: identifying specific objects, human visual perception, visual perception system, process scenes rapidly, Glance Index Measure
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Non-semantic context information is crucial for visual recognition, as the human visual perception system first uses global statistics to process scenes rapidly before identifying specific objects. However, while semantic information is increasingly incorporated into computer vision tasks such as image reconstruction, non-semantic information, such as global spatial structures, is often overlooked. To bridge the gap, we propose a biologically informed non-semantic context descriptor, \textbfMS-Glance, along with the Glance Index Measure for comparing two images. A Global Glance vector is formulated by randomly retrieving pixels based on a perception-driven rule from an image to form a vector representing non-semantic global context, while a local Glance vector is a flattened local image window, mimicking a zoom-in observation. The Glance Index is defined as the inner product of two standardized sets of Glance vectors. We evaluate the effectiveness of incorporating Glance supervision in two reconstruction tasks: image fitting with implicit neural representation (INR) and undersampled MRI reconstruction. Extensive experimental results show that MS-Glance outperforms existing image restoration losses across both natural and medical images. The code is available at \urlthis https URL.

[AI-78] STIED: A deep learning model for the SpatioTemporal detection of focal Interictal Epileptiform Discharges with MEG

链接: https://arxiv.org/abs/2410.23386
作者: Raquel Fernández-Martín,Alfonso Gijón,Odile Feys,Elodie Juvené,Alec Aeby,Charline Urbain,Xavier De Tiège,Vincent Wens
关键词-EN: interictal epileptiform discharges, Clinical MEG, clinical MEG practice, MEG, epileptiform discharges
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Magnetoencephalography (MEG) allows the non-invasive detection of interictal epileptiform discharges (IEDs). Clinical MEG analysis in epileptic patients traditionally relies on the visual identification of IEDs, which is time consuming and partially subjective. Automatic, data-driven detection methods exist but show limited performance. Still, the rise of deep learning (DL)-with its ability to reproduce human-like abilities-could revolutionize clinical MEG practice. Here, we developed and validated STIED, a simple yet powerful supervised DL algorithm combining two convolutional neural networks with temporal (1D time-course) and spatial (2D topography) features of MEG signals inspired from current clinical guidelines. Our DL model enabled both temporal and spatial localization of IEDs in patients suffering from focal epilepsy with frequent and high amplitude spikes (FE group), with high-performance metrics-accuracy, specificity, and sensitivity all exceeding 85%-when learning from spatiotemporal features of IEDs. This performance can be attributed to our handling of input data, which mimics established clinical MEG practice. Reverse engineering further revealed that STIED encodes fine spatiotemporal features of IEDs rather than their mere amplitude. The model trained on the FE group also showed promising results when applied to a separate group of presurgical patients with different types of refractory focal epilepsy, though further work is needed to distinguish IEDs from physiological transients. This study paves the way of incorporating STIED and DL algorithms into the routine clinical MEG evaluation of epilepsy.

[AI-79] Non-binary artificial neuron with phase variation implemented on a quantum computer

链接: https://arxiv.org/abs/2410.23373
作者: Jhordan Silveira de Borba,Jonas Maziero
关键词-EN: similar path, path to classic, classic models, Abstract, model
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 7 figures, to be published in Ciência e Natura (ISSN 2179-460X, DOI: https://doi.org/10.5902/2179460X )

点击查看摘要

Abstract:The first artificial quantum neuron models followed a similar path to classic models, as they work only with discrete values. Here we introduce an algorithm that generalizes the binary model manipulating the phase of complex numbers. We propose, test, and implement a neuron model that works with continuous values in a quantum computer. Through simulations, we demonstrate that our model may work in a hybrid training scheme utilizing gradient descent as a learning algorithm. This work represents another step in the direction of evaluation of the use of artificial neural networks efficiently implemented on near-term quantum devices.

[AI-80] ASURA-FDPS-ML: Star-by-star Galaxy Simulations Accelerated by Surrogate Modeling for Supernova Feedback

链接: https://arxiv.org/abs/2410.23346
作者: Keiya Hirashima,Kana Moriwaki,Michiko S. Fujii,Yutaka Hirai,Takayuki R. Saitoh,Junnichiro Makino,Ulrich P. Steinwandel,Shirley Ho
关键词-EN: Age Main Sequence, Main Sequence mass, galaxy simulations accelerated, model that reduces, reduces the computation
类目: Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 14 figures, 3 tables, submitted to ApJ

点击查看摘要

Abstract:We introduce new high-resolution galaxy simulations accelerated by a surrogate model that reduces the computation cost by approximately 75 percent. Massive stars with a Zero Age Main Sequence mass of about 8 solar masses and above explode as core-collapse supernovae (CCSNe), which play a critical role in galaxy formation. The energy released by CCSNe is essential for regulating star formation and driving feedback processes in the interstellar medium (ISM). However, the short integration timesteps required for SNe feedback present significant bottlenecks in star-by-star galaxy simulations that aim to capture individual stellar dynamics and the inhomogeneous shell expansion of SNe within the turbulent ISM. Our new framework combines direct numerical simulations and surrogate modeling, including machine learning and Gibbs sampling. The star formation history and the time evolution of outflow rates in the galaxy match those obtained from resolved direct numerical simulations. Our new approach achieves high-resolution fidelity while reducing computational costs, effectively bridging the physical scale gap and enabling multi-scale simulations.

[AI-81] Variable Resolution Sampling and Deep Learning Image Recovery for Accelerated Multi-Spectral MRI Near Metal Implants

链接: https://arxiv.org/abs/2410.23329
作者: Azadeh Sharafi,Nikolai J. Mickevicius,Mehran Baboli,Andrew S. Nencka,Kevin M. Koch
关键词-EN: Purpose, presents a variable, MRI, metal implants, MRI scans affected
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Purpose: This study presents a variable resolution (VR) sampling and deep learning reconstruction approach for multi-spectral MRI near metal implants, aiming to reduce scan times while maintaining image quality. Background: The rising use of metal implants has increased MRI scans affected by metal artifacts. Multi-spectral imaging (MSI) reduces these artifacts but sacrifices acquisition efficiency. Methods: This retrospective study on 1.5T MSI knee and hip data from patients with metal hardware used a novel spectral undersampling scheme to improve acquisition efficiency by ~40%. U-Net-based deep learning models were trained for reconstruction. Image quality was evaluated using SSIM, PSNR, and RESI metrics. Results: Deep learning reconstructions of undersampled VR data (DL-VR) showed significantly higher SSIM and PSNR values (p0.001) compared to conventional reconstruction (CR-VR), with improved edge sharpness. Edge sharpness in DL-reconstructed images matched fully sampled references (p=0.5). Conclusion: This approach can potentially enhance MRI examinations near metal implants by reducing scan times or enabling higher resolution. Further prospective studies are needed to assess clinical value.

[AI-82] ransfer Learning in Vocal Education: Technical Evaluation of Limited Samples Describing Mezzo-soprano

链接: https://arxiv.org/abs/2410.23325
作者: Zhenyi Hou,Xu Zhao,Kejie Ye,Xinyu Sheng,Shanggerile Jiang,Jiajing Xia,Yitao Zhang,Chenxi Ban,Daijun Luo,Jiaxing Chen,Yan Zou,Yuchao Feng,Guangyu Fan,Xin Yuan
关键词-EN: music education due, Mezzo-soprano Vocal Set, deep learning models, field is difficult, difficult to quantify
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Vocal education in the music field is difficult to quantify due to the individual differences in singers’ voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, such as Mezzo-soprano, requires extensive well-annotated data support using deep learning models. In order to attain the objective, we perform transfer learning by employing deep learning models pre-trained on the ImageNet and Urbansound8k datasets for the improvement on the precision of vocal technique evaluation. Furthermore, we tackle the problem of the lack of samples by constructing a dedicated dataset, the Mezzo-soprano Vocal Set (MVS), for vocal technique assessment. Our experimental results indicate that transfer learning increases the overall accuracy (OAcc) of all models by an average of 8.3%, with the highest accuracy at 94.2%. We not only provide a novel approach to evaluating Mezzo-soprano vocal techniques but also introduce a new quantitative assessment method for music education.

[AI-83] Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis

链接: https://arxiv.org/abs/2410.23320
作者: Théodor Lemerle,Harrison Vanderbyl,Vaibhav Srivastav,Nicolas Obin,Axel Roebel
关键词-EN: Neural codec language, Neural codec, leveraging scalable architectures, codec language models, leveraging scalable
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Preprint

点击查看摘要

Abstract:Neural codec language models have achieved state-of-the-art performance in text-to-speech (TTS) synthesis, leveraging scalable architectures like autoregressive transformers and large-scale speech datasets. By framing voice cloning as a prompt continuation task, these models excel at cloning voices from short audio samples. However, this approach is limited in its ability to handle numerous or lengthy speech excerpts, since the concatenation of source and target speech must fall within the maximum context length which is determined during training. In this work, we introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA). Building on the success of initial-state tuning on RWKV, we extend this technique to voice cloning, enabling the use of multiple speech samples and full utilization of the context window in synthesis. This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes. Notably, Lina-Speech matches or outperforms state-of-the-art baseline models, including some with a parameter count up to four times higher or trained in an end-to-end style. We release our code and checkpoints. Audio samples are available at this https URL.

[AI-84] Moral Agency in Silico: Exploring Free Will in Large Language Models

链接: https://arxiv.org/abs/2410.23310
作者: Morgan S. Porter
关键词-EN: specifically large language, large language models, Dennett compatibilist framework, specifically large, language models
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates the potential of deterministic systems, specifically large language models (LLMs), to exhibit the functional capacities of moral agency and compatibilist free will. We develop a functional definition of free will grounded in Dennett’s compatibilist framework, building on an interdisciplinary theoretical foundation that integrates Shannon’s information theory, Dennett’s compatibilism, and Floridi’s philosophy of information. This framework emphasizes the importance of reason-responsiveness and value alignment in determining moral responsibility rather than requiring metaphysical libertarian free will. Shannon’s theory highlights the role of processing complex information in enabling adaptive decision-making, while Floridi’s philosophy reconciles these perspectives by conceptualizing agency as a spectrum, allowing for a graduated view of moral status based on a system’s complexity and responsiveness. Our analysis of LLMs’ decision-making in moral dilemmas demonstrates their capacity for rational deliberation and their ability to adjust choices in response to new information and identified inconsistencies. Thus, they exhibit features of a moral agency that align with our functional definition of free will. These results challenge traditional views on the necessity of consciousness for moral responsibility, suggesting that systems with self-referential reasoning capacities can instantiate degrees of free will and moral reasoning in artificial and biological contexts. This study proposes a parsimonious framework for understanding free will as a spectrum that spans artificial and biological systems, laying the groundwork for further interdisciplinary research on agency and ethics in the artificial intelligence era.

计算机视觉

[CV-0] URAvatar: Universal Relightable Gaussian Codec Avatars SIGGRAPH

链接: https://arxiv.org/abs/2410.24223
作者: Junxuan Li,Chen Cao,Gabriel Schwartz,Rawal Khirodkar,Christian Richardt,Tomas Simon,Yaser Sheikh,Shunsuke Saito
关键词-EN: creating photorealistic, unknown illumination, phone scan, phone, light transport
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: SIGGRAPH Asia 2024. Website: this https URL

点击查看摘要

Abstract:We present a new approach to creating photorealistic and relightable head avatars from a phone scan with unknown illumination. The reconstructed avatars can be animated and relit in real time with the global illumination of diverse environments. Unlike existing approaches that estimate parametric reflectance parameters via inverse rendering, our approach directly models learnable radiance transfer that incorporates global light transport in an efficient manner for real-time rendering. However, learning such a complex light transport that can generalize across identities is non-trivial. A phone scan in a single environment lacks sufficient information to infer how the head would appear in general environments. To address this, we build a universal relightable avatar model represented by 3D Gaussians. We train on hundreds of high-quality multi-view human scans with controllable point lights. High-resolution geometric guidance further enhances the reconstruction accuracy and generalization. Once trained, we finetune the pretrained model on a phone scan using inverse rendering to obtain a personalized relightable avatar. Our experiments establish the efficacy of our design, outperforming existing approaches while retaining real-time rendering capability.

[CV-1] EgoMimic: Scaling Imitation Learning via Egocentric Video

链接: https://arxiv.org/abs/2410.24221
作者: Simar Kareer,Dhruv Patel,Ryan Punamiya,Pranay Mathur,Shuo Cheng,Chen Wang,Judy Hoffman,Danfei Xu
关键词-EN: data, human embodiment data, demonstration data required, imitation learning, human
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at this https URL

[CV-2] Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning NEURIPS2024

链接: https://arxiv.org/abs/2410.24219
作者: Penghui Ruan,Pichao Wang,Divya Saxena,Jiannong Cao,Yuhui Shi
关键词-EN: motion remains challenging, realistic motion remains, remains challenging, motion, generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024, code available at this https URL

点击查看摘要

Abstract:Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model’s understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO’s superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: this https URL

[CV-3] ARQ: A Mixed-Precision Quantization Framework for Accurate and Certifiably Robust DNNs

链接: https://arxiv.org/abs/2410.24214
作者: Yuchen Yang,Shubham Ugare,Yifan Zhao,Gagandeep Singh,Sasa Misailovic
关键词-EN: Mixed precision quantization, resource computing platforms, limited resource computing, Mixed precision, deep neural networks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mixed precision quantization has become an important technique for enabling the execution of deep neural networks (DNNs) on limited resource computing platforms. Traditional quantization methods have primarily concentrated on maintaining neural network accuracy, either ignoring the impact of quantization on the robustness of the network, or using only empirical techniques for improving robustness. In contrast, techniques for robustness certification, which can provide strong guarantees about the robustness of DNNs have not been used during quantization due to their high computation cost. This paper introduces ARQ, an innovative mixed-precision quantization method that not only preserves the clean accuracy of the smoothed classifiers but also maintains their certified robustness. ARQ uses reinforcement learning to find accurate and robust DNN quantization, while efficiently leveraging randomized smoothing, a popular class of statistical DNN verification algorithms, to guide the search process. We compare ARQ with multiple state-of-the-art quantization techniques on several DNN architectures commonly used in quantization studies: ResNet-20 on CIFAR-10, ResNet-50 on ImageNet, and MobileNetV2 on ImageNet. We demonstrate that ARQ consistently performs better than these baselines across all the benchmarks and the input perturbation levels. In many cases, the performance of ARQ quantized networks can reach that of the original DNN with floating-point weights, but with only 1.5% instructions. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.24214 [cs.LG] (or arXiv:2410.24214v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.24214 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-4] Learning Video Representations without Natural Videos

链接: https://arxiv.org/abs/2410.24213
作者: Xueyang Yu,Xinlei Chen,Yossi Gandelsman
关键词-EN: incorporating natural videos, natural, video, model, pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.

[CV-5] DELTA: Dense Efficient Long-range 3D Tracking for any video

链接: https://arxiv.org/abs/2410.24211
作者: Tuan Duc Ngo,Peiye Zhuang,Chuang Gan,Evangelos Kalogerakis,Sergey Tulyakov,Hsin-Ying Lee,Chaoyang Wang
关键词-EN: videos remains challenging, monocular videos remains, remains challenging, long sequences, aiming for pixel-level
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce \Approach, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, \Approach delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of \Approach on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.

[CV-6] No Pose No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

链接: https://arxiv.org/abs/2410.24207
作者: Botao Ye,Sifei Liu,Haofei Xu,Xueting Li,Marc Pollefeys,Ming-Hsuan Yang,Songyou Peng
关键词-EN: sparse multi-view images, feed-forward model capable, introduce NoPoSplat, capable of reconstructing, sparse multi-view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce NoPoSplat, a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from \textitunposed sparse multi-view images. Our model, trained exclusively with photometric loss, achieves real-time 3D Gaussian reconstruction during inference. To eliminate the need for accurate pose input during reconstruction, we anchor one input view’s local camera coordinates as the canonical space and train the network to predict Gaussian primitives for all views within this space. This approach obviates the need to transform Gaussian primitives from local coordinates into a global coordinate system, thus avoiding errors associated with per-frame Gaussians and pose estimation. To resolve scale ambiguity, we design and compare various intrinsic embedding methods, ultimately opting to convert camera intrinsics into a token embedding and concatenate it with image tokens as input to the model, enabling accurate scene scale prediction. We utilize the reconstructed 3D Gaussians for novel view synthesis and pose estimation tasks and propose a two-stage coarse-to-fine pipeline for accurate pose estimation. Experimental results demonstrate that our pose-free approach can achieve superior novel view synthesis quality compared to pose-required methods, particularly in scenarios with limited input image overlap. For pose estimation, our method, trained without ground truth depth or explicit matching loss, significantly outperforms the state-of-the-art methods with substantial improvements. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios. Code and trained models are available at this https URL.

[CV-7] GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering

链接: https://arxiv.org/abs/2410.24204
作者: Kai Ye,Chong Gao,Guanbin Li,Wenzheng Chen,Baoquan Chen
关键词-EN: Gaussian Splatting, physically-based inverse rendering, Gaussian, Splatting, PBR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We consider the problem of physically-based inverse rendering using 3D Gaussian Splatting (3DGS) representations. While recent 3DGS methods have achieved remarkable results in novel view synthesis (NVS), accurately capturing high-fidelity geometry, physically interpretable materials and lighting remains challenging, as it requires precise geometry modeling to provide accurate surface normals, along with physically-based rendering (PBR) techniques to ensure correct material and lighting disentanglement. Previous 3DGS methods resort to approximating surface normals, but often struggle with noisy local geometry, leading to inaccurate normal estimation and suboptimal material-lighting decomposition. In this paper, we introduce GeoSplatting, a novel hybrid representation that augments 3DGS with explicit geometric guidance and differentiable PBR equations. Specifically, we bridge isosurface and 3DGS together, where we first extract isosurface mesh from a scalar field, then convert it into 3DGS points and formulate PBR equations for them in a fully differentiable manner. In GeoSplatting, 3DGS is grounded on the mesh geometry, enabling precise surface normal modeling, which facilitates the use of PBR frameworks for material decomposition. This approach further maintains the efficiency and quality of NVS from 3DGS while ensuring accurate geometry from the isosurface. Comprehensive evaluations across diverse datasets demonstrate the superiority of GeoSplatting, consistently outperforming existing methods both quantitatively and qualitatively.

[CV-8] Extended Object Tracking and Classification based on Linear Splines

链接: https://arxiv.org/abs/2410.24183
作者: Matteo Tesori,Giorgio Battistelli,Luigi Chisci
关键词-EN: extended object tracking, extended object, linear splines, paper introduces, introduces a framework
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a framework based on linear splines for 2-dimensional extended object tracking and classification. Unlike state of the art models, linear splines allow to represent extended objects whose contour is an arbitrarily complex curve. An exact likelihood is derived for the case in which noisy measurements can be scattered from any point on the contour of the extended object, while an approximate Monte Carlo likelihood is provided for the case wherein scattering points can be anywhere, i.e. inside or on the contour, on the object surface. Exploiting such likelihood to measure how well the observed data fit a given shape, a suitable estimator is developed. The proposed estimator models the extended object in terms of a kinematic state, providing object position and orientation, along with a shape vector, characterizing object contour and surface. The kinematic state is estimated via a nonlinear Kalman filter, while the shape vector is estimated via a Bayesian classifier so that classification is implicitly solved during shape estimation. Numerical experiments are provided to assess, compared to state of the art extended object estimators, the effectiveness of the proposed one.

[CV-9] Federated Black-Box Adaptation for Semantic Segmentation NEURIPS2024

链接: https://arxiv.org/abs/2410.24181
作者: Jay N. Paranjape,Shameema Sikder,S. Swaroop Vedula,Vishal M. Patel
关键词-EN: solve a task, form of distributed, collaboratively learn, Federated Learning, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NEURIPS 2024

点击查看摘要

Abstract:Federated Learning (FL) is a form of distributed learning that allows multiple institutions or clients to collaboratively learn a global model to solve a task. This allows the model to utilize the information from every institute while preserving data privacy. However, recent studies show that the promise of protecting the privacy of data is not upheld by existing methods and that it is possible to recreate the training data from the different institutions. This is done by utilizing gradients transferred between the clients and the global server during training or by knowing the model architecture at the client end. In this paper, we propose a federated learning framework for semantic segmentation without knowing the model architecture nor transferring gradients between the client and the server, thus enabling better privacy preservation. We propose BlackFed - a black-box adaptation of neural networks that utilizes zero order optimization (ZOO) to update the client model weights and first order optimization (FOO) to update the server weights. We evaluate our approach on several computer vision and medical imaging datasets to demonstrate its effectiveness. To the best of our knowledge, this work is one of the first works in employing federated learning for segmentation, devoid of gradients or model information exchange. Code: this https URL

[CV-10] Exploring Vision Language Models for Facial Attribute Recognition: Emotion Race Gender and Age

链接: https://arxiv.org/abs/2410.24148
作者: Nouar AlDahoul,Myles Joshua Toledo Tan,Harishwar Reddy Kasireddy,Yasir Zaki
关键词-EN: Technologies for recognizing, recognizing facial attributes, advertising content, sentiment analysis, social behaviors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 52 pages, 13 figures

点击查看摘要

Abstract:Technologies for recognizing facial attributes like race, gender, age, and emotion have several applications, such as surveillance, advertising content, sentiment analysis, and the study of demographic trends and social behaviors. Analyzing demographic characteristics based on images and analyzing facial expressions have several challenges due to the complexity of humans’ facial attributes. Traditional approaches have employed CNNs and various other deep learning techniques, trained on extensive collections of labeled images. While these methods demonstrated effective performance, there remains potential for further enhancements. In this paper, we propose to utilize vision language models (VLMs) such as generative pre-trained transformer (GPT), GEMINI, large language and vision assistant (LLAVA), PaliGemma, and Microsoft Florence2 to recognize facial attributes such as race, gender, age, and emotion from images with human faces. Various datasets like FairFace, AffectNet, and UTKFace have been utilized to evaluate the solutions. The results show that VLMs are competitive if not superior to traditional techniques. Additionally, we propose “FaceScanPaliGemma”–a fine-tuned PaliGemma model–for race, gender, age, and emotion recognition. The results show an accuracy of 81.1%, 95.8%, 80%, and 59.4% for race, gender, age group, and emotion classification, respectively, outperforming pre-trained version of PaliGemma, other VLMs, and SotA methods. Finally, we propose “FaceScanGPT”, which is a GPT-4o model to recognize the above attributes when several individuals are present in the image using a prompt engineered for a person with specific facial and/or physical attributes. The results underscore the superior multitasking capability of FaceScanGPT to detect the individual’s attributes like hair cut, clothing color, postures, etc., using only a prompt to drive the detection and recognition tasks.

[CV-11] HoloChrome: Polychromatic Illumination for Speckle Reduction in Holographic Near-Eye Displays

链接: https://arxiv.org/abs/2410.24144
作者: Florian Schiffers,Grace Kuo,Nathan Matsuda,Douglas Lanman,Oliver Cossairt
关键词-EN: authentic depth cues, providing authentic depth, Holographic displays hold, depth cues, resulting in enhanced
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Holographic displays hold the promise of providing authentic depth cues, resulting in enhanced immersive visual experiences for near-eye applications. However, current holographic displays are hindered by speckle noise, which limits accurate reproduction of color and texture in displayed images. We present HoloChrome, a polychromatic holographic display framework designed to mitigate these limitations. HoloChrome utilizes an ultrafast, wavelength-adjustable laser and a dual-Spatial Light Modulator (SLM) architecture, enabling the multiplexing of a large set of discrete wavelengths across the visible spectrum. By leveraging spatial separation in our dual-SLM setup, we independently manipulate speckle patterns across multiple wavelengths. This novel approach effectively reduces speckle noise through incoherent averaging achieved by wavelength multiplexing. Our method is complementary to existing speckle reduction techniques, offering a new pathway to address this challenge. Furthermore, the use of polychromatic illumination broadens the achievable color gamut compared to traditional three-color primary holographic displays. Our simulations and tabletop experiments validate that HoloChrome significantly reduces speckle noise and expands the color gamut. These advancements enhance the performance of holographic near-eye displays, moving us closer to practical, immersive next-generation visual experiences. Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Optics (physics.optics) Cite as: arXiv:2410.24144 [cs.GR] (or arXiv:2410.24144v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2410.24144 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-12] COSNet: A Novel Semantic Segmentation Network using Enhanced Boundaries in Cluttered Scenes WACV2025

链接: https://arxiv.org/abs/2410.24139
作者: Muhammad Ali,Mamoona Javaid,Mubashir Noman,Mustansar Fiaz,Salman Khan
关键词-EN: Automated waste recycling, employing vision-based systems, waste recycling aims, Automated waste, vision-based systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:Automated waste recycling aims to efficiently separate the recyclable objects from the waste by employing vision-based systems. However, the presence of varying shaped objects having different material types makes it a challenging problem, especially in cluttered environments. Existing segmentation methods perform reasonably on many semantic segmentation datasets by employing multi-contextual representations, however, their performance is degraded when utilized for waste object segmentation in cluttered scenarios. In addition, plastic objects further increase the complexity of the problem due to their translucent nature. To address these limitations, we introduce an efficacious segmentation network, named COSNet, that uses boundary cues along with multi-contextual information to accurately segment the objects in cluttered scenes. COSNet introduces novel components including feature sharpening block (FSB) and boundary enhancement module (BEM) for enhancing the features and highlighting the boundary information of irregular waste objects in cluttered environment. Extensive experiments on three challenging datasets including ZeroWaste-f, SpectralWaste, and ADE20K demonstrate the effectiveness of the proposed method. Our COSNet achieves a significant gain of 1.8% on ZeroWaste-f and 2.1% on SpectralWaste datasets respectively in terms of mIoU metric.

[CV-13] Identifying Spatio-Temporal Drivers of Extreme Events NEURIPS2024

链接: https://arxiv.org/abs/2410.24075
作者: Mohamad Hakam Shams Eddin,Juergen Gall
关键词-EN: machine learning approaches, spatio-temporal relations, fully understood, machine learning, learning approaches
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:The spatio-temporal relations of impacts of extreme events and their drivers in climate data are not fully understood and there is a need of machine learning approaches to identify such spatio-temporal relations from data. The task, however, is very challenging since there are time delays between extremes and their drivers, and the spatial response of such drivers is inhomogeneous. In this work, we propose a first approach and benchmarks to tackle this challenge. Our approach is trained end-to-end to predict spatio-temporally extremes and spatio-temporally drivers in the physical input variables jointly. By enforcing the network to predict extremes from spatio-temporal binary masks of identified drivers, the network successfully identifies drivers that are correlated with extremes. We evaluate our approach on three newly created synthetic benchmarks, where two of them are based on remote sensing or reanalysis climate data, and on two real-world reanalysis datasets. The source code and datasets are publicly available at the project page this https URL.

[CV-14] Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure

链接: https://arxiv.org/abs/2410.24060
作者: Xiang Li,Yixiang Dai,Qing Qu
关键词-EN: nonlinear diffusion denoisers, diffusion models, learned score functions, deep denoisers trained, nonlinear diffusion models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model’s capacity is relatively small compared to the training dataset size. In the case that the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.

[CV-15] Advanced Predictive Quality Assessment for Ultrasonic Additive Manufacturing with Deep Learning Model

链接: https://arxiv.org/abs/2410.24055
作者: Lokendra Poudel,Sushant Jha,Ryan Meeker,Duy-Nhat Phan,Rahul Bhowmik
关键词-EN: Ultrasonic Additive Manufacturing, employs ultrasonic welding, consolidated metal components, Ultrasonic Additive, dissimilar metal foils
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultrasonic Additive Manufacturing (UAM) employs ultrasonic welding to bond similar or dissimilar metal foils to a substrate, resulting in solid, consolidated metal components. However, certain processing conditions can lead to inter-layer defects, affecting the final product’s quality. This study develops a method to monitor in-process quality using deep learning-based convolutional neural networks (CNNs). The CNN models were evaluated on their ability to classify samples with and without embedded thermocouples across five power levels (300W, 600W, 900W, 1200W, 1500W) using thermal images with supervised labeling. Four distinct CNN classification models were created for different scenarios including without (baseline) and with thermocouples, only without thermocouples across power levels, only with thermocouples across power levels, and combined without and with thermocouples across power levels. The models achieved 98.29% accuracy on combined baseline and thermocouple images, 97.10% for baseline images across power levels, 97.43% for thermocouple images, and 97.27% for both types across power levels. The high accuracy, above 97%, demonstrates the system’s effectiveness in identifying and classifying conditions within the UAM process, providing a reliable tool for quality assurance and process control in manufacturing environments.

[CV-16] PC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation NEURIPS2024

链接: https://arxiv.org/abs/2410.24037
作者: Sunjae Yoon,Gwanhyeong Koo,Younghwan Lee,Chang D. Yoo
关键词-EN: human motion video, target motion video, motion video, diffusion-based image animation, image animation aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, 16 figures, NeurIPS 2024

点击查看摘要

Abstract:Human image animation aims to generate a human motion video from the inputs of a reference human image and a target motion video. Current diffusion-based image animation systems exhibit high precision in transferring human identity into targeted motion, yet they still exhibit irregular quality in their outputs. Their optimal precision is achieved only when the physical compositions (i.e., scale and rotation) of the human shapes in the reference image and target pose frame are aligned. In the absence of such alignment, there is a noticeable decline in fidelity and consistency. Especially, in real-world environments, this compositional misalignment commonly occurs, posing significant challenges to the practical usage of current systems. To this end, we propose Test-time Procrustes Calibration (TPC), which enhances the robustness of diffusion-based image animation systems by maintaining optimal performance even when faced with compositional misalignment, effectively addressing real-world scenarios. The TPC provides a calibrated reference image for the diffusion model, enhancing its capability to understand the correspondence between human shapes in the reference and target images. Our method is simple and can be applied to any diffusion-based image animation system in a model-agnostic manner, improving the effectiveness at test time without additional training.

[CV-17] Handwriting Recognition in Historical Documents with Multimodal LLM

链接: https://arxiv.org/abs/2410.24034
作者: Lucian Li
关键词-EN: immense quantity, quantity of historical, documentation that exists, Optical Character Recognition, performing OCR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:There is an immense quantity of historical and cultural documentation that exists only as handwritten manuscripts. At the same time, performing OCR across scripts and different handwriting styles has proven to be an enormously difficult problem relative to the process of digitizing print. While recent Transformer based models have achieved relatively strong performance, they rely heavily on manually transcribed training data and have difficulty generalizing across writers. Multimodal LLM, such as GPT-4v and Gemini, have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. In this paper, I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods. Keywords: Optical Character Recognition, Multimodal Language Models, Cultural Preservation, Mass digitization, Handwriting Recognitio Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.24034 [cs.CV] (or arXiv:2410.24034v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.24034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-18] Bayesian-guided Label Mapping for Visual Reprogramming

链接: https://arxiv.org/abs/2410.24018
作者: Chengyi Cai,Zesheng Ye,Lei Feng,Jianzhong Qi,Feng Liu
关键词-EN: Visual reprogramming, solve downstream tasks, downstream labels, label mapping, labels
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual reprogramming (VR) leverages the intrinsic capabilities of pretrained vision models by adapting their input or output interfaces to solve downstream tasks whose labels (i.e., downstream labels) might be totally different from the labels associated with the pretrained models (i.e., pretrained labels). When adapting the output interface, label mapping methods transform the pretrained labels to downstream labels by establishing a gradient-free one-to-one correspondence between the two sets of labels. However, in this paper, we reveal that one-to-one mappings may overlook the complex relationship between pretrained and downstream labels. Motivated by this observation, we propose a Bayesian-guided Label Mapping (BLM) method. BLM constructs an iteratively-updated probabilistic label mapping matrix, with each element quantifying a pairwise relationship between pretrained and downstream labels. The assignment of values to the constructed matrix is guided by Bayesian conditional probability, considering the joint distribution of the downstream labels and the labels predicted by the pretrained model on downstream samples. Experiments conducted on both pretrained vision models (e.g., ResNeXt) and vision-language models (e.g., CLIP) demonstrate the superior performance of BLM over existing label mapping methods. The success of BLM also offers a probabilistic lens through which to understand and analyze the effectiveness of VR. Our code is available at this https URL.

[CV-19] Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities NEURIPS2024

链接: https://arxiv.org/abs/2410.24015
作者: Hatef Otroshi Shahreza,Sébastien Marcel
关键词-EN: computer vision applications, gaining increasing popularity, synthetic face recognition, face recognition, face recognition datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in NeurIPS 2024 Workshop on New Frontiers in Adversarial Machine Learning

点击查看摘要

Abstract:Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.

[CV-20] Re-assembling the past: The RePAIR dataset and benchmark for real world 2D and 3D puzzle solving NEURIPS2024

链接: https://arxiv.org/abs/2410.24010
作者: Theodore Tsesmelis,Luca Palmieri,Marina Khoroshiltseva,Adeela Islam,Gur Elkin,Ofir Itzhak Shahar,Gianluca Scarpellini,Stefano Fiorini,Yaniv Ohayon,Nadav Alali,Sinem Aslan,Pietro Morerio,Sebastiano Vascon,Elena Gravina,Maria Cristina Napolitano,Giuseppe Scarpati,Gabriel Zuchtriegel,Alexandra Spühler,Michel E. Fuchs,Stuart James,Ohad Ben-Shahar,Marcello Pelillo,Alessio Del Bue
关键词-EN: test modern computational, data driven methods, paper proposes, proposes the RePAIR, test modern
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024, Track Datasets and Benchmarks, 10 pages

点击查看摘要

Abstract:This paper proposes the RePAIR dataset that represents a challenging benchmark to test modern computational and data driven methods for puzzle-solving and reassembly tasks. Our dataset has unique properties that are uncommon to current benchmarks for 2D and 3D puzzle solving. The fragments and fractures are realistic, caused by a collapse of a fresco during a World War II bombing at the Pompeii archaeological park. The fragments are also eroded and have missing pieces with irregular shapes and different dimensions, challenging further the reassembly algorithms. The dataset is multi-modal providing high resolution images with characteristic pictorial elements, detailed 3D scans of the fragments and meta-data annotated by the archaeologists. Ground truth has been generated through several years of unceasing fieldwork, including the excavation and cleaning of each fragment, followed by manual puzzle solving by archaeologists of a subset of approx. 1000 pieces among the 16000 available. After digitizing all the fragments in 3D, a benchmark was prepared to challenge current reassembly and puzzle-solving methods that often solve more simplistic synthetic scenarios. The tested baselines show that there clearly exists a gap to fill in solving this computationally complex problem.

[CV-21] DiffPAD: Denoising Diffusion-based Adversarial Patch Decontamination WACV

链接: https://arxiv.org/abs/2410.24006
作者: Jia Fu,Xiao Zhang,Sepideh Pashami,Fatemeh Rahimian,Anders Holst
关键词-EN: machine learning landscape, developing effective defenses, necessitating reliable solutions, ever-evolving adversarial machine, adversarial machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

点击查看摘要

Abstract:In the ever-evolving adversarial machine learning landscape, developing effective defenses against patch attacks has become a critical challenge, necessitating reliable solutions to safeguard real-world AI systems. Although diffusion models have shown remarkable capacity in image synthesis and have been recently utilized to counter \ell_p -norm bounded attacks, their potential in mitigating localized patch attacks remains largely underexplored. In this work, we propose DiffPAD, a novel framework that harnesses the power of diffusion models for adversarial patch decontamination. DiffPAD first performs super-resolution restoration on downsampled input images, then adopts binarization, dynamic thresholding scheme and sliding window for effective localization of adversarial patches. Such a design is inspired by the theoretically derived correlation between patch size and diffusion restoration error that is generalized across diverse patch attack scenarios. Finally, DiffPAD applies inpainting techniques to the original input images with the estimated patch region being masked. By integrating closed-form solutions for super-resolution restoration and image inpainting into the conditional reverse sampling process of a pre-trained diffusion model, DiffPAD obviates the need for text guidance or fine-tuning. Through comprehensive experiments, we demonstrate that DiffPAD not only achieves state-of-the-art adversarial robustness against patch attacks but also excels in recovering naturalistic images without patch remnants.

[CV-22] ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images NEURIPS2024

链接: https://arxiv.org/abs/2410.24001
作者: Timing Yang,Yuanliang Ju,Li Yi
关键词-EN: base categories labeled, object detection, aims to generalize, limited number, number of base
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. Code link this https URL

点击查看摘要

Abstract:Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase. The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated. Consequently, it is intuitive to leverage the wealth of annotations in 2D images to alleviate the inherent data scarcity in OV-3Det. In this paper, we push the task setup to its limits by exploring the potential of using solely 2D images to learn OV-3Det. The major challenges for this setup is the modality gap between training images and testing point clouds, which prevents effective integration of 2D knowledge into OV-3Det. To address this challenge, we propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap. The key of ImOV3D lies in flexible modality conversion where 2D images can be lifted into 3D using monocular depth estimation and can also be derived from 3D scenes through rendering. This allows unifying both training images and testing point clouds into a common image-PC representation, encompassing a wealth of 2D semantic information and also incorporating the depth and structural characteristics of 3D spatial data. We carefully conduct such conversion to minimize the domain gap between training and test cases. Extensive experiments on two benchmark datasets, SUNRGBD and ScanNet, show that ImOV3D significantly outperforms existing methods, even in the absence of ground truth 3D training data. With the inclusion of a minimal amount of real 3D data for fine-tuning, the performance also significantly surpasses previous state-of-the-art. Codes and pre-trained models are released on the this https URL.

[CV-23] JEMA: A Joint Embedding Framework for Scalable Co-Learning with Multimodal Alignment

链接: https://arxiv.org/abs/2410.23988
作者: Joao Sousa,Roya Darabi,Armando Sousa,Frank Brueckner,Luís Paulo Reis,Ana Reis
关键词-EN: laser metal deposition, metal additive manufacturing, work introduces JEMA, Multimodal Alignment, co-learning framework tailored
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 14 figures

点击查看摘要

Abstract:This work introduces JEMA (Joint Embedding with Multimodal Alignment), a novel co-learning framework tailored for laser metal deposition (LMD), a pivotal process in metal additive manufacturing. As Industry 5.0 gains traction in industrial applications, efficient process monitoring becomes increasingly crucial. However, limited data and the opaque nature of AI present challenges for its application in an industrial setting. JEMA addresses this challenges by leveraging multimodal data, including multi-view images and metadata such as process parameters, to learn transferable semantic representations. By applying a supervised contrastive loss function, JEMA enables robust learning and subsequent process monitoring using only the primary modality, simplifying hardware requirements and computational overhead. We investigate the effectiveness of JEMA in LMD process monitoring, focusing specifically on its generalization to downstream tasks such as melt pool geometry prediction, achieved without extensive fine-tuning. Our empirical evaluation demonstrates the high scalability and performance of JEMA, particularly when combined with Vision Transformer models. We report an 8% increase in performance in multimodal settings and a 1% improvement in unimodal settings compared to supervised contrastive learning. Additionally, the learned embedding representation enables the prediction of metadata, enhancing interpretability and making possible the assessment of the added metadata’s contributions. Our framework lays the foundation for integrating multisensor data with metadata, enabling diverse downstream tasks within the LMD domain and beyond.

[CV-24] rAct: Making First-layer Pre-Activations Trainable NEURIPS2024

链接: https://arxiv.org/abs/2410.23970
作者: Felix Petersen,Christian Borgelt,Stefano Ermon
关键词-EN: definition directly proportional, gradient update magnitudes, update magnitudes, notice the clear, clear relationship
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at NeurIPS 2024

点击查看摘要

Abstract:We consider the training of the first layer of vision models and notice the clear relationship between pixel values and gradient update magnitudes: the gradients arriving at the weights of a first layer are by definition directly proportional to (normalized) input pixel values. Thus, an image with low contrast has a smaller impact on learning than an image with higher contrast, and a very bright or very dark image has a stronger impact on the weights than an image with moderate brightness. In this work, we propose performing gradient descent on the embeddings produced by the first layer of the model. However, switching to discrete inputs with an embedding layer is not a reasonable option for vision models. Thus, we propose the conceptual procedure of (i) a gradient descent step on first layer activations to construct an activation proposal, and (ii) finding the optimal weights of the first layer, i.e., those weights which minimize the squared distance to the activation proposal. We provide a closed form solution of the procedure and adjust it for robust stochastic training while computing everything efficiently. Empirically, we find that TrAct (Training Activations) speeds up training by factors between 1.25x and 4x while requiring only a small computational overhead. We demonstrate the utility of TrAct with different optimizers for a range of different vision models including convolutional and transformer architectures.

[CV-25] MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption

链接: https://arxiv.org/abs/2410.23946
作者: Ruixun Liu,Kaiyu Li,Jiayi Song,Dongwei Sun,Xiangyong Cao
关键词-EN: Remote sensing image, bi-temporal remote sensing, Remote sensing, provide natural language, natural language descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote sensing image change caption (RSICC) aims to provide natural language descriptions for bi-temporal remote sensing images. Since Change Caption (CC) task requires both spatial and temporal features, previous works follow an encoder-fusion-decoder architecture. They use an image encoder to extract spatial features and the fusion module to integrate spatial features and extract temporal features, which leads to increasingly complex manual design of the fusion module. In this paper, we introduce a novel video model-based paradigm without design of the fusion module and propose a Mask-enhanced Video model for Change Caption (MV-CC). Specifically, we use the off-the-shelf video encoder to simultaneously extract the temporal and spatial features of bi-temporal images. Furthermore, the types of changes in the CC are set based on specific task requirements, and to enable the model to better focus on the regions of interest, we employ masks obtained from the Change Detection (CD) method to explicitly guide the CC model. Experimental results demonstrate that our proposed method can obtain better performance compared with other state-of-the-art RSICC methods. The code is available at this https URL.

[CV-26] Manipulating Vehicle 3D Shapes through Latent Space Editing

链接: https://arxiv.org/abs/2410.23931
作者: JiangDong Miao,Tatsuya Ikeda,Bisser Raytchev,Ryota Mizoguchi,Takenori Hiraoka,Takuji Nakashima,Keigo Shimizu,Toru Higaki,Kazufumi Kaneda
关键词-EN: influence various industries, recent research, potential to significantly, significantly influence, primarily focused
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:Although 3D object editing has the potential to significantly influence various industries, recent research in 3D generation and editing has primarily focused on converting text and images into 3D models, often overlooking the need for fine-grained control over the editing of existing 3D objects. This paper introduces a framework that employs a pre-trained regressor, enabling continuous, precise, attribute-specific modifications to both the stylistic and geometric attributes of vehicle 3D models. Our method not only preserves the inherent identity of vehicle 3D objects, but also supports multi-attribute editing, allowing for extensive customization without compromising the model’s structural integrity. Experimental results demonstrate the efficacy of our approach in achieving detailed edits on various vehicle 3D models.

[CV-27] Uncertainty Estimation for 3D Object Detection via Evidential Learning

链接: https://arxiv.org/abs/2410.23910
作者: Nikita Durasov,Rafid Mahmood,Jiwoong Choi,Marc T. Law,James Lucas,Pascal Fua,Jose M. Alvarez
关键词-EN: computer vision applications, Bird Eye View, vehicles and robotics, computer vision, vision applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D object detection is an essential task for computer vision applications in autonomous vehicles and robotics. However, models often struggle to quantify detection reliability, leading to poor performance on unfamiliar scenes. We introduce a framework for quantifying uncertainty in 3D object detection by leveraging an evidential learning loss on Bird’s Eye View representations in the 3D detector. These uncertainty estimates require minimal computational overhead and are generalizable across different architectures. We demonstrate both the efficacy and importance of these uncertainty estimates on identifying out-of-distribution scenes, poorly localized objects, and missing (false negative) detections; our framework consistently improves over baselines by 10-20% on average. Finally, we integrate this suite of tasks into a system where a 3D object detector auto-labels driving scenes and our uncertainty estimates verify label correctness before the labels are used to train a second model. Here, our uncertainty-driven verification results in a 1% improvement in mAP and a 1-2% improvement in NDS.

[CV-28] IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking

链接: https://arxiv.org/abs/2410.23907
作者: Run Luo,Zikai Song,Longze Chen,Yunshui Li,Min Yang,Wei Yang
关键词-EN: associate multiple objects, challenging vision task, vision task due, aims to associate, associate multiple
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-Object Tracking (MOT) aims to associate multiple objects across video frames and is a challenging vision task due to inherent complexities in the tracking environment. Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability to data from other domains. While several works have introduced natural language representation to bridge the domain gap in visual tracking, these textual descriptions often provide too high-level a view and fail to distinguish various instances within the same class. In this paper, we address this limitation by developing IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions. Our approach is underpinned by two key innovations: Firstly, leveraging a pre-trained vision-language model, we obtain instance-level pseudo textual descriptions via prompt-tuning, which are invariant across different tracking scenes; Secondly, we introduce a query-balanced strategy, augmented by knowledge distillation, to further boost the generalization capabilities of our model. Extensive experiments conducted on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach not only achieves competitive performance on same-domain data compared to state-of-the-art models but also significantly improves the performance of query-based trackers by large margins for cross-domain inputs.

[CV-29] From Web Data to Real Fields: Low-Cost Unsupervised Domain Adaptation for Agricultural Robots

链接: https://arxiv.org/abs/2410.23906
作者: Vasileios Tzouras,Lazaros Nalpantidis,Ronja Güldenring
关键词-EN: Unsupervised Domain Adaptation, precision agriculture, external factors, resulting in compositions, learned distribution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In precision agriculture, vision models often struggle with new, unseen fields where crops and weeds have been influenced by external factors, resulting in compositions and appearances that differ from the learned distribution. This paper aims to adapt to specific fields at low cost using Unsupervised Domain Adaptation (UDA). We explore a novel domain shift from a diverse, large pool of internet-sourced data to a small set of data collected by a robot at specific locations, minimizing the need for extensive on-field data collection. Additionally, we introduce a novel module – the Multi-level Attention-based Adversarial Discriminator (MAAD) – which can be integrated at the feature extractor level of any detection model. In this study, we incorporate MAAD with CenterNet to simultaneously detect leaf, stem, and vein instances. Our results show significant performance improvements in the unlabeled target domain compared to baseline models, with a 7.5% increase in object detection accuracy and a 5.1% improvement in keypoint detection.

[CV-30] xt-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model NEURIPS2024

链接: https://arxiv.org/abs/2410.23905
作者: Hao Zhang,Lei Cao,Jiayi Ma
关键词-EN: Existing multi-modal image, Existing multi-modal, multi-modal image fusion, fusion images plagued, fusion methods fail
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Existing multi-modal image fusion methods fail to address the compound degradations presented in source images, resulting in fusion images plagued by noise, color bias, improper exposure, \textitetc. Additionally, these methods often overlook the specificity of foreground objects, weakening the salience of the objects of interest within the fused images. To address these challenges, this study proposes a novel interactive multi-modal image fusion framework based on the text-modulated diffusion model, called Text-DiFuse. First, this framework integrates feature-level information integration into the diffusion process, allowing adaptive degradation removal and multi-modal information fusion. This is the first attempt to deeply and explicitly embed information fusion within the diffusion process, effectively addressing compound degradation in image fusion. Second, by embedding the combination of the text and zero-shot location model into the diffusion fusion process, a text-controlled fusion re-modulation strategy is developed. This enables user-customized text control to improve fusion performance and highlight foreground objects in the fused images. Extensive experiments on diverse public datasets show that our Text-DiFuse achieves state-of-the-art fusion performance across various scenarios with complex degradation. Moreover, the semantic segmentation experiment validates the significant enhancement in semantic performance achieved by our text-controlled fusion re-modulation strategy. The code is publicly available at this https URL.

[CV-31] EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.23904
作者: Qinqian Lei,Bo Wang,Robby T. Tan
关键词-EN: Detecting Human-Object Interactions, Detecting Human-Object, Human-Object Interactions, poses significant challenges, poses significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at this https URL.

[CV-32] NeFF-BioNet: Crop Biomass Prediction from Point Cloud to Drone Imagery

链接: https://arxiv.org/abs/2410.23901
作者: Xuesong Li,Zeeshan Hayder,Ali Zia,Connor Cassidy,Shiming Liu,Warwick Stiller,Eric Stone,Warren Conaty,Lars Petersson,Vivien Rolland
关键词-EN: offers crucial insights, Crop biomass offers, crop science, biomass offers crucial, farming systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Crop biomass offers crucial insights into plant health and yield, making it essential for crop science, farming systems, and agricultural research. However, current measurement methods, which are labor-intensive, destructive, and imprecise, hinder large-scale quantification of this trait. To address this limitation, we present a biomass prediction network (BioNet), designed for adaptation across different data modalities, including point clouds and drone imagery. Our BioNet, utilizing a sparse 3D convolutional neural network (CNN) and a transformer-based prediction module, processes point clouds and other 3D data representations to predict biomass. To further extend BioNet for drone imagery, we integrate a neural feature field (NeFF) module, enabling 3D structure reconstruction and the transformation of 2D semantic features from vision foundation models into the corresponding 3D surfaces. For the point cloud modality, BioNet demonstrates superior performance on two public datasets, with an approximate 6.1% relative improvement (RI) over the state-of-the-art. In the RGB image modality, the combination of BioNet and NeFF achieves a 7.9% RI. Additionally, the NeFF-based approach utilizes inexpensive, portable drone-mounted cameras, providing a scalable solution for large field applications.

[CV-33] Airway Labeling Meets Clinical Applications: Reflecting Topology Consistency and Outliers via Learnable Attentions

链接: https://arxiv.org/abs/2410.23854
作者: Chenyu Li,Minghui Zhang,Chuyan Zhang,Yun Gu
关键词-EN: navigate complex bronchial, complex bronchial structures, airway anatomical labeling, anatomical labeling, structures during bronchoscopy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate airway anatomical labeling is crucial for clinicians to identify and navigate complex bronchial structures during bronchoscopy. Automatic airway anatomical labeling is challenging due to significant individual variability and anatomical variations. Previous methods are prone to generate inconsistent predictions, which is harmful for preoperative planning and intraoperative navigation. This paper aims to address these challenges by proposing a novel method that enhances topological consistency and improves the detection of abnormal airway branches. We propose a novel approach incorporating two modules: the Soft Subtree Consistency (SSC) and the Abnormal Branch Saliency (ABS). The SSC module constructs a soft subtree to capture clinically relevant topological relationships, allowing for flexible feature aggregation within and across subtrees. The ABS module facilitates the interaction between node features and prototypes to distinguish abnormal branches, preventing the erroneous aggregation of features between normal and abnormal nodes. Evaluated on a challenging dataset characterized by severe airway distortion and atrophy, our method achieves superior performance compared to state-of-the-art approaches. Specifically, it attains a 91.4% accuracy at the segmental level and an 83.7% accuracy at the subsegmental level, representing a 1.4% increase in subsegmental accuracy and a 3.1% increase in topological consistency. Notably, the method demonstrates reliable performance in cases with disease-induced airway deformities, ensuring consistent and accurate labeling. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.23854 [cs.CV] (or arXiv:2410.23854v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.23854 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-34] Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

链接: https://arxiv.org/abs/2410.23836
作者: Xiang Deng,Youxin Pang,Xiaochen Zhao,Chao Xu,Lizhen Wang,Hongjiang Xiao,Shi Yan,Hongwen Zhang,Yebin Liu
关键词-EN: precise lip synchronization, temporally consistent photo-realistic, continuous viewpoint control, consistent photo-realistic quality, paper introduces Stereo-Talker
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs’ cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.

[CV-35] FRoundation: Are Foundation Models Ready for Face Recognition?

链接: https://arxiv.org/abs/2410.23831
作者: Tahar Chettaoui,Naser Damer,Fadi Boutros
关键词-EN: Foundation models, face recognition, models, making them broadly, unsupervised or self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models are predominantly trained in an unsupervised or self-supervised manner on highly diverse and large-scale datasets, making them broadly applicable to various downstream tasks. In this work, we investigate for the first time whether such models are suitable for the specific domain of face recognition. We further propose and demonstrate the adaptation of these models for face recognition across different levels of data availability. Extensive experiments are conducted on multiple foundation models and datasets of varying scales for training and fine-tuning, with evaluation on a wide range of benchmarks. Our results indicate that, despite their versatility, pre-trained foundation models underperform in face recognition compared to similar architectures trained specifically for this task. However, fine-tuning foundation models yields promising results, often surpassing models trained from scratch when training data is limited. Even with access to large-scale face recognition training datasets, fine-tuned foundation models perform comparably to models trained from scratch, but with lower training computational costs and without relying on the assumption of extensive data availability. Our analysis also explores bias in face recognition, with slightly higher bias observed in some settings when using foundation models.

[CV-36] Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection

链接: https://arxiv.org/abs/2410.23828
作者: Ke Li,Fuyu Dong,Di Wang,Shaofeng Li,Quan Wang,Xinbo Gao,Tat-Seng Chua
关键词-EN: perceive changes occurring, change detection aims, Earth surface, change detection, Change Detection Question
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote sensing change detection aims to perceive changes occurring on the Earth’s surface from remote sensing data in different periods, and feed these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the ability to interact with users to identify changes that the users expect. In this paper, we introduce a new task named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. To this end, we construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a large-scale and diverse dataset for remote sensing applications. Based on this, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic CDVQA and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for the development of better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored direction. The proposed benchmark dataset and method are available at this https URL.

[CV-37] Human Action Recognition (HAR) Using Skeleton-based Quantum Spatial Temporal Relative Transformer Network: ST-RTR

链接: https://arxiv.org/abs/2410.23806
作者: Faisal Mehmood,Enqing Chen,Touqeer Abbas,Samah M. Alzanin
关键词-EN: interesting research area, disabled individuals affected, NTU RGB, mental health, interesting research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Quantum Human Action Recognition (HAR) is an interesting research area in human-computer interaction used to monitor the activities of elderly and disabled individuals affected by physical and mental health. In the recent era, skeleton-based HAR has received much attention because skeleton data has shown that it can handle changes in striking, body size, camera views, and complex backgrounds. One key characteristic of ST-GCN is automatically learning spatial and temporal patterns from skeleton sequences. It has some limitations, as this method only works for short-range correlation due to its limited receptive field. Consequently, understanding human action requires long-range interconnection. To address this issue, we developed a quantum spatial-temporal relative transformer ST-RTR model. The ST-RTR includes joint and relay nodes, which allow efficient communication and data transmission within the network. These nodes help to break the inherent spatial and temporal skeleton topologies, which enables the model to understand long-range human action better. Furthermore, we combine quantum ST-RTR with a fusion model for further performance improvements. To assess the performance of the quantum ST-RTR method, we conducted experiments on three skeleton-based HAR benchmarks: NTU RGB+D 60, NTU RGB+D 120, and UAV-Human. It boosted CS and CV by 2.11 % and 1.45% on NTU RGB+D 60, 1.25% and 1.05% on NTU RGB+D 120. On UAV-Human datasets, accuracy improved by 2.54%. The experimental outcomes explain that the proposed ST-RTR model significantly improves action recognition associated with the standard ST-GCN method.

[CV-38] SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild

链接: https://arxiv.org/abs/2410.23800
作者: Zhuoyang Pan,Angjoo Kanazawa,Hang Gao
关键词-EN: predefined motion scripts, follow predefined motion, Self-occlusion is common, motion scripts, common when capturing
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Self-occlusion is common when capturing people in the wild, where the performer do not follow predefined motion scripts. This challenges existing monocular human reconstruction systems that assume full body visibility. We introduce Self-Occluded Avatar Recovery (SOAR), a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved. SOAR leverages structural normal prior and generative diffusion prior to address such an ill-posed reconstruction problem. For structural normal prior, we model human with an reposable surfel model with well-defined and easily readable shapes. For generative diffusion prior, we perform an initial reconstruction and refine it using score distillation. On various benchmarks, we show that SOAR performs favorably than state-of-the-art reconstruction and generation methods, and on-par comparing to concurrent works. Additional video results and code are available at this https URL.

[CV-39] Video Token Merging for Long-form Video Understanding NEURIPS2024

链接: https://arxiv.org/abs/2410.23782
作者: Seon-Ho Lee,Jue Wang,Zhikang Zhang,David Fan,Xinyu Li
关键词-EN: transformer-based models presents, understanding rapidly expand, video understanding rapidly, token merging, handling long-form video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, NeurIPS 2024

点击查看摘要

Abstract:As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency. Extensive experimental results show that we achieve better or comparable performances on the LVU, COIN, and Breakfast datasets. Moreover, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

[CV-40] In-Context LoRA for Diffusion Transformers

链接: https://arxiv.org/abs/2410.23775
作者: Lianghua Huang,Wei Wang,Zhi-Fan Wu,Yupeng Shi,Huanzhang Dou,Chen Liang,Yutong Feng,Yu Liu,Jingren Zhou
关键词-EN: Recent research arXiv, simply concatenating attention, concatenating attention tokens, Recent research, diffusion transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., 20\sim 100 samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at this https URL

[CV-41] Open-Set 3D object detection in LiDAR data as an Out-of-Distribution problem

链接: https://arxiv.org/abs/2410.23767
作者: Louis Soum-Fontez,Jean-Emmanuel Deschaud,François Goulette
关键词-EN: achieved industry-ready performance, advanced deep learning, deep learning methods, Object Detection, OOD Object Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Object Detection from LiDAR data has achieved industry-ready performance in controlled environments through advanced deep learning methods. However, these neural network models are limited by a finite set of inlier object categories. Our work redefines the open-set 3D Object Detection problem in LiDAR data as an Out-Of-Distribution (OOD) problem to detect outlier objects. This approach brings additional information in comparison with traditional object detection. We establish a comparative benchmark and show that two-stage OOD methods, notably autolabelling, show promising results for 3D OOD Object Detection. Our contributions include setting a rigorous evaluation protocol by examining the evaluation of hyperparameters and evaluating strategies for generating additional data to train an OOD-aware 3D object detector. This comprehensive analysis is essential for developing robust 3D object detection systems that can perform reliably in diverse and unpredictable real-world scenarios.

[CV-42] Reverse Attitude Statistics Based Star Map Identification Method

链接: https://arxiv.org/abs/2410.23758
作者: Shunmei Dong,Qinglong Wang,Haiqing Wang,Qianqian Wang
关键词-EN: atmospheric background light, tracker is generally, generally affected, atmospheric background, background light
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 17figures, 4 tables, 4663 words, submitted to IEEE Sensors Journal

点击查看摘要

Abstract:The star tracker is generally affected by the atmospheric background light and the aerodynamic environment when working in near space, which results in missing stars or false stars. Moreover, high-speed maneuvering may cause star trailing, which reduces the accuracy of the star position. To address the challenges for starmap identification, a reverse attitude statistics based method is proposed to handle position noise, false stars, and missing stars. Conversely to existing methods which match before solving for attitude, this method introduces attitude solving into the matching process, and obtains the final match and the correct attitude simultaneously by frequency statistics. Firstly, based on stable angular distance features, the initial matching is obtained by utilizing spatial hash indexing. Then, the dual-vector attitude determination is introduced to calculate potential attitude. Finally, the star pairs are accurately matched by applying a frequency statistics filtering method. In addition, Bayesian optimization is employed to find optimal parameters under the impact of noises, which is able to enhance the algorithm performance further. In this work, the proposed method is validated in simulation, field test and on-orbit experiment. Compared with the state-of-the-art, the identification rate is improved by more than 14.3%, and the solving time is reduced by over 28.5%.

[CV-43] EXACFS – A CIL Method to mitigate Catastrophic Forgetting

链接: https://arxiv.org/abs/2410.23751
作者: S Balasubramanian,M Sai Subramaniam,Sai Sriram Talasu,P Yedu Krishna,Manepalli Pranav Phanindra Sai,Ravi Mukkamala,Darshan Gera
关键词-EN: Deep neural networks, data arrives sequentially, Deep neural, neural networks, arrives sequentially
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNS) excel at learning from static datasets but struggle with continual learning, where data arrives sequentially. Catastrophic forgetting, the phenomenon of forgetting previously learned knowledge, is a primary challenge. This paper introduces EXponentially Averaged Class-wise Feature Significance (EXACFS) to mitigate this issue in the class incremental learning (CIL) setting. By estimating the significance of model features for each learned class using loss gradients, gradually aging the significance through the incremental tasks and preserving the significant features through a distillation loss, EXACFS effectively balances remembering old knowledge (stability) and learning new knowledge (plasticity). Extensive experiments on CIFAR-100 and ImageNet-100 demonstrate EXACFS’s superior performance in preserving stability while acquiring plasticity.

[CV-44] EchoNarrator: Generating natural text explanations for ejection fraction predictions MICCAI2024

链接: https://arxiv.org/abs/2410.23744
作者: Sarina Thomas,Qing Cao,Anna Novikova,Daria Kulikova,Guy Ben-Yosef
关键词-EN: cardiac ultrasound acquisition, diagnosing acute heart, acute heart failure, Natural Language Explanation, left ventricle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted for MICCAI 2024

点击查看摘要

Abstract:Ejection fraction (EF) of the left ventricle (LV) is considered as one of the most important measurements for diagnosing acute heart failure and can be estimated during cardiac ultrasound acquisition. While recent successes in deep learning research successfully estimate EF values, the proposed models often lack an explanation for the prediction. However, providing clear and intuitive explanations for clinical measurement predictions would increase the trust of cardiologists in these models. In this paper, we explore predicting EF measurements with Natural Language Explanation (NLE). We propose a model that in a single forward pass combines estimation of the LV contour over multiple frames, together with a set of modules and routines for computing various motion and shape attributes that are associated with ejection fraction. It then feeds the attributes into a large language model to generate text that helps to explain the network’s outcome in a human-like manner. We provide experimental evaluation of our explanatory output, as well as EF prediction, and show that our model can provide EF comparable to state-of-the-art together with meaningful and accurate natural language explanation to the prediction. The project page can be found at this https URL .

[CV-45] Scaled Inverse Graphics: Efficiently Learning Large Sets of 3D Scenes

链接: https://arxiv.org/abs/2410.23742
作者: Karim Kassab,Antoine Schnepf,Jean-Yves Franceschi,Laurent Caraffa,Flavian Vasile,Jeremie Mary,Andrew Comport,Valérie Gouet-Brunet
关键词-EN: witnessing continuous growth, inverse graphics, scaled inverse graphics, learning large sets, continuous growth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While the field of inverse graphics has been witnessing continuous growth, techniques devised thus far predominantly focus on learning individual scene representations. In contrast, learning large sets of scenes has been a considerable bottleneck in NeRF developments, as repeatedly applying inverse graphics on a sequence of scenes, though essential for various applications, remains largely prohibitive in terms of resource costs. We introduce a framework termed “scaled inverse graphics”, aimed at efficiently learning large sets of scene representations, and propose a novel method to this end. It operates in two stages: (i) training a compression model on a subset of scenes, then (ii) training NeRF models on the resulting smaller representations, thereby reducing the optimization space per new scene. In practice, we compact the representation of scenes by learning NeRFs in a latent space to reduce the image resolution, and sharing information across scenes to reduce NeRF representation complexity. We experimentally show that our method presents both the lowest training time and memory footprint in scaled inverse graphics compared to other methods applied independently on each scene. Our codebase is publicly available as open-source. Our project page can be found at this https URL .

[CV-46] MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

链接: https://arxiv.org/abs/2410.23736
作者: Haiwen Li,Fei Su,Zhicheng Zhao
关键词-EN: Composed Image Retrieval, retrieve target images, challenging vision-language task, utilizing bi-modal, queries to retrieve
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image \leftrightarrow text) and inference (image+text \rightarrow image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.

[CV-47] An Empirical Analysis of GPT-4Vs Performance on Fashion Aesthetic Evaluation

链接: https://arxiv.org/abs/2410.23730
作者: Yuki Hirakawa,Takashi Wada,Kazuya Morishita,Ryotaro Shimizu,Takuya Furusawa,Sai Htaung Kham,Yuki Saito
关键词-EN: Fashion aesthetic evaluation, Fashion aesthetic, aesthetic evaluation, worn by individuals, individuals in images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fashion aesthetic evaluation is the task of estimating how well the outfits worn by individuals in images suit them. In this work, we examine the zero-shot performance of GPT-4V on this task for the first time. We show that its predictions align fairly well with human judgments on our datasets, and also find that it struggles with ranking outfits in similar colors. The code is available at this https URL.

[CV-48] GaussianMarker: Uncertainty-Aware Copyright Protection of 3D Gaussian Splatting

链接: https://arxiv.org/abs/2410.23718
作者: Xiufeng Huang,Ruiqi Li,Yiu-ming Cheung,Ka Chun Cheung,Simon See,Renjie Wan
关键词-EN: Gaussian Splatting, Splatting, Gaussians, crucial method, assets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become a crucial method for acquiring 3D assets. To protect the copyright of these assets, digital watermarking techniques can be applied to embed ownership information discreetly within 3DGS models. However, existing watermarking methods for meshes, point clouds, and implicit radiance fields cannot be directly applied to 3DGS models, as 3DGS models use explicit 3D Gaussians with distinct structures and do not rely on neural networks. Naively embedding the watermark on a pre-trained 3DGS can cause obvious distortion in rendered images. In our work, we propose an uncertainty-based method that constrains the perturbation of model parameters to achieve invisible watermarking for 3DGS. At the message decoding stage, the copyright messages can be reliably extracted from both 3D Gaussians and 2D rendered images even under various forms of 3D and 2D distortions. We conduct extensive experiments on the Blender, LLFF and MipNeRF-360 datasets to validate the effectiveness of our proposed method, demonstrating state-of-the-art performance on both message decoding accuracy and view synthesis quality.

[CV-49] Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP NEURIPS2024

链接: https://arxiv.org/abs/2410.23698
作者: Chen Huang,Skyler Seto,Samira Abnar,David Grangier,Navdeep Jaitly,Josh Susskind
关键词-EN: Large pretrained vision-language, promising generalization capability, Large pretrained, shown promising generalization, satellite imagery
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In this paper, we improve prompt learning by distilling the textual knowledge from natural language prompts (either human- or LLM-generated) to provide rich priors for those under-represented concepts. We first obtain a prompt ``summary’’ aligned to each input image via a learned prompt aggregator. Then we jointly train a prompt generator, optimized to produce a prompt embedding that stays close to the aggregated summary while minimizing task loss at the same time. We dub such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE). AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks (e.g., few-shot classification, VQA) and generation tasks (image captioning) where AAPE achieves competitive performance. We also show AAPE is particularly helpful to handle non-canonical and OOD examples. Furthermore, AAPE learning eliminates LLM-based inference cost as required by baselines, and scales better with data and LLM model size.

[CV-50] XRDSLAM: A Flexible and Modular Framework for Deep Learning based SLAM

链接: https://arxiv.org/abs/2410.23690
作者: Xiaomeng Wang,Nan Wang,Guofeng Zhang
关键词-EN: flexible SLAM framework, propose a flexible, flexible SLAM, SLAM, XRDSLAM
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this paper, we propose a flexible SLAM framework, XRDSLAM. It adopts a modular code design and a multi-process running mechanism, providing highly reusable foundational modules such as unified dataset management, 3d visualization, algorithm configuration, and metrics evaluation. It can help developers quickly build a complete SLAM system, flexibly combine different algorithm modules, and conduct standardized benchmarking for accuracy and efficiency comparison. Within this framework, we integrate several state-of-the-art SLAM algorithms with different types, including NeRF and 3DGS based SLAM, and even odometry or reconstruction algorithms, which demonstrates the flexibility and extensibility. We also conduct a comprehensive comparison and evaluation of these integrated algorithms, analyzing the characteristics of each. Finally, we contribute all the code, configuration and data to the open-source community, which aims to promote the widespread research and development of SLAM technology within the open-source ecosystem.

[CV-51] Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey

链接: https://arxiv.org/abs/2410.23687
作者: Chiyu Zhang,Xiaogang Xu,Jiafei Wu,Zhe Liu,Lu Zhou
关键词-EN: pose significant security, machine learning inference, manipulate input data, significant security threats, undermine model availability
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Adversarial attacks, which manipulate input data to undermine model availability and integrity, pose significant security threats during machine learning inference. With the advent of Large Vision-Language Models (LVLMs), new attack vectors, such as cognitive bias, prompt injection, and jailbreak techniques, have emerged. Understanding these attacks is crucial for developing more robust systems and demystifying the inner workings of neural networks. However, existing reviews often focus on attack classifications and lack comprehensive, in-depth analysis. The research community currently needs: 1) unified insights into adversariality, transferability, and generalization; 2) detailed evaluations of existing methods; 3) motivation-driven attack categorizations; and 4) an integrated perspective on both traditional and LVLM attacks. This article addresses these gaps by offering a thorough summary of traditional and LVLM adversarial attacks, emphasizing their connections and distinctions, and providing actionable insights for future research.

[CV-52] Wide Two-Layer Networks can Learn from Adversarial Perturbations NEURIPS24

链接: https://arxiv.org/abs/2410.23677
作者: Soichiro Kumano,Hiroshi Kera,Toshihiko Yamasaki
关键词-EN: open questions, raised several open, Adversarial, classifiers trained, perturbation learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: NeurIPS24

点击查看摘要

Abstract:Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on adversarial examples and the corresponding incorrect labels generalize well to correctly labeled test data. Although this hypothesis and perturbation learning are effective in explaining intriguing properties of adversarial examples, their solid theoretical foundation is limited. In this study, we theoretically explain the counterintuitive success of perturbation learning. We assume wide two-layer networks and the results hold for any data distribution. We prove that adversarial perturbations contain sufficient class-specific features for networks to generalize from them. Moreover, the predictions of classifiers trained on mislabeled adversarial examples coincide with those of classifiers trained on correctly labeled clean samples. The code is available at this https URL.

[CV-53] Web-Scale Visual Entity Recognition: An LLM -Driven Data Approach NEURIPS2024

链接: https://arxiv.org/abs/2410.23676
作者: Mathilde Caron,Alireza Fathi,Cordelia Schmid,Ahmet Iscen
关键词-EN: presents significant challenges, vast knowledge bases, significant challenges due, presents significant, lack of clean
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as “rationale”) that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain.

[CV-54] DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection

链接: https://arxiv.org/abs/2410.23663
作者: Fan Nie,Jiangqun Ni,Jian Zhang,Bin Zhang,Weizhe Zhang
关键词-EN: protecting multimedia content, multimedia content integrity, deepfake generation techniques, deepfake video detection, generation techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 13 pages, accepted with IEEE Trans. on Multimedia

点击查看摘要

Abstract:With the advancement of deepfake generation techniques, the importance of deepfake detection in protecting multimedia content integrity has become increasingly obvious. Recently, temporal inconsistency clues have been explored to improve the generalizability of deepfake video detection. According to our observation, the temporal artifacts of forged videos in terms of motion information usually exhibits quite distinct inconsistency patterns along horizontal and vertical directions, which could be leveraged to improve the generalizability of detectors. In this paper, a transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection. Specifically, DIP begins with a spatiotemporal encoder to represent spatiotemporal information. A directional inconsistency decoder is adopted accordingly, where direction-aware attention and inconsistency diffusion are incorporated to explore potential inconsistency patterns and jointly learn the inherent relationships. In addition, the SpatioTemporal Invariant Loss (STI Loss) is introduced to contrast spatiotemporally augmented sample pairs and prevent the model from overfitting nonessential forgery artifacts. Extensive experiments on several public datasets demonstrate that our method could effectively identify directional forgery clues and achieve state-of-the-art performance.

[CV-55] GS-Blur: A 3D Scene-Based Dataset for Realistic Image Deblurring NEURIPS2024

链接: https://arxiv.org/abs/2410.23658
作者: Dongwoo Lee,Joonkyu Park,Kyoung Mu Lee
关键词-EN: blurry images, blur, images, paired blurry, blurry
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024 Datasets Benchmarks Track

点击查看摘要

Abstract:To train a deblurring network, an appropriate dataset with paired blurry and sharp images is essential. Existing datasets collect blurry images either synthetically by aggregating consecutive sharp frames or using sophisticated camera systems to capture real blur. However, these methods offer limited diversity in blur types (blur trajectories) or require extensive human effort to reconstruct large-scale datasets, failing to fully reflect real-world blur scenarios. To address this, we propose GS-Blur, a dataset of synthesized realistic blurry images created using a novel approach. To this end, we first reconstruct 3D scenes from multi-view images using 3D Gaussian Splatting (3DGS), then render blurry images by moving the camera view along the randomly generated motion trajectories. By adopting various camera trajectories in reconstructing our GS-Blur, our dataset contains realistic and diverse types of blur, offering a large-scale dataset that generalizes well to real-world blur. Using GS-Blur with various deblurring methods, we demonstrate its ability to generalize effectively compared to previous synthetic or real blur datasets, showing significant improvements in deblurring performance.

[CV-56] Recovering Complete Actions for Cross-dataset Skeleton Action Recognition NEURIPS2024

链接: https://arxiv.org/abs/2410.23641
作者: Hanchao Liu,Yujiang Li,Tai-Jiang Mu,Shi-Min Hu
关键词-EN: skeleton-based action recognition, action, challenging issue, huge progress, progress in skeleton-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by NeurIPS 2024

点击查看摘要

Abstract:Despite huge progress in skeleton-based action recognition, its generalizability to different domains remains a challenging issue. In this paper, to solve the skeleton action generalization problem, we present a recover-and-resample augmentation framework based on a novel complete action prior. We observe that human daily actions are confronted with temporal mismatch across different datasets, as they are usually partial observations of their complete action sequences. By recovering complete actions and resampling from these full sequences, we can generate strong augmentations for unseen domains. At the same time, we discover the nature of general action completeness within large datasets, indicated by the per-frame diversity over time. This allows us to exploit two assets of transferable knowledge that can be shared across action samples and be helpful for action completion: boundary poses for determining the action start, and linear temporal transforms for capturing global action patterns. Therefore, we formulate the recovering stage as a two-step stochastic action completion with boundary pose-conditioned extrapolation followed by smooth linear transforms. Both the boundary poses and linear transforms can be efficiently learned from the whole dataset via clustering. We validate our approach on a cross-dataset setting with three skeleton action datasets, outperforming other domain generalization approaches by a considerable margin.

[CV-57] On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection

链接: https://arxiv.org/abs/2410.23623
作者: Xiufeng Song,Xiao Guo,Jiache Zhang,Qirui Li,Lei Bai,Xiaoming Liu,Guangtao Zhai,Xiaohong Liu
关键词-EN: models pose threats, security and authenticity, generated content detection, diffusion models pose, numbers of synthesized
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Large numbers of synthesized videos from diffusion models pose threats to information security and authenticity, leading to an increasing demand for generated content detection. However, existing video-level detection algorithms primarily focus on detecting facial forgeries and often fail to identify diffusion-generated content with a diverse range of semantics. To advance the field of video forensics, we propose an innovative algorithm named Multi-Modal Detection(MM-Det) for detecting diffusion-generated videos. MM-Det utilizes the profound perceptual and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR) from LMM’s multi-modal space, enhancing its ability to detect unseen forgery content. Besides, MM-Det leverages an In-and-Across Frame Attention (IAFA) mechanism for feature augmentation in the spatio-temporal domain. A dynamic fusion strategy helps refine forgery representations for the fusion. Moreover, we construct a comprehensive diffusion video dataset, called Diffusion Video Forensics (DVF), across a wide range of forgery videos. MM-Det achieves state-of-the-art performance in DVF, demonstrating the effectiveness of our algorithm. Both source code and DVF are available at this https URL.

[CV-58] Context-Aware Token Selection and Packing for Enhanced Vision Transformer

链接: https://arxiv.org/abs/2410.23608
作者: Tianyi Zhang,Baoxin Li,Jae-sun Seo,Yu Cap
关键词-EN: driven significant performance, significant performance breakthroughs, long-range attention mechanism, recent years, transformers has driven
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.

[CV-59] Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding

链接: https://arxiv.org/abs/2410.23570
作者: Minghong Xie,Mengzhao Wang,Huafeng Li,Yafei Zhang,Dapeng Tao,Zhengtao Yu
关键词-EN: visual language tasks, attracted wide attention, Visual grounding, Correction Visual Grounding, Visual Grounding method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been accepted by TMM

点击查看摘要

Abstract:Visual grounding has attracted wide attention thanks to its broad application in various visual language tasks. Although visual grounding has made significant research progress, existing methods ignore the promotion effect of the association between text and image features at different hierarchies on cross-modal matching. This paper proposes a Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction Visual Grounding method. It first generates a mask through decoupled sentence phrases, and a text and image hierarchical matching mechanism is constructed, highlighting the role of association between different hierarchies in cross-modal matching. In addition, a corresponding target object position progressive correction strategy is defined based on the hierarchical matching mechanism to achieve accurate positioning for the target object described in the text. This method can continuously optimize and adjust the bounding box position of the target object as the certainty of the text description of the target object improves. This design explores the association between features at different hierarchies and highlights the role of features related to the target object and its position in target positioning. The proposed method is validated on different datasets through experiments, and its superiority is verified by the performance comparison with the state-of-the-art methods.

[CV-60] Language-guided Hierarchical Fine-grained Image Forgery Detection and Localization

链接: https://arxiv.org/abs/2410.23556
作者: Xiao Guo,Xiaohong Liu,Iacopo Masi,Xiaoming Liu
关键词-EN: differences make, forgery, Differences, domains are large, forgery attributes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCV2024. arXiv admin note: substantial text overlap with arXiv:2303.17111

点击查看摘要

Abstract:Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: a multi-branch feature extractor, a language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment pixel-level forgery regions and detect image-level forgery, respectively. Also, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 8 by using different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset are available.

[CV-61] LBurst: Learning-Based Robotic Burst Feature Extraction for 3D Reconstruction in Low Light

链接: https://arxiv.org/abs/2410.23522
作者: Ahalya Ravendran,Mitch Bryson,Donald G. Dansereau
关键词-EN: aerial imaging, disaster recovery, revolutionized the fields, fields of aerial, low-light conditions
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 7 pages, 8 figures, 3 tables, for associated project page, see this https URL

点击查看摘要

Abstract:Drones have revolutionized the fields of aerial imaging, mapping, and disaster recovery. However, the deployment of drones in low-light conditions is constrained by the image quality produced by their on-board cameras. In this paper, we present a learning architecture for improving 3D reconstructions in low-light conditions by finding features in a burst. Our approach enhances visual reconstruction by detecting and describing high quality true features and less spurious features in low signal-to-noise ratio images. We demonstrate that our method is capable of handling challenging scenes in millilux illumination, making it a significant step towards drones operating at night and in extremely low-light applications such as underground mining and search and rescue operations.

[CV-62] PACER: Preference-conditioned All-terrain Costmap Generation

链接: https://arxiv.org/abs/2410.23488
作者: Luisa Mao,Garrett Warnell,Peter Stone,Joydeep Biswas
关键词-EN: autonomous robot navigation, pre-trained semantic classifier, robot navigation, semantic classifier, autonomous robot
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In autonomous robot navigation, terrain cost assignment is typically performed using a semantics-based paradigm in which terrain is first labeled using a pre-trained semantic classifier and costs are then assigned according to a user-defined mapping between label and cost. While this approach is rapidly adaptable to changing user preferences, only preferences over the types of terrain that are already known by the semantic classifier can be expressed. In this paper, we hypothesize that a machine-learning-based alternative to the semantics-based paradigm above will allow for rapid cost assignment adaptation to preferences expressed over new terrains at deployment time without the need for additional training. To investigate this hypothesis, we introduce and study PACER, a novel approach to costmap generation that accepts as input a single birds-eye view (BEV) image of the surrounding area along with a user-specified preference context and generates a corresponding BEV costmap that aligns with the preference context. Using both real and synthetic data along with a combination of proposed training tasks, we find that PACER is able to adapt quickly to new user preferences while also exhibiting better generalization to novel terrains compared to both semantics-based and representation-learning approaches.

[CV-63] EchoFM: Foundation Model for Generalizable Echocardiogram Analysis

链接: https://arxiv.org/abs/2410.23413
作者: Sekeun Kim,Pengfei Jin,Sifan Song,Cheng Chen,Yiwei Li,Hui Ren,Xiang Li,Tianming Liu,Quanzheng Li
关键词-EN: recently gained significant, gained significant attention, data distributions, Foundation models, recently gained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.

[CV-64] Multilingual Vision-Language Pre-training for the Remote Sensing Domain

链接: https://arxiv.org/abs/2410.23370
作者: João Daniel Silva,Joao Magalhaes,Devis Tuia,Bruno Martins
关键词-EN: Contrastive Language-Image Pre-training, remote sensing, involving remote sensing, remote sensing images, Sensing Multilingual CLIP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ACM SIGSPATIAL 2024 - Research Papers

点击查看摘要

Abstract:Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective. Model training relied on assembling pre-existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. We show that translated data is indeed helpful, e.g. improving performance also on English. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks, including cross-modal and multilingual image-text retrieval, or zero-shot image classification.

[CV-65] Domain-decomposed image classification algorithms using linear discriminant analysis and convolutional neural networks

链接: https://arxiv.org/abs/2410.23359
作者: Axel Klawonn,Martin Lanser,Janine Weber
关键词-EN: modern computer application, computer application problems, important role, modern computer, computer application
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In many modern computer application problems, the classification of image data plays an important role. Among many different supervised machine learning models, convolutional neural networks (CNNs) and linear discriminant analysis (LDA) as well as sophisticated variants thereof are popular techniques. In this work, two different domain decomposed CNN models are experimentally compared for different image classification problems. Both models are loosely inspired by domain decomposition methods and in addition, combined with a transfer learning strategy. The resulting models show improved classification accuracies compared to the corresponding, composed global CNN model without transfer learning and besides, also help to speed up the training process. Moreover, a novel decomposed LDA strategy is proposed which also relies on a localization approach and which is combined with a small neural network model. In comparison with a global LDA applied to the entire input data, the presented decomposed LDA approach shows increased classification accuracies for the considered test problems.

[CV-66] Improving Image Data Leakage Detection in Automotive Software

链接: https://arxiv.org/abs/2410.23312
作者: Md Abu Ahammed Babu,Sushant Kumar Pandey,Darko Durisic,Ashok Chaitanya Koppisetty,Miroslaw Staron
关键词-EN: Data leakage, common problem, overlooked during splitting, train and test, test sets
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data leakage is a very common problem that is often overlooked during splitting data into train and test sets before training any ML/DL model. The model performance gets artificially inflated with the presence of data leakage during the evaluation phase which often leads the model to erroneous prediction on real-time deployment. However, detecting the presence of such leakage is challenging, particularly in the object detection context of perception systems where the model needs to be supplied with image data for training. In this study, we conduct a computational experiment on the Cirrus dataset from our industrial partner Volvo Cars to develop a method for detecting data leakage. We then evaluate the method on another public dataset, Kitti, which is a popular and widely accepted benchmark dataset in the automotive domain. The results show that thanks to our proposed method we are able to detect data leakage in the Kitti dataset, which was previously unknown.

[CV-67] Parameter choices in HaarPSI for IQA with medical images

链接: https://arxiv.org/abs/2410.24098
作者: Clemens Karner,Janek Gröhl,Ian Selby,Judith Babar,Jake Beckford,Thomas R Else,Timothy J Sadler,Shahab Shahipasand,Arthikkaa Thavakumar,Michael Roberts,James H.F. Rudd,Carola-Bibiane Schönlieb,Jonathan R Weir-McCall,Anna Breger
关键词-EN: machine learning models, developing machine learning, image quality assessment, IQA measures, learning models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:When developing machine learning models, image quality assessment (IQA) measures are a crucial component for evaluation. However, commonly used IQA measures have been primarily developed and optimized for natural images. In many specialized settings, such as medical images, this poses an often-overlooked problem regarding suitability. In previous studies, the IQA measure HaarPSI showed promising behavior for natural and medical images. HaarPSI is based on Haar wavelet representations and the framework allows optimization of two parameters. So far, these parameters have been aligned for natural images. Here, we optimize these parameters for two annotated medical data sets, a photoacoustic and a chest X-Ray data set. We observe that they are more sensitive to the parameter choices than the employed natural images, and on the other hand both medical data sets lead to similar parameter values when optimized. We denote the optimized setting, which improves the performance for the medical images notably, by HaarPSI _MED . The results suggest that adapting common IQA measures within their frameworks for medical images can provide a valuable, generalizable addition to the employment of more specific task-based measures.

[CV-68] Deep Learning with HM-VGG: AI Strategies for Multi-modal Image Analysis

链接: https://arxiv.org/abs/2410.24046
作者: Junliang Du,Yiru Cang,Tong Zhou,Jiacheng Hu,Weijie He
关键词-EN: Hybrid Multi-modal VGG, Hybrid Multi-modal, cutting-edge deep learning, introduces the Hybrid, deep learning approach
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces the Hybrid Multi-modal VGG (HM-VGG) model, a cutting-edge deep learning approach for the early diagnosis of glaucoma. The HM-VGG model utilizes an attention mechanism to process Visual Field (VF) data, enabling the extraction of key features that are vital for identifying early signs of glaucoma. Despite the common reliance on large annotated datasets, the HM-VGG model excels in scenarios with limited data, achieving remarkable results with small sample sizes. The model’s performance is underscored by its high metrics in Precision, Accuracy, and F1-Score, indicating its potential for real-world application in glaucoma detection. The paper also discusses the challenges associated with ophthalmic image analysis, particularly the difficulty of obtaining large volumes of annotated data. It highlights the importance of moving beyond single-modality data, such as VF or Optical Coherence Tomography (OCT) images alone, to a multimodal approach that can provide a richer, more comprehensive dataset. This integration of different data types is shown to significantly enhance diagnostic accuracy. The HM- VGG model offers a promising tool for doctors, streamlining the diagnostic process and improving patient outcomes. Furthermore, its applicability extends to telemedicine and mobile healthcare, making diagnostic services more accessible. The research presented in this paper is a significant step forward in the field of medical image processing and has profound implications for clinical ophthalmology.

[CV-69] Assessing the Efficacy of Classical and Deep Neuroimaging Biomarkers in Early Alzheimers Disease Diagnosis

链接: https://arxiv.org/abs/2410.24002
作者: Milla E. Nielsen,Mads Nielsen,Mostafa Mehdipour Ghazi
关键词-EN: current diagnostic methods, Alzheimer Disease Neuroimaging, Disease Neuroimaging Initiative, Alzheimer disease, sensitivity and specificity
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: SPIE Medical Imaging (MI25)

点击查看摘要

Abstract:Alzheimer’s disease (AD) is the leading cause of dementia, and its early detection is crucial for effective intervention, yet current diagnostic methods often fall short in sensitivity and specificity. This study aims to detect significant indicators of early AD by extracting and integrating various imaging biomarkers, including radiomics, hippocampal texture descriptors, cortical thickness measurements, and deep learning features. We analyze structural magnetic resonance imaging (MRI) scans from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohorts, utilizing comprehensive image analysis and machine learning techniques. Our results show that combining multiple biomarkers significantly improves detection accuracy. Radiomics and texture features emerged as the most effective predictors for early AD, achieving AUCs of 0.88 and 0.72 for AD and MCI detection, respectively. Although deep learning features proved to be less effective than traditional approaches, incorporating age with other biomarkers notably enhanced MCI detection performance. Additionally, our findings emphasize the continued importance of classical imaging biomarkers in the face of modern deep-learning approaches, providing a robust framework for early AD diagnosis.

[CV-70] mporal and Spatial Super Resolution with Latent Diffusion Model in Medical MRI images

链接: https://arxiv.org/abs/2410.23898
作者: Vishal Dubey
关键词-EN: acquisition time constraints, medical imaging, spatial and temporal, Vector Quantised GAN, Super Resolution
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Super Resolution (SR) plays a critical role in computer vision, particularly in medical imaging, where hardware and acquisition time constraints often result in low spatial and temporal resolution. While diffusion models have been applied for both spatial and temporal SR, few studies have explored their use for joint spatial and temporal SR, particularly in medical imaging. In this work, we address this gap by proposing to use a Latent Diffusion Model (LDM) combined with a Vector Quantised GAN (VQGAN)-based encoder-decoder architecture for joint super resolution. We frame SR as an image denoising problem, focusing on improving both spatial and temporal resolution in medical images. Using the cardiac MRI dataset from the Data Science Bowl Cardiac Challenge, consisting of 2D cine images with a spatial resolution of 256x256 and 8-14 slices per time-step, we demonstrate the effectiveness of our approach. Our LDM model achieves Peak Signal to Noise Ratio (PSNR) of 30.37, Structural Similarity Index (SSIM) of 0.7580, and Learned Perceptual Image Patch Similarity (LPIPS) of 0.2756, outperforming simple baseline method by 5% in PSNR, 6.5% in SSIM, 39% in LPIPS. Our LDM model generates images with high fidelity and perceptual quality with 15 diffusion steps. These results suggest that LDMs hold promise for advancing super resolution in medical imaging, potentially enhancing diagnostic accuracy and patient outcomes. Code link is also shared.

[CV-71] Denoising Diffusion Models for Anomaly Localization in Medical Images

链接: https://arxiv.org/abs/2410.23834
作者: Cosmin I. Bercea,Philippe C. Cattin,Julia A. Schnabel,Julia Wolleb
关键词-EN: chapter explores anomaly, explores anomaly localization, anomaly localization, medical images, localization in medical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This chapter explores anomaly localization in medical images using denoising diffusion models. After providing a brief methodological background of these models, including their application to image reconstruction and their conditioning using guidance mechanisms, we provide an overview of available datasets and evaluation metrics suitable for their application to anomaly localization in medical images. In this context, we discuss supervision schemes ranging from fully supervised segmentation to semi-supervised, weakly supervised, self-supervised, and unsupervised methods, and provide insights into the effectiveness and limitations of these approaches. Furthermore, we highlight open challenges in anomaly localization, including detection bias, domain shift, computational cost, and model interpretability. Our goal is to provide an overview of the current state of the art in the field, outline research gaps, and highlight the potential of diffusion models for robust anomaly localization in medical images.

[CV-72] MLLA-UNet: Mamba-like Linear Attention in an Efficient U-Shape Model for Medical Image Segmentation

链接: https://arxiv.org/abs/2410.23738
作者: Yufeng Jiang,Zongxi Li,Xiangyan Chen,Haoran Xie,Jing Cai
关键词-EN: blurred tissue boundaries, low organ contrast, Recent advancements, high anatomical variability, anatomical variability
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in medical imaging have resulted in more complex and diverse images, with challenges such as high anatomical variability, blurred tissue boundaries, low organ contrast, and noise. Traditional segmentation methods struggle to address these challenges, making deep learning approaches, particularly U-shaped architectures, increasingly prominent. However, the quadratic complexity of standard self-attention makes Transformers computationally prohibitive for high-resolution images. To address these challenges, we propose MLLA-UNet (Mamba-Like Linear Attention UNet), a novel architecture that achieves linear computational complexity while maintaining high segmentation accuracy through its innovative combination of linear attention and Mamba-inspired adaptive mechanisms, complemented by an efficient symmetric sampling structure for enhanced feature processing. Our architecture effectively preserves essential spatial features while capturing long-range dependencies at reduced computational complexity. Additionally, we introduce a novel sampling strategy for multi-scale feature fusion. Experiments demonstrate that MLLA-UNet achieves state-of-the-art performance on six challenging datasets with 24 different segmentation tasks, including but not limited to FLARE22, AMOS CT, and ACDC, with an average DSC of 88.32%. These results underscore the superiority of MLLA-UNet over existing methods. Our contributions include the novel 2D segmentation architecture and its empirical validation. The code is available via this https URL.

[CV-73] Novel Clinical-Grade Prostate Cancer Detection and Grading Model: Development and Prospective Validation Using Real World Data with Performance Assessment on IHC Requested Cases

链接: https://arxiv.org/abs/2410.23642
作者: Ramin Nateghi,Ruoji Zhou,Madeline Saft,Marina Schnauss,Clayton Neill,Ridwan Alam,Nicole Handa,Mitchell Huang,Eric V Li,Jeffery A Goldstein,Edward M Schaeffer,Menatalla Nadim,Fattaneh Pourakpour,Bogdan Isaila,Christopher Felicelli,Vikas Mehta,Behtash G Nezami,Ashley Ross,Ximing Yang,Lee AD Cooper
关键词-EN: meeting increasing demand, reducing turnaround time, assist healthcare systems, Artificial intelligence, maintaining diagnostic quality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Artificial intelligence may assist healthcare systems in meeting increasing demand for pathology services while maintaining diagnostic quality and reducing turnaround time and costs. We aimed to investigate the performance of an institutionally developed system for prostate cancer detection, grading, and workflow optimization and to contrast this with commercial alternatives. From August 2021 to March 2023, we scanned 21,396 slides from 1,147 patients with positive biopsies. We developed models for cancer detection, grading, and screening of equivocal cases for IHC ordering. We compared a task-specific model trained using the PANDA dataset of prostate cancer biopsies with one built using features extracted by the general-purpose histology foundation model, UNI and compare their performance in an unfiltered prospectively collected dataset that reflects our patient population (1737 slides,95 patients). We evaluated the contributions of a bespoke model designed to improve sensitivity in detecting small cancer foci and scoring of broader patterns observed at lower resolution. We found high concordance between the developed systems and pathologist reference in detection (AUC 98.5, sensitivity 95.0, and specificity 97.8), ISUP grading (quadratic Cohen’s kappa 0.869), grade group 3 or higher (AUC 97.5, sensitivity 94.9, specificity 96.6) and comparable to published data from commercial systems. Screening could reduce IHC ordering for equivocal cases by 44.5% with an overall error rate of 1.8% (1.4% false positive, 0.4% false negative rates). Institutions like academic medical centers that have high scanning volumes and report abstraction capabilities can develop accurate computational pathology models for internal use. These models have the potential to aid in quality control role and to improve workflow in the pathology lab to help meet future challenges in prostate cancer diagnosis.

[CV-74] Cycle-Constrained Adversarial Denoising Convolutional Network for PET Image Denoising: Multi-Dimensional Validation on Large Datasets with Reader Study and Real Low-Dose Data

链接: https://arxiv.org/abs/2410.23628
作者: Yucun Hou,Fenglin Zhan,Xin Cheng,Chenxi Li,Ziquan Yuan,Runze Liao,Haihao Wang,Jianlang Hua,Jing Wu,Jianyong Jiang
关键词-EN: Positron emission tomography, Positron emission, poses radiation risks, Denoising Convolutional Network, emission tomography
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Positron emission tomography (PET) is a critical tool for diagnosing tumors and neurological disorders but poses radiation risks to patients, particularly to sensitive populations. While reducing injected radiation dose mitigates this risk, it often compromises image quality. To reconstruct full-dose-quality images from low-dose scans, we propose a Cycle-constrained Adversarial Denoising Convolutional Network (Cycle-DCN). This model integrates a noise predictor, two discriminators, and a consistency network, and is optimized using a combination of supervised loss, adversarial loss, cycle consistency loss, identity loss, and neighboring Structural Similarity Index (SSIM) loss. Experiments were conducted on a large dataset consisting of raw PET brain data from 1,224 patients, acquired using a Siemens Biograph Vision PET/CT scanner. Each patient underwent a 120-seconds brain scan. To simulate low-dose PET conditions, images were reconstructed from shortened scan durations of 30, 12, and 5 seconds, corresponding to 1/4, 1/10, and 1/24 of the full-dose acquisition, respectively, using a custom-developed GPU-based image reconstruction software. The results show that Cycle-DCN significantly improves average Peak Signal-to-Noise Ratio (PSNR), SSIM, and Normalized Root Mean Square Error (NRMSE) across three dose levels, with improvements of up to 56%, 35%, and 71%, respectively. Additionally, it achieves contrast-to-noise ratio (CNR) and Edge Preservation Index (EPI) values that closely align with full-dose images, effectively preserving image details, tumor shape, and contrast, while resolving issues with blurred edges. The results of reader studies indicated that the images restored by Cycle-DCN consistently received the highest ratings from nuclear medicine physicians, highlighting their strong clinical relevance.

[CV-75] 2D Empirical Transforms. Wavelets Ridgelets and Curvelets revisited

链接: https://arxiv.org/abs/2410.23533
作者: Jerome Gilles,Giang Tran,Stanley Osher
关键词-EN: Empirical Wavelet Transform, recently developed, analyzed signal, aims to build, Wavelet Transform
类目: Functional Analysis (math.FA); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:A recently developed new approach, called ``Empirical Wavelet Transform’', aims to build 1D adaptive wavelet frames accordingly to the analyzed signal. In this paper, we present several extensions of this approach to 2D signals (images). We revisit some well-known transforms (tensor wavelets, Littlewood-Paley wavelets, ridgelets and curvelets) and show that it is possible to build their empirical counterpart. We prove that such constructions lead to different adaptive frames which show some promising properties for image analysis and processing.

[CV-76] NCAdapt: Dynamic adaptation with domain-specific Neural Cellular Automata for continual hippocampus segmentation

链接: https://arxiv.org/abs/2410.23368
作者: Amin Ranem,John Kalkhof,Anirban Mukhopadhyay
关键词-EN: previously acquired knowledge, Continual learning, Neural Cellular Automata, retaining previously acquired, acquired knowledge
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual learning (CL) in medical imaging presents a unique challenge, requiring models to adapt to new domains while retaining previously acquired knowledge. We introduce NCAdapt, a Neural Cellular Automata (NCA) based method designed to address this challenge. NCAdapt features a domain-specific multi-head structure, integrating adaptable convolutional layers into the NCA backbone for each new domain encountered. After initial training, the NCA backbone is frozen, and only the newly added adaptable convolutional layers, consisting of 384 parameters, are trained along with domain-specific NCA convolutions. We evaluate NCAdapt on hippocampus segmentation tasks, benchmarking its performance against Lifelong nnU-Net and U-Net models with state-of-the-art (SOTA) CL methods. Our lightweight approach achieves SOTA performance, underscoring its effectiveness in addressing CL challenges in medical imaging. Upon acceptance, we will make our code base publicly accessible to support reproducibility and foster further advancements in medical CL.

[CV-77] Deep learning meets tree phenology modeling: PhenoFormer vs. process-based models

链接: https://arxiv.org/abs/2410.23327
作者: Vivien Sainte Fare Garnot,Lynsay Spafford,Jelle Lever,Christian Sigg,Barbara Pietragalla,Yann Vitasse,Arthur Gessler,Jan Dirk Wegner
关键词-EN: cyclical plant life, plant life events, emergence and coloration, bio-climatic system, timing of cyclical
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: journal-preprint

点击查看摘要

Abstract:Phenology, the timing of cyclical plant life events such as leaf emergence and coloration, is crucial in the bio-climatic system. Climate change drives shifts in these phenological events, impacting ecosystems and the climate itself. Accurate phenology models are essential to predict the occurrence of these phases under changing climatic conditions. Existing methods include hypothesis-driven process models and data-driven statistical approaches. Process models account for dormancy stages and various phenology drivers, while statistical models typically rely on linear or traditional machine learning techniques. Research shows that process models often outperform statistical methods when predicting under climate conditions outside historical ranges, especially with climate change scenarios. However, deep learning approaches remain underexplored in climate phenology modeling. We introduce PhenoFormer, a neural architecture better suited than traditional statistical methods at predicting phenology under shift in climate data distribution, while also bringing significant improvements or performing on par to the best performing process-based models. Our numerical experiments on a 70-year dataset of 70,000 phenological observations from 9 woody species in Switzerland show that PhenoFormer outperforms traditional machine learning methods by an average of 13% R2 and 1.1 days RMSE for spring phenology, and 11% R2 and 0.7 days RMSE for autumn phenology, while matching or exceeding the best process-based models. Our results demonstrate that deep learning has the potential to be a valuable methodological tool for accurate climate-phenology prediction, and our PhenoFormer is a first promising step in improving phenological predictions before a complete understanding of the underlying physiological mechanisms is available.

[CV-78] Improved Patch Denoising Diffusion Probabilistic Models for Magnetic Resonance Fingerprinting

链接: https://arxiv.org/abs/2410.23318
作者: Perla Mayo,Carolin M. Pirkl,Alin Achim,Bjoern H. Menze,Mohammad Golbabaee
关键词-EN: Magnetic Resonance Fingerprinting, Magnetic Resonance, Resonance Fingerprinting, multiple tissue properties, enabling the mapping
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 2 algorithms

点击查看摘要

Abstract:Magnetic Resonance Fingerprinting (MRF) is a time-efficient approach to quantitative MRI, enabling the mapping of multiple tissue properties from a single, accelerated scan. However, achieving accurate reconstructions remains challenging, particularly in highly accelerated and undersampled acquisitions, which are crucial for reducing scan times. While deep learning techniques have advanced image reconstruction, the recent introduction of diffusion models offers new possibilities for imaging tasks, though their application in the medical field is still emerging. Notably, diffusion models have not yet been explored for the MRF problem. In this work, we propose for the first time a conditional diffusion probabilistic model for MRF image reconstruction. Qualitative and quantitative comparisons on in-vivo brain scan data demonstrate that the proposed approach can outperform established deep learning and compressed sensing algorithms for MRF reconstruction. Extensive ablation studies also explore strategies to improve computational efficiency of our approach.

机器学习

[LG-0] Robust Gaussian Processes via Relevance Pursuit NEURIPS2024

链接: https://arxiv.org/abs/2410.24222
作者: Sebastian Ament,Elizabeth Santorella,David Eriksson,Ben Letham,Maximilian Balandat,Eytan Bakshy
关键词-EN: well-calibrated uncertainty estimates, non-parametric probabilistic regression, Gaussian processes, data efficiency, homoskedastic Gaussian noise
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: NeurIPS 2024 Article

点击查看摘要

Abstract:Gaussian processes (GPs) are non-parametric probabilistic regression models that are popular due to their flexibility, data efficiency, and well-calibrated uncertainty estimates. However, standard GP models assume homoskedastic Gaussian noise, while many real-world applications are subject to non-Gaussian corruptions. Variants of GPs that are more robust to alternative noise models have been proposed, and entail significant trade-offs between accuracy and robustness, and between computational requirements and theoretical guarantees. In this work, we propose and study a GP model that achieves robustness against sparse outliers by inferring data-point-specific noise levels with a sequential selection procedure maximizing the log marginal likelihood that we refer to as relevance pursuit. We show, surprisingly, that the model can be parameterized such that the associated log marginal likelihood is strongly concave in the data-point-specific noise variances, a property rarely found in either robust regression objectives or GP marginal likelihoods. This in turn implies the weak submodularity of the corresponding subset selection problem, and thereby proves approximation guarantees for the proposed algorithm. We compare the model’s performance relative to other approaches on diverse regression and Bayesian optimization tasks, including the challenging but common setting of sparse corruptions of the labels within or close to the function range.

[LG-1] CaAdam: Improving Adam optimizer using connection aware methods

链接: https://arxiv.org/abs/2410.24216
作者: Remi Genet,Hugo Inzirillo
关键词-EN: loss function minima, function minima, enhances convergence speed, loss function, Adam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a new method inspired by Adam that enhances convergence speed and achieves better loss function minima. Traditional optimizers, including Adam, apply uniform or globally adjusted learning rates across neural networks without considering their architectural specifics. This architecture-agnostic approach is deeply embedded in most deep learning frameworks, where optimizers are implemented as standalone modules without direct access to the network’s structural information. For instance, in popular frameworks like Keras or PyTorch, optimizers operate solely on gradients and parameters, without knowledge of layer connectivity or network topology. Our algorithm, CaAdam, explores this overlooked area by introducing connection-aware optimization through carefully designed proxies of architectural information. We propose multiple scaling methodologies that dynamically adjust learning rates based on easily accessible structural properties such as layer depth, connection counts, and gradient distributions. This approach enables more granular optimization while working within the constraints of current deep learning frameworks. Empirical evaluations on standard datasets (e.g., CIFAR-10, Fashion MNIST) show that our method consistently achieves faster convergence and higher accuracy compared to standard Adam optimizer, demonstrating the potential benefits of incorporating architectural awareness in optimization strategies.

[LG-2] abM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

链接: https://arxiv.org/abs/2410.24210
作者: Yury Gorishniy,Akim Kotelnikov,Artem Babenko
关键词-EN: Deep learning architectures, Deep learning, sophisticated Transformers, Transformers and retrieval-augmented, simple multilayer perceptrons
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Deep learning architectures for supervised learning on tabular data range from simple multilayer perceptrons (MLP) to sophisticated Transformers and retrieval-augmented methods. This study highlights a major, yet so far overlooked opportunity for substantially improving tabular MLPs: namely, parameter-efficient ensembling – a paradigm for implementing an ensemble of models as one model producing multiple predictions. We start by developing TabM – a simple model based on MLP and our variations of BatchEnsemble (an existing technique). Then, we perform a large-scale evaluation of tabular DL architectures on public benchmarks in terms of both task performance and efficiency, which renders the landscape of tabular DL in a new light. Generally, we show that MLPs, including TabM, form a line of stronger and more practical models compared to attention- and retrieval-based architectures. In particular, we find that TabM demonstrates the best performance among tabular DL models. Lastly, we conduct an empirical analysis on the ensemble-like nature of TabM. For example, we observe that the multiple predictions of TabM are weak individually, but powerful collectively. Overall, our work brings an impactful technique to tabular DL, analyses its behaviour, and advances the performance-efficiency trade-off with TabM – a simple and powerful baseline for researchers and practitioners.

[LG-3] Group Crosscoders for Mechanistic Analysis of Symmetry

链接: https://arxiv.org/abs/2410.24184
作者: Liv Gorton
关键词-EN: analyse symmetrical features, neural networks, systematically discover, discover and analyse, analyse symmetrical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce group crosscoders, an extension of crosscoders that systematically discover and analyse symmetrical features in neural networks. While neural networks often develop equivariant representations without explicit architectural constraints, understanding these emergent symmetries has traditionally relied on manual analysis. Group crosscoders automate this process by performing dictionary learning across transformed versions of inputs under a symmetry group. Applied to InceptionV1’s mixed3b layer using the dihedral group \mathrmD_32 , our method reveals several key insights: First, it naturally clusters features into interpretable families that correspond to previously hypothesised feature types, providing more precise separation than standard sparse autoencoders. Second, our transform block analysis enables the automatic characterisation of feature symmetries, revealing how different geometric features (such as curves versus lines) exhibit distinct patterns of invariance and equivariance. These results demonstrate that group crosscoders can provide systematic insights into how neural networks represent symmetry, offering a promising new tool for mechanistic interpretability.

[LG-4] AR-Pro: Counterfactual Explanations for Anomaly Repair with Formal Properties

链接: https://arxiv.org/abs/2410.24178
作者: Xiayan Ji,Anton Xue,Eric Wong,Oleg Sokolsky,Insup Lee
关键词-EN: identifying critical errors, methods lack interpretability, current methods lack, suspicious behaviors, lack interpretability
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is widely used for identifying critical errors and suspicious behaviors, but current methods lack interpretability. We leverage common properties of existing methods and recent advances in generative models to introduce counterfactual explanations for anomaly detection. Given an input, we generate its counterfactual as a diffusion-based repair that shows what a non-anomalous version should have looked like. A key advantage of this approach is that it enables a domain-independent formal specification of explainability desiderata, offering a unified framework for generating and evaluating explanations. We demonstrate the effectiveness of our anomaly explainability framework, AR-Pro, on vision (MVTec, VisA) and time-series (SWaT, WADI, HAI) anomaly datasets. The code used for the experiments is accessible at: this https URL.

[LG-5] he Importance of Being Scalable: Improving the Speed and Accuracy of Neural Network Interatomic Potentials Across Chemical Domains NEURIPS2024

链接: https://arxiv.org/abs/2410.24169
作者: Eric Qu,Aditi S. Krishnapriyan
关键词-EN: machine learning, critical in improving, generalization in machine, Network Interatomic Potentials, model
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Scaling has been critical in improving model performance and generalization in machine learning. It involves how a model’s performance changes with increases in model size or input data, as well as how efficiently computational resources are utilized to support this growth. Despite successes in other areas, the study of scaling in Neural Network Interatomic Potentials (NNIPs) remains limited. NNIPs act as surrogate models for ab initio quantum mechanical calculations. The dominant paradigm here is to incorporate many physical domain constraints into the model, such as rotational equivariance. We contend that these complex constraints inhibit the scaling ability of NNIPs, and are likely to lead to performance plateaus in the long run. In this work, we take an alternative approach and start by systematically studying NNIP scaling strategies. Our findings indicate that scaling the model through attention mechanisms is efficient and improves model expressivity. These insights motivate us to develop an NNIP architecture designed for scalability: the Efficiently Scaled Attention Interatomic Potential (EScAIP). EScAIP leverages a multi-head self-attention formulation within graph neural networks, applying attention at the neighbor-level representations. Implemented with highly-optimized attention GPU kernels, EScAIP achieves substantial gains in efficiency–at least 10x faster inference, 5x less memory usage–compared to existing NNIPs. EScAIP also achieves state-of-the-art performance on a wide range of datasets including catalysts (OC20 and OC22), molecules (SPICE), and materials (MPTrj). We emphasize that our approach should be thought of as a philosophy rather than a specific model, representing a proof-of-concept for developing general-purpose NNIPs that achieve better expressivity through scaling, and continue to scale efficiently with increased computational resources and training data.

[LG-6] Approaches to human activity recognition via passive radar

链接: https://arxiv.org/abs/2410.24166
作者: Christian Bresciani,Federico Cerutti,Marco Cominelli
关键词-EN: Channel State Information, Wi-Fi Channel State, Human Activity Recognition, Activity Recognition, State Information
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The thesis explores novel methods for Human Activity Recognition (HAR) using passive radar with a focus on non-intrusive Wi-Fi Channel State Information (CSI) data. Traditional HAR approaches often use invasive sensors like cameras or wearables, raising privacy issues. This study leverages the non-intrusive nature of CSI, using Spiking Neural Networks (SNN) to interpret signal variations caused by human movements. These networks, integrated with symbolic reasoning frameworks such as DeepProbLog, enhance the adaptability and interpretability of HAR systems. SNNs offer reduced power consumption, ideal for privacy-sensitive applications. Experimental results demonstrate SNN-based neurosymbolic models achieve high accuracy making them a promising alternative for HAR across various domains.

[LG-7] pi_0: A Vision-Language-Action Flow Model for General Robot Control

链接: https://arxiv.org/abs/2410.24164
作者: Kevin Black,Noah Brown,Danny Driess,Adnan Esmail,Michael Equi,Chelsea Finn,Niccolo Fusai,Lachy Groom,Karol Hausman,Brian Ichter,Szymon Jakubczak,Tim Jones,Liyiming Ke,Sergey Levine,Adrian Li-Bell,Mohith Mothukuri,Suraj Nair,Karl Pertsch,Lucy Xiaoyang Shi,James Tanner,Quan Vuong,Anna Walling,Haohuan Wang,Ury Zhilinsky
关键词-EN: holds tremendous promise, learning holds tremendous, Robot learning holds, potential of flexible, artificial intelligence
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: See project website for videos: this https URL

点击查看摘要

Abstract:Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

[LG-8] Conformalized Prediction of Post-Fault Voltage Trajectories Using Pre-trained and Finetuned Attention-Driven Neural Operators

链接: https://arxiv.org/abs/2410.24162
作者: Amirhossein Mollaali,Gabriel Zufferey,Gonzalo Constante-Flores,Christian Moya,Can Li,Guang Lin,Meng Yue
关键词-EN: Deep Operator Network, post-fault voltage trajectories, voltage trajectories, Quantile Attention-Fourier Deep, Attention-Fourier Deep Operator
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a new data-driven methodology for predicting intervals of post-fault voltage trajectories in power systems. We begin by introducing the Quantile Attention-Fourier Deep Operator Network (QAF-DeepONet), designed to capture the complex dynamics of voltage trajectories and reliably estimate quantiles of the target trajectory without any distributional assumptions. The proposed operator regression model maps the observed portion of the voltage trajectory to its unobserved post-fault trajectory. Our methodology employs a pre-training and fine-tuning process to address the challenge of limited data availability. To ensure data privacy in learning the pre-trained model, we use merging via federated learning with data from neighboring buses, enabling the model to learn the underlying voltage dynamics from such buses without directly sharing their data. After pre-training, we fine-tune the model with data from the target bus, allowing it to adapt to unique dynamics and operating conditions. Finally, we integrate conformal prediction into the fine-tuned model to ensure coverage guarantees for the predicted intervals. We evaluated the performance of the proposed methodology using the New England 39-bus test system considering detailed models of voltage and frequency controllers. Two metrics, Prediction Interval Coverage Probability (PICP) and Prediction Interval Normalized Average Width (PINAW), are used to numerically assess the model’s performance in predicting intervals. The results show that the proposed approach offers practical and reliable uncertainty quantification in predicting the interval of post-fault voltage trajectories.

[LG-9] Dense Associative Memory Through the Lens of Random Features NEURIPS2024

链接: https://arxiv.org/abs/2410.24153
作者: Benjamin Hoover,Duen Horng Chau,Hendrik Strobelt,Parikshit Ram,Dmitry Krotov
关键词-EN: high storage capacity, storage capacity variants, Dense Associative Memories, Dense Associative, Hopfield networks
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Dense Associative Memories are high storage capacity variants of the Hopfield networks that are capable of storing a large number of memory patterns in the weights of the network of a given size. Their common formulations typically require storing each pattern in a separate set of synaptic weights, which leads to the increase of the number of synaptic weights when new patterns are introduced. In this work we propose an alternative formulation of this class of models using random features, commonly used in kernel methods. In this formulation the number of network’s parameters remains fixed. At the same time, new memories can be added to the network by modifying existing weights. We show that this novel network closely approximates the energy function and dynamics of conventional Dense Associative Memories and shares their desirable computational properties.

[LG-10] Q-learning for Quantile MDPs: A Decomposition Performance and Convergence Analysis

链接: https://arxiv.org/abs/2410.24128
作者: Jia Lin Hau,Erick Delage,Esther Derman,Mohammad Ghavamzadeh,Marek Petrik
关键词-EN: Markov decision processes, Markov decision, quantile risk measures, decision processes, risk measures
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents’ preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.

[LG-11] Repository-Level Compositional Code Translation and Validation

链接: https://arxiv.org/abs/2410.24117
作者: Ali Reza Ibrahimzada,Kaiyao Ke,Mrigank Pawagi,Muhammad Salman Abid,Rangeet Pan,Saurabh Sinha,Reyhaneh Jabbarvand
关键词-EN: Code translation transforms, Large Language Models, Code translation, Code, translation
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code translation transforms programs from one programming language (PL) to another. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using Large Language Models (LLMs). One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the scale and complexity of real-world projects with dependencies, custom types, PL-specific features, etc. We propose AlphaTrans, a neuro-symbolic approach to automate repository-level code translation. AlphaTrans translates both source and test code, and employs multiple levels of validation to ensure the translation preserves the functionality of the source program. To break down the problem for LLMs, AlphaTrans leverages program analysis to decompose the program into fragments and translates them in the reverse call order. We leveraged AlphaTrans to translate ten real-world open-source projects consisting of 836, 8575, 2719 classes, methods, and tests. AlphaTrans translated the entire repository of these projects consisting of 6899 source code fragments. 99.1% of the translated code fragments are syntactically correct, and AlphaTrans validates the translations’ runtime behavior and functional correctness for 25.8%. On average, the integrated translation and validation take 36 hours to translate a project, showing its scalability in practice. For the syntactically or semantically incorrect translations, AlphaTrans generates a report including existing translation, stack trace, test errors, or assertion failures. We provided these artifacts to two developers to fix the translation bugs in four projects. They were able to fix the issues in 20.1 hours on average and achieve all passing tests. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2410.24117 [cs.SE] (or arXiv:2410.24117v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.24117 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] On Sampling Strategies for Spectral Model Sharding NEURIPS2024

链接: https://arxiv.org/abs/2410.24106
作者: Denis Korzhenkov,Christos Louizos
关键词-EN: lot of attention, heterogeneous clients, recently drawn, drawn a lot, Spectral model sharding
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:The problem of heterogeneous clients in federated learning has recently drawn a lot of attention. Spectral model sharding, i.e., partitioning the model parameters into low-rank matrices based on the singular value decomposition, has been one of the proposed solutions for more efficient on-device training in such settings. In this work, we present two sampling strategies for such sharding, obtained as solutions to specific optimization problems. The first produces unbiased estimators of the original weights, while the second aims to minimize the squared approximation error. We discuss how both of these estimators can be incorporated in the federated learning loop and practical considerations that arise during local training. Empirically, we demonstrate that both of these methods can lead to improved performance on various commonly used datasets.

[LG-13] Matchmaker: Self-Improving Large Language Model Programs for Schema Matching NEURIPS2024 ALT

链接: https://arxiv.org/abs/2410.24105
作者: Nabeel Seedat,Mihaela van der Schaar
关键词-EN: interoperable machine learning, creating interoperable machine, disparate data sources, tables and hierarchies, machine learning
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024, GenAI for Health Workshop and Table Representation Learning Workshop

点击查看摘要

Abstract:Schema matching – the task of finding matches between attributes across disparate data sources with different tables and hierarchies – is critical for creating interoperable machine learning (ML)-ready data. Addressing this fundamental data-centric problem has wide implications, especially in domains like healthcare, finance and e-commerce – but also has the potential to benefit ML models more generally, by increasing the data available for ML model training. However, schema matching is a challenging ML task due to structural/hierarchical and semantic heterogeneity between different schemas. Previous ML approaches to automate schema matching have either required significant labeled data for model training, which is often unrealistic or suffer from poor zero-shot performance. To this end, we propose Matchmaker - a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker also self-improves in a zero-shot manner without the need for labeled demonstrations via a novel optimization approach, which constructs synthetic in-context demonstrations to guide the language model’s reasoning process. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data.

[LG-14] Clustering to Minimize Cluster-Aware Norm Objectives

链接: https://arxiv.org/abs/2410.24104
作者: Martin G. Herold,Evangelos Kipouridis,Joachim Spoerhase
关键词-EN: general clustering problem, clustering, initiate the study, norm, general clustering
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: accepted at SODA 2025

点击查看摘要

Abstract:We initiate the study of the following general clustering problem. We seek to partition a given set P of data points into k clusters by finding a set X of k centers and assigning each data point to one of the centers. The cost of a cluster, represented by a center x\in X , is a monotone, symmetric norm f (inner norm) of the vector of distances of points assigned to x . The goal is to minimize a norm g (outer norm) of the vector of cluster costs. This problem, which we call (f,g) -Clustering, generalizes many fundamental clustering problems such as k -Center, k -Median , Min-Sum of Radii, and Min-Load k -Clustering . A recent line of research (Chakrabarty, Swamy [STOC’19]) studies norm objectives that are oblivious to the cluster structure such as k -Median and k -Center. In contrast, our problem models cluster-aware objectives including Min-Sum of Radii and Min-Load k -Clustering. Our main results are as follows. First, we design a constant-factor approximation algorithm for (\textsftop_\ell,\mathcalL_1) -Clustering where the inner norm ( \textsftop_\ell ) sums over the \ell largest distances. Second, we design a constant-factor approximation\ for (\mathcalL_\infty,\textsfOrd) -Clustering where the outer norm is a convex combination of \textsftop_\ell norms (ordered weighted norm). Comments: accepted at SODA 2025 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2410.24104 [cs.DS] (or arXiv:2410.24104v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2410.24104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Benchmark Data Repositories for Better Benchmarking NEURIPS

链接: https://arxiv.org/abs/2410.24100
作者: Rachel Longjohn,Markelle Kelly,Sameer Singh,Padhraic Smyth
关键词-EN: machine learning research, machine learning, benchmark data repositories, standard benchmark datasets, common to evaluate
类目: Machine Learning (cs.LG); Digital Libraries (cs.DL)
*备注: Accepted to NeurIPS Datasets and Benchmarks 2024

点击查看摘要

Abstract:In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for – and levies criticisms at – data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these \textitbenchmark data repositories and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning.

[LG-16] Progressive Safeguards for Safe and Model-Agnostic Reinforcement Learning

链接: https://arxiv.org/abs/2410.24096
作者: Nabil Omi,Hosein Hasanbeig,Hiteshi Sharma,Sriram K. Rajamani,Siddhartha Sen
关键词-EN: propose a formal, paper we propose, safety, safe reinforcement learning, safeguard
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In this paper we propose a formal, model-agnostic meta-learning framework for safe reinforcement learning. Our framework is inspired by how parents safeguard their children across a progression of increasingly riskier tasks, imparting a sense of safety that is carried over from task to task. We model this as a meta-learning process where each task is synchronized with a safeguard that monitors safety and provides a reward signal to the agent. The safeguard is implemented as a finite-state machine based on a safety specification; the reward signal is formally shaped around this specification. The safety specification and its corresponding safeguard can be arbitrarily complex and non-Markovian, which adds flexibility to the training process and explainability to the learned policy. The design of the safeguard is manual but it is high-level and model-agnostic, which gives rise to an end-to-end safe learning approach with wide applicability, from pixel-level game control to language model fine-tuning. Starting from a given set of safety specifications (tasks), we train a model such that it can adapt to new specifications using only a small number of training samples. This is made possible by our method for efficiently transferring safety bias between tasks, which effectively minimizes the number of safety violations. We evaluate our framework in a Minecraft-inspired Gridworld, a VizDoom game environment, and an LLM fine-tuning application. Agents trained with our approach achieve near-minimal safety violations, while baselines are shown to underperform.

[LG-17] Hamiltonian Monte Carlo Inference of Marginalized Linear Mixed-Effects Models NEURIPS2024

链接: https://arxiv.org/abs/2410.24079
作者: Jinlin Lai,Daniel Sheldon,Justin Domke
关键词-EN: Markov chain Monte, chain Monte Carlo, Hamiltonian Monte Carlo, Monte Carlo, requires advanced sampling
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Bayesian reasoning in linear mixed-effects models (LMMs) is challenging and often requires advanced sampling techniques like Markov chain Monte Carlo (MCMC). A common approach is to write the model in a probabilistic programming language and then sample via Hamiltonian Monte Carlo (HMC). However, there are many ways a user can transform a model that make inference more or less efficient. In particular, marginalizing some variables can greatly improve inference but is difficult for users to do manually. We develop an algorithm to easily marginalize random effects in LMMs. A naive approach introduces cubic time operations within an inference algorithm like HMC, but we reduce the running time to linear using fast linear algebra techniques. We show that marginalization is always beneficial when applicable and highlight improvements in various models, especially ones from cognitive sciences.

[LG-18] Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs

链接: https://arxiv.org/abs/2410.24071
作者: Davide Maran,Alberto Maria Metelli,Matteo Papini,Marcello Restelli
关键词-EN: Reinforcement Learning, major open problems, property for Reinforcement, open problems, Markov Decision Processes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving the no-regret property for Reinforcement Learning (RL) problems in continuous state and action-space environments is one of the major open problems in the field. Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes. Furthermore, many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon H in the regret, which makes any possible solution unfeasible in practice. In this paper, we identify local linearity as the feature that makes Markov Decision Processes (MDPs) both learnable (sublinear regret) and feasible (regret that is polynomial in H ). We define a novel MDP representation class, namely Locally Linearizable MDPs, generalizing other representation classes like Linear MDPs and MDPS with low inherent Belmman error. Then, i) we introduce Cinderella, a no-regret algorithm for this general representation class, and ii) we show that all known learnable and feasible MDP families are representable in this class. We first show that all known feasible MDPs belong to a family that we call Mildly Smooth MDPs. Then, we show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation. This way, Cinderella is shown to achieve state-of-the-art regret bounds for all previously known (and some new) continuous MDPs for which RL is learnable and feasible.

[LG-19] A Visual Case Study of the Training Dynamics in Neural Networks

链接: https://arxiv.org/abs/2410.24050
作者: Ambroise Odonnat,Wassim Bouaziz,Vivien Cabannes
关键词-EN: small-scale transformer model, embedding dimension constrained, visual sandbox designed, transformer model, paper introduces
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces a visual sandbox designed to explore the training dynamics of a small-scale transformer model, with the embedding dimension constrained to d=2 . This restriction allows for a comprehensive two-dimensional visualization of each layer’s dynamics. Through this approach, we gain insights into training dynamics, circuit transferability, and the causes of loss spikes, including those induced by the high curvature of normalization layers. We propose strategies to mitigate these spikes, demonstrating how good visualization facilitates the design of innovative ideas of practical interest. Additionally, we believe our sandbox could assist theoreticians in assessing essential training dynamics mechanisms and integrating them into future theories. The code is available at this https URL.

[LG-20] AdaFlow: Opportunistic Inference on Asynchronous Mobile Data with Generalized Affinity Control

链接: https://arxiv.org/abs/2410.24028
作者: Fenmin Wu,Sicong Liu,Kehao Zhu,Xiaochen Li,Bin Guo,Zhiwen Yu,Hongkai Wen,Xiangrui Xu,Lehao Wang,Xiangyu Liu
关键词-EN: mobile devices equipped, multi-modal deep intelligence, numerous sensors, LiDAR and cameras, driving assistance
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The rise of mobile devices equipped with numerous sensors, such as LiDAR and cameras, has spurred the adoption of multi-modal deep intelligence for distributed sensing tasks, such as smart cabins and driving assistance. However, the arrival times of mobile sensory data vary due to modality size and network dynamics, which can lead to delays (if waiting for slower data) or accuracy decline (if inference proceeds without waiting). Moreover, the diversity and dynamic nature of mobile systems exacerbate this challenge. In response, we present a shift to \textitopportunistic inference for asynchronous distributed multi-modal data, enabling inference as soon as partial data arrives. While existing methods focus on optimizing modality consistency and complementarity, known as modal affinity, they lack a \textitcomputational approach to control this affinity in open-world mobile environments. AdaFlow pioneers the formulation of structured cross-modality affinity in mobile contexts using a hierarchical analysis-based normalized matrix. This approach accommodates the diversity and dynamics of modalities, generalizing across different types and numbers of inputs. Employing an affinity attention-based conditional GAN (ACGAN), AdaFlow facilitates flexible data imputation, adapting to various modalities and downstream tasks without retraining. Experiments show that AdaFlow significantly reduces inference latency by up to 79.9% and enhances accuracy by up to 61.9%, outperforming status quo approaches.

[LG-21] Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting

链接: https://arxiv.org/abs/2410.24023
作者: Suhan Guo,Jiahong Deng,Yi Wei,Hui Dou,Furao Shen,Jian Zhao
关键词-EN: time series forecasting, series forecasting tasks, long-term time series, time series, series forecasting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we have shown empirically that the entire attention mechanism in the encoder can be reduced to an MLP formed by feedforward, skip-connection, and layer normalization operations for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projection, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance that the given network remains the top-tier compared to other SOTA methods. For spatio-temporal networks, the MLP-replace-attention network achieves a reduction in FLOPS of 62.579% with a loss in performance less than 2.5% ; for LTSF, a reduction in FLOPs of 42.233% with a loss in performance less than 2% .

[LG-22] Maximum Entropy Hindsight Experience Replay

链接: https://arxiv.org/abs/2410.24016
作者: Douglas C. Crowder,Matthew L. Trappett,Darrien M. McKenzie,Frances S. Chance
关键词-EN: Hindsight experience replay, goal-based reinforcement learning, Hindsight experience, accelerate goal-based reinforcement, experience replay
类目: Machine Learning (cs.LG)
*备注: 11 pages, 11 Figures

点击查看摘要

Abstract:Hindsight experience replay (HER) is well-known to accelerate goal-based reinforcement learning (RL). While HER is generally applied to off-policy RL algorithms, we previously showed that HER can also accelerate on-policy algorithms, such as proximal policy optimization (PPO), for goal-based Predator-Prey environments. Here, we show that we can improve the previous PPO-HER algorithm by selectively applying HER in a principled manner.

[LG-23] Diffusion Twigs with Loop Guidance for Conditional Graph Generation NEURIPS2024 ALT

链接: https://arxiv.org/abs/2410.24012
作者: Giangiacomo Mercatali,Yogesh Verma,Andre Freitas,Vikas Garg
关键词-EN: incorporates multiple co-evolving, framework named Twigs, score-based diffusion framework, diffusion framework named, multiple co-evolving flows
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024. Code is available at this https URL

点击查看摘要

Abstract:We introduce a novel score-based diffusion framework named Twigs that incorporates multiple co-evolving flows for enriching conditional generation tasks. Specifically, a central or trunk diffusion process is associated with a primary variable (e.g., graph structure), and additional offshoot or stem processes are dedicated to dependent variables (e.g., graph properties or labels). A new strategy, which we call loop guidance, effectively orchestrates the flow of information between the trunk and the stem processes during sampling. This approach allows us to uncover intricate interactions and dependencies, and unlock new generative capabilities. We provide extensive experiments to demonstrate strong performance gains of the proposed method over contemporary baselines in the context of conditional graph generation, underscoring the potential of Twigs in challenging generative tasks such as inverse molecular design and molecular optimization.

[LG-24] Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.24005
作者: Paulius Rauba,Nabeel Seedat,Max Ruiz Luyten,Mihaela van der Schaar
关键词-EN: compute aggregate evaluation, aggregate evaluation metrics, predominant de facto, compute aggregate, assessing the performance
类目: Machine Learning (cs.LG)
*备注: Presented at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). *Rauba Seedat contributed equally

点击查看摘要

Abstract:The predominant de facto paradigm of testing ML models relies on either using only held-out data to compute aggregate evaluation metrics or by assessing the performance on different subgroups. However, such data-only testing methods operate under the restrictive assumption that the available empirical data is the sole input for testing ML models, disregarding valuable contextual information that could guide model testing. In this paper, we challenge the go-to approach of data-only testing and introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures, which are evaluated on data using a self-falsification mechanism. Through empirical evaluations in diverse settings, we show that SMART automatically identifies more relevant and impactful failures than alternatives, demonstrating the potential of CAT as a testing paradigm.

[LG-25] Breaking Determinism: Fuzzy Modeling of Sequential Recommendation Using Discrete State Space Diffusion Model NEURIPS’2024

链接: https://arxiv.org/abs/2410.23994
作者: Wenjia Xie,Hao Wang,Luankang Zhang,Rui Zhou,Defu Lian,Enhong Chen
关键词-EN: historical behavior sequences, aims to predict, Sequential recommendation, conventional sequential modeling, historical behavior
类目: Machine Learning (cs.LG)
*备注: NeurIPS’2024, 10 pages

点击查看摘要

Abstract:Sequential recommendation (SR) aims to predict items that users may be interested in based on their historical behavior sequences. We revisit SR from a novel information-theoretic perspective and find that conventional sequential modeling methods fail to adequately capture the randomness and unpredictability of user behavior. Inspired by fuzzy information processing theory, this paper introduces the DDSR model, which uses fuzzy sets of interaction sequences to overcome the limitations and better capture the evolution of users’ real interests. Formally based on diffusion transition processes in discrete state spaces, which is unlike common diffusion models such as DDPM that operate in continuous domains. It is better suited for discrete data, using structured transitions instead of arbitrary noise introduction to avoid information loss. Additionally, to address the inefficiency of matrix transformations due to the vast discrete space, we use semantic labels derived from quantization or RQ-VAE to replace item IDs, enhancing efficiency and improving cold start issues. Testing on three public benchmark datasets shows that DDSR outperforms existing state-of-the-art methods in various settings, demonstrating its potential and effectiveness in handling SR tasks.

[LG-26] Ada-MSHyper: Adaptive Multi-Scale Hypergraph Transformer for Time Series Forecasting NEURIPS

链接: https://arxiv.org/abs/2410.23992
作者: Zongjiang Shang,Ling Chen,Binqing wu,Dongliang Cui
关键词-EN: Individual time points, achieved great success, key challenges limit, information utilization bottleneck, model pair-wise interactions
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS, 21 pages, and 8 figures

点击查看摘要

Abstract:Although transformer-based methods have achieved great success in multi-scale temporal pattern interaction modeling, two key challenges limit their further development: (1) Individual time points contain less semantic information, and leveraging attention to model pair-wise interactions may cause the information utilization bottleneck. (2) Multiple inherent temporal variations (e.g., rising, falling, and fluctuating) entangled in temporal patterns. To this end, we propose Adaptive Multi-Scale Hypergraph Transformer (Ada-MSHyper) for time series forecasting. Specifically, an adaptive hypergraph learning module is designed to provide foundations for modeling group-wise interactions, then a multi-scale interaction module is introduced to promote more comprehensive pattern interactions at different scales. In addition, a node and hyperedge constraint mechanism is introduced to cluster nodes with similar semantic information and differentiate the temporal variations within each scales. Extensive experiments on 11 real-world datasets demonstrate that Ada-MSHyper achieves state-of-the-art performance, reducing prediction errors by an average of 4.56%, 10.38%, and 4.97% in MSE for long-range, short-range, and ultra-long-range time series forecasting, respectively. Code is available at this https URL.

[LG-27] Scalable Kernel Inverse Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.23952
作者: Youyuan Long,Tolga Ok,Pedro Zattoni Scroccaro,Peyman Mohajerin Esfahani
关键词-EN: unknown objective function, Kernel Inverse Optimization, kernel Hilbert space, Inverse Optimization, past dataset
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Inverse Optimization (IO) is a framework for learning the unknown objective function of an expert decision-maker from a past dataset. In this paper, we extend the hypothesis class of IO objective functions to a reproducing kernel Hilbert space (RKHS), thereby enhancing feature representation to an infinite-dimensional space. We demonstrate that a variant of the representer theorem holds for a specific training loss, allowing the reformulation of the problem as a finite-dimensional convex optimization program. To address scalability issues commonly associated with kernel methods, we propose the Sequential Selection Optimization (SSO) algorithm to efficiently train the proposed Kernel Inverse Optimization (KIO) model. Finally, we validate the generalization capabilities of the proposed KIO model and the effectiveness of the SSO algorithm through learning-from-demonstration tasks on the MuJoCo benchmark.

[LG-28] Deep Learning Frameworks for Cognitive Radio Networks: Review and Open Research Challenges

链接: https://arxiv.org/abs/2410.23949
作者: Senthil Kumar Jagatheesaperumal,Ijaz Ahmad,Marko Höyhtyä,Suleman Khan,Andrei Gurtov
关键词-EN: cognitive radio networks, cognitive radio, Deep learning, radio networks, spectrum sensing
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: The article has been accepted for publication in “Journal of Network and Computer Applications” during October 2024

点击查看摘要

Abstract:Deep learning has been proven to be a powerful tool for addressing the most significant issues in cognitive radio networks, such as spectrum sensing, spectrum sharing, resource allocation, and security attacks. The utilization of deep learning techniques in cognitive radio networks can significantly enhance the network’s capability to adapt to changing environments and improve the overall system’s efficiency and reliability. As the demand for higher data rates and connectivity increases, B5G/6G wireless networks are expected to enable new services and applications significantly. Therefore, the significance of deep learning in addressing cognitive radio network challenges cannot be overstated. This review article provides valuable insights into potential solutions that can serve as a foundation for the development of future B5G/6G services. By leveraging the power of deep learning, cognitive radio networks can pave the way for the next generation of wireless networks capable of meeting the ever-increasing demands for higher data rates, improved reliability, and security.

[LG-29] ransformers to Predict the Applicability of Symbolic Integration Routines NEURIPS2024

链接: https://arxiv.org/abs/2410.23948
作者: Rashid Barket,Uzma Shafiq,Matthew England,Juergen Gerhard
关键词-EN: Computer Algebra System, Algebra System, Computer Algebra, Symbolic integration, problem in mathematics
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: 10 pages, 5 figures, to be published in NeurIPS 2024 MATH-AI Workshop

点击查看摘要

Abstract:Symbolic integration is a fundamental problem in mathematics: we consider how machine learning may be used to optimise this task in a Computer Algebra System (CAS). We train transformers that predict whether a particular integration method will be successful, and compare against the existing human-made heuristics (called guards) that perform this task in a leading CAS. We find the transformer can outperform these guards, gaining up to 30% accuracy and 70% precision. We further show that the inference time of the transformer is inconsequential which shows that it is well-suited to include as a guard in a CAS. Furthermore, we use Layer Integrated Gradients to interpret the decisions that the transformer is making. If guided by a subject-matter expert, the technique can explain some of the predictions based on the input tokens, which can lead to further optimisations.

[LG-30] Quantum Deep Equilibrium Models NEURIPS2024

链接: https://arxiv.org/abs/2410.23940
作者: Philipp Schleich,Marta Skreta,Lasse B. Kristensen,Rodrigo A. Vargas-Hernández,Alán Aspuru-Guzik
关键词-EN: variational quantum algorithms, involved parametrized quantum, deep equilibrium models, parametrized quantum circuits, quantum
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: To be published in NeurIPS 2024

点击查看摘要

Abstract:The feasibility of variational quantum algorithms, the most popular correspondent of neural networks on noisy, near-term quantum hardware, is highly impacted by the circuit depth of the involved parametrized quantum circuits (PQCs). Higher depth increases expressivity, but also results in a detrimental accumulation of errors. Furthermore, the number of parameters involved in the PQC significantly influences the performance through the necessary number of measurements to evaluate gradients, which scales linearly with the number of parameters. Motivated by this, we look at deep equilibrium models (DEQs), which mimic an infinite-depth, weight-tied network using a fraction of the memory by employing a root solver to find the fixed points of the network. In this work, we present Quantum Deep Equilibrium Models (QDEQs): a training paradigm that learns parameters of a quantum machine learning model given by a PQC using DEQs. To our knowledge, no work has yet explored the application of DEQs to QML models. We apply QDEQs to find the parameters of a quantum circuit in two settings: the first involves classifying MNIST-4 digits with 4 qubits; the second extends it to 10 classes of MNIST, FashionMNIST and CIFAR. We find that QDEQ is not only competitive with comparable existing baseline models, but also achieves higher performance than a network with 5 times more layers. This demonstrates that the QDEQ paradigm can be used to develop significantly more shallow quantum circuits for a given task, something which is essential for the utility of near-term quantum computers. Our code is available at this https URL. Comments: To be published in NeurIPS 2024 Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2410.23940 [cs.LG] (or arXiv:2410.23940v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.23940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Learning Macroscopic Dynamics from Partial Microscopic Observations

链接: https://arxiv.org/abs/2410.23938
作者: Mengyi Chen,Qianxiao Li
关键词-EN: keen interest, interest in real, real applications, microscopic, microscopic coordinates
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Macroscopic observables of a system are of keen interest in real applications such as the design of novel materials. Current methods rely on microscopic trajectory simulations, where the forces on all microscopic coordinates need to be computed or measured. However, this can be computationally prohibitive for realistic systems. In this paper, we propose a method to learn macroscopic dynamics requiring only force computations on a subset of the microscopic coordinates. Our method relies on a sparsity assumption: the force on each microscopic coordinate relies only on a small number of other coordinates. The main idea of our approach is to map the training procedure on the macroscopic coordinates back to the microscopic coordinates, on which partial force computations can be used as stochastic estimation to update model parameters. We provide a theoretical justification of this under suitable conditions. We demonstrate the accuracy, force computation efficiency, and robustness of our method on learning macroscopic closure models from a variety of microscopic systems, including those modeled by partial differential equations or molecular dynamics simulations.

[LG-32] Robust Sparse Regression with Non-Isotropic Designs NEURIPS2024

链接: https://arxiv.org/abs/2410.23937
作者: Chih-Hung Liu,Gleb Novikov
关键词-EN: design efficiently computable, algorithm achieves error, varepsilon, sparse linear regression, efficiently computable estimators
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: NeurIPS 2024; Authors have equal contribution

点击查看摘要

Abstract:We develop a technique to design efficiently computable estimators for sparse linear regression in the simultaneous presence of two adversaries: oblivious and adaptive. We design several robust algorithms that outperform the state of the art even in the special case when oblivious adversary simply adds Gaussian noise. In particular, we provide a polynomial-time algorithm that with high probability recovers the signal up to error O(\sqrt\varepsilon) as long as the number of samples n \ge \tildeO(k^2/\varepsilon) , only assuming some bounds on the third and the fourth moments of the distribution D of the design. In addition, prior to this work, even in the special case of Gaussian design and noise, no polynomial time algorithm was known to achieve error o(\sqrt\varepsilon) in the sparse setting n d^2 . We show that under some assumptions on the fourth and the eighth moments of D , there is a polynomial-time algorithm that achieves error o(\sqrt\varepsilon) as long as n \ge \tildeO(k^4 / \varepsilon^3) . For Gaussian distribution, this algorithm achieves error O(\varepsilon^3/4) . Moreover, our algorithm achieves error o(\sqrt\varepsilon) for all log-concave distributions if \varepsilon \le 1/\textpolylog(d) . Our algorithms are based on the filtering of the covariates that uses sum-of-squares relaxations, and weighted Huber loss minimization with \ell_1 regularizer. We provide a novel analysis of weighted penalized Huber loss that is suitable for heavy-tailed designs in the presence of two adversaries. Furthermore, we complement our algorithmic results with Statistical Query lower bounds, providing evidence that our estimators are likely to have nearly optimal sample complexity. Comments: NeurIPS 2024; Authors have equal contribution Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2410.23937 [cs.LG] (or arXiv:2410.23937v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.23937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Analyzing Reducing the Need for Learning Rate Warmup in GPT Training NEURIPS2024

链接: https://arxiv.org/abs/2410.23922
作者: Atli Kosson,Bettina Messmer,Martin Jaggi
关键词-EN: Learning Rate Warmup, Learning Rate, mathbf, training neural networks, popular heuristic
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size \Delta \mathbfw_t = \eta_t \mathbfu_t early in training by using lower values for the learning rate \eta_t . In this work we argue that warmup benefits training by keeping the overall size of \Delta \mathbfw_t limited, counteracting large initial values of \mathbfu_t . Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates \mathbfu_t too large? We analyze different metrics for the update size including the \ell_2 -norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize \mathbfu_t based on the aforementioned metrics.

[LG-34] Metamorphic Malware Evolution: The Potential and Peril of Large Language Models

链接: https://arxiv.org/abs/2410.23894
作者: Pooria Madani
关键词-EN: computer programming exercise, partial or entire, Code metamorphism refers, consistently and automatically, core functionality
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code metamorphism refers to a computer programming exercise wherein the program modifies its own code (partial or entire) consistently and automatically while retaining its core functionality. This technique is often used for online performance optimization and automated crash recovery in certain mission-critical applications. However, the technique has been misappropriated by malware creators to bypass signature-based detection measures instituted by anti-malware engines. However, current code mutation engines used by threat actors offer only a limited degree of mutation, which is frequently detectable via static code analysis. The advent of large language models (LLMs), such as ChatGPT 4.0 and Google Bard may lead to a significant evolution in this landscape. These models have demonstrated a level of algorithm comprehension and code synthesis capability that closely resembles human abilities. This advancement has sparked concerns among experts that such models could be exploited by threat actors to generate sophisticated metamorphic malware. This paper explores the potential of several prominent LLMs for software code mutation that may be used to reconstruct (with mutation) existing malware code bases or create new forms of embedded mutation engines for next-gen metamorphic malwares. In this work, we introduce a framework for creating self-testing program mutation engines based on LLM/Transformer-based models. The proposed framework serves as an essential tool in testing next-gen metamorphic malware detection engines.

[LG-35] DiffBatt: A Diffusion Model for Battery Degradation Prediction and Synthesis

链接: https://arxiv.org/abs/2410.23893
作者: Hamidreza Eivazi,André Hebenbrock,Raphael Ginster,Steffen Blömeke,Stefan Wittek,Christoph Hermann,Thomas S. Spengler,Thomas Turek,Andreas Rausch
关键词-EN: sustainable energy solutions, Battery degradation, energy solutions, pursuit of green, green technologies
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Battery degradation remains a critical challenge in the pursuit of green technologies and sustainable energy solutions. Despite significant research efforts, predicting battery capacity loss accurately remains a formidable task due to its complex nature, influenced by both aging and cycling behaviors. To address this challenge, we introduce a novel general-purpose model for battery degradation prediction and synthesis, DiffBatt. Leveraging an innovative combination of conditional and unconditional diffusion models with classifier-free guidance and transformer architecture, DiffBatt achieves high expressivity and scalability. DiffBatt operates as a probabilistic model to capture uncertainty in aging behaviors and a generative model to simulate battery degradation. The performance of the model excels in prediction tasks while also enabling the generation of synthetic degradation curves, facilitating enhanced model training by data augmentation. In the remaining useful life prediction task, DiffBatt provides accurate results with a mean RMSE of 196 cycles across all datasets, outperforming all other models and demonstrating superior generalizability. This work represents an important step towards developing foundational models for battery degradation.

[LG-36] DynaSplit: A Hardware-Software Co-Design Framework for Energy-Aware Inference on Edge

链接: https://arxiv.org/abs/2410.23881
作者: Daniel May,Alessandro Tundo,Shashikant Ilager,Ivona Brandic
关键词-EN: limited computational resources, challenged by limited, limited computational, computational resources, energy availability
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The deployment of ML models on edge devices is challenged by limited computational resources and energy availability. While split computing enables the decomposition of large neural networks (NNs) and allows partial computation on both edge and cloud devices, identifying the most suitable split layer and hardware configurations is a non-trivial task. This process is in fact hindered by the large configuration space, the non-linear dependencies between software and hardware parameters, the heterogeneous hardware and energy characteristics, and the dynamic workload conditions. To overcome this challenge, we propose DynaSplit, a two-phase framework that dynamically configures parameters across both software (i.e., split layer) and hardware (e.g., accelerator usage, CPU frequency). During the Offline Phase, we solve a multi-objective optimization problem with a meta-heuristic approach to discover optimal settings. During the Online Phase, a scheduling algorithm identifies the most suitable settings for an incoming inference request and configures the system accordingly. We evaluate DynaSplit using popular pre-trained NNs on a real-world testbed. Experimental results show a reduction in energy consumption up to 72% compared to cloud-only computation, while meeting ~90% of user request’s latency threshold compared to baselines.

[LG-37] Directly Optimizing Explanations for Desired Properties

链接: https://arxiv.org/abs/2410.23880
作者: Hiwot Belay Tadesse,Alihan Hüyük,Weiwei Pan,Finale Doshi-Velez
关键词-EN: machine learning models, explaining black-box machine, black-box machine learning, learning models, explaining black-box
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When explaining black-box machine learning models, it’s often important for explanations to have certain desirable properties. Most existing methods `encourage’ desirable properties in their construction of explanations. In this work, we demonstrate that these forms of encouragement do not consistently create explanations with the properties that are supposedly being targeted. Moreover, they do not allow for any control over which properties are prioritized when different properties are at odds with each other. We propose to directly optimize explanations for desired properties. Our direct approach not only produces explanations with optimal properties more consistently but also empowers users to control trade-offs between different properties, allowing them to create explanations with exactly what is needed for a particular task.

[LG-38] Noise as a Double-Edged Sword: Reinforcement Learning Exploits Randomized Defenses in Neural Networks

链接: https://arxiv.org/abs/2410.23870
作者: Steve Bakos,Pooria Madani,Heidar Davoudi
关键词-EN: adversarial machine learning, investigates a counterintuitive, counterintuitive phenomenon, inadvertently aid evasion, aid evasion attacks
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates a counterintuitive phenomenon in adversarial machine learning: the potential for noise-based defenses to inadvertently aid evasion attacks in certain scenarios. While randomness is often employed as a defensive strategy against adversarial examples, our research reveals that this approach can sometimes backfire, particularly when facing adaptive attackers using reinforcement learning (RL). Our findings show that in specific cases, especially with visually noisy classes, the introduction of noise in the classifier’s confidence values can be exploited by the RL attacker, leading to a significant increase in evasion success rates. In some instances, the noise-based defense scenario outperformed other strategies by up to 20% on a subset of classes. However, this effect was not consistent across all classifiers tested, highlighting the complexity of the interaction between noise-based defenses and different models. These results suggest that in some cases, noise-based defenses can inadvertently create an adversarial training loop beneficial to the RL attacker. Our study emphasizes the need for a more nuanced approach to defensive strategies in adversarial machine learning, particularly in safety-critical applications. It challenges the assumption that randomness universally enhances defense against evasion attacks and highlights the importance of considering adaptive, RL-based attackers when designing robust defense mechanisms.

[LG-39] QuACK: A Multipurpose Queuing Algorithm for Cooperative k-Armed Bandits

链接: https://arxiv.org/abs/2410.23867
作者: Benjamin Howson,Sarah Filippi,Ciara Pike-Burke
关键词-EN: armed bandit problem, cooperative stochastic, agents collaborate, study the cooperative, collaborate to find
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the cooperative stochastic k -armed bandit problem, where a network of m agents collaborate to find the optimal action. In contrast to most prior work on this problem, which focuses on extending a specific algorithm to the multi-agent setting, we provide a black-box reduction that allows us to extend any single-agent bandit algorithm to the multi-agent setting. Under mild assumptions on the bandit environment, we prove that our reduction transfers the regret guarantees of the single-agent algorithm to the multi-agent setting. These guarantees are tight in subgaussian environments, in that using a near minimax optimal single-player algorithm is near minimax optimal in the multi-player setting up to an additive graph-dependent quantity. Our reduction and theoretical results are also general, and apply to many different bandit settings. By plugging in appropriate single-player algorithms, we can easily develop provably efficient algorithms for many multi-player settings such as heavy-tailed bandits, duelling bandits and bandits with local differential privacy, among others. Experimentally, our approach is competitive with or outperforms specialised multi-agent algorithms.

[LG-40] psiDAG: Projected Stochastic Approximation Iteration for DAG Structure Learning

链接: https://arxiv.org/abs/2410.23862
作者: Klea Ziu,Slavomír Hanzely,Loka Li,Kun Zhang,Martin Takáč,Dmitry Kamzolov
关键词-EN: Directed Acyclic Graphs, Acyclic Graphs, Directed Acyclic, vast combinatorial search, combinatorial search space
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Learning the structure of Directed Acyclic Graphs (DAGs) presents a significant challenge due to the vast combinatorial search space of possible graphs, which scales exponentially with the number of nodes. Recent advancements have redefined this problem as a continuous optimization task by incorporating differentiable acyclicity constraints. These methods commonly rely on algebraic characterizations of DAGs, such as matrix exponentials, to enable the use of gradient-based optimization techniques. Despite these innovations, existing methods often face optimization difficulties due to the highly non-convex nature of DAG constraints and the per-iteration computational complexity. In this work, we present a novel framework for learning DAGs, employing a Stochastic Approximation approach integrated with Stochastic Gradient Descent (SGD)-based optimization techniques. Our framework introduces new projection methods tailored to efficiently enforce DAG constraints, ensuring that the algorithm converges to a feasible local minimum. With its low iteration complexity, the proposed method is well-suited for handling large-scale problems with improved computational efficiency. We demonstrate the effectiveness and scalability of our framework through comprehensive experimental evaluations, which confirm its superior performance across various settings.

[LG-41] Neural Network Matrix Product Operator: A Multi-Dimensionally Integrable Machine Learning Potential

链接: https://arxiv.org/abs/2410.23858
作者: Kentaro Hino,Yuki Kurashige
关键词-EN: matrix product operator, potential energy surface, network-based machine learning, machine learning potential, neural network-based machine
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Chemical Physics (physics.chem-ph); Quantum Physics (quant-ph)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:A neural network-based machine learning potential energy surface (PES) expressed in a matrix product operator (NN-MPO) is proposed. The MPO form enables efficient evaluation of high-dimensional integrals that arise in solving the time-dependent and time-independent Schrödinger equation and effectively overcomes the so-called curse of dimensionality. This starkly contrasts with other neural network-based machine learning PES methods, such as multi-layer perceptrons (MLPs), where evaluating high-dimensional integrals is not straightforward due to the fully connected topology in their backbone architecture. Nevertheless, the NN-MPO retains the high representational capacity of neural networks. NN-MPO can achieve spectroscopic accuracy with a test mean absolute error (MAE) of 3.03 cm ^-1 for a fully coupled six-dimensional ab initio PES, using only 625 training points distributed across a 0 to 17,000 cm ^-1 energy range. Our Python implementation is available at this https URL.

[LG-42] Case ID detection based on time series data – the mining use case

链接: https://arxiv.org/abs/2410.23846
作者: Edyta Brzychczy,Tomasz Pełech-Pilichowski,Ziemowit Dworakowski
关键词-EN: gains increasing popularity, business process analysis, mining gains increasing, heavy industry, Process mining gains
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Presented at EdbA’24 - Fifth International Workshop on Event Data and Behavioral Analytics, ICPM 2024, Kopenhagen, Denmark

点击查看摘要

Abstract:Process mining gains increasing popularity in business process analysis, also in heavy industry. It requires a specific data format called an event log, with the basic structure including a case identifier (case ID), activity (event) name, and timestamp. In the case of industrial processes, data is very often provided by a monitoring system as time series of low level sensor readings. This data cannot be directly used for process mining since there is no explicit marking of activities in the event log, and sometimes, case ID is not provided. We propose a novel rule-based algorithm for identification patterns, based on the identification of significant changes in short-term mean values of selected variable to detect case ID. We present our solution on the mining use case. We compare computed results (identified patterns) with expert labels of the same dataset. Experiments show that the developed algorithm in the most of the cases correctly detects IDs in datasets with and without outliers reaching F1 score values: 96.8% and 97% respectively. We also evaluate our algorithm on dataset from manufacturing domain reaching value 92.6% for F1 score.

[LG-43] Deterministic Exploration via Stationary Bellm an Error Maximization

链接: https://arxiv.org/abs/2410.23840
作者: Sebastian Griesbach,Carlo D’Eramo
关键词-EN: fundamental open problem, open problem, crucial and distinctive, distinctive aspect, aspect of reinforcement
类目: Machine Learning (cs.LG)
*备注: Published at the 17th European Workshop for Reinforcement Learning

点击查看摘要

Abstract:Exploration is a crucial and distinctive aspect of reinforcement learning (RL) that remains a fundamental open problem. Several methods have been proposed to tackle this challenge. Commonly used methods inject random noise directly into the actions, indirectly via entropy maximization, or add intrinsic rewards that encourage the agent to steer to novel regions of the state space. Another previously seen idea is to use the Bellman error as a separate optimization objective for exploration. In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy. Our separate exploration agent is informed about the state of the exploitation, thus enabling it to account for previous experiences. Further components are introduced to make the exploration objective agnostic toward the episode length and to mitigate instability introduced by far-off-policy learning. Our experimental results show that our approach can outperform \varepsilon -greedy in dense and sparse reward settings.

[LG-44] Reducing Oversmoothing through Informed Weight Initialization in Graph Neural Networks

链接: https://arxiv.org/abs/2410.23830
作者: Dimitrios Kelesis,Dimitris Fotakis,Georgios Paliouras
关键词-EN: Graph Neural Networks, graph classification tasks, Neural Networks, ideas of Kaiming, Graph Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we generalize the ideas of Kaiming initialization to Graph Neural Networks (GNNs) and propose a new scheme (G-Init) that reduces oversmoothing, leading to very good results in node and graph classification tasks. GNNs are commonly initialized using methods designed for other types of Neural Networks, overlooking the underlying graph topology. We analyze theoretically the variance of signals flowing forward and gradients flowing backward in the class of convolutional GNNs. We then simplify our analysis to the case of the GCN and propose a new initialization method. Our results indicate that the new method (G-Init) reduces oversmoothing in deep GNNs, facilitating their effective use. Experimental validation supports our theoretical findings, demonstrating the advantages of deep networks in scenarios with no feature information for unlabeled nodes (i.e., ``cold start’’ scenario).

[LG-45] Generative AI-Powered Plugin for Robust Federated Learning in Heterogeneous IoT Networks

链接: https://arxiv.org/abs/2410.23824
作者: Youngjoon Lee,Jinu Gong,Joonhyuk Kang
关键词-EN: Federated learning enables, learning enables edge, keeping data localized, maintaining data privacy, learning enables
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages

点击查看摘要

Abstract:Federated learning enables edge devices to collaboratively train a global model while maintaining data privacy by keeping data localized. However, the Non-IID nature of data distribution across devices often hinders model convergence and reduces performance. In this paper, we propose a novel plugin for federated optimization techniques that approximates Non-IID data distributions to IID through generative AI-enhanced data augmentation and balanced sampling strategy. Key idea is to synthesize additional data for underrepresented classes on each edge device, leveraging generative AI to create a more balanced dataset across the FL network. Additionally, a balanced sampling approach at the central server selectively includes only the most IID-like devices, accelerating convergence while maximizing the global model’s performance. Experimental results validate that our approach significantly improves convergence speed and robustness against data imbalance, establishing a flexible, privacy-preserving FL plugin that is applicable even in data-scarce environments.

[LG-46] Weight decay induces low-rank attention layers

链接: https://arxiv.org/abs/2410.23819
作者: Seijin Kobayashi,Yassir Akram,Johannes Von Oswald
关键词-EN: deep neural networks, training deep neural, training neural network, effect of regularizers, weight decay
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effect of regularizers such as weight decay when training deep neural networks is not well understood. We study the influence of weight decay as well as L2 -regularization when training neural network models in which parameter matrices interact multiplicatively. This combination is of particular interest as this parametrization is common in attention layers, the workhorse of transformers. Here, key-query, as well as value-projection parameter matrices, are multiplied directly with each other: W_K^TW_Q and PW_V . We extend previous results and show on one hand that any local minimum of a L2 -regularized loss of the form L(AB^\top) + \lambda (|A|^2 + |B|^2) coincides with a minimum of the nuclear norm-regularized loss L(AB^\top) + \lambda|AB^\top|_* , and on the other hand that the 2 losses become identical exponentially quickly during training. We thus complement existing works linking L2 -regularization with low-rank regularization, and in particular, explain why such regularization on the matrix product affects early stages of training. Based on these theoretical insights, we verify empirically that the key-query and value-projection matrix products W_K^TW_Q, PW_V within attention layers, when optimized with weight decay, as usually done in vision tasks and language modelling, indeed induce a significant reduction in the rank of W_K^TW_Q and PW_V , even in fully online training. We find that, in accordance with existing work, inducing low rank in attention matrix products can damage language model performance, and observe advantages when decoupling weight decay in attention layers from the rest of the parameters.

[LG-47] Graph Neural Networks Uncover Geometric Neural Representations in Reinforcement-Based Motor Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.23812
作者: Federico Nardi,Jinpei Han,Shlomi Haar,A.Aldo Faisal
关键词-EN: Graph Neural Networks, Neural Networks, EEG data, neural representations, Neural
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 19 pages, 7 figures, accepted at the NeurIPS 2024 workshop on Symmetry and Geometry in Neural Representations (NeurReps 2024)

点击查看摘要

Abstract:Graph Neural Networks (GNN) can capture the geometric properties of neural representations in EEG data. Here we utilise those to study how reinforcement-based motor learning affects neural activity patterns during motor planning, leveraging the inherent graph structure of EEG channels to capture the spatial relationships in brain activity. By exploiting task-specific symmetries, we define different pretraining strategies that not only improve model performance across all participant groups but also validate the robustness of the geometric representations. Explainability analysis based on the graph structures reveals consistent group-specific neural signatures that persist across pretraining conditions, suggesting stable geometric structures in the neural representations associated with motor learning and feedback processing. These geometric patterns exhibit partial invariance to certain task space transformations, indicating symmetries that enable generalisation across conditions while maintaining specificity to individual learning strategies. This work demonstrates how GNNs can uncover the effects of previous outcomes on motor planning, in a complex real-world task, providing insights into the geometric principles governing neural representations. Our experimental design bridges the gap between controlled experiments and ecologically valid scenarios, offering new insights into the organisation of neural representations during naturalistic motor learning, which may open avenues for exploring fundamental principles governing brain activity in complex tasks.

[LG-48] One Sample Fits All: Approximating All Probabilistic Values Simultaneously and Efficiently

链接: https://arxiv.org/abs/2410.23808
作者: Weida Li,Yaoliang Yu
关键词-EN: gained recent attention, weighted Banzhaf, data valuation, gained recent, recent attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The concept of probabilistic values, such as Beta Shapley values and weighted Banzhaf values, has gained recent attention in applications like feature attribution and data valuation. However, exact computation of these values is often exponentially expensive, necessitating approximation techniques. Prior research has shown that the choice of probabilistic values significantly impacts downstream performance, with no universally superior option. Consequently, one may have to approximate multiple candidates and select the best-performing one. Although there have been many efforts to develop efficient estimators, none are intended to approximate all probabilistic values both simultaneously and efficiently. In this work, we embark on the first exploration of achieving this goal. Adhering to the principle of maximum sample reuse, we propose a one-sample-fits-all framework parameterized by a sampling vector to approximate intermediate terms that can be converted to any probabilistic value without amplifying scalars. Leveraging the concept of (\epsilon, \delta) -approximation, we theoretically identify a key formula that effectively determines the convergence rate of our framework. By optimizing the sampling vector using this formula, we obtain i) a one-for-all estimator that achieves the currently best time complexity for all probabilistic values on average, and ii) a faster generic estimator with the sampling vector optimally tuned for each probabilistic value. Particularly, our one-for-all estimator achieves the fastest convergence rate on Beta Shapley values, including the well-known Shapley value, both theoretically and empirically. Finally, we establish a connection between probabilistic values and the least square regression used in (regularized) datamodels, showing that our one-for-all estimator can solve a family of datamodels simultaneously.

[LG-49] Neural Model Checking NEURIPS2024

链接: https://arxiv.org/abs/2410.23790
作者: Mirco Giacobbe,Daniel Kroening,Abhinandan Pal,Michael Tautschnig
关键词-EN: model checking, temporal logic, checking temporal logic, model checking temporal, temporal logic specification
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: To appear in NeurIPS 2024

点击查看摘要

Abstract:We introduce a machine learning approach to model checking temporal logic, with application to formal hardware verification. Model checking answers the question of whether every execution of a given system satisfies a desired temporal logic specification. Unlike testing, model checking provides formal guarantees. Its application is expected standard in silicon design and the EDA industry has invested decades into the development of performant symbolic model checking algorithms. Our new approach combines machine learning and symbolic reasoning by using neural networks as formal proof certificates for linear temporal logic. We train our neural certificates from randomly generated executions of the system and we then symbolically check their validity using satisfiability solving which, upon the affirmative answer, establishes that the system provably satisfies the specification. We leverage the expressive power of neural networks to represent proof certificates as well as the fact that checking a certificate is much simpler than finding one. As a result, our machine learning procedure for model checking is entirely unsupervised, formally sound, and practically effective. We experimentally demonstrate that our method outperforms the state-of-the-art academic and commercial model checkers on a set of standard hardware designs written in SystemVerilog.

[LG-50] owards Convexity in Anomaly Detection: A New Formulation of SSLM with Unique Optimal Solutions

链接: https://arxiv.org/abs/2410.23774
作者: Hongying Liu,Hao Wang,Haoran Chu,Yibo Wu
关键词-EN: Vector Data Description, Large Margin SVM, Data Description, Small Sphere, Sphere and Large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An unsolved issue in widely used methods such as Support Vector Data Description (SVDD) and Small Sphere and Large Margin SVM (SSLM) for anomaly detection is their nonconvexity, which hampers the analysis of optimal solutions in a manner similar to SVMs and limits their applicability in large-scale scenarios. In this paper, we introduce a novel convex SSLM formulation which has been demonstrated to revert to a convex quadratic programming problem for hyperparameter values of interest. Leveraging the convexity of our method, we derive numerous results that are unattainable with traditional nonconvex approaches. We conduct a thorough analysis of how hyperparameters influence the optimal solution, pointing out scenarios where optimal solutions can be trivially found and identifying instances of ill-posedness. Most notably, we establish connections between our method and traditional approaches, providing a clear determination of when the optimal solution is unique – a task unachievable with traditional nonconvex methods. We also derive the \nu-property to elucidate the interactions between hyperparameters and the fractions of support vectors and margin errors in both positive and negative classes.

[LG-51] owards Generative Ray Path Sampling for Faster Point-to-Point Ray Tracing ICML

链接: https://arxiv.org/abs/2410.23773
作者: Jérome Eertmans,Nicola Di Cicco,Claude Oestges,Laurent Jacques,Enrico M. Vittuci,Vittorio Degli-Esposti
关键词-EN: Radio propagation modeling, radio channels result, Ray Tracing, Machine Learning, existing Machine Learning
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 6 figures, submitted to IEEE ICMLCN 2025

点击查看摘要

Abstract:Radio propagation modeling is essential in telecommunication research, as radio channels result from complex interactions with environmental objects. Recently, Machine Learning has been attracting attention as a potential alternative to computationally demanding tools, like Ray Tracing, which can model these interactions in detail. However, existing Machine Learning approaches often attempt to learn directly specific channel characteristics, such as the coverage map, making them highly specific to the frequency and material properties and unable to fully capture the underlying propagation mechanisms. Hence, Ray Tracing, particularly the Point-to-Point variant, remains popular to accurately identify all possible paths between transmitter and receiver nodes. Still, path identification is computationally intensive because the number of paths to be tested grows exponentially while only a small fraction is valid. In this paper, we propose a Machine Learning-aided Ray Tracing approach to efficiently sample potential ray paths, significantly reducing the computational load while maintaining high accuracy. Our model dynamically learns to prioritize potentially valid paths among all possible paths and scales linearly with scene complexity. Unlike recent alternatives, our approach is invariant with translation, scaling, or rotation of the geometry, and avoids dependency on specific environment characteristics.

[LG-52] Disentangling Interactions and Dependencies in Feature Attribution

链接: https://arxiv.org/abs/2410.23772
作者: Gunnar König,Eric Günther,Ulrike von Luxburg
关键词-EN: explainable machine learning, feature importance scores, global feature importance, feature importance methods, predicting the target
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: GK and EG contributed equally to this article

点击查看摘要

Abstract:In explainable machine learning, global feature importance methods try to determine how much each individual feature contributes to predicting the target variable, resulting in one importance score for each feature. But often, predicting the target variable requires interactions between several features (such as in the XOR function), and features might have complex statistical dependencies that allow to partially replace one feature with another one. In commonly used feature importance scores these cooperative effects are conflated with the features’ individual contributions, making them prone to misinterpretations. In this work, we derive DIP, a new mathematical decomposition of individual feature importance scores that disentangles three components: the standalone contribution and the contributions stemming from interactions and dependencies. We prove that the DIP decomposition is unique and show how it can be estimated in practice. Based on these results, we propose a new visualization of feature importance scores that clearly illustrates the different contributions.

[LG-53] A Non-Monolithic Policy Approach of Offline-to-Online Reinforcement Learning ICONIP2024

链接: https://arxiv.org/abs/2410.23737
作者: JaeYoon Kim,Junyu Xuan,Christy Liang,Farookh Hussain
关键词-EN: improve data efficiency, online policies trained, offline policy, accelerate performance enhancement, online policy
类目: Machine Learning (cs.LG)
*备注: ICONIP 2024

点击查看摘要

Abstract:Offline-to-online reinforcement learning (RL) leverages both pre-trained offline policies and online policies trained for downstream tasks, aiming to improve data efficiency and accelerate performance enhancement. An existing approach, Policy Expansion (PEX), utilizes a policy set composed of both policies without modifying the offline policy for exploration and learning. However, this approach fails to ensure sufficient learning of the online policy due to an excessive focus on exploration with both policies. Since the pre-trained offline policy can assist the online policy in exploiting a downstream task based on its prior experience, it should be executed effectively and tailored to the specific requirements of the downstream task. In contrast, the online policy, with its immature behavioral strategy, has the potential for exploration during the training phase. Therefore, our research focuses on harmonizing the advantages of the offline policy, termed exploitation, with those of the online policy, referred to as exploration, without modifying the offline policy. In this study, we propose an innovative offline-to-online RL method that employs a non-monolithic exploration approach. Our methodology demonstrates superior performance compared to PEX.

[LG-54] Zero-shot Class Unlearning via Layer-wise Relevance Analysis and Neuronal Path Perturbation

链接: https://arxiv.org/abs/2410.23693
作者: Wenhan Chang,Tianqing Zhu,Yufeng Wu,Wanlei Zhou
关键词-EN: unlearning, machine unlearning, Neuronal Path Perturbation, artificial intelligence, giving rise
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:In the rapid advancement of artificial intelligence, privacy protection has become crucial, giving rise to machine unlearning. Machine unlearning is a technique that removes specific data influences from trained models without the need for extensive retraining. However, it faces several key challenges, including accurately implementing unlearning, ensuring privacy protection during the unlearning process, and achieving effective unlearning without significantly compromising model performance. This paper presents a novel approach to machine unlearning by employing Layer-wise Relevance Analysis and Neuronal Path Perturbation. We address three primary challenges: the lack of detailed unlearning principles, privacy guarantees in zero-shot unlearning scenario, and the balance between unlearning effectiveness and model utility. Our method balances machine unlearning performance and model utility by identifying and perturbing highly relevant neurons, thereby achieving effective unlearning. By using data not present in the original training set during the unlearning process, we satisfy the zero-shot unlearning scenario and ensure robust privacy protection. Experimental results demonstrate that our approach effectively removes targeted data from the target unlearning model while maintaining the model’s utility, offering a practical solution for privacy-preserving machine learning.

[LG-55] Automatically Learning Hybrid Digital Twins of Dynamical Systems NEURIPS2024

链接: https://arxiv.org/abs/2410.23691
作者: Samuel Holt,Tennison Liu,Mihaela van der Schaar
关键词-EN: Digital Twins, Hybrid Digital Twins, role in prediction, simulate the states, states and temporal
类目: Machine Learning (cs.LG)
*备注: Accepted as Spotlight at NeurIPS2024

点击查看摘要

Abstract:Digital Twins (DTs) are computational models that simulate the states and temporal dynamics of real-world systems, playing a crucial role in prediction, understanding, and decision-making across diverse domains. However, existing approaches to DTs often struggle to generalize to unseen conditions in data-scarce settings, a crucial requirement for such models. To address these limitations, our work begins by establishing the essential desiderata for effective DTs. Hybrid Digital Twins ( \textbfHDTwins ) represent a promising approach to address these requirements, modeling systems using a composition of both mechanistic and neural components. This hybrid architecture simultaneously leverages (partial) domain knowledge and neural network expressiveness to enhance generalization, with its modular design facilitating improved evolvability. While existing hybrid models rely on expert-specified architectures with only parameters optimized on data, \textitautomatically specifying and optimizing HDTwins remains intractable due to the complex search space and the need for flexible integration of domain priors. To overcome this complexity, we propose an evolutionary algorithm ( \textbfHDTwinGen ) that employs Large Language Models (LLMs) to autonomously propose, evaluate, and optimize HDTwins. Specifically, LLMs iteratively generate novel model specifications, while offline tools are employed to optimize emitted parameters. Correspondingly, proposed models are evaluated and evolved based on targeted feedback, enabling the discovery of increasingly effective hybrid models. Our empirical results reveal that HDTwinGen produces generalizable, sample-efficient, and evolvable models, significantly advancing DTs’ efficacy in real-world applications.

[LG-56] owards Dynamic Message Passing on Graphs NEURIPS2024

链接: https://arxiv.org/abs/2410.23686
作者: Junshu Sun,Chenxue Yang,Xiangyang Ji,Qingming Huang,Shuhui Wang
关键词-EN: effective feature learning, Message passing plays, Message passing, graph neural networks, neural networks
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Message passing plays a vital role in graph neural networks (GNNs) for effective feature learning. However, the over-reliance on input topology diminishes the efficacy of message passing and restricts the ability of GNNs. Despite efforts to mitigate the reliance, existing study encounters message-passing bottlenecks or high computational expense problems, which invokes the demands for flexible message passing with low complexity. In this paper, we propose a novel dynamic message-passing mechanism for GNNs. It projects graph nodes and learnable pseudo nodes into a common space with measurable spatial relations between them. With nodes moving in the space, their evolving relations facilitate flexible pathway construction for a dynamic message-passing process. Associating pseudo nodes to input graphs with their measured relations, graph nodes can communicate with each other intermediately through pseudo nodes under linear complexity. We further develop a GNN model named \mathtt\mathbfN^2 based on our dynamic message-passing mechanism. \mathtt\mathbfN^2 employs a single recurrent layer to recursively generate the displacements of nodes and construct optimal dynamic pathways. Evaluation on eighteen benchmarks demonstrates the superior performance of \mathtt\mathbfN^2 over popular GNNs. \mathtt\mathbfN^2 successfully scales to large-scale benchmarks and requires significantly fewer parameters for graph classification with the shared recurrent layer.

[LG-57] Projected Neural Differential Equations for Learning Constrained Dynamics

链接: https://arxiv.org/abs/2410.23667
作者: Alistair White,Anna Büttner,Maximilian Gelbrecht,Valentin Duruisseaux,Niki Kilbertus,Frank Hellmann,Niklas Boers
关键词-EN: Neural differential equations, differential equations offer, Neural differential, dynamics from data, differential equations
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Neural differential equations offer a powerful approach for learning dynamics from data. However, they do not impose known constraints that should be obeyed by the learned model. It is well-known that enforcing constraints in surrogate models can enhance their generalizability and numerical stability. In this paper, we introduce projected neural differential equations (PNDEs), a new method for constraining neural differential equations based on projection of the learned vector field to the tangent space of the constraint manifold. In tests on several challenging examples, including chaotic dynamical systems and state-of-the-art power grid models, PNDEs outperform existing methods while requiring fewer hyperparameters. The proposed approach demonstrates significant potential for enhancing the modeling of constrained dynamical systems, particularly in complex domains where accuracy and reliability are essential.

[LG-58] Local Superior Soups: A Catalyst for Model Merging in Cross-Silo Federated Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.23660
作者: Minghui Chen,Meirui Jiang,Xin Zhang,Qi Dou,Zehua Wang,Xiaoxiao Li
关键词-EN: Federated learning, enables collaborative training, decentralized data, learning paradigm, paradigm that enables
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Federated learning (FL) is a learning paradigm that enables collaborative training of models using decentralized data. Recently, the utilization of pre-trained weight initialization in FL has been demonstrated to effectively improve model performance. However, the evolving complexity of current pre-trained models, characterized by a substantial increase in parameters, markedly intensifies the challenges associated with communication rounds required for their adaptation to FL. To address these communication cost issues and increase the performance of pre-trained model adaptation in FL, we propose an innovative model interpolation-based local training technique called ``Local Superior Soups.‘’ Our method enhances local training across different clients, encouraging the exploration of a connected low-loss basin within a few communication rounds through regularized model interpolation. This approach acts as a catalyst for the seamless adaptation of pre-trained models in in FL. We demonstrated its effectiveness and efficiency across diverse widely-used FL datasets. Our code is available at \hrefthis https URLthis https URL.

[LG-59] Online Consistency of the Nearest Neighbor Rule

链接: https://arxiv.org/abs/2410.23644
作者: Sanjoy Dasgupta,Geelon So
关键词-EN: realizable online setting, learner is tasked, tasked with making, correct answer, answer is revealed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the realizable online setting, a learner is tasked with making predictions for a stream of instances, where the correct answer is revealed after each prediction. A learning rule is online consistent if its mistake rate eventually vanishes. The nearest neighbor rule (Fix and Hodges, 1951) is a fundamental prediction strategy, but it is only known to be consistent under strong statistical or geometric assumptions: the instances come i.i.d. or the label classes are well-separated. We prove online consistency for all measurable functions in doubling metric spaces under the mild assumption that the instances are generated by a process that is uniformly absolutely continuous with respect to a finite, upper doubling measure.

[LG-60] Sample-Efficient Agnostic Boosting NEURIPS2024

链接: https://arxiv.org/abs/2410.23632
作者: Udaya Ghai,Karan Singh
关键词-EN: accurate strong learner, aggregating approximate weak, Empirical Risk Minimization, approximate weak learning, random predictor
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: NeurIPS 2024 camera ready version

点击查看摘要

Abstract:The theory of boosting provides a computational framework for aggregating approximate weak learning algorithms, which perform marginally better than a random predictor, into an accurate strong learner. In the realizable case, the success of the boosting approach is underscored by a remarkable fact that the resultant sample complexity matches that of a computationally demanding alternative, namely Empirical Risk Minimization (ERM). This in particular implies that the realizable boosting methodology has the potential to offer computational relief without compromising on sample efficiency. Despite recent progress, in agnostic boosting, where assumptions on the conditional distribution of labels given feature descriptions are absent, ERM outstrips the agnostic boosting methodology in being quadratically more sample efficient than all known agnostic boosting algorithms. In this paper, we make progress on closing this gap, and give a substantially more sample efficient agnostic boosting algorithm than those known, without compromising on the computational (or oracle) complexity. A key feature of our algorithm is that it leverages the ability to reuse samples across multiple rounds of boosting, while guaranteeing a generalization error strictly better than those obtained by blackbox applications of uniform convergence arguments. We also apply our approach to other previously studied learning problems, including boosting for reinforcement learning, and demonstrate improved results. Comments: NeurIPS 2024 camera ready version Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2410.23632 [cs.LG] (or arXiv:2410.23632v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.23632 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] An Application of the Holonomic Gradient Method to the Neural Tangent Kernel

链接: https://arxiv.org/abs/2410.23626
作者: Akihiro Sakoda,Nobuki Takayama
关键词-EN: linear partial differential, partial differential equations, roughly speaking, finite dimensional, linear partial
类目: Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:A holonomic system of linear partial differential equations is, roughly speaking, a system whose solution space is finite dimensional. A distribution that is a solution of a holonomic system is called a holonomic distribution. We give methods to numerically evaluate dual activations of holonomic activator distributions for neural tangent kernels. These methods are based on computer algebra algorithms for rings of differential operators.

[LG-62] EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography

链接: https://arxiv.org/abs/2410.23625
作者: Jehan Yang,Maxwell Soh,Vivianna Lieu,Douglas J Weber,Zackory Erickson
关键词-EN: learning for evaluating, EMG, machine learning, classification algorithms, EMG datasets
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the first generalization and adaptation benchmark using machine learning for evaluating out-of-distribution performance of electromyography (EMG) classification algorithms. The ability of an EMG classifier to handle inputs drawn from a different distribution than the training distribution is critical for real-world deployment as a control interface. By predicting the user’s intended gesture using EMG signals, we can create a wearable solution to control assistive technologies, such as computers, prosthetics, and mobile manipulator robots. This new out-of-distribution benchmark consists of two major tasks that have utility for building robust and adaptable control interfaces: 1) intersubject classification and 2) adaptation using train-test splits for time-series. This benchmark spans nine datasets–the largest collection of EMG datasets in a benchmark. Among these, a new dataset is introduced, featuring a novel, easy-to-wear high-density EMG wearable for data collection. The lack of open-source benchmarks has made comparing accuracy results between papers challenging for the EMG research community. This new benchmark provides researchers with a valuable resource for analyzing practical measures of out-of-distribution performance for EMG datasets. Our code and data from our new dataset can be found at this http URL.

[LG-63] Identifiability Guarantees for Causal Disentanglement from Purely Observational Data

链接: https://arxiv.org/abs/2410.23620
作者: Ryan Welch,Jiaqi Zhang,Caroline Uhler
关键词-EN: augment existing representation, holding the promise, interpretability and extrapolation, promise to augment, augment existing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal disentanglement aims to learn about latent causal factors behind data, holding the promise to augment existing representation learning methods in terms of interpretability and extrapolation. Recent advances establish identifiability results assuming that interventions on (single) latent factors are available; however, it remains debatable whether such assumptions are reasonable due to the inherent nature of intervening on latent variables. Accordingly, we reconsider the fundamentals and ask what can be learned using just observational data. We provide a precise characterization of latent factors that can be identified in nonlinear causal models with additive Gaussian noise and linear mixing, without any interventions or graphical restrictions. In particular, we show that the causal variables can be identified up to a layer-wise transformation and that further disentanglement is not possible. We transform these theoretical results into a practical algorithm consisting of solving a quadratic program over the score estimation of the observed data. We provide simulation results to support our theoretical guarantees and demonstrate that our algorithm can derive meaningful causal representations from purely observational data. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.23620 [cs.LG] (or arXiv:2410.23620v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.23620 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] Stabilizing Linear Passive-Aggressive Online Learning with Weighted Reservoir Sampling NEURIPS2024

链接: https://arxiv.org/abs/2410.23601
作者: Skyler Wu,Fred Lu,Edward Raff,James Holt
关键词-EN: high-dimensional streaming data, throughput-sensitive applications, highly effective, effective for high-dimensional, high-dimensional streaming
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear in the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Online learning methods, like the seminal Passive-Aggressive (PA) classifier, are still highly effective for high-dimensional streaming data, out-of-core processing, and other throughput-sensitive applications. Many such algorithms rely on fast adaptation to individual errors as a key to their convergence. While such algorithms enjoy low theoretical regret, in real-world deployment they can be sensitive to individual outliers that cause the algorithm to over-correct. When such outliers occur at the end of the data stream, this can cause the final solution to have unexpectedly low accuracy. We design a weighted reservoir sampling (WRS) approach to obtain a stable ensemble model from the sequence of solutions without requiring additional passes over the data, hold-out sets, or a growing amount of memory. Our key insight is that good solutions tend to be error-free for more iterations than bad solutions, and thus, the number of passive rounds provides an estimate of a solution’s relative quality. Our reservoir thus contains K previous intermediate weight vectors with high survival times. We demonstrate our WRS approach on the Passive-Aggressive Classifier (PAC) and First-Order Sparse Online Learning (FSOL), where our method consistently and significantly outperforms the unmodified approach. We show that the risk of the ensemble classifier is bounded with respect to the regret of the underlying online learning method.

[LG-65] RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning

链接: https://arxiv.org/abs/2410.23569
作者: Yujie Zhao,Jose Efraim Aguilar Escamill,Weyl Lu,Huazheng Wang
关键词-EN: Preference-based Reinforcement Learning, Preference-based Reinforcement, Reinforcement Learning, studies the problem, problem where agents
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preference-based Reinforcement Learning (PbRL) studies the problem where agents receive only preferences over pairs of trajectories in each episode. Traditional approaches in this field have predominantly focused on the mean reward or utility criterion. However, in PbRL scenarios demanding heightened risk awareness, such as in AI systems, healthcare, and agriculture, risk-aware measures are requisite. Traditional risk-aware objectives and algorithms are not applicable in such one-episode-reward settings. To address this, we explore and prove the applicability of two risk-aware objectives to PbRL: nested and static quantile risk objectives. We also introduce Risk-Aware- PbRL (RA-PbRL), an algorithm designed to optimize both nested and static objectives. Additionally, we provide a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and present empirical results to support our findings. Our code is available in this https URL.

[LG-66] Prosody as a Teaching Signal for Agent Learning: Exploratory Studies and Algorithmic Implications

链接: https://arxiv.org/abs/2410.23554
作者: Matilda Knierim,Sahil Jain,Murat Han Aydoğan,Kenneth Mitra,Kush Desai,Akanksha Saran,Kim Baraka
关键词-EN: provide valuable information, implicit social cues, social cues, Agent learning, provide valuable
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Published at the 26th ACM International Conference on Multimodal Interaction (ICMI) 2024

点击查看摘要

Abstract:Agent learning from human interaction often relies on explicit signals, but implicit social cues, such as prosody in speech, could provide valuable information for more effective learning. This paper advocates for the integration of prosody as a teaching signal to enhance agent learning from human teachers. Through two exploratory studies–one examining voice feedback in an interactive reinforcement learning setup and the other analyzing restricted audio from human demonstrations in three Atari games–we demonstrate that prosody carries significant information about task dynamics. Our findings suggest that prosodic features, when coupled with explicit feedback, can enhance reinforcement learning outcomes. Moreover, we propose guidelines for prosody-sensitive algorithm design and discuss insights into teaching behavior. Our work underscores the potential of leveraging prosody as an implicit signal for more efficient agent learning, thus advancing human-agent interaction paradigms.

[LG-67] Generative forecasting of brain activity enhances Alzheimers classification and interpretation

链接: https://arxiv.org/abs/2410.23515
作者: Yutong Gao,Vince D. Calhoun,Robyn L. Miller
关键词-EN: purely data-driven approaches, data-driven approaches remains, Understanding the relationship, challenge in neuroscience, relationship between cognition
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Understanding the relationship between cognition and intrinsic brain activity through purely data-driven approaches remains a significant challenge in neuroscience. Resting-state functional magnetic resonance imaging (rs-fMRI) offers a non-invasive method to monitor regional neural activity, providing a rich and complex spatiotemporal data structure. Deep learning has shown promise in capturing these intricate representations. However, the limited availability of large datasets, especially for disease-specific groups such as Alzheimer’s Disease (AD), constrains the generalizability of deep learning models. In this study, we focus on multivariate time series forecasting of independent component networks derived from rs-fMRI as a form of data augmentation, using both a conventional LSTM-based model and the novel Transformer-based BrainLM model. We assess their utility in AD classification, demonstrating how generative forecasting enhances classification performance. Post-hoc interpretation of BrainLM reveals class-specific brain network sensitivities associated with AD.

[LG-68] Development and Comparative Analysis of Machine Learning Models for Hypoxemia Severity Triage in CBRNE Emergency Scenarios Using Physiological and Demographic Data from Medical-Grade Devices

链接: https://arxiv.org/abs/2410.23503
作者: Santino Nanini,Mariem Abid,Yassir Mamouni,Arnaud Wiedemann,Philippe Jouvet,Stephane Bourassa
关键词-EN: predict hypoxemia severity, Gradient Boosting Models, Gradient Boosting, Boosting Models, machine learning
类目: Machine Learning (cs.LG)
*备注: 12 figures, 12 tables and 39 pages

点击查看摘要

Abstract:This paper presents the development of machine learning (ML) models to predict hypoxemia severity during emergency triage, especially in Chemical, Biological, Radiological, Nuclear, and Explosive (CBRNE) events, using physiological data from medical-grade sensors. Gradient Boosting Models (XGBoost, LightGBM, CatBoost) and sequential models (LSTM, GRU) were trained on physiological and demographic data from the MIMIC-III and IV datasets. A robust preprocessing pipeline addressed missing data, class imbalances, and incorporated synthetic data flagged with masks. Gradient Boosting Models (GBMs) outperformed sequential models in terms of training speed, interpretability, and reliability, making them well-suited for real-time decision-making. While their performance was comparable to that of sequential models, the GBMs used score features from six physiological variables derived from the enhanced National Early Warning Score (NEWS) 2, which we termed NEWS2+. This approach significantly improved prediction accuracy. While sequential models handled temporal data well, their performance gains did not justify the higher computational cost. A 5-minute prediction window was chosen for timely intervention, with minute-level interpolations standardizing the data. Feature importance analysis highlighted the significant role of mask and score features in enhancing both transparency and performance. Temporal dependencies proved to be less critical, as Gradient Boosting Models were able to capture key patterns effectively without relying on them. This study highlights ML’s potential to improve triage and reduce alarm fatigue. Future work will integrate data from multiple hospitals to enhance model generalizability across clinical settings.

[LG-69] angent Space Causal Inference: Leveraging Vector Fields for Causal Discovery in Dynamical Systems NEURIPS2024

链接: https://arxiv.org/abs/2410.23499
作者: Kurt Butler,Daniel Waxman,Petar M. Djurić
关键词-EN: series data remains, increasingly important task, time series data, scientific domains, time series
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Machine Learning (stat.ML)
*备注: 18 pages, 7 figures. Accepted to NeurIPS 2024

点击查看摘要

Abstract:Causal discovery with time series data remains a challenging yet increasingly important task across many scientific domains. Convergent cross mapping (CCM) and related methods have been proposed to study time series that are generated by dynamical systems, where traditional approaches like Granger causality are unreliable. However, CCM often yields inaccurate results depending upon the quality of the data. We propose the Tangent Space Causal Inference (TSCI) method for detecting causalities in dynamical systems. TSCI works by considering vector fields as explicit representations of the systems’ dynamics and checks for the degree of synchronization between the learned vector fields. The TSCI approach is model-agnostic and can be used as a drop-in replacement for CCM and its generalizations. We first present a basic version of the TSCI algorithm, which is shown to be more effective than the basic CCM algorithm with very little additional computation. We additionally present augmented versions of TSCI that leverage the expressive power of latent variable models and deep learning. We validate our theory on standard systems, and we demonstrate improved causal inference performance across a number of benchmark tasks.

[LG-70] Multi-fidelity Machine Learning for Uncertainty Quantification and Optimization

链接: https://arxiv.org/abs/2410.23482
作者: Ruda Zhang,Negin Alemazkoor
关键词-EN: multiple computational models, system analysis, physical system, analysis and design, multiple computational
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In system analysis and design optimization, multiple computational models are typically available to represent a given physical system. These models can be broadly classified as high-fidelity models, which provide highly accurate predictions but require significant computational resources, and low-fidelity models, which are computationally efficient but less accurate. Multi-fidelity methods integrate high- and low-fidelity models to balance computational cost and predictive accuracy. This perspective paper provides an in-depth overview of the emerging field of machine learning-based multi-fidelity methods, with a particular emphasis on uncertainty quantification and optimization. For uncertainty quantification, a particular focus is on multi-fidelity graph neural networks, compared with multi-fidelity polynomial chaos expansion. For optimization, our emphasis is on multi-fidelity Bayesian optimization, offering a unified perspective on multi-fidelity priors and proposing an application strategy when the objective function is an integral or a weighted sum. We highlight the current state of the art, identify critical gaps in the literature, and outline key research opportunities in this evolving field.

[LG-71] Gradient-free training of recurrent neural networks

链接: https://arxiv.org/abs/2410.23467
作者: Erik Lien Bolager,Ana Cukarska,Iryna Burak,Zahra Monfared,Felix Dietrich
关键词-EN: successful neural architecture, Recurrent neural, dynamical systems, Recurrent neural networks, including time series
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recurrent neural networks are a successful neural architecture for many time-dependent problems, including time series analysis, forecasting, and modeling of dynamical systems. Training such networks with backpropagation through time is a notoriously difficult problem because their loss gradients tend to explode or vanish. In this contribution, we introduce a computational approach to construct all weights and biases of a recurrent neural network without using gradient-based methods. The approach is based on a combination of random feature networks and Koopman operator theory for dynamical systems. The hidden parameters of a single recurrent block are sampled at random, while the outer weights are constructed using extended dynamic mode decomposition. This approach alleviates all problems with backpropagation commonly related to recurrent networks. The connection to Koopman operator theory also allows us to start using results in this area to analyze recurrent neural networks. In computational experiments on time series, forecasting for chaotic dynamical systems, and control problems, as well as on weather data, we observe that the training time and forecasting accuracy of the recurrent neural networks we construct are improved when compared to commonly used gradient-based methods.

[LG-72] ransformation-Invariant Learning and Theoretical Guarantees for OOD Generalization NEURIPS2024

链接: https://arxiv.org/abs/2410.23461
作者: Omar Montasser,Han Shao,Emmanuel Abbe
关键词-EN: practically and theoretically, extensively investigated, investigated both practically, test distributions, train and test
类目: Machine Learning (cs.LG)
*备注: To appear in NeurIPS 2024

点击查看摘要

Abstract:Learning with identical train and test distributions has been extensively investigated both practically and theoretically. Much remains to be understood, however, in statistical learning under distribution shifts. This paper focuses on a distribution shift setting where train and test distributions can be related by classes of (data) transformation maps. We initiate a theoretical study for this framework, investigating learning scenarios where the target class of transformations is either known or unknown. We establish learning rules and algorithmic reductions to Empirical Risk Minimization (ERM), accompanied with learning guarantees. We obtain upper bounds on the sample complexity in terms of the VC dimension of the class composing predictors with transformations, which we show in many cases is not much larger than the VC dimension of the class of predictors. We highlight that the learning rules we derive offer a game-theoretic viewpoint on distribution shift: a learner searching for predictors and an adversary searching for transformation maps to respectively minimize and maximize the worst-case loss.

[LG-73] Rethinking Deep Thinking: Stable Learning of Algorithms using Lipschitz Constraints NEURIPS2024

链接: https://arxiv.org/abs/2410.23451
作者: Jay Bear,Adam Prügel-Bennett,Jonathon Hare
关键词-EN: Iterative algorithms solve, Deep Thinking, algorithms solve problems, taking steps, Iterative algorithms
类目: Machine Learning (cs.LG)
*备注: 10 pages (main body), 26 pages (total), 13 figures, 3 tables, submitted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Iterative algorithms solve problems by taking steps until a solution is reached. Models in the form of Deep Thinking (DT) networks have been demonstrated to learn iterative algorithms in a way that can scale to different sized problems at inference time using recurrent computation and convolutions. However, they are often unstable during training, and have no guarantees of convergence/termination at the solution. This paper addresses the problem of instability by analyzing the growth in intermediate representations, allowing us to build models (referred to as Deep Thinking with Lipschitz Constraints (DT-L)) with many fewer parameters and providing more reliable solutions. Additionally our DT-L formulation provides guarantees of convergence of the learned iterative procedure to a unique solution at inference time. We demonstrate DT-L is capable of robustly learning algorithms which extrapolate to harder problems than in the training set. We benchmark on the traveling salesperson problem to evaluate the capabilities of the modified system in an NP-hard problem where DT fails to learn.

[LG-74] Learning Lipschitz Operators with respect to Gaussian Measures with Near-Optimal Sample Complexity

链接: https://arxiv.org/abs/2410.23440
作者: Ben Adcock,Michael Griebel,Gregor Maier
关键词-EN: infinite-dimensional function spaces, gained increasing research, increasing research attention, Lipschitz operators, machine learning
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 56 pages

点击查看摘要

Abstract:Operator learning, the approximation of mappings between infinite-dimensional function spaces using ideas from machine learning, has gained increasing research attention in recent years. Approximate operators, learned from data, hold promise to serve as efficient surrogate models for problems in computational science and engineering, complementing traditional numerical methods. However, despite their empirical success, our understanding of the underpinning mathematical theory is in large part still incomplete. In this paper, we study the approximation of Lipschitz operators in expectation with respect to Gaussian measures. We prove higher Gaussian Sobolev regularity of Lipschitz operators and establish lower and upper bounds on the Hermite polynomial approximation error. We further consider the reconstruction of Lipschitz operators from m arbitrary (adaptive) linear samples. A key finding is the tight characterization of the smallest achievable error for all possible (adaptive) sampling and reconstruction maps in terms of m . It is shown that Hermite polynomial approximation is an optimal recovery strategy, but we have the following curse of sample complexity: No method to approximate Lipschitz operators based on finitely many samples can achieve algebraic convergence rates in m . On the positive side, we prove that a sufficiently fast spectral decay of the covariance operator of the Gaussian measure guarantees convergence rates which are arbitrarily close to any algebraic rate in the large data limit m \to \infty . Finally, we focus on the recovery of Lipschitz operators from finitely many point samples. We consider Christoffel sampling and weighted least-squares approximation, and present an algorithm which provably achieves near-optimal sample complexity.

[LG-75] Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation

链接: https://arxiv.org/abs/2410.23434
作者: Stefan Stojanovic,Yassir Jedra,Alexandre Proutiere
关键词-EN: controlled dynamical systems, low-rank latent structure, Low-Rank Policy Iteration, latent structure, controlled dynamical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of learning an \varepsilon -optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an \varepsilon -optimal policy using \widetildeO(S+A\over \mathrmpoly(1-\gamma)\varepsilon^2) samples where S (resp. A ) denotes the number of states (resp. actions) and \gamma the discount factor. Our algorithm achieves this order-optimal (in S , A and \varepsilon ) sample complexity under milder conditions than those assumed in previously proposed approaches.

[LG-76] Communication-Efficient Federated Learning over Wireless Channels via Gradient Sketching

链接: https://arxiv.org/abs/2410.23424
作者: Vineet Sunil Gattani,Junshan Zhang,Gautam Dasarathy
关键词-EN: Large-scale federated learning, crucial learning paradigm, wireless multiple access, Large-scale federated, multiple access channels
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale federated learning (FL) over wireless multiple access channels (MACs) has emerged as a crucial learning paradigm with a wide range of applications. However, its widespread adoption is hindered by several major challenges, including limited bandwidth shared by many edge devices, noisy and erroneous wireless communications, and heterogeneous datasets with different distributions across edge devices. To overcome these fundamental challenges, we propose Federated Proximal Sketching (FPS), tailored towards band-limited wireless channels and handling data heterogeneity across edge devices. FPS uses a count sketch data structure to address the bandwidth bottleneck and enable efficient compression while maintaining accurate estimation of significant coordinates. Additionally, we modify the loss function in FPS such that it is equipped to deal with varying degrees of data heterogeneity. We establish the convergence of the FPS algorithm under mild technical conditions and characterize how the bias induced due to factors like data heterogeneity and noisy wireless channels play a role in the overall result. We complement the proposed theoretical framework with numerical experiments that demonstrate the stability, accuracy, and efficiency of FPS in comparison to state-of-the-art methods on both synthetic and real-world datasets. Overall, our results show that FPS is a promising solution to tackling the above challenges of FL over wireless MACs.

[LG-77] Dynamic Information Sub-Selection for Decision Support

链接: https://arxiv.org/abs/2410.23423
作者: Hung-Tien Huang,Maxwell Lennon,Shreyas Bhat Brahmavar,Sean Sylvia,Junier B. Oliva
关键词-EN: Dynamic Information Sub-Selection, introduce Dynamic Information, introduce Dynamic, Dynamic Information, per-instance basis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Dynamic Information Sub-Selection (DISS), a novel framework of AI assistance designed to enhance the performance of black-box decision-makers by tailoring their information processing on a per-instance basis. Blackbox decision-makers (e.g., humans or real-time systems) often face challenges in processing all possible information at hand (e.g., due to cognitive biases or resource constraints), which can degrade decision efficacy. DISS addresses these challenges through policies that dynamically select the most effective features and options to forward to the black-box decision-maker for prediction. We develop a scalable frequentist data acquisition strategy and a decision-maker mimicking technique for enhanced budget efficiency. We explore several impactful applications of DISS, including biased decision-maker support, expert assignment optimization, large language model decision support, and interpretability. Empirical validation of our proposed DISS methodology shows superior performance to state-of-the-art methods across various applications.

[LG-78] Stepping Out of the Shadows: Reinforcement Learning in Shadow Mode

链接: https://arxiv.org/abs/2410.23419
作者: Philipp Gassert,Matthias Althoff
关键词-EN: Reinforcement learning, reinforcement learning agent, process automation, physical components, simulation models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is not yet competitive for many cyber-physical systems, such as robotics, process automation, and power systems, as training on a system with physical components cannot be accelerated, and simulation models do not exist or suffer from a large simulation-to-reality gap. During the long training time, expensive equipment cannot be used and might even be damaged due to inappropriate actions of the reinforcement learning agent. Our novel approach addresses exactly this problem: We train the reinforcement agent in a so-called shadow mode with the assistance of an existing conventional controller, which does not have to be trained and instantaneously performs reasonably well. In shadow mode, the agent relies on the controller to provide action samples and guidance towards favourable states to learn the task, while simultaneously estimating for which states the learned agent will receive a higher reward than the conventional controller. The RL agent will then control the system for these states and all other regions remain under the control of the existing controller. Over time, the RL agent will take over for an increasing amount of states, while leaving control to the baseline, where it cannot surpass its performance. Thus, we keep regret during training low and improve the performance compared to only using conventional controllers or reinforcement learning. We present and evaluate two mechanisms for deciding whether to use the RL agent or the conventional controller. The usefulness of our approach is demonstrated for a reach-avoid task, for which we are able to effectively train an agent, where standard approaches fail.

[LG-79] On the Optimality of Dilated Entropy and Lower Bounds for Online Learning in Extensive-Form Games

链接: https://arxiv.org/abs/2410.23398
作者: Zhiyuan Fan,Christian Kroer,Gabriele Farina
关键词-EN: First-order methods, large extensive-form games, computation in large, distance-generating function, optimal distance-generating function
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:First-order methods (FOMs) are arguably the most scalable algorithms for equilibrium computation in large extensive-form games. To operationalize these methods, a distance-generating function, acting as a regularizer for the strategy space, must be chosen. The ratio between the strong convexity modulus and the diameter of the regularizer is a key parameter in the analysis of FOMs. A natural question is then: what is the optimal distance-generating function for extensive-form decision spaces? In this paper, we make a number of contributions, ultimately establishing that the weight-one dilated entropy (DilEnt) distance-generating function is optimal up to logarithmic factors. The DilEnt regularizer is notable due to its iterate-equivalence with Kernelized OMWU (KOMWU) – the algorithm with state-of-the-art dependence on the game tree size in extensive-form games – when used in conjunction with the online mirror descent (OMD) algorithm. However, the standard analysis for OMD is unable to establish such a result; the only current analysis is by appealing to the iterate equivalence to KOMWU. We close this gap by introducing a pair of primal-dual treeplex norms, which we contend form the natural analytic viewpoint for studying the strong convexity of DilEnt. Using these norm pairs, we recover the diameter-to-strong-convexity ratio that predicts the same performance as KOMWU. Along with a new regret lower bound for online learning in sequence-form strategy spaces, we show that this ratio is nearly optimal. Finally, we showcase our analytic techniques by refining the analysis of Clairvoyant OMD when paired with DilEnt, establishing an \mathcalO(n \log |\mathcalV| \log T/T) approximation rate to coarse correlated equilibrium in n -player games, where |\mathcalV| is the number of reduced normal-form strategies of the players, establishing the new state of the art.

[LG-80] Understanding Representation of Deep Equilibrium Models from Neural Collapse Perspective

链接: https://arxiv.org/abs/2410.23391
作者: Haixiang sun,Ye Shi
关键词-EN: Deep Equilibrium Model, Deep Equilibrium, Equilibrium Model, competitive performance compared, explicit neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Equilibrium Model (DEQ), which serves as a typical implicit neural network, emphasizes their memory efficiency and competitive performance compared to explicit neural networks. However, there has been relatively limited theoretical analysis on the representation of DEQ. In this paper, we utilize the Neural Collapse ( \mathcalNC ) as a tool to systematically analyze the representation of DEQ under both balanced and imbalanced conditions. \mathcalNC is an interesting phenomenon in the neural network training process that characterizes the geometry of class features and classifier weights. While extensively studied in traditional explicit neural networks, the \mathcalNC phenomenon has not received substantial attention in the context of implicit neural networks. We theoretically show that \mathcalNC exists in DEQ under balanced conditions. Moreover, in imbalanced settings, despite the presence of minority collapse, DEQ demonstrated advantages over explicit neural networks. These advantages include the convergence of extracted features to the vertices of a simplex equiangular tight frame and self-duality properties under mild conditions, highlighting DEQ’s superiority in handling imbalanced datasets. Finally, we validate our theoretical analyses through experiments in both balanced and imbalanced scenarios.

[LG-81] Ensemble learning of the atrial fiber orientation with physics-informed neural networks

链接: https://arxiv.org/abs/2410.23388
作者: Efraín Magaña,Simone Pezzuto,Francisco Sahli Costabal
关键词-EN: cardiac fiber structure, key determinant, cardiac function, fiber orientation, neural networks
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:The anisotropic structure of the myocardium is a key determinant of the cardiac function. To date, there is no imaging modality to assess in-vivo the cardiac fiber structure. We recently proposed Fibernet, a method for the automatic identification of the anisotropic conduction – and thus fibers – in the atria from local electrical recordings. Fibernet uses cardiac activation as recorded during electroanatomical mappings to infer local conduction properties using physics-informed neural networks. In this work, we extend Fibernet to cope with the uncertainty in the estimated fiber field. Specifically, we use an ensemble of neural networks to produce multiple samples, all fitting the observed data, and compute posterior statistics. We also introduce a methodology to select the best fiber orientation members and define the input of the neural networks directly on the atrial surface. With these improvements, we outperform the previous methodology in terms of fiber orientation error in 8 different atrial anatomies. Currently, our approach can estimate the fiber orientation and conduction velocities in under 7 minutes with quantified uncertainty, which opens the door to its application in clinical practice. We hope the proposed methodology will enable further personalization of cardiac digital twins for precision medicine.

[LG-82] Random Heterogeneous Neurochaos Learning Architecture for Data Classification

链接: https://arxiv.org/abs/2410.23351
作者: Remya Ajai A S,Nithin Nagaraj
关键词-EN: Artificial Neural Networks, Deep Neural Networks, Neural Networks, Artificial Neural, existing Neural Networks
类目: Machine Learning (cs.LG)
*备注: 34 pages, 8 figures, 42 Tables

点击查看摘要

Abstract:Inspired by the human brain’s structure and function, Artificial Neural Networks (ANN) were developed for data classification. However, existing Neural Networks, including Deep Neural Networks, do not mimic the brain’s rich structure. They lack key features such as randomness and neuron heterogeneity, which are inherently chaotic in their firing behavior. Neurochaos Learning (NL), a chaos-based neural network, recently employed one-dimensional chaotic maps like Generalized Lüroth Series (GLS) and Logistic map as neurons. For the first time, we propose a random heterogeneous extension of NL, where various chaotic neurons are randomly placed in the input layer, mimicking the randomness and heterogeneous nature of human brain networks. We evaluated the performance of the newly proposed Random Heterogeneous Neurochaos Learning (RHNL) architectures combined with traditional Machine Learning (ML) methods. On public datasets, RHNL outperformed both homogeneous NL and fixed heterogeneous NL architectures in nearly all classification tasks. RHNL achieved high F1 scores on the Wine dataset (1.0), Bank Note Authentication dataset (0.99), Breast Cancer Wisconsin dataset (0.99), and Free Spoken Digit Dataset (FSDD) (0.98). These RHNL results are among the best in the literature for these datasets. We investigated RHNL performance on image datasets, where it outperformed stand-alone ML classifiers. In low training sample regimes, RHNL was the best among stand-alone ML. Our architecture bridges the gap between existing ANN architectures and the human brain’s chaotic, random, and heterogeneous properties. We foresee the development of several novel learning algorithms centered around Random Heterogeneous Neurochaos Learning in the coming days.

[LG-83] VECTOR: Velocity-Enhanced GRU Neural Network for Real-Time 3D UAV Trajectory Prediction

链接: https://arxiv.org/abs/2410.23305
作者: Omer Nacar,Mohamed Abdelkader,Lahouari Ghouti,Kahled Gabr,Abdulrahman S. Al-Batati,Anis Koubaa
关键词-EN: surveillance and defense, Gated Recurrent Units, paper tackles, critical for applications, aerial surveillance
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper tackles the challenge of real-time 3D trajectory prediction for UAVs, which is critical for applications such as aerial surveillance and defense. Existing prediction models that rely primarily on position data struggle with accuracy, especially when UAV movements fall outside the position domain used in training. Our research identifies a gap in utilizing velocity estimates, first-order dynamics, to better capture the dynamics and enhance prediction accuracy and generalizability in any position domain. To bridge this gap, we propose a new trajectory prediction method using Gated Recurrent Units (GRUs) within sequence-based neural networks. Unlike traditional methods that rely on RNNs or transformers, this approach forecasts future velocities and positions based on historical velocity data instead of positions. This is designed to enhance prediction accuracy and scalability, overcoming challenges faced by conventional models in handling complex UAV dynamics. The methodology employs both synthetic and real-world 3D UAV trajectory data, capturing a wide range of flight patterns, speeds, and agility. Synthetic data is generated using the Gazebo simulator and PX4 Autopilot, while real-world data comes from the UZH-FPV and Mid-Air drone racing datasets. The GRU-based models significantly outperform state-of-the-art RNN approaches, with a mean square error (MSE) as low as 2 x 10^-8. Overall, our findings confirm the effectiveness of incorporating velocity data in improving the accuracy of UAV trajectory predictions across both synthetic and real-world scenarios, in and out of position data distributions. Finally, we open-source our 5000 trajectories dataset and a ROS 2 package to facilitate the integration with existing ROS-based UAV systems.

[LG-84] Understanding and Scaling Collaborative Filtering Optimization from the Perspective of Matrix Rank

链接: https://arxiv.org/abs/2410.23300
作者: Donald Loveland,Xinyi Wu,Tong Zhao,Danai Koutra,Neil Shah,Mingxuan Ju
关键词-EN: Collaborative Filtering, capture user preferences, effectively capture user, sparse ID-embedding tables, dominate real-world recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Collaborative Filtering (CF) methods dominate real-world recommender systems given their ability to learn high-quality, sparse ID-embedding tables that effectively capture user preferences. These tables scale linearly with the number of users and items, and are trained to ensure high similarity between embeddings of interacted user-item pairs, while maintaining low similarity for non-interacted pairs. Despite their high performance, encouraging dispersion for non-interacted pairs necessitates expensive regularization (e.g., negative sampling), hurting runtime and scalability. Existing research tends to address these challenges by simplifying the learning process, either by reducing model complexity or sampling data, trading performance for runtime. In this work, we move beyond model-level modifications and study the properties of the embedding tables under different learning strategies. Through theoretical analysis, we find that the singular values of the embedding tables are intrinsically linked to different CF loss functions. These findings are empirically validated on real-world datasets, demonstrating the practical benefits of higher stable rank, a continuous version of matrix rank which encodes the distribution of singular values. Based on these insights, we propose an efficient warm-start strategy that regularizes the stable rank of the user and item embeddings. We show that stable rank regularization during early training phases can promote higher-quality embeddings, resulting in training speed improvements of up to 66%. Additionally, stable rank regularization can act as a proxy for negative sampling, allowing for performance gains of up to 21% over loss functions with small negative sampling ratios. Overall, our analysis unifies current CF methods under a new perspective, their optimization of stable rank, motivating a flexible regularization method.

[LG-85] rajectory Prediction for Autonomous Driving using Agent -Interaction Graph Embedding ITSC2024

链接: https://arxiv.org/abs/2410.23298
作者: Jilan Samiuddin,Benoit Boulet,Di Wu
关键词-EN: Trajectory prediction module, autonomous driving system, autonomous agent car, Trajectory prediction, driving system
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This article has been presented in the 27th IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2024), Edmonton, Alberta, Canada on 26th September, 2024. Number of pages: 7, Number of figures: 8

点击查看摘要

Abstract:Trajectory prediction module in an autonomous driving system is crucial for the decision-making and safety of the autonomous agent car and its surroundings. This work presents a novel scheme called AiGem (Agent-Interaction Graph Embedding) to predict traffic vehicle trajectories around the autonomous car. AiGem tackles this problem in four steps. First, AiGem formulates the historical traffic interaction with the autonomous agent as a graph in two steps: (1) at each time step of the history frames, agent-interactions are captured using spatial edges between the agents (nodes of the graph), and then, (2) connects the spatial graphs in chronological order using temporal edges. Then, AiGem applies a depthwise graph encoder network on the spatial-temporal graph to generate graph embedding, i.e., embedding of all the nodes in the graph. Next, a sequential Gated Recurrent Unit decoder network uses the embedding of the current timestamp to get the decoded states. Finally, an output network comprising a Multilayer Perceptron is used to predict the trajectories utilizing the decoded states as its inputs. Results show that AiGem outperforms the state-of-the-art deep learning algorithms for longer prediction horizons.

[LG-86] Conformal prediction of circular data

链接: https://arxiv.org/abs/2410.24145
作者: Paulo C. Marques F.,Rinaldo Artes,Helton Graziadei
关键词-EN: finite-sample coverage guarantees, suitable conformity score, Split conformal prediction, conformal prediction techniques, conformity score
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 7 pages; 4 figures

点击查看摘要

Abstract:Split conformal prediction techniques are applied to regression problems with circular responses by introducing a suitable conformity score, leading to prediction sets with adaptive arc length and finite-sample coverage guarantees for any circular predictive model under exchangeable data. Leveraging the high performance of existing predictive models designed for linear responses, we analyze a general projection procedure that converts any linear response regression model into one suitable for circular responses. When random forests serve as basis models in this projection procedure, we harness the out-of-bag dynamics to eliminate the necessity for a separate calibration sample in the construction of prediction sets. For synthetic and real datasets the resulting projected random forests model produces more efficient out-of-bag conformal prediction sets, with shorter median arc length, when compared to the split conformal prediction sets generated by two existing alternative models.

[LG-87] Demystifying Linear MDPs and Novel Dynamics Aggregation Framework

链接: https://arxiv.org/abs/2410.24089
作者: Joongkyu Lee,Min-hwan Oh
关键词-EN: directly reachable states, represent transition probabilities, aptly represent transition, maximum size, size of directly
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we prove that, in linear MDPs, the feature dimension d is lower bounded by S/U in order to aptly represent transition probabilities, where S is the size of the state space and U is the maximum size of directly reachable states. Hence, d can still scale with S depending on the direct reachability of the environment. To address this limitation of linear MDPs, we propose a novel structural aggregation framework based on dynamics, named as the “dynamics aggregation”. For this newly proposed framework, we design a provably efficient hierarchical reinforcement learning algorithm in linear function approximation that leverages aggregated sub-structures. Our proposed algorithm exhibits statistical efficiency, achieving a regret of \tildeO ( d_\psi^3/2 H^3/2\sqrt N T ) , where d_\psi represents the feature dimension of aggregated subMDPs and N signifies the number of aggregated subMDPs. We establish that the condition d_\psi^3 N \ll d^3 is readily met in most real-world environments with hierarchical structures, enabling a substantial improvement in the regret bound compared to LSVI-UCB, which enjoys a regret of \tildeO (d^3/2 H^3/2 \sqrt T) . To the best of our knowledge, this work presents the first HRL algorithm with linear function approximation that offers provable guarantees.

[LG-88] Natural gradient and parameter estimation for quantum Boltzmann machines

链接: https://arxiv.org/abs/2410.24058
作者: Dhrumil Patel,Mark M. Wilde
关键词-EN: quantum Boltzmann machine, parameterized thermal states, Thermal states play, Thermal states, Boltzmann machine learning
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:Thermal states play a fundamental role in various areas of physics, and they are becoming increasingly important in quantum information science, with applications related to semi-definite programming, quantum Boltzmann machine learning, Hamiltonian learning, and the related task of estimating the parameters of a Hamiltonian. Here we establish formulas underlying the basic geometry of parameterized thermal states, and we delineate quantum algorithms for estimating the values of these formulas. More specifically, we prove formulas for the Fisher–Bures and Kubo–Mori information matrices of parameterized thermal states, and our quantum algorithms for estimating their matrix elements involve a combination of classical sampling, Hamiltonian simulation, and the Hadamard test. These results have applications in developing a natural gradient descent algorithm for quantum Boltzmann machine learning, which takes into account the geometry of thermal states, and in establishing fundamental limitations on the ability to estimate the parameters of a Hamiltonian, when given access to thermal-state samples. For the latter task, and for the special case of estimating a single parameter, we sketch an algorithm that realizes a measurement that is asymptotically optimal for the estimation task. We finally stress that the natural gradient descent algorithm developed here can be used for any machine learning problem that employs the quantum Boltzmann machine ansatz.

[LG-89] EigenVI: score-based variational inference with orthogonal function expansions NEURIPS

链接: https://arxiv.org/abs/2410.24054
作者: Diana Cai,Chirag Modi,Charles C. Margossian,Robert M. Gower,David M. Blei,Lawrence K. Saul
关键词-EN: black-box variational inference, eigenvalue-based approach, approach for black-box, variational, EigenVI
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 25 pages, 9 figures. Advances in Neural Information Processing Systems (NeurIPS), 2024

点击查看摘要

Abstract:We develop EigenVI, an eigenvalue-based approach for black-box variational inference (BBVI). EigenVI constructs its variational approximations from orthogonal function expansions. For distributions over \mathbbR^D , the lowest order term in these expansions provides a Gaussian variational approximation, while higher-order terms provide a systematic way to model non-Gaussianity. These approximations are flexible enough to model complex distributions (multimodal, asymmetric), but they are simple enough that one can calculate their low-order moments and draw samples from them. EigenVI can also model other types of random variables (e.g., nonnegative, bounded) by constructing variational approximations from different families of orthogonal functions. Within these families, EigenVI computes the variational approximation that best matches the score function of the target distribution by minimizing a stochastic estimate of the Fisher divergence. Notably, this optimization reduces to solving a minimum eigenvalue problem, so that EigenVI effectively sidesteps the iterative gradient-based optimizations that are required for many other BBVI algorithms. (Gradient-based methods can be sensitive to learning rates, termination criteria, and other tunable hyperparameters.) We use EigenVI to approximate a variety of target distributions, including a benchmark suite of Bayesian models from posteriordb. On these distributions, we find that EigenVI is more accurate than existing methods for Gaussian BBVI.

[LG-90] Attention is All You Need to Optimize Wind Farm Operations and Maintenance

链接: https://arxiv.org/abs/2410.24052
作者: Iman Kazemian,Murat Yildirim,Paritosh Ramanan
关键词-EN: Operations and maintenance, wind energy systems, reliability and profitability, energy systems, reaching implications
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Operations and maintenance (OM) is a fundamental problem in wind energy systems with far reaching implications for reliability and profitability. Optimizing OM is a multi-faceted decision optimization problem that requires a careful balancing act across turbine level failure risks, operational revenues, and maintenance crew logistics. The resulting OM problems are typically solved using large-scale mixed integer programming (MIP) models, which yield computationally challenging problems that require either long-solution times, or heuristics to reach a solution. To address this problem, we introduce a novel decision-making framework for wind farm OM that builds on a multi-head attention (MHA) models, an emerging artificial intelligence methods that are specifically designed to learn in rich and complex problem settings. The development of proposed MHA framework incorporates a number of modeling innovations that allows explicit embedding of MIP models within an MHA structure. The proposed MHA model (i) significantly reduces the solution time from hours to seconds, (ii) guarantees feasibility of the proposed solutions considering complex constraints that are omnipresent in wind farm OM, (iii) results in significant solution quality compared to the conventional MIP formulations, and (iv) exhibits significant transfer learning capability across different problem settings.

[LG-91] SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

链接: https://arxiv.org/abs/2410.24022
作者: Liang He,Peiran Jin,Yaosen Min,Shufang Xie,Lijun Wu,Tao Qin,Xiaozhuan Liang,Kaiyuan Gao,Yuliang Jiang,Tie-Yan Liu
关键词-EN: perform functions intricately, functions intricately linked, essential to biological, biological systems, perform functions
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proteins, essential to biological systems, perform functions intricately linked to their three-dimensional structures. Understanding the relationship between protein structures and their amino acid sequences remains a core challenge in protein modeling. While traditional protein foundation models benefit from pre-training on vast unlabeled datasets, they often struggle to capture critical co-evolutionary information, which evolutionary-based methods excel at. In this study, we introduce a novel pre-training strategy for protein foundation models that emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features from sequence data. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability, outperforming established baselines of similar size, including the ESM model, across diverse downstream tasks. Experimental results confirm the model’s effectiveness in integrating co-evolutionary information, marking a significant step forward in protein sequence-based modeling.

[LG-92] Interactive proofs for verifying (quantum) learning and testing

链接: https://arxiv.org/abs/2410.23969
作者: Matthias C. Caro,Jens Eisert,Marcel Hinsche,Marios Ioannou,Alexander Nietner,Ryan Sweke
关键词-EN: weak data access, data access, weak data, efficiency and feasibility, testing
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 12 + 33 + 13 pages; 1 table; 2 figures

点击查看摘要

Abstract:We consider the problem of testing and learning from data in the presence of resource constraints, such as limited memory or weak data access, which place limitations on the efficiency and feasibility of testing or learning. In particular, we ask the following question: Could a resource-constrained learner/tester use interaction with a resource-unconstrained but untrusted party to solve a learning or testing problem more efficiently than they could without such an interaction? In this work, we answer this question both abstractly and for concrete problems, in two complementary ways: For a wide variety of scenarios, we prove that a resource-constrained learner cannot gain any advantage through classical interaction with an untrusted prover. As a special case, we show that for the vast majority of testing and learning problems in which quantum memory is a meaningful resource, a memory-constrained quantum algorithm cannot overcome its limitations via classical communication with a memory-unconstrained quantum prover. In contrast, when quantum communication is allowed, we construct a variety of interactive proof protocols, for specific learning and testing problems, which allow memory-constrained quantum verifiers to gain significant advantages through delegation to untrusted provers. These results highlight both the limitations and potential of delegating learning and testing problems to resource-rich but untrusted third parties.

[LG-93] An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

链接: https://arxiv.org/abs/2410.23955
作者: Theo Clark,Benedetta Cevoli,Eloy de Jong,Timofey Abramski,Jamie Dougherty
关键词-EN: recent advancements concentrating, multiple timescales, Self-supervised learning, recent advancements, advancements concentrating
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself, and (2) downsampling to lower resolutions neither improves downstream performance nor correlates with higher-level information (e.g., words), though it does improve computational efficiency. These findings challenge assumptions about the multi-scale nature of MR-HuBERT and motivate the importance of disentangling computational efficiency from learning better representations.

[LG-94] Learning quantum states prepared by shallow circuits in polynomial time

链接: https://arxiv.org/abs/2410.23618
作者: Zeph Landau,Yunchao Liu
关键词-EN: polynomial time algorithm, unknown constant depth, constant depth quantum, constant depth, constant depth circuit
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:We give a polynomial time algorithm that, given copies of an unknown quantum state \vert\psi\rangle=U\vert 0^n\rangle that is prepared by an unknown constant depth circuit U on a finite-dimensional lattice, learns a constant depth quantum circuit that prepares \vert\psi\rangle . The algorithm extends to the case when the depth of U is \mathrmpolylog(n) , with a quasi-polynomial run-time. The key new idea is a simple and general procedure that efficiently reconstructs the global state \vert\psi\rangle from its local reduced density matrices. As an application, we give an efficient algorithm to test whether an unknown quantum state on a lattice has low or high quantum circuit complexity.

[LG-95] Global Convergence in Training Large-Scale Transformers NEURIPS2024

链接: https://arxiv.org/abs/2410.23610
作者: Cheng Gao,Yuan Cao,Zihao Li,Yihan He,Mengdi Wang,Han Liu,Jason Matthew Klusowski,Jianqing Fan
关键词-EN: large-scale model settings, gradient flow, widespread success, optimization guarantees, Wasserstein gradient flow
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: to be published in 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Despite the widespread success of Transformers across various domains, their optimization guarantees in large-scale model settings are not well-understood. This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small. Our analysis is based on a series of novel mean-field techniques that adapt to Transformers. Compared with existing tools for deep networks (Lu et al., 2020) that demand homogeneity and global Lipschitz smoothness, we utilize a refined analysis assuming only \textitpartial homogeneity and \textitlocal Lipschitz smoothness . These new techniques may be of independent interest.

[LG-96] Linearized Wasserstein Barycenters: Synthesis Analysis Representational Capacity and Applications

链接: https://arxiv.org/abs/2410.23602
作者: Matthew Werenski,Brendan Mallery,Shuchin Aeron,James M. Murphy
关键词-EN: linear optimal transport, linear barycentric coding, barycentric coding model, probability measures, linear barycentric
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 37 pages, 6 figures

点击查看摘要

Abstract:We propose the \textitlinear barycentric coding model (LBCM) that utilizes the linear optimal transport (LOT) metric for analysis and synthesis of probability measures. We provide a closed-form solution to the variational problem characterizing the probability measures in the LBCM and establish equivalence of the LBCM to the set of Wasserstein-2 barycenters in the special case of compatible measures. Computational methods for synthesizing and analyzing measures in the LBCM are developed with finite sample guarantees. One of our main theoretical contributions is to identify an LBCM, expressed in terms of a simple family, which is sufficient to express all probability measures on the interval [0,1] . We show that a natural analogous construction of an LBCM in \mathbbR^2 fails, and we leave it as an open problem to identify the proper extension in more than one dimension. We conclude by demonstrating the utility of LBCM for covariance estimation and data imputation.

[LG-97] Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis NEURIPS2024

链接: https://arxiv.org/abs/2410.23595
作者: Jiayu Su,David A. Knowles,Raul Rabadan
关键词-EN: models relies heavily, effectively representing high-dimensional, machine learning models, learning models relies, success of machine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 10 pages and 6 figures in the main text; To be published in the Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:The success of machine learning models relies heavily on effectively representing high-dimensional data. However, ensuring data representations capture human-understandable concepts remains difficult, often requiring the incorporation of prior knowledge and decomposition of data into multiple subspaces. Traditional linear methods fall short in modeling more than one space, while more expressive deep learning approaches lack interpretability. Here, we introduce Supervised Independent Subspace Principal Component Analysis ( \textttsisPCA ), a PCA extension designed for multi-subspace learning. Leveraging the Hilbert-Schmidt Independence Criterion (HSIC), \textttsisPCA incorporates supervision and simultaneously ensures subspace disentanglement. We demonstrate \textttsisPCA 's connections with autoencoders and regularized linear regression and showcase its ability to identify and separate hidden data structures through extensive applications, including breast cancer diagnosis from image features, learning aging-associated DNA methylation changes, and single-cell analysis of malaria infection. Our results reveal distinct functional pathways associated with malaria colonization, underscoring the essentiality of explainable representation in high-dimensional data analysis.

[LG-98] Online Convex Optimization with Memory and Limited Predictions

链接: https://arxiv.org/abs/2410.23574
作者: Lintao Ye,Zhengmiao Wang,Zhi-Wei Liu,Ming Chi,Xiaoling Wang,Housheng Su
关键词-EN: convex optimization, online convex optimization, future time steps, decision maker, convex
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 28 pages, 2 figures

点击查看摘要

Abstract:We study the problem of online convex optimization with memory and predictions over a horizon T . At each time step, a decision maker is given some limited predictions of the cost functions from a finite window of future time steps, i.e., values of the cost function at certain decision points in the future. The decision maker then chooses an action and incurs a cost given by a convex function that depends on the actions chosen in the past. We propose an algorithm to solve this problem and show that the dynamic regret of the algorithm decays exponentially with the prediction window length. Our algorithm contains two general subroutines that work for wider classes of problems. The first subroutine can solve general online convex optimization with memory and bandit feedback with \sqrtT -dynamic regret with respect to T . The second subroutine is a zeroth-order method that can be used to solve general convex optimization problems with a linear convergence rate that matches the best achievable rate of first-order methods for convex optimization. The key to our algorithm design and analysis is the use of truncated Gaussian smoothing when querying the decision points for obtaining the predictions. We complement our theoretical results using numerical experiments.

[LG-99] Assessing Concordance between RNA-Seq and NanoString Technologies in Ebola-Infected Nonhuman Primates Using Machine Learning

链接: https://arxiv.org/abs/2410.23433
作者: Mostafa Rezapour,Aarthi Narayanan,Wyatt H. Mowery,Metin Nafi Gurcan
关键词-EN: RNA sequencing, concordance between RNA, non-human primates, study evaluates, evaluates the concordance
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study evaluates the concordance between RNA sequencing (RNA-Seq) and NanoString technologies for gene expression analysis in non-human primates (NHPs) infected with Ebola virus (EBOV). We performed a detailed comparison of both platforms, demonstrating a strong correlation between them, with Spearman coefficients for 56 out of 62 samples ranging from 0.78 to 0.88, with a mean of 0.83 and a median of 0.85. Bland-Altman analysis further confirmed high consistency, with most measurements falling within 95% confidence limits. A machine learning approach, using the Supervised Magnitude-Altitude Scoring (SMAS) method trained on NanoString data, identified OAS1 as a key marker for distinguishing RT-qPCR positive from negative samples. Remarkably, when applied to RNA-Seq data, OAS1 also achieved 100% accuracy in differentiating infected from uninfected samples using logistic regression, demonstrating its robustness across platforms. Further differential expression analysis identified 12 common genes including ISG15, OAS1, IFI44, IFI27, IFIT2, IFIT3, IFI44L, MX1, MX2, OAS2, RSAD2, and OASL which demonstrated the highest levels of statistical significance and biological relevance across both platforms. Gene Ontology (GO) analysis confirmed that these genes are directly involved in key immune and viral infection pathways, reinforcing their importance in EBOV infection. In addition, RNA-Seq uniquely identified genes such as CASP5, USP18, and DDX60, which play key roles in immune regulation and antiviral defense. This finding highlights the broader detection capabilities of RNA-Seq and underscores the complementary strengths of both platforms in providing a comprehensive and accurate assessment of gene expression changes during Ebola virus infection.

[LG-100] ghtening convex relaxations of trained neural networks: a unified approach for convex and S-shaped activations

链接: https://arxiv.org/abs/2410.23362
作者: Pablo Carrasco,Gonzalo Muñoz
关键词-EN: trained neural networks, created significant obstacles, trained neural, neural networks, non-convex nature
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The non-convex nature of trained neural networks has created significant obstacles in their incorporation into optimization models. Considering the wide array of applications that this embedding has, the optimization and deep learning communities have dedicated significant efforts to the convexification of trained neural networks. Many approaches to date have considered obtaining convex relaxations for each non-linear activation in isolation, which poses limitations in the tightness of the relaxations. Anderson et al. (2020) strengthened these relaxations and provided a framework to obtain the convex hull of the graph of a piecewise linear convex activation composed with an affine function; this effectively convexifies activations such as the ReLU together with the affine transformation that precedes it. In this article, we contribute to this line of work by developing a recursive formula that yields a tight convexification for the composition of an activation with an affine function for a wide scope of activation functions, namely, convex or ``S-shaped". Our approach can be used to efficiently compute separating hyperplanes or determine that none exists in various settings, including non-polyhedral cases. We provide computational experiments to test the empirical benefits of these convex approximations.

[LG-101] MassSpecGym: A benchmark for the discovery and identification of molecules

链接: https://arxiv.org/abs/2410.23326
作者: Roman Bushuiev,Anton Bushuiev,Niek F. de Jonge,Adamo Young,Fleming Kretschmer,Raman Samusevich,Janne Heirman,Fei Wang,Luke Zhang,Kai Dührkop,Marcus Ludwig,Nils A. Haupt,Apurva Kalia,Corinna Brungs,Robin Schmid,Russell Greiner,Bo Wang,David S. Wishart,Li-Ping Liu,Juho Rousu,Wout Bittremieux,Hannes Rost,Tytus D. Mak,Soha Hassoun,Florian Huber,Justin J.J. van der Hooft,Michael A. Stravs,Sebastian Böcker,Josef Sivic,Tomáš Pluskal
关键词-EN: biological and environmental, environmental samples, samples is crucial, crucial for advancing, advancing biomedical
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym – the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: \textitde novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at \urlthis https URL.

[LG-102] Beyond Current Boundaries: Integrating Deep Learning and AlphaFold for Enhanced Protein Structure Prediction from Low-Resolution Cryo-EM Maps

链接: https://arxiv.org/abs/2410.23321
作者: Xin (Chloe)Ma,Dong Si
关键词-EN: Constructing atomic models, Constructing atomic, cryo-electron microscopy, crucial yet intricate, intricate task
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constructing atomic models from cryo-electron microscopy (cryo-EM) maps is a crucial yet intricate task in structural biology. While advancements in deep learning, such as convolutional neural networks (CNNs) and graph neural networks (GNNs), have spurred the development of sophisticated map-to-model tools like DeepTracer and ModelAngelo, their efficacy notably diminishes with low-resolution maps beyond 4 Å. To address this shortfall, our research introduces DeepTracer-LowResEnhance, an innovative framework that synergizes a deep learning-enhanced map refinement technique with the power of AlphaFold. This methodology is designed to markedly improve the construction of models from low-resolution cryo-EM maps. DeepTracer-LowResEnhance was rigorously tested on a set of 37 protein cryo-EM maps, with resolutions ranging between 2.5 to 8.4 Å, including 22 maps with resolutions lower than 4 Å. The outcomes were compelling, demonstrating that 95.5% of the low-resolution maps exhibited a significant uptick in the count of total predicted residues. This denotes a pronounced improvement in atomic model building for low-resolution maps. Additionally, a comparative analysis alongside Phenix’s auto-sharpening functionality delineates DeepTracer-LowResEnhance’s superior capability in rendering more detailed and precise atomic models, thereby pushing the boundaries of current computational structural biology methodologies.

[LG-103] Hybrid model of the kernel method for quantum computers

链接: https://arxiv.org/abs/2410.23315
作者: Jhordan Silveira de Borba,Jonas Maziero
关键词-EN: intelligent data processing, data processing methods, revolution in intelligent, intelligent data, data processing
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, in Portuguese language, 1 figure

点击查看摘要

Abstract:The field of quantum machine learning is a promising way to lead to a revolution in intelligent data processing methods. In this way, a hybrid learning method based on classic kernel methods is proposed. This proposal also requires the development of a quantum algorithm for the calculation of internal products between vectors of continuous values. In order for this to be possible, it was necessary to make adaptations to the classic kernel method, since it is necessary to consider the limitations imposed by the Hilbert space of the quantum processor. As a test case, we applied this new algorithm to learn to classify whether new points generated randomly, in a finite square located under a plane, were found inside or outside a circle located inside this square. It was found that the algorithm was able to correctly detect new points in 99% of the samples tested, with a small difference due to considering the radius slightly larger than the ideal. However, the kernel method was able to perform classifications correctly, as well as the internal product algorithm successfully performed the internal product calculations using quantum resources. Thus, the present work represents a contribution to the area, proposing a new model of machine learning accessible to both physicists and computer scientists.

[LG-104] Clustering Digital Assets Using Path Signatures: Application to Portfolio Construction

链接: https://arxiv.org/abs/2410.23297
作者: Hugo Inzirillo
关键词-EN: provide good diversification, good diversification properties, properties to investors, provide good, good diversification
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new way of building portfolios of cryptocurrencies that provide good diversification properties to investors. First, we seek to filter these digital assets by creating some clusters based on their path signature. The goal is to identify similar patterns in the behavior of these highly volatile assets. Once such clusters have been built, we propose “optimal” portfolios by comparing the performances of such portfolios to a universe of unfiltered digital assets. Our intuition is that clustering based on path signatures will make it easier to capture the main trends and features of a group of cryptocurrencies, and allow parsimonious portfolios that reduce excessive transaction fees. Empirically, our assumptions seem to be satisfied.

[LG-105] Generalized Distribution Prediction for Asset Returns

链接: https://arxiv.org/abs/2410.23296
作者: Ísak Pétursson,María Óskarsdóttir
关键词-EN: Long Short-Term Memory, Short-Term Memory, method with Long, Long Short-Term, quantile-based method
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel approach for predicting the distribution of asset returns using a quantile-based method with Long Short-Term Memory (LSTM) networks. Our model is designed in two stages: the first focuses on predicting the quantiles of normalized asset returns using asset-specific features, while the second stage incorporates market data to adjust these predictions for broader economic conditions. This results in a generalized model that can be applied across various asset classes, including commodities, cryptocurrencies, as well as synthetic datasets. The predicted quantiles are then converted into full probability distributions through kernel density estimation, allowing for more precise return distribution predictions and inferencing. The LSTM model significantly outperforms a linear quantile regression baseline by 98% and a dense neural network model by over 50%, showcasing its ability to capture complex patterns in financial return distributions across both synthetic and real-world data. By using exclusively asset-class-neutral features, our model achieves robust, generalizable results.

[LG-106] Exploiting Risk-Aversion and Size-dependent fees in FX Trading with Fitted Natural Actor-Critic

链接: https://arxiv.org/abs/2410.23294
作者: Vito Alessandro Monaco,Antonio Riva,Luca Sabbioni,Lorenzo Bisi,Edoardo Vittori,Marco Pinciroli,Michele Trapletti,Marcello Restelli
关键词-EN: recent years, popularity of artificial, artificial intelligence, intelligence has surged, surged due
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:In recent years, the popularity of artificial intelligence has surged due to its widespread application in various fields. The financial sector has harnessed its advantages for multiple purposes, including the development of automated trading systems designed to interact autonomously with markets to pursue different aims. In this work, we focus on the possibility of recognizing and leveraging intraday price patterns in the Foreign Exchange market, known for its extensive liquidity and flexibility. Our approach involves the implementation of a Reinforcement Learning algorithm called Fitted Natural Actor-Critic. This algorithm allows the training of an agent capable of effectively trading by means of continuous actions, which enable the possibility of executing orders with variable trading sizes. This feature is instrumental to realistically model transaction costs, as they typically depend on the order size. Furthermore, it facilitates the integration of risk-averse approaches to induce the agent to adopt more conservative behavior. The proposed approaches have been empirically validated on EUR-USD historical data.

[LG-107] Dark energy reconstruction analysis with artificial neural networks: Application on simulated Supernova Ia data from Rubin Observatory ATC

链接: https://arxiv.org/abs/2402.18124
作者: Ayan Mitra,Isidro Gómez-Vargas,Vasilios Zarikas
关键词-EN: Artificial Neural Network, Neural Network, Artificial Neural, Monte Carlo Dropout, LSST simulated three-year
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 14 Pages, 5 figures; matches the version published in Physics of the Dark Universe

点击查看摘要

Abstract:In this paper, we present an analysis of Supernova Ia (SNIa) distance moduli \mu(z) and dark energy using an Artificial Neural Network (ANN) reconstruction based on LSST simulated three-year SNIa data. The ANNs employed in this study utilize genetic algorithms for hyperparameter tuning and Monte Carlo Dropout for predictions. Our ANN reconstruction architecture is capable of modeling both the distance moduli and their associated statistical errors given redshift values. We compare the performance of the ANN-based reconstruction with two theoretical dark energy models: \Lambda CDM and Chevallier-Linder-Polarski (CPL). Bayesian analysis is conducted for these theoretical models using the LSST simulations and compared with observations from Pantheon and Pantheon+ SNIa real data. We demonstrate that our model-independent ANN reconstruction is consistent with both theoretical models. Performance metrics and statistical tests reveal that the ANN produces distance modulus estimates that align well with the LSST dataset and exhibit only minor discrepancies with \Lambda CDM and CPL.

信息检索

[IR-0] Investigating Bias in Political Search Query Suggestions by Relative Comparison with LLM s

链接: https://arxiv.org/abs/2410.23879
作者: Fabian Haak,Björn Engelmann,Christin Katharina Kreutz,Philipp Schaer
关键词-EN: affect users’ interactions, suggestions affect users’, Search query suggestions, query suggestions affect, query suggestions
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Search query suggestions affect users’ interactions with search engines, which then influences the information they encounter. Thus, bias in search query suggestions can lead to exposure to biased search results and can impact opinion formation. This is especially critical in the political domain. Detecting and quantifying bias in web search engines is difficult due to its topic dependency, complexity, and subjectivity. The lack of context and phrasality of query suggestions emphasizes this problem. In a multi-step approach, we combine the benefits of large language models, pairwise comparison, and Elo-based scoring to identify and quantify bias in English search query suggestions. We apply our approach to the U.S. political news domain and compare bias in Google and Bing.

[IR-1] Leveraging Large Language Models for Medical Information Extraction and Query Generation

链接: https://arxiv.org/abs/2410.23851
作者: Georgios Peikos,Pranav Kasela,Gabriella Pasi
关键词-EN: integrates large language, maintaining information privacy, allowing expert oversight, large language models, paper introduces
类目: Information Retrieval (cs.IR)
*备注: Accepted in WI-IAT '24

点击查看摘要

Abstract:This paper introduces a system that integrates large language models (LLMs) into the clinical trial retrieval process, enhancing the effectiveness of matching patients with eligible trials while maintaining information privacy and allowing expert oversight. We evaluate six LLMs for query generation, focusing on open-source and relatively small models that require minimal computational resources. Our evaluation includes two closed-source and four open-source models, with one specifically trained in the medical field and five general-purpose models. We compare the retrieval effectiveness achieved by LLM-generated queries against those created by medical experts and state-of-the-art methods from the literature. Our findings indicate that the evaluated models reach retrieval effectiveness on par with or greater than expert-created queries. The LLMs consistently outperform standard baselines and other approaches in the literature. The best performing LLMs exhibit fast response times, ranging from 1.7 to 8 seconds, and generate a manageable number of query terms (15-63 on average), making them suitable for practical implementation. Our overall findings suggest that leveraging small, open-source LLMs for clinical trials retrieval can balance performance, computational efficiency, and real-world applicability in medical settings.

[IR-2] Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

链接: https://arxiv.org/abs/2410.23841
作者: Jianqun Zhou,Yuanlei Zheng,Wei Chen,Qianqian Zheng,Zeyuan Shang,Wei Zhang,Rui Meng,Xiaoyu Shen
关键词-EN: complex user interactions, significantly progressed, enabling more complex, detailed prompts, interactions through detailed
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Instruction-following capabilities in large language models (LLMs) have significantly progressed, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics – Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models’ responsiveness to instructions. Our findings reveal that while reranking models generally surpass retrieval models in instruction following, they still face challenges in handling certain attributes. Moreover, although instruction fine-tuning and increased model size lead to better performance, most models fall short of achieving comprehensive instruction compliance as assessed by our benchmark.

[IR-3] Identify Then Recommend: Towards Unsupervised Group Recommendation

链接: https://arxiv.org/abs/2410.23757
作者: Yue Liu,Shihao Zhu,Tianyuan Yang,Jian Ma,Wenliang Zhong
关键词-EN: Group Recommendation, Recommendation, Group, aims to recommend, recommend items
类目: Information Retrieval (cs.IR)
*备注: 26 pages

点击查看摘要

Abstract:Group Recommendation (GR), which aims to recommend items to groups of users, has become a promising and practical direction for recommendation systems. This paper points out two issues of the state-of-the-art GR models. (1) The pre-defined and fixed number of user groups is inadequate for real-time industrial recommendation systems, where the group distribution can shift dynamically. (2) The training schema of existing GR methods is supervised, necessitating expensive user-group and group-item labels, leading to significant annotation costs. To this end, we present a novel unsupervised group recommendation framework named \underlineIdentify \underlineThen \underlineRecommend (\underlineITR), where it first identifies the user groups in an unsupervised manner even without the pre-defined number of groups, and then two pre-text tasks are designed to conduct self-supervised group recommendation. Concretely, at the group identification stage, we first estimate the adaptive density of each user point, where areas with higher densities are more likely to be recognized as group centers. Then, a heuristic merge-and-split strategy is designed to discover the user groups and decision boundaries. Subsequently, at the self-supervised learning stage, the pull-and-repulsion pre-text task is proposed to optimize the user-group distribution. Besides, the pseudo group recommendation pre-text task is designed to assist the recommendations. Extensive experiments demonstrate the superiority and effectiveness of ITR on both user recommendation (e.g., 22.22% NDCG@5 \uparrow ) and group recommendation (e.g., 22.95% NDCG@5 \uparrow ). Furthermore, we deploy ITR on the industrial recommender and achieve promising results.

[IR-4] owards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

链接: https://arxiv.org/abs/2410.23715
作者: Jia Song,Wanru Zhuang,Yujie Lin,Liang Zhang,Chunyan Li,Jinsong Su,Song He,Xiaochen Bo
关键词-EN: accurate similarity calculation, drug design, aims to learn, learn a shared, modalities for accurate
类目: Information Retrieval (cs.IR)
*备注: BIBM 2024 regular paper

点击查看摘要

Abstract:Cross-modal text-molecule retrieval model aims to learn a shared feature space of the text and molecule modalities for accurate similarity calculation, which facilitates the rapid screening of molecules with specific properties and activities in drug design. However, previous works have two main defects. First, they are inadequate in capturing modality-shared features considering the significant gap between text sequences and molecule graphs. Second, they mainly rely on contrastive learning and adversarial training for cross-modality alignment, both of which mainly focus on the first-order similarity, ignoring the second-order similarity that can capture more structural information in the embedding space. To address these issues, we propose a novel cross-modal text-molecule retrieval model with two-fold improvements. Specifically, on the top of two modality-specific encoders, we stack a memory bank based feature projector that contain learnable memory vectors to extract modality-shared features better. More importantly, during the model training, we calculate four kinds of similarity distributions (text-to-text, text-to-molecule, molecule-to-molecule, and molecule-to-text similarity distributions) for each instance, and then minimize the distance between these similarity distributions (namely second-order similarity losses) to enhance cross-modal alignment. Experimental results and analysis strongly demonstrate the effectiveness of our model. Particularly, our model achieves SOTA performance, outperforming the previously-reported best result by 6.4%.

[IR-5] Unveiling User Satisfaction and Creator Productivity Trade-Offs in Recommendation Platforms

链接: https://arxiv.org/abs/2410.23683
作者: Fan Yao,Yiming Liao,Jingzhou Liu,Shaoliang Nie,Qifan Wang,Haifeng Xu,Hongning Wang
关键词-EN: significantly impact creators’, impact creators’ motivation, allocated user traffic, algorithmically allocated user, algorithms significantly impact
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:On User-Generated Content (UGC) platforms, recommendation algorithms significantly impact creators’ motivation to produce content as they compete for algorithmically allocated user traffic. This phenomenon subtly shapes the volume and diversity of the content pool, which is crucial for the platform’s sustainability. In this work, we demonstrate, both theoretically and empirically, that a purely relevance-driven policy with low exploration strength boosts short-term user satisfaction but undermines the long-term richness of the content pool. In contrast, a more aggressive exploration policy may slightly compromise user satisfaction but promote higher content creation volume. Our findings reveal a fundamental trade-off between immediate user satisfaction and overall content production on UGC platforms. Building on this finding, we propose an efficient optimization method to identify the optimal exploration strength, balancing user and creator engagement. Our model can serve as a pre-deployment audit tool for recommendation algorithms on UGC platforms, helping to align their immediate objectives with sustainable, long-term goals.

[IR-6] Demonstrating Linked Battery Data To Accelerate Knowledge Flow in Battery Science

链接: https://arxiv.org/abs/2410.23303
作者: Philipp Dechent,Elias Barbers,Simon Clark,Susanne Lehner,Brady Planden,Masaki Adachi,David A. Howey,Sabine Paarmann
关键词-EN: Batteries are pivotal, climate-friendly future, pivotal for transitioning, data, Batteries
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:Batteries are pivotal for transitioning to a climate-friendly future, leading to a surge in battery research. Scopus (Elsevier) lists 14,388 papers that mention “lithium-ion battery” in 2023 alone, making it infeasible for individuals to keep up. This paper discusses strategies based on structured, semantic, and linked data to manage this information overload. Structured data follows a predefined, machine-readable format; semantic data includes metadata for context; linked data references other semantic data, forming a web of interconnected information. We use a battery-related ontology, BattINFO to standardise terms and enable automated data extraction and analysis. Our methodology integrates full-text search and machine-readable data, enhancing data retrieval and battery testing. We aim to unify commercial cell information and develop tools for the battery community such as manufacturer-independent cycling procedure descriptions and external memory for Large Language Models. Although only a first step, this approach significantly accelerates battery research and digitalizes battery testing, inviting community participation for continuous improvement. We provide the structured data and the tools to access them as open source.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-01

目录

概览 (2024-11-01)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载