Arxiv今日论文 | 2024-10-29

本篇博文主要展示 2024-10-29 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决多模态输入输出的问题，即如何构建一个能够处理和生成文本、音频、图像和视频的通用模型。解决方案的关键在于开发了一个端到端的自回归多模态模型GPT-4o，该模型能够统一处理不同类型的输入和输出，并且在视觉和音频理解方面相比现有模型有显著提升。GPT-4o不仅在处理速度和成本上优于GPT-4 Turbo，还在非英语语言的文本处理上表现出色。此外，论文还强调了模型的安全性和对社会潜在影响的评估，通过发布GPT-4o系统卡来详细展示其能力、局限性和安全措施。

链接: https://arxiv.org/abs/2410.21276
作者: OpenAI:Aaron Hurst,Adam Lerer,Adam P. Goucher,Adam Perelman,Aditya Ramesh,Aidan Clark,AJ Ostrow,Akila Welihinda,Alan Hayes,Alec Radford,Aleksander Mądry,Alex Baker-Whitcomb,Alex Beutel,Alex Borzunov,Alex Carney,Alex Chow,Alex Kirillov,Alex Nichol,Alex Paino,Alex Renzin,Alex Tachard Passos,Alexander Kirillov,Alexi Christakis,Alexis Conneau,Ali Kamali,Allan Jabri,Allison Moyer,Allison Tam,Amadou Crookes,Amin Tootoochian,Amin Tootoonchian,Ananya Kumar,Andrea Vallone,Andrej Karpathy,Andrew Braunstein,Andrew Cann,Andrew Codispoti,Andrew Galu,Andrew Kondrich,Andrew Tulloch,Andrey Mishchenko,Angela Baek,Angela Jiang,Antoine Pelisse,Antonia Woodford,Anuj Gosalia,Arka Dhar,Ashley Pantuliano,Avi Nayak,Avital Oliver,Barret Zoph,Behrooz Ghorbani,Ben Leimberger,Ben Rossen,Ben Sokolowsky,Ben Wang,Benjamin Zweig,Beth Hoover,Blake Samic,Bob McGrew,Bobby Spero,Bogo Giertler,Bowen Cheng,Brad Lightcap,Brandon Walkin,Brendan Quinn,Brian Guarraci,Brian Hsu,Bright Kellogg,Brydon Eastman,Camillo Lugaresi,Carroll Wainwright,Cary Bassin,Cary Hudson,Casey Chu,Chad Nelson,Chak Li,Chan Jun Shern,Channing Conger,Charlotte Barette,Chelsea Voss,Chen Ding,Cheng Lu,Chong Zhang,Chris Beaumont,Chris Hallacy,Chris Koch,Christian Gibson,Christina Kim,Christine Choi,Christine McLeavey,Christopher Hesse,Claudia Fischer,Clemens Winter,Coley Czarnecki,Colin Jarvis,Colin Wei,Constantin Koumouzelis,Dane Sherburn
关键词-EN: autoregressive omni model, generates any combination, autoregressive omni, text, System Card
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It’s trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o’s capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we’ve implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o’s text and vision capabilities.
摘要：GPT-4o 是一种自回归的通用模型，能够接受文本、音频、图像和视频的任意组合作为输入，并生成文本、音频和图像的任意组合作为输出。该模型在文本、视觉和音频领域进行了端到端的训练，意味着所有输入和输出均由同一神经网络处理。GPT-4o 能够在短短 232 毫秒内响应音频输入，平均响应时间为 320 毫秒，这与人类在对话中的响应时间相当。在英文文本和代码方面，GPT-4o 与 GPT-4 Turbo 性能相当，但在非英语语言的文本处理上有了显著提升，同时在 API 使用上速度更快且成本降低了 50%。相较于现有模型，GPT-4o 在视觉和音频理解方面表现尤为出色。秉承我们构建安全 AI 的承诺，并遵循我们自愿向白宫作出的承诺，我们发布了 GPT-4o 系统卡片，其中包括我们的准备框架评估。在该系统卡片中，我们详细介绍了 GPT-4o 在多个类别中的能力、局限性和安全性评估，重点评估了语音到语音的能力，同时也评估了文本和图像能力，并实施了多项措施以确保模型的安全性和一致性。我们还包含了第三方对危险能力的评估，以及对 GPT-4o 文本和视觉能力潜在社会影响的讨论。

[NLP-1] Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）在解决推理任务时，是通过学习稳健且可泛化的算法，还是通过记忆训练数据来实现的。解决方案的关键在于，通过因果分析识别出模型中解释基本算术逻辑行为的主要子集（电路），并深入研究其功能。研究发现，这些电路中的重要神经元实现了简单的启发式方法，每个启发式方法识别特定的数值输入模式并输出相应答案。论文假设这些启发式神经元的组合是生成正确算术答案的机制，并通过实验验证了这一假设。最终，研究结果表明，LLMs在执行算术任务时既不依赖稳健的算法，也不是通过记忆训练数据，而是依赖于“启发式方法的集合”（“bag of heuristics”）。

链接: https://arxiv.org/abs/2410.21272
作者: Yaniv Nikankin,Anja Reusch,Aaron Mueller,Yonatan Belinkov
关键词-EN: solve reasoning tasks, large language models, memorize training data, learning robust generalizable, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a representative task. Using causal analysis, we identify a subset of the model (a circuit) that explains most of the model’s behavior for basic arithmetic logic and examine its functionality. By zooming in on the level of individual circuit neurons, we discover a sparse set of important neurons that implement simple heuristics. Each heuristic identifies a numerical input pattern and outputs corresponding answers. We hypothesize that the combination of these heuristic neurons is the mechanism used to produce correct arithmetic answers. To test this, we categorize each neuron into several heuristic types-such as neurons that activate when an operand falls within a certain range-and find that the unordered combination of these heuristic types is the mechanism that explains most of the model’s accuracy on arithmetic prompts. Finally, we demonstrate that this mechanism appears as the main source of arithmetic accuracy early in training. Overall, our experimental results across several LLMs show that LLMs perform arithmetic using neither robust algorithms nor memorization; rather, they rely on a “bag of heuristics”.
摘要：大语言模型 (LLMs) 是通过学习稳健且可泛化的算法来解决推理任务，还是仅仅通过记忆训练数据来实现？为了探讨这一问题，我们以算术推理作为代表性任务进行研究。通过因果分析，我们识别出模型中解释大部分基本算术逻辑行为的一个子集（即电路），并考察其功能。通过深入到单个电路神经元的层面，我们发现了一组稀疏但重要的神经元，它们实现了简单的启发式规则。每个启发式规则识别特定的数值输入模式，并输出相应的答案。我们假设这些启发式神经元的组合是生成正确算术答案的机制。为了验证这一点，我们将每个神经元分类为几种启发式类型——例如，当操作数落入某个特定范围时激活的神经元——并发现这些启发式类型的无序组合是解释模型在算术提示上大部分准确性的机制。最后，我们证明这种机制在训练初期就成为算术准确性的主要来源。总体而言，我们在多个大语言模型上的实验结果表明，LLMs 进行算术运算既不依赖于稳健的算法，也不是通过记忆，而是依赖于“启发式规则的集合”。

[NLP-2] EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

【速读】：该论文试图解决模型压缩过程中引入的误差问题，特别是如何在用户定制需求（如任务类型、压缩比率）下，通过引入残差低秩路径来补偿这些误差，从而在不局限于特定压缩格式的情况下，灵活调整模型的整体容量。解决方案的关键是提出了一种无需训练的特征空间低秩近似方法（Training-free Eigenspace Low-Rank Approximation, EoRA）。EoRA通过直接最小化压缩引起的误差，利用输入激活的特征空间和特征值来优先重建高重要性的误差成分，从而在少量校准数据的情况下实现快速优化。此外，EoRA可以无缝集成到微调和量化过程中，进一步提升效果和效率，在多种任务（如语言生成、常识推理和数学推理）中显著优于先前的方法。

链接: https://arxiv.org/abs/2410.21271
作者: Shih-Yang Liu,Huck Yang,Chein-Yi Wang,Nai Chit Fung,Hongxu Yin,Charbel Sakr,Saurav Muralidharan,Kwang-Ting Cheng,Jan Kautz,Yu-Chiang Frank Wang,Pavlo Molchanov,Min-Hung Chen
关键词-EN: customized compensation problem, specific compression formats, introduce residual low-rank, customized compensation, compensation problem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users (e.g., tasks, compression ratios), resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats. However, naively applying SVD to derive residual paths causes suboptimal utilization of the low-rank representation capacity. Instead, we propose Training-free Eigenspace Low-Rank Approximation (EoRA), a method that directly minimizes compression-induced errors without requiring gradient-based training, achieving fast optimization in minutes using a small amount of calibration data. EoRA projects compression errors into the eigenspace of input activations, leveraging eigenvalues to effectively prioritize the reconstruction of high-importance error components. Moreover, EoRA can be seamlessly integrated with fine-tuning and quantization to further improve effectiveness and efficiency. EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks, such as language generation, commonsense reasoning, and math reasoning tasks (e.g., 31.31%/12.88% and 9.69% improvements on ARC-Easy/ARC-Challenge and MathQA when compensating LLaMA3-8B that is quantized to 4-bit and pruned to 2:4 sparsity). EoRA offers a scalable, training-free solution to compensate for compression errors, making it a powerful tool to deploy LLMs in various capacity and efficiency requirements.
摘要：在本研究中，我们将模型压缩问题重新表述为定制补偿问题：给定一个压缩模型，我们的目标是通过引入残差低秩路径来补偿用户定制需求（如任务类型、压缩比率）下的压缩误差，从而在不受特定压缩格式限制的情况下，提高整体容量的调整灵活性。然而，简单地应用奇异值分解（SVD）来推导残差路径会导致低秩表示容量的次优利用。为此，我们提出了无需训练的特征空间低秩近似（Eigenspace Low-Rank Approximation, EoRA）方法，该方法直接最小化压缩引起的误差，无需基于梯度的训练，通过少量校准数据即可在几分钟内实现快速优化。EoRA将压缩误差投影到输入激活的特征空间中，利用特征值有效优先重建高重要性误差成分。此外，EoRA可以无缝集成到微调（fine-tuning）和量化（quantization）过程中，进一步提高效果和效率。在补偿压缩后的LLaMA2/3模型在各种任务（如语言生成、常识推理和数学推理任务）中的误差时，EoRA始终优于先前的方法（例如，在ARC-Easy/ARC-Challenge和MathQA任务中，补偿LLaMA3-8B模型量化至4比特并剪枝至2:4稀疏度时，分别提升了31.31%/12.88%和9.69%）。EoRA提供了一种可扩展、无需训练的解决方案来补偿压缩误差，使其成为在各种容量和效率需求下部署大语言模型的强大工具。

[NLP-3] Are BabyLMs Second Language Learners?

【速读】：该论文试图解决的是如何在2024年版的BabyLM Challenge中优化语言模型的学习效率和效果。解决方案的关键在于从第二语言（L2）学习的角度出发，强调显式语言信息的教学，如语法概念、词汇定义和表达方式的多样性。具体方法包括利用Wiktionary数据、由大型语言模型（LLM）生成的语法示例或从语法书中提取的示例，以及句子改写数据。研究发现，显式的词汇意义信息（如Wiktionary数据）对模型性能提升有限，而语法信息能带来小幅提升。最具影响力的数据成分是句子改写数据，论文中表现最佳的两个模型分别基于改写数据与BabyLM预训练数据混合训练，以及仅使用改写数据进行训练。

链接: https://arxiv.org/abs/2410.21254
作者: Lukas Edman,Lisa Bylinina,Faeze Ghorbanpour,Alexander Fraser
关键词-EN: paper describes, describes a linguistically-motivated, Warstadt, linguistically-motivated approach, data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge (Warstadt et al. 2023). Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data. We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two best models being trained on 1) a mix of paraphrase data and data from the BabyLM pretraining dataset, and 2) exclusively paraphrase data.
摘要：本文介绍了一种基于语言学动机的方法，用于应对2024年版BabyLM挑战赛（Warstadt et al. 2023）。我们没有采用第一语言学习（L1）范式，而是从第二语言（L2）学习的角度出发。在L2学习中，更强调学习显性的语言信息，如语法概念、词汇定义或表达同一意义的多种方式。这使得L2学习可能更为高效和简洁。我们通过使用来自Wiktionary的数据、由大语言模型生成的或从语法书中获取的语法示例，以及释义数据来近似这一过程。我们发现，关于词汇意义的显性信息（在我们的案例中，即Wiktionary）并未提升模型性能，而语法信息则能带来小幅改进。最具影响力的数据成分是句子释义，我们表现最好的两个模型分别基于以下数据训练：1）释义数据与BabyLM预训练数据集的混合，以及2）仅使用释义数据。

[NLP-4] LongReward: Improving Long-context Large Language Models with AI Feedback

【速读】：该论文试图解决长上下文大语言模型（LLMs）在监督微调（SFT）过程中，由于合成数据质量不佳导致的长上下文性能下降问题。解决方案的关键在于提出了一种名为LongReward的新方法，该方法利用现成的LLM从四个关键维度（helpfulness, logicality, faithfulness, completeness）对长上下文模型响应进行评估，并生成相应的奖励信号。通过结合LongReward与离线强化学习算法DPO，论文展示了该方法能够有效提升长上下文SFT模型的性能，同时增强模型遵循短指令的能力。此外，研究还发现，长上下文DPO与传统短上下文DPO可以结合使用，而不会相互影响性能。

链接: https://arxiv.org/abs/2410.21252
作者: Jiajie Zhang,Zhongni Hou,Xin Lv,Shulin Cao,Zhenyu Hou,Yilin Niu,Lei Hou,Yuxiao Dong,Ling Feng,Juanzi Li
关键词-EN: large language models, developing long-context large, long-context large language, supervised fine-tuning, inherent limitations
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models’ capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models’ long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one’s performance.
摘要：尽管在开发长上下文大语言模型（LLM）方面取得了显著进展，但用于监督微调（SFT）的LLM合成数据的质量往往会影响SFT模型的长上下文性能，并导致固有的局限性。原则上，通过适当的奖励信号进行强化学习（RL）可以进一步提升模型的能力。然而，如何在长上下文场景中获取可靠的奖励信号仍未得到探索。为此，我们提出了LongReward，这是一种利用现成的LLM为长上下文模型响应提供奖励的新方法，从四个具有人类价值维度的方面进行评估：有用性、逻辑性、忠实性和完整性，每个维度都经过精心设计的评估流程。通过结合LongReward和离线RL算法DPO，我们能够有效提升长上下文SFT模型。我们的实验表明，LongReward不仅显著提高了模型的长上下文性能，还增强了其遵循简短指令的能力。我们还发现，结合LongReward的长上下文DPO和传统的短上下文DPO可以同时使用，而不会相互影响性能。

[NLP-5] Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback

【速读】：该论文试图解决在没有相关性监督的情况下构建有效的密集检索系统的问题。解决方案的关键在于引入了一种名为“从相关性反馈中提取真实文档嵌入 (Real Document Embeddings from Relevance Feedback, ReDE-RF)”的方法。ReDE-RF 通过将假设文档生成重新构想为相关性估计任务，利用大型语言模型 (LLM) 来选择用于最近邻搜索的文档，从而避免了依赖 LLM 的领域特定知识，并显著减少了生成假设文档所需的计算量。具体来说，LLM 只需输出一个单一的标记来进行相关性判断，从而大幅提高了搜索的延迟性能。实验结果表明，ReDE-RF 在多个低资源检索数据集上持续超越了现有的零样本密集检索方法，并在每查询延迟方面取得了显著改进。

链接: https://arxiv.org/abs/2410.21242
作者: Nour Jedidi,Yung-Sung Chuang,Leslie Shing,James Glass
关键词-EN: Building effective dense, systems remains difficult, Building effective, Large Language Model, retrieval systems remains
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building effective dense retrieval systems remains difficult when relevance supervision is not available. Recent work has looked to overcome this challenge by using a Large Language Model (LLM) to generate hypothetical documents that can be used to find the closest real document. However, this approach relies solely on the LLM to have domain-specific knowledge relevant to the query, which may not be practical. Furthermore, generating hypothetical documents can be inefficient as it requires the LLM to generate a large number of tokens for each query. To address these challenges, we introduce Real Document Embeddings from Relevance Feedback (ReDE-RF). Inspired by relevance feedback, ReDE-RF proposes to re-frame hypothetical document generation as a relevance estimation task, using an LLM to select which documents should be used for nearest neighbor search. Through this re-framing, the LLM no longer needs domain-specific knowledge but only needs to judge what is relevant. Additionally, relevance estimation only requires the LLM to output a single token, thereby improving search latency. Our experiments show that ReDE-RF consistently surpasses state-of-the-art zero-shot dense retrieval methods across a wide range of low-resource retrieval datasets while also making significant improvements in latency per-query.
摘要：在没有相关性监督的情况下，构建有效的密集检索系统仍然具有挑战性。最近的研究试图通过使用大语言模型 (LLM) 生成假设文档来克服这一难题，这些假设文档可用于找到最接近的真实文档。然而，这种方法完全依赖于 LLM 具备与查询相关的领域特定知识，这在实际应用中可能并不现实。此外，生成假设文档的效率较低，因为每次查询都需要 LLM 生成大量 Token。为了解决这些问题，我们提出了基于相关性反馈的实际文档嵌入 (ReDE-RF)。受相关性反馈的启发，ReDE-RF 提出将假设文档生成重新构建为相关性估计任务，利用 LLM 选择应用于最近邻搜索的文档。通过这种重新构建，LLM 不再需要领域特定知识，而只需判断什么是相关的。此外，相关性估计仅需要 LLM 输出一个 Token，从而提高了搜索延迟。我们的实验表明，ReDE-RF 在广泛的低资源检索数据集上持续超越最先进的零样本密集检索方法，同时在每次查询的延迟方面也取得了显著改进。

[NLP-6] Flaming-hot Initiation with Regular Execution Sampling for Large Language Models

【速读】：该论文试图解决在大语言模型 (LLMs) 开发中，如何高效地获取多样化、高质量数据以提升推理相关任务（如数学或代码生成）的性能问题。解决方案的关键是引入了一种名为“Flaming-hot Initiation with Regular Execution (FIRE) 采样”的方法，该方法通过在推理阶段和训练对齐阶段提高生成质量，从而有效提升模型性能。FIRE 采样的核心在于促进生成内容的多样性，并通过在响应的不同位置应用该方法来分析其影响。

链接: https://arxiv.org/abs/2410.21236
作者: Weizhe Chen,Zhicheng Zhang,Guanlin Liu,Renjie Zheng,Wenlei Shi,Chen Dun,Zheng Wu,Xing Jin,Lin Yan
关键词-EN: large language models, demonstrated remarkable capabilities, release of ChatGPT, large language, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since the release of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities across various domains. A key challenge in developing these general capabilities is efficiently sourcing diverse, high-quality data. This becomes especially critical in reasoning-related tasks with sandbox checkers, such as math or code, where the goal is to generate correct solutions to specific problems with higher probability. In this work, we introduce Flaming-hot Initiation with Regular Execution (FIRE) sampling, a simple yet highly effective method to efficiently find good responses. Our empirical findings show that FIRE sampling enhances inference-time generation quality and also benefits training in the alignment stage. Furthermore, we explore how FIRE sampling improves performance by promoting diversity and analyze the impact of employing FIRE at different positions within a response.
摘要：自 ChatGPT 发布以来，大语言模型 (Large Language Model, LLM) 在各个领域展示了显著的能力。开发这些通用能力的一个关键挑战是如何高效地获取多样化、高质量的数据。这在涉及沙盒检查器的推理相关任务中尤为重要，例如数学或代码任务，其目标是以更高的概率生成特定问题的正确解决方案。在本研究中，我们引入了 Flaming-hot Initiation with Regular Execution (FIRE) 采样方法，这是一种简单但非常有效的方法，用于高效地找到优质响应。我们的实证研究发现，FIRE 采样不仅提升了推理阶段的生成质量，还有助于对齐阶段的训练。此外，我们探讨了 FIRE 采样如何通过促进多样性来提高性能，并分析了在响应的不同位置应用 FIRE 的影响。

[NLP-7] LoRA vs Full Fine-tuning: An Illusion of Equivalence

【速读】：该论文试图解决的问题是：尽管低秩适应（Low-Rank Adaptation, LoRA）在下游任务中能够以极少的可训练参数达到与全量微调（full fine-tuning）相似的性能，但这两种方法是否真的学习到了等价的模型解决方案。论文通过分析模型权重矩阵的谱特性，发现全量微调和LoRA生成的权重矩阵在奇异值分解（Singular Value Decomposition, SVD）结构上存在显著差异，且在适应任务分布之外的测试中表现出不同的泛化行为。关键解决方案在于识别并解释了LoRA模型中出现的“入侵维度”（intruder dimensions），这些维度在全量微调中不存在，且会导致模型在预训练分布上的表现变差，以及在多任务连续适应中的鲁棒性降低。论文进一步提出，通过增加LoRA的秩并稳定秩，可以更接近全量微调的效果，从而减少入侵维度的影响。

链接: https://arxiv.org/abs/2410.21228
作者: Reece Shuttleworth,Jacob Andreas,Antonio Torralba,Pratyusha Sharma
关键词-EN: adapting pre-trained large, pre-trained large language, large language models, models, full fine-tuning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to match the performance of fully fine-tuned models on various tasks with an extreme reduction in the number of trainable parameters. Even in settings where both methods learn similarly accurate models, \emphare their learned solutions really equivalent? We study how different fine-tuning methods change pre-trained models by analyzing the model’s weight matrices through the lens of their spectral properties. We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure; moreover, the fine-tuned models themselves show distinct generalization behaviors when tested outside the adaptation task’s distribution. More specifically, we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emphintruder dimensions. Intruder dimensions do not appear during full fine-tuning. Second, we show that LoRA models with intruder dimensions, despite achieving similar performance to full fine-tuning on the target task, become worse models of the pre-training distribution and adapt less robustly to multiple tasks sequentially. Higher-rank, rank-stabilized LoRA models closely mirror full fine-tuning, even when performing on par with lower-rank LoRA models on the same tasks. These results suggest that models updated with LoRA and full fine-tuning access different parts of parameter space, even when they perform equally on the fine-tuned distribution. We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.
摘要：微调是使预训练的大语言模型适应下游任务的关键范式。最近，像低秩适应 (LoRA) 这样的方法已被证明在大幅减少可训练参数数量的同时，能够在各种任务上与完全微调的模型性能相匹配。即使在两种方法都能学习到同样准确模型的场景下，它们所学到的解决方案真的等价吗？我们通过分析模型权重矩阵的谱特性，研究了不同微调方法如何改变预训练模型。我们发现，完全微调和 LoRA 产生的权重矩阵在奇异值分解上展现出截然不同的结构；此外，在适应任务分布之外进行测试时，微调后的模型表现出不同的泛化行为。更具体地说，我们首先展示了使用 LoRA 训练的权重矩阵具有新的、高排序的奇异向量，我们称之为“入侵维度”。在完全微调过程中不会出现这些入侵维度。其次，我们展示了尽管在目标任务上 LoRA 模型与完全微调的模型性能相似，但具有入侵维度的 LoRA 模型在预训练分布上的表现更差，并且在连续适应多个任务时适应性较弱。更高秩、秩稳定的 LoRA 模型在性能上与完全微调非常接近，即使在相同任务上与低秩 LoRA 模型表现相当。这些结果表明，即使在大语言模型微调分布上表现相同，使用 LoRA 和完全微调更新的模型访问了参数空间的不同部分。最后，我们探讨了为什么入侵维度会出现在 LoRA 微调的模型中，为什么它们是不理想的，以及如何最小化它们的影响。

[NLP-8] HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

【速读】：该论文试图解决的问题是现有位置编码（Positional Encodings, PEs）在大型语言模型（LLMs）中表现出的长程衰减（long-term decay）特性，这一特性与LLMs在处理需要从任意位置精确检索上下文信息的任务时存在矛盾。论文通过实证分析发现，模型在全局范围内形成了U形注意力模式，而非长程衰减模式，这表明长程衰减原则在LLMs时代已经过时。解决方案的关键在于提出了一种新型的高频旋转位置编码（High-frequency rotary Position Encoding, HoPE），该编码通过替换旋转位置编码（Rotary Position Encoding, RoPE）中的特定组件为位置无关的高频信号，打破了长程衰减的理论原则。HoPE的两大优势在于：1) 消除了限制自发性注意力优化和模型外推性能的矛盾因素；2) 优化了表示位置和语义的组件，从而增强了模型的上下文感知和外推能力。

链接: https://arxiv.org/abs/2410.21216
作者: Yuhan Chen,Ang Lv,Jian Luan,Bin Wang,Wei Liu
关键词-EN: long-standing inductive opinion, current position carry, long-term decay, exhibit long-term decay, inductive opinion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE’s expressiveness and this http URL by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model’s context awareness and extrapolation, as validated by extensive experiments.
摘要：许多位置编码 (Positional Encoding, PE) 的设计基于一个根深蒂固且长期存在的归纳观点，即距离当前位置越远的 Token 携带的相关信息越少，因此表现出长程衰减的特性。我们认为，在当前大语言模型 (Large Language Model, LLM) 的时代，这种长程衰减的观点已经过时，因为 LLM 现在被应用于需要从任意位置精确检索上下文信息的任务中。首先，我们对多种位置编码进行了实证分析，结果表明模型在局部范围内自然地学习到了仅具有局部衰减模式的注意力，而在全局范围内则形成了 U 形模式，这与长程衰减的原则相矛盾。此外，我们对旋转位置编码 (Rotary Position Encoding, RoPE) 进行了详细分析，发现这种 U 形注意力是由某些学习到的组件引起的，这些组件也是限制 RoPE 表达能力的关键因素。基于这些洞察，我们提出了高频旋转位置编码 (High-frequency rotary Position Encoding, HoPE)。HoPE 用与位置无关的组件替换了 RoPE 中的特定组件，仅保留高频信号，这在理论上打破了长程衰减的原则。HoPE 实现了两大优势：(1) 去除了由长程衰减带来的限制，消除了限制自发性注意力优化和模型外推性能的矛盾因素；(2) 优化了表示位置和语义的组件，从而增强了模型的上下文感知能力和外推性能，这一点通过广泛的实验得到了验证。

[NLP-9] BongLLaMA: LLaMA for Bangla Language

【速读】：该论文试图解决孟加拉语（Bangla）作为低资源语言在自然语言处理（BLP）任务中表现不佳的问题。解决方案的关键在于引入了一个名为BongLLaMA的开源大型语言模型，该模型专门针对大量孟加拉语文本和指令调优数据集进行了微调。通过数据增强技术和详细的微调过程，BongLLaMA在孟加拉语处理任务中展示了显著的性能提升，并有望成为孟加拉语语言模型的新基准，从而促进未来针对这一广泛使用但资源匮乏的语言的基准测试研究。

链接: https://arxiv.org/abs/2410.21200
作者: Abdullah Khan Zehady,Safi Al Mamun,Naymul Islam,Santu Karmaker
关键词-EN: million people worldwide, million native speakers, Bangla Language Processing, million native, million people
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:Bangla (or “Bengali”) is a language spoken by approximately 240 million native speakers and around 300 million people worldwide. Despite being the 5th largest spoken language in the world, Bangla is still a “low-resource” language, and existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks. This work addresses this gap by introducing BongLLaMA (i.e., Bangla-LLaMA), an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets. We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks. We believe BongLLaMA will serve as the new standard baseline for Bangla Language Models and, thus, facilitate future benchmarking studies focused on this widely-spoken yet “low-resource” language. All BongLLaMA models are available for public use at this https URL.
摘要：孟加拉语（或称“孟加拉语”）是一种由约2.4亿母语使用者和全球约3亿人使用的语言。尽管它是全球第五大使用语言，但孟加拉语仍被视为“低资源”语言，现有的预训练语言模型在孟加拉语语言处理（BLP）任务中往往表现不佳。本研究通过引入BongLLaMA（即孟加拉语-LLaMA）来填补这一空白，这是一个专门针对大量孟加拉语文本和指令微调数据集进行微调的开源大语言模型。我们介绍了我们的方法论、数据增强技术、微调细节以及全面的基准测试结果，展示了BongLLaMA在BLP任务中的实用性。我们相信BongLLaMA将成为孟加拉语语言模型的新标准基线，从而促进针对这一广泛使用但资源匮乏的语言的未来基准测试研究。所有BongLLaMA模型均可在此https URL公开使用。

[NLP-10] Belief in the Machine: Investigating Epistemological Blind Spots of Language Models

【速读】：该论文试图解决现代语言模型（LMs）在区分事实、信念和知识方面的能力问题，特别是在医疗、法律和新闻等关键领域中，这些模型的错误判断可能导致严重后果。解决方案的关键在于系统性地评估LMs在处理这些基本认知挑战时的表现，特别是通过使用新的数据集KaBLE进行测试。研究发现，尽管LMs在处理事实场景时表现较好（86%准确率），但在处理虚假场景和信念相关任务时表现显著下降。此外，LMs在识别和确认个人信念方面存在困难，尤其是在这些信念与事实数据相矛盾时。研究还揭示了LMs在处理第一人称和第三人称信念时的显著偏差，以及对知识本质（即知识必然要求真实性）的理解不足。这些发现强调了在广泛部署LMs于关键领域之前，需要在这些方面进行改进和提升。

链接: https://arxiv.org/abs/2410.21195
作者: Mirac Suzgun,Tayfun Gur,Federico Bianchi,Daniel E. Ho,Thomas Icard,Dan Jurafsky,James Zou
关键词-EN: language models, differentiate between fact, reliable decision-making, integral to fields, essential for reliable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: this https URL

点击查看摘要

Abstract:As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person’s beliefs is critical. Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%). Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth. Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning. These findings highlight significant concerns about current LMs’ ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.
摘要：随着语言模型 (Language Models, LMs) 在医疗、法律和新闻等领域的应用日益广泛，它们区分事实、信念和知识的能力对于可靠的决策制定至关重要。未能把握这些区别可能会在医疗诊断、法律判决和假新闻传播等领域带来严重后果。尽管如此，当前的文献大多关注如心智理论等更为复杂的问题，而忽略了更为基本的认识论挑战。本研究系统地评估了现代语言模型（包括 GPT-4、Claude-3 和 Llama-3）的认识论推理能力，使用了一个新的数据集 KaBLE，该数据集包含 13,000 个问题，涵盖 13 项任务。我们的研究结果揭示了几个关键的局限性。首先，尽管语言模型在事实场景中达到了 86% 的准确率，但在虚假场景中，特别是在与信念相关的任务中，其表现显著下降。其次，语言模型在识别和确认个人信念方面存在困难，尤其是当这些信念与事实数据相矛盾时，这引发了在医疗和咨询等应用中的担忧，因为在这些领域中，与个人的信念互动至关重要。第三，我们发现语言模型在处理第一人称与第三人称信念时存在显著偏差，在第三人称任务中的表现（80.7%）优于第一人称任务（54.4%）。第四，语言模型缺乏对知识的事实性本质的深刻理解，即知识本质上需要真实性。第五，语言模型依赖于语言线索进行事实核查，有时会绕过更深层次的推理。这些发现突显了当前语言模型在推理事实、信念和知识方面的显著问题，并强调了在关键领域广泛部署之前，需要在这些方面取得进展。

[NLP-11] Document Parsing Unveiled: Techniques Challenges and Prospects for Structured Information Extraction

【速读】：该论文试图解决文档解析（Document Parsing）中的关键问题，即将非结构化和半结构化文档（如合同、学术论文和发票）转换为结构化、机器可读数据的过程。解决方案的关键在于采用从模块化流水线系统到端到端模型（End-to-End Models）的多种方法，特别是利用大型视觉-语言模型（Large Vision-Language Models）来驱动文档解析。核心组件包括布局检测（Layout Detection）、内容提取（Content Extraction，包括文本、表格和数学表达式）以及多模态数据集成（Multi-Modal Data Integration）。论文还强调了在处理复杂布局、整合多个模块和识别高密度文本时面临的挑战，并指出开发更大、更多样化的数据集的重要性，以及未来的研究方向。

链接: https://arxiv.org/abs/2410.21169
作者: Qintong Zhang,Victor Shea-Jay Huang,Bin Wang,Junyuan Zhang,Zhengren Wang,Hao Liang,Shawn Wang,Matthieu Lin,Wentao Zhang,Conghui He
关键词-EN: Document parsing, documents-such as contracts, essential for converting, semi-structured documents-such, Document parsing extract
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document parsing is essential for converting unstructured and semi-structured documents-such as contracts, academic papers, and invoices-into structured, machine-readable data. Document parsing extract reliable structured data from unstructured inputs, providing huge convenience for numerous applications. Especially with recent achievements in Large Language Models, document parsing plays an indispensable role in both knowledge base construction and training data generation. This survey presents a comprehensive review of the current state of document parsing, covering key methodologies, from modular pipeline systems to end-to-end models driven by large vision-language models. Core components such as layout detection, content extraction (including text, tables, and mathematical expressions), and multi-modal data integration are examined in detail. Additionally, this paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts, integrating multiple modules, and recognizing high-density text. It emphasizes the importance of developing larger and more diverse datasets and outlines future research directions.
摘要：文档解析是将非结构化和半结构化文档（如合同、学术论文和发票）转换为结构化、机器可读数据的关键步骤。文档解析能够从非结构化输入中提取可靠的结构化数据，为众多应用提供了极大的便利。特别是在大语言模型（Large Language Models）的最新进展下，文档解析在知识库构建和训练数据生成中扮演着不可或缺的角色。本综述全面回顾了当前文档解析的现状，涵盖了从模块化流水线系统到由大型视觉-语言模型驱动的端到端模型的关键方法。详细探讨了布局检测、内容提取（包括文本、表格和数学表达式）以及多模态数据集成等核心组件。此外，本文还讨论了模块化文档解析系统和视觉-语言模型在处理复杂布局、集成多个模块以及识别高密度文本时面临的挑战。强调了开发更大规模和更多样化数据集的重要性，并概述了未来的研究方向。

[NLP-12] M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

【速读】：该论文试图解决现有代码大型语言模型（LLMs）在多语言场景下的通用代码智能能力评估问题。现有基准数据集通常仅涵盖有限数量的编程语言（5种），并且缺乏对不同完成场景的细粒度评估。解决方案的关键在于提出了一个大规模多语言的代码完成基准（M2RC-EVAL），涵盖18种编程语言，并提供了两种细粒度注释（bucket-level和semantic-level），这些注释基于解析的抽象语法树（AST）生成。此外，论文还构建了一个大规模多语言指令语料库（M2RC-INSTRUCT），以提升现有代码LLMs的代码完成能力。综合实验结果验证了M2RC-EVAL和M2RC-INSTRUCT的有效性。

链接: https://arxiv.org/abs/2410.21157
作者: Jiaheng Liu,Ken Deng,Congnan Liu,Jian Yang,Shukai Liu,He Zhu,Peng Zhao,Linzheng Chai,Yanan Wu,Ke Jin,Ge Zhang,Zekun Wang,Guoan Zhang,Bangyu Xiang,Wenbo Su,Bo Zheng
关键词-EN: Repository-level code completion, drawn great attention, Large Language Models, Repository-level code, code completion
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 19 pages

点击查看摘要

Abstract:Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.
摘要：在软件工程领域，代码库级别的代码补全技术引起了广泛关注，并引入了多个基准数据集。然而，现有的代码库级别代码补全基准通常仅关注有限数量的编程语言（5种），这无法全面评估现有代码大语言模型（LLMs）在不同语言间的通用代码智能能力。此外，现有基准通常报告不同语言的总体平均分数，忽略了不同补全场景中的细粒度能力。因此，为了促进多语言场景下代码LLMs的研究，我们提出了一种涵盖18种编程语言的大规模多语言代码库级别代码补全基准（称为M2RC-EVAL），并提供了两种类型的细粒度注释（即桶级别和语义级别），这些注释基于解析的抽象语法树获得。此外，我们还构建了一个大规模多语言指令语料库M2RC-INSTRUCT数据集，以提升现有代码LLMs的代码库级别代码补全能力。综合实验结果证明了我们的M2RC-EVAL和M2RC-INSTRUCT的有效性。

[NLP-13] SciER: An Entity and Relation Extraction Dataset for Datasets Methods and Tasks in Scientific Documents EMNLP2024

【速读】：该论文试图解决科学信息提取 (Scientific Information Extraction, SciIE) 领域中由于数据标注复杂性和成本高昂，导致现有数据集仅限于摘要部分标注，从而丢失了全文上下文中多样化的实体提及和关系的问题。解决方案的关键在于发布了一个新的实体和关系提取数据集，该数据集包含了106篇手动标注的全文科学出版物，涵盖超过24,000个实体和12,000个关系。该数据集不仅提供了细粒度的关系标签集，以捕捉全文中的实体复杂交互，还提供了一个分布外的测试集，以进行更现实的评估。通过引入这一全面的数据集，论文鼓励开发创新的模型，以应对SciIE领域的挑战，推动该领域的发展。

链接: https://arxiv.org/abs/2410.21155
作者: Qi Zhang,Zhijia Chen,Huitong Pan,Cornelia Caragea,Longin Jan Latecki,Eduard Dragut
关键词-EN: converting unstructured knowledge, structured data, critical for converting, converting unstructured, unstructured knowledge
类目: Computation and Language (cs.CL)
备注: EMNLP2024 Main

点击查看摘要

Abstract:Scientific information extraction (SciIE) is critical for converting unstructured knowledge from scholarly articles into structured data (entities and relations). Several datasets have been proposed for training and validating SciIE models. However, due to the high complexity and cost of annotating scientific texts, those datasets restrict their annotations to specific parts of paper, such as abstracts, resulting in the loss of diverse entity mentions and relations in context. In this paper, we release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations. To capture the intricate use and interactions among entities in full texts, our dataset contains a fine-grained tag set for relations. Additionally, we provide an out-of-distribution test set to offer a more realistic evaluation. We conduct comprehensive experiments, including state-of-the-art supervised models and our proposed LLM-based baselines, and highlight the challenges presented by our dataset, encouraging the development of innovative models to further the field of SciIE.
摘要：科学信息提取（Scientific Information Extraction, SciIE）对于将学术文章中的非结构化知识转化为结构化数据（实体和关系）至关重要。已有多个数据集被提出用于训练和验证 SciIE 模型。然而，由于标注科学文本的高复杂性和成本，这些数据集通常仅限于标注文章的特定部分，如摘要，从而导致上下文中多样化的实体提及和关系信息的丢失。本文中，我们发布了一个新的实体和关系提取数据集，该数据集专注于科学文章中与数据集、方法和任务相关的实体。我们的数据集包含 106 篇手工标注的全文科学出版物，涵盖超过 24,000 个实体和 12,000 个关系。为了捕捉全文中的实体复杂使用和交互情况，我们的数据集提供了细粒度的关系标签集。此外，我们还提供了一个分布外测试集，以提供更真实的评估。我们进行了全面的实验，包括最先进的监督模型和我们提出的基于大语言模型的基线模型，并强调了我们数据集带来的挑战，鼓励开发创新模型以进一步推动 SciIE 领域的发展。

[NLP-14] Palisade – Prompt Injection Detection Framework

【速读】：该论文试图解决大语言模型 (LLMs) 在面对恶意提示注入攻击时的脆弱性问题。解决方案的关键在于提出了一种基于自然语言处理 (NLP) 的多层输入筛选框架，通过规则基础层、机器学习分类器层和辅助LLM层的三层筛选机制，提高提示注入检测的准确性和优化性能。该框架通过减少假阴性（即漏检恶意提示）来增强整体检测精度，尽管可能会增加假阳性（即误报），但优先考虑了系统的安全性。

链接: https://arxiv.org/abs/2410.21146
作者: Sahasra Kokkula,Somanathan R,Nandavardhan R,Aashishkumar,G Divya
关键词-EN: Large Language Models, Artificial Intelligence, Large Language, Language Models LLMs, generate human language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of Large Language Models LLMs marks a milestone in Artificial Intelligence, altering how machines comprehend and generate human language. However, LLMs are vulnerable to malicious prompt injection attacks, where crafted inputs manipulate the models behavior in unintended ways, compromising system integrity and causing incorrect outcomes. Conventional detection methods rely on static, rule-based approaches, which often fail against sophisticated threats like abnormal token sequences and alias substitutions, leading to limited adaptability and higher rates of false positives and false this http URL paper proposes a novel NLP based approach for prompt injection detection, emphasizing accuracy and optimization through a layered input screening process. In this framework, prompts are filtered through three distinct layers rule-based, ML classifier, and companion LLM before reaching the target model, thereby minimizing the risk of malicious this http URL show the ML classifier achieves the highest accuracy among individual layers, yet the multi-layer framework enhances overall detection accuracy by reducing false negatives. Although this increases false positives, it minimizes the risk of overlooking genuine injected prompts, thus prioritizing this http URL multi-layered detection approach highlights LLM vulnerabilities and provides a comprehensive framework for future research, promoting secure interactions between humans and AI systems.
摘要：大语言模型 (LLM) 的出现标志着人工智能领域的一个重要里程碑，改变了机器理解和生成人类语言的方式。然而，LLM 容易受到恶意提示注入攻击，即通过精心设计的输入操纵模型的行为，导致系统完整性受损并产生错误结果。传统的检测方法依赖于静态的基于规则的方法，这些方法往往无法应对复杂的威胁，如异常 Token 序列和别名替换，导致适应性有限，误报率和漏报率较高。本文提出了一种基于自然语言处理 (NLP) 的新型提示注入检测方法，通过分层输入筛选过程强调准确性和优化。在该框架中，提示在到达目标模型之前，需经过三个不同的层次过滤：基于规则的过滤、机器学习 (ML) 分类器和辅助 LLM。这种方法最小化了恶意提示的风险。研究表明，ML 分类器在单个层次中实现了最高的准确性，但多层框架通过减少漏报率提高了整体检测准确性。尽管这增加了误报率，但它最小化了忽略真实注入提示的风险，从而优先考虑了系统的安全性。这种多层次检测方法突显了 LLM 的脆弱性，并为未来的研究提供了一个全面的框架，促进了人类与 AI 系统之间的安全交互。

[NLP-15] uOttawa at LegalLens-2024: Transformer-based Classification Experiments

【速读】：该论文试图解决在非结构化文本数据中检测法律违规行为并将其与可能受影响的个人关联的问题。解决方案的关键在于采用了两种主要方法：针对法律命名实体识别（L-NER）子任务，使用了spaCy库；针对法律自然语言推理（L-NLI）子任务，采用了结合RoBERTa和CNN的混合模型。这两种方法的有效性在实验中得到了验证，分别在L-NER和L-NLI子任务中取得了86.3%和88.25%的准确率，展示了transformer模型在法律领域复杂任务中的应用潜力。

链接: https://arxiv.org/abs/2410.21139
作者: Nima Meghdadi,Diana Inkpen
关键词-EN: potentially affected individuals, unstructured textual data, detecting legal violations, Named Entity Recognition, Natural Language Inference
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the methods used for LegalLens-2024 shared task, which focused on detecting legal violations within unstructured textual data and associating these violations with potentially affected individuals. The shared task included two subtasks: A) Legal Named Entity Recognition (L-NER) and B) Legal Natural Language Inference (L-NLI). For subtask A, we utilized the spaCy library, while for subtask B, we employed a combined model incorporating RoBERTa and CNN. Our results were 86.3% in the L-NER subtask and 88.25% in the L-NLI subtask. Overall, our paper demonstrates the effectiveness of transformer models in addressing complex tasks in the legal domain. The source code for our implementation is publicly available at this https URL
摘要：本文介绍了用于 LegalLens-2024 共享任务的方法，该任务专注于检测非结构化文本数据中的法律违规行为，并将这些违规行为与可能受影响的个人关联起来。共享任务包括两个子任务：A) 法律命名实体识别 (L-NER) 和 B) 法律自然语言推理 (L-NLI)。对于子任务 A，我们使用了 spaCy 库，而对于子任务 B，我们采用了一个结合了 RoBERTa 和 CNN 的混合模型。我们在 L-NER 子任务中的结果为 86.3%，在 L-NLI 子任务中的结果为 88.25%。总体而言，本文展示了 Transformer 模型在解决法律领域复杂任务中的有效性。我们的实现代码已在以下网址公开：https URL。

[NLP-16] owards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments AAAI-2025

【速读】：该论文试图解决的是现有可解释AI技术中，反事实解释（Counterfactual Explanations）的评估方法缺乏以用户为中心的研究，且现有评估指标未能充分捕捉人类视角的问题。解决方案的关键在于开发了一套包含30个反事实场景的多样化评估集，并通过206名受访者的评分收集了8个评估指标的数据。随后，论文对不同的大型语言模型（LLMs）进行了微调，以预测这些评估指标下的平均或个体人类判断。通过这种方法，微调后的模型在零样本评估中达到了63%的准确率，在三分类预测中达到了85%的准确率，从而提供了更好的可比性和可扩展性，用于评估不同的反事实解释框架。

链接: https://arxiv.org/abs/2410.21131
作者: Marharyta Domnich,Julius Valja,Rasmus Moorits Veski,Giacomo Magnifico,Kadi Tulver,Eduard Barbu,Raul Vicente
关键词-EN: maintaining transparency demands, learning models evolve, machine learning models, maintaining transparency, explainable AI techniques
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been submitted in August and is currently under review to AAAI-2025

点击查看摘要

Abstract:As machine learning models evolve, maintaining transparency demands more human-centric explainable AI techniques. Counterfactual explanations, with roots in human reasoning, identify the minimal input changes needed to obtain a given output and, hence, are crucial for supporting decision-making. Despite their importance, the evaluation of these explanations often lacks grounding in user studies and remains fragmented, with existing metrics not fully capturing human perspectives. To address this challenge, we developed a diverse set of 30 counterfactual scenarios and collected ratings across 8 evaluation metrics from 206 respondents. Subsequently, we fine-tuned different Large Language Models (LLMs) to predict average or individual human judgment across these metrics. Our methodology allowed LLMs to achieve an accuracy of up to 63% in zero-shot evaluations and 85% (over a 3-classes prediction) with fine-tuning across all metrics. The fine-tuned models predicting human ratings offer better comparability and scalability in evaluating different counterfactual explanation frameworks.
摘要：随着机器学习模型的不断演进，保持透明性需要更多以人为中心的可解释 AI 技术。反事实解释（Counterfactual Explanations）源自人类推理，能够识别获得特定输出所需的最小输入变化，因此在支持决策制定方面至关重要。尽管其重要性不言而喻，但这些解释的评估往往缺乏基于用户研究的依据，且评估标准分散，现有指标未能充分捕捉人类视角。为应对这一挑战，我们设计了一套包含 30 种反事实情景的多样化集合，并从 206 名受访者中收集了针对 8 个评估指标的评分。随后，我们对不同的大语言模型（Large Language Models, LLMs）进行了微调，以预测这些指标下的平均或个体人类判断。我们的方法使 LLMs 在零样本评估中达到了高达 63% 的准确率，并在所有指标上通过微调实现了 85% 的准确率（超过三类预测）。这些经过微调的模型在预测人类评分方面提供了更好的可比性和可扩展性，从而有助于评估不同的反事实解释框架。

[NLP-17] Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model

【速读】：该论文试图解决酶工程中突变效应预测的问题，特别是通过深度学习方法提高预测的准确性和效率。解决方案的关键在于开发了一种检索增强的蛋白质语言模型 (ProtREM)，该模型能够综合分析蛋白质序列和局部结构相互作用中的本征属性，以及从检索到的同源序列中提取的进化属性。通过这种综合分析，ProtREM 在超过200万突变体的开放基准测试 (ProteinGym) 中展示了最先进的性能，并成功预测了突变对酶稳定性、结合亲和力和催化活性的影响，为生物学家提供了一个辅助工具，用于优化现有酶的性能。

链接: https://arxiv.org/abs/2410.21127
作者: Yang Tan,Ruilin Wang,Banghao Wu,Liang Hong,Bingxin Zhou
关键词-EN: Enzyme engineering enables, enhancing catalytic activity, engineering enables, enables the modification, modification of wild-type
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 25 pages, 10 figures, 8 tables

点击查看摘要

Abstract:Enzyme engineering enables the modification of wild-type proteins to meet industrial and research demands by enhancing catalytic activity, stability, binding affinities, and other properties. The emergence of deep learning methods for protein modeling has demonstrated superior results at lower costs compared to traditional approaches such as directed evolution and rational design. In mutation effect prediction, the key to pre-training deep learning models lies in accurately interpreting the complex relationships among protein sequence, structure, and function. This study introduces a retrieval-enhanced protein language model for comprehensive analysis of native properties from sequence and local structural interactions, as well as evolutionary properties from retrieved homologous sequences. The state-of-the-art performance of the proposed ProtREM is validated on over 2 million mutants across 217 assays from an open benchmark (ProteinGym). We also conducted post-hoc analyses of the model’s ability to improve the stability and binding affinity of a VHH antibody. Additionally, we designed 10 new mutants on a DNA polymerase and conducted wet-lab experiments to evaluate their enhanced activity at higher temperatures. Both in silico and experimental evaluations confirmed that our method provides reliable predictions of mutation effects, offering an auxiliary tool for biologists aiming to evolve existing enzymes. The implementation is publicly available at this https URL.
摘要：酶工程通过增强催化活性、稳定性、结合亲和力等特性，使野生型蛋白质得以改造以满足工业和研究需求。深度学习方法在蛋白质建模中的应用，相较于传统的定向进化和理性设计方法，在降低成本的同时展现了更优越的结果。在突变效应预测中，预训练深度学习模型的关键在于准确解读蛋白质序列、结构与功能之间的复杂关系。本研究引入了一种检索增强的蛋白质语言模型，用于综合分析从序列和局部结构相互作用中提取的固有特性，以及从检索到的同源序列中提取的进化特性。所提出的ProtREM在ProteinGym这一公开基准上的217项实验中，对超过200万个突变体进行了验证，展示了其最先进的性能。我们还对模型在提高VHH抗体稳定性和结合亲和力方面的能力进行了事后分析。此外，我们在DNA聚合酶上设计了10个新的突变体，并通过湿实验评估了它们在高温下的活性增强效果。无论是计算评估还是实验验证，均证实了我们的方法在突变效应预测上的可靠性，为致力于进化现有酶的生物学家提供了一个辅助工具。该实现的代码已公开，详见此https链接。

[NLP-18] Current State-of-the-Art of Bias Detection and Mitigation in Machine Translation for African and European Languages: a Review

【速读】：该论文试图解决自然语言处理（Natural Language Processing, NLP）和机器翻译（Machine Translation）中的偏见检测与缓解问题，特别是在欧洲和非洲语言中的应用。解决方案的关键在于分析当前最先进的技术，并指出大多数研究集中在少数语言上，从而强调未来研究应关注较少被研究的语言，以促进研究领域的多样性。

链接: https://arxiv.org/abs/2410.21126
作者: Catherine Ikae,Mascha Kurpicz-Briki
关键词-EN: Studying bias detection, Studying bias, natural language processing, highly relevant, bias detection
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Studying bias detection and mitigation methods in natural language processing and the particular case of machine translation is highly relevant, as societal stereotypes might be reflected or reinforced by these systems. In this paper, we analyze the state-of-the-art with a particular focus on European and African languages. We show how the majority of the work in this field concentrates on few languages, and that there is potential for future research to cover also the less investigated languages to contribute to more diversity in the research field.
摘要：研究自然语言处理中的偏见检测与缓解方法，特别是机器翻译中的偏见问题，具有极高的相关性，因为这些系统可能反映或强化社会刻板印象。本文中，我们重点分析了当前最先进的技术，特别关注欧洲和非洲语言。我们展示了该领域的大部分工作集中在少数语言上，并指出未来研究有潜力覆盖较少研究的语言，从而为研究领域带来更多样性。

[NLP-19] Zero-Shot Action Recognition in Surveillance Videos

【速读】：该论文试图解决公共空间监控中由于人力资源短缺带来的挑战，特别是在监控视频理解任务中，由于数据集有限和监控环境复杂（如视角、低质量等），导致基于核心计算机视觉模型的AI系统需要大量微调的问题。解决方案的关键在于利用大视觉语言模型（Large Vision-Language Models, LVLMs），如VideoLLaMA2，以及改进的采样方法，即自反式采样（Self-Reflective Sampling, Self-ReS）。这些技术显著提升了零样本（zero-shot）和少样本（few-shot）泛化能力，实验结果表明，VideoLLaMA2在UCF-Crime数据集上的零样本性能比基线提升了20%，而Self-ReS进一步将零样本动作识别性能提升至44.6%。

链接: https://arxiv.org/abs/2410.21113
作者: Joao Pereira,Vasco Lopes,David Semedo,Joao Neves
关键词-EN: public spaces presents, spaces presents significant, presents significant challenges, human resources, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing demand for surveillance in public spaces presents significant challenges due to the shortage of human resources. Current AI-based video surveillance systems heavily rely on core computer vision models that require extensive finetuning, which is particularly difficult in surveillance settings due to limited datasets and difficult setting (viewpoint, low quality, etc.). In this work, we propose leveraging Large Vision-Language Models (LVLMs), known for their strong zero and few-shot generalization, to tackle video understanding tasks in surveillance. Specifically, we explore VideoLLaMA2, a state-of-the-art LVLM, and an improved token-level sampling method, Self-Reflective Sampling (Self-ReS). Our experiments on the UCF-Crime dataset show that VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline. Self-ReS additionally increases zero-shot action recognition performance to 44.6%. These results highlight the potential of LVLMs, paired with improved sampling techniques, for advancing surveillance video analysis in diverse scenarios.
摘要：随着公共场所监控需求的不断增长，由于人力资源的短缺，这一领域面临着重大挑战。当前基于人工智能的视频监控系统主要依赖于需要大量微调的核心计算机视觉模型，这在监控环境中尤为困难，因为数据集有限且环境复杂（如视角、低质量等）。在本研究中，我们提出利用大视觉-语言模型 (Large Vision-Language Models, LVLMs) 来应对监控中的视频理解任务，这些模型以其强大的零样本和少样本泛化能力而著称。具体而言，我们探索了最先进的 LVLM——VideoLLaMA2，以及一种改进的 Token 级采样方法——自反射采样 (Self-Reflective Sampling, Self-ReS)。我们在 UCF-Crime 数据集上的实验表明，VideoLLaMA2 在零样本性能上实现了显著提升，比基线提高了 20%。Self-ReS 进一步将零样本动作识别性能提升至 44.6%。这些结果突显了 LVLMs 结合改进采样技术在推进多样化监控视频分析中的潜力。

[NLP-20] Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

【速读】：该论文试图解决大型语言模型 (Large Language Model, LLM) 安全性问题，特别是通过红队测试 (red team testing) 发现的模型安全漏洞。解决方案的关键在于提出了一种改进的迁移攻击方法 (transfer attack method)，通过在本地使用良性数据蒸馏 (benign data distillation) 训练目标黑箱模型的镜像模型 (mirror model)，来指导恶意提示的构建。这种方法增强了攻击的隐蔽性，因为它在搜索阶段不涉及向目标模型提交可识别的恶意指令，从而避免了被内容审核者拦截的风险。实验结果显示，该方法在GPT-3.5 Turbo模型上达到了92%的最大攻击成功率，或在平衡值为80%的情况下，平均每样本仅产生1.5次可检测的越狱查询，突显了现有防御机制的不足。

链接: https://arxiv.org/abs/2410.21083
作者: Honglin Mu,Han He,Yuxin Zhou,Yunlong Feng,Yang Xu,Libo Qin,Xiaoming Shi,Zeming Liu,Xudong Han,Qi Shi,Qingfu Zhu,Wanxiang Che
关键词-EN: Large language model, numerous studies employing, studies employing red, employing red team, red team testing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) safety is a critical issue, with numerous studies employing red team testing to enhance model security. Among these, jailbreak methods explore potential vulnerabilities by crafting malicious prompts that induce model outputs contrary to safety alignments. Existing black-box jailbreak methods often rely on model feedback, repeatedly submitting queries with detectable malicious instructions during the attack search process. Although these approaches are effective, the attacks may be intercepted by content moderators during the search process. We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation. This method offers enhanced stealth, as it does not involve submitting identifiable malicious instructions to the target model during the search phase. Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore the need for more robust defense mechanisms.
摘要：大语言模型（LLM）的安全性是一个关键问题，众多研究通过红队测试来提升模型的安全性。其中，越狱方法通过设计恶意提示来探索潜在的漏洞，诱导模型输出与安全对齐相悖的内容。现有的黑盒越狱方法通常依赖于模型反馈，在攻击搜索过程中反复提交带有可检测恶意指令的查询。尽管这些方法有效，但在搜索过程中攻击可能会被内容审核者拦截。我们提出了一种改进的迁移攻击方法，通过良性数据蒸馏在本地训练目标黑盒模型的镜像模型，从而指导恶意提示的构建。这种方法增强了隐蔽性，因为在搜索阶段不涉及向目标模型提交可识别的恶意指令。我们的方法在AdvBench的一个子集上对GPT-3.5 Turbo实现了最高92%的攻击成功率，或者在每样本平均1.5次可检测越狱查询的情况下达到80%的平衡值。这些结果强调了需要更强大的防御机制。

[NLP-21] CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在机器翻译中处理上下文依赖术语（如新词或领域特定词汇）时出现的翻译不一致和错误问题。解决方案的关键在于提出了一种名为CRAT的新型多代理翻译框架，该框架结合了检索增强生成（RAG）和因果增强的自省机制。CRAT框架通过多个专门代理协同工作，包括未知术语识别代理、知识图谱（KG）构建代理、因果增强的判断代理和翻译代理，实现了对关键术语的自动、精确和一致的处理。具体来说，未知术语识别代理检测上下文中的未知术语，知识图谱构建代理提取相关内部知识并从外部源检索双语信息，因果增强的判断代理验证信息的准确性，翻译代理则将精炼后的信息整合到最终输出中。这种自动化流程显著提高了翻译的准确性，特别是在处理上下文敏感术语和新兴词汇方面。

链接: https://arxiv.org/abs/2410.21067
作者: Meiqi Chen,Fandong Meng,Yingxue Zhang,Yan Zhang,Jie Zhou
关键词-EN: Large language models, shown great promise, contextually dependent terms, Large language, domain-specific words
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown great promise in machine translation, but they still struggle with contextually dependent terms, such as new or domain-specific words. This leads to inconsistencies and errors that are difficult to address. Existing solutions often depend on manual identification of such terms, which is impractical given the complexity and evolving nature of language. While Retrieval-Augmented Generation (RAG) could provide some assistance, its application to translation is limited by issues such as hallucinations from information overload. In this paper, we propose CRAT, a novel multi-agent translation framework that leverages RAG and causality-enhanced self-reflection to address these challenges. This framework consists of several specialized agents: the Unknown Terms Identification agent detects unknown terms within the context, the Knowledge Graph (KG) Constructor agent extracts relevant internal knowledge about these terms and retrieves bilingual information from external sources, the Causality-enhanced Judge agent validates the accuracy of the information, and the Translator agent incorporates the refined information into the final output. This automated process allows for more precise and consistent handling of key terms during translation. Our results show that CRAT significantly improves translation accuracy, particularly in handling context-sensitive terms and emerging vocabulary.
摘要：大语言模型（Large Language Models, LLMs）在机器翻译领域展现了巨大的潜力，但它们在处理依赖上下文的术语，如新词或领域特定词汇时仍面临挑战。这导致了翻译中的不一致性和错误，难以解决。现有的解决方案通常依赖于手动识别这些术语，这在面对语言的复杂性和不断演变的特性时显得不切实际。尽管检索增强生成（Retrieval-Augmented Generation, RAG）可以提供一定帮助，但其应用于翻译时受到信息过载导致的幻觉问题的限制。本文提出了一种名为CRAT的新型多智能体翻译框架，该框架结合了RAG和因果增强的自省机制来应对这些挑战。该框架包含多个专门化的智能体：未知术语识别智能体（Unknown Terms Identification agent）在上下文中检测未知术语，知识图谱构建智能体（Knowledge Graph Constructor agent）提取这些术语的相关内部知识并从外部资源检索双语信息，因果增强的判断智能体（Causality-enhanced Judge agent）验证信息的准确性，翻译智能体（Translator agent）将精炼后的信息整合到最终输出中。这一自动化过程使得在翻译过程中能够更精确和一致地处理关键术语。我们的实验结果表明，CRAT显著提高了翻译的准确性，特别是在处理上下文敏感术语和新出现的词汇方面。

[NLP-22] Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

【速读】：该论文试图解决现有主题建模方法在处理短文本数据时存在的两个主要问题：一是假设每个文档只有一个主题，二是难以高效处理大规模、噪声较多的短文本数据集。论文提出的解决方案是引入一种名为语义成分分析 (Semantic Component Analysis, SCA) 的新型主题建模技术。SCA 的关键在于通过在基于聚类的主题建模框架中引入分解步骤，能够发现短文本中的多个、细致的语义成分，从而克服了传统方法的局限性。实验结果表明，SCA 在多个 Twitter 数据集上的表现与当前最先进的方法 BERTopic 相当，同时在语义成分的数量和噪声控制方面表现更优，并且具有跨语言的适用性，包括对少数语言的支持。

链接: https://arxiv.org/abs/2410.21054
作者: Florian Eichin,Carolin Schuster,Georg Groh,Michael A. Hedderich
关键词-EN: Semantic Component Analysis, Topic modeling, efficiently for large, topic modeling framework, existing approaches
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures, code: this https URL

点击查看摘要

Abstract:Topic modeling is a key method in text analysis, but existing approaches are limited by assuming one topic per document or fail to scale efficiently for large, noisy datasets of short texts. We introduce Semantic Component Analysis (SCA), a novel topic modeling technique that overcomes these limitations by discovering multiple, nuanced semantic components beyond a single topic in short texts which we accomplish by introducing a decomposition step to the clustering-based topic modeling framework. Evaluated on multiple Twitter datasets, SCA matches the state-of-the-art method BERTopic in coherence and diversity, while uncovering at least double the semantic components and maintaining a noise rate close to zero while staying scalable and effective across languages, including an underrepresented one.
摘要：主题建模是文本分析中的关键方法，但现有方法在假设每篇文档只有一个主题或无法高效扩展以处理大规模、嘈杂的短文本数据集方面存在局限性。我们引入了语义成分分析 (Semantic Component Analysis, SCA)，这是一种新颖的主题建模技术，通过在基于聚类的主题建模框架中引入分解步骤，克服了这些局限性，能够在短文本中发现多个细微的语义成分，而不仅仅是一个单一主题。在多个 Twitter 数据集上的评估结果显示，SCA 在一致性和多样性方面与最先进的方法 BERTopic 相当，同时揭示的语义成分数量至少是 BERTopic 的两倍，并且在保持接近零的噪声率的同时，在包括未充分代表的语言在内的多种语言中保持了可扩展性和有效性。

[NLP-23] Sorting Out the Bad Seeds: Automatic Classification of Cryptocurrency Abuse Reports

【速读】：该论文试图解决加密货币滥用报告的自动分类问题，特别是当前分类方法依赖于报告者选择或分析师手动分类，存在报告者经验不足和分类效率低下的问题。解决方案的关键在于利用大型语言模型 (LLM) 对报告文本进行解释，并自动分配到预定义的19种滥用类型分类体系中。通过收集29万份来自BitcoinAbuse和BBB’s ScamTracker的报告，构建真实数据集进行评估，论文提出的基于LLM的分类器在精确度、召回率和F1分数上均显著优于传统的监督机器学习分类器，分别为0.92、0.87和0.89，而基线分类器的F1分数仅为0.55。该解决方案不仅提高了分类准确性，还为细粒度的滥用类型提供了财务损失统计，并为加密货币分析平台生成了标记地址。

链接: https://arxiv.org/abs/2410.21041
作者: Gibran Gomez,Kevin van Liebergen,Davide Sanvito,Giuseppe Siracusano,Roberto Gonzalez,Juan Caballero
关键词-EN: Abuse, victims have suffered, abuse types, cryptocurrency abuse reports, abuse victims
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Abuse reporting services collect reports about abuse victims have suffered. Accurate classification of the submitted reports is fundamental to analyzing the prevalence and financial impact of different abuse types (e.g., sextortion, investment, romance). Current classification approaches are problematic because they require the reporter to select the abuse type from a list, assuming the reporter has the necessary experience for the classification, which we show is frequently not the case, or require manual classification by analysts, which does not scale. To address these issues, this paper presents a novel approach to classify cryptocurrency abuse reports automatically. We first build a taxonomy of 19 frequently reported abuse types. Given as input the textual description written by the reporter, our classifier leverages a large language model (LLM) to interpret the text and assign it an abuse type in our taxonomy. We collect 290K cryptocurrency abuse reports from two popular reporting services: BitcoinAbuse and BBB’s ScamTracker. We build ground truth datasets for 20K of those reports and use them to evaluate three designs for our LLM-based classifier and four LLMs, as well as a supervised ML classifier used as a baseline. Our LLM-based classifier achieves a precision of 0.92, a recall of 0.87, and an F1 score of 0.89, compared to an F1 score of 0.55 for the baseline. We demonstrate our classifier in two applications: providing financial loss statistics for fine-grained abuse types and generating tagged addresses for cryptocurrency analysis platforms.
摘要：滥用报告服务收集关于受害者遭受的滥用行为的报告。准确分类提交的报告对于分析不同滥用类型（例如，性勒索、投资诈骗、情感诈骗）的普遍性和经济影响至关重要。当前的分类方法存在问题，因为它们要求报告者从列表中选择滥用类型，假设报告者具备必要的分类经验，但我们表明这种情况经常并非如此，或者需要分析师手动分类，这无法扩展。为了解决这些问题，本文提出了一种自动分类加密货币滥用报告的新方法。我们首先构建了一个包含19种常见滥用类型的分类体系。给定报告者撰写的文本描述作为输入，我们的分类器利用大语言模型（LLM）来解释文本并将其分配到我们的分类体系中的一个滥用类型。我们从两个流行的报告服务——BitcoinAbuse和BBB的ScamTracker——收集了29万份加密货币滥用报告。我们为其中的2万份报告构建了真实数据集，并使用它们来评估我们基于LLM的分类器的三个设计方案和四个LLM，以及作为基线的监督式机器学习分类器。我们的基于LLM的分类器达到了0.92的精确度、0.87的召回率和0.89的F1分数，而基线的F1分数为0.55。我们展示了我们的分类器在两个应用中的效果：为细粒度的滥用类型提供财务损失统计数据，并为加密货币分析平台生成标记地址。

[NLP-24] Beyond Autoregression: Fast LLM s via Self-Distillation Through Time

【速读】：该论文试图解决自回归大型语言模型（AR LLMs）在生成文本时存在的显著延迟问题。解决方案的关键在于采用扩散语言模型（diffusion language models），通过一种新颖的离散扩散模型蒸馏方法，显著减少了推理步骤的数量（减少32-64倍），从而实现了一次性生成至少32个token的能力。这种方法不仅在文本质量和LAMBADA自然语言理解基准上超越了AR模型，而且在实际应用中，即使不使用缓存，生成token的速度也比使用KV缓存的AR模型快8倍，预计在加入缓存后性能还将进一步提升。此外，该方法在高达860M参数的扩散语言模型上也展示了其有效性。

链接: https://arxiv.org/abs/2410.21035
作者: Justin Deschenaux,Caglar Gulcehre
关键词-EN: demonstrated significant success, Large Language Models, Large Language, contemporary autoregressive LLMs, numerous tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advances have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, our models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.
摘要：自回归 (Autoregressive, AR) 大语言模型 (Large Language Models, LLMs) 在众多任务中展示了显著的成功。然而，AR 建模范式存在一定的局限性；例如，当代自回归 LLMs 被训练为一次生成一个 Token，这可能导致明显的延迟。最近的进展表明，通过在推理过程中利用更多的计算资源，搜索和重复采样可以提升各种应用（如定理证明、代码生成和对齐）的性能。在本研究中，我们展示了扩散语言模型能够同时生成至少 32 个 Token，同时在文本质量和 LAMBADA 自然语言理解基准上超越 AR 模型的表现。这一成果是通过一种新颖的离散扩散模型蒸馏方法实现的，该方法将推理步骤的数量减少了 32-64 倍。实际上，即使不使用缓存，我们的模型生成 Token 的速度也比使用 KV 缓存的 AR 模型快 8 倍，并且我们预计随着缓存的引入，性能将进一步提高。此外，我们展示了我们的方法在高达 860M 参数的扩散语言模型中的有效性。

[NLP-25] ransferable Post-training via Inverse Value Learning

【速读】：该论文试图解决现有算法在处理大规模数据集和大型基础模型时面临的计算需求和实施挑战。解决方案的关键在于提出了一种在logits层级进行后训练调整的方法，通过引入一个独立的神经网络（即价值网络）来建模这些变化。该价值网络在小型基础模型上进行训练后，可以无缝集成到其他预训练模型中，在推理阶段实现类似的能力提升。论文系统地研究了预训练权重和连接方案的最佳实践，并展示了价值网络在同一模型家族内不同参数大小的预训练模型、连续预训练模型以及不同词汇表的模型家族间的广泛适用性。在某些情况下，价值网络的性能可与全参数微调相媲美，同时探索了增强价值模型迁移性和防止过拟合的方法。

链接: https://arxiv.org/abs/2410.21027
作者: Xinyu Lu,Xueru Wen,Yaojie Lu,Bowen Yu,Hongyu Lin,Haiyang Yu,Le Sun,Xianpei Han,Yongbin Li
关键词-EN: processes utilize increasingly, utilize increasingly large, increasingly large datasets, post-training processes utilize, escalating significantly
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As post-training processes utilize increasingly large datasets and base models continue to grow in size, the computational demands and implementation challenges of existing algorithms are escalating significantly. In this paper, we propose modeling the changes at the logits level during post-training using a separate neural network (i.e., the value network). After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference, enables them to achieve similar capability enhancements. We systematically investigate the best practices for this paradigm in terms of pre-training weights and connection schemes. We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes within the same family, models undergoing continuous pre-training within the same family, and models with different vocabularies across families. In certain cases, it can achieve performance comparable to full-parameter fine-tuning. Furthermore, we explore methods to enhance the transferability of the value model and prevent overfitting to the base model used during training.
摘要：随着后训练过程所使用的数据集规模不断扩大，基础模型的大小也在持续增长，现有算法的计算需求和实现挑战显著增加。本文提出，在后训练阶段通过一个独立的神经网络（即价值网络）来建模logits级别的变化。在利用演示数据对一个小型基础模型进行训练后，该网络可以无缝集成到其他预训练模型中，在推理过程中使其获得类似的性能提升。我们系统地研究了这种范式在预训练权重和连接方案方面的最佳实践。实验表明，生成的价值网络在同一族内不同参数大小的预训练模型之间、同一族内进行连续预训练的模型之间，以及不同族之间具有不同词汇表的模型之间，都具有广泛的迁移性。在某些情况下，其性能可与全参数微调相媲美。此外，我们还探讨了增强价值模型迁移性的方法，并防止其对训练时使用的基础模型过度拟合。

[NLP-26] Frequency matters: Modeling irregular morphological patterns in Spanish with Transformers

【速读】：该论文试图解决基于Transformer的神经网络在学习不规则屈折变化模式时的行为问题。解决方案的关键在于将不规则模式应用于屈折变化任务，并将其建模为字符序列到序列的学习问题。具体研究对象是西班牙语中的不规则动词，特别是那些在第一人称单数直陈式中词干与虚拟式模式不规则匹配的动词（L-shaped verbs）。研究通过对比不同输入频率条件下的模型表现，探讨了频率在学习过程中的作用，并使用事后分析方法揭示了模型学习这种不规则模式的能力及其可能的错误来源。

链接: https://arxiv.org/abs/2410.21013
作者: Akhilesh Kakolu Ramarao,Kevin Tang,Dinah Baer-Henney
关键词-EN: present paper evaluates, transformer-based neural network, irregular inflectional paradigm, present paper, paper evaluates
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The present paper evaluates the learning behaviour of a transformer-based neural network with regard to an irregular inflectional paradigm. We apply the paradigm cell filling problem to irregular patterns. We approach this problem using the morphological reinflection task and model it as a character sequence-to-sequence learning problem. The test case under investigation are irregular verbs in Spanish. Besides many regular verbs in Spanish L-shaped verbs the first person singular indicative stem irregularly matches the subjunctive paradigm, while other indicative forms remain unaltered. We examine the role of frequency during learning and compare models under differing input frequency conditions. We train the model on a corpus of Spanish with a realistic distribution of regular and irregular verbs to compare it with models trained on input with augmented distributions of (ir)regular words. We explore how the neural models learn this L-shaped pattern using post-hoc analyses. Our experiments show that, across frequency conditions, the models are surprisingly capable of learning the irregular pattern. Furthermore, our post-hoc analyses reveal the possible sources of errors. All code and data are available at \urlthis https URL under MIT license.
摘要：本文评估了基于 Transformer 的神经网络在学习不规则屈折范式时的行为。我们将范式单元填充问题应用于不规则模式。通过形态学重屈折任务，我们将该问题建模为字符序列到序列的学习问题。研究对象为西班牙语中的不规则动词。除了许多规则动词外，西班牙语中的 L 形动词在第一人称单数直陈式词干与虚拟式范式不规则地匹配，而其他直陈式形式保持不变。我们考察了学习过程中频率的作用，并比较了在不同输入频率条件下模型的表现。我们使用具有现实分布的规则和不规则动词的西班牙语文本语料库进行模型训练，并与使用增强分布的（不）规则词输入训练的模型进行比较。通过事后分析，我们探讨了神经网络如何学习这种 L 形模式。实验结果表明，在各种频率条件下，模型都能出乎意料地学习到不规则模式。此外，事后分析揭示了可能的错误来源。所有代码和数据均在 MIT 许可证下提供，链接为 \urlthis https URL。

[NLP-27] FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在多事实检索任务中表现不佳的问题，特别是在生成过程中难以同时检索多个事实的“中间迷失”现象 (lost-in-the-middle)。解决方案的关键是引入了一种名为“查找所有关键文本” (Find All Crucial Texts, FACT) 的迭代检索方法，通过多轮重写来逐步精炼上下文，从而使模型能够逐步捕捉到容易被单次检索忽略的关键事实。实验结果表明，FACT 显著提升了多事实检索任务的性能，尽管在通用问答场景中的改进相对不明显。

链接: https://arxiv.org/abs/2410.21012
作者: Jinlin Wang,Suyuchen Wang,Ziwen Xia,Sirui Hong,Yun Zhu,Bang Liu,Chenglin Wu
关键词-EN: Large Language Models, Large Language, retrieving single facts, Language Models, proficient at retrieving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:Large Language Models (LLMs) are proficient at retrieving single facts from extended contexts, yet they struggle with tasks requiring the simultaneous retrieval of multiple facts, especially during generation. This paper identifies a novel “lost-in-the-middle” phenomenon, where LLMs progressively lose track of critical information throughout the generation process, resulting in incomplete or inaccurate retrieval. To address this challenge, we introduce Find All Crucial Texts (FACT), an iterative retrieval method that refines context through successive rounds of rewriting. This approach enables models to capture essential facts incrementally, which are often overlooked in single-pass retrieval. Experiments demonstrate that FACT substantially enhances multi-fact retrieval performance across various tasks, though improvements are less notable in general-purpose QA scenarios. Our findings shed light on the limitations of LLMs in multi-fact retrieval and underscore the need for more resilient long-context retrieval strategies.
摘要：大语言模型 (LLM) 在从扩展上下文中检索单一事实方面表现出色，但在需要同时检索多个事实的任务中，尤其是在生成过程中，它们的表现却不尽如人意。本文识别了一种新的“中间迷失”现象，即在大语言模型的生成过程中，关键信息逐渐丢失，导致检索结果不完整或不准确。为应对这一挑战，我们提出了“查找所有关键文本” (Find All Crucial Texts, FACT)，这是一种通过多轮重写来逐步精炼上下文的迭代检索方法。该方法使模型能够逐步捕捉到在单次检索中容易被忽略的重要事实。实验表明，FACT 在各种任务中显著提升了多事实检索的性能，尽管在通用问答场景中的改进效果不甚显著。我们的研究揭示了大语言模型在多事实检索中的局限性，并强调了开发更为稳健的长上下文检索策略的必要性。

[NLP-28] Is GPT-4 Less Politically Biased than GPT-3.5? A Renewed Investigation of ChatGPTs Political Biases

【速读】：该论文试图解决的问题是探究ChatGPT（特别是GPT-3.5和GPT-4）的政治偏见和人格特质，并分析它们模拟政治观点（如自由派或保守派立场）的能力。解决方案的关键在于使用Political Compass Test和Big Five Personality Test对模型进行100次测试，通过计算平均值、标准差和进行显著性检验来分析GPT-3.5和GPT-4之间的差异。研究发现，两个模型均表现出进步和自由主义的政治偏见，GPT-4的偏见略微但可忽略不计地较弱。此外，GPT-4在模拟指定政治观点方面表现出显著能力，准确反映了所有四个测试实例中的指定象限。在人格特质方面，GPT-3.5显示出高度突出的开放性和宜人性特质，而GPT-4则表现出较低的特质显著性，但神经质得分较高。研究还发现，测试顺序影响ChatGPT的响应和观察到的相关性，表明存在一种上下文记忆效应。

链接: https://arxiv.org/abs/2410.21008
作者: Erik Weber,Jérôme Rutinowski,Niklas Jost,Markus Pauly
关键词-EN: Political Compass Test, political, Big Five Personality, Political Compass, Personality Test
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work investigates the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5 to GPT-4. In addition, the ability of the models to emulate political viewpoints (e.g., liberal or conservative positions) is analyzed. The Political Compass Test and the Big Five Personality Test were employed 100 times for each scenario, providing statistically significant results and an insight into the results correlations. The responses were analyzed by computing averages, standard deviations, and performing significance tests to investigate differences between GPT-3.5 and GPT-4. Correlations were found for traits that have been shown to be interdependent in human studies. Both models showed a progressive and libertarian political bias, with GPT-4’s biases being slightly, but negligibly, less pronounced. Specifically, on the Political Compass, GPT-3.5 scored -6.59 on the economic axis and -6.07 on the social axis, whereas GPT-4 scored -5.40 and -4.73. In contrast to GPT-3.5, GPT-4 showed a remarkable capacity to emulate assigned political viewpoints, accurately reflecting the assigned quadrant (libertarian-left, libertarian-right, authoritarian-left, authoritarian-right) in all four tested instances. On the Big Five Personality Test, GPT-3.5 showed highly pronounced Openness and Agreeableness traits (O: 85.9%, A: 84.6%). Such pronounced traits correlate with libertarian views in human studies. While GPT-4 overall exhibited less pronounced Big Five personality traits, it did show a notably higher Neuroticism score. Assigned political orientations influenced Openness, Agreeableness, and Conscientiousness, again reflecting interdependencies observed in human studies. Finally, we observed that test sequencing affected ChatGPT’s responses and the observed correlations, indicating a form of contextual memory.
摘要：本研究探讨了ChatGPT的政治偏见和人格特质，特别是比较了GPT-3.5与GPT-4之间的差异。此外，还分析了模型模拟政治观点（如自由派或保守派立场）的能力。通过100次使用政治罗盘测试和五大人格测试，我们获得了具有统计显著性的结果，并深入了解了结果之间的相关性。通过对响应进行平均值、标准差计算和显著性测试，我们研究了GPT-3.5和GPT-4之间的差异。研究发现，某些特质在人类研究中已被证明是相互依赖的。两个模型均显示出进步和自由主义的政治偏见，但GPT-4的偏见略微但不显著地较弱。具体而言，在政治罗盘上，GPT-3.5在经济轴上得分为-6.59，在社会轴上得分为-6.07，而GPT-4的得分分别为-5.40和-4.73。与GPT-3.5相比，GPT-4在模拟指定政治观点方面表现出显著能力，准确反映了所有四个测试实例中的指定象限（自由左派、自由右派、权威左派、权威右派）。在五大人格测试中，GPT-3.5显示出高度显著的开放性和宜人性特质（O: 85.9%, A: 84.6%）。这些显著特质与人类研究中的自由主义观点相关。尽管GPT-4总体上表现出较不显著的五大人格特质，但其神经质得分显著较高。指定的政治倾向影响了开放性、宜人性和尽责性，再次反映了人类研究中观察到的相互依赖性。最后，我们观察到测试顺序影响了ChatGPT的响应和观察到的相关性，表明存在一种上下文记忆形式。

[NLP-29] DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning NEURIPS2024

【速读】：该论文试图解决当前AI生成文本检测方法在泛化能力和对分布外数据（OOD）的适应性方面存在的瓶颈问题。解决方案的关键在于通过多任务辅助和多层次对比学习框架（DeTeCtive）来区分不同作者的写作风格，而不仅仅是将文本简单分类为人写或AI生成。DeTeCtive结合了密集信息检索管道，能够增强各种文本编码器在检测AI生成文本方面的能力，并在多个基准测试中实现了最先进的性能，特别是在OOD零样本评估中显著优于现有方法。此外，该方法还具备无需训练的增量适应（TFIA）能力，进一步提升了其在OOD检测场景中的有效性。

链接: https://arxiv.org/abs/2410.20964
作者: Xun Guo,Shan Zhang,Yongxin He,Ting Zhang,Wanquan Feng,Haibin Huang,Chongyang Ma
关键词-EN: binary classification paradigms, manual feature crafting, supervised binary classification, Current techniques, AI-generated text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in NeurIPS 2024. Code is available at this https URL

点击查看摘要

Abstract:Current techniques for detecting AI-generated text are largely confined to manual feature crafting and supervised binary classification paradigms. These methodologies typically lead to performance bottlenecks and unsatisfactory generalizability. Consequently, these methods are often inapplicable for out-of-distribution (OOD) data and newly emerged large language models (LLMs). In this paper, we revisit the task of AI-generated text detection. We argue that the key to accomplishing this task lies in distinguishing writing styles of different authors, rather than simply classifying the text into human-written or AI-generated text. To this end, we propose DeTeCtive, a multi-task auxiliary, multi-level contrastive learning framework. DeTeCtive is designed to facilitate the learning of distinct writing styles, combined with a dense information retrieval pipeline for AI-generated text detection. Our method is compatible with a range of text encoders. Extensive experiments demonstrate that our method enhances the ability of various text encoders in detecting AI-generated text across multiple benchmarks and achieves state-of-the-art results. Notably, in OOD zero-shot evaluation, our method outperforms existing approaches by a large margin. Moreover, we find our method boasts a Training-Free Incremental Adaptation (TFIA) capability towards OOD data, further enhancing its efficacy in OOD detection scenarios. We will open-source our code and models in hopes that our work will spark new thoughts in the field of AI-generated text detection, ensuring safe application of LLMs and enhancing compliance. Our code is available at this https URL.
摘要：当前用于检测 AI 生成文本的技术主要局限于手工特征设计和监督二分类范式。这些方法通常导致性能瓶颈和泛化能力不足，因此往往不适用于分布外 (OOD) 数据和新兴的大语言模型 (LLM)。本文重新审视了 AI 生成文本检测任务，并提出关键在于区分不同作者的写作风格，而非简单地将文本分类为人类撰写或 AI 生成。为此，我们提出了 DeTeCtive，一个多任务辅助、多层次对比学习框架。DeTeCtive 旨在促进对不同写作风格的学习，并结合密集信息检索管道用于 AI 生成文本检测。我们的方法兼容多种文本编码器。大量实验表明，我们的方法提升了各种文本编码器在多个基准上检测 AI 生成文本的能力，并取得了最先进的结果。特别地，在 OOD 零样本评估中，我们的方法显著优于现有方法。此外，我们发现该方法具备无需训练的增量适应 (TFIA) 能力，进一步增强了其在 OOD 检测场景中的有效性。我们将在 https URL 上开源代码和模型，希望我们的工作能激发 AI 生成文本检测领域的新思路，确保 LLM 的安全应用并增强合规性。

[NLP-30] Instruction-Tuned LLM s Succeed in Document-Level MT Without Fine-Tuning – But BLEU Turns a Blind Eye

【速读】：该论文试图解决文档级机器翻译 (document-level machine translation, docMT) 的质量评估问题，特别是现有基于BLEU评分的评估方法在捕捉文档级翻译质量方面的不足。解决方案的关键在于直接利用指令调优的大型语言模型 (Large Language Models, LLMs) 进行文档级翻译，并通过GPT-4作为评判者来评估翻译的连贯性、准确性和流畅性，而不是依赖于传统的基于n-gram的BLEU评分。这种方法不仅展示了LLMs在利用文档上下文进行翻译方面的有效性，还强调了BLEU评分在评估docMT时的局限性。

链接: https://arxiv.org/abs/2410.20941
作者: Yirong Sun,Dawei Zhu,Yanjun Chen,Erjia Xiao,Xinghao Chen,Xiaoyu Shen
关键词-EN: Large language models, NLP tasks, Large language, including machine translation, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT). Unlike prior approaches that require specialized techniques, we evaluate LLMs by directly prompting them to translate entire documents in a single pass. Our results show that this method improves translation quality compared to translating sentences separately, even without document-level fine-tuning. However, this advantage is not reflected in BLEU scores, which often favor sentence-based translations. We propose using the LLM-as-a-judge paradigm for evaluation, where GPT-4 is used to assess document coherence, accuracy, and fluency in a more nuanced way than n-gram-based metrics. Overall, our work demonstrates that instruction-tuned LLMs can effectively leverage document context for translation. However, we caution against using BLEU scores for evaluating docMT, as they often provide misleading outcomes, failing to capture the quality of document-level translation. Code and data are available at this https URL
摘要：大语言模型 (LLMs) 在包括机器翻译 (MT) 在内的多种自然语言处理 (NLP) 任务中表现出色，但大多数研究集中在句子级别的翻译上。本研究探讨了指令微调的 LLMs 在文档级别翻译 (docMT) 中的固有能力。与以往需要专门技术的研究不同，我们通过直接提示 LLMs 一次性翻译整个文档来评估其性能。我们的结果表明，这种方法相较于单独翻译句子，即使在未进行文档级别微调的情况下，也能提高翻译质量。然而，这种优势并未在 BLEU 评分中体现出来，因为 BLEU 评分往往偏向于基于句子的翻译。我们提出使用 LLM-as-a-judge 评估范式，其中 GPT-4 用于以比基于 n-gram 的指标更细致的方式评估文档的连贯性、准确性和流畅性。总体而言，我们的研究表明，指令微调的 LLMs 能够有效利用文档上下文进行翻译。但我们提醒，不应使用 BLEU 评分来评估 docMT，因为它们往往提供误导性的结果，无法捕捉文档级别翻译的质量。代码和数据可在以下链接获取：https URL。

[NLP-31] Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

【速读】：该论文试图解决在生成对抗样本以测试文本分类算法（特别是检测低可信度内容，如宣传、虚假声明、谣言和极端偏见新闻）的鲁棒性时，如何在攻击者查询次数受限的情况下有效生成对抗样本的问题。解决方案的关键在于TREPAT方法，该方法首先利用大型语言模型生成初始的重述，这些重述受启发于意义保留的自然语言处理任务（如文本简化 (text simplification) 和风格迁移 (style transfer)）。随后，这些修改被分解为小变化，并通过束搜索 (beam search) 过程逐步应用，直到受害分类器改变其决策。这种方法在长输入文本（如新闻文章）的场景中表现尤为优越，因为穷举搜索在这种情况下不可行。

链接: https://arxiv.org/abs/2410.20940
作者: Piotr Przybyła
关键词-EN: classification algorithms detecting, algorithms detecting low-credibility, detecting low-credibility content, text classification algorithms, including propaganda
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, e.g. text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure until the victim classifier changes its decision. The evaluation confirms the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
摘要：我们研究了生成对抗样本以测试检测低可信度内容（包括宣传、虚假声明、谣言和极端偏见新闻）的文本分类算法鲁棒性的挑战。我们专注于通过设置攻击者允许尝试的查询次数的现实限制来模拟内容审核。在我们的解决方案（TREPAT）中，初始的重新表述由大语言模型生成，这些模型的提示受到意义保留的自然语言处理任务（如文本简化与风格转换）的启发。随后，这些修改被分解为小变化，并通过束搜索过程应用，直到受害分类器改变其决策。评估结果证实了我们的方法在受限场景中的优越性，特别是在长输入文本（新闻文章）的情况下，穷举搜索是不可行的。

[NLP-32] Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency NEURIPS2024 ACL

【速读】：该论文试图解决生成式 AI (Generative AI) 在自动形式化（Autoformalization）任务中，尤其是数学领域，从自然语言描述转换为形式语言时，pass@1 和 pass@k 准确率之间存在显著差距的问题。解决方案的关键在于引入了一种新的框架，通过两种互补的自一致性方法——符号等价性（symbolic equivalence）和语义一致性（semantic consistency）——来评分并选择最佳的 k 个自动形式化候选结果。具体来说，符号等价性利用自动定理证明器识别候选结果之间的逻辑同质性，而语义一致性则通过将候选结果非形式化并计算原始文本与非形式化文本的嵌入相似度来评估原始意义的保留程度。实验结果表明，该方法显著提升了自动形式化的准确性，在 MATH 和 miniF2F 数据集上实现了高达 0.22-1.35 倍的相对改进。

链接: https://arxiv.org/abs/2410.20936
作者: Zenan Li,Yifan Wu,Zhaoyu Li,Xinming Wei,Xian Zhang,Fan Yang,Xiaoxing Ma
关键词-EN: automatically translating natural, translating natural language, natural language descriptions, poses a significant, task of automatically
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at NeurIPS 2024. Code is available at [this https URL]( this https URL )

点击查看摘要

Abstract:Autoformalization, the task of automatically translating natural language descriptions into a formal language, poses a significant challenge across various domains, especially in mathematics. Recent advancements in large language models (LLMs) have unveiled their promising capabilities to formalize even competition-level math problems. However, we observe a considerable discrepancy between pass@1 and pass@k accuracies in LLM-generated formalizations. To address this gap, we introduce a novel framework that scores and selects the best result from k autoformalization candidates based on two complementary self-consistency methods: symbolic equivalence and semantic consistency. Elaborately, symbolic equivalence identifies the logical homogeneity among autoformalization candidates using automated theorem provers, and semantic consistency evaluates the preservation of the original meaning by informalizing the candidates and computing the similarity between the embeddings of the original and informalized texts. Our extensive experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy, achieving up to 0.22-1.35x relative improvements across various LLMs and baseline methods.
摘要：自动形式化（Autoformalization），即将自然语言描述自动翻译成形式化语言的任务，在多个领域，尤其是数学领域，构成了重大挑战。近期大语言模型（LLM）的进展揭示了其在形式化竞赛级数学问题方面的巨大潜力。然而，我们观察到LLM生成的形式化结果中，pass@1和pass@k准确率之间存在显著差异。为解决这一差距，我们提出了一种新型框架，该框架基于两种互补的自一致性方法——符号等价性和语义一致性，对k个自动形式化候选结果进行评分和选择最佳结果。具体而言，符号等价性通过自动定理证明器识别自动形式化候选之间的逻辑同质性，而语义一致性则通过非形式化候选并计算原始文本与非形式化文本嵌入之间的相似度来评估原始意义的保留情况。我们在MATH和miniF2F数据集上的广泛实验表明，我们的方法显著提升了自动形式化的准确性，在各种LLM和基线方法上实现了高达0.22-1.35倍的相对改进。

[NLP-33] Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning

【速读】：该论文试图解决长序列文本处理中的长程依赖问题，特别是在基于注意力机制的模型中，全注意力机制在处理长程依赖时存在建模能力与计算效率之间的不匹配。解决方案的关键在于提出了一种名为“张量化注意力 (Tensorized Attention)”的方法，通过将长输入序列张量化为紧凑的张量表示，并在每个变换维度上应用注意力机制，从而扩展了注意力机制的感受野。这种方法不仅提高了内存和时间效率，还能够在预训练的大型语言模型 (LLMs) 中实现更高效的适应。实验结果表明，张量化注意力机制能够在处理长达128k长度的序列时，相比全注意力机制与FlashAttention-2，实现11倍的加速。

链接: https://arxiv.org/abs/2410.20926
作者: Aosong Feng,Rex Ying,Leandros Tassiulas
关键词-EN: textual data grows, processing extended textual, extended textual data, maintain computational efficiency, handle long-range dependencies
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the demand for processing extended textual data grows, the ability to handle long-range dependencies and maintain computational efficiency is more critical than ever. One of the key issues for long-sequence modeling using attention-based model is the mismatch between the limited-range modeling power of full attention and the long-range token dependency in the input sequence. In this work, we propose to scale up the attention receptive field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension. The resulting Tensorized Attention can be adopted as efficient transformer backbones to extend input context length with improved memory and time efficiency. We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention. Extensive experiments show that tensorized attention can be used to adapt pretrained LLMs with improved efficiency. Notably, Llama-8B with tensorization is trained under 32,768 context length and can steadily extrapolate to 128k length during inference with 11\times speedup, compared to full attention with FlashAttention-2.
摘要：随着处理扩展文本数据的需求不断增长，处理长距离依赖关系并保持计算效率的能力变得比以往任何时候都更为关键。在使用基于注意力机制的模型进行长序列建模时，一个关键问题是全注意力机制的有限范围建模能力与输入序列中的长距离Token依赖关系之间的不匹配。在本研究中，我们提出通过将长输入序列张量化为紧凑的张量表示，然后对每个变换后的维度进行注意力计算，从而扩展注意力接收域。由此产生的张量化注意力可以作为高效的Transformer骨干网络，以提高内存和时间效率的方式扩展输入上下文长度。我们证明，所提出的注意力张量化将Token依赖关系编码为多跳注意力过程，并且等价于全注意力的Kronecker分解。广泛的实验表明，张量化注意力可以用于适应预训练的大语言模型，并提高其效率。值得注意的是，经过张量化的Llama-8B在32,768上下文长度下进行训练，并在推理过程中能够稳定地外推至128k长度，相较于使用FlashAttention-2的全注意力机制，速度提升了11倍。

[NLP-34] NeuGPT: Unified multi-modal Neural GPT

【速读】：该论文试图解决神经记录研究领域中数据类型隔离的问题，即不同类型的神经信号（如EEG、MEG、ECoG、SEEG、fMRI和fNIRS）通常被单独分析，缺乏跨模态的整合与交互。解决方案的关键在于开发了一个名为NeuGPT的多模态语言生成模型，该模型借鉴了自然语言处理（NLP）、计算机视觉和语音处理领域中预训练大模型的成功经验，能够处理多种神经记录数据，并与语音和文本数据进行交互。NeuGPT主要关注脑-文本解码，显著提升了BLEU-1和ROUGE-1F的性能，同时还能模拟脑信号，从而作为一种新型的神经接口。

链接: https://arxiv.org/abs/2410.20916
作者: Yiqian Yang,Yiqun Duan,Hyejeong Jo,Qiang Zhang,Renjing Xu,Oiwi Parker Jones,Xuming Hu,Chin-teng Lin,Hui Xiong
关键词-EN: groundbreaking multi-modal language, multi-modal language generation, language generation model, generation model designed, paper introduces NeuGPT
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces NeuGPT, a groundbreaking multi-modal language generation model designed to harmonize the fragmented landscape of neural recording research. Traditionally, studies in the field have been compartmentalized by signal type, with EEG, MEG, ECoG, SEEG, fMRI, and fNIRS data being analyzed in isolation. Recognizing the untapped potential for cross-pollination and the adaptability of neural signals across varying experimental conditions, we set out to develop a unified model capable of interfacing with multiple modalities. Drawing inspiration from the success of pre-trained large models in NLP, computer vision, and speech processing, NeuGPT is architected to process a diverse array of neural recordings and interact with speech and text data. Our model mainly focus on brain-to-text decoding, improving SOTA from 6.94 to 12.92 on BLEU-1 and 6.93 to 13.06 on ROUGE-1F. It can also simulate brain signals, thereby serving as a novel neural interface. Code is available at \hrefthis https URLNeuSpeech/NeuGPT (this https URL) .
摘要：本文介绍了 NeuGPT，这是一个开创性的多模态语言生成模型，旨在整合神经记录研究中碎片化的领域。传统上，该领域的研究按信号类型被划分为多个独立的部分，包括 EEG、MEG、ECoG、SEEG、fMRI 和 fNIRS 数据的单独分析。我们认识到跨领域交叉融合的未开发潜力以及神经信号在不同实验条件下的适应性，因此着手开发一个能够与多种模态接口的统一模型。借鉴了自然语言处理、计算机视觉和语音处理领域中预训练大模型的成功经验，NeuGPT 被设计用于处理多样化的神经记录，并与语音和文本数据进行交互。我们的模型主要关注脑到文本的解码，将 BLEU-1 的 SOTA 从 6.94 提升至 12.92，ROUGE-1F 从 6.93 提升至 13.06。此外，它还能模拟脑信号，从而作为一种新颖的神经接口。代码可在 \hrefthis https URLNeuSpeech/NeuGPT (this https URL) 获取。

[NLP-35] AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

【速读】：该论文试图解决在特定数据集上选择合适的检索增强生成 (RAG) 模块的问题。解决方案的关键在于提出了 AutoRAG 框架，该框架能够自动识别并优化适合特定数据集的 RAG 模块组合，从而提升模型在该数据集上的性能。AutoRAG 通过探索和近似最优的 RAG 模块组合，为不同数据集提供了定制化的解决方案。

链接: https://arxiv.org/abs/2410.20878
作者: Dongkyu Kim,Byoungwook Kim,Donggeon Han,Matouš Eibich
关键词-EN: Large Language Models, Large Language, Language Models, Retrieval-Augmented Generation, RAG modules
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Using LLMs (Large Language Models) in conjunction with external documents has made RAG (Retrieval-Augmented Generation) an essential technology. Numerous techniques and modules for RAG are being researched, but their performance can vary across different datasets. Finding RAG modules that perform well on specific datasets is challenging. In this paper, we propose the AutoRAG framework, which automatically identifies suitable RAG modules for a given dataset. AutoRAG explores and approximates the optimal combination of RAG modules for the dataset. Additionally, we share the results of optimizing a dataset using AutoRAG. All experimental results and data are publicly available and can be accessed through our GitHub repository this https URL .
摘要：将大语言模型 (LLM) 与外部文档结合使用，使得检索增强生成 (RAG) 成为一项关键技术。目前，针对 RAG 的研究涵盖了多种技术和模块，但这些技术在不同数据集上的表现存在差异。为特定数据集找到性能优异的 RAG 模块是一项挑战。本文提出了 AutoRAG 框架，该框架能够自动识别适用于给定数据集的 RAG 模块。AutoRAG 通过探索和近似，为数据集找到最优的 RAG 模块组合。此外，我们还分享了使用 AutoRAG 优化数据集的结果。所有实验结果和数据均已公开，可通过我们的 GitHub 仓库访问，链接为 this https URL。

[NLP-36] Reward Modeling with Weak Supervision for Language Models

【速读】：该论文试图解决在强化学习从人类反馈 (Reinforcement Learning from Human Feedback, RLHF) 过程中，依赖大量手动标注数据的高成本问题。解决方案的关键在于引入弱监督 (weak supervision) 作为一种扩展 RLHF 数据集的策略，通过利用噪声或不精确的数据标注来减少对昂贵手动标注数据的依赖。具体来说，研究者通过分析 RLHF 数据集，识别与响应偏好相关的启发式规则，编写简单的标注函数，并校准标签模型以弱标注未标注数据。这种方法在较小数据集上显著提升了奖励模型的性能，但在较大、原本已标注的数据集上效果有所下降。此外，利用大型语言模型 (LLM) 生成并弱标注响应数据，为扩展偏好数据提供了有前景的方法。

链接: https://arxiv.org/abs/2410.20869
作者: Ben Hauptvogel,Malte Ostendorff,Georg Rehm,Sebastian Möller
关键词-EN: Recent advancements, large language models, user intentions, reinforcement learning, advancements in large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets. Additionally, using an LLM to generate and then weakly label responses offers a promising method for extending preference data.
摘要：近年来，大语言模型 (Large Language Models, LLMs) 的进步推动了其在多种任务中的广泛应用，其中基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 是其训练过程中至关重要的部分，旨在使模型的响应与用户意图相一致。在 RLHF 过程中，通过人类标注者或 AI 系统确定的响应偏好来训练奖励模型，然后通过强化学习对大语言模型进行优化。本文提出了一种弱监督策略，以扩展 RLHF 数据集并提升奖励模型的性能。弱监督利用噪声或不精确的数据标注，减少了对昂贵的手动标注数据的依赖。通过分析 RLHF 数据集以识别与响应偏好相关的启发式规则，我们编写了简单的标注函数，并校准了标注模型以对未标注数据进行弱标注。我们的评估显示，尽管弱监督显著提升了较小数据集的奖励模型性能，但随着数据集规模的增大，其效果逐渐减弱。此外，利用大语言模型生成响应并进行弱标注，为扩展偏好数据提供了一种有前景的方法。

[NLP-37] A Simple Yet Effective Corpus Construction Framework for Indonesian Grammatical Error Correction

【速读】：该论文试图解决低资源语言（如印尼语）在语法错误纠正（GEC）领域缺乏高质量评估语料库的问题。解决方案的关键在于提出了一种构建GEC语料库的框架，并针对印尼语进行了具体实施。论文通过利用现有的预训练大型语言模型（LLMs），如GPT-3.5-Turbo和GPT-4，来简化和加速语料库的标注过程，从而克服了现有印尼语评估语料库的局限性。研究结果表明，这种方法在提升低资源语言环境下LLMs性能方面具有显著潜力。

链接: https://arxiv.org/abs/2410.20838
作者: Nankai Lin,Meiyu Zeng,Wentao Huang,Shengyi Jiang,Lixian Xiao,Aimin Yang
关键词-EN: English and Chinese, grammatical error correction, error correction, grammatical error, concentrated on universal
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Currently, the majority of research in grammatical error correction (GEC) is concentrated on universal languages, such as English and Chinese. Many low-resource languages lack accessible evaluation corpora. How to efficiently construct high-quality evaluation corpora for GEC in low-resource languages has become a significant challenge. To fill these gaps, in this paper, we present a framework for constructing GEC corpora. Specifically, we focus on Indonesian as our research language and construct an evaluation corpus for Indonesian GEC using the proposed framework, addressing the limitations of existing evaluation corpora in Indonesian. Furthermore, we investigate the feasibility of utilizing existing large language models (LLMs), such as GPT-3.5-Turbo and GPT-4, to streamline corpus annotation efforts in GEC tasks. The results demonstrate significant potential for enhancing the performance of LLMs in low-resource language settings. Our code and corpus can be obtained from this https URL.
摘要：目前，语法错误修正 (Grammatical Error Correction, GEC) 领域的研究主要集中在通用语言，如英语和中文。许多低资源语言缺乏可用的评估语料库。如何高效构建高质量的低资源语言 GEC 评估语料库已成为一个重大挑战。为了填补这些空白，本文提出了一种构建 GEC 语料库的框架。具体而言，我们以印尼语为研究对象，利用该框架构建了印尼语 GEC 的评估语料库，解决了现有印尼语评估语料库的局限性。此外，我们还探讨了利用现有的大语言模型 (Large Language Models, LLMs)，如 GPT-3.5-Turbo 和 GPT-4，来简化 GEC 任务中语料库标注工作的可行性。结果表明，在低资源语言环境下，LLMs 的性能提升具有显著潜力。我们的代码和语料库可通过此 https URL 获取。

[NLP-38] LLM s are Biased Evaluators But Not Biased for Retrieval Augmented Generation

【速读】：该论文试图解决的问题是：在大语言模型（LLMs）的评估任务中，尤其是在检索增强生成（RAG）框架下，模型是否存在对自身生成内容的偏好，以及这种偏好如何影响事实导向任务的准确性。解决方案的关键在于通过模拟RAG框架的两个关键阶段来研究这一问题：第一阶段是模拟点对点重排序过程，评估人类撰写和模型生成段落的适用性；第二阶段是进行成对阅读理解测试，模拟生成过程。研究结果表明，在RAG框架下，模型并未表现出显著的自我偏好效应，而是事实准确性显著影响了LLMs的输出，即使在缺乏先验知识的情况下也是如此。这一发现为开发更稳健和无偏的LLM系统提供了重要见解。

链接: https://arxiv.org/abs/2410.20833
作者: Yen-Shan Chen,Jing Jin,Peng-Ting Kuo,Chao-Wei Huang,Yun-Nung Chen
关键词-EN: large language models, favoring self-generated content, Recent studies, language models, self-generated content
类目: Computation and Language (cs.CL)
备注: 15 pages, 14 tables, 5 figures

点击查看摘要

Abstract:Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks-where keyword extraction and factual accuracy take precedence over stylistic elements-remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, we access the suitability of human-authored versus model-generated passages, emulating the pointwise reranking process. The second phase involves conducting pairwise reading comprehension tests to simulate the generation process. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs’ output, even in the absence of prior knowledge. Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.
摘要：最近的研究表明，大语言模型 (LLMs) 在评估任务中表现出显著的偏见，特别是在优先评分和偏爱自生成内容方面。然而，这种偏见在事实导向任务中的表现程度，尤其是在检索增强生成 (RAG) 框架中——其中关键词提取和事实准确性优先于风格元素——尚不清楚。我们的研究通过模拟 RAG 框架的两个关键阶段来填补这一知识空白。在第一阶段，我们评估了人类撰写与模型生成段落的适用性，模拟了逐点重排序过程。第二阶段涉及进行成对阅读理解测试，以模拟生成过程。与先前研究中显示的评分任务中的自我偏好相反，我们的结果显示在 RAG 框架中没有显著的自我偏好效应。相反，我们观察到事实准确性显著影响 LLMs 的输出，即使在缺乏先验知识的情况下也是如此。我们的研究为关于 LLM 偏见及其对基于 RAG 系统影响的持续讨论做出了贡献，提供了可能指导开发更稳健和无偏见 LLM 系统的见解。

[NLP-39] he Zenos Paradox of `Low-Resource Languages EMNLP2024

【速读】：该论文试图解决自然语言处理 (NLP) 领域中对“低资源语言”定义不一致的问题。解决方案的关键在于通过定性分析150篇提及“低资源”关键词的论文，揭示了影响语言“低资源性”的多个相互作用的因素，并提出了明确的定义和考虑这些因素的重要性。论文建议在研究中明确界定“低资源语言”的术语，并为评估语言的低资源性提供多维度的参考框架。

链接: https://arxiv.org/abs/2410.20817
作者: Hellina Hailu Nigatu,Atnafu Lambebo Tonja,Benjamin Rosman,Thamar Solorio,Monojit Choudhury
关键词-EN: Natural Language Processing, studied in Natural, languages commonly studied, Language Processing, Natural Language
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a low-resource language.' To understand how NLP papers define and study low resource’ languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword low-resource.' Based on our analysis, we show how several interacting axes contribute to low-resourcedness’ of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.
摘要：自然语言处理 (NLP) 领域中常用语言的资源差异通常通过将语言区分为低资源与高资源来体现。然而，对于何为“低资源语言”，学术界尚未达成一致的定义。为了探究 NLP 论文如何定义和研究“低资源”语言，我们定性地分析了来自 ACL Anthology 和流行语音处理会议的 150 篇提及“低资源”关键词的论文。基于我们的分析，我们展示了多个相互作用的维度如何共同影响语言的“低资源性”，并解释了为何这使得追踪每种语言的进展变得困难。我们希望我们的工作能够 (1) 促使论文在使用相关术语时给出明确的定义，以及 (2) 为在将某种语言归类为低资源时需要考虑的不同维度提供基础。

[NLP-40] NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在处理实时信息（如新事实和新术语）时的局限性问题，主要原因是这些模型在其开发过程中存在知识截止点。现有的基准测试主要关注过时内容和有限领域，难以实时更新且未充分探索新术语。论文提出的解决方案是引入一个自适应基准测试，名为NewTerm，用于实时评估新术语。关键在于设计了一种高度自动化的构建方法，以最小化人工干预，确保高质量基准的构建，并实现对实时信息的灵活更新。实证结果表明，新术语导致LLMs的性能下降超过20%，即使模型更新其知识截止点，也无法完全覆盖更远的新术语。论文还分析了哪些类型的新术语更具挑战性，并探讨了LLMs在新术语处理上的困难，为未来研究铺平了道路。

链接: https://arxiv.org/abs/2410.20814
作者: Hexuan Deng,Wenxiang Jiao,Xuebo Liu,Min Zhang,Zhaopeng Tu
关键词-EN: large language models, large language, language models, development process, remarkable abilities
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:Despite their remarkable abilities in various tasks, large language models (LLMs) still struggle with real-time information (e.g., new facts and terms) due to the knowledge cutoff in their development process. However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also analyze which types of terms are more challenging and why LLMs struggle with new terms, paving the way for future research. Finally, we construct NewTerm 2022 and 2023 to evaluate the new terms updated each year and will continue updating annually. The benchmark and codes can be found at this https URL.
摘要：尽管大语言模型 (LLMs) 在多种任务中展现出卓越的能力，但由于开发过程中的知识截断，它们在处理实时信息（例如新的事实和术语）方面仍面临挑战。现有的基准测试主要关注过时的内容和有限的领域，难以实现实时更新，且对新术语的探索不足。为解决这一问题，我们提出了一种自适应基准测试，名为 NewTerm，用于实时评估新术语。我们设计了一种高度自动化的构建方法，以最小化人工干预，确保高质量基准测试的构建，并允许灵活更新实时信息。在多种大语言模型上的实证结果表明，新术语导致性能下降超过 20%。此外，尽管大语言模型的知识截断更新可以涵盖部分新术语，但无法推广到更远的新术语。我们还分析了哪些类型的术语更具挑战性，以及大语言模型为何在新术语上表现不佳，为未来的研究铺平了道路。最后，我们构建了 NewTerm 2022 和 2023，以评估每年更新的新术语，并将继续每年更新。基准测试及其代码可在以下链接找到：https URL。

[NLP-41] Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation

【速读】：该论文试图解决在复杂决策领域（如国际象棋）中，基于深度学习的专家模型虽然决策准确但难以解释的问题。解决方案的关键在于引入概念引导的国际象棋评论生成（Concept-guided Chess Commentary generation, CCC）和基于GPT的国际象棋评论评估（GPT-based Chess Commentary Evaluation, GCC-Eval）。CCC通过优先级概念解释，将专家模型的决策优势与大型语言模型（LLMs）的语言流畅性相结合，生成准确、信息丰富且流畅的评论。GCC-Eval则利用专家知识评估评论的信息量和语言质量，确保生成的评论既准确又易于理解。

链接: https://arxiv.org/abs/2410.20811
作者: Jaechang Kim,Jinmin Goh,Inseok Hwang,Jaewoong Cho,Jungseul Ok
关键词-EN: Deep learning-based expert, reached superhuman performance, Deep learning-based, chess commentary, reached superhuman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep learning-based expert models have reached superhuman performance in decision-making domains such as chess and Go. However, it is under-explored to explain or comment on given decisions although it is important for human education and model explainability. The outputs of expert models are accurate, but yet difficult to interpret for humans. On the other hand, large language models (LLMs) produce fluent commentary but are prone to hallucinations due to their limited decision-making capabilities. To bridge this gap between expert models and LLMs, we focus on chess commentary as a representative case of explaining complex decision-making processes through language and address both the generation and evaluation of commentary. We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality. Experimental results, validated by both human judges and GCC-Eval, demonstrate that CCC generates commentary that is accurate, informative, and fluent.
摘要：基于深度学习的专家模型在象棋和围棋等决策领域已达到超人性能。然而，尽管对于人类教育和模型可解释性来说解释或评论给定决策非常重要，但这一领域尚未得到充分探索。专家模型的输出虽然准确，但对人类来说难以解读。另一方面，大语言模型（LLMs）虽然能生成流畅的评论，但由于其有限的决策能力，容易产生幻觉。为了弥合专家模型与大语言模型之间的这一差距，我们以国际象棋评论为例，探讨通过语言解释复杂决策过程，并解决评论的生成与评估问题。我们引入了概念引导的国际象棋评论生成（CCC）用于生成评论，以及基于GPT的国际象棋评论评估（GCC-Eval）用于评估评论。CCC通过基于优先级和概念的解释，将专家模型的决策优势与大语言模型的语言流畅性相结合。GCC-Eval利用专家知识，根据信息量和语言质量评估国际象棋评论。实验结果，经过人类评判和GCC-Eval验证，表明CCC生成的评论既准确、信息丰富又流畅。

[NLP-42] Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

【速读】：该论文试图解决在预训练大型语言模型（LLMs）时，如何通过重述自然文本数据来扩展和优化训练数据的问题。解决方案的关键在于开发和优化一个重述管道，该管道能够生成高质量的合成重述数据，并将其与原始数据集结合使用。通过在C4数据集和CulturaX的多语言子集（包括英语、德语、意大利语和西班牙语）上进行实验，研究团队展示了这种重述管道在单语言和多语言设置下，能够显著提升标准评估基准的性能。此外，研究还探讨了基础数据集和LLM选择对重述效果的影响，以及模型大小与预训练后性能之间的关系。研究结果表明，随着数据质量的提高，重述带来的性能提升逐渐减少，且不同模型家族之间的性能差异大于不同模型大小之间的差异，这强调了在选择LLM进行大规模数据重述前进行详细测试的必要性。

链接: https://arxiv.org/abs/2410.20796
作者: Michael Pieler,Marco Bellagente,Hannah Teufel,Duy Phung,Nathan Cooper,Jonathan Tow,Paulo Rocha,Reshinth Adithyan,Zaid Alyafeai,Nikhil Pinnaparaju,Maksym Zhuravinskyi,Carlos Riquelme
关键词-EN: Recently published work, Recently published, rephrasing natural text, natural text data, synthetically rephrased data
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures, 12 tables

点击查看摘要

Abstract:Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline to the English, German, Italian, and Spanish Oscar subsets of CulturaX. Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup. In addition, we provide a detailed study of our pipeline, investigating the choice of the base dataset and LLM for the rephrasing, as well as the relationship between the model size and the performance after pre-training. By exploring data with different perceived quality levels, we show that gains decrease with higher quality. Furthermore, we find the difference in performance between model families to be bigger than between different model sizes. This highlights the necessity for detailed tests before choosing an LLM to rephrase large amounts of data. Moreover, we investigate the effect of pre-training with synthetic data on supervised fine-tuning. Here, we find increasing but inconclusive results that highly depend on the used benchmark. These results (again) highlight the need for better benchmarking setups. In summary, we show that rephrasing multilingual and low-quality data is a very promising direction to extend LLM pre-training data.
摘要：最近发表的研究表明，在预训练大语言模型 (LLM) 时，将原始数据集与合成重述的数据集结合使用，可以取得显著的成果。我们在前人工作的基础上，通过复现 C4 数据集上的现有结果，并将其扩展到我们优化的重述流程中，应用于 CulturaX 的英语、德语、意大利语和西班牙语 Oscar 子集。我们的流程在单语和多语设置的标准评估基准上均提升了性能。此外，我们详细研究了我们的流程，探讨了重述过程中基础数据集和大语言模型 (LLM) 的选择，以及模型规模与预训练后性能之间的关系。通过探索不同感知质量水平的数据，我们发现随着数据质量的提高，增益逐渐减少。此外，我们发现不同模型家族之间的性能差异大于不同模型规模之间的差异。这突显了在选择用于重述大量数据的大语言模型 (LLM) 之前进行详细测试的必要性。此外，我们还研究了使用合成数据进行预训练对监督微调的影响。在此，我们发现结果虽有增加但并不确定，这高度依赖于所使用的基准。这些结果再次强调了改进基准测试设置的必要性。总之，我们展示了重述多语言和低质量数据是扩展大语言模型 (LLM) 预训练数据的非常有前景的方向。

[NLP-43] Deep Learning for Medical Text Processing: BERT Model Fine-Tuning and Comparative Study

【速读】：该论文试图解决当前医疗信息爆炸带来的挑战，即如何从海量的医疗文献中快速提取关键信息并生成准确、连贯的摘要。解决方案的关键在于基于BERT模型进行微调与优化，开发出高效的摘要生成系统。通过实验对比Seq-Seq、Attention、Transformer和BERT等模型，发现改进后的BERT模型在Rouge和Recall指标上具有显著优势，并展示了知识蒸馏技术进一步增强模型性能的潜力。该系统在实际应用中表现出强大的通用性和效率，为医疗文献的快速筛选和分析提供了可靠的工具。

链接: https://arxiv.org/abs/2410.20792
作者: Jiacheng Hu,Yiru Cang,Guiran Liu,Meiqi Wang,Weijie He,Runyuan Bao
关键词-EN: generation method based, summary generation method, BERT model, literature summary generation, medical literature summary
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper proposes a medical literature summary generation method based on the BERT model to address the challenges brought by the current explosion of medical information. By fine-tuning and optimizing the BERT model, we develop an efficient summary generation system that can quickly extract key information from medical literature and generate coherent, accurate summaries. In the experiment, we compared various models, including Seq-Seq, Attention, Transformer, and BERT, and demonstrated that the improved BERT model offers significant advantages in the Rouge and Recall metrics. Furthermore, the results of this study highlight the potential of knowledge distillation techniques to further enhance model performance. The system has demonstrated strong versatility and efficiency in practical applications, offering a reliable tool for the rapid screening and analysis of medical literature.
摘要：本文提出了一种基于 BERT 模型的医学文献摘要生成方法，以应对当前医学信息爆炸带来的挑战。通过对 BERT 模型进行微调和优化，我们开发了一个高效的摘要生成系统，能够快速从医学文献中提取关键信息并生成连贯、准确的摘要。在实验中，我们对比了多种模型，包括 Seq-Seq、Attention、Transformer 和 BERT，结果表明改进后的 BERT 模型在 Rouge 和 Recall 指标上具有显著优势。此外，本研究的结果还突显了知识蒸馏技术进一步增强模型性能的潜力。该系统在实际应用中展现了强大的通用性和效率，为医学文献的快速筛选和分析提供了一个可靠的工具。

[NLP-44] SCULPT: Systematic Tuning of Long Prompts

【速读】：该论文试图解决优化长且非结构化提示（long, unstructured prompts）的问题，特别是在大型语言模型（large language models）中解决复杂任务时。解决方案的关键是引入了一个名为SCULPT（Systematic Tuning of Long Prompts）的新框架。SCULPT通过将提示结构化并采用迭代式actor-critic机制来系统地优化长提示。为了增强鲁棒性和泛化能力，SCULPT利用了两种互补的反馈机制：初步评估（Preliminary Assessment）和错误评估（Error Assessment）。通过聚合这些反馈机制的反馈，SCULPT避免了过拟合，并确保了性能的持续改进。实验结果表明，SCULPT在处理错误和未对齐提示时显著提高了准确性和鲁棒性，并持续优于现有方法，成为优化长提示的可扩展解决方案。

链接: https://arxiv.org/abs/2410.20788
作者: Shanu Kumar,Akhila Yesantarao Venkata,Shubhanshu Khandelwal,Bishal Santra,Parag Agrawal,Manish Gupta
关键词-EN: large language models, solving complex tasks, large language, language models, models become increasingly
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models become increasingly central to solving complex tasks, the challenge of optimizing long, unstructured prompts has become critical. Existing optimization techniques often struggle to effectively handle such prompts, leading to suboptimal performance. We introduce SCULPT (Systematic Tuning of Long Prompts), a novel framework that systematically refines long prompts by structuring them hierarchically and applying an iterative actor-critic mechanism. To enhance robustness and generalizability, SCULPT utilizes two complementary feedback mechanisms: Preliminary Assessment, which assesses the prompt’s structure before execution, and Error Assessment, which diagnoses and addresses errors post-execution. By aggregating feedback from these mechanisms, SCULPT avoids overfitting and ensures consistent improvements in performance. Our experimental results demonstrate significant accuracy gains and enhanced robustness, particularly in handling erroneous and misaligned prompts. SCULPT consistently outperforms existing approaches, establishing itself as a scalable solution for optimizing long prompts across diverse and real-world tasks.
摘要：随着大语言模型在解决复杂任务中的作用日益突出，优化长且非结构化的提示词已成为一个关键挑战。现有的优化技术往往难以有效处理此类提示词，导致性能不佳。我们提出了 SCULPT（Systematic Tuning of Long Prompts），这是一个新颖的框架，通过将长提示词分层结构化并应用迭代式 Actor-Critic 机制，系统地对其进行优化。为了增强鲁棒性和泛化能力，SCULPT 采用了两种互补的反馈机制：初步评估（Preliminary Assessment），在执行前评估提示词的结构；错误评估（Error Assessment），在执行后诊断并解决错误。通过聚合这些反馈机制的信息，SCULPT 避免了过拟合，并确保了性能的持续改进。我们的实验结果显示，在处理错误和未对齐的提示词时，SCULPT 显著提高了准确性并增强了鲁棒性。SCULPT 持续优于现有的方法，成为优化跨多样和真实世界任务长提示词的可扩展解决方案。

[NLP-45] Graph-based Uncertainty Metrics for Long-form Language Model Outputs NEURIPS2024

【速读】：该论文试图解决长文本生成式大语言模型 (Large Language Models, LLMs) 在生成过程中出现的幻觉问题，以及在长文本生成中进行细粒度不确定性估计的挑战。解决方案的关键在于提出了“图不确定性” (Graph Uncertainty) 方法，通过将LLM生成内容与其中的声明构建为二分图 (bipartite graph)，并利用图中心性度量 (graph centrality metrics) 来估计声明级别的不确定性。具体来说，该方法不仅将现有的基于自一致性 (self-consistency) 的不确定性估计方法视为使用度中心性 (degree centrality) 作为不确定性度量，还展示了更复杂的中心性度量如接近中心性 (closeness centrality) 在声明级别不确定性估计中的持续改进。此外，论文还提出了不确定性感知的解码技术，结合图结构和不确定性估计来提高LLM生成内容的事实性，从而在各种长文本生成场景中实现平均6.8%的AUPRC相对提升，并在事实性和生成响应的信息量方面提供一致的2-4%的改进。

链接: https://arxiv.org/abs/2410.20783
作者: Mingjian Jiang,Yangjun Ruan,Prasanna Sattigeri,Salim Roukos,Tatsunori Hashimoto
关键词-EN: Large Language Models, Language Models, Large Language, Recent advancements, advancements in Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a Spotlight paper at NeurIPS 2024

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty – which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.
摘要：近年来，大语言模型 (LLM) 的进步显著提升了文本生成能力，但这些系统仍存在幻觉现象，且对长文本生成进行细粒度的不确定性估计仍然具有挑战性。在本研究中，我们提出了图不确定性 (Graph Uncertainty) —— 将 LLM 生成内容与其中的声明之间的关系表示为二部图，并利用一系列图中心性度量来估计声明级别的不确定性。在这种视角下，基于自一致性概念的现有不确定性估计方法可以视为使用度中心性作为不确定性度量，而我们展示了更复杂的替代方案，如接近中心性，在声明级别的不确定性估计中提供了持续的改进。此外，我们提出了不确定性感知的解码技术，该技术结合了图结构和不确定性估计，通过保留最可靠的声明来提高 LLM 生成内容的事实性。与现有方法相比，我们的基于图的不确定性度量在各种长文本生成设置下实现了平均 6.8% 的 AUPRC 相对提升，并且我们的端到端系统在事实性方面比现有解码技术提供了持续的 2-4% 的提升，同时显著提高了生成响应的信息量。

[NLP-46] Decoding Reading Goals from Eye Movements

【速读】：该论文试图解决的问题是：读者在阅读文本时的不同目标（如信息寻求和普通阅读）是否可以通过其眼球运动的模式来解码。解决方案的关键在于：利用大规模的眼动追踪数据，应用多种先进的模型（包括不同架构和数据表示策略的模型），并引入一个新的模型集成方法，系统地评估这些模型在不同泛化水平（新文本项、新参与者以及两者的组合）上的表现。研究发现，眼球运动包含了对这一任务非常有价值的信息，并通过错误分析揭示了影响任务难度的关键文本项和参与者眼球运动的特性。

链接: https://arxiv.org/abs/2410.20779
作者: Omer Shubi,Cfir Avraham Hadar,Yevgeni Berzak
关键词-EN: eye movements, Readers, eye, movements, reading
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Readers can have different goals with respect to the text they are reading. Can these goals be decoded from the pattern of their eye movements over the text? In this work, we examine for the first time whether it is possible to decode two types of reading goals that are common in daily life: information seeking and ordinary reading. Using large scale eye-tracking data, we apply to this task a wide range of state-of-the-art models for eye movements and text that cover different architectural and data representation strategies, and further introduce a new model ensemble. We systematically evaluate these models at three levels of generalization: new textual item, new participant, and the combination of both. We find that eye movements contain highly valuable signals for this task. We further perform an error analysis which builds on prior empirical findings on differences between ordinary reading and information seeking and leverages rich textual annotations. This analysis reveals key properties of textual items and participant eye movements that contribute to the difficulty of the task.
摘要：读者在阅读文本时可能具有不同的目标。这些目标能否从他们的眼动模式中解码出来？在本研究中，我们首次探讨了是否可以从眼动模式中解码两种日常生活中常见的阅读目标：信息搜索和普通阅读。我们利用大规模眼动追踪数据，应用了一系列涵盖不同架构和数据表示策略的先进眼动和文本模型，并进一步引入了一种新的模型集成方法。我们系统地评估了这些模型在三种泛化水平上的表现：新的文本项目、新的参与者，以及两者的结合。结果发现，眼动数据在此任务中包含了高度有价值的信号。我们还进行了错误分析，该分析基于先前关于普通阅读和信息搜索之间差异的实证发现，并利用了丰富的文本注释。这一分析揭示了影响任务难度的文本项目和参与者眼动模式的关键属性。

[NLP-47] KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation NEURIPS

【速读】：该论文试图解决大型语言模型（LLMs）在下游任务中高计算和内存需求的问题。解决方案的关键在于提出了一种结合低秩适应（LoRA）和知识蒸馏（KD）的新型微调方法，称为KD-LoRA。这种方法在保持与全量微调（FFT）和LoRA相当性能的同时，显著减少了资源需求，包括模型大小、GPU内存使用和推理时间。具体来说，KD-LoRA在GLUE基准测试中保留了LoRA 98%的性能，同时模型大小减少了40%，GPU内存使用减少了30%，推理时间也减少了30%。

链接: https://arxiv.org/abs/2410.20777
作者: Rambod Azimi,Rishav Rishav,Marek Teichmann,Samira Ebrahimi Kahou
关键词-EN: Large language models, Large language, downstream tasks, demonstrated remarkable performance, demonstrated remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024)

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various downstream tasks. However, the high computational and memory requirements of LLMs are a major bottleneck. To address this, parameter-efficient fine-tuning (PEFT) methods such as low-rank adaptation (LoRA) have been proposed to reduce computational costs while ensuring minimal loss in performance. Additionally, knowledge distillation (KD) has been a popular choice for obtaining compact student models from teacher models. In this work, we present KD-LoRA, a novel fine-tuning method that combines LoRA with KD. Our results demonstrate that KD-LoRA achieves performance comparable to full fine-tuning (FFT) and LoRA while significantly reducing resource requirements. Specifically, KD-LoRA retains 98% of LoRA’s performance on the GLUE benchmark, while being 40% more compact. Additionally, KD-LoRA reduces GPU memory usage by 30% compared to LoRA, while decreasing inference time by 30% compared to both FFT and LoRA. We evaluate KD-LoRA across three encoder-only models: BERT, RoBERTa, and DeBERTaV3. Code is available at this https URL.
摘要：大语言模型（LLMs）在各种下游任务中展现了卓越的性能。然而，LLMs 的高计算和内存需求是其主要瓶颈。为解决这一问题，参数高效微调（PEFT）方法，如低秩适应（LoRA），已被提出以降低计算成本，同时确保性能损失最小。此外，知识蒸馏（KD）已成为从教师模型中获取紧凑学生模型的流行选择。在本研究中，我们提出了 KD-LoRA，一种结合 LoRA 与 KD 的新型微调方法。我们的结果表明，KD-LoRA 在性能上可与全量微调（FFT）和 LoRA 相媲美，同时显著减少了资源需求。具体而言，KD-LoRA 在 GLUE 基准测试中保留了 LoRA 98% 的性能，同时模型尺寸缩小了 40%。此外，与 LoRA 相比，KD-LoRA 减少了 30% 的 GPU 内存使用，并且在推理时间上比 FFT 和 LoRA 减少了 30%。我们在三种仅编码器模型上评估了 KD-LoRA：BERT、RoBERTa 和 DeBERTaV3。代码可在以下链接获取：https URL。

[NLP-48] Are LLM -Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM -based Evaluation

【速读】：该论文试图解决的问题是：在大型语言模型（LLMs）生成的输出中使用认知标记（epistemic markers）是否会导致意外的负面后果。解决方案的关键在于提出了一个名为EMBER的基准测试，用于评估LLM-judges在单项和成对评估设置中对认知标记的鲁棒性。研究结果表明，所有测试的LLM-judges，包括GPT-4o，在存在认知标记的情况下都表现出显著的鲁棒性不足，特别是对表达不确定性的标记存在更强的负面偏见。这表明LLM-judges受到这些标记的影响，而不仅仅是关注内容的正确性。

链接: https://arxiv.org/abs/2410.20774
作者: Dongryeol Lee,Yerin Hwang,Yongil Kim,Joonsuk Park,Kyomin Jung
关键词-EN: large language models, train large language, epistemic markers, principle of honesty, language models
类目: Computation and Language (cs.CL)
备注: 21 pages, 6 figures, 15 tables

点击查看摘要

Abstract:In line with the principle of honesty, there has been a growing effort to train large language models (LLMs) to generate outputs containing epistemic markers. However, evaluation in the presence of epistemic markers has been largely overlooked, raising a critical question: Could the use of epistemic markers in LLM-generated outputs lead to unintended negative consequences? To address this, we present EMBER, a benchmark designed to assess the robustness of LLM-judges to epistemic markers in both single and pairwise evaluation settings. Our findings, based on evaluations using EMBER, reveal that all tested LLM-judges, including GPT-4o, show a notable lack of robustness in the presence of epistemic markers. Specifically, we observe a negative bias toward epistemic markers, with a stronger bias against markers expressing uncertainty. This suggests that LLM-judges are influenced by the presence of these markers and do not focus solely on the correctness of the content.
摘要：基于诚实原则，越来越多的努力被投入到训练大语言模型 (LLM) 以生成包含认知标记的输出。然而，在存在认知标记的情况下进行评估的工作却被大大忽视，这引发了一个关键问题：在 LLM 生成的输出中使用认知标记是否会导致意外的负面后果？为了解决这一问题，我们提出了 EMBER，这是一个旨在评估 LLM 评判器在单一和成对评估设置中对认知标记的鲁棒性的基准。基于使用 EMBER 进行的评估，我们的研究结果表明，包括 GPT-4o 在内的所有测试的 LLM 评判器在存在认知标记的情况下都表现出显著的鲁棒性不足。具体而言，我们观察到对认知标记的负面偏见，尤其是对表达不确定性的标记表现出更强的偏见。这表明 LLM 评判器受到这些标记存在的影响，并未完全专注于内容的正确性。

[NLP-49] MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

【速读】：该论文试图解决现有基于子词分词（subword tokenization）模型的几个主要问题，包括对字符级噪声（如拼写错误）的敏感性和不同语言及文字间压缩率的不一致性。尽管字符或字节级模型（如ByT5）尝试解决这些问题，但它们尚未广泛采用，因为处理原始字节流而不进行分词会导致序列长度显著增加，从而使训练和推理效率低下。论文提出的解决方案是引入MrT5（MergeT5），这是ByT5的一个更高效变体，其在编码器中集成了一个token删除机制，以动态缩短输入序列长度。具体来说，在经过固定数量的编码器层处理后，一个学习到的删除门（delete gate）决定哪些token将被移除，哪些将被保留以供后续层处理。MrT5通过将删除token中的关键信息合并到更紧凑的序列中，利用剩余token的上下文信息，从而在保持性能的同时显著减少序列长度，提高推理效率。

链接: https://arxiv.org/abs/2410.20771
作者: Julie Kallini,Shikhar Murty,Christopher D. Manning,Christopher Potts,Róbert Csordás
关键词-EN: inconsistent compression rates, rely on subword, noise like spelling, spelling errors, errors and inconsistent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption – processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learnt delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively ``merges’’ critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance. When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages, with significant additional improvements following multilingual training. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI and character-level tasks while reducing sequence lengths by up to 80%. Our approach presents a solution to the practical limitations of existing byte-level models.
摘要：依赖于子词 Tokenization 的模型存在显著的缺陷，例如对拼写错误等字符级噪声的敏感性，以及不同语言和文字之间压缩率的不一致性。尽管像 ByT5 这样的字符级或字节级模型试图解决这些问题，但它们并未得到广泛采用——不进行 Tokenization 直接处理原始字节流会导致序列长度显著增加，从而使得训练和推理效率低下。本文提出了 MrT5 (MergeT5)，这是 ByT5 的一个更高效的变体，在其编码器中集成了一个 Token 删除机制，以动态缩短输入序列长度。在经过固定数量的编码器层处理后，一个学习到的删除门控机制决定哪些 Token 将被移除，哪些将被保留用于后续层。MrT5 有效地将已删除 Token 中的关键信息“合并”到更紧凑的序列中，利用剩余 Token 的上下文信息。在持续预训练实验中，我们发现 MrT5 能够在对性能影响最小的情况下显著提升推理运行时间。在英语文本上训练时，MrT5 展示了其删除特征在多种语言间零样本迁移的能力，并且在多语言训练后获得了显著的额外改进。此外，MrT5 在下游评估任务（如 XNLI 和字符级任务）中表现出与 ByT5 相当的准确性，同时将序列长度减少了高达 80%。我们的方法为现有字节级模型的实际局限性提供了一个解决方案。

[NLP-50] A Static and Dynamic Attention Framework for Multi Turn Dialogue Generation

【速读】：该论文试图解决开放域多轮对话生成中对话历史上下文语义建模的问题。传统基于循环神经网络（RNN）的层次编码方法在处理长对话历史时面临梯度消失问题，导致难以有效捕捉对话的上下文信息。论文提出的解决方案之关键是采用静态和动态注意力机制（static and dynamic attention-based approach）来建模对话历史，从而生成多轮对话响应。静态注意力用于捕捉对话历史中的全局信息，而动态注意力则根据当前输入动态调整注意力分布，以更好地捕捉局部和上下文相关的信息。实验结果表明，该方法在Ubuntu和Opensubtitles数据集上的自动和人工评估指标上均表现出色，验证了其在多轮对话生成中的有效性。

链接: https://arxiv.org/abs/2410.20766
作者: Wei-Nan Zhang,Yiming Cui,Kaiyan Zhang,Yifa Wang,Qingfu Zhu,Lingzhi Li,Ting Liu
关键词-EN: multi turn dialogue, domain multi turn, open domain multi, open domain dialogue, open domain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: published as a journal paper at ACM Transactions on Information Systems 2023. 30 pages, 6 figures

点击查看摘要

Abstract:Recently, research on open domain dialogue systems have attracted extensive interests of academic and industrial researchers. The goal of an open domain dialogue system is to imitate humans in conversations. Previous works on single turn conversation generation have greatly promoted the research of open domain dialogue systems. However, understanding multiple single turn conversations is not equal to the understanding of multi turn dialogue due to the coherent and context dependent properties of human dialogue. Therefore, in open domain multi turn dialogue generation, it is essential to modeling the contextual semantics of the dialogue history, rather than only according to the last utterance. Previous research had verified the effectiveness of the hierarchical recurrent encoder-decoder framework on open domain multi turn dialogue generation. However, using RNN-based model to hierarchically encoding the utterances to obtain the representation of dialogue history still face the problem of a vanishing gradient. To address this issue, in this paper, we proposed a static and dynamic attention-based approach to model the dialogue history and then generate open domain multi turn dialogue responses. Experimental results on Ubuntu and Opensubtitles datasets verify the effectiveness of the proposed static and dynamic attention-based approach on automatic and human evaluation metrics in various experimental settings. Meanwhile, we also empirically verify the performance of combining the static and dynamic attentions on open domain multi turn dialogue generation.
摘要：近年来，开放域对话系统的研究吸引了学术界和工业界的广泛关注。开放域对话系统的目标是在对话中模仿人类的行为。以往关于单轮对话生成的工作极大地推动了开放域对话系统的研究。然而，理解多个单轮对话并不等同于理解多轮对话，因为人类对话具有连贯性和上下文依赖性。因此，在开放域多轮对话生成中，建模对话历史的上下文语义至关重要，而不仅仅是依赖于最后一轮的表达。以往的研究已经验证了层次化循环编码器-解码器框架在开放域多轮对话生成中的有效性。然而，使用基于RNN的模型来层次化编码话语以获取对话历史的表示仍然面临梯度消失的问题。为了解决这一问题，本文提出了一种基于静态和动态注意力的方法来建模对话历史，并生成开放域多轮对话的响应。在Ubuntu和Opensubtitles数据集上的实验结果验证了所提出的静态和动态注意力方法在各种实验设置下的自动评估和人工评估指标上的有效性。同时，我们还通过实验验证了结合静态和动态注意力在开放域多轮对话生成中的性能。

[NLP-51] Evaluating LLM s for Targeted Concept Simplification forDomain-Specific Texts EMNLP2024

【速读】：该论文试图解决成人读者在阅读不熟悉领域的复杂文本（如科学文章）时遇到的困难，特别是由于缺乏上下文和对困难概念的不熟悉所导致的理解障碍。解决方案的关键在于提出了“目标概念简化”（targeted concept simplification）任务，即通过重写文本以帮助读者理解包含不熟悉概念的文本。论文还引入了WikiDomains数据集，包含22k个来自13个学术领域的定义，每个定义都配有一个困难概念。通过对比开源和商业大型语言模型（LLMs）以及简单的词典基线，论文评估了这些模型在概念简化任务中的表现，并发现人类评委更倾向于对困难概念的解释而非概念短语的简化。此外，研究还揭示了自动化指标与人类评估之间的低相关性（约0.2），这为个性化阅读理解支持的研究提供了丰富的探索空间。

链接: https://arxiv.org/abs/2410.20763
作者: Sumit Asthana,Hannah Rashkin,Elizabeth Clark,Fantine Huot,Mirella Lapata
关键词-EN: application of NLP, scientific articles, NLP models, NLP, text
类目: Computation and Language (cs.CL)
备注: to appear in proceedings of EMNLP 2024

点击查看摘要

Abstract:One useful application of NLP models is to support people in reading complex text from unfamiliar domains (e.g., scientific articles). Simplifying the entire text makes it understandable but sometimes removes important details. On the contrary, helping adult readers understand difficult concepts in context can enhance their vocabulary and knowledge. In a preliminary human study, we first identify that lack of context and unfamiliarity with difficult concepts is a major reason for adult readers’ difficulty with domain-specific text. We then introduce “targeted concept simplification,” a simplification task for rewriting text to help readers comprehend text containing unfamiliar concepts. We also introduce WikiDomains, a new dataset of 22k definitions from 13 academic domains paired with a difficult concept within each definition. We benchmark the performance of open-source and commercial LLMs and a simple dictionary baseline on this task across human judgments of ease of understanding and meaning preservation. Interestingly, our human judges preferred explanations about the difficult concept more than simplification of the concept phrase. Further, no single model achieved superior performance across all quality dimensions, and automated metrics also show low correlations with human evaluations of concept simplification ( \sim0.2 ), opening up rich avenues for research on personalized human reading comprehension support.
摘要：自然语言处理模型的一个有用应用是帮助人们在阅读来自不熟悉领域的复杂文本（例如科学文章）时提供支持。简化整个文本虽然使其易于理解，但有时会丢失重要细节。相反，帮助成年读者在上下文中理解困难概念可以增强他们的词汇量和知识。在一项初步的人类研究中，我们首先确定缺乏上下文和与困难概念的不熟悉性是成年读者难以理解特定领域文本的主要原因。然后，我们引入了“目标概念简化”，这是一种重写文本以帮助读者理解包含不熟悉概念的文本的简化任务。我们还引入了WikiDomains，这是一个包含22,000个定义的新数据集，涵盖13个学术领域，每个定义都配有一个困难概念。我们在此任务上对开源和商业大语言模型以及一个简单的词典基线进行了基准测试，评估了人类对理解难易程度和意义保留的判断。有趣的是，我们的人类评判者更喜欢关于困难概念的解释，而不是概念短语的简化。此外，没有单一模型在所有质量维度上表现出色，自动化指标也显示出与人类对概念简化的评估（约0.2）的低相关性，这为个性化人类阅读理解支持的研究开辟了丰富的途径。

[NLP-52] PlantimesRAG: Planning-guided Retrieval Augmented Generation

【速读】：该论文试图解决现有检索增强生成 (Retrieval Augmented Generation, RAG) 框架中存在的效率低下和生成内容不可靠的问题。解决方案的关键在于引入了一种新的框架——规划引导的检索增强生成 (Planning-guided Retrieval Augmented Generation, Plan × RAG)，该框架通过将推理计划形式化为有向无环图 (Directed Acyclic Graph, DAG)，将查询分解为相互关联的原子子查询，从而实现并行化的检索和生成，显著提高了效率。此外，Plan × RAG 采用冻结的语言模型 (Language Models, LMs) 作为即插即用的专家，生成高质量答案，同时通过结构化的子查询分解，有效减少了幻觉现象并增强了归因性，从而确保了生成内容的可靠性和可追溯性。

链接: https://arxiv.org/abs/2410.20753
作者: Prakhar Verma,Sukruta Prakash Midigeshi,Gaurav Sinha,Arno Solin,Nagarajan Natarajan,Amit Sharma
关键词-EN: Planning-guided Retrieval Augmented, introduce Planning-guided Retrieval, Retrieval Augmented Generation, introduce Planning-guided, Planning-guided Retrieval
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, preprint

点击查看摘要

Abstract:We introduce Planning-guided Retrieval Augmented Generation (Plan \times RAG), a novel framework that augments the \emphretrieve-then-reason paradigm of existing RAG frameworks to \emphplan-then-retrieve. Plan \times RAG formulates a reasoning plan as a directed acyclic graph (DAG), decomposing queries into interrelated atomic sub-queries. Answer generation follows the DAG structure, allowing significant gains in efficiency through parallelized retrieval and generation. While state-of-the-art RAG solutions require extensive data generation and fine-tuning of language models (LMs), Plan \times RAG incorporates frozen LMs as plug-and-play experts to generate high-quality answers. Compared to existing RAG solutions, Plan \times RAG demonstrates significant improvements in reducing hallucinations and bolstering attribution due to its structured sub-query decomposition. Overall, Plan \times RAG offers a new perspective on integrating external knowledge in LMs while ensuring attribution by design, contributing towards more reliable LM-based systems.
摘要：我们提出了规划引导的检索增强生成框架（Planning-guided Retrieval Augmented Generation，Plan × RAG），这是一种新颖的框架，它将现有RAG框架的“检索-推理”范式扩展为“规划-检索”。Plan × RAG将推理计划形式化为有向无环图（DAG），将查询分解为相互关联的原子子查询。答案生成遵循DAG结构，通过并行化的检索和生成显著提高了效率。尽管最先进的RAG解决方案需要大量的数据生成和语言模型（LMs）的微调，但Plan × RAG采用冻结的LMs作为即插即用的专家，以生成高质量的答案。与现有的RAG解决方案相比，Plan × RAG在减少幻觉和增强归因方面表现出显著的改进，这得益于其结构化的子查询分解。总体而言，Plan × RAG为在LMs中整合外部知识提供了新的视角，并通过设计确保了归因，有助于构建更可靠的基于LM的系统。

[NLP-53] Matryoshka: Learning to Drive Black-Box LLM s with LLM s

【速读】：该论文试图解决黑箱大型语言模型（LLMs）在推理、规划和个性化等能力上的不透明性问题，这些模型的固有黑箱特性阻碍了其能力的进一步提升。解决方案的关键在于引入了一个轻量级的白箱LLM控制器，称为Matryoshika，它通过将复杂任务分解为一系列中间输出来指导大规模黑箱LLM生成器。具体来说，Matryoshika将黑箱LLM视为一个环境，并通过提供中间指导提示来驱动黑箱LLM，使其输出在迭代交互中与偏好对齐，从而实现可控的多轮生成和自我改进。通过这种控制器-生成器框架，Matryoshika提供了一种透明且实用的解决方案，通过可控的多轮生成来改进黑箱LLMs，而不依赖于模型参数的额外训练。

链接: https://arxiv.org/abs/2410.20749
作者: Changhao Li,Yuchen Zhuang,Rushi Qiang,Haotian Sun,Hanjun Dai,Chao Zhang,Bo Dai
关键词-EN: impressive generative abilities, inherent opacity hinders, black-box large language, large language models, black-box LLM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation or in-context learning, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshika, a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with Matryoshika serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. Matryoshika is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on three diverse tasks demonstrate that Matryoshika effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks, including reasoning, planning, and personalization. By leveraging this pioneering controller-generator framework to mitigate dependence on model parameters, Matryoshika provides a transparent and practical solution for improving black-box LLMs through controllable multi-turn generation using white-box LLMs.
摘要：尽管黑箱大语言模型（Large Language Models, LLMs）在生成能力方面表现出色，但其固有的不透明性阻碍了在推理、规划和个性化等能力上的进一步发展。现有的研究工作旨在通过领域特定适应或上下文学习来增强LLM的能力，这些方法需要对可访问的模型参数进行额外训练，这对于黑箱LLM来说是一个不可行的选择。为了应对这一挑战，我们引入了Matryoshika，一个轻量级的白箱LLM控制器，通过将复杂任务分解为一系列中间输出来指导大规模黑箱LLM生成器。具体而言，我们将黑箱LLM视为一个环境，Matryoshika作为策略，通过提示提供中间指导以驱动黑箱LLM。Matryoshika经过训练，能够在迭代交互中调整黑箱LLM的输出，使其与偏好对齐，从而实现可控的多轮生成和在优化中间指导方面的自我改进。在三个不同任务上的实证评估表明，Matryoshika在复杂的长周期任务中，包括推理、规划和个性化，有效地增强了黑箱LLM的能力。通过利用这一开创性的控制器-生成器框架来减少对模型参数的依赖，Matryoshika为通过可控的多轮生成使用白箱LLM改进黑箱LLM提供了一个透明且实用的解决方案。

[NLP-54] ElectionSim: Massive Population Election Simulation Powered by Large Language Model Driven Agents

【速读】：该论文试图解决传统基于代理的建模方法（Agent-Based Modeling, ABM）在处理复杂个体背景信息和提供交互式预测结果方面的局限性。解决方案的关键在于引入基于大语言模型（Large Language Models）的选举模拟框架ElectionSim，该框架能够支持精确的选民模拟和定制化分布，并通过交互式平台与模拟选民进行对话。论文还通过从社交媒体平台抽样百万级别的选民池，以及引入基于民调的总统选举基准（Poll-based Presidential Election, PPE）来评估框架在美国总统选举场景下的性能，从而展示其有效性和鲁棒性。

链接: https://arxiv.org/abs/2410.20746
作者: Xinnong Zhang,Jiayu Lin,Libo Sun,Weihong Qi,Yihang Yang,Yue Chen,Hanjia Lyu,Xinyi Mou,Siming Chen,Jiebo Luo,Xuanjing Huang,Shiping Tang,Zhongyu Wei
关键词-EN: massive population election, massive population, preferences of specific, specific groups, population election simulation
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 41 pages, 13 figures

点击查看摘要

Abstract:The massive population election simulation aims to model the preferences of specific groups in particular election scenarios. It has garnered significant attention for its potential to forecast real-world social trends. Traditional agent-based modeling (ABM) methods are constrained by their ability to incorporate complex individual background information and provide interactive prediction results. In this paper, we introduce ElectionSim, an innovative election simulation framework based on large language models, designed to support accurate voter simulations and customized distributions, together with an interactive platform to dialogue with simulated voters. We present a million-level voter pool sampled from social media platforms to support accurate individual simulation. We also introduce PPE, a poll-based presidential election benchmark to assess the performance of our framework under the U.S. presidential election scenario. Through extensive experiments and analyses, we demonstrate the effectiveness and robustness of our framework in U.S. presidential election simulations.
摘要：大规模人口选举模拟旨在模拟特定群体在特定选举场景中的偏好，因其预测现实世界社会趋势的潜力而备受关注。传统的基于智能体建模 (Agent-based Modeling, ABM) 方法受限于其整合复杂个体背景信息的能力，以及提供互动预测结果的能力。本文中，我们介绍了 ElectionSim，一种基于大语言模型的创新选举模拟框架，旨在支持精确的选民模拟和定制化分布，并配备一个互动平台，以便与模拟选民进行对话。我们展示了一个从社交媒体平台抽取的百万级选民池，以支持精确的个体模拟。此外，我们引入了基于民调的总统选举基准 PPE，以评估我们的框架在美国总统选举场景下的表现。通过广泛的实验和分析，我们证明了我们的框架在美国总统选举模拟中的有效性和鲁棒性。

[NLP-55] Gender Bias in LLM -generated Interview Responses

【速读】：该论文旨在解决生成式大型语言模型（LLMs）在生成面试回答时表现出的性别偏见问题。解决方案的关键在于对不同LLMs（如GPT-3.5、GPT-4、Claude）生成的面试回答进行多维度审计，包括模型、问题类型和职位，并评估其与两种性别刻板印象的一致性。研究发现，性别偏见在不同模型和情境下具有一致性，并与性别刻板印象和职位主导性密切相关。因此，论文强调了在相关应用中需要采取谨慎的方法来减轻这种偏见。

链接: https://arxiv.org/abs/2410.20739
作者: Haein Kong,Yongsu Ahn,Sangyub Lee,Yunho Maeng
关键词-EN: including job-related texts, diverse text-generation tasks, text-generation tasks, including job-related, job-related texts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have emerged as a promising tool for assisting individuals in diverse text-generation tasks, including job-related texts. However, LLM-generated answers have been increasingly found to exhibit gender bias. This study evaluates three LLMs (GPT-3.5, GPT-4, Claude) to conduct a multifaceted audit of LLM-generated interview responses across models, question types, and jobs, and their alignment with two gender stereotypes. Our findings reveal that gender bias is consistent, and closely aligned with gender stereotypes and the dominance of jobs. Overall, this study contributes to the systematic examination of gender bias in LLM-generated interview responses, highlighting the need for a mindful approach to mitigate such biases in related applications.
摘要：大语言模型 (LLM) 已成为辅助个人完成多样化文本生成任务的有力工具，包括与工作相关的文本。然而，越来越多的研究发现，LLM 生成的答案中存在性别偏见。本研究对三种大语言模型（GPT-3.5、GPT-4、Claude）进行了多方面的审计，评估了它们在不同模型、问题类型和工作岗位下生成的面试回答，以及这些回答与两种性别刻板印象的一致性。研究结果表明，性别偏见具有一致性，并与性别刻板印象及职业主导性密切相关。总体而言，本研究为系统性审视大语言模型生成的面试回答中的性别偏见提供了贡献，强调了在相关应用中采取审慎措施以减轻此类偏见的必要性。

[NLP-56] SEG:Seeds-Enhanced Iterative Refinement Graph Neural Network for Entity Alignment

【速读】：该论文试图解决知识图谱中实体对齐的问题，特别是在多源数据导致对齐实体的非同构邻域结构的情况下，尤其是对于稀疏连接的实体。解决方案的关键在于提出了一种软标签传播框架，该框架整合了多源数据并通过迭代种子增强来处理大规模数据集的扩展性挑战。具体来说，该框架利用种子进行锚定，并选择最优关系对来生成富含邻域特征和语义关系数据的软标签。此外，实现了一个双向加权联合损失函数，该函数减少了正样本之间的距离，并对负样本进行差异化处理，从而考虑了非同构邻域结构的影响。这种方法在多个数据集上显著优于现有的半监督方法，显著提高了实体对齐的质量。

链接: https://arxiv.org/abs/2410.20733
作者: Wei Ai,Yinghui Gao,Jianbin Li,Jiayi Du,Tao Meng,Yuntao Shou,Keqin Li
关键词-EN: knowledge graphs, merging knowledge, crucial for merging, knowledge, entities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7, 2 figures

点击查看摘要

Abstract:Entity alignment is crucial for merging knowledge across knowledge graphs, as it matches entities with identical semantics. The standard method matches these entities based on their embedding similarities using semi-supervised learning. However, diverse data sources lead to non-isomorphic neighborhood structures for aligned entities, complicating alignment, especially for less common and sparsely connected entities. This paper presents a soft label propagation framework that integrates multi-source data and iterative seed enhancement, addressing scalability challenges in handling extensive datasets where scale computing excels. The framework uses seeds for anchoring and selects optimal relationship pairs to create soft labels rich in neighborhood features and semantic relationship data. A bidirectional weighted joint loss function is implemented, which reduces the distance between positive samples and differentially processes negative samples, taking into account the non-isomorphic neighborhood structures. Our method outperforms existing semi-supervised approaches, as evidenced by superior results on multiple datasets, significantly improving the quality of entity alignment.
摘要：实体对齐在知识图谱间的知识融合中至关重要，因为它能够匹配语义相同的实体。标准方法通过半监督学习基于实体嵌入的相似性进行匹配。然而，多样化的数据源导致对齐实体的邻域结构非同构，这使得对齐过程变得复杂，尤其是对于较少见且连接稀疏的实体。本文提出了一种软标签传播框架，该框架集成了多源数据和迭代种子增强，解决了在处理大规模数据集时面临的可扩展性挑战，其中规模计算表现出色。该框架利用种子进行锚定，并选择最优关系对来创建富含邻域特征和语义关系数据的软标签。我们实现了一个双向加权联合损失函数，该函数减少了正样本之间的距离，并对负样本进行差异化处理，考虑了非同构的邻域结构。我们的方法优于现有的半监督方法，这在多个数据集上的优越结果中得到了验证，显著提高了实体对齐的质量。

[NLP-57] Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

【速读】：该论文试图解决大语言模型（LLMs）在推理能力上表现出色但面临幻觉和知识过时的问题。解决方案的关键在于引入基于知识图谱（Knowledge Graph, KG）的检索增强生成框架（Retrieval-Augmented Generation, RAG），通过将LLM的输出锚定在来自KG的结构化外部知识上来解决这些问题。具体来说，论文提出了SubgraphRAG，该方法通过检索子图并利用LLMs进行推理和答案预测，创新性地结合了轻量级多层感知器（multilayer perceptron）与并行三元组评分机制（parallel triple-scoring mechanism），以实现高效且灵活的子图检索，同时编码方向性结构距离以增强检索效果。这种设计在模型复杂性和推理能力之间取得了平衡，使得检索过程既可扩展又具有通用性。通过灵活调整检索子图的大小，SubgraphRAG能够适应查询需求和下游LLM的能力，从而在不进行微调的情况下，使较小模型如Llama3.1-8B-Instruct展现出可解释的推理能力，而较大模型如GPT-4o则实现了与先前基线相比的最新准确性。

链接: https://arxiv.org/abs/2410.20724
作者: Mufei Li,Siqi Miao,Pan Li
关键词-EN: Large Language Models, Large Language, demonstrate strong reasoning, strong reasoning abilities, demonstrate strong
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM outputs in structured external knowledge from KGs. However, current KG-based RAG frameworks still struggle to optimize the trade-off between retrieval effectiveness and efficiency in identifying a suitable amount of relevant graph information for the LLM to digest. We introduce SubgraphRAG, extending the KG-based RAG framework that retrieves subgraphs and leverages LLMs for reasoning and answer prediction. Our approach innovatively integrates a lightweight multilayer perceptron with a parallel triple-scoring mechanism for efficient and flexible subgraph retrieval while encoding directional structural distances to enhance retrieval effectiveness. The size of retrieved subgraphs can be flexibly adjusted to match the query’s need and the downstream LLM’s capabilities. This design strikes a balance between model complexity and reasoning power, enabling scalable and generalizable retrieval processes. Notably, based on our retrieved subgraphs, smaller LLMs like Llama3.1-8B-Instruct deliver competitive results with explainable reasoning, while larger models like GPT-4o achieve state-of-the-art accuracy compared with previous baselines – all without fine-tuning. Extensive evaluations on the WebQSP and CWQ benchmarks highlight SubgraphRAG’s strengths in efficiency, accuracy, and reliability by reducing hallucinations and improving response grounding.
摘要：大语言模型 (LLM) 展示了强大的推理能力，但面临着如幻觉和知识过时等局限性。基于知识图谱 (Knowledge Graph, KG) 的检索增强生成 (Retrieval-Augmented Generation, RAG) 通过将 LLM 的输出基于 KG 中的结构化外部知识来解决这些问题。然而，当前基于 KG 的 RAG 框架在优化检索效果与效率之间的权衡方面仍存在困难，难以确定适合 LLM 消化的相关图信息量。我们引入了 SubgraphRAG，扩展了基于 KG 的 RAG 框架，该框架检索子图并利用 LLM 进行推理和答案预测。我们的方法创新性地集成了一个轻量级的多层感知器与并行三元组评分机制，以实现高效且灵活的子图检索，同时编码方向性结构距离以增强检索效果。检索到的子图大小可以根据查询需求和下游 LLM 的能力灵活调整。这种设计在模型复杂性和推理能力之间取得了平衡，使得检索过程具有可扩展性和通用性。值得注意的是，基于我们检索到的子图，较小的 LLM 如 Llama3.1-8B-Instruct 提供了具有可解释推理的竞争性结果，而较大的模型如 GPT-4o 在与先前基线的比较中达到了最先进的准确性——所有这些都不需要微调。在 WebQSP 和 CWQ 基准上的广泛评估突显了 SubgraphRAG 在效率、准确性和可靠性方面的优势，通过减少幻觉和提高响应的接地性。

[NLP-58] Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models INTERSPEECH2023

【速读】：该论文试图解决预训练语言模型在自然语言推理（NLI）任务中对反事实数据的鲁棒性不足的问题。解决方案的关键在于通过基于词和句子的增强方法生成属于每个类别的反事实句子对，并应用对比学习（contrastive learning）来帮助模型学习具有相似上下文的句子对之间的差异。这种方法旨在提高NLI模型的性能和鲁棒性。

链接: https://arxiv.org/abs/2410.20710
作者: Heerin Yang,Sseung-won Hwang,Jungmin So
关键词-EN: language processing tasks, natural language processing, pre-trained language models, language inference tasks, determine the outcome
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted at INTERSPEECH 2023

点击查看摘要

Abstract:Although pre-trained language models show good performance on various natural language processing tasks, they often rely on non-causal features and patterns to determine the outcome. For natural language inference tasks, previous results have shown that even a model trained on a large number of data fails to perform well on counterfactually revised data, indicating that the model is not robustly learning the semantics of the classes. In this paper, we propose a method in which we use token-based and sentence-based augmentation methods to generate counterfactual sentence pairs that belong to each class, and apply contrastive learning to help the model learn the difference between sentence pairs of different classes with similar contexts. Evaluation results with counterfactually-revised dataset and general NLI datasets show that the proposed method can improve the performance and robustness of the NLI model.
摘要：尽管预训练语言模型在各种自然语言处理任务中表现出色，但它们通常依赖于非因果特征和模式来确定结果。对于自然语言推理任务，先前的研究结果表明，即使模型在大量数据上进行训练，其在反事实修订数据上的表现仍然不佳，这表明模型并未稳健地学习各类别的语义。本文提出了一种方法，我们采用基于 Token 和基于句子的增强方法来生成属于每个类别的反事实句子对，并应用对比学习来帮助模型学习具有相似上下文的句子对之间的差异。在反事实修订数据集和通用自然语言推理数据集上的评估结果显示，所提出的方法能够提升自然语言推理模型的性能和鲁棒性。

[NLP-59] DisasterQA: A Benchmark for Assessing the performance of LLM s in Disaster Response

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在灾难响应中的知识和决策能力，特别是它们是否适合在灾难情况下提供建议和决策支持。解决方案的关键在于引入了一个名为DisasterQA的基准测试，该基准从六个在线资源中创建，涵盖广泛的灾难响应主题。通过评估五种不同的LLMs在四种不同提示方法下的表现，研究测量了准确性和置信度（通过Logprobs），结果表明LLMs在灾难响应知识方面仍需改进。论文希望通过这一基准推动LLMs在灾难响应领域的进一步发展，使其最终能够与应急管理人员协同工作。

链接: https://arxiv.org/abs/2410.20707
作者: Rajat Rawat
关键词-EN: response times vital, quick response times, times vital, disaster response, disaster
类目: Computation and Language (cs.CL)
备注: 7 pages, 6 tables

点击查看摘要

Abstract:Disasters can result in the deaths of many, making quick response times vital. Large Language Models (LLMs) have emerged as valuable in the field. LLMs can be used to process vast amounts of textual information quickly providing situational context during a disaster. However, the question remains whether LLMs should be used for advice and decision making in a disaster. To evaluate the capabilities of LLMs in disaster response knowledge, we introduce a benchmark: DisasterQA created from six online sources. The benchmark covers a wide range of disaster response topics. We evaluated five LLMs each with four different prompting methods on our benchmark, measuring both accuracy and confidence levels through Logprobs. The results indicate that LLMs require improvement on disaster response knowledge. We hope that this benchmark pushes forth further development of LLMs in disaster response, ultimately enabling these models to work alongside. emergency managers in disasters.
摘要：灾难可能导致大量人员伤亡，因此快速响应至关重要。大语言模型 (LLMs) 在这一领域已展现出其价值。LLMs 能够快速处理大量文本信息，为灾难中的情境提供上下文。然而，LLMs 是否应被用于灾难中的建议和决策仍是一个问题。为了评估 LLMs 在灾难响应知识方面的能力，我们引入了一个基准：DisasterQA，该基准来源于六个在线资源。该基准涵盖了广泛的灾难响应主题。我们在基准上评估了五种 LLMs，每种模型采用四种不同的提示方法，通过 Logprobs 测量其准确性和置信度。结果表明，LLMs 在灾难响应知识方面仍有待提升。我们希望这一基准能够推动 LLMs 在灾难响应领域的进一步发展，最终使这些模型能够与应急管理人员在灾难中协同工作。

[NLP-60] Combining Domain-Specific Models and LLM s for Automated Disease Phenotyping from Survey Data

【速读】：该论文试图解决从研究调查数据中自动进行疾病表型分类的问题，特别是如何高效且准确地将不断增长的调查数据与标准化疾病本体进行协调。解决方案的关键在于结合领域特定模型BERN2与大型语言模型（LLMs），通过提取和规范化疾病信息，并利用提示工程、检索增强生成（RAG）和指令微调（IFT）等技术来优化模型的输出。特别是通过少样本推理和RAG协调，结合结构化示例、逻辑推理提示和详细上下文，显著提升了疾病提及的提取和规范化准确性，为开发高效的数据协调和队列分析工具提供了有前景的路径。

链接: https://arxiv.org/abs/2410.20695
作者: Gal Beeri,Benoit Chamot,Elena Latchem,Shruthi Venkatesh,Sarah Whalan,Van Zyl Kruger,David Martino
关键词-EN: exploratory pilot study, pilot study investigated, enhance automated disease, automated disease phenotyping, survey data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This exploratory pilot study investigated the potential of combining a domain-specific model, BERN2, with large language models (LLMs) to enhance automated disease phenotyping from research survey data. Motivated by the need for efficient and accurate methods to harmonize the growing volume of survey data with standardized disease ontologies, we employed BERN2, a biomedical named entity recognition and normalization model, to extract disease information from the ORIGINS birth cohort survey data. After rigorously evaluating BERN2’s performance against a manually curated ground truth dataset, we integrated various LLMs using prompt engineering, Retrieval-Augmented Generation (RAG), and Instructional Fine-Tuning (IFT) to refine the model’s outputs. BERN2 demonstrated high performance in extracting and normalizing disease mentions, and the integration of LLMs, particularly with Few Shot Inference and RAG orchestration, further improved accuracy. This approach, especially when incorporating structured examples, logical reasoning prompts, and detailed context, offers a promising avenue for developing tools to enable efficient cohort profiling and data harmonization across large, heterogeneous research datasets.
摘要：本探索性试点研究探讨了将领域特定模型 BERN2 与大语言模型 (LLMs) 结合以增强从研究调查数据中进行自动化疾病表型分析的潜力。鉴于需要高效且准确的方法来将日益增长的调查数据与标准化疾病本体进行协调，我们采用了 BERN2，一种生物医学命名实体识别与规范化模型，从 ORIGINS 出生队列调查数据中提取疾病信息。在严格评估 BERN2 相对于人工整理的基准数据集的表现后，我们通过提示工程、检索增强生成 (RAG) 和指令微调 (IFT) 整合了多种大语言模型以优化模型的输出。BERN2 在提取和规范化疾病提及方面表现出色，而大语言模型的集成，特别是通过少样本推理和 RAG 编排，进一步提高了准确性。这种方法，尤其是在结合结构化示例、逻辑推理提示和详细上下文时，为开发工具以实现大规模异质研究数据集中的高效队列分析和数据协调提供了有前景的途径。

[NLP-61] SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script

【速读】：该论文试图解决长期对话中互动性和持续性的问题，通过利用对话双方共享的记忆来增强对话的吸引力。解决方案的关键在于引入了一个名为SHARE的新数据集，该数据集从电影剧本中构建，包含了对话双方的人物信息和事件的摘要，以及可隐式提取的共享记忆。此外，论文还提出了一个基于SHARE的长期对话框架EPISODE，该框架利用共享经历来管理对话中的共享记忆，从而使长期对话更加生动和可持续。

链接: https://arxiv.org/abs/2410.20682
作者: Eunwon Kim,Chanho Park,Buru Chang
关键词-EN: Shared memories, Shared, memories, strengthen their bond, crucial for facilitating
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Shared memories between two individuals strengthen their bond and are crucial for facilitating their ongoing conversations. This study aims to make long-term dialogue more engaging by leveraging these shared memories. To this end, we introduce a new long-term dialogue dataset named SHARE, constructed from movie scripts, which are a rich source of shared memories among various relationships. Our dialogue dataset contains the summaries of persona information and events of two individuals, as explicitly revealed in their conversation, along with implicitly extractable shared memories. We also introduce EPISODE, a long-term dialogue framework based on SHARE that utilizes shared experiences between individuals. Through experiments using SHARE, we demonstrate that shared memories between two individuals make long-term dialogues more engaging and sustainable, and that EPISODE effectively manages shared memories during dialogue. Our new dataset is publicly available at this https URL.
摘要：两个人之间的共享记忆能够加强他们的联系，并且对于促进他们持续的对话至关重要。本研究旨在利用这些共享记忆，使长期对话更加引人入胜。为此，我们引入了一个名为 SHARE 的新长期对话数据集，该数据集基于电影剧本构建，电影剧本是各种关系中共享记忆的丰富来源。我们的对话数据集包含了两个人物信息和事件的摘要，这些信息和事件在他们的对话中被明确揭示，以及可以隐式提取的共享记忆。我们还介绍了 EPISODE，这是一个基于 SHARE 的长期对话框架，利用了个人之间的共享经历。通过使用 SHARE 进行实验，我们证明了两个人之间的共享记忆使得长期对话更加引人入胜和可持续，并且 EPISODE 在对话过程中有效地管理了共享记忆。我们的新数据集可通过此 https URL 公开获取。

[NLP-62] Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

【速读】：该论文试图解决大型语言模型（LLMs）部署成本高的问题，提出了一种通过参数共享来减小模型大小和成本的方法。解决方案的关键在于重新审视Transformer中的“层绑定”（layer tying）技术，并引入了一种新型的“递归Transformer”（Recursive Transformers）模型。这种模型通过重复使用单一的独特层块来构建，从而显著减少了参数数量，同时通过深度方向的低秩适应（LoRA）模块增加了灵活性，进一步提升了性能。论文还提出了“松弛递归Transformer”（Relaxed Recursive Transformers），通过在层绑定约束中引入灵活性，保持了模型的紧凑性。实验结果表明，递归模型在性能上优于类似大小的传统预训练模型和知识蒸馏基线，甚至能恢复大部分原始“全尺寸”模型的性能。此外，论文还提出了“连续深度批处理”（Continuous Depth-wise Batching）这一新的推理范式，有望在推理吞吐量上实现显著提升。

链接: https://arxiv.org/abs/2410.20672
作者: Sangmin Bae,Adam Fisch,Hrayr Harutyunyan,Ziwei Ji,Seungyeon Kim,Tal Schuster
关键词-EN: Large language models, Large language, Recursive Transformers, expensive to deploy, Relaxed Recursive Transformers
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 48 pages, 17 figures, 17 tables

点击查看摘要

Abstract:Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit “layer tying” as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller “Recursive Transformers” that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines – and can even recover most of the performance of the original “full-size” model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.
摘要：大语言模型 (LLMs) 的部署成本高昂。参数共享提供了一种可能的途径来减小其规模和成本，但在现代 LLMs 中的效果仍然相当有限。在本研究中，我们重新审视了“层绑定”作为 Transformer 中参数共享的一种形式，并引入了新的方法，将现有 LLMs 转换为更小的“递归 Transformer”，这些模型在层间共享参数，且性能损失最小。我们的递归 Transformer 能从标准的预训练 Transformer 高效初始化，但仅使用一个独特的层块，然后在循环中重复多次。我们进一步通过引入松弛递归 Transformer 来提升性能，该模型通过深度方向的低秩适应 (LoRA) 模块增加了层绑定约束的灵活性，但仍保持了整体模型的紧凑性。实验表明，我们的递归模型（如递归 Gemma 1B）在性能上优于类似规模的普通预训练模型（如 TinyLlama 1.1B 和 Pythia 1B）以及知识蒸馏基线，甚至能够恢复大部分原始“全尺寸”模型（如未共享参数的 Gemma 2B）的性能。最后，我们提出了连续深度批处理，这是一种由递归 Transformer 与早期退出相结合所启发的有前景的新推理范式。在理论分析中，我们展示了这有可能带来显著（2-3倍）的推理吞吐量提升。

[NLP-63] Guide-LLM : An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments

【速读】：该论文试图解决视觉障碍者（PVI）在大型室内环境中导航的挑战。解决方案的关键在于引入了一个基于大语言模型（LLM）的实体代理Guide-LLM，该代理通过创新的文本型拓扑地图来规划全局路径，使用简化的环境表示，专注于直线和直角转弯，以简化导航。此外，Guide-LLM利用LLM的常识推理能力进行危险检测和基于用户偏好的个性化路径规划。实验结果表明，该系统能够有效辅助视觉障碍者导航，展示了其在辅助技术领域的显著进步潜力。

链接: https://arxiv.org/abs/2410.20666
作者: Sangmim Song,Sarath Kodagoda,Amal Gunatilake,Marc G. Carmichael,Karthick Thiyagarajan,Jodi Martin
关键词-EN: visual impairments, challenge for persons, persons with visual, Navigation presents, presents a significant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Navigation presents a significant challenge for persons with visual impairments (PVI). While traditional aids such as white canes and guide dogs are invaluable, they fall short in delivering detailed spatial information and precise guidance to desired locations. Recent developments in large language models (LLMs) and vision-language models (VLMs) offer new avenues for enhancing assistive navigation. In this paper, we introduce Guide-LLM, an embodied LLM-based agent designed to assist PVI in navigating large indoor environments. Our approach features a novel text-based topological map that enables the LLM to plan global paths using a simplified environmental representation, focusing on straight paths and right-angle turns to facilitate navigation. Additionally, we utilize the LLM’s commonsense reasoning for hazard detection and personalized path planning based on user preferences. Simulated experiments demonstrate the system’s efficacy in guiding PVI, underscoring its potential as a significant advancement in assistive technology. The results highlight Guide-LLM’s ability to offer efficient, adaptive, and personalized navigation assistance, pointing to promising advancements in this field.
摘要：导航对于视觉障碍者（Persons with Visual Impairments, PVI）来说是一个重大挑战。尽管传统的辅助工具如白杖和导盲犬具有不可估量的价值，但它们在提供详细的空间信息和精确引导至目标位置方面存在不足。近年来，大语言模型（Large Language Models, LLMs）和视觉-语言模型（Vision-Language Models, VLMs）的发展为增强辅助导航提供了新的途径。本文介绍了Guide-LLM，这是一个基于LLM的具身智能体，旨在帮助PVI在大规模室内环境中进行导航。我们的方法采用了一种新颖的基于文本的拓扑地图，使LLM能够使用简化的环境表示来规划全局路径，重点在于直线和直角转弯，以促进导航。此外，我们还利用LLM的常识推理进行危险检测，并根据用户偏好进行个性化路径规划。模拟实验证明了该系统在引导PVI方面的有效性，突显了其在辅助技术领域的重大进步潜力。结果表明，Guide-LLM能够提供高效、适应性强且个性化的导航辅助，预示着该领域有望取得显著进展。

[NLP-64] Visualizing attention zones in machine reading comprehension models

【速读】：该论文试图解决机器阅读理解 (MRC) 模型中注意力机制的可解释性问题。解决方案的关键在于构建一个基于预训练语言模型的 MRC 模型，并通过可视化不同层中每个注意力区域的效果来展示模型的解释性。该方法不仅提供了具体的实现流程和代码，还具有通用性，可应用于其他预训练语言模型。

链接: https://arxiv.org/abs/2410.20652
作者: Yiming Cui,Wei-Nan Zhang,Ting Liu
关键词-EN: machine reading comprehension, attention mechanism plays, reading comprehension, MRC model, mechanism plays
类目: Computation and Language (cs.CL)
备注: 17 pages, published in STAR Protocols

点击查看摘要

Abstract:The attention mechanism plays an important role in the machine reading comprehension (MRC) model. Here, we describe a pipeline for building an MRC model with a pretrained language model and visualizing the effect of each attention zone in different layers, which can indicate the explainability of the model. With the presented protocol and accompanying code, researchers can easily visualize the relevance of each attention zone in the MRC model. This approach can be generalized to other pretrained language models.
摘要：注意力机制在机器阅读理解 (MRC) 模型中扮演着重要角色。本文描述了一种构建 MRC 模型的流程，该模型基于预训练语言模型，并能够可视化不同层中各个注意力区域的效果，从而揭示模型的可解释性。通过提供的协议和配套代码，研究人员可以轻松地可视化 MRC 模型中每个注意力区域的相关性。此方法可推广至其他预训练语言模型。

[NLP-65] SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts QA Through Six-Dimensional Feature Analysis NEURIPS2024

【速读】：该论文试图解决在事实正确但缺乏清晰度和相关性的主观回答中存在的信息透明度问题。解决方案的关键在于引入了一个名为SubjECTive-QA的人工标注数据集，该数据集基于收益电话会议（Earnings Call Transcripts, ECTs）的问答环节，涵盖了六个主观特征：断言性（Assertive）、谨慎性（Cautious）、乐观性（Optimistic）、具体性（Specific）、清晰性（Clear）和相关性（Relevant）。通过这些特征的标注，研究者能够评估和比较不同预训练语言模型（Pre-trained Language Model, PLM）在处理主观性问答时的表现，并展示了该数据集在不同领域（如白宫新闻简报）中的广泛适用性。

链接: https://arxiv.org/abs/2410.20651
作者: Huzaifa Pardawala,Siddhant Sukhani,Agam Shah,Veer Kejriwal,Abhishek Pillai,Rohan Bhasin,Andrew DiBiasio,Tarun Mandapati,Dhruv Adha,Sudheer Chava
关键词-EN: addressing objective inaccuracies, Fact-checking is extensively, addressing objective, objective inaccuracies, extensively studied
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Fact-checking is extensively studied in the context of misinformation and disinformation, addressing objective inaccuracies. However, a softer form of misinformation involves responses that are factually correct but lack certain features such as clarity and relevance. This challenge is prevalent in formal Question-Answer (QA) settings such as press conferences in finance, politics, sports, and other domains, where subjective answers can obscure transparency. Despite this, there is a lack of manually annotated datasets for subjective features across multiple dimensions. To address this gap, we introduce SubjECTive-QA, a human annotated dataset on Earnings Call Transcripts’ (ECTs) QA sessions as the answers given by company representatives are often open to subjective interpretations and scrutiny. The dataset includes 49,446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant. These features are carefully selected to encompass the key attributes that reflect the tone of the answers provided during QA sessions across different domain. Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity, such as Relevant and Clear, with a mean difference of 2.17% in their weighted F1 scores. The models perform significantly better on features with higher subjectivity, such as Specific and Assertive, with a mean difference of 10.01% in their weighted F1 scores. Furthermore, testing SubjECTive-QA’s generalizability using QAs from White House Press Briefings and Gaggles yields an average weighted F1 score of 65.97% using our best models for each feature, demonstrating broader applicability beyond the financial domain. SubjECTive-QA is publicly available under the CC BY 4.0 license
摘要：事实核查在处理虚假信息和误导信息中的客观不准确性方面得到了广泛研究。然而，一种较为隐蔽的虚假信息形式涉及事实正确但缺乏某些特征（如清晰度和相关性）的回答。这一挑战在正式的问答（QA）环境中尤为普遍，例如金融、政治、体育等领域的新闻发布会上，主观回答可能掩盖透明度。尽管如此，目前缺乏针对多维度主观特征的手动标注数据集。为填补这一空白，我们引入了SubjECTive-QA，这是一个基于财报电话会议（Earnings Call Transcripts, ECTs）QA环节的人工标注数据集，因为公司代表的回答往往容易受到主观解读和审查。该数据集包含49,446个针对长篇QA对的主观特征标注，涵盖六个特征：断言性（Assertive）、谨慎性（Cautious）、乐观性（Optimistic）、具体性（Specific）、清晰性（Clear）和相关性（Relevant）。这些特征经过精心挑选，以涵盖反映不同领域QA环节中回答语气的关键属性。我们的研究发现，表现最佳的预训练语言模型（PLM）RoBERTa-base在较低主观性特征（如相关性和清晰性）上的加权F1分数与Llama-3-70b-Chat相似，平均差异为2.17%。而在较高主观性特征（如具体性和断言性）上，模型的表现显著更好，平均差异为10.01%。此外，通过使用白宫新闻发布会和Gaggles的QA数据测试SubjECTive-QA的泛化能力，我们发现使用最佳模型对每个特征的平均加权F1分数为65.97%，表明其在金融领域之外的广泛适用性。SubjECTive-QA以CC BY 4.0许可证公开发布。

[NLP-66] LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization

【速读】：该论文试图解决低秩适应 (LoRA) 优化器在参数微调过程中缺乏变换不变性的问题，这导致学习效率低下和次优解。解决方案的关键是引入了一种名为 LoRA-RITE 的新型自适应矩阵预处理方法，该方法能够在保持计算效率的同时实现变换不变性。通过理论分析和实验验证，LoRA-RITE 在多个大型语言模型 (LLM) 任务中显著提升了优化效果，例如在 Gemma-2B 模型上使用 LoRA-RITE 替代 Adam 进行微调，Super-Natural Instructions 任务的准确率提高了 4.6%，其他四个 LLM 基准测试的准确率平均提高了 3.5%。

链接: https://arxiv.org/abs/2410.20625
作者: Jui-Nan Yen,Si Si,Zhao Meng,Felix Yu,Sai Surya Duvvuri,Inderjit S. Dhillon,Cho-Jui Hsieh,Sanjiv Kumar
关键词-EN: reduces memory requirements, Low-rank adaption, parameter-efficient finetuning method, memory requirements, widely used parameter-efficient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. However, current LoRA optimizers lack transformation invariance, meaning the actual updates to the weights depends on how the two LoRA factors are scaled or rotated. This deficiency leads to inefficient learning and sub-optimal solutions in practice. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization, which can achieve transformation invariance and remain computationally efficient. We provide theoretical analysis to demonstrate the benefit of our method and conduct experiments on various LLM tasks with different models including Gemma 2B, 7B, and mT5-XXL. The results demonstrate consistent improvements against existing optimizers. For example, replacing Adam with LoRA-RITE during LoRA fine-tuning of Gemma-2B yielded 4.6% accuracy gain on Super-Natural Instructions and 3.5% accuracy gain across other four LLM benchmarks (HellaSwag, ArcChallenge, GSM8K, OpenBookQA).
摘要：低秩适应 (Low-rank Adaption, LoRA) 是一种广泛使用的大语言模型参数高效微调方法，能够降低内存需求。然而，现有的 LoRA 优化器缺乏变换不变性，这意味着权重的实际更新依赖于两个 LoRA 因子如何缩放或旋转。这一缺陷导致实际学习效率低下，并可能得到次优解。本文提出了 LoRA-RITE，一种新颖的 LoRA 优化自适应矩阵预处理方法，能够在保持计算效率的同时实现变换不变性。我们提供了理论分析来证明该方法的益处，并在包括 Gemma 2B、7B 和 mT5-XXL 在内的不同模型上进行了多种大语言模型任务的实验。结果显示，与现有优化器相比，LoRA-RITE 表现出一致的改进。例如，在 Gemma-2B 的 LoRA 微调过程中，将 Adam 替换为 LoRA-RITE 在 Super-Natural Instructions 上获得了 4.6% 的准确率提升，在其他四个大语言模型基准测试（HellaSwag、ArcChallenge、GSM8K、OpenBookQA）上平均提升了 3.5% 的准确率。

[NLP-67] owards an LLM -Based Speech Interface for Robot-Assisted Feeding

【速读】：该论文试图解决为运动障碍或其他形式残疾的个体提供辅助机器人服务的问题，特别是如何通过自然语言处理技术增强这些机器人与用户的交互能力。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 构建语音接口，使得用户能够通过自然语言有效地传达高级指令和细微偏好。论文展示了一个基于LLM的语音接口系统，该系统应用于商业化的辅助喂食机器人，并通过迭代设计框架整合了以人为中心的元素。该系统在独立生活设施中对11名老年人进行了用户研究评估。

链接: https://arxiv.org/abs/2410.20624
作者: Jessie Yuan,Janavi Gupta,Akhil Padmanabha,Zulekha Karachiwalla,Carmel Majidi,Henny Admoni,Zackory Erickson
关键词-EN: Large Language Models, Physically assistive robots, utilize Large Language, assistive robots present, Physically assistive
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Physically assistive robots present an opportunity to significantly increase the well-being and independence of individuals with motor impairments or other forms of disability who are unable to complete activities of daily living (ADLs). Speech interfaces, especially ones that utilize Large Language Models (LLMs), can enable individuals to effectively and naturally communicate high-level commands and nuanced preferences to robots. In this work, we demonstrate an LLM-based speech interface for a commercially available assistive feeding robot. Our system is based on an iteratively designed framework, from the paper “VoicePilot: Harnessing LLMs as Speech Interfaces for Physically Assistive Robots,” that incorporates human-centric elements for integrating LLMs as interfaces for robots. It has been evaluated through a user study with 11 older adults at an independent living facility. Videos are located on our project website: this https URL.
摘要：物理辅助机器人为那些因运动障碍或其他形式残疾而无法完成日常生活活动（ADLs）的个体提供了显著提升福祉和独立性的机会。语音接口，尤其是那些利用大语言模型（LLMs）的接口，能够使个体有效地、自然地向机器人传达高级指令和细微的偏好。在本研究中，我们展示了一个基于LLM的语音接口，用于一款商业化的辅助喂食机器人。我们的系统基于一个迭代设计的框架，源自论文“VoicePilot: 利用LLMs作为物理辅助机器人的语音接口”，该框架融合了以人为中心的元素，用于将LLMs整合为机器人接口。该系统已通过在一个独立生活设施中对11名老年人进行的用户研究进行了评估。相关视频可在我们的项目网站上查看：此https URL。

[NLP-68] Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?

【速读】：该论文试图解决的问题是如何利用“弱教师模型”（如普通人类标注员或现有AI系统）有效地监督大型语言模型（LLMs），以提升其在复杂推理任务上的表现，特别是那些对教师模型构成挑战并需要专业知识或日常实践的任务。解决方案的关键在于探索了多种数据驱动的策略，这些策略在不同质量水平上提供监督数据，以应对不同复杂度的任务。论文发现，即使在高错误率（如90%）的情况下，训练数据来自与目标推理任务难度相匹配的完整任务，其效果仍优于从更简单的子任务中获取的完美正确监督数据。此外，论文强调了逐步错误率（step-wise error rates）对训练性能的影响，指出即使在结果错误率相同的情况下，逐步错误率的差异也能导致在MATH基准测试中高达30%的准确率差距。最后，论文提出，结合硬任务监督和相应的子任务监督可以显著提升性能，这为数据增强提供了新的方向。

链接: https://arxiv.org/abs/2410.20533
作者: Xuan He,Da Yin,Nanyun(Violet)Peng
关键词-EN: effectively supervise LLMs, average human annotators, weak teacher models, teacher models, hard task supervision
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How can “weak teacher models” such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and requires expertise or daily practice from the teacher models? In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging. Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90%), training on such data can outperform perfectly correct supervision on easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions. Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation. Data and code are released at \urlthis https URL.
摘要：如何利用“弱教师模型”，如普通人类标注员或现有 AI 系统，有效地监督大语言模型 (LLM) 以提升其在复杂推理任务上的表现，尤其是那些对教师模型提出挑战并需要其专业知识或日常实践的任务？本文通过研究各种数据驱动策略，探索了在不同质量水平上提供监督数据的方法，以期为这一问题提供实证答案。两种直观的策略在教师模型进行对齐训练时提供了监督：1) 使用与目标推理任务难度相匹配的完整任务的低质量监督；2) 利用来自较简单子任务的高质量监督，这些子任务挑战性较低。有趣的是，我们发现即使硬任务监督的错误率很高（例如，90%），基于此类数据的训练在多个硬数学基准测试中仍能超越在简单子任务上完美正确的监督。我们进一步识别出一个更关键的影响训练表现的因素：逐步错误率，它反映了解决方案中错误的严重程度。具体而言，在硬任务监督中，尽管结果错误率相同，但不同的逐步错误率会导致在 MATH 基准测试上出现 30% 的准确率差距。我们的研究结果还表明，将硬任务监督与相应的子任务监督相结合，比简单地合并重新表述的硬完整任务监督，能显著提升性能，这为数据增强提供了新的途径。数据和代码已发布在 \urlthis https URL。

[NLP-69] Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

【速读】：该论文试图解决稀疏自编码器 (Sparse Autoencoders, SAEs) 在大规模语言模型中训练的可扩展性问题。解决方案的关键在于引入了一套256个SAEs，这些SAEs分别在Llama-3.1-8B-Base模型的每一层和子层上进行训练，并使用了32K和128K特征。论文对当前最先进的SAE变体Top-K SAEs进行了改进，并在多个维度上评估了其泛化能力，特别是评估了在基础模型上训练的SAEs对更长上下文和微调模型的适应性。此外，论文还分析了学习到的SAE潜在特征的几何结构，确认了特征分裂 (feature splitting) 能够发现新的特征。通过公开Llama Scope SAE的检查点和相关的训练、解释及可视化工具，论文旨在推动开源稀疏自编码器生态系统的发展，并支持机制可解释性研究，减少冗余的SAE训练需求。

链接: https://arxiv.org/abs/2410.20526
作者: Zhengfu He,Wentao Shu,Xuyang Ge,Lingjie Chen,Junxuan Wang,Yunhua Zhou,Frances Liu,Qipeng Guo,Xuanjing Huang,Zuxuan Wu,Yu-Gang Jiang,Xipeng Qiu
关键词-EN: powerful unsupervised method, extracting sparse representations, significant challenge, powerful unsupervised, unsupervised method
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22pages, 12 figures

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emphfeature splitting enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\urlthis https URL, alongside our scalable training, interpretation, and visualization tools at \urlthis https URL. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.
摘要：稀疏自编码器 (Sparse Autoencoders, SAEs) 作为一种强大的无监督方法，用于从语言模型中提取稀疏表示，但其可扩展的训练仍然是一个重大挑战。我们引入了一套 256 个 SAE，分别在 Llama-3.1-8B-Base 模型的每一层和子层上进行训练，特征数量分别为 32K 和 128K。我们对一种最先进的 SAE 变体——Top-K SAE 进行了多维度的改进评估。特别地，我们评估了在基础模型上训练的 SAE 对更长上下文和微调模型的泛化能力。此外，我们分析了所学 SAE 潜在特征的几何结构，确认了特征分割能够发现新的特征。Llama Scope SAE 的检查点已公开发布于~\urlthis https URL，同时我们的可扩展训练、解释和可视化工具也发布于 \urlthis https URL。这些贡献旨在推动开源稀疏自编码器生态系统的发展，并通过减少冗余 SAE 训练来支持机制可解释性研究。

[NLP-70] Is Moral Self-correction An Innate Capability of Large Language Models ? A Mechanistic Analysis to Self-correction

【速读】：该论文试图解决两个关于大型语言模型（LLMs）道德自我修正能力的基本问题：(1) 自我修正的不同组成部分（如思维链推理 (Chain-of-Thought, CoT)、外部反馈和指令提示）如何相互作用以实现道德自我修正；(2) 自我修正是否是LLMs的固有能力。解决方案的关键在于通过实验分析不同自我修正组件的交互作用，以及引入自然语言干预和自区分验证框架（self-distinguish）来评估自我修正的鲁棒性和有效性。实验结果表明，尽管外部反馈和CoT可以提升性能，但不存在普遍最优的自我修正方法，且内部知识和外部反馈之间存在冲突。自区分实验显示，LLMs虽能自我修正，但无法可靠地区分期望和非期望输出，从而得出道德自我修正并非LLMs预训练时获得的固有能力的结论。

链接: https://arxiv.org/abs/2410.20513
作者: Zimo Qi,Guangliang Liu,Kristen Marie Johnson,Lu Chen
关键词-EN: Large Language Models, Language Models, Large Language, self-correction, intensive attentions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Though intensive attentions to the self-correction capability of Large Language Models (LLMs), the underlying mechanism of this capability is still under-explored. In this paper, we aim to answer two fundamental questions for moral self-correction: (1) how different components in self-correction, such as Chain-of-Thought (CoT) reasoning, external feedback, and instructional prompts, interact to enable moral self-correction; and (2) is the self-correction one of LLMs’ innate capabilities? To answer the first question, we examine how different self-correction components interact to intervene the embedded morality within hidden states, therefore contributing to different performance. For the second question, we (i) evaluate the robustness of moral self-correction by introducing natural language interventions of weak evidence into prompts; (ii) propose a validation framework, self-distinguish, that requires effective self-correction to enable LLMs to distinguish between desirable and undesirable outputs. Our experimental results indicate that there is no universally optimal self-correction method for the tasks considered, although external feedback and CoT can contribute to additional performance gains. However, our mechanistic analysis reveals negative interactions among instructional prompts, CoT, and external feedback, suggesting a conflict between internal knowledge and external feedback. The self-distinguish experiments demonstrate that while LLMs can self-correct their responses, they are unable to reliably distinguish between desired and undesired outputs. With our empirical evidence, we can conclude that moral self-correction is not an innate capability of LLMs acquired during pretraining.
摘要：尽管大语言模型 (LLM) 的自校正能力受到了广泛关注，但其背后的机制仍未得到充分探索。本文旨在回答关于道德自校正的两个基本问题：(1) 自校正中的不同组成部分，如思维链 (Chain-of-Thought, CoT) 推理、外部反馈和指令提示，如何相互作用以实现道德自校正；(2) 自校正是否是 LLM 的固有能力？为回答第一个问题，我们研究了不同自校正组件如何相互作用以干预隐藏状态中的嵌入道德，从而影响不同的性能表现。对于第二个问题，我们 (i) 通过在提示中引入弱证据的自然语言干预来评估道德自校正的鲁棒性；(ii) 提出了一种验证框架，即自我区分 (self-distinguish)，该框架要求有效的自校正以使 LLM 能够区分可取和不可取的输出。我们的实验结果表明，尽管外部反馈和 CoT 可以带来额外的性能提升，但对于所考虑的任务，并不存在普遍最优的自校正方法。然而，我们的机制分析揭示了指令提示、CoT 和外部反馈之间的负面相互作用，表明内部知识和外部反馈之间存在冲突。自我区分实验表明，尽管 LLM 可以自我校正其响应，但它们无法可靠地区分可取和不可取的输出。基于我们的实证证据，我们可以得出结论：道德自校正并非 LLM 在预训练期间获得的固有能力。

[NLP-71] MatViX: Multimodal Information Extraction from Visually Rich Articles

【速读】：该论文试图解决多模态信息提取 (Multimodal Information Extraction, MIE) 在材料科学领域中的挑战，特别是在从科学文献中提取结构化信息以加速新材料发现的过程中。解决方案的关键在于引入了一个名为 MatViX 的基准数据集，该数据集包含324篇完整的研究文章和1,688个复杂的结构化JSON文件，这些文件由领域专家精心从文本、表格和图表中提取。论文还提出了一种评估方法，用于评估曲线相似性和层次结构对齐的准确性，并展示了在零样本设置下使用视觉语言模型 (Vision-Language Models, VLMs) 处理长上下文和多模态输入的能力。特别地，使用专门模型（如DePlot）可以显著提高曲线提取的性能。

链接: https://arxiv.org/abs/2410.20494
作者: Ghazal Khalighinejad,Sharon Scott,Ollie Liu,Kelly L. Anderson,Rickard Stureborg,Aman Tyagi,Bhuwan Dhingra
关键词-EN: Multimodal information extraction, valuable data, information extraction, scientific literature, JSON files
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal information extraction (MIE) is crucial for scientific literature, where valuable data is often spread across text, figures, and tables. In materials science, extracting structured information from research articles can accelerate the discovery of new materials. However, the multimodal nature and complex interconnections of scientific content present challenges for traditional text-based methods. We introduce \textscMatViX, a benchmark consisting of 324 full-length research articles and 1,688 complex structured JSON files, carefully curated by domain experts. These JSON files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE. We introduce an evaluation method to assess the accuracy of curve similarity and the alignment of hierarchical structures. Additionally, we benchmark vision-language models (VLMs) in a zero-shot manner, capable of processing long contexts and multimodal inputs, and show that using a specialized model (DePlot) can improve performance in extracting curves. Our results demonstrate significant room for improvement in current models. Our dataset and evaluation code are available\footnote\urlthis https URL.
摘要：多模态信息提取（Multimodal Information Extraction, MIE）在科学文献中至关重要，因为其中的宝贵数据往往分散在文本、图表和表格中。在材料科学领域，从研究文章中提取结构化信息可以加速新材料的发现。然而，科学内容的多模态特性和复杂关联性对传统的基于文本的方法提出了挑战。我们引入了 \textscMatViX，这是一个由领域专家精心挑选的包含 324 篇完整研究文章和 1,688 个复杂结构化 JSON 文件的基准数据集。这些 JSON 文件从完整文档的文本、表格和图表中提取，为 MIE 提供了一个全面的挑战。我们引入了一种评估方法，用于评估曲线相似度和层次结构对齐的准确性。此外，我们以零样本（zero-shot）方式对视觉-语言模型（Vision-Language Models, VLMs）进行了基准测试，这些模型能够处理长上下文和多模态输入，并展示了使用专用模型（DePlot）在提取曲线方面的性能提升。我们的结果表明，当前模型仍有显著的改进空间。我们的数据集和评估代码已公开。

[NLP-72] textitWho Speaks Matters: Analysing the Influence of the Speakers Ethnicity on Hate Classification NEURIPS

【速读】：该论文试图解决大型语言模型（LLMs）在仇恨言论检测中的鲁棒性问题，特别是当输入中包含说话者的种族身份的显式和隐式标记时。解决方案的关键在于通过分析模型输出在不同标记条件下的变化频率，揭示模型在不同种族和标记类型下的鲁棒性差异。研究发现，隐式方言标记比显式身份标记更容易导致模型输出翻转，且不同种族间的翻转比例存在差异。此外，较大的模型表现出更高的鲁棒性。这些发现强调了在部署LLMs进行高风险任务（如仇恨言论检测）时需要谨慎。

链接: https://arxiv.org/abs/2410.20490
作者: Ananya Malik,Kartik Sharma,Lynnette Hui Xian Ng,Shaily Bhatt
关键词-EN: Large Language Models, Large Language, scalable content moderation, hate speech detection, including hate speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 3 tables. To appear in NeurIPS SafeGenAI 2024 Workshop

点击查看摘要

Abstract:Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs, particularly when explicit and implicit markers of the speaker’s ethnicity are injected into the input. For the explicit markers, we inject a phrase that mentions the speaker’s identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 4 popular LLMs and 5 ethnicities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.
摘要：大语言模型 (LLMs) 为可扩展的内容审核，包括仇恨言论检测，提供了诱人的前景。然而，这些模型也被认为对边缘化社区和方言存在脆弱性和偏见。因此，将其应用于仇恨言论检测等高风险任务时，需要进行严格的审查。在本研究中，我们探讨了使用 LLMs 进行仇恨言论分类的鲁棒性，特别是在输入中注入说话者种族的显式和隐式标记时。对于显式标记，我们注入提及说话者身份的短语。对于隐式标记，我们注入方言特征。通过分析在这些标记存在时模型输出的翻转频率，我们揭示了 4 种流行 LLMs 和 5 种种族之间不同程度的脆弱性。我们发现，输入中隐式方言标记的存在比显式标记更能导致模型输出的翻转。此外，翻转的百分比在不同种族之间有所不同。最后，我们发现较大的模型更为鲁棒。我们的研究结果表明，在将 LLMs 部署于仇恨言论检测等高风险任务时，需要谨慎行事。

[NLP-73] FIRP: Faster LLM inference via future intermediate representation prediction

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在解码过程中由于自回归性质导致的计算效率低下问题。自回归解码每次仅生成一个token，未能充分利用GPU的并行计算能力，从而产生显著的延迟。论文提出的解决方案是引入一种名为FIRP的推测性解码方法，其关键在于通过预测未来token的中间隐藏状态来生成多个token。具体来说，FIRP在LLM的中间层通过简单的线性变换预测未来token的伪隐藏状态，这些伪隐藏状态随后参与后续所有层的计算，从而整合更丰富的语义信息。随着层数的加深，伪隐藏状态与真实隐藏状态之间的语义差距逐渐缩小，使得高精度地解码未来token成为可能。实验结果表明，FIRP在多个模型和数据集上实现了1.9x-3x的加速比，验证了其有效性。

链接: https://arxiv.org/abs/2410.20488
作者: Pengfei Wu,Jiahao Liu,Zhuocheng Gong,Qifan Wang,Jinpeng Li,Jingang Wang,Xunliang Cai,Dongyan Zhao
关键词-EN: Large Language Models, Large Language, shown remarkable performance, Recent advancements, advancements in Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have shown remarkable performance across a wide range of tasks. Despite this, the auto-regressive nature of LLM decoding, which generates only a single token per forward propagation, fails to fully exploit the parallel computational power of GPUs, leading to considerable latency. To address this, we introduce a novel speculative decoding method named FIRP which generates multiple tokens instead of one at each decoding step. We achieve this by predicting the intermediate hidden states of future tokens (tokens have not been decoded yet) and then using these pseudo hidden states to decode future tokens, specifically, these pseudo hidden states are predicted with simple linear transformation in intermediate layers of LLMs. Once predicted, they participate in the computation of all the following layers, thereby assimilating richer semantic information. As the layers go deeper, the semantic gap between pseudo and real hidden states is narrowed and it becomes feasible to decode future tokens with high accuracy. To validate the effectiveness of FIRP, we conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets, analytical experiments also prove our motivations.
摘要：近年来，大语言模型 (Large Language Models, LLMs) 在众多任务中展现了卓越的性能。然而，LLM 解码的自回归特性，即每次前向传播仅生成一个 Token，未能充分利用 GPU 的并行计算能力，导致显著的延迟。为解决这一问题，我们提出了一种名为 FIRP 的新型推测性解码方法，该方法在每个解码步骤中生成多个 Token 而非一个。我们通过预测未来 Token 的中间隐藏状态（即尚未解码的 Token），并利用这些伪隐藏状态来解码未来 Token 来实现这一目标。具体而言，这些伪隐藏状态通过 LLM 中间层的简单线性变换进行预测。一旦预测完成，它们将参与后续所有层的计算，从而融入更丰富的语义信息。随着层数的加深，伪隐藏状态与真实隐藏状态之间的语义差距逐渐缩小，使得以高精度解码未来 Token 成为可能。为验证 FIRP 的有效性，我们进行了广泛的实验，结果显示在多个模型和数据集上实现了 1.9 倍至 3 倍的加速比，分析实验也证明了我们的动机。

[NLP-74] What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration NEURIPS2024

【速读】：该论文试图解决多模态上下文学习 (Multi-Modal In-Context Learning, MM-ICL) 在无需额外参数调优的情况下实现优异性能的背后机制问题。解决方案的关键在于系统地研究了MM-ICL的三个核心步骤：演示检索、演示排序和提示构建。通过使用6个视觉大型语言模型和20种策略进行广泛实验，研究发现：(1) 多模态检索器在演示检索中的必要性；(2) 演示内部排序比演示间排序更为重要；(3) 通过在提示中加入引导性指令可以增强任务理解。这些发现为未来优化MM-ICL策略提供了基础性指导。

链接: https://arxiv.org/abs/2410.20482
作者: Libo Qin,Qiguang Chen,Hao Fei,Zhi Chen,Min Li,Wanxiang Che
关键词-EN: achieved notable success, additional parameter tuning, Multi-Modal In-Context Learning, requiring additional parameter, In-Context Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: "What factors affect the performance of MM-ICL?‘’ To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.
摘要：近年来，多模态上下文学习 (Multi-Modal In-Context Learning, MM-ICL) 取得了显著进展，能够在无需额外参数调整的情况下在多种任务中实现卓越性能。然而，MM-ICL 有效性的内在规律仍未得到充分探索。为了填补这一研究空白，本研究旨在探讨以下研究问题：“哪些因素影响 MM-ICL 的性能？”为此，我们针对 MM-ICL 的三个核心步骤——示范检索、示范排序和提示构建，进行了广泛的实验研究，涉及 6 个视觉大语言模型和 20 种策略。我们的研究发现：(1) 示范检索中多模态检索器的必要性；(2) 示范内排序相对于示范间排序的重要性；(3) 通过提示中的引导指令增强任务理解。我们希望这项研究能为未来优化 MM-ICL 策略的研究提供基础性指导。

[NLP-75] Graph Neural Networks on Discriminative Graphs of Words

【速读】：该论文试图解决文本分类问题，特别是通过改进图神经网络 (Graph Neural Networks, GNNs) 在文本分类中的应用。解决方案的关键在于提出了一种新的判别词图图神经网络 (Discriminative Graph of Words Graph Neural Network, DGoW-GNN) 方法，该方法包括两个核心创新：一是构建了一个仅包含词节点且没有文档节点的判别图，通过将训练语料库根据标签分割成不连通的子图，并使用词的点互信息 (pointwise mutual information) 来加权边；二是将文本分类任务重新定义为路径分类任务，并结合图神经网络和序列模型来实现图基础的文本分类。尽管在七个基准数据集上的评估结果显示该方法性能不及一些最先进的基线模型，但论文提供了理论动机并分析了性能差异的原因，提出了未来可能改进的条件。

链接: https://arxiv.org/abs/2410.20469
作者: Yassine Abbahaddou,Johannes F. Lutzeyer,Michalis Vazirgiannis
关键词-EN: Graph Neural Networks, complex data structures, studies apply GNNs, Words Graph Neural, Graph Neural
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In light of the recent success of Graph Neural Networks (GNNs) and their ability to perform inference on complex data structures, many studies apply GNNs to the task of text classification. In most previous methods, a heterogeneous graph, containing both word and document nodes, is constructed using the entire corpus and a GNN is used to classify document nodes. In this work, we explore a new Discriminative Graph of Words Graph Neural Network (DGoW-GNN) approach encapsulating both a novel discriminative graph construction and model to classify text. In our graph construction, containing only word nodes and no document nodes, we split the training corpus into disconnected subgraphs according to their labels and weight edges by the pointwise mutual information of the represented words. Our graph construction, for which we provide theoretical motivation, allows us to reformulate the task of text classification as the task of walk classification. We also propose a new model for the graph-based classification of text, which combines a GNN and a sequence model. We evaluate our approach on seven benchmark datasets and find that it is outperformed by several state-of-the-art baseline models. We analyse reasons for this performance difference and hypothesise under which conditions it is likely to change.
摘要：鉴于图神经网络 (Graph Neural Networks, GNNs) 近期在处理复杂数据结构上的成功及其推断能力，许多研究将 GNNs 应用于文本分类任务。在大多数先前的方法中，通过构建包含词节点和文档节点的异构图，并利用 GNN 对文档节点进行分类。本文中，我们探索了一种新的判别词图图神经网络 (Discriminative Graph of Words Graph Neural Network, DGoW-GNN) 方法，该方法结合了新颖的判别图构建和文本分类模型。在我们的图构建中，仅包含词节点而不涉及文档节点，我们将训练语料库根据其标签分割成不连通的子图，并通过所表示词的点互信息来加权边。我们为这种图构建提供了理论动机，并将其重新表述为步行分类任务。此外，我们提出了一种新的基于图的文本分类模型，该模型结合了 GNN 和序列模型。我们在七个基准数据集上评估了我们的方法，发现其性能不及几种最先进的基线模型。我们分析了这种性能差异的原因，并假设在何种条件下这种差异可能会发生变化。

[NLP-76] A Derivational ChainBank for Modern Standard Arabic

【速读】：该论文试图解决阿拉伯语派生形态学建模的问题，提出了一个名为“阿拉伯派生链库 (Arabic Derivational ChainBank)”的新框架。解决方案的关键在于通过构建派生词链来反映其派生意义，并采用基于规则的方法来快速建立这些连接，避免了耗时的手动标注。随后，将派生网络与CamelMorph形态分析器数据库对齐，形成了一个包含23,333个派生关系的词根链，从而展示了ChainBank的高效性。

链接: https://arxiv.org/abs/2410.20463
作者: Reham Marzouk,Sondos Krouna,Nizar Habash
关键词-EN: Arabic derivational morphology, modeling Arabic derivational, Arabic Derivational ChainBank, modeling Arabic, Arabic Derivational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study presents the ``Arabic Derivational ChainBank,‘’ a novel framework for modeling Arabic derivational morphology. It establishes connections between forms and meanings by constructing a chain of derived words that reflect their derivational significance. To expedite the process, a rule-based methodology was employed, avoiding time-consuming manual annotation. The derivational network was then aligned with the CamelMorph morphological analyzer database. This two-step process resulted in a chain of derived word lemmas linked to their roots, encompassing 23,333 evaluated derivational relations, thereby demonstrating the efficiency of the ChainBank.
摘要：本研究提出了“阿拉伯语派生链银行 (Arabic Derivational ChainBank)”，这是一个用于建模阿拉伯语派生形态学的新框架。它通过构建反映派生词派生意义的派生词链，建立了形式与意义之间的联系。为了加快这一过程，采用了基于规则的方法，避免了耗时的手动标注。随后，派生网络与 CamelMorph 形态分析器数据库进行了对齐。这一两步过程最终形成了一个由派生词词根链接的派生词链，涵盖了 23,333 个经过评估的派生关系，从而展示了 ChainBank 的高效性。

[NLP-77] rajAgent : An Agent Framework for Unified Trajectory Modelling

【速读】：该论文试图解决轨迹建模（Trajectory Modeling）中由于数据异质性和任务多样性导致的统一建模难题。解决方案的关键在于提出了TrajAgent框架，这是一个基于大型语言模型（Large Language Model）的代理框架，旨在统一各种轨迹建模任务。TrajAgent的核心创新包括：1) 开发了UniEnv，一个具有统一数据和模型接口的执行环境，以支持多种模型的执行和训练；2) 设计了TAgent，一个用于自动轨迹建模的代理工作流程，涵盖了多种轨迹任务；3) 引入了AutOpt，一个系统优化模块，以进一步提升集成模型的性能。通过这些创新，TrajAgent能够自动生成针对自然语言输入的多样化轨迹任务的竞争性结果，实验证明其在统一轨迹建模中比基线方法平均提升了15.43%的性能。

链接: https://arxiv.org/abs/2410.20445
作者: Yuwei Du,Jie Feng,Jie Zhao,Yong Li
关键词-EN: data pattern mining, trajectory modelling, Trajectory, urban transportation, future prediction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages; the code will be openly accessible at: this https URL

点击查看摘要

Abstract:Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modelling. However, due to the heterogeneity of data and the diversity of trajectory tasks, achieving unified trajectory modelling remains an important yet challenging task. In this paper, we propose TrajAgent, a large language model-based agentic framework, to unify various trajectory modelling tasks. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce TAgent, an agentic workflow designed for automatic trajectory modelling across various trajectory tasks. Specifically, we design AutOpt, a systematic optimization module within TAgent, to further improve the performance of the integrated model. With diverse trajectory tasks input in natural language, TrajAgent automatically generates competitive results via training and executing appropriate models. Extensive experiments on four tasks using four real-world datasets demonstrate the effectiveness of TrajAgent in unified trajectory modelling, achieving an average performance improvement of 15.43% over baseline methods.
摘要：轨迹建模，包括轨迹数据模式挖掘和未来预测的研究，在生活服务、城市交通和公共管理等领域有着广泛的应用。针对轨迹建模中的具体问题，已经提出了多种方法。然而，由于数据的异质性和轨迹任务的多样性，实现统一的轨迹建模仍然是一个重要且具有挑战性的任务。本文中，我们提出了 TrajAgent，一个基于大语言模型的智能体框架，用于统一各种轨迹建模任务。在 TrajAgent 中，我们首先开发了 UniEnv，一个具有统一数据和模型接口的执行环境，以支持各种模型的执行和训练。基于 UniEnv，我们引入了 TAgent，一个针对不同轨迹任务自动轨迹建模的智能体工作流。具体来说，我们在 TAgent 中设计了 AutOpt，一个系统优化模块，以进一步提高集成模型的性能。通过输入自然语言描述的多样化轨迹任务，TrajAgent 能够通过训练和执行适当的模型自动生成具有竞争力的结果。在四个真实世界数据集上进行的四项任务的广泛实验表明，TrajAgent 在统一轨迹建模方面具有显著效果，平均性能比基线方法提高了 15.43%。

[NLP-78] MedGo: A Chinese Medical Large Language Model

【速读】：该论文试图解决当前大型语言模型在医疗应用中存在的准确性不足和功能局限性问题。解决方案的关键在于提出了一个专门针对中文医疗领域的大型语言模型——MedGo。MedGo通过结合高质量的无监督医疗数据、监督数据和偏好对齐数据进行训练，旨在提升其在医疗任务中的多功能性和精确性。该模型在CBLUE基准测试和自建的ClinicalQA数据集上均表现出色，尤其在ClinicalQA上超越了其基础模型Qwen2，显示出其在自动化医疗问答和临床决策支持方面的潜力。

链接: https://arxiv.org/abs/2410.20428
作者: Haitao Zhang,Bo An
关键词-EN: hot research topic, artificial intelligence, hot research, research topic, Chinese medical large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:Large models are a hot research topic in the field of artificial intelligence. Leveraging their generative capabilities has the potential to enhance the level and quality of medical services. In response to the limitations of current large language models, which often struggle with accuracy and have narrow capabilities in medical applications, this paper presents a Chinese medical large language model, MedGo. MedGo was trained using a combination of high quality unsupervised medical data, supervised data, and preference alignment data, aimed at enhancing both its versatility and precision in medical tasks. The model was evaluated through the public CBLUE benchmark and a manually constructed dataset ClinicalQA. The results demonstrate that MedGo achieved promising performance across various Chinese medical information processing tasks, achieved the first place in the CBLUE evaluation. Additionally, on our constructed dataset ClinicalQA, MedGo outperformed its base model Qwen2, highlighting its potential to improve both automated medical question answering and clinical decision support. These experimental results demonstrate that MedGo possesses strong information processing capabilities in the medical field. At present, we have successfully deployed MedGo at Shanghai East Hospital.
摘要：大模型是人工智能领域的热门研究课题。利用其生成能力有望提升医疗服务的水准和质量。针对当前大语言模型在准确性和医疗应用能力方面的局限性，本文提出了一种中文医疗大语言模型——MedGo。MedGo通过结合高质量的无监督医疗数据、监督数据和偏好对齐数据进行训练，旨在提升其在医疗任务中的通用性和精确性。该模型通过公开的CBLUE基准和手动构建的数据集ClinicalQA进行评估。结果显示，MedGo在各项中文医疗信息处理任务中表现出色，在CBLUE评测中取得了第一名。此外，在我们构建的数据集ClinicalQA上，MedGo的表现优于其基础模型Qwen2，显示出其在自动化医疗问答和临床决策支持方面的潜力。这些实验结果表明，MedGo在医疗领域具有强大的信息处理能力。目前，我们已成功将MedGo部署在上海东方医院。

[NLP-79] AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

【速读】：该论文试图解决涉及表格数据的数据科学任务中的复杂挑战，提出了一种名为AutoKaggle的框架。解决方案的关键在于通过一个协作的多代理系统，实现迭代开发过程，结合代码执行、调试和全面的单元测试，确保代码的正确性和逻辑一致性。AutoKaggle允许用户在每个阶段进行干预，从而将自动化智能与人类专家知识相结合。此外，该框架基于一个包含数据清洗、特征工程和建模验证功能的通用数据科学工具包，显著提升了生产力，并通过在8个Kaggle竞赛中的模拟验证了其有效性和实用性。

链接: https://arxiv.org/abs/2410.20424
作者: Ziming Li,Qianbo Zang,David Ma,Jiawei Guo,Tianyu Zheng,Minghao liu,Xinyao Niu,Xiang Yue,Yue Wang,Jian Yang,Jiaheng Liu,Wanjun Zhong,Wangchunshu Zhou,Wenhao Huang,Ge Zhang
关键词-EN: sophisticated problem-solving approaches, require sophisticated problem-solving, involving tabular data, tabular data present, tasks involving tabular
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 44 pages, 10 figures

点击查看摘要

Abstract:Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.
摘要：涉及表格数据的任务需要复杂的问题解决方法，这为数据科学带来了挑战。我们提出了AutoKaggle，这是一个强大且以用户为中心的框架，通过协作的多智能体系统帮助数据科学家完成日常的数据流水线任务。AutoKaggle采用迭代开发流程，结合代码执行、调试和全面的单元测试，以确保代码的正确性和逻辑一致性。该框架提供高度可定制的工作流程，允许用户在每个阶段进行干预，从而将自动化智能与人类专业知识相结合。我们的通用数据科学工具包，包括经过验证的数据清洗、特征工程和建模功能，构成了这一解决方案的基础，通过简化常见任务来提高生产力。我们选择了8个Kaggle竞赛来模拟现实应用场景中的数据处理工作流程。评估结果显示，AutoKaggle在典型的数据科学流水线中达到了0.85的验证提交率和0.82的综合评分，充分证明了其在处理复杂数据科学任务中的有效性和实用性。

[NLP-80] Open-Vocabulary Object Detection via Language Hierarchy NEURIPS2024

【速读】：该论文试图解决弱监督目标检测中存在的图像到边界框标签不匹配问题，即图像级别的标签无法提供精确的目标信息。解决方案的关键在于设计了一种名为语言层次自训练 (Language Hierarchical Self-training, LHST) 的方法，通过引入语言层次结构来增强弱监督检测器的训练。LHST 通过扩展图像级别标签并实现扩展标签与自训练之间的协同正则化，从而提供更丰富的监督信息并缓解标签不匹配问题。此外，论文还设计了语言层次提示生成 (Language Hierarchical Prompt Generation)，以帮助弥合训练和测试之间的词汇差距。实验结果表明，这些技术在14个广泛研究的目标检测数据集上持续实现了优越的泛化性能。

链接: https://arxiv.org/abs/2410.20371
作者: Jiaxing Huang,Jingyi Zhang,Kai Jiang,Shijian Lu
关键词-EN: attracted increasing attention, Recent studies, additional weak supervision, image-level labels, attracted increasing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2024 Camera Ready

点击查看摘要

Abstract:Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels. However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i.e., image-level labels do not convey precise object information. We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors. LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability. In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing. Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.
摘要：近年来，关于可泛化对象检测的研究引起了越来越多的关注，这些研究通过大规模数据集的图像级标签提供了额外的弱监督。然而，弱监督检测学习常常面临图像到框标签不匹配的问题，即图像级标签无法传达精确的对象信息。我们设计了语言层次自训练（Language Hierarchical Self-training, LHST），将语言层次引入弱监督检测器的训练中，以学习更具泛化能力的检测器。LHST通过语言层次扩展图像级标签，并实现了扩展标签与自训练之间的协同正则化。具体而言，扩展标签通过提供更丰富的监督信息和缓解图像到框标签不匹配问题来正则化自训练，而自训练则根据预测的可靠性评估和选择扩展标签。此外，我们还设计了语言层次提示生成，将语言层次引入提示生成中，有助于弥合训练和测试之间的词汇差距。广泛的实验表明，所提出的技术在14个广泛研究的对象检测数据集上持续实现了卓越的泛化性能。

[NLP-81] Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

【速读】：该论文试图解决现有大型语言模型（LLM）在生成高质量指令数据方面的局限性问题。现有的方法主要依赖于标准监督指令微调模型，这些模型优化于通用问答/问题解决，而非数据生成。论文提出了一种新的范式，称为NOMAD，通过专门训练模型进行数据生成，从而实现显著的性能提升。解决方案的关键在于两个核心因素：无提示掩码训练（no-prompt-masked training）和适当的训练集大小选择。实验结果表明，NOMAD在TriviaQA和GSM8K数据集上分别取得了4%和2%的性能提升，尤其是在训练数据有限的情况下。此外，论文还通过“相关性”和“新颖性”的视角对生成的合成数据进行了新的解读。

链接: https://arxiv.org/abs/2410.20362
作者: Yifang Chen,David Zhu
关键词-EN: Recent advances, high-quality instruction data, large language model, high-quality instruction, advances in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbfNOMAD by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving 4% gains in TriviaQA and 2% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of “relevance” and “novelty”.
摘要：近年来，大语言模型 (LLM) 训练的进展突显了对多样化、高质量指令数据的迫切需求。近期，许多研究工作开始探索利用 LLM 生成合成数据。然而，这些研究主要集中在使用标准监督指令微调模型的提示工程上，这存在一个根本性的局限：这些模型针对的是通用问答/问题解决任务，而非数据生成任务。我们提出了一种名为 NOMAD 的范式转变，通过研究如何专门训练模型进行数据生成，证明了这一任务与训练经典语言模型 (LM) 存在显著差异。我们识别了两个关键因素：无提示掩码训练和适当的训练集规模选择。我们的方法 NOMAD 在基线方法上取得了显著改进，在 TriviaQA 上提升了 4%，在 GSM8K 上提升了 2%，且仅使用了有限的训练数据。最后，我们通过“相关性”和“新颖性”的视角解读了合成数据，提供了新的见解。

[NLP-82] Historical Test-time Prompt Tuning for Vision Foundation Models NEURIPS2024

【速读】：该论文试图解决测试时提示调优（Test-time Prompt Tuning）在处理连续变化的测试样本域时性能下降的问题。解决方案的关键是提出了历史测试时提示调优技术（HisTPT, Historical Test-time Prompt Tuning），通过引入三种知识库（local knowledge bank, hard-sample knowledge bank, 和 global knowledge bank）来记忆和利用测试样本中的有用知识，从而实现鲁棒的测试时提示优化。此外，HisTPT还采用了自适应知识检索机制，通过动态检索记忆的知识来正则化每个测试样本的预测，从而在处理不同视觉识别任务（如图像分类、语义分割和目标检测）时，即使在测试样本域不断变化的情况下，也能持续保持优越的提示调优性能。

链接: https://arxiv.org/abs/2410.20346
作者: Jingyi Zhang,Jiaxing Huang,Xiaoqin Zhang,Ling Shao,Shijian Lu
关键词-EN: Test-time prompt tuning, demonstrated great potential, learns prompts online, Test-time prompt, prompt tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2024 Camera Ready

点击查看摘要

Abstract:Test-time prompt tuning, which learns prompts online with unlabelled test samples during the inference stage, has demonstrated great potential by learning effective prompts on-the-fly without requiring any task-specific annotations. However, its performance often degrades clearly along the tuning process when the prompts are continuously updated with the test data flow, and the degradation becomes more severe when the domain of test samples changes continuously. We propose HisTPT, a Historical Test-time Prompt Tuning technique that memorizes the useful knowledge of the learnt test samples and enables robust test-time prompt tuning with the memorized knowledge. HisTPT introduces three types of knowledge banks, namely, local knowledge bank, hard-sample knowledge bank, and global knowledge bank, each of which works with different mechanisms for effective knowledge memorization and test-time prompt optimization. In addition, HisTPT features an adaptive knowledge retrieval mechanism that regularizes the prediction of each test sample by adaptively retrieving the memorized knowledge. Extensive experiments show that HisTPT achieves superior prompt tuning performance consistently while handling different visual recognition tasks (e.g., image classification, semantic segmentation, and object detection) and test samples from continuously changing domains.
摘要：测试时提示调优（Test-time prompt tuning）在推理阶段通过未标记的测试样本在线学习提示，展示了巨大的潜力，能够在无需任何任务特定标注的情况下即时学习有效的提示。然而，随着提示在测试数据流中不断更新，其性能往往在调优过程中明显下降，当测试样本的领域持续变化时，这种下降更为严重。我们提出了HisTPT，一种历史测试时提示调优技术，该技术记忆已学习测试样本的有用知识，并利用记忆的知识实现鲁棒的测试时提示调优。HisTPT引入了三种知识库，即局部知识库、难样本知识库和全局知识库，每种知识库通过不同的机制实现有效的知识记忆和测试时提示优化。此外，HisTPT还具备一种自适应知识检索机制，通过自适应地检索记忆的知识来规范每个测试样本的预测。大量实验表明，HisTPT在处理不同视觉识别任务（如图像分类、语义分割和目标检测）以及来自持续变化领域的测试样本时，始终能够实现卓越的提示调优性能。

[NLP-83] Maintaining Informative Coherence: Migrating Hallucinations in Large Language Models via Absorbing Markov Chains

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在文本生成过程中出现的幻觉问题（hallucinations），即模型在解码过程中未能保持上下文信息的忠实性和连贯性，有时会因采样策略和训练数据及微调差异导致的固有偏差而忽略关键细节。解决方案的关键在于提出了一种新的解码策略，该策略利用吸收马尔可夫链（absorbing Markov chains）来量化上下文信息的重要性，并测量生成过程中信息损失的程度。通过考虑从第一个词到最后一个词的所有可能路径，该方法在不需额外训练或外部数据的情况下，增强了模型输出的可靠性。实验结果表明，该方法在减少幻觉方面表现优异，特别是在确保网络应用中信息准确传播方面具有重要意义。

链接: https://arxiv.org/abs/2410.20340
作者: Jiemin Wu,Songning Lai,Ruiqiang Xiao,Tianlang Xue,Jiayu Yang,Yutao Yue
关键词-EN: Large Language Models, overlooking critical details, critical details due, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful tools for text generation, translation, and summarization, but they often suffer from hallucinations-instances where they fail to maintain the fidelity and coherence of contextual information during decoding, sometimes overlooking critical details due to their sampling strategies and inherent biases from training data and fine-tuning discrepancies. These hallucinations can propagate through the web, affecting the trustworthiness of information disseminated online. To address this issue, we propose a novel decoding strategy that leverages absorbing Markov chains to quantify the significance of contextual information and measure the extent of information loss during generation. By considering all possible paths from the first to the last token, our approach enhances the reliability of model outputs without requiring additional training or external data. Evaluations on datasets including TruthfulQA, FACTOR, and HaluEval highlight the superior performance of our method in mitigating hallucinations, underscoring the necessity of ensuring accurate information flow in web-based applications.
摘要：大语言模型 (LLMs) 是文本生成、翻译和摘要的强大工具，但它们常常出现幻觉现象——即在解码过程中无法保持上下文信息的忠实性和连贯性，有时由于采样策略和训练数据及微调差异带来的固有偏差而忽略关键细节。这些幻觉现象可能会通过网络传播，影响在线信息的可信度。为解决这一问题，我们提出了一种新的解码策略，利用吸收马尔可夫链来量化上下文信息的重要性，并衡量生成过程中信息损失的程度。通过考虑从第一个 Token 到最后一个 Token 的所有可能路径，我们的方法在不需额外训练或外部数据的情况下增强了模型输出的可靠性。在 TruthfulQA、FACTOR 和 HaluEval 等数据集上的评估结果显示，我们的方法在减少幻觉现象方面表现优异，突显了在基于网络的应用中确保信息准确流动的必要性。

[NLP-84] Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

【速读】：该论文试图解决将大型语言模型（LLMs）扩展到语音生成任务中的问题，并提出了一种名为TTS-Llama的文本到语音（TTS）系统，该系统基于微调的Llama模型，实现了最先进的语音合成性能。解决方案的关键在于提出了MoLE-Llama，这是一种通过纯后期融合参数高效微调（PEFT）和混合专家架构开发的文本与语音多模态LLM。MoLE-Llama在文本问答（QA）和TTS任务中均表现出竞争性性能，有效缓解了多模态学习中的灾难性遗忘问题，并在文本输入语音输出的问答任务中展示了其作为多模态对话系统的巨大潜力。

链接: https://arxiv.org/abs/2410.20336
作者: Maohao Shen,Shun Zhang,Jilong Wu,Zhiping Xiu,Ehab AlBadawy,Yiting Lu,Mike Seltzer,Qing He
关键词-EN: Large language models, natural language processing, revolutionized natural language, Large language, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama’s competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.
摘要：大语言模型（Large Language Models, LLMs）在自然语言处理（Natural Language Processing, NLP）领域取得了显著的进展，在各种基于文本的任务中表现出色。然而，将这些以文本为主导的 LLMs 扩展到语音生成任务的研究仍然较少。在本研究中，我们引入了一种基于微调的 Llama 模型的文本到语音（Text-to-Speech, TTS）系统，命名为 TTS-Llama，该系统实现了最先进的语音合成性能。在此基础上，我们进一步提出了 MoLE-Llama，这是一种通过纯后期融合参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）和专家混合架构开发的文本与语音多模态 LLM。广泛的实证结果表明，MoLE-Llama 在纯文本问答（Question-Answering, QA）和 TTS 任务中均表现出竞争性性能，有效缓解了任一模态中的灾难性遗忘问题。最后，我们进一步探索了 MoLE-Llama 在文本输入语音输出的问答任务中的应用，展示了其在语音生成能力方面的巨大潜力，有望成为一种多模态对话系统。

[NLP-85] Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLM s

【速读】：该论文试图解决从语音转录文本中识别情感状态的问题，即后自动语音识别（ASR）情感识别。解决方案的关键在于：首先对所有可用的转录文本进行精炼以确保数据可靠性；然后将完整的对话分割成较小的对话片段，并利用这些对话片段作为上下文来预测目标话语的情感；最后，通过研究不同的上下文长度和提示技术来提高预测准确性。该方法在挑战赛中表现优异，最佳提交结果在未加权准确率上超过了基线20%，取得了最佳性能。

链接: https://arxiv.org/abs/2410.20334
作者: Enshi Zhang,Christian Poellabauer
关键词-EN: Speech Emotion Recognition, Automatic Speech Recognition, identifying emotional states, Post Automatic Speech, Emotion Recognition
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech Emotion Recognition (SER) focuses on identifying emotional states from spoken language. The 2024 IEEE SLT-GenSEC Challenge on Post Automatic Speech Recognition (ASR) Emotion Recognition tasks participants to explore the capabilities of large language models (LLMs) for emotion recognition using only text data. We propose a novel approach that first refines all available transcriptions to ensure data reliability. We then segment each complete conversation into smaller dialogues and use these dialogues as context to predict the emotion of the target utterance within the dialogue. Finally, we investigated different context lengths and prompting techniques to improve prediction accuracy. Our best submission exceeded the baseline by 20% in unweighted accuracy, achieving the best performance in the challenge. All our experiments’ codes, prediction results, and log files are publicly available.
摘要：语音情感识别 (Speech Emotion Recognition, SER) 专注于从口语中识别情感状态。2024年IEEE SLT-GenSEC挑战赛中的后自动语音识别 (ASR) 情感识别任务要求参与者探索仅使用文本数据进行情感识别的大语言模型 (LLM) 的能力。我们提出了一种新颖的方法，首先对所有可用的转录文本进行精炼，以确保数据可靠性。接着，我们将完整的对话分割成较小的对话片段，并利用这些对话片段作为上下文来预测对话中目标话语的情感。最后，我们研究了不同的上下文长度和提示技术，以提高预测准确性。我们的最佳提交结果在未加权准确率上超过了基线20%，在挑战赛中取得了最佳表现。所有实验的代码、预测结果和日志文件均已公开。

[NLP-86] Deep Learning Based Dense Retrieval: A Comparative Study

【速读】：该论文旨在评估密集检索系统（Dense Retrievers）在面对被污染的标记器（Tokenizer）时的鲁棒性。研究的关键在于通过对比监督学习模型（如BERT和Dense Passage Retrieval (DPR)）和无监督学习模型（如ANCE）在标记器被污染情况下的表现，发现监督学习模型在标记器被污染时性能显著下降，而无监督学习模型则表现出更强的抗干扰能力。实验结果表明，即使是微小的标记器扰动也能严重降低检索准确性，强调了在关键应用中需要开发更为鲁棒的防御机制。

链接: https://arxiv.org/abs/2410.20315
作者: Ming Zhong,Zhizhi Wu,Nanako Honda
关键词-EN: poisoning remains underexplored, information retrieval tasks, tokenizer poisoning remains, Dense Passage Retrieval, retrievers have achieved
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:Dense retrievers have achieved state-of-the-art performance in various information retrieval tasks, but their robustness against tokenizer poisoning remains underexplored. In this work, we assess the vulnerability of dense retrieval systems to poisoned tokenizers by evaluating models such as BERT, Dense Passage Retrieval (DPR), Contriever, SimCSE, and ANCE. We find that supervised models like BERT and DPR experience significant performance degradation when tokenizers are compromised, while unsupervised models like ANCE show greater resilience. Our experiments reveal that even small perturbations can severely impact retrieval accuracy, highlighting the need for robust defenses in critical applications.
摘要：稠密检索器在多种信息检索任务中已达到最先进的性能，但其对分词器中毒的鲁棒性仍未得到充分探索。在本研究中，我们通过评估BERT、稠密段落检索（Dense Passage Retrieval, DPR）、Contriever、SimCSE和ANCE等模型，评估了稠密检索系统对中毒分词器的脆弱性。我们发现，监督模型如BERT和DPR在分词器受损时性能显著下降，而如ANCE等无监督模型则表现出更强的抗干扰能力。我们的实验表明，即使微小的扰动也能严重损害检索准确性，这突显了在关键应用中需要建立强有力的防御机制。

[NLP-87] Accelerating Direct Preference Optimization with Prefix Sharing NEURIPS2024

【速读】：该论文试图解决离线配对偏好优化算法在处理长共享提示任务时存在的冗余计算问题。解决方案的关键在于引入前缀共享（prefix sharing）技术，将选择和拒绝的响应作为一个带有共享前缀的序列进行处理，并采用自定义的块稀疏注意力掩码（block-sparse attention mask）来防止响应间的交叉污染。这种方法在不影响收敛性的前提下，显著提升了训练吞吐量，尤其在与序列打包（sequence packing）结合时，能实现1.3到1.6倍的加速，适用于各种序列长度的数据集。该方法不仅限于直接偏好优化（Direct Preference Optimization, DPO），还可应用于其他配对偏好调优方法，从而提高基于偏好的微调在更广泛应用和模型规模中的可访问性。

链接: https://arxiv.org/abs/2410.20305
作者: Franklin Wang,Sumanth Hegde
关键词-EN: outperforming traditional supervised, Offline paired preference, Offline paired, preference optimization algorithms, Offline
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: To appear in NeurIPS 2024 in the Fine-Tuning in Machine Learning Workshop

点击查看摘要

Abstract:Offline paired preference optimization algorithms have become a popular approach for fine-tuning on preference data, outperforming traditional supervised fine-tuning in various tasks. However, traditional implementations often involve redundant computations, especially for tasks with long shared prompts. We introduce prefix sharing for preference tuning, a novel technique that processes chosen and rejected responses as one sequence with a shared prefix. To prevent cross-response contamination, we use a custom block-sparse attention mask. Our method achieves 1.1 - 1.5\times improvement in training throughput on popular DPO datasets, without any effect on convergence. When combined with sequence packing, we observe consistent 1.3 - 1.6\times speedups, benefiting even datasets with smaller sequence lengths. While we focus on Direct Preference Optimization (DPO), our approach is applicable to other paired preference tuning methods. By enhancing computational efficiency, our work contributes to making preference-based fine-tuning more accessible for a wider range of applications and model sizes. We open-source our code at this https URL.
摘要：离线配对偏好优化算法已成为在偏好数据上进行微调的热门方法，在各种任务中优于传统的监督微调。然而，传统实现往往涉及冗余计算，特别是在共享提示较长的任务中。我们引入了偏好调优的前缀共享技术，这是一种新颖的方法，将选定和拒绝的响应作为一个带有共享前缀的序列进行处理。为防止响应间的交叉污染，我们使用了自定义的块稀疏注意力掩码。我们的方法在流行的 DPO 数据集上实现了 1.1 至 1.5 倍的训练吞吐量提升，且对收敛性无任何影响。结合序列打包技术，我们观察到一致的 1.3 至 1.6 倍加速，即使对于序列长度较小的数据集也有益。尽管我们专注于直接偏好优化 (DPO)，但我们的方法同样适用于其他配对偏好调优方法。通过提高计算效率，我们的工作有助于使基于偏好的微调更广泛地适用于各种应用和模型规模。我们在该 https URL 上开源了我们的代码。

[NLP-88] Sequential Large Language Model-Based Hyper-Parameter Optimization

【速读】：该论文试图解决超参数优化 (Hyperparameter Optimization, HPO) 中传统方法和基于大语言模型 (Large Language Models, LLMs) 方法的局限性问题。解决方案的关键在于提出了一个创新的框架 SLLMBO，该框架结合了动态搜索空间适应性、增强的参数景观利用以及一种混合的、新颖的 LLM-树结构Parzen估计器 (LLM-Tree-structured Parzen Estimator, LLM-TPE) 采样器。SLLMBO 通过集成 LLMs 在参数初始化方面的优势和 TPE 的探索能力，实现了探索-利用的平衡，降低了 API 成本，并减少了过早停止，从而在多个分类和回归任务中显著提升了优化效果。

链接: https://arxiv.org/abs/2410.20302
作者: Kanan Mahammadli
关键词-EN: Large Language Models, leverages Large Language, Parzen Estimator, Language Models, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study introduces SLLMBO, an innovative framework that leverages Large Language Models (LLMs) for hyperparameter optimization (HPO), incorporating dynamic search space adaptability, enhanced parameter landscape exploitation, and a hybrid, novel LLM-Tree-structured Parzen Estimator (LLM-TPE) sampler. By addressing limitations in recent fully LLM-based methods and traditional Bayesian Optimization (BO), SLLMBO achieves more robust optimization. This comprehensive benchmarking evaluates multiple LLMs, including GPT-3.5-turbo, GPT-4o, Claude-Sonnet-3.5, and Gemini-1.5-flash, extending prior work beyond GPT-3.5 and GPT-4 and establishing SLLMBO as the first framework to benchmark a diverse set of LLMs for HPO. By integrating LLMs’ established strengths in parameter initialization with the exploitation abilities demonstrated in this study, alongside TPE’s exploration capabilities, the LLM-TPE sampler achieves a balanced exploration-exploitation trade-off, reduces API costs, and mitigates premature early stoppings for more effective parameter searches. Across 14 tabular tasks in classification and regression, the LLM-TPE sampler outperformed fully LLM-based methods and achieved superior results over BO methods in 9 tasks. Testing early stopping in budget-constrained scenarios further demonstrated competitive performance, indicating that LLM-based methods generally benefit from extended iterations for optimal results. This work lays the foundation for future research exploring open-source LLMs, reproducibility of LLM results in HPO, and benchmarking SLLMBO on complex datasets, such as image classification, segmentation, and machine translation.
摘要：本研究引入了 SLLMBO，这是一个创新框架，利用大语言模型 (LLM) 进行超参数优化 (HPO)，结合了动态搜索空间适应性、增强的参数景观利用以及一种混合的新型 LLM-树结构 Parzen 估计器 (LLM-TPE) 采样器。通过解决近期全 LLM 方法和传统贝叶斯优化 (BO) 的局限性，SLLMBO 实现了更稳健的优化。该综合基准测试评估了多个 LLM，包括 GPT-3.5-turbo、GPT-4o、Claude-Sonnet-3.5 和 Gemini-1.5-flash，扩展了先前工作超越 GPT-3.5 和 GPT-4，并确立了 SLLMBO 作为首个为 HPO 基准测试多样化 LLM 集合的框架。通过整合 LLM 在参数初始化方面的既有优势与本研究展示的利用能力，以及 TPE 的探索能力，LLM-TPE 采样器实现了探索-利用的平衡，降低了 API 成本，并减少了预算受限场景下的过早停止，从而更有效地进行参数搜索。在 14 个表格任务（分类和回归）中，LLM-TPE 采样器优于全 LLM 方法，并在 9 个任务中取得了优于 BO 方法的结果。在预算受限场景下测试早期停止进一步展示了其竞争性能，表明基于 LLM 的方法通常受益于延长迭代以获得最佳结果。本研究为未来探索开源 LLM、LLM 结果在 HPO 中的可重复性以及在复杂数据集（如图像分类、分割和机器翻译）上基准测试 SLLMBO 奠定了基础。

[NLP-89] Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

【速读】：该论文试图解决文本去毒化（text detoxification）任务中使用非平行数据（non-parallel data）进行微调时面临的偏好不完全问题。解决方案的关键在于提出了Stackelberg响应优化（Stackelberg Response Optimization, SRO），这是一种基于直接偏好优化（Direct Preference Optimization, DPO）的改进方法。SRO通过模拟LLM与毒性筛查器（toxicity screener）之间的Stackelberg博弈，使得LLM能够根据筛查器的反馈进行学习。具体来说，SRO在生成失败的情况下降低生成概率，而在生成成功的情况下执行DPO，从而有效解决了非平行数据微调中的偏好不完全问题，显著提升了去毒化效果，使其在风格准确性、内容相似度和流畅性方面达到或超越了现有最先进模型。

链接: https://arxiv.org/abs/2410.20298
作者: Xinhong Xie,Tao Li,Quanyan Zhu
关键词-EN: online social media, style transfer tasks, Text detoxification, LLM, transfer tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text detoxification, a variant of style transfer tasks, finds useful applications in online social media. This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. We model the fine-tuning process as a Stackelberg game between an LLM (leader) and a toxicity screener (follower), which is a binary style classifier (toxic or non-toxic). The LLM aims to align its preference according to the screener and generate paraphases passing the screening. The primary challenge of non-parallel data fine-tuning is incomplete preference. In the case of unsuccessful paraphrases, the classifier cannot establish a preference between the input and paraphrase, as they belong to the same toxic style. Hence, preference-alignment fine-tuning methods, such as direct preference optimization (DPO), no longer apply. To address the challenge of incomplete preference, we propose Stackelberg response optimization (SRO), adapted from DPO, to enable the LLM to learn from the follower’s response. The gist is that SRO decreases the likelihood of generating the paraphrase if it fails the follower’s screening while performing DPO on the pair of the toxic input and its paraphrase when the latter passes the screening. Experiments indicate that the SRO-fine-tunned LLM achieves satisfying performance comparable to state-of-the-art models regarding style accuracy, content similarity, and fluency. The overall detoxification performance surpasses other computing methods and matches the human reference. Additional empirical evidence suggests that SRO is sensitive to the screener’s feedback, and a slight perturbation leads to a significant performance drop. We release the code and LLM models at \urlthis https URL.
摘要：文本去毒化，作为风格迁移任务的一个变种，在在线社交媒体中具有广泛的应用价值。本研究提出了一种仅使用非平行数据对大语言模型 (LLM) 进行微调的方法，使其成为去毒化重写器。我们将微调过程建模为一个大语言模型 (领导者) 与一个毒性筛查器 (跟随者) 之间的 Stackelberg 博弈，其中毒性筛查器是一个二元风格分类器 (有毒或无毒)。大语言模型的目标是根据筛查器的偏好进行调整，并生成通过筛查的释义。非平行数据微调的主要挑战在于偏好不完整。在释义失败的情况下，分类器无法在输入和释义之间建立偏好，因为它们都属于同一有毒风格。因此，传统的偏好对齐微调方法，如直接偏好优化 (DPO)，不再适用。为解决偏好不完整的问题，我们提出了 Stackelberg 响应优化 (SRO)，该方法从 DPO 改编而来，使大语言模型能够从跟随者的响应中学习。其核心思想是，如果释义未能通过跟随者的筛查，SRO 会降低生成该释义的概率；而在释义通过筛查的情况下，对有毒输入及其释义进行 DPO 处理。实验结果表明，经过 SRO 微调的大语言模型在风格准确性、内容相似性和流畅性方面达到了令人满意的表现，与最先进的模型相当。总体去毒化性能优于其他计算方法，并达到了与人工参考相匹配的水平。进一步的实证证据表明，SRO 对筛查器的反馈敏感，轻微的扰动会导致显著的性能下降。我们已在 \urlthis https URL 上发布了代码和 LLM 模型。

[NLP-90] Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain

【速读】：该论文试图解决当前大型语言模型（LLMs）在军事领域应用中表现不佳的问题，主要原因是这些模型缺乏军事领域特定的词汇和术语。解决方案的关键在于通过微调（fine-tuning）开源LLMs，使其适应军事领域的需求。具体来说，研究团队创建了TRACLM系列模型，通过不断优化训练流程，逐步提升模型在军事任务中的表现。此外，为了客观评估模型的军事领域知识，研究团队开发了MilBench评估框架，该框架基于军事教义和评估任务，能够有效评估LLMs在军事领域的知识掌握情况。这些成果不仅为国防部（DoD）的LLM技术发展提供了重要参考，还为高级领导层在人工智能集成决策中提供了有力支持。

链接: https://arxiv.org/abs/2410.20297
作者: Daniel C. Ruiz,John Sell
关键词-EN: Large Language Models, Large Language, adoption of Large, Language Models, Army Futures Command
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:In recent years, the widespread adoption of Large Language Models (LLMs) has sparked interest in their potential for application within the military domain. However, the current generation of LLMs demonstrate sub-optimal performance on Army use cases, due to the prevalence of domain-specific vocabulary and jargon. In order to fully leverage LLMs in-domain, many organizations have turned to fine-tuning to circumvent the prohibitive costs involved in training new LLMs from scratch. In light of this trend, we explore the viability of adapting open-source LLMs for usage in the Army domain in order to address their existing lack of domain-specificity. Our investigations have resulted in the creation of three distinct generations of TRACLM, a family of LLMs fine-tuned by The Research and Analysis Center (TRAC), Army Futures Command (AFC). Through continuous refinement of our training pipeline, each successive iteration of TRACLM displayed improved capabilities when applied to Army tasks and use cases. Furthermore, throughout our fine-tuning experiments, we recognized the need for an evaluation framework that objectively quantifies the Army domain-specific knowledge of LLMs. To address this, we developed MilBench, an extensible software framework that efficiently evaluates the Army knowledge of a given LLM using tasks derived from doctrine and assessments. We share preliminary results, models, methods, and recommendations on the creation of TRACLM and MilBench. Our work significantly informs the development of LLM technology across the DoD and augments senior leader decisions with respect to artificial intelligence integration.
摘要：近年来，大语言模型 (LLM) 的广泛应用引发了其在军事领域应用潜力的关注。然而，当前一代的 LLM 在处理军队特定用例时表现不佳，主要原因是存在大量领域特定的词汇和术语。为了充分利用 LLM 在特定领域的应用，许多组织转向了微调 (fine-tuning) 以规避从头训练新 LLM 的高昂成本。鉴于这一趋势，我们探讨了将开源 LLM 适应于军队领域的可行性，以解决其现有缺乏领域特定性的问题。我们的研究成果是创建了三个不同版本的 TRACLM，这是由陆军未来司令部 (AFC) 的研究与分析中心 (TRAC) 微调的一系列 LLM。通过不断优化我们的训练流程，每个后续版本的 TRACLM 在应用于军队任务和用例时都表现出更强的能力。此外，在我们的微调实验过程中，我们认识到需要一个评估框架来客观量化 LLM 在军队领域的特定知识。为此，我们开发了 MilBench，这是一个可扩展的软件框架，能够使用从教义和评估中提取的任务高效评估给定 LLM 的军队知识。我们分享了关于 TRACLM 和 MilBench 创建的初步结果、模型、方法和建议。我们的工作显著推动了国防部 (DoD) 内 LLM 技术的发展，并增强了高级领导者在人工智能集成方面的决策。

[NLP-91] Fast Best-of-N Decoding via Speculative Rejection NEURIPS2024

【速读】：该论文试图解决大语言模型（LLMs）在推理阶段进行对齐（alignment）时计算资源消耗过大的问题。解决方案的关键是提出了一种名为“推测性拒绝”（Speculative Rejection）的推理时对齐算法。该算法能够在生成高评分响应的同时，比现有的最佳推理时对齐方法（如Best-of-N）提高16到32倍的计算效率，从而在保证对齐效果的前提下显著降低计算成本。

链接: https://arxiv.org/abs/2410.20290
作者: Hanshi Sun,Momin Haider,Ruiqi Zhang,Huitao Yang,Jiahao Qiu,Ming Yin,Mengdi Wang,Peter Bartlett,Andrea Zanette
关键词-EN: Large Language Models, Large Language, deployment of Large, Language Models, involves a critical
类目: Computation and Language (cs.CL)
备注: NeurIPS 2024

点击查看摘要

Abstract:The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model’s responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.
摘要：大语言模型 (LLM) 的安全有效部署涉及一个关键步骤，称为对齐 (alignment)，以确保模型的响应符合人类偏好。目前流行的对齐技术，如 DPO、PPO 及其变体，通过在训练后阶段调整预训练模型的权重来对齐 LLM。尽管这些训练后方法占据主导地位，但它们在 LLM 部署前增加了大量复杂性。推理时对齐方法避免了复杂的训练后步骤，而是倾向于生成与人类偏好对齐的响应。最著名的推理时对齐方法，称为 Best-of-N，其效果与最先进的训练后程序相当。然而，Best-of-N 在推理时所需的资源远超标准解码策略，这使得它在计算上不可行。在本研究中，我们引入了推测性拒绝 (Speculative Rejection)，这是一种计算上可行的推理时对齐算法。它像 Best-of-N 一样根据给定的奖励模型生成高评分响应，同时在计算效率上提高了 16 到 32 倍。

[NLP-92] Library Learning Doesnt: The Curious Case of the Single-Use “Library” NEURIPS’24

【速读】：该论文试图解决的问题是：当前的大语言模型（LLMs）在数学推理任务中是否真正学习到了可重用的工具库。解决方案的关键在于通过实验验证两个数学推理系统（LEGO-Prover和TroVE）在miniF2F和MATH数据集上的表现，发现函数重用（function reuse）在这些系统中极为罕见。进一步的消融实验表明，性能提升的主要驱动因素并非工具的重用，而是自校正（self-correction）和自一致性（self-consistency）。

链接: https://arxiv.org/abs/2410.20274
作者: Ian Berlot-Attwell,Frank Rudzicz,Xujie Si
关键词-EN: Large Language Models, Advances in Large, LLM library learning, library learning systems, Language Models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 24 pages, 7 figures. Accepted to the 4th MATH-AI Workshop at NeurIPS’24

点击查看摘要

Abstract:Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim to learn a reusable library of tools, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains. Our code and data are available at this https URL Comments: 24 pages, 7 figures. Accepted to the 4th MATH-AI Workshop at NeurIPS’24 Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Symbolic Computation (cs.SC) Cite as: arXiv:2410.20274 [cs.LG] (or arXiv:2410.20274v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.20274 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：大语言模型 (LLM) 的进步催生了一系列用于数学推理的 LLM 库学习系统。这些系统旨在学习一组可重用的工具库，例如针对一系列任务定制的正式 Isabelle 引理或 Python 程序。许多此类系统受到人类将知识结构化为可重用和可扩展概念的启发，但当前的方法是否真正学习了可重用的工具库？我们研究了两种数学库学习系统，它们均报告了准确率的提升：LEGO-Prover 和 TroVE。我们发现，在 miniF2F 和 MATH 数据集上，函数重用极为罕见。后续的消融实验表明，观察到的性能提升主要驱动因素并非重用，而是自我修正和自我一致性。我们的代码和数据可通过以下链接获取：https URL。

评论：24 页，7 幅图。已被 NeurIPS’24 的第 4 届 MATH-AI 研讨会接受。
主题：机器学习 (cs.LG)；计算与语言 (cs.CL)；符号计算 (cs.SC)
引用为：arXiv:2410.20274 [cs.LG]（或 arXiv:2410.20274v1 [cs.LG] 用于此版本）
https://doi.org/10.48550/arXiv.2410.20274
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-93] Improving Model Evaluation using SMART Filtering of Benchmark Datasets

【速读】：该论文试图解决自然语言处理 (NLP) 领域中评估基准数据集的质量和多样性问题，特别是基准饱和、数据污染和测试样本质量多样性不足的问题。解决方案的关键在于提出了一种名为“准确、精简、目标化筛选方法 (SMART)”的新方法，通过系统性地移除低信息量和低挑战性的样本，从现有基准数据集中筛选出高质量的子集。SMART 方法应用了三种过滤标准：移除简单样本、数据污染样本以及在嵌入空间中相似度高的样本。实验结果表明，SMART 方法在多个选择题问答数据集上显著提高了评估效率，平均减少了 48% 的数据集规模，同时增强了与 ChatBot Arena 这种开放式人类评估设置的皮尔逊相关性，从而在保持模型相对排名的同时，提升了数据集的挑战性和实用性。

链接: https://arxiv.org/abs/2410.20245
作者: Vipul Gupta,Candace Ross,David Pantoja,Rebecca J. Passonneau,Megan Ung,Adina Williams
关键词-EN: problems facing NLP, facing NLP today, facing NLP, NLP today, challenging problems facing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48% on average, while increasing Pearson correlation with rankings from ChatBot Arena, a more open-ended human evaluation setting. Our method enables us to be more efficient, whether using SMART to make new benchmarks more challenging or to revitalize older datasets, while still preserving the relative model rankings.
摘要：当前自然语言处理（NLP）面临的最具挑战性的问题之一是评估。其中一些最紧迫的问题涉及基准测试的饱和度、数据污染以及测试样本质量的多样性。为了应对这些挑战，我们提出了准确、简化、目标化的选择方法（SMART）过滤，这是一种通过系统地移除信息量较少和难度较低的样本，从现有基准数据集中选择高质量子集的新方法。我们的方法应用了三种过滤标准：（i）移除简单样本，（ii）移除数据污染样本，以及（iii）基于嵌入空间中的距离移除相似样本。我们在三个多项选择问答数据集上展示了SMART的有效性，我们的方法平均减少了48%的数据集规模，同时提高了与ChatBot Arena（一种更开放的人类评估设置）排名之间的Pearson相关性。无论是在使新基准更具挑战性还是使旧数据集焕发新生方面，我们的方法都能提高效率，同时保持模型排名的相对性。

[NLP-94] A Survey of Large Language Models for Arabic Language and its Dialects

【速读】：该论文试图解决阿拉伯语及其方言的大型语言模型（LLMs）的全面概述问题。解决方案的关键在于系统地分析了不同架构的LLMs（包括仅编码器、仅解码器和编码器-解码器模型），并详细探讨了用于预训练的数据集（涵盖古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语）。此外，论文还评估了单语、双语和多语种LLMs在下游任务（如情感分析、命名实体识别和问答）中的性能，并强调了模型开放性的重要性，包括源代码、训练数据、模型权重和文档的可用性。论文指出，需要更多多样化的方言数据集，并强调开放性对于研究的可重复性和透明性的重要性。

链接: https://arxiv.org/abs/2410.20238
作者: Malak Mashaabi,Shahad Al-Khalifa,Hend Al-Khalifa
关键词-EN: Large Language Models, Large Language, Modern Standard Arabic, overview of Large, Arabic language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.
摘要：本综述全面概述了针对阿拉伯语及其方言设计的大语言模型 (LLM)。内容涵盖了关键的模型架构，包括仅编码器、仅解码器和编码器-解码器模型，以及用于预训练的数据集，这些数据集涵盖了古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语。研究还探讨了单语、双语和多语种 LLM，分析了它们在下游任务（如情感分析、命名实体识别和问答）中的架构和性能。此外，本研究还评估了阿拉伯语 LLM 的开源性，基于源代码可用性、训练数据、模型权重和文档等因素。综述强调了更多样化的方言数据集的必要性，并强调了开源性对研究可重复性和透明度的重要性。最后，文章指出了未来研究的关键挑战和机遇，并强调了构建更具包容性和代表性模型的必要性。

[NLP-95] Ambiguity is the last thing you need

【速读】：该论文试图解决合同中因语言模糊性导致的争议和法律纠纷问题。解决方案的关键在于使用清晰、明确的语言（plain language），以减少误解和未来的诉讼风险。论文指出，合同语言的模糊性通常源于英语中大量的同义词和多义词，导致各方对合同条款的理解产生差异。通过采用明确的语言，可以有效降低合同解释的不确定性，从而减少法律纠纷的发生。

链接: https://arxiv.org/abs/2410.20222
作者: Emily Chivers,Shawn Curran
关键词-EN: Clear legal language, Clear legal, numerous reasons, forms the backbone, language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clear legal language forms the backbone of a contract for numerous reasons. Disputes often arise between contract parties where ambiguous language has been used and parties often disagree on the meaning or effect of the words. Unambiguous language can also be important where there is an imbalance of bargaining strength between the parties, for instance where a business is contracting with a consumer, where the law actually requires plain language to be used. Thus, plain language minimises misinterpretation and prevents future litigation. Contracts become ambiguous when the language used is vague, imprecise, or open to multiple interpretations and this is due to the vast number of synonyms in the English Language which creates differences in interpretation between the meaning of the language. Ambiguity has always formed a prevalent issue in case-law, with a large percentage of cases based on ambiguous language. Thus, from an outside perspective the legal sector should look forward to ways of reducing this.
摘要：清晰的法律语言构成了合同的核心，原因众多。合同双方在使用模糊语言时常常产生争议，各方往往对词语的含义或效果存在分歧。在双方议价能力不平衡的情况下，明确的语言也显得尤为重要，例如在企业与消费者签订合同时，法律实际上要求使用简洁的语言。因此，简洁的语言可以减少误解并防止未来的诉讼。当使用的语言模糊、不精确或存在多种解释时，合同就会变得模糊不清，这主要是由于英语中存在大量同义词，导致语言含义的解释差异。模糊性一直是判例法中的一个普遍问题，很大一部分案件基于模糊的语言。因此，从外部视角来看，法律界应寻求减少模糊性的方法。

[NLP-96] Generative linguistics contribution to artificial intelligence: Where this contribution lies?

【速读】：该论文试图解决生成语言学 (Generative linguistics) 对人工智能 (AI) 的贡献问题，特别是在语言学是否应归属于人文科学还是自然科学的争议背景下。解决方案的关键在于从科学的角度独立分析生成语言学，特别是乔姆斯基学派 (Chomsky School)，对AI的贡献。论文通过探讨语法、语义、语言能力、普遍语法、人类语言的计算系统、语言习得、人脑、编程语言（如Python）、大型语言模型以及AI科学家的观点，提供了大量证据表明生成语言学对AI的贡献巨大且不可忽视。然而，论文也指出，尽管生成语言学对AI有巨大贡献，但在语言输入的性质和类型上仍存在分歧。

链接: https://arxiv.org/abs/2410.20221
作者: Mohammed Q. Shormani(Ibb University, University of Cyprus)
关键词-EN: characterize Generative linguistics, Generative linguistics, characterize Generative, artificial intelligence, humanities or science
类目: Computation and Language (cs.CL)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:This article aims to characterize Generative linguistics (GL) contribution to artificial intelligence (AI), alluding to the debate among linguists and AI scientists on whether linguistics belongs to humanities or science. In this article, I will try not to be biased as a linguist, studying the phenomenon from an independent scientific perspective. The article walks the researcher/reader through the scientific theorems and rationales involved in AI which belong from GL, specifically the Chomsky School. It, thus, provides good evidence from syntax, semantics, language faculty, Universal Grammar, computational system of human language, language acquisition, human brain, programming languages (e.g. Python), Large Language Models, and unbiased AI scientists that this contribution is huge, and that this contribution cannot be denied. It concludes that however the huge GL contribution to AI, there are still points of divergence including the nature and type of language input."
摘要：本文旨在阐述生成语言学 (Generative Linguistics) 对人工智能 (AI) 的贡献，并涉及语言学家与 AI 科学家之间关于语言学属于人文还是科学的争论。本文试图以独立科学的角度研究这一现象，避免作为语言学家的偏见。文章引导研究者/读者了解 AI 中源自生成语言学，特别是乔姆斯基学派的科学定理和理论基础。通过语法、语义、语言能力、普遍语法、人类语言的计算系统、语言习得、人脑、编程语言（如 Python）、大语言模型以及无偏见的 AI 科学家提供的证据，本文展示了生成语言学对 AI 的巨大贡献，并指出这一贡献不可否认。文章最终得出结论，尽管生成语言学对 AI 的贡献巨大，但在语言输入的性质和类型上仍存在分歧。

[NLP-97] Pseudo-Label Enhanced Prototypical Contrastive Learning for Uniformed Intent Discovery EMNLP2024

【速读】：该论文试图解决面向任务的对话系统中新意图发现的问题，特别是在处理领域内（IND）和领域外（OOD）数据时，现有方法在表示学习和聚类过程中存在差距，以及在开放意图发现和OOD设置中单独处理的问题。解决方案的关键在于提出了一个伪标签增强的原型对比学习（PLPCL）模型，通过迭代利用伪标签来探索潜在的正负样本，从而弥合表示学习和聚类之间的差距。此外，通过设计一种结合IND和OOD样本的监督信号和伪信号的原型学习方法，实现了更好的知识迁移。该方法在两种不同的新意图发现设置中均被证明有效。

链接: https://arxiv.org/abs/2410.20219
作者: Yimin Deng,Yuxia Wu,Guoshuai Zhao,Li Zhu,Xueming Qian
关键词-EN: task-oriented dialogue systems, dialogue systems, crucial capability, capability for task-oriented, task-oriented dialogue
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:New intent discovery is a crucial capability for task-oriented dialogue systems. Existing methods focus on transferring in-domain (IND) prior knowledge to out-of-domain (OOD) data through pre-training and clustering stages. They either handle the two processes in a pipeline manner, which exhibits a gap between intent representation and clustering process or use typical contrastive clustering that overlooks the potential supervised signals from the whole data. Besides, they often individually deal with open intent discovery or OOD settings. To this end, we propose a Pseudo-Label enhanced Prototypical Contrastive Learning (PLPCL) model for uniformed intent discovery. We iteratively utilize pseudo-labels to explore potential positive/negative samples for contrastive learning and bridge the gap between representation and clustering. To enable better knowledge transfer, we design a prototype learning method integrating the supervised and pseudo signals from IND and OOD samples. In addition, our method has been proven effective in two different settings of discovering new intents. Experiments on three benchmark datasets and two task settings demonstrate the effectiveness of our approach.
摘要：新意图发现是面向任务的对话系统中的一项关键能力。现有方法主要通过预训练和聚类阶段将领域内（IND）的先验知识迁移到领域外（OOD）数据。这些方法要么以流水线方式处理这两个过程，导致意图表示与聚类过程之间存在差距，要么使用典型的对比聚类方法，忽略了整个数据中潜在的监督信号。此外，它们通常单独处理开放意图发现或OOD设置。为此，我们提出了一种伪标签增强的原型对比学习（PLPCL）模型，用于统一的意图发现。我们迭代地利用伪标签来探索对比学习的潜在正负样本，并弥合表示与聚类之间的差距。为了实现更好的知识迁移，我们设计了一种结合IND和OOD样本的监督信号和伪信号的原型学习方法。此外，我们的方法在两种不同的新意图发现设置中已被证明是有效的。在三个基准数据集和两种任务设置上的实验证明了我们方法的有效性。

[NLP-98] DAWN-ICL: Strategic Planning of Problem-solving Trajectories for Zero-Shot In-Context Learning

【速读】：该论文试图解决零样本上下文学习（Zero-shot in-context learning, ZS-ICL）在实际应用中遇到的问题，即当问题来自不同任务时，随机遍历顺序可能导致不可靠的伪演示（pseudo-demonstrations）生成和错误累积。解决方案的关键在于将ZS-ICL重新构想为一个规划问题，并提出了一种基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）的演示感知方法（Demonstration-aware Monte Carlo Tree Search, DAWN-ICL）。DAWN-ICL通过MCTS策略性地规划问题解决轨迹，并引入了一种新的演示感知Q值函数（demonstration-aware Q-value function），以增强选择阶段并加速MCTS的扩展和模拟阶段，从而提高ZS-ICL在域内和跨域场景中的有效性和效率。

链接: https://arxiv.org/abs/2410.20215
作者: Xinyu Tang,Xiaolei Wang,Wayne Xin Zhao,Ji-Rong Wen
关键词-EN: Zero-shot in-context learning, conduct in-context learning, in-context learning, Zero-shot in-context, conduct in-context
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Zero-shot in-context learning (ZS-ICL) aims to conduct in-context learning (ICL) without using human-annotated demonstrations. Most ZS-ICL methods use large language models (LLMs) to generate (input, label) pairs as pseudo-demonstrations and leverage historical pseudo-demonstrations to help solve the current problem. They assume that problems are from the same task and traverse them in a random order. However, in real-world scenarios, problems usually come from diverse tasks, and only a few belong to the same task. The random traversing order may generate unreliable pseudo-demonstrations and lead to error accumulation. To address this problem, we reformulate ZS-ICL as a planning problem and propose a Demonstration-aware Monte Carlo Tree Search (MCTS) approach (DAWN-ICL), which leverages MCTS to strategically plan the problem-solving trajectories for ZS-ICL. In addition, to achieve effective and efficient Q value estimation, we propose a novel demonstration-aware Q-value function and use it to enhance the selection phase and accelerate the expansion and simulation phases in MCTS. Extensive experiments demonstrate the effectiveness and efficiency of DAWN-ICL on in-domain and cross-domain scenarios, and it even outperforms ICL using human-annotated labels. The code is available at this https URL.
摘要：零样本上下文学习（Zero-shot in-context learning, ZS-ICL）旨在不使用人工标注的示例进行上下文学习（In-context learning, ICL）。大多数 ZS-ICL 方法利用大语言模型（Large Language Models, LLMs）生成（输入，标签）对作为伪示例，并利用历史伪示例来辅助解决当前问题。这些方法假设问题来自同一任务，并以随机顺序遍历它们。然而，在现实场景中，问题通常来自不同的任务，只有少数属于同一任务。随机遍历顺序可能会生成不可靠的伪示例，并导致错误累积。为了解决这一问题，我们将 ZS-ICL 重新表述为一个规划问题，并提出了一种示例感知的蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）方法（DAWN-ICL），该方法利用 MCTS 策略性地规划 ZS-ICL 的问题解决轨迹。此外，为了实现有效且高效的 Q 值估计，我们提出了一种新颖的示例感知 Q 值函数，并将其用于增强选择阶段，并加速 MCTS 中的扩展和模拟阶段。大量实验证明了 DAWN-ICL 在领域内和跨领域场景中的有效性和效率，甚至在某些情况下优于使用人工标注标签的 ICL。代码可在以下链接获取：https URL。

[NLP-99] Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

【速读】：该论文试图解决的问题是理解Transformer模型在预测过程中如何以及何时确定每个token的排名，特别是在“饱和事件”（saturation event）发生后，模型如何逐步确定top-k token的顺序。解决方案的关键在于提出了一个任务过渡机制（task transition mechanism），认为模型在预测过程中通过一系列离散的任务过渡来逐步确定每个token的排名，其中每个任务对应于预测第k个最可能的token。论文通过实验验证了这一机制，并展示了如何从隐藏层嵌入中预测当前任务，以及如何通过干预方法促使模型从一个任务过渡到下一个任务。最终，基于这一发现，论文提出了一种新的token级早期退出策略（token-level early-exit strategy），该策略在平衡性能和效率方面优于现有方法。

链接: https://arxiv.org/abs/2410.20210
作者: Daria Lioubashevski,Tomer Schlank,Gabriel Stanovsky,Ariel Goldstein
关键词-EN: saturation events, crucial for achieving, achieving more accurate, accurate and efficient, saturation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the “saturation event”. We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.
摘要：理解 Transformer 的内部工作机制对于实现更准确和高效的预测至关重要。在本研究中，我们分析了 Transformer 在顶部预测 Token 固定后各层中的计算过程，这一现象先前被称为“饱和事件”。我们将饱和事件的概念扩展到前 k 个 Token，证明在语言、视觉和语音模型中均存在类似的饱和事件。我们发现这些饱和事件按照相应 Token 的排名顺序发生，即模型首先确定排名最高的 Token，然后是第二高排名的 Token，依此类推。这种现象似乎是 Transformer 架构的内在特性，出现在不同的架构变体（仅解码器、仅编码器，以及在较小程度上全 Transformer）中，甚至在未经训练的 Transformer 中也能观察到。我们提出了一种任务转换的底层机制来解释这种顺序饱和现象，其中任务 k 对应于预测第 k 个最可能的 Token，而饱和事件实际上是任务之间的离散转换。为了支持这一观点，我们展示了从隐藏层嵌入中预测当前任务的可能性。此外，通过干预方法，我们证明了可以使模型从一个任务切换到下一个任务。最后，利用我们的发现，我们引入了一种新颖的 Token 级早期退出策略，该策略在平衡性能和效率方面超越了现有方法。

[NLP-100] Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLM s EMNLP

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在解决组合性问题时是否真正进行了逻辑推理，还是仅仅依赖于隐含的线索来生成答案。解决方案的关键在于通过操纵两个组合性数据集（QASC和Bamboogle）中的事实，控制可能影响模型性能的潜在线索，包括词/短语重叠、模型预训练或微调时的固有知识以及命名实体。研究发现，尽管两种模型（LLaMA 2和Flan-T5）都利用了词/短语重叠，但Flan-T5在应对固有知识和命名实体的变化时表现出更强的适应性，这表明模型可能通过在相关数据集上的微调来发展对传递性的理解。

链接: https://arxiv.org/abs/2410.20200
作者: Houman Mehrafarin,Arash Eshghi,Ioannis Konstas
关键词-EN: Evaluating Large Language, Large Language Models, Evaluating Large, Large Language, solve compositional questions
类目: Computation and Language (cs.CL)
备注: To appear in EMNLP Main 2024

点击查看摘要

Abstract:Evaluating Large Language Models (LLMs) on reasoning benchmarks demonstrates their ability to solve compositional questions. However, little is known of whether these models engage in genuine logical reasoning or simply rely on implicit cues to generate answers. In this paper, we investigate the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle. We controlled for potential cues that might influence the models’ performance, including (a) word/phrase overlaps across sections of test input; (b) models’ inherent knowledge during pre-training or fine-tuning; and © Named Entities. Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments (b and c), having less variance than LLaMA 2. This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets, a hypothesis we leave to future work.
摘要：在推理基准上评估大语言模型 (LLM) 的能力，展示了它们解决组合性问题的能力。然而，目前尚不清楚这些模型是否真正进行了逻辑推理，还是仅仅依赖于隐含的线索来生成答案。本文通过操纵两个组合性数据集（QASC 和 Bamboogle）中的事实，研究了两种不同 LLM 架构（LLaMA 2 和 Flan-T5）的传递推理能力。我们控制了可能影响模型性能的潜在线索，包括：(a) 测试输入各部分之间的词语/短语重叠；(b) 模型在预训练或微调期间固有的知识；以及 © 命名实体。研究结果表明，尽管两种模型都利用了 (a)，但 Flan-T5 在实验 (b 和 c) 中表现出更强的韧性，其变异性小于 LLaMA 2。这表明，模型可能通过在已知相关数据集上进行微调来发展对传递性的理解，这一假设留待未来工作进一步验证。

[NLP-101] Enhancing Inflation Nowcasting with LLM : Sentiment Analysis on News

【速读】：该论文试图解决在高通胀波动时期（如COVID-19疫情），如何提高通胀即时预测（nowcasting）的准确性问题。解决方案的关键在于提出了InflaBERT，这是一个基于BERT的大语言模型（LLM），经过微调以预测新闻中与通胀相关的情绪。通过使用InflaBERT生成的NEWS指数（一个捕捉新闻中每月通胀情绪的指数），并将其整合到克利夫兰联储的传统宏观经济自回归模型中，论文展示了在疫情期间模型预测准确性的边际提升。这一结果强调了将情绪分析与传统经济指标结合的潜力，为改进实时通胀监测方法提供了新的研究方向。

链接: https://arxiv.org/abs/2410.20198
作者: Marc-Antoine Allard,Paul Teiletche,Adam Zinebi
关键词-EN: inflation nowcasting frameworks, large language models, classic inflation nowcasting, high inflation volatility, inflation volatility periods
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores the integration of large language models (LLMs) into classic inflation nowcasting frameworks, particularly in light of high inflation volatility periods such as the COVID-19 pandemic. We propose InflaBERT, a BERT-based LLM fine-tuned to predict inflation-related sentiment in news. We use this model to produce NEWS, an index capturing the monthly sentiment of the news regarding inflation. Incorporating our expectation index into the Cleveland Fed’s model, which is only based on macroeconomic autoregressive processes, shows a marginal improvement in nowcast accuracy during the pandemic. This highlights the potential of combining sentiment analysis with traditional economic indicators, suggesting further research to refine these methodologies for better real-time inflation monitoring. The source code is available at this https URL.
摘要：本研究探讨了将大语言模型 (LLM) 整合到经典的通货膨胀实时预测框架中，特别是在高通胀波动期如 COVID-19 大流行期间的应用。我们提出了 InflaBERT，这是一个基于 BERT 的 LLM，经过微调以预测新闻中与通货膨胀相关的情绪。我们使用该模型生成了 NEWS 指数，该指数捕捉了新闻中关于通货膨胀的月度情绪。将我们的预期指数纳入克利夫兰联储的模型（该模型仅基于宏观经济自回归过程），在疫情期间显示出实时预测准确性的边际提升。这突显了将情绪分析与传统经济指标结合的潜力，并建议进一步研究以优化这些方法，从而实现更好的实时通货膨胀监测。源代码可在以下链接获取：https URL。

[NLP-102] LLM s Can Evolve Continually on Modality for X-Modal Reasoning

【速读】：该论文试图解决多模态大语言模型（MLLMs）在扩展到新模态时面临的计算负担问题。现有的方法依赖于大量的模态特定预训练和联合模态调优，这导致在扩展到新模态时计算成本显著增加。论文提出的解决方案是PathWeave框架，其关键在于模态路径切换和扩展能力，通过持续学习（Continual Learning）和增量训练策略，使得MLLMs能够利用单模态数据进行新模态的扩展，而无需执行联合模态预训练。具体来说，论文引入了一种新颖的Adapter-in-Adapter（AnA）框架，通过无缝集成单模态和跨模态适配器，实现高效的模态对齐和协作。此外，基于专家混合（MoE）的门控模块被应用于两种适配器之间，以进一步增强多模态交互。实验结果表明，PathWeave在保持与现有最先进MLLMs相当性能的同时，显著减少了参数训练的负担。

链接: https://arxiv.org/abs/2410.20178
作者: Jiazuo Yu,Haomiao Xiong,Lu Zhang,Haiwen Diao,Yunzhi Zhuge,Lanqing Hong,Dong Wang,Huchuan Lu,You He,Long Chen
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, gained significant attention
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for \mathbbX -modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at this https URL
摘要：多模态大语言模型 (Multimodal Large Language Models, MLLMs) 因其卓越的多模态理解能力而备受关注。然而，现有方法严重依赖于广泛的模态特定预训练和联合模态调优，这在新模态扩展时带来了显著的计算负担。本文提出 PathWeave，一个灵活且可扩展的框架，具备模态路径切换和扩展能力，使 MLLMs 能够在 \mathbbX 模态推理中持续进化。我们利用持续学习 (Continual Learning) 的概念，在预训练的 MLLMs 基础上开发了一种增量训练策略，使其能够使用单模态数据扩展到新模态，而无需执行联合模态预训练。具体而言，我们引入了一种新颖的 Adapter-in-Adapter (AnA) 框架，其中单模态和跨模态适配器无缝集成，以促进高效的模态对齐和协作。此外，在两种适配器之间应用了基于门控机制 (MoE-based gating module) 的模块，进一步增强了多模态交互。为了验证所提出的方法，我们建立了一个名为模态持续学习 (Continual Learning of Modality, MCL) 的挑战性基准，该基准包含来自五种不同模态（图像、视频、音频、深度和点云）的高质量问答数据。大量实验表明，所提出的 AnA 框架在持续学习过程中在学习和记忆稳定性方面表现出色。此外，PathWeave 在性能上与最先进的 MLLMs 相当，同时将参数训练负担减少了 98.73%。我们的代码位于此 https URL。

[NLP-103] A Stack-Propagation Framework for Low-Resource Personalized Dialogue Generation

【速读】：该论文试图解决在有限的个性化对话数据下，如何构建一个能够生成自然且一致的对话响应的开放领域对话系统的问题。解决方案的关键在于提出了一种新颖的堆叠传播框架 (stack-propagation framework)，该框架通过将Transformer编码器与两个Transformer解码器堆叠，实现了生成与一致性理解的双重任务。第一个解码器负责响应生成，而第二个解码器则作为正则化器，共同建模响应生成和一致性理解。这种设计使得模型能够在较小的个性化对话数据集上进行有效训练，同时保持较高的响应质量和一致性，从而克服了传统模型依赖大量个性化对话数据的局限性。

链接: https://arxiv.org/abs/2410.20174
作者: Haoyu Song,Wei-Nan Zhang,Kaiyan Zhang,Ting Liu
关键词-EN: attracted increasing attention, building open-domain dialogue, open-domain dialogue systems, past few years, resurgent interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: published as a journal paper at ACM Transactions on Information Systems 2023. 35 pages, 5 figures

点击查看摘要

Abstract:With the resurgent interest in building open-domain dialogue systems, the dialogue generation task has attracted increasing attention over the past few years. This task is usually formulated as a conditional generation problem, which aims to generate a natural and meaningful response given dialogue contexts and specific constraints, such as persona. And maintaining a consistent persona is essential for the dialogue systems to gain trust from the users. Although tremendous advancements have been brought, traditional persona-based dialogue models are typically trained by leveraging a large number of persona-dense dialogue examples. Yet, such persona-dense training data are expensive to obtain, leading to a limited scale. This work presents a novel approach to learning from limited training examples by regarding consistency understanding as a regularization of response generation. To this end, we propose a novel stack-propagation framework for learning a generation and understanding this http URL, the framework stacks a Transformer encoder and two Transformer decoders, where the first decoder models response generation and the second serves as a regularizer and jointly models response generation and consistency understanding. The proposed framework can benefit from the stacked encoder and decoders to learn from much smaller personalized dialogue data while maintaining competitive performance. Under different low-resource settings, subjective and objective evaluations prove that the stack-propagation framework outperforms strong baselines in response quality and persona consistency and largely overcomes the shortcomings of traditional models that rely heavily on the persona-dense dialogue data.
摘要：随着近年来对构建开放域对话系统的兴趣再度兴起，对话生成任务受到了越来越多的关注。该任务通常被定义为一个条件生成问题，旨在根据对话上下文和特定约束（如角色）生成自然且有意义的响应。保持一致的角色设定对于对话系统赢得用户信任至关重要。尽管取得了巨大的进展，但传统的基于角色的对话模型通常通过利用大量角色密集的对话样本来进行训练。然而，这种角色密集的训练数据获取成本高昂，导致数据规模有限。本文提出了一种新颖的方法，通过将一致性理解视为响应生成的正则化，从有限的训练样本中学习。为此，我们提出了一种新颖的堆叠传播框架，用于学习生成和理解响应。该框架堆叠了一个 Transformer 编码器和两个 Transformer 解码器，其中第一个解码器用于建模响应生成，第二个解码器作为正则化器，共同建模响应生成和一致性理解。所提出的框架能够从更小的个性化对话数据中学习，同时保持竞争性能。在不同的低资源设置下，主观和客观评估均证明，堆叠传播框架在响应质量和角色一致性方面优于强基线，并很大程度上克服了传统模型严重依赖角色密集对话数据的缺点。

[NLP-104] UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

【速读】：该论文试图解决现有信息检索（IR）模型在处理异构知识源和多样化用户指令时的局限性问题。解决方案的关键在于引入了一个统一的指令感知异构知识检索器（UniHGKR），其核心创新包括：(1) 构建一个统一的检索空间来处理异构知识；(2) 遵循多样化的用户指令来检索特定类型的知识。UniHGKR通过三个主要阶段实现这一目标：异构自监督预训练、文本锚定嵌入对齐和指令感知检索器微调，从而使其能够在不同的检索场景中泛化应用。此外，论文还引入了首个原生异构知识检索基准（CompMix-IR），并通过广泛的实验验证了UniHGKR在CompMix-IR上的优越性能，以及在开放域异构问答系统中的最新成果。

链接: https://arxiv.org/abs/2410.20163
作者: Dehai Min,Zhiyang Xu,Guilin Qi,Lifu Huang,Chenyu You
关键词-EN: Existing information retrieval, Existing information, limiting their applicability, assume a homogeneous, homogeneous structure
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 4.80 points.
摘要：现有的信息检索（IR）模型通常假设知识源和用户查询具有同质结构，这限制了其在真实世界中的适用性，因为在实际应用中，检索任务本质上是异质且多样的。本文提出了一种名为 UniHGKR 的统一指令感知异质知识检索器，该系统（1）构建了一个统一的异质知识检索空间，（2）遵循多样化的用户指令以检索特定类型的知识。UniHGKR 包含三个主要阶段：异质自监督预训练、文本锚定嵌入对齐以及指令感知检索器微调，使其能够在不同的检索场景中泛化应用。该框架具有高扩展性，包括基于 BERT 的版本和在大型语言模型上训练的 UniHGKR-7B 版本。此外，我们引入了 CompMix-IR，这是首个原生的异质知识检索基准。它包含两个检索场景，涵盖多种指令，超过 9,400 个问答（QA）对，以及一个包含 1000 万条目的语料库，覆盖四种不同类型的数据。广泛的实验表明，UniHGKR 在 CompMix-IR 上持续优于现有最先进的方法，在两个场景中分别实现了高达 6.36% 和 54.23% 的相对改进。最后，通过为开放域异质问答系统配备我们的检索器，我们在流行的 ConvMix 任务上取得了新的最先进结果，绝对提升高达 4.80 分。

[NLP-105] Causal Abstraction in Model Interpretability: A Compact Survey

【速读】：该论文试图解决复杂模型（如深度学习系统）决策过程的可解释性问题。解决方案的关键在于因果抽象（causal abstraction），这是一种理论框架，旨在通过理解和解释模型行为背后的因果机制，提供一种系统化的方法来增强模型的可解释性。

链接: https://arxiv.org/abs/2410.20161
作者: Yihao Zhang
关键词-EN: deep learning systems, interpretable artificial intelligence, learning systems, pursuit of interpretable, interpretable artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The pursuit of interpretable artificial intelligence has led to significant advancements in the development of methods that aim to explain the decision-making processes of complex models, such as deep learning systems. Among these methods, causal abstraction stands out as a theoretical framework that provides a principled approach to understanding and explaining the causal mechanisms underlying model behavior. This survey paper delves into the realm of causal abstraction, examining its theoretical foundations, practical applications, and implications for the field of model interpretability.
摘要：对可解释人工智能的追求推动了旨在解释复杂模型（如深度学习系统）决策过程的方法的显著发展。在这些方法中，因果抽象作为一种理论框架脱颖而出，它提供了一种原则性的方法来理解和解释模型行为背后的因果机制。本文深入探讨了因果抽象的领域，考察了其理论基础、实际应用以及对模型可解释性领域的影响。

[NLP-106] Hybrid Deep Learning for Legal Text Analysis: Predicting Punishment Durations in Indonesian Court Rulings

【速读】：该论文试图解决印尼法院系统中公众对法律程序理解有限以及判决不一致导致的广泛不满和法官压力增加的问题。解决方案的关键在于开发了一种基于深度学习的预测系统，用于预测法庭判决的长度。该系统结合了卷积神经网络 (CNN) 和双向长短期记忆网络 (BiLSTM) 以及注意力机制，实现了0.5893的R-squared得分，有效捕捉了法律文本中的局部模式和长期依赖关系。通过仅使用最频繁的30%词汇，平衡了信息保留和计算效率，同时改进了文本规范化过程，显著提升了模型性能。这些技术手段不仅有助于自动化法律文档处理，还提高了法律判决的透明度和公众理解度，为实现更一致和易懂的法律决策铺平了道路。

链接: https://arxiv.org/abs/2410.20104
作者: Muhammad Amien Ibrahim,Alif Tri Handoyo,Maria Susan Anggreainy
关键词-EN: Limited public understanding, court system led, stress on judges, Limited public, processes and inconsistent
类目: Computation and Language (cs.CL)
备注: 11 pages, 7 figures, 6 tables, submitted to Journal of Advances in Information Technology

点击查看摘要

Abstract:Limited public understanding of legal processes and inconsistent verdicts in the Indonesian court system led to widespread dissatisfaction and increased stress on judges. This study addresses these issues by developing a deep learning-based predictive system for court sentence lengths. Our hybrid model, combining CNN and BiLSTM with attention mechanism, achieved an R-squared score of 0.5893, effectively capturing both local patterns and long-term dependencies in legal texts. While document summarization proved ineffective, using only the top 30% most frequent tokens increased prediction performance, suggesting that focusing on core legal terminology balances information retention and computational efficiency. We also implemented a modified text normalization process, addressing common errors like misspellings and incorrectly merged words, which significantly improved the model’s performance. These findings have important implications for automating legal document processing, aiding both professionals and the public in understanding court judgments. By leveraging advanced NLP techniques, this research contributes to enhancing transparency and accessibility in the Indonesian legal system, paving the way for more consistent and comprehensible legal decisions.
摘要：由于公众对法律程序的理解有限，以及印度尼西亚法院系统中判决的不一致性，导致了广泛的公众不满和法官压力的增加。本研究通过开发一种基于深度学习的法庭判决长度预测系统来解决这些问题。我们的混合模型结合了卷积神经网络 (CNN) 和双向长短期记忆网络 (BiLSTM) 以及注意力机制，实现了 0.5893 的 R-squared 得分，有效地捕捉了法律文本中的局部模式和长期依赖关系。尽管文档摘要被证明是无效的，但仅使用最频繁的 30% Token 提高了预测性能，这表明专注于核心法律术语在信息保留和计算效率之间取得了平衡。我们还实施了一种改进的文本规范化过程，解决了常见的错误，如拼写错误和错误合并的单词，这显著提高了模型的性能。这些发现对自动化法律文档处理具有重要意义，有助于专业人士和公众更好地理解法庭判决。通过利用先进的自然语言处理 (NLP) 技术，本研究为提高印度尼西亚法律系统的透明度和可访问性做出了贡献，为实现更加一致和易于理解的法律决策铺平了道路。

[NLP-107] RARe: Retrieval Augmented Retrieval with In-Context Examples

【速读】：该论文试图解决在检索任务中，如何利用上下文示例（in-context examples）来提升嵌入模型（embedding model）性能的问题。解决方案的关键在于提出了一种名为RARe的方法，该方法通过微调预训练模型，使其能够利用与目标查询语义相似的上下文示例。具体来说，RARe方法在推理时将上下文示例（查询-文档对）与目标查询结合，从而在各种开放域检索数据集（如BeIR, RAR-b）上实现了高达+2.72%的nDCG性能提升。此外，RARe方法在跨域泛化方面表现出色，类似于在大型语言模型（LLMs）中观察到的上下文学习效果。

链接: https://arxiv.org/abs/2410.20088
作者: Atula Tejaswi,Yoonsang Lee,Sujay Sanghavi,Eunsol Choi
关键词-EN: improve embedding model, improve embedding, in-context, LLMs, decoder-only language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We investigate whether in-context examples, widely used in decoder-only language models (LLMs), can improve embedding model performance in retrieval tasks. Unlike in LLMs, naively prepending in-context examples (query-document pairs) to the target query at inference time does not work out of the box. We introduce a simple approach to enable retrievers to use in-context examples. Our approach, RARe, finetunes a pre-trained model with in-context examples whose query is semantically similar to the target query. This can be applied to adapt various base architectures (i.e., decoder-only language models, retriever models) and consistently achieves performance gains of up to +2.72% nDCG across various open-domain retrieval datasets (BeIR, RAR-b). In particular, we find RARe exhibits stronger out-of-domain generalization compared to models using queries without in-context examples, similar to what is seen for in-context learning in LLMs. We further provide analysis on the design choices of in-context example augmentation and lay the foundation for future work in this space.
摘要：我们研究了在仅解码器的大语言模型 (LLM) 中广泛使用的上下文示例是否能提升嵌入模型在检索任务中的性能。与大语言模型不同，简单地在推理时将上下文示例（查询-文档对）前置于目标查询并不能直接奏效。我们提出了一种简单的方法，使检索器能够利用上下文示例。我们的方法，RARe，通过微调预训练模型，使其能够使用与目标查询语义相似的上下文示例。这种方法可以应用于多种基础架构（即仅解码器的大语言模型、检索模型），并且在各种开放域检索数据集（BeIR、RAR-b）上一致地实现了高达 +2.72% 的 nDCG 性能提升。特别地，我们发现 RARe 在跨域泛化方面表现更强，相比于不使用上下文示例的模型，这与大语言模型中上下文学习的观察结果相似。我们进一步对上下文示例增强的设计选择进行了分析，并为该领域的未来工作奠定了基础。

[NLP-108] Multi-Field Adaptive Retrieval

【速读】：该论文试图解决在文档检索任务中，现有方法通常处理无结构化文本数据的问题。解决方案的关键在于提出了多字段自适应检索框架（Multi-Field Adaptive Retrieval, MFAR），该框架能够灵活处理任何数量和类型的结构化数据文档索引。其核心步骤包括：(1) 将现有文档分解为多个字段，并分别通过密集表示和词汇表示进行独立索引；(2) 学习一个模型，该模型能够根据查询内容自适应地预测各字段的重要性，从而在检索时动态调整各字段的权重。这种方法优化了密集表示与词汇表示在不同字段类型中的使用，显著提升了文档排序效果，并在多字段结构化数据检索中达到了最先进的性能。

链接: https://arxiv.org/abs/2410.20056
作者: Millicent Li,Tongfei Chen,Benjamin Van Durme,Patrick Xia
关键词-EN: retrieval-augmented generation typically, generation typically involves, typically involves datasets, explicit internal structure, free-form text
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document retrieval for tasks such as search and retrieval-augmented generation typically involves datasets that are unstructured: free-form text without explicit internal structure in each document. However, documents can have a structured form, consisting of fields such as an article title, message body, or HTML header. To address this gap, we introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of and any type of document indices on structured data. Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query, allowing on-the-fly weighting of the most likely field(s). We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured data.
摘要：用于搜索和检索增强生成等任务的文档检索通常涉及非结构化数据集：每个文档中没有明确内部结构的自由形式文本。然而，文档可以具有结构化形式，由文章标题、消息正文或HTML头部等字段组成。为了填补这一空白，我们引入了多字段自适应检索（Multi-Field Adaptive Retrieval, MFAR），这是一个灵活的框架，能够适应结构化数据上的任意数量和类型的文档索引。我们的框架包括两个主要步骤：（1）将现有文档分解为字段，每个字段通过密集和词汇方法独立索引，以及（2）学习一个模型，该模型通过条件化文档查询来自适应预测字段的重要性，从而允许动态加权最可能的字段。我们发现，这种方法能够在字段类型之间优化密集与词汇表示的使用，显著提高文档排序性能，并在多字段结构化数据上达到最先进的性能。

[NLP-109] Architectural Flaw Detection in Civil Engineering Using GPT-4

【速读】：该论文试图解决建筑设计阶段中存在的建筑缺陷检测问题，特别是识别缺失的门和窗户。解决方案的关键在于利用先进的LLM GPT4 Turbo视觉模型，通过评估其精度（precision）、召回率（recall）和F1分数（F1 score）等指标，验证AI在缺陷检测方面的有效性。此外，研究还探讨了AI在识别承重问题、材料弱点和确保符合建筑规范方面的更广泛应用，强调AI如何通过提高设计准确性、减少成本高昂的修订和支持可持续实践，来显著改善建筑设计质量，从而在土木工程领域实现更安全、高效和美观的建筑结构。

链接: https://arxiv.org/abs/2410.20036
作者: Saket Kumar,Abul Ehtesham,Aditi Singh,Tala Talaei Khoei
关键词-EN: enhancing design quality, artificial intelligence, quality and safety, civil engineering presents, application of artificial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The application of artificial intelligence (AI) in civil engineering presents a transformative approach to enhancing design quality and safety. This paper investigates the potential of the advanced LLM GPT4 Turbo vision model in detecting architectural flaws during the design phase, with a specific focus on identifying missing doors and windows. The study evaluates the model’s performance through metrics such as precision, recall, and F1 score, demonstrating AI’s effectiveness in accurately detecting flaws compared to human-verified data. Additionally, the research explores AI’s broader capabilities, including identifying load-bearing issues, material weaknesses, and ensuring compliance with building codes. The findings highlight how AI can significantly improve design accuracy, reduce costly revisions, and support sustainable practices, ultimately revolutionizing the civil engineering field by ensuring safer, more efficient, and aesthetically optimized structures.
摘要：人工智能（AI）在土木工程中的应用为提升设计质量和安全性提供了变革性的方法。本文探讨了先进的大语言模型 GPT4 Turbo 视觉模型在设计阶段检测建筑缺陷的潜力，特别是识别缺失的门和窗。研究通过精确度、召回率和 F1 分数等指标评估了模型的性能，展示了 AI 在准确检测缺陷方面相较于人工验证数据的有效性。此外，研究还探索了 AI 的更广泛能力，包括识别承重问题、材料弱点以及确保符合建筑规范。研究结果表明，AI 可以显著提高设计准确性，减少成本高昂的修订，并支持可持续实践，最终通过确保更安全、更高效和美学优化的结构，彻底改变土木工程领域。

[NLP-110] raining the Untrainable: Introducing Inductive Bias via Representational Alignment

【速读】：该论文试图解决的问题是如何在不改变网络架构的情况下，通过引入其他架构的归纳偏置（inductive biases）来训练原本被认为不适合特定任务的网络。解决方案的关键在于引入“指导网络”（guide network），通过神经距离函数（neural distance function）引导“目标网络”（target network）。目标网络在优化过程中不仅要表现良好，还要逐层匹配指导网络的内部表示。这种方法能够将指导网络的部分架构先验和知识传递给目标网络，从而克服目标网络在视觉任务中的过拟合问题，使普通卷积神经网络（CNNs）与残差网络（ResNets）竞争，缩小普通循环神经网络（RNNs）与Transformer之间的差距，甚至帮助Transformer学习原本RNNs更容易完成的任务。此外，该方法还揭示了可能存在更好的全连接网络初始化方法以避免过拟合。

链接: https://arxiv.org/abs/2410.20035
作者: Vighnesh Subramaniam,David Mayo,Colin Conwell,Tomaso Poggio,Boris Katz,Brian Cheung,Andrei Barbu
关键词-EN: fully connected networks, guide, fully connected, connected networks, traditionally are considered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review; 24 pages, 9 figures; Project page and code is at this https URL

点击查看摘要

Abstract:We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. Networks are considered untrainable when they overfit, underfit, or converge to poor results even when tuning their hyperparameters. For example, plain fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although what that bias is remains unknown. We introduce guidance, where a guide network guides a target network using a neural distance function. The target is optimized to perform well and to match its internal representations, layer-by-layer, to those of the guide; the guide is unchanged. If the guide is trained, this transfers over part of the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. In this manner, we can investigate what kinds of priors different architectures place on untrainable networks such as fully connected networks. We demonstrate that this method overcomes the immediate overfitting of fully connected networks on vision tasks, makes plain CNNs competitive to ResNets, closes much of the gap between plain vanilla RNNs and Transformers, and can even help Transformers learn tasks which RNNs can perform more easily. We also discover evidence that better initializations of fully connected networks likely exist to avoid overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, may demystify the dark art of architecture creation, even perhaps turning architectures into a continuous optimizable parameter of the network.
摘要：我们展示了那些传统上被认为不适合某项任务的架构，可以通过从另一种架构中引入归纳偏置来进行训练。当网络出现过拟合、欠拟合或即使调整超参数也收敛到较差结果时，它们被认为是不可训练的。例如，普通的完全连接网络在物体识别任务上容易过拟合，而没有残差连接的深度卷积网络则容易欠拟合。传统的解决方法是改变架构以施加某种归纳偏置，尽管这种偏置的具体内容仍未明确。我们引入了指导机制，其中指导网络通过神经距离函数引导目标网络。目标网络被优化以表现良好，并逐层匹配其内部表示与指导网络的内部表示；指导网络保持不变。如果指导网络是经过训练的，这会将部分架构先验和指导网络的知识传递给目标网络。如果指导网络是未经训练的，则只会传递部分架构先验。通过这种方式，我们可以研究不同架构对不可训练网络（如完全连接网络）施加的先验类型。我们展示了这种方法克服了完全连接网络在视觉任务上的即时过拟合问题，使普通卷积神经网络（CNN）与残差网络（ResNet）具有竞争力，缩小了普通循环神经网络（RNN）与Transformer之间的差距，甚至帮助Transformer学习那些RNN更容易完成的任务。我们还发现了证据，表明可能存在更好的完全连接网络初始化方法以避免过拟合。我们的方法提供了一种数学工具来研究先验和架构，从长远来看，可能有助于揭开架构设计的神秘面纱，甚至可能将架构转变为网络中一个可连续优化的参数。

[NLP-111] Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在自然语言处理中生成“幻觉”（hallucinations）的问题，即模型输出不准确或虚构信息，从而影响其在关键数据驱动决策中的可靠性。解决方案的关键在于引入并评估四种针对性策略：结构化输出生成（Structured Output Generation）、严格规则执行（Strict Rules Enforcement）、系统提示增强（System Prompt Enhancements）和语义层集成（Semantic Layer Integration）。研究表明，这些策略在减少幻觉方面比传统的微调方法更为有效，为在数据分析中部署LLMs提供了更可靠的框架，从而确保在数据驱动环境中获得更可靠的结果。

链接: https://arxiv.org/abs/2410.20024
作者: Mikhail Rumiantsau,Aliaksei Vertsel,Ilya Hrytsuk,Isaiah Ballah
关键词-EN: Large Language Models, Large Language, enabling advanced data, natural language queries, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly important in natural language processing, enabling advanced data analytics through natural language queries. However, these models often generate “hallucinations”-inaccurate or fabricated information-that can undermine their reliability in critical data-driven decision-making. Addressing the challenge of hallucinations is essential to improve the accuracy and trustworthiness of LLMs in processing natural language queries. This research focuses on mitigating hallucinations in LLMs, specifically within the context of data analytics. We introduce and evaluate four targeted strategies: Structured Output Generation, Strict Rules Enforcement, System Prompt Enhancements, and Semantic Layer Integration. Our findings show that these methods are more effective than traditional fine-tuning approaches in reducing hallucinations, offering a more reliable framework for deploying LLMs in natural language queries for data analytics. This research demonstrates the potential of these strategies to enhance the accuracy of LLM-driven data queries, ensuring more dependable results in data-driven environments.
摘要：大语言模型 (LLM) 在自然语言处理中变得越来越重要，通过自然语言查询实现了高级数据分析。然而，这些模型常常生成“幻觉”——即不准确或虚构的信息——这可能会削弱其在关键数据驱动决策中的可靠性。解决幻觉问题是提高大语言模型在处理自然语言查询时的准确性和可信度的关键。本研究专注于在大语言模型中减轻幻觉现象，特别是在数据分析的背景下。我们提出了并评估了四种针对性的策略：结构化输出生成、严格规则执行、系统提示增强和语义层集成。我们的研究结果表明，这些方法在减少幻觉方面比传统的微调方法更为有效，为在数据分析中部署大语言模型提供了更可靠的框架。本研究展示了这些策略在提升大语言模型驱动数据查询准确性方面的潜力，确保在数据驱动环境中获得更可靠的结果。

[NLP-112] Dynamic layer selection in decoder-only transformers

【速读】：该论文试图解决大型语言模型（LLMs）在推理过程中的计算效率问题。解决方案的关键在于动态推理（dynamic inference），特别是通过层跳过（layer skipping）和早期退出（early exiting）两种方法来优化自然语言生成（NLG）任务中的计算成本。研究发现，预训练的解码器模型在通过层跳过进行层移除时表现出更高的鲁棒性，而早期退出方法在基于每个token的计算适应性上存在困难。论文还展示了基于每个序列的动态计算分配具有显著提高效率的潜力，并通过构建一个oracle控制器证明了存在一种分配策略，可以在平均仅使用23.3%的层数的情况下达到与完整模型相当的性能。

链接: https://arxiv.org/abs/2410.20022
作者: Theodore Glavas,Joud Chataoui,Florence Regol,Wassim Jabbour,Antonios Valkanas,Boris N. Oreshkin,Mark Coates
关键词-EN: Large Language Models, size of Large, Large Language, vast size, prompted a search
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using only 23.3% of its layers on average.
摘要：大语言模型 (Large Language Model, LLM) 的庞大体量促使人们寻找优化推理的方法。动态推理是一种有效的方法，它根据当前样本调整架构以降低整体计算成本。我们实证研究了两种常见的自然语言生成 (Natural Language Generation, NLG) 动态推理方法：层跳过 (layer skipping) 和提前退出 (early exiting)。我们发现，与提前退出相比，预训练的仅解码器模型通过层跳过进行层移除时，显著更为稳健。我们还展示了在层跳过中，利用隐藏状态信息在每个 Token 基础上调整计算的难度。最后，我们通过构建一个预言控制器 (oracle controller)，展示了在每个序列基础上进行动态计算分配具有显著提高效率的潜力。值得注意的是，我们发现存在一种分配策略，平均仅使用全模型 23.3% 的层数，就能达到与全模型相同的性能。

[NLP-113] hink Carefully and Check Again! Meta-Generation Unlocking LLM s for Low-Resource Cross-Lingual Summarization

【速读】：该论文试图解决低资源语言的跨语言摘要生成（Cross-lingual Summarization, CLS）问题，特别是在指令调优的大型语言模型（Large Language Models, LLMs）在处理这些语言时表现不佳的情况下。解决方案的关键在于提出了一种四步零样本方法：摘要生成（Summarization）、改进（Improvement）、翻译（Translation）和精炼（Refinement），即SITR方法，并通过相应设计的提示（prompts）来充分发挥LLMs的潜力。实验结果表明，使用SITR方法的GPT-3.5和GPT-4在低资源语言的CLS任务中显著且一致地优于其他基线模型，从而有效地解锁了LLMs在处理低资源语言跨语言摘要生成任务中的潜力。

链接: https://arxiv.org/abs/2410.20021
作者: Zhecheng Li,Yiwei Wang,Bryan Hooi,Yujun Cai,Naifan Cheung,Nanyun Peng,Kai-wei Chang
关键词-EN: Cross-lingual summarization, cross-lingual summarization tasks, aims to generate, summarization, low-resource languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-lingual summarization (CLS) aims to generate a summary for the source text in a different target language. Currently, instruction-tuned large language models (LLMs) excel at various English tasks. However, unlike languages such as English, Chinese or Spanish, for those relatively low-resource languages with limited usage or data, recent studies have shown that LLMs’ performance on CLS tasks remains unsatisfactory even with few-shot settings. This raises the question: Are LLMs capable of handling cross-lingual summarization tasks for low-resource languages? To resolve this question, we fully explore the potential of large language models on cross-lingual summarization task for low-resource languages through our four-step zero-shot method: Summarization, Improvement, Translation and Refinement (SITR) with correspondingly designed prompts. We test our proposed method with multiple LLMs on two well-known cross-lingual summarization datasets with various low-resource target languages. The results show that: i) GPT-3.5 and GPT-4 significantly and consistently outperform other baselines when using our zero-shot SITR methods. ii) By employing our proposed method, we unlock the potential of LLMs, enabling them to effectively handle cross-lingual summarization tasks for relatively low-resource languages.
摘要：跨语言摘要 (Cross-lingual summarization, CLS) 旨在为目标语言生成源文本的摘要。目前，经过指令微调的大语言模型 (Large Language Models, LLMs) 在各种英语任务中表现出色。然而，与英语、中文或西班牙语等语言不同，对于使用率或数据有限的相对低资源语言，最近的研究表明，即使在少样本设置下，LLMs 在 CLS 任务上的表现仍然不尽如人意。这引发了一个问题：LLMs 是否能够处理低资源语言的跨语言摘要任务？为了解答这个问题，我们通过四步零样本方法：摘要 (Summarization)、改进 (Improvement)、翻译 (Translation) 和精炼 (Refinement) (SITR)，并结合相应设计的提示，全面探索了大语言模型在低资源语言跨语言摘要任务中的潜力。我们在两个著名的跨语言摘要数据集上，使用多种 LLMs 测试了我们提出的方法，这些数据集涵盖了多种低资源目标语言。结果显示：i) 在使用我们的零样本 SITR 方法时，GPT-3.5 和 GPT-4 显著且持续地优于其他基线模型。ii) 通过采用我们提出的方法，我们解锁了 LLMs 的潜力，使其能够有效处理相对低资源语言的跨语言摘要任务。

[NLP-114] Attacks against Abstractive Text Summarization Models through Lead Bias and Influence Functions EMNLP

【速读】：该论文试图解决生成式文本摘要模型在面对对抗性扰动和数据中毒攻击时的鲁棒性问题。解决方案的关键在于利用摘要模型固有的“领先偏差”（lead bias）进行对抗性扰动，并通过引入影响函数（influence functions）来实施数据中毒，从而破坏模型的完整性。这种方法不仅展示了模型行为向预期结果的偏斜，还揭示了模型在被攻击时倾向于生成抽取式摘要而非生成式摘要的新行为变化。

链接: https://arxiv.org/abs/2410.20019
作者: Poojitha Thota,Shirin Nilizadeh
关键词-EN: Large Language Models, Large Language, comprehension and generation, Language Models, introduced novel opportunities
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 10 pages, 3 figures, Accepted at EMNLP Findings 2024

点击查看摘要

Abstract:Large Language Models have introduced novel opportunities for text comprehension and generation. Yet, they are vulnerable to adversarial perturbations and data poisoning attacks, particularly in tasks like text classification and translation. However, the adversarial robustness of abstractive text summarization models remains less explored. In this work, we unveil a novel approach by exploiting the inherent lead bias in summarization models, to perform adversarial perturbations. Furthermore, we introduce an innovative application of influence functions, to execute data poisoning, which compromises the model’s integrity. This approach not only shows a skew in the models behavior to produce desired outcomes but also shows a new behavioral change, where models under attack tend to generate extractive summaries rather than abstractive summaries.
摘要：大语言模型为文本理解和生成带来了新的机遇。然而，它们在文本分类和翻译等任务中容易受到对抗性扰动和数据中毒攻击的影响。然而，抽象文本摘要模型的对抗性鲁棒性研究相对较少。在本研究中，我们揭示了一种利用摘要模型固有的领先偏差进行对抗性扰动的新方法。此外，我们引入了一种创新的影响函数应用，以执行数据中毒，从而破坏模型的完整性。这种方法不仅显示出模型行为向预期结果的偏斜，还展示了一种新的行为变化，即受攻击的模型倾向于生成抽取式摘要而非抽象式摘要。

[NLP-115] Vulnerability of LLM s to Vertically Aligned Text Manipulations

【速读】：该论文试图解决的问题是：在文本分类任务中，垂直格式化的文本输入是否会对基于解码器的大型语言模型 (LLMs) 的性能产生显著影响。解决方案的关键在于通过实验验证垂直文本输入对不同LLMs在多个文本分类数据集上的准确性影响，并分析其背后的原因。研究发现，垂直文本输入显著降低了LLMs的分类准确性，而Chain of Thought (CoT) 推理并不能帮助模型识别或缓解这种脆弱性，但通过精心设计的少样本学习方法可以部分缓解这一问题。此外，论文还探讨了这种脆弱性的根本原因，涉及分词 (tokenization) 和注意力矩阵 (attention matrices) 的内在问题。

链接: https://arxiv.org/abs/2410.20016
作者: Zhecheng Li,Yiwei Wang,Bryan Hooi,Yujun Cai,Zhen Xiong,Nanyun Peng,Kai-wei Chang
关键词-EN: text classification tasks, classification involves categorizing, Text classification, identifying harmful content, Text classification involves
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text classification involves categorizing a given text, such as determining its sentiment or identifying harmful content. With the advancement of large language models (LLMs), these models have become highly effective at performing text classification tasks. However, they still show vulnerabilities to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input? In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) Chain of Thought (CoT) reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but few-shot learning with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.
摘要：文本分类涉及对给定文本进行分类，例如确定其情感或识别有害内容。随着大语言模型（LLM）的发展，这些模型在执行文本分类任务方面已变得非常有效。然而，它们仍然对文本格式的变化表现出脆弱性。最近的研究表明，修改输入格式，例如对基于编码器的模型进行垂直对齐的单词，可以显著降低文本分类任务的准确性。虽然这些输入对人类来说易于理解，但它们可以显著误导模型，在涉及有害或敏感信息的现实场景中，存在绕过检测的潜在风险。随着LLM应用的扩展，一个关键问题浮现：基于解码器的LLM是否对垂直格式的文本输入表现出类似的脆弱性？在本文中，我们研究了垂直文本输入对多种LLM在多个文本分类数据集上的性能影响，并分析了其根本原因。我们的研究结果如下：（i）垂直文本输入显著降低了LLM在文本分类任务中的准确性。（ii）链式思维（Chain of Thought, CoT）推理无法帮助LLM识别垂直输入或减轻其脆弱性，但经过仔细分析的少样本学习可以。（iii）我们通过分析Token化和注意力矩阵中的固有问题，探讨了这种脆弱性的根本原因。

[NLP-116] A Survey of Small Language Models

【速读】：该论文试图解决小型语言模型 (Small Language Models, SLMs) 的优化和应用问题。解决方案的关键在于提出了一种新的分类法，用于归类优化SLMs的方法，包括模型压缩、剪枝和量化技术。此外，论文还总结了用于基准测试SLMs的数据集和常用的评估指标，并指出了当前仍需解决的关键开放挑战。通过这些内容，论文旨在为研究人员和实践者在开发和部署高效的小型语言模型方面提供有价值的资源。

链接: https://arxiv.org/abs/2410.20011
作者: Chien Van Nguyen,Xuan Shen,Ryan Aponte,Yu Xia,Samyadeep Basu,Zhengmian Hu,Jian Chen,Mihir Parmar,Sasidhar Kunapuli,Joe Barrow,Junda Wu,Ashish Singh,Yu Wang,Jiuxiang Gu,Franck Dernoncourt,Nesreen K. Ahmed,Nedim Lipka,Ruiyi Zhang,Xiang Chen,Tong Yu,Sungchul Kim,Hanieh Deilamsalehy,Namyong Park,Mike Rimer,Zhehao Zhang,Huanrui Yang,Ryan A. Rossi,Thien Huu Nguyen
关键词-EN: increasingly important due, settings including on-device, edge devices, minimal computational resources, making them ideal
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.
摘要：小型语言模型 (Small Language Models, SLMs) 由于其在执行各种语言任务时的高效性和性能，以及对计算资源的极低需求，正变得越来越重要。这使得它们非常适合多种应用场景，包括设备端、移动设备、边缘设备等。本文对 SLMs 进行了全面的综述，重点探讨了其架构、训练技术以及模型压缩技术。我们提出了一种新的分类法，用于对优化 SLMs 的方法进行分类，包括模型压缩、剪枝和量化技术。我们总结了用于基准测试 SLMs 的基准数据集，以及常用的评估指标。此外，我们还指出了当前仍需解决的关键开放性挑战。本综述旨在为对开发和部署小型但高效的语言模型感兴趣的研究人员和实践者提供有价值的参考资源。

[NLP-117] Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models EMNLP2024

【速读】：该论文试图解决的问题是：预训练的大型语言模型 (LLMs) 在多大程度上保留了任务特定的知识，以及指令微调 (instruction tuning) 如何影响这些模型在不同自然语言处理 (NLP) 任务中的表示。解决方案的关键在于使用矩阵分析工具来比较预训练和指令微调的 LLMs 在存储任务特定信息方面的差异。研究发现，虽然某些任务在预训练的 LLMs 中已经编码，但其他任务通过指令微调显著受益。此外，研究还确定了模型从高层通用表示过渡到更任务导向表示的层次，这有助于理解 LLMs 的机制，并为参数高效迁移学习和多任务学习领域的未来研究提供了基础。

链接: https://arxiv.org/abs/2410.20008
作者: Zheng Zhao,Yftah Ziser,Shay B. Cohen
关键词-EN: natural language processing, Fine-tuning pre-trained large, large language models, pre-trained large language, Fine-tuning pre-trained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Fine-tuning pre-trained large language models (LLMs) on a diverse array of tasks has become a common approach for building models that can solve various natural language processing (NLP) tasks. However, where and to what extent these models retain task-specific knowledge remains largely unexplored. This study investigates the task-specific information encoded in pre-trained LLMs and the effects of instruction tuning on their representations across a diverse set of over 60 NLP tasks. We use a set of matrix analysis tools to examine the differences between the way pre-trained and instruction-tuned LLMs store task-specific information. Our findings reveal that while some tasks are already encoded within the pre-trained LLMs, others greatly benefit from instruction tuning. Additionally, we pinpointed the layers in which the model transitions from high-level general representations to more task-oriented representations. This finding extends our understanding of the governing mechanisms of LLMs and facilitates future research in the fields of parameter-efficient transfer learning and multi-task learning.
摘要：在多样化的任务上微调预训练的大语言模型（LLMs）已成为构建能够解决各种自然语言处理（NLP）任务的模型的常见方法。然而，这些模型在何处以及在多大程度上保留了任务特定的知识，仍然是一个未被充分探索的领域。本研究探讨了预训练 LLMs 中编码的任务特定信息，以及指令微调对其在超过 60 个 NLP 任务中的表示的影响。我们使用一组矩阵分析工具来检查预训练和指令微调 LLMs 存储任务特定信息的方式之间的差异。我们的研究发现，尽管某些任务已经在预训练 LLMs 中编码，但其他任务则从指令微调中获益匪浅。此外，我们确定了模型从高级通用表示过渡到更面向任务的表示的层级。这一发现扩展了我们对 LLMs 主导机制的理解，并促进了未来在参数高效迁移学习和多任务学习领域的研究。

[NLP-118] Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

【速读】：该论文试图解决大型语言模型 (LLMs) 在处理复杂、多步骤问题时推理能力不足的问题。解决方案的关键在于提出了一种新颖的合作多智能体推理框架 (CoPlanner)，通过将推理步骤分离并分配给不同的智能体来增强推理能力。CoPlanner 由两个 LLM 智能体组成：规划智能体和推理智能体。规划智能体提供高层次的战略提示，而推理智能体则根据这些提示进行推理并得出答案。通过使用近端策略优化 (PPO) 训练规划智能体的策略，CoPlanner 在 LogiQA 和 BBH 数据集上分别比之前的最优方法提升了 9.94% 和 3.09%。这一结果表明，规划智能体的指导和智能体之间的有效合作是 CoPlanner 在解决多步骤推理问题时表现优异的关键因素。

链接: https://arxiv.org/abs/2410.20007
作者: Danqing Wang,Zhuorui Ye,Fei Fang,Lei Li
关键词-EN: large language models, language models, tackle complex, reasoning, planning agent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Working in progress

点击查看摘要

Abstract:Enhancing the reasoning capabilities of large language models (LLMs) is crucial for enabling them to tackle complex, multi-step problems. Multi-agent frameworks have shown great potential in enhancing LLMs’ reasoning capabilities. However, the lack of effective cooperation between LLM agents hinders their performance, especially for multi-step reasoning tasks. This paper proposes a novel cooperative multi-agent reasoning framework (CoPlanner) by separating reasoning steps and assigning distinct duties to different agents. CoPlanner consists of two LLM agents: a planning agent and a reasoning agent. The planning agent provides high-level strategic hints, while the reasoning agent follows these hints and infers answers. By training the planning agent’s policy through the interactive reasoning process via Proximal Policy Optimization (PPO), the LLaMA-3-8B-based CoPlanner outperforms the previous best method by 9.94% on LogiQA and 3.09% on BBH. Our results demonstrate that the guidance from the planning agent and the effective cooperation between the agents contribute to the superior performance of CoPlanner in tackling multi-step reasoning problems.
摘要：提升大语言模型（LLMs）的推理能力对于使其能够解决复杂、多步骤的问题至关重要。多智能体框架在增强 LLMs 的推理能力方面展现了巨大潜力。然而，LLM 智能体之间缺乏有效的合作阻碍了它们的性能，尤其是在多步骤推理任务中。本文提出了一种新颖的合作多智能体推理框架（CoPlanner），通过分离推理步骤并将不同的职责分配给不同的智能体。CoPlanner 由两个 LLM 智能体组成：一个规划智能体和一个推理智能体。规划智能体提供高层次的战略提示，而推理智能体则遵循这些提示并推断答案。通过使用近端策略优化（PPO）训练规划智能体的策略，基于 LLaMA-3-8B 的 CoPlanner 在 LogiQA 上比之前最佳方法提升了 9.94%，在 BBH 上提升了 3.09%。我们的结果表明，规划智能体的指导以及智能体之间的有效合作是 CoPlanner 在解决多步骤推理问题中表现优异的关键因素。

[NLP-119] Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

【速读】：该论文试图解决在多模态搜索场景中，如何选择最优的大型语言模型（LLMs）和多模态语言模型（MLLMs）以实现与人类判断的高度一致性。解决方案的关键在于全面评估不同模型在多种情境下的表现，分析成本与准确性之间的权衡，并揭示模型性能在不同上下文中的显著差异。特别值得注意的是，对于较小的模型，引入视觉组件可能反而会降低性能。这些发现为实际应用中选择最合适的模型提供了复杂性的考量。

链接: https://arxiv.org/abs/2410.19974
作者: Silvia Terragni,Hoang Cuong,Joachim Daiber,Pallavi Gudipati,Pablo N. Mendes
关键词-EN: Large Language Models, Large Language, search relevance evaluators, effective search relevance, relevance evaluators
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.
摘要：大语言模型 (LLM) 已展现出作为有效搜索相关性评估工具的潜力。然而，目前缺乏关于哪些模型在不同情境或特定应用场景中始终表现最佳的综合指导。本文中，我们评估了多个大语言模型和多模态语言模型 (MLLM) 在多种多模态搜索场景中与人类判断的一致性。我们的分析探讨了成本与准确性之间的权衡，强调了模型性能因情境而异。有趣的是，在较小模型中，加入视觉组件可能会阻碍性能而非提升性能。这些发现突显了在实际应用中选择最合适模型所涉及的复杂性。

[NLP-120] RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction

【速读】：该论文试图解决的问题是如何有效防御针对大型语言模型（LLM）的越狱攻击（jailbreak attacks）。越狱攻击通过在越狱提示中隐藏有害查询来绕过LLM的内置安全机制。现有的防御措施主要集中在减轻越狱提示的影响上，但由于越狱提示可以采取任意和自适应的形式，这些防御措施往往不够有效。论文提出的解决方案是RobustKV，其关键在于采用了一种根本不同的方法：通过选择性地从键值（KV）缓存中移除有害查询的关键标记（tokens）。具体来说，RobustKV通过策略性地驱逐KV缓存中重要性最低的标记，从而削弱有害查询在KV缓存中的存在，进而防止LLM生成恶意响应。这种方法不仅在基准数据集和模型上进行了广泛的评估，证明了其有效性，还为攻击者创造了一个逃避困境，迫使他们在逃避RobustKV和绕过LLM内置安全机制之间做出权衡，从而增强了RobustKV对自适应攻击的鲁棒性。

链接: https://arxiv.org/abs/2410.19937
作者: Tanqiu Jiang,Zian Wang,Jiacheng Liang,Changjiang Li,Yuhui Wang,Ting Wang
关键词-EN: circumvent LLMs’ built-in, attacks circumvent LLMs’, jailbreak prompts, Jailbreak attacks circumvent, circumvent LLMs’
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Jailbreak attacks circumvent LLMs’ built-in safeguards by concealing harmful queries within jailbreak prompts. While existing defenses primarily focus on mitigating the effects of jailbreak prompts, they often prove inadequate as jailbreak prompts can take arbitrary, adaptive forms. This paper presents RobustKV, a novel defense that adopts a fundamentally different approach by selectively removing critical tokens of harmful queries from key-value (KV) caches. Intuitively, for a jailbreak prompt to be effective, its tokens must achieve sufficient `importance’ (as measured by attention scores), which inevitably lowers the importance of tokens in the concealed harmful query. Thus, by strategically evicting the KVs of the lowest-ranked tokens, RobustKV diminishes the presence of the harmful query in the KV cache, thus preventing the LLM from generating malicious responses. Extensive evaluation using benchmark datasets and models demonstrates that RobustKV effectively counters state-of-the-art jailbreak attacks while maintaining the LLM’s general performance on benign queries. Moreover, RobustKV creates an intriguing evasiveness dilemma for adversaries, forcing them to balance between evading RobustKV and bypassing the LLM’s built-in safeguards. This trade-off contributes to RobustKV’s robustness against adaptive attacks. (warning: this paper contains potentially harmful content generated by LLMs.)
摘要：越狱攻击通过将有害查询隐藏在越狱提示中，绕过了大语言模型（LLM）内置的安全措施。尽管现有防御措施主要集中在减轻越狱提示的影响上，但它们往往显得不足，因为越狱提示可以采取任意、适应性的形式。本文提出了RobustKV，这是一种新颖的防御方法，通过有选择地从键值（KV）缓存中移除有害查询的关键Token，采用了根本不同的策略。直观地说，为了使越狱提示有效，其Token必须达到足够的“重要性”（通过注意力分数衡量），这不可避免地降低了隐藏有害查询中Token的重要性。因此，通过战略性地驱逐排名最低的Token的KV，RobustKV减少了有害查询在KV缓存中的存在，从而防止LLM生成恶意响应。使用基准数据集和模型进行的广泛评估表明，RobustKV有效地对抗了最先进的越狱攻击，同时保持了LLM在良性查询上的整体性能。此外，RobustKV为对手创造了一个有趣的逃避困境，迫使他们在逃避RobustKV和绕过LLM内置安全措施之间做出权衡。这种权衡有助于RobustKV抵御适应性攻击的鲁棒性。（警告：本文包含由LLM生成的潜在有害内容。）

[NLP-121] Do Discrete Self-Supervised Representations of Speech Capture Tone Distinctions? ICASSP2025

【速读】：该论文试图解决的问题是：通过自监督学习（Self-Supervised Learning, SSL）模型生成的离散语音表示是否能够充分捕捉到语言中的声调信息，特别是在低资源语言中。解决方案的关键在于评估使用k-means聚类生成的离散符号是否能够保留声调信息，并比较这些离散符号与SSL模型中的潜在向量在元音和声调分类任务中的表现。研究结果表明，使用离散符号会导致声调信息的显著损失，即使在使用特定语言优化的SSL模型时也是如此。因此，论文建议离散化过程需要考虑到下游任务的特性，尤其是对于依赖声调的任务。

链接: https://arxiv.org/abs/2410.19935
作者: Opeyemi Osakuade,Simon King
关键词-EN: Self-Supervised Learning, foundation models, language-specialised SSL models, limited data, SSL model
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretization of speech into a sequence of symbols is achieved by unsupervised clustering of the latents from an SSL model. Our study evaluates whether discrete symbols - found using k-means - adequately capture tone in two example languages, Mandarin and Yoruba. We compare latent vectors with discrete symbols, obtained from HuBERT base, MandarinHuBERT, or XLS-R, for vowel and tone classification. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models. We suggest that discretization needs to be task-aware, particularly for tone-dependent downstream tasks.
摘要：通过自监督学习 (Self-Supervised Learning, SSL) 基础模型获得的语音离散表示在广泛应用，特别是在下游任务数据有限的情况下，例如对于低资源语言。通常，将语音离散化为符号序列是通过对 SSL 模型中的潜在变量进行无监督聚类实现的。我们的研究评估了使用 k-means 找到的离散符号是否能充分捕捉两种示例语言——普通话和约鲁巴语中的声调。我们比较了从 HuBERT base、MandarinHuBERT 或 XLS-R 获得的潜在向量与离散符号在元音和声调分类中的表现。我们发现，使用离散符号会导致声调信息的显著损失，即使是针对特定语言的 SSL 模型也是如此。我们建议，离散化过程需要具备任务感知能力，特别是对于依赖声调的下游任务。

[NLP-122] Improving Multimodal Large Language Models Using Continual Learning NEURIPS2024

【速读】：该论文试图解决在将预训练的视觉模型集成到生成式大型语言模型（LLM）中以创建多模态LLM（MLLM）时，导致自然语言理解和生成任务性能显著下降的问题。解决方案的关键在于将这种集成视为一个持续学习问题，并通过评估五种持续学习方法来减轻遗忘效应。研究识别出一种技术，该技术在增强视觉理解能力的同时，最小化语言性能的损失，从而将语言性能的降解减少了高达15%，同时保持了较高的多模态准确性。此外，该方法在处理一系列视觉-语言任务时表现出鲁棒性，能够在获取新的多模态能力的同时有效保留语言技能。

链接: https://arxiv.org/abs/2410.19925
作者: Shikhar Srivastava,Md Yousuf Harun,Robik Shrestha,Christopher Kanan
关键词-EN: Generative large language, pre-trained vision model, Generative large, large language models, exhibit impressive capabilities
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2024 Workshop on Scalable Continual Learning for Lifelong Foundation Models

点击查看摘要

Abstract:Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks, effectively preserving linguistic skills while acquiring new multimodal capabilities.
摘要：生成式大语言模型（LLMs）展现出令人印象深刻的能力，通过将预训练的视觉模型整合到原始 LLM 中，可以创建一个多模态 LLM（MLLM），从而进一步提升其能力。然而，与原始 LLM 相比，这种整合通常会显著降低自然语言理解和生成任务的性能。本研究使用 LLaVA MLLM 探讨了这一问题，将整合视为一个持续学习问题。我们评估了五种持续学习方法以减轻遗忘现象，并确定了一种技术，该技术在增强视觉理解的同时，最小化了语言性能的损失。我们的方法在保持高多模态准确性的同时，将语言性能的下降减少了高达 15%，优于 LLaVA 的方案。此外，我们还通过在一系列视觉-语言任务上的持续学习，展示了我们方法的鲁棒性，有效地保留了语言技能，同时获得了新的多模态能力。

[NLP-123] Ensembling Finetuned Language Models for Text Classification NEURIPS2024

【速读】：该论文试图解决在文本分类任务中，如何通过集成学习（ensembling）提升微调（finetuning）预训练模型性能的问题。解决方案的关键在于构建一个包含五个大型微调模型在六个数据集上的预测结果的元数据集（metadataset），并评估不同集成策略对这些预测结果的影响。研究结果表明，集成学习能够显著提高微调文本分类器的性能，并为未来在类似任务中采用集成方法提供了激励。

链接: https://arxiv.org/abs/2410.19889
作者: Sebastian Pineda Arango,Maciej Janowski,Lennart Purucker,Arber Zela,Frank Hutter,Josif Grabocka
关键词-EN: common practice widespread, adapt pretrained models, common practice, practice widespread, communities to adapt
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Workshop on Fine-Tuning in Modern Machine Learning @ NeurIPS 2024. arXiv admin note: text overlap with arXiv:2410.04520

点击查看摘要

Abstract:Finetuning is a common practice widespread across different communities to adapt pretrained models to particular tasks. Text classification is one of these tasks for which many pretrained models are available. On the other hand, ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates. However, ensembling pretrained models for text classification is not a well-studied avenue. In this paper, we present a metadataset with predictions from five large finetuned models on six datasets, and report results of different ensembling strategies from these predictions. Our results shed light on how ensembling can improve the performance of finetuned text classifiers and incentivize future adoption of ensembles in such tasks.
摘要：微调（Finetuning）是一种广泛应用于不同领域的常见做法，旨在将预训练模型适应于特定任务。文本分类是其中一项任务，已有许多预训练模型可供使用。另一方面，神经网络的集成通常用于提升性能并提供可靠的不确定性估计。然而，针对文本分类的预训练模型集成尚未得到充分研究。本文中，我们提出了一种包含五种大型微调模型在六个数据集上的预测结果的元数据集，并报告了基于这些预测的不同集成策略的结果。我们的研究结果揭示了集成如何提升微调文本分类器的性能，并激励未来在此类任务中采用集成方法。

[NLP-124] Critical biblical studies via word frequency analysis: unveiling text authorship

【速读】：该论文试图解决《圣经》文本的作者身份问题，特别是通过统计分析词频的方法来区分《圣经》前九本书中三个不同作者（D、DtrH和P）的文本。解决方案的关键在于利用词频的微小差异来识别不同作者的语言特征，而不依赖于对作者身份的先验假设。通过将各章节与参考文集进行相似性评估，研究成功地区分了三个作者的文本，并发现D和DtrH的文本比P的文本更为相似，这与专家评估一致。这种方法为圣经文本的作者身份提供了可解释且具有统计显著性的证据。

链接: https://arxiv.org/abs/2410.19883
作者: Shira Faigenbaum-Golovin,Alon Kipnis,Axel Bühler,Eli Piasetzky,Thomas Römer,Israel Finkelstein
关键词-EN: transmission spanning centuries, oral-written transmission spanning, biblical texts, obscures the contours, earlier recensions
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Bible, a product of an extensive and intricate process of oral-written transmission spanning centuries, obscures the contours of its earlier recensions. Debate rages over determining the existing layers and identifying the date of composition and historical background of the biblical texts. Traditional manual methodologies have grappled with authorship challenges through scrupulous textual criticism, employing linguistic, stylistic, inner-biblical, and historical criteria. Despite recent progress in computer-assisted analysis, many patterns still need to be uncovered in Biblical Texts. In this study, we address the question of authorship of biblical texts by employing statistical analysis to the frequency of words using a method that is particularly sensitive to deviations in frequencies associated with a few words out of potentially many. We aim to differentiate between three distinct authors across numerous chapters spanning the first nine books of the Bible. In particular, we examine 50 chapters labeled according to biblical exegesis considerations into three corpora (D, DtrH, and P). Without prior assumptions about author identity, our approach leverages subtle differences in word frequencies to distinguish among the three corpora and identify author-dependent linguistic properties. Our analysis indicates that the first two authors (D and DtrH) are much more closely related compared to P, a fact that aligns with expert assessments. Additionally, we attain high accuracy in attributing authorship by evaluating the similarity of each chapter with the reference corpora. This study sheds new light on the authorship of biblical texts by providing interpretable, statistically significant evidence that there are different linguistic characteristics of biblical authors and that these differences can be identified.
摘要：《圣经》作为一部历经数世纪口传与书写复杂过程的产物，模糊了其早期修订版的轮廓。关于现有文本层次的确定以及《圣经》文本的创作日期和历史背景的识别，争论激烈。传统的手工方法通过严谨的文本批评，运用语言学、文体学、内部圣经和历史标准，应对了作者身份的挑战。尽管计算机辅助分析在近年来取得了进展，但《圣经》文本中仍有许多模式有待揭示。在本研究中，我们通过采用统计分析方法，对单词频率进行分析，特别关注在众多单词中少数单词频率的偏差，来探讨《圣经》文本的作者身份问题。我们的目标是区分《圣经》前九本书中跨越多个章节的三位不同作者。具体而言，我们根据圣经释经学的考虑，将50个章节分为三个语料库（D、DtrH和P）。在没有先验作者身份假设的情况下，我们的方法利用单词频率的细微差异来区分这三个语料库，并识别依赖于作者的语言特征。我们的分析表明，前两位作者（D和DtrH）之间的关系比与P的关系更为密切，这一事实与专家评估相符。此外，我们通过评估每个章节与参考语料库的相似性，实现了高准确性的作者归属。本研究通过提供可解释的、统计上显著的证据，揭示了《圣经》文本作者的不同语言特征，并表明这些差异是可以被识别的，从而为《圣经》文本的作者身份问题提供了新的见解。

[NLP-125] Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

【速读】：该论文试图解决大规模预训练模型在适应特定下游任务时面临的计算和存储成本问题。解决方案的关键在于参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT)，其核心思想是通过高效调整预训练模型的参数，使其适应特定任务或领域，同时尽量减少引入额外参数和所需的计算资源。PEFT方法在保持模型性能的同时，显著降低了硬件平台上的计算和存储负担，为大规模模型的实际应用提供了可行性。

链接: https://arxiv.org/abs/2410.19878
作者: Luping Wang,Sheng Chen,Linnan Jiang,Shu Pan,Runze Cai,Sen Yang,Fei Yang
关键词-EN: scaling raw forecasts, surpassed human levels, made groundbreaking progress, natural language generation, language generation tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The large models, as predicted by scaling raw forecasts, have made groundbreaking progress in many fields, particularly in natural language generation tasks, where they have approached or even surpassed human levels. However, the unprecedented scale of their parameters brings significant computational and storage costs. These large models require substantial computational resources and GPU memory to operate. When adapting large models to specific downstream tasks, their massive parameter scale poses a significant challenge in fine-tuning on hardware platforms with limited computational power and GPU memory. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) offers a practical solution by efficiently adjusting the parameters of large pre-trained models to suit various downstream tasks. Specifically, PEFT adjusts the parameters of pre-trained large models to adapt to specific tasks or domains, minimizing the introduction of additional parameters and the computational resources required. This review mainly introduces the preliminary knowledge of PEFT, the core ideas and principles of various PEFT algorithms, the applications of PEFT, and potential future research directions. By reading this review, we believe that interested parties can quickly grasp the PEFT methodology, thereby accelerating its development and innovation.
摘要：随着原始预测的扩展，大规模模型在多个领域取得了突破性进展，特别是在自然语言生成任务中，它们已经接近甚至超越了人类水平。然而，这些模型前所未有的参数规模带来了显著的计算和存储成本。这些大规模模型需要大量的计算资源和GPU内存才能运行。当将这些大模型适应于特定的下游任务时，其庞大的参数规模在计算能力和GPU内存有限的硬件平台上进行微调时构成了重大挑战。为了解决这一问题，参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）提供了一种实用的解决方案，通过高效调整预训练大模型的参数以适应各种下游任务。具体而言，PEFT调整预训练大模型的参数以适应特定任务或领域，同时最小化引入额外参数和所需的计算资源。本文主要介绍了PEFT的初步知识、各种PEFT算法的核心思想和原理、PEFT的应用以及潜在的未来研究方向。通过阅读本文，我们相信感兴趣的读者可以快速掌握PEFT方法，从而加速其发展和创新。

[NLP-126] Benchmarking Large Language Models for Image Classification of Marine Mammals

【速读】：该论文试图解决在海洋哺乳动物分类领域缺乏专门基准数据集的问题。解决方案的关键在于构建了一个包含1,423张图片、涵盖65种海洋哺乳动物的基准数据集，并对不同层次的分类（从物种级到中等级再到群体级）进行了详细分类。此外，论文评估了多种分类方法，包括使用神经网络嵌入的机器学习算法、预训练神经网络、零样本模型（如CLIP和LLMs）以及基于LLM的多智能体系统（MAS）。通过这些方法的比较，论文展示了传统模型和LLMs在不同方面的优势，并验证了MAS在提升分类性能方面的潜力。

链接: https://arxiv.org/abs/2410.19848
作者: Yijiashun Qi,Shuzhang Cai,Zunduo Zhao,Jiaming Li,Yanbin Lin,Zhiqiang Wang
关键词-EN: Large Language Models, Artificial Intelligence, Large Language, achieved ground-breaking performance, past few decades
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ICKG 2024

点击查看摘要

Abstract:As Artificial Intelligence (AI) has developed rapidly over the past few decades, the new generation of AI, Large Language Models (LLMs) trained on massive datasets, has achieved ground-breaking performance in many applications. Further progress has been made in multimodal LLMs, with many datasets created to evaluate LLMs with vision abilities. However, none of those datasets focuses solely on marine mammals, which are indispensable for ecological equilibrium. In this work, we build a benchmark dataset with 1,423 images of 65 kinds of marine mammals, where each animal is uniquely classified into different levels of class, ranging from species-level to medium-level to group-level. Moreover, we evaluate several approaches for classifying these marine mammals: (1) machine learning (ML) algorithms using embeddings provided by neural networks, (2) influential pre-trained neural networks, (3) zero-shot models: CLIP and LLMs, and (4) a novel LLM-based multi-agent system (MAS). The results demonstrate the strengths of traditional models and LLMs in different aspects, and the MAS can further improve the classification performance. The dataset is available on GitHub: this https URL.
摘要：随着人工智能 (AI) 在过去几十年中的迅速发展，新一代基于大规模数据集训练的大语言模型 (LLMs) 在许多应用中取得了突破性的性能。在多模态 LLMs 方面也取得了进一步的进展，许多数据集被创建用于评估具有视觉能力的 LLMs。然而，这些数据集均未专门针对海洋哺乳动物，而海洋哺乳动物对于生态平衡是不可或缺的。在本研究中，我们构建了一个包含 1,423 张图像的基准数据集，涵盖了 65 种海洋哺乳动物，每种动物被唯一分类为不同级别的类别，从物种级别到中等级别再到群体级别。此外，我们评估了几种分类这些海洋哺乳动物的方法：(1) 使用神经网络提供的嵌入的机器学习 (ML) 算法，(2) 有影响力的预训练神经网络，(3) 零样本模型：CLIP 和大语言模型，以及 (4) 一种基于大语言模型的多智能体系统 (MAS)。结果显示了传统模型和大语言模型在不同方面的优势，而 MAS 可以进一步提高分类性能。该数据集可在 GitHub 上获取：this https URL。

[NLP-127] Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning

【速读】：该论文试图解决大语言模型（LLMs）在数学推理方面的瓶颈问题，特别是当前方法要么依赖大规模推理数据集进行训练，要么采用少样本方法导致准确性下降的问题。解决方案的关键在于提出了一种名为“步骤引导推理”（Step Guidance Reasoning）的新方法，该方法无需进一步模型微调。通过在推理阶段引入反思过程，即模型对小步骤推理进行自我审视，类似于人类在解决问题时的思考方式，从而有效地指导推理从一步到下一步。这种方法显著提升了数学推理的准确性，例如在AMC23数据集上将准确率从30%提升至57.5%，相对提升了91.7%，在MATH数据集的5级样本问题上，准确率从43%提升至67%，相对提升了55.8%。

链接: https://arxiv.org/abs/2410.19817
作者: Lang Cao,Chao Peng,Yitong Li
关键词-EN: large language models, challenging aspect, aspect of large, large language, Mathematical reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 4 pages, 4 figures

点击查看摘要

Abstract:Mathematical reasoning has been a challenging aspect of large language models (LLMs). However, the introduction of step-by-step Chain-of-Thought (CoT) inference has significantly advanced the mathematical capabilities of LLMs. Despite this progress, current approaches either require massive inference datasets as training datasets or rely on few-shot methods that often sacrifice accuracy. To address this bottleneck in mathematical reasoning, we propose a novel method called Step Guidance Reasoning without involving further model fine-tuning. In this approach, LLMs reflect on small reasoning steps – similar to how humans deliberate on and focus attention on what to do next. By incorporating this reflective process into the inference stage, LLMs can effectively guide their reasoning from one step to the next. Our method significantly improved the math performance, raising the accuracy on the AMC23 dataset from 30% to 57.5%, a relative improvement of 91.7%, and on the sampled level 5 problem of the MATH dataset, we achieved a relative accuracy improvement of 55.8%, increasing from 43% to 67%.
摘要：数学推理一直是大型语言模型（LLMs）中的一个挑战性领域。然而，逐步推理的链式思维（Chain-of-Thought, CoT）推理方法显著提升了LLMs的数学能力。尽管取得了这些进展，当前的方法要么需要大量的推理数据集作为训练数据，要么依赖于少样本方法，这些方法往往以牺牲准确性为代价。为了解决数学推理中的这一瓶颈，我们提出了一种名为“无进一步模型微调的步骤引导推理”的新方法。在这种方法中，LLMs通过反思小的推理步骤——类似于人类在思考和集中注意力于下一步该做什么时的行为。通过将这种反思过程融入推理阶段，LLMs能够有效地从一个步骤引导到下一个步骤。我们的方法显著提升了数学表现，将AMC23数据集上的准确率从30%提高到57.5%，相对提升了91.7%，并且在MATH数据集的抽样5级问题上，我们实现了55.8%的相对准确率提升，从43%提高到67%。

[NLP-128] Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在研究与应用中涉及的伦理问题。随着LLMs在广泛应用中的集成度增加，其社会影响也随之扩大，从而引发了重要的伦理问题。论文的关键解决方案在于提供了一套全面的、实用的最佳实践指南，旨在帮助科研人员和行业从业者在开发、部署和使用LLMs时，能够坚持最高的伦理标准。

链接: https://arxiv.org/abs/2410.19812
作者: Eddie L. Ungless,Nikolas Vitsakis,Zeerak Talat,James Garforth,Björn Ross,Arno Onken,Atoosa Kasirzadeh,Alexandra Birch
关键词-EN: large language models, considerations surrounding research, ethical considerations surrounding, language models, offers an overview
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 47 pages

点击查看摘要

Abstract:This whitepaper offers an overview of the ethical considerations surrounding research into or with large language models (LLMs). As LLMs become more integrated into widely used applications, their societal impact increases, bringing important ethical questions to the forefront. With a growing body of work examining the ethical development, deployment, and use of LLMs, this whitepaper provides a comprehensive and practical guide to best practices, designed to help those in research and in industry to uphold the highest ethical standards in their work.
摘要：本白皮书概述了围绕大语言模型 (LLM) 研究或应用的伦理考量。随着 LLM 在广泛使用的应用中越来越普及，其社会影响也随之增加，从而将重要的伦理问题推到了前台。随着越来越多的研究关注 LLM 的伦理开发、部署和使用，本白皮书提供了一份全面且实用的最佳实践指南，旨在帮助学术界和工业界的从业者在其工作中坚持最高的伦理标准。

[NLP-129] ControlAgent : Automating Control System Design via Novel Integration of LLM Agents and Domain Expertise

【速读】：该论文试图解决大型语言模型（LLM）在控制系统设计中的应用局限性问题，特别是在面对控制理论的复杂性和专业性时。解决方案的关键在于引入ControlAgent，这是一种通过将LLM代理与控制领域的专业知识相结合，自动化控制系统设计的新范式。ControlAgent的核心在于其能够编码专家控制知识，并通过模拟人类工程师的迭代设计过程，逐步调整控制器参数以满足用户指定的稳定性、性能和鲁棒性要求。ControlAgent通过集成多个协作的LLM代理，包括负责任务分配的中央代理和专门针对不同系统和需求进行详细控制器设计的任务特定代理，以及执行复杂计算和控制器评估的Python计算代理。结合历史和反馈模块，任务特定的LLM代理能够根据先前设计的实时反馈迭代优化控制器参数，从而实现端到端的自动化控制系统设计解决方案。

链接: https://arxiv.org/abs/2410.19811
作者: Xingang Guo,Darioush Keivan,Usman Syed,Lianhui Qin,Huan Zhang,Geir Dullerud,Peter Seiler,Bin Hu
关键词-EN: Large Language Models, Control system design, sectors including aerospace, Control system, diverse sectors including
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Control system design is a crucial aspect of modern engineering with far-reaching applications across diverse sectors including aerospace, automotive systems, power grids, and robotics. Despite advances made by Large Language Models (LLMs) in various domains, their application in control system design remains limited due to the complexity and specificity of control theory. To bridge this gap, we introduce ControlAgent, a new paradigm that automates control system design via novel integration of LLM agents and control-oriented domain expertise. ControlAgent encodes expert control knowledge and emulates human iterative design processes by gradually tuning controller parameters to meet user-specified requirements for stability, performance, and robustness. ControlAgent integrates multiple collaborative LLM agents, including a central agent responsible for task distribution and task-specific agents dedicated to detailed controller design for various types of systems and requirements. ControlAgent also employs a Python computation agent that performs complex calculations and controller evaluations based on standard design information provided by task-specified LLM agents. Combined with a history and feedback module, the task-specific LLM agents iteratively refine controller parameters based on real-time feedback from prior designs. Overall, ControlAgent mimics the design processes used by (human) practicing engineers, but removes all the human efforts and can be run in a fully automated way to give end-to-end solutions for control system design with user-specified requirements. To validate ControlAgent’s effectiveness, we develop ControlEval, an evaluation dataset that comprises 500 control tasks with various specific design goals. The effectiveness of ControlAgent is demonstrated via extensive comparative evaluations between LLM-based and traditional human-involved toolbox-based baselines.
摘要：控制系统设计是现代工程中至关重要的一个方面，广泛应用于航空航天、汽车系统、电力网络和机器人等多个领域。尽管大语言模型 (LLM) 在各个领域取得了显著进展，但由于控制理论的复杂性和专业性，其在控制系统设计中的应用仍然有限。为了填补这一空白，我们提出了 ControlAgent，这是一种通过将 LLM 智能体与面向控制的领域专业知识相结合，实现控制系统设计自动化的全新范式。ControlAgent 通过逐步调整控制器参数以满足用户指定的稳定性、性能和鲁棒性要求，编码了专家控制知识并模拟了人类工程师的迭代设计过程。ControlAgent 集成了多个协作的 LLM 智能体，包括一个负责任务分配的中央智能体和多个专门针对不同类型系统和需求进行详细控制器设计的任务特定智能体。ControlAgent 还采用了一个 Python 计算智能体，该智能体根据任务特定 LLM 智能体提供的设计标准信息执行复杂的计算和控制器评估。结合历史和反馈模块，任务特定 LLM 智能体根据先前设计的实时反馈迭代优化控制器参数。总体而言，ControlAgent 模仿了（人类）实践工程师使用的设计过程，但消除了所有人工操作，并可以以全自动方式运行，为用户指定的控制系统设计提供端到端解决方案。为了验证 ControlAgent 的有效性，我们开发了 ControlEval，这是一个包含 500 个具有各种特定设计目标的控制任务的评估数据集。通过在基于 LLM 的基准和传统的人工参与工具箱基准之间进行广泛的比较评估，展示了 ControlAgent 的有效性。

[NLP-130] First-Person Fairness in Chatbots

【速读】：该论文试图解决在聊天机器人（如ChatGPT）应用中确保“第一人称公平性”（first-person fairness）的问题，即确保所有用户无论其身份或背景都能获得高质量且无偏见的响应。解决方案的关键在于提出了一种可扩展且保护隐私的方法，用于评估聊天机器人响应中与用户姓名相关的潜在偏见。具体来说，该方法利用第二语言模型在保护用户隐私的前提下，分析聊天机器人响应中与用户姓名相关的敏感性，并通过独立的人类评估验证这些注释的有效性。此外，论文还展示了通过强化学习（RL）等后训练干预措施，显著减少有害刻板印象的效果。该方法不仅能够识别和描述不同任务中响应的差异，还为外部研究人员提供了系统消息，以便进一步研究ChatGPT在假设用户配置文件下的行为。

链接: https://arxiv.org/abs/2410.19803
作者: Tyna Eloundou,Alex Beutel,David G. Robinson,Keren Gu-Lemberg,Anna-Luisa Brakman,Pamela Mishkin,Meghan Shah,Johannes Heidecke,Lilian Weng,Adam Tauman Kalai
关键词-EN: diverse purposes, chatbot, fairness, users, user
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chatbots like ChatGPT are used for diverse purposes, ranging from resume writing to entertainment. These real-world applications are different from the institutional uses, such as resume screening or credit scoring, which have been the focus of much of AI research on fairness. Ensuring equitable treatment for all users in these first-person contexts is critical. In this work, we study “first-person fairness,” which means fairness toward the chatbot user. This includes providing high-quality responses to all users regardless of their identity or background and avoiding harmful stereotypes. We propose a scalable, privacy-preserving method for evaluating one aspect of first-person fairness across a large, heterogeneous corpus of real-world chatbot interactions. Specifically, we assess potential bias linked to users’ names, which can serve as proxies for demographic attributes like gender or race, in chatbot systems such as ChatGPT, which provide mechanisms for storing and using user names. Our method leverages a second language model to privately analyze name-sensitivity in the chatbot’s responses. We verify the validity of these annotations through independent human evaluation. Further, we show that post-training interventions, including RL, significantly mitigate harmful stereotypes. Our approach also yields succinct descriptions of response differences across tasks. For instance, in the “writing a story” task, chatbot responses show a tendency to create protagonists whose gender matches the likely gender inferred from the user’s name. Moreover, a pattern emerges where users with female-associated names receive responses with friendlier and simpler language slightly more often than users with male-associated names. Finally, we provide the system messages required for external researchers to further investigate ChatGPT’s behavior with hypothetical user profiles. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.19803 [cs.CY] (or arXiv:2410.19803v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2410.19803 Focus to learn more arXiv-issued DOI via DataCite
摘要：像 ChatGPT 这样的聊天机器人被用于多种目的，从简历撰写到娱乐。这些实际应用与机构用途（如简历筛选或信用评分）不同，后者一直是 AI 公平性研究的重点。在这些第一人称情境中，确保所有用户得到公平对待至关重要。在本研究中，我们探讨了“第一人称公平性”，即对聊天机器人用户的公平性。这包括无论用户的身份或背景如何，都提供高质量的回应，并避免有害的刻板印象。我们提出了一种可扩展的、保护隐私的方法，用于评估大规模、异构的真实世界聊天机器人交互中的第一人称公平性的一个方面。具体来说，我们评估了与用户姓名相关的潜在偏见，这些姓名可以作为性别或种族等人口统计属性的代理，在提供存储和使用用户姓名机制的聊天机器人系统（如 ChatGPT）中。我们的方法利用第二个语言模型来私密地分析聊天机器人回应中的姓名敏感性。我们通过独立的人类评估验证了这些注释的有效性。此外，我们展示了包括强化学习在内的训练后干预措施显著减少了有害的刻板印象。我们的方法还产生了简洁的描述，说明了不同任务中回应的差异。例如，在“撰写故事”任务中，聊天机器人的回应倾向于创造与用户姓名推断出的性别相匹配的主角。此外，出现了一种模式，即与女性相关的姓名用户比与男性相关的姓名用户更频繁地收到语言更友好和更简单的回应。最后，我们提供了系统消息，供外部研究人员进一步研究 ChatGPT 在假设用户配置文件下的行为。

主题：计算机与社会 (cs.CY); 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用为：arXiv:2410.19803 [cs.CY]（或 arXiv:2410.19803v1 [cs.CY] 用于此版本）
https://doi.org/10.48550/arXiv.2410.19803
通过 DataCite 发布的 arXiv DOI

[NLP-131] co-DPR: A Hybrid Dataset for Evaluating Retrieval Models of 3GPP Technical Specifications

【速读】：该论文试图解决电信领域中基于3GPP技术文档的问答系统（Question-Answering, QA）的构建问题。解决方案的关键在于提出了一个混合数据集Telco-DPR，该数据集结合了文本和表格形式的3GPP文档，并包含用于评估检索性能的合成问答对。论文通过评估和比较稀疏模型（如BM25）和密集模型（如DPR和DHR）的检索性能，发现利用层次化段落选择的密集层次检索模型（Dense Hierarchical Retrieval, DHR）在检索相关技术信息方面表现最佳，Top-10准确率达到86.2%。此外，论文还采用了检索增强生成技术（Retriever-Augmented Generation, RAG），结合GPT-4，使得问答系统的答案准确性比之前的基准提高了14%。

链接: https://arxiv.org/abs/2410.19790
作者: Thaina Saraiva,Marco Sousa,Pedro Vieira,António Rodrigues
关键词-EN: Generation Partnership Project, Partnership Project, Generation Partnership, proposes a Question-Answering, paper proposes
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:This paper proposes a Question-Answering (QA) system for the telecom domain using 3rd Generation Partnership Project (3GPP) technical documents. Alongside, a hybrid dataset, Telco-DPR, which consists of a curated 3GPP corpus in a hybrid format, combining text and tables, is presented. Additionally, the dataset includes a set of synthetic question/answer pairs designed to evaluate the retrieval performance of QA systems on this type of data. The retrieval models, including the sparse model, Best Matching 25 (BM25), as well as dense models, such as Dense Passage Retriever (DPR) and Dense Hierarchical Retrieval (DHR), are evaluated and compared using top-K accuracy and Mean Reciprocal Rank (MRR). The results show that DHR, a retriever model utilising hierarchical passage selection through fine-tuning at both the document and passage levels, outperforms traditional methods in retrieving relevant technical information, achieving a Top-10 accuracy of 86.2%. Additionally, the Retriever-Augmented Generation (RAG) technique, used in the proposed QA system, is evaluated to demonstrate the benefits of using the hybrid dataset and the DHR. The proposed QA system, using the developed RAG model and the Generative Pretrained Transformer (GPT)-4, achieves a 14% improvement in answer accuracy, when compared to a previous benchmark on the same dataset.
摘要：本文提出了一种基于第三代合作伙伴计划 (3GPP) 技术文档的电信领域问答 (QA) 系统。同时，本文还介绍了一种混合数据集 Telco-DPR，该数据集包含一个精心策划的 3GPP 语料库，采用文本和表格混合格式。此外，该数据集还包括一组用于评估 QA 系统在该类型数据上检索性能的合成问答对。本文评估并比较了多种检索模型，包括稀疏模型 Best Matching 25 (BM25) 以及密集模型如 Dense Passage Retriever (DPR) 和 Dense Hierarchical Retrieval (DHR)，使用 Top-K 准确率和平均倒数排名 (MRR) 作为评价指标。结果显示，DHR 模型通过在文档和段落级别进行微调，利用层次化段落选择，在检索相关技术信息方面优于传统方法，Top-10 准确率达到 86.2%。此外，本文还评估了所提出的 QA 系统中使用的检索增强生成 (RAG) 技术，以展示使用混合数据集和 DHR 的优势。所提出的 QA 系统采用开发的 RAG 模型和大语言模型 (GPT)-4，与之前在同一数据集上的基准相比，答案准确率提高了 14%。

[NLP-132] Author Unknown: Evaluating Performance of Author Extraction Libraries on Global Online News Articles

【速读】：该论文试图解决在线新闻内容的大规模语料库分析中，作者元数据提取方法的稳健性验证问题。解决方案的关键在于构建了一个手动编码的跨语言新闻文章作者数据集，并利用该数据集评估了五种现有软件包和一个定制模型的作者提取性能。研究结果表明，Go-readability和Trafilatura在作者提取方面表现最为一致，但所有软件包在不同语言中的表现存在显著差异，这提示研究人员在使用作者数据进行分析时，需要针对特定语言和地理区域进行进一步的验证。

链接: https://arxiv.org/abs/2410.19771
作者: Sriharsha Hatwar,Virginia Partridge,Rahul Bhargava,Fernando Bermejo
关键词-EN: content requires robust, metadata extraction methodologies, underlying metadata extraction, requires robust validation, large corpora
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Analysis of large corpora of online news content requires robust validation of underlying metadata extraction methodologies. Identifying the author of a given web-based news article is one example that enables various types of research questions. While numerous solutions for off-the-shelf author extraction exist, there is little work comparing performance (especially in multilingual settings). In this paper we present a manually coded cross-lingual dataset of authors of online news articles and use it to evaluate the performance of five existing software packages and one customized model. Our evaluation shows evidence for Go-readability and Trafilatura as the most consistent solutions for author extraction, but we find all packages produce highly variable results across languages. These findings are relevant for researchers wishing to utilize author data in their analysis pipelines, primarily indicating that further validation for specific languages and geographies is required to rely on results.
摘要：对大规模在线新闻内容语料库的分析需要对底层元数据提取方法进行强有力的验证。识别给定基于网络的新闻文章的作者是支持各种类型研究问题的一个例子。尽管存在许多现成的作者提取解决方案，但在多语言环境下的性能比较研究却很少。本文中，我们提供了一个手动编码的跨语言新闻文章作者数据集，并利用它来评估五个现有软件包和一个定制模型的性能。我们的评估结果显示，Go-readability 和 Trafilatura 在作者提取方面表现最为一致，但我们发现所有软件包在不同语言中的结果都存在高度变异性。这些发现对于希望在其分析流程中利用作者数据的研究人员具有重要意义，主要表明在依赖结果之前，需要对特定语言和地理区域进行进一步验证。

[NLP-133] A Comparative Analysis on Ethical Benchmarking in Large Language Models

【速读】：该论文试图解决当前机器伦理（Machine Ethics, ME）基准测试中存在的三个主要问题：生态效度有限、问题生成缺乏结构化标准、以及依赖人工注释导致的可扩展性不足。解决方案的关键在于引入两个新的ME基准测试：Triage Benchmark和Medical Law (MedLaw) Benchmark。这两个基准测试均基于医疗领域的真实伦理困境，其中MedLaw Benchmark完全由AI生成，提供了可扩展的替代方案。此外，论文还引入了上下文扰动（context perturbations）来评估模型的最坏情况性能，发现伦理提示并不总能改善决策，且上下文扰动不仅显著降低模型性能，还可能逆转错误模式并改变相对性能排名。通过这些改进，论文强调ME基准测试必须模拟真实世界场景和最坏情况性能，以确保评估的鲁棒性。

链接: https://arxiv.org/abs/2410.19753
作者: Kira Sam,Raja Vavekanand
关键词-EN: intelligent systems accurately, systems accurately represent, field of Machine, accurately represent human, Machine Ethics
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 62 pages

点击查看摘要

Abstract:This work contributes to the field of Machine Ethics (ME) benchmarking, which develops tests to assess whether intelligent systems accurately represent human values and act accordingly. We identify three major issues with current ME benchmarks: limited ecological validity due to unrealistic ethical dilemmas, unstructured question generation without clear inclusion/exclusion criteria, and a lack of scalability due to reliance on human annotations. Moreover, benchmarks often fail to include sufficient syntactic variations, reducing the robustness of findings. To address these gaps, we introduce two new ME benchmarks: the Triage Benchmark and the Medical Law (MedLaw) Benchmark, both featuring real-world ethical dilemmas from the medical domain. The MedLaw Benchmark, fully AI-generated, offers a scalable alternative. We also introduce context perturbations in our benchmarks to assess models’ worst-case performance. Our findings reveal that ethics prompting does not always improve decision-making. Furthermore, context perturbations not only significantly reduce model performance but can also reverse error patterns and shift relative performance rankings. Lastly, our comparison of worst-case performance suggests that general model capability does not always predict strong ethical decision-making. We argue that ME benchmarks must approximate real-world scenarios and worst-case performance to ensure robust evaluation.
摘要：本研究为机器伦理（Machine Ethics, ME）基准测试领域做出了贡献，该领域旨在开发测试以评估智能系统是否能准确体现人类价值观并据此行事。我们识别出当前ME基准测试存在的三个主要问题：由于伦理困境不切实际而导致的生态效度有限、问题生成缺乏明确的纳入/排除标准，以及依赖人工标注导致的可扩展性不足。此外，基准测试往往未能包含足够的句法变体，从而降低了研究结果的稳健性。为解决这些差距，我们引入了两个新的ME基准测试：Triage基准测试和Medical Law（MedLaw）基准测试，两者均基于医疗领域的真实伦理困境。MedLaw基准测试完全由AI生成，提供了可扩展的替代方案。我们还在基准测试中引入了上下文扰动，以评估模型的最差性能。我们的研究结果表明，伦理提示并不总能改善决策质量。此外，上下文扰动不仅显著降低了模型性能，还可能逆转错误模式并改变相对性能排名。最后，我们对最差性能的比较表明，通用模型能力并不总能预测出强大的伦理决策能力。我们认为，ME基准测试必须接近真实世界场景和最差性能，以确保评估的稳健性。

[NLP-134] Scaling up Masked Diffusion Models on Text

【速读】：该论文试图解决掩码扩散模型（Masked Diffusion Models, MDMs）在语言建模中的可扩展性和有效性问题，特别是在文本生成和语言理解等核心语言任务中的表现。解决方案的关键在于：1) 建立了MDMs的第一个扩展定律，展示了其扩展速率与自回归模型（Autoregressive Models, ARMs）相当，且计算差距较小；2) 训练了一系列参数规模高达11亿（B）的MDMs，并与同等或更大规模的ARMs进行系统性对比评估；3) 提出了一个简单而有效的无监督分类器无指导方法（unsupervised classifier-free guidance），利用大规模未配对数据提升条件推断的性能；4) 在语言理解和文本生成任务中，MDMs展示了与ARMs相当的性能，同时在某些情况下表现出更高的效率或质量，尤其是在处理双向推理和适应数据时间变化方面。

链接: https://arxiv.org/abs/2410.18514
作者: Shen Nie,Fengqi Zhu,Chao Du,Tianyu Pang,Qian Liu,Guangtao Zeng,Min Lin,Chongxuan Li
关键词-EN: Masked diffusion models, Masked diffusion, remain underexplored, shown promise, effectiveness in core
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective \emphunsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, a 1.1B MDM shows competitive results, outperforming the larger 1.5B GPT-2 model on four out of eight zero-shot benchmarks. In text generation, MDMs provide a flexible trade-off compared to ARMs utilizing KV-cache: MDMs match the performance of ARMs while being 1.4 times faster, or achieve higher quality than ARMs at a higher computational cost. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the \emphreverse curse encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B). Our code is available at \urlthis https URL.
摘要：掩码扩散模型（Masked Diffusion Models, MDMs）在语言建模中显示出潜力，然而其在核心语言任务（如文本生成和语言理解）中的可扩展性和有效性仍未得到充分探索。本文首次为MDMs建立了扩展定律，展示了其扩展速率与自回归模型（Autoregressive Models, ARMs）相当，且计算差距相对较小。受其可扩展性的启发，我们训练了一系列参数规模高达11亿（B）的MDMs，系统地评估了它们在与ARM相当或更大规模下的性能。充分利用MDMs的概率公式，我们提出了一种简单而有效的无监督无分类器引导方法，该方法有效利用大规模未配对数据，提升了条件推断的性能。在语言理解方面，一个1.1B的MDM在八个零样本基准测试中，有四个测试的表现优于更大的1.5B GPT-2模型。在文本生成方面，与使用KV-cache的ARM相比，MDMs提供了灵活的权衡：MDMs在速度快1.4倍的情况下与ARM性能相当，或在计算成本较高的情况下提供更高的质量。此外，MDMs通过有效处理双向推理和适应数据中的时间变化，解决了ARM面临的挑战性任务。值得注意的是，一个1.1B的MDM打破了更大规模的ARM（如Llama-2（13B）和GPT-3（175B））在大量数据和计算资源下遇到的反向诅咒。我们的代码可在以下链接获取：\urlthis https URL。

[NLP-135] LinBridge: A Learnable Framework for Interpreting Nonlinear Neural Encoding Models ICLR2025

【速读】：该论文试图解决的问题是如何在保持可解释性的前提下，开发非线性神经编码模型来更好地解释大脑对信息的处理。解决方案的关键在于提出了LinBridge框架，该框架基于雅可比矩阵分析，通过将非线性映射分解为线性固有成分和样本选择性映射偏差，从而实现对非线性编码模型的可解释性分析。LinBridge利用自监督学习策略从测试集的雅可比矩阵中提取线性固有成分和非线性映射偏差，使其能够有效适应各种非线性编码模型。

链接: https://arxiv.org/abs/2410.20053
作者: Xiaohui Gao,Yue Cheng,Peiyang Li,Yijie Niu,Yifan Ren,Yiheng Liu,Haiyang Sun,Zhuoyi Li,Weiwei Xing,Xintao Hu
关键词-EN: brain processes information, nonlinear encoding models, encoding models, artificial neural networks, neural encoding models
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注: 9 pages of main text, 23 pages total, submitted to ICLR 2025 and currently under review

点击查看摘要

Abstract:Neural encoding of artificial neural networks (ANNs) links their computational representations to brain responses, offering insights into how the brain processes information. Current studies mostly use linear encoding models for clarity, even though brain responses are often nonlinear. This has sparked interest in developing nonlinear encoding models that are still interpretable. To address this problem, we propose LinBridge, a learnable and flexible framework based on Jacobian analysis for interpreting nonlinear encoding models. LinBridge posits that the nonlinear mapping between ANN representations and neural responses can be factorized into a linear inherent component that approximates the complex nonlinear relationship, and a mapping bias that captures sample-selective nonlinearity. The Jacobian matrix, which reflects output change rates relative to input, enables the analysis of sample-selective mapping in nonlinear models. LinBridge employs a self-supervised learning strategy to extract both the linear inherent component and nonlinear mapping biases from the Jacobian matrices of the test set, allowing it to adapt effectively to various nonlinear encoding models. We validate the LinBridge framework in the scenario of neural visual encoding, using computational visual representations from CLIP-ViT to predict brain activity recorded via functional magnetic resonance imaging (fMRI). Our experimental results demonstrate that: 1) the linear inherent component extracted by LinBridge accurately reflects the complex mappings of nonlinear neural encoding models; 2) the sample-selective mapping bias elucidates the variability of nonlinearity across different levels of the visual processing hierarchy. This study presents a novel tool for interpreting nonlinear neural encoding models and offers fresh evidence about hierarchical nonlinearity distribution in the visual cortex.
摘要：人工神经网络 (ANNs) 的神经编码将它们的计算表示与大脑反应联系起来，提供了关于大脑如何处理信息的见解。当前的研究大多使用线性编码模型以确保清晰性，尽管大脑反应通常是非线性的。这引发了开发仍然可解释的非线性编码模型的兴趣。为了解决这一问题，我们提出了 LinBridge，这是一个基于 Jacobian 分析的可学习和灵活的框架，用于解释非线性编码模型。LinBridge 假设 ANN 表示与神经反应之间的非线性映射可以分解为一个线性内在成分，该成分近似复杂的非线性关系，以及一个捕捉样本选择性非线性的映射偏差。Jacobian 矩阵反映了输出相对于输入的变化率，使得在非线性模型中分析样本选择性映射成为可能。LinBridge 采用自监督学习策略，从测试集的 Jacobian 矩阵中提取线性内在成分和非线性映射偏差，使其能够有效适应各种非线性编码模型。我们在神经视觉编码的场景中验证了 LinBridge 框架，使用 CLIP-ViT 的计算视觉表示来预测通过功能性磁共振成像 (fMRI) 记录的大脑活动。我们的实验结果表明：1) LinBridge 提取的线性内在成分准确反映了非线性神经编码模型的复杂映射；2) 样本选择性映射偏差阐明了视觉处理层次结构不同层次上非线性的变异性。本研究提出了一种解释非线性神经编码模型的新工具，并为视觉皮层中层次非线性分布提供了新的证据。

人工智能

[AI-0] Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

链接: https://arxiv.org/abs/2410.21275
作者: Manuel Benavent-Lledo,David Mulero-Pérez,David Ortiz-Perez,Jose Garcia-Rodriguez,Antonis Argyros
关键词-EN: hierarchical structure consisting, action recognition, levels of abstraction, structure consisting, remain unexplored
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The sequential execution of actions and their hierarchical structure consisting of different levels of abstraction, provide features that remain unexplored in the task of action recognition. In this study, we present a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and prior actions to reflect the sequential context. To achieve this goal, we introduce a novel transformer architecture tailored for action recognition that utilizes both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse and fine-grained action recognition, thereby exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset to introduce action hierarchies, introducing the Hierarchical TSU dataset. We also conduct an ablation study to assess the impact of different methods for integrating contextual and hierarchical data on action recognition performance. Results show that the proposed approach outperforms pre-trained SOTA methods when trained with the same hyperparameters. Moreover, they also show a 17.12% improvement in top-1 accuracy over the equivalent fine-grained RGB version when using ground-truth contextual information, and a 5.33% improvement when contextual information is obtained from actual predictions.

[AI-1] LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

链接: https://arxiv.org/abs/2410.21264
作者: Hanyu Wang,Saksham Suri,Yixuan Ren,Hao Chen,Abhinav Shrivastava
关键词-EN: LARP, designed to overcome, overcome limitations, limitations in current, video tokenizer designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP’s strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

[AI-2] BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference

链接: https://arxiv.org/abs/2410.21262
作者: Changwoo Lee,Soo Min Kwon,Qing Qu,Hun-Seok Kim
关键词-EN: Large-scale foundation models, demonstrated exceptional performance, Large-scale foundation, BLAST matrix, demonstrated exceptional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large-scale foundation models have demonstrated exceptional performance in language and vision tasks. However, the numerous dense matrix-vector operations involved in these large networks pose significant computational challenges during inference. To address these challenges, we introduce the Block-Level Adaptive STructured (BLAST) matrix, designed to learn and leverage efficient structures prevalent in the weight matrices of linear layers within deep learning models. Compared to existing structured matrices, the BLAST matrix offers substantial flexibility, as it can represent various types of structures that are either learned from data or computed from pre-existing weight matrices. We demonstrate the efficiency of using the BLAST matrix for compressing both language and vision tasks, showing that (i) for medium-sized models such as ViT and GPT-2, training with BLAST weights boosts performance while reducing complexity by 70% and 40%, respectively; and (ii) for large foundation models such as Llama-7B and DiT-XL, the BLAST matrix achieves a 2x compression while exhibiting the lowest performance degradation among all tested structured matrices. Our code is available at \urlthis https URL.

[AI-3] AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

链接: https://arxiv.org/abs/2410.21259
作者: Han Bao,Yue Huang,Yanbo Wang,Jiayi Ye,Xiangqi Wang,Xiuyin Chen,Mohamed Elhoseiny,Xiangliang Zhang
关键词-EN: Large Vision-Language Models, Large Vision-Language, linguistic information, facilitating a wide, essential for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information, facilitating a wide range of complex applications and tasks. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: “Can LVLMs serve as a path to automatic benchmarking?”. We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. Upon receiving an evaluation capability, AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability. We observe the following: (1) Our constructed benchmark accurately reflects varying task difficulties; (2) As task difficulty rises, the performance gap between models widens; (3) While models exhibit strong performance in abstract level understanding, they underperform in details reasoning tasks; and (4) Constructing a dataset with varying levels of difficulties is critical for a comprehensive and exhaustive evaluation. Overall, AutoBench-V not only successfully utilizes LVLMs for automated benchmarking but also reveals that LVLMs as judges have significant potential in various domains.

[AI-4] Multi-modal AI for comprehensive breast cancer prognostication

链接: https://arxiv.org/abs/2410.21256
作者: Jan Witowski,Ken Zeng,Joseph Cappadona,Jailan Elayoubi,Elena Diana Chiru,Nancy Chan,Young-Joon Kang,Frederick Howard,Irina Ostrovnaya,Carlos Fernandez-Granda,Freya Schnabel,Ugur Ozerdem,Kangning Liu,Zoe Steinsnyder,Nitya Thakore,Mohammad Sadic,Frank Yeung,Elisa Liu,Theodore Hill,Benjamin Swett,Danielle Rigau,Andrew Clayburn,Valerie Speirs,Marcus Vetter,Lina Sojak,Simone Muenst Soysal,Daniel Baumhoer,Khalil Choucair,Yu Zong,Lina Daoud,Anas Saad,Waleed Abdulsattar,Rafic Beydoun,Jia-Wern Pan,Haslina Makmur,Soo-Hwang Teo,Linda Ma Pak,Victor Angel,Dovile Zilenaite-Petrulaitiene,Arvydas Laurinavicius,Natalie Klar,Brian D. Piening,Carlo Bifulco,Sun-Young Jun,Jae Pak Yi,Su Hyun Lim,Adam Brufsky,Francisco J. Esteva,Lajos Pusztai,Yann LeCun,Krzysztof J. Geras
关键词-EN: guided by molecular, test, breast cancer, Treatment, clinical
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. Recurrence risk assessment plays a crucial role in personalizing treatment. Current methods, including genomic assays, have limited accuracy and clinical utility, leading to suboptimal decisions for many patients. We developed a test for breast cancer patient stratification based on digital pathology and clinical characteristics using novel AI methods. Specifically, we utilized a vision transformer-based pan-cancer foundation model trained with self-supervised learning to extract features from digitized HE-stained slides. These features were integrated with clinical data to form a multi-modal AI test predicting cancer recurrence and death. The test was developed and evaluated using data from a total of 8,161 breast cancer patients across 15 cohorts originating from seven countries. Of these, 3,502 patients from five cohorts were used exclusively for evaluation, while the remaining patients were used for training. Our test accurately predicted our primary endpoint, disease-free interval, in the five external cohorts (C-index: 0.71 [0.68-0.75], HR: 3.63 [3.02-4.37, p0.01]). In a direct comparison (N=858), the AI test was more accurate than Oncotype DX, the standard-of-care 21-gene assay, with a C-index of 0.67 [0.61-0.74] versus 0.61 [0.49-0.73], respectively. Additionally, the AI test added independent information to Oncotype DX in a multivariate analysis (HR: 3.11 [1.91-5.09, p0.01)]). The test demonstrated robust accuracy across all major breast cancer subtypes, including TNBC (C-index: 0.71 [0.62-0.81], HR: 3.81 [2.35-6.17, p=0.02]), where no diagnostic tools are currently recommended by clinical guidelines. These results suggest that our AI test can improve accuracy, extend applicability to a wider range of patients, and enhance access to treatment selection tools.

[AI-5] Capacity-Aware Planning and Scheduling in Budget-Constrained Monotonic MDPs: A Meta-RL Approach

链接: https://arxiv.org/abs/2410.21249
作者: Manav Vora,Ilan Shomorony,Melkior Ornik
关键词-EN: Markov Decision Processes, monotonic Markov Decision, Decision Processes, Markov Decision, system state stochastically
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many real-world sequential repair problems can be effectively modeled using monotonic Markov Decision Processes (MDPs), where the system state stochastically decreases and can only be increased by performing a restorative action. This work addresses the problem of solving multi-component monotonic MDPs with both budget and capacity constraints. The budget constraint limits the total number of restorative actions and the capacity constraint limits the number of restorative actions that can be performed simultaneously. While prior methods dealt with budget constraints, including capacity constraints in prior methods leads to an exponential increase in computational complexity as the number of components in the MDP grows. We propose a two-step planning approach to address this challenge. First, we partition the components of the multi-component MDP into groups, where the number of groups is determined by the capacity constraint. We achieve this partitioning by solving a Linear Sum Assignment Problem (LSAP). Each group is then allocated a fraction of the total budget proportional to its size. This partitioning effectively decouples the large multi-component MDP into smaller subproblems, which are computationally feasible because the capacity constraint is simplified and the budget constraint can be addressed using existing methods. Subsequently, we use a meta-trained PPO agent to obtain an approximately optimal policy for each group. To validate our approach, we apply it to the problem of scheduling repairs for a large group of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot swarm, particularly for large swarm sizes.

[AI-6] Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce

链接: https://arxiv.org/abs/2410.21237
作者: Zhantao Yang,Han Zhang,Fangyi Chen,Anudeepsekhar Bolimera,Marios Savvides
关键词-EN: increasingly important role, Knowledge Graph, knowledge graph construction, playing an increasingly, increasingly important
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Graph (KG) is playing an increasingly important role in various AI systems. For e-commerce, an efficient and low-cost automated knowledge graph construction method is the foundation of enabling various successful downstream applications. In this paper, we propose a novel method for constructing structured product knowledge graphs from raw product images. The method cooperatively leverages recent advances in the vision-language model (VLM) and large language model (LLM), fully automating the process and allowing timely graph updates. We also present a human-annotated e-commerce product dataset for benchmarking product property extraction in knowledge graph construction. Our method outperforms our baseline in all metrics and evaluated properties, demonstrating its effectiveness and bright usage potential.

[AI-7] Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

链接: https://arxiv.org/abs/2410.21220
作者: Zhixin Zhang,Yiyuan Zhang,Xiaohan Ding,Xiangyu Yue
关键词-EN: Search engines enable, Vision Search Assistant, engines enable, enable the retrieval, retrieval of unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user’s question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs’ visual understanding capabilities and web agents’ real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

[AI-8] SeriesGAN: Time Series Generation via Adversarial and Autoregressive Learning

链接: https://arxiv.org/abs/2410.21203
作者: MohammadReza EskandariNasab,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: Current Generative Adversarial, Generative Adversarial Network, Current Generative, generation face challenges, Generative Adversarial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted at BigData 2024 on October 26, 2024, as a regular paper for oral presentation

点击查看摘要

Abstract:Current Generative Adversarial Network (GAN)-based approaches for time series generation face challenges such as suboptimal convergence, information loss in embedding spaces, and instability. To overcome these challenges, we introduce an advanced framework that integrates the advantages of an autoencoder-generated embedding space with the adversarial training dynamics of GANs. This method employs two discriminators: one to specifically guide the generator and another to refine both the autoencoder’s and generator’s output. Additionally, our framework incorporates a novel autoencoder-based loss function and supervision from a teacher-forcing supervisor network, which captures the stepwise conditional distributions of the data. The generator operates within the latent space, while the two discriminators work on latent and feature spaces separately, providing crucial feedback to both the generator and the autoencoder. By leveraging this dual-discriminator approach, we minimize information loss in the embedding space. Through joint training, our framework excels at generating high-fidelity time series data, consistently outperforming existing state-of-the-art benchmarks both qualitatively and quantitatively across a range of real and synthetic multivariate time series datasets.

[AI-9] Deep Learning-Based Fatigue Cracks Detection in Bridge Girders using Feature Pyramid Networks

链接: https://arxiv.org/abs/2410.21175
作者: Jiawei Zhang,Jun Li,Reachsak Ly,Yunyi Liu,Jiangpeng Shu
关键词-EN: structural health monitoring, crack, Feature Pyramid Networks, health monitoring, challenging problem
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:For structural health monitoring, continuous and automatic crack detection has been a challenging problem. This study is conducted to propose a framework of automatic crack segmentation from high-resolution images containing crack information about steel box girders of bridges. Considering the multi-scale feature of cracks, convolutional neural network architecture of Feature Pyramid Networks (FPN) for crack detection is proposed. As for input, 120 raw images are processed via two approaches (shrinking the size of images and splitting images into sub-images). Then, models with the proposed structure of FPN for crack detection are developed. The result shows all developed models can automatically detect the cracks at the raw images. By shrinking the images, the computation efficiency is improved without decreasing accuracy. Because of the separable characteristic of crack, models using the splitting method provide more accurate crack segmentations than models using the resizing method. Therefore, for high-resolution images, the FPN structure coupled with the splitting method is an promising solution for the crack segmentation and detection.

[AI-10] CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants ICLR2025

链接: https://arxiv.org/abs/2410.21159
作者: Lize Alberts,Benjamin Ellis,Andrei Lupu,Jakob Foerster
关键词-EN: handle user-provided safety-critical, introduce a multi-turn, multi-turn benchmark, benchmark for evaluating, ability to handle
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Submitted to ICLR 2025 on 01/10/2024

点击查看摘要

Abstract:We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (each with 337 use cases) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated “harmless” models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising user preferences above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI’s o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic ‘harmless and helpful’ instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.

[AI-11] rajectory Flow Matching with Applications to Clinical Time Series Modeling NEURIPS2024

链接: https://arxiv.org/abs/2410.21154
作者: Xi Zhang,Yuan Pu,Yuki Kawamura,Andrew Loza,Yoshua Bengio,Dennis L. Shung,Alexander Tong
关键词-EN: challenging problem found, irregularly sampled time, range of applications, irregularly sampled, wide range
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Modeling stochastic and irregularly sampled time series is a challenging problem found in a wide range of applications, especially in medicine. Neural stochastic differential equations (Neural SDEs) are an attractive modeling technique for this problem, which parameterize the drift and diffusion terms of an SDE with neural networks. However, current algorithms for training Neural SDEs require backpropagation through the SDE dynamics, greatly limiting their scalability and stability. To address this, we propose Trajectory Flow Matching (TFM), which trains a Neural SDE in a simulation-free manner, bypassing backpropagation through the dynamics. TFM leverages the flow matching technique from generative modeling to model time series. In this work we first establish necessary conditions for TFM to learn time series data. Next, we present a reparameterization trick which improves training stability. Finally, we adapt TFM to the clinical time series setting, demonstrating improved performance on three clinical time series datasets both in terms of absolute performance and uncertainty prediction.

[AI-12] Fast Calibrated Explanations: Efficient and Uncertainty-Aware Explanations for Machine Learning Models

链接: https://arxiv.org/abs/2410.21129
作者: Tuwe Löfström,Fatima Rabia Yapicioglu,Alessandra Stramiglio,Helena Löfström,Fabio Vitali
关键词-EN: machine learning models, Fast Calibrated Explanations, paper introduces Fast, introduces Fast Calibrated, Calibrated Explanations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 36 pages, 5 figures, journal submission

点击查看摘要

Abstract:This paper introduces Fast Calibrated Explanations, a method designed for generating rapid, uncertainty-aware explanations for machine learning models. By incorporating perturbation techniques from ConformaSight - a global explanation framework - into the core elements of Calibrated Explanations (CE), we achieve significant speedups. These core elements include local feature importance with calibrated predictions, both of which retain uncertainty quantification. While the new method sacrifices a small degree of detail, it excels in computational efficiency, making it ideal for high-stakes, real-time applications. Fast Calibrated Explanations are applicable to probabilistic explanations in classification and thresholded regression tasks, where they provide the likelihood of a target being above or below a user-defined threshold. This approach maintains the versatility of CE for both classification and probabilistic regression, making it suitable for a range of predictive tasks where uncertainty quantification is crucial.

[AI-13] Large Language Model-assisted Speech and Pointing Benefits Multiple 3D Object Selection in Virtual Reality

链接: https://arxiv.org/abs/2410.21091
作者: Junlong Chen,Jens Grubert,Per Ola Kristensson
关键词-EN: virtual reality, virtual reality scene, challenging problem, occluded object selection, object selection technique
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:Selection of occluded objects is a challenging problem in virtual reality, even more so if multiple objects are involved. With the advent of new artificial intelligence technologies, we explore the possibility of leveraging large language models to assist multi-object selection tasks in virtual reality via a multimodal speech and raycast interaction technique. We validate the findings in a comparative user study (n=24), where participants selected target objects in a virtual reality scene with different levels of scene perplexity. The performance metrics and user experience metrics are compared against a mini-map based occluded object selection technique that serves as the baseline. Results indicate that the introduced technique, AssistVR, outperforms the baseline technique when there are multiple target objects. Contrary to the common belief for speech interfaces, AssistVR was able to outperform the baseline even when the target objects were difficult to reference verbally. This work demonstrates the viability and interaction potential of an intelligent multimodal interactive system powered by large laguage models. Based on the results, we discuss the implications for design of future intelligent multimodal interactive systems in immersive environments.

[AI-14] Efficient Mixture-of-Expert for Video-based Driver State and Physiological Multi-task Estimation in Conditional Autonomous Driving

链接: https://arxiv.org/abs/2410.21086
作者: Jiyao Wang,Xiao Yang,Zhenyu Wang,Ximeng Wei,Ange Wang,Dengbo He,Kaishun Wu
关键词-EN: Road safety remains, million fatalities annually, fatalities annually attributed, critical challenge worldwide, Road safety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Road safety remains a critical challenge worldwide, with approximately 1.35 million fatalities annually attributed to traffic accidents, often due to human errors. As we advance towards higher levels of vehicle automation, challenges still exist, as driving with automation can cognitively over-demand drivers if they engage in non-driving-related tasks (NDRTs), or lead to drowsiness if driving was the sole task. This calls for the urgent need for an effective Driver Monitoring System (DMS) that can evaluate cognitive load and drowsiness in SAE Level-2/3 autonomous driving contexts. In this study, we propose a novel multi-task DMS, termed VDMoE, which leverages RGB video input to monitor driver states non-invasively. By utilizing key facial features to minimize computational load and integrating remote Photoplethysmography (rPPG) for physiological insights, our approach enhances detection accuracy while maintaining efficiency. Additionally, we optimize the Mixture-of-Experts (MoE) framework to accommodate multi-modal inputs and improve performance across different tasks. A novel prior-inclusive regularization method is introduced to align model outputs with statistical priors, thus accelerating convergence and mitigating overfitting risks. We validate our method with the creation of a new dataset (MCDD), which comprises RGB video and physiological indicators from 42 participants, and two public datasets. Our findings demonstrate the effectiveness of VDMoE in monitoring driver states, contributing to safer autonomous driving systems. The code and data will be released.

[AI-15] Skip2-LoRA: A Lightweight On-device DNN Fine-tuning Method for Low-cost Edge Devices

链接: https://arxiv.org/abs/2410.21073
作者: Hiroki Matsutani,Masaaki Kondo,Kazuki Sunaga,Radu Marculescu
关键词-EN: deep neural networks, lightweight fine-tuning method, paper proposes, deployed models, method for deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ASP-DAC 2025 (accepted)

点击查看摘要

Abstract:This paper proposes Skip2-LoRA as a lightweight fine-tuning method for deep neural networks to address the gap between pre-trained and deployed models. In our approach, trainable LoRA (low-rank adaptation) adapters are inserted between the last layer and every other layer to enhance the network expressive power while keeping the backward computation cost low. This architecture is well-suited to cache intermediate computation results of the forward pass and then can skip the forward computation of seen samples as training epochs progress. We implemented the combination of the proposed architecture and cache, denoted as Skip2-LoRA, and tested it on a 15 single board computer. Our results show that Skip2-LoRA reduces the fine-tuning time by 90.0% on average compared to the counterpart that has the same number of trainable parameters while preserving the accuracy, while taking only a few seconds on the microcontroller board.

[AI-16] EMOCPD: Efficient Attention-based Models for Computational Protein Design Using Amino Acid Microenvironment

链接: https://arxiv.org/abs/2410.21069
作者: Xiaoqi Ling,Cheng Cai,Zhaohong Deng,Lei Wang,Zhisheng Wei,Jing Wu
关键词-EN: Computational protein design, protein design, Computational protein, protein, EMOCPD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Computational protein design (CPD) refers to the use of computational methods to design proteins. Traditional methods relying on energy functions and heuristic algorithms for sequence design are inefficient and do not meet the demands of the big data era in biomolecules, with their accuracy limited by the energy functions and search algorithms. Existing deep learning methods are constrained by the learning capabilities of the networks, failing to extract effective information from sparse protein structures, which limits the accuracy of protein design. To address these shortcomings, we developed an Efficient attention-based Models for Computational Protein Design using amino acid microenvironment (EMOCPD). It aims to predict the category of each amino acid in a protein by analyzing the three-dimensional atomic environment surrounding the amino acids, and optimize the protein based on the predicted high-probability potential amino acid categories. EMOCPD employs a multi-head attention mechanism to focus on important features in the sparse protein microenvironment and utilizes an inverse residual structure to optimize the network architecture. The proposed EMOCPD achieves over 80% accuracy on the training set and 68.33% and 62.32% accuracy on two independent test sets, respectively, surpassing the best comparative methods by over 10%. In protein design, the thermal stability and protein expression of the predicted mutants from EMOCPD show significant improvements compared to the wild type, effectively validating EMOCPD’s potential in designing superior proteins. Furthermore, the predictions of EMOCPD are influenced positively, negatively, or have minimal impact based on the content of the 20 amino acids, categorizing amino acids as positive, negative, or neutral. Research findings indicate that EMOCPD is more suitable for designing proteins with lower contents of negative amino acids.

[AI-17] Learning to Handle Complex Constraints for Vehicle Routing Problems NEURIPS2024

链接: https://arxiv.org/abs/2410.21066
作者: Jieyi Bi,Yining Ma,Jianan Zhou,Wen Song,Zhiguang Cao,Yaoxin Wu,Jie Zhang
关键词-EN: Vehicle Routing Problems, Vehicle Routing, Routing Problems, involve complex constraints, Proactive Infeasibility Prevention
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Vehicle Routing Problems (VRPs) can model many real-world scenarios and often involve complex constraints. While recent neural methods excel in constructing solutions based on feasibility masking, they struggle with handling complex constraints, especially when obtaining the masking itself is NP-hard. In this paper, we propose a novel Proactive Infeasibility Prevention (PIP) framework to advance the capabilities of neural methods towards more complex VRPs. Our PIP integrates the Lagrangian multiplier as a basis to enhance constraint awareness and introduces preventative infeasibility masking to proactively steer the solution construction process. Moreover, we present PIP-D, which employs an auxiliary decoder and two adaptive strategies to learn and predict these tailored masks, potentially enhancing performance while significantly reducing computational costs during training. To verify our PIP designs, we conduct extensive experiments on the highly challenging Traveling Salesman Problem with Time Window (TSPTW), and TSP with Draft Limit (TSPDL) variants under different constraint hardness levels. Notably, our PIP is generic to boost many neural methods, and exhibits both a significant reduction in infeasible rate and a substantial improvement in solution quality.

[AI-18] Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework EMNLP2024

链接: https://arxiv.org/abs/2410.21061
作者: Vladimir Arkhipkin,Viacheslav Vasilev,Andrei Filatov,Igor Pavlov,Julia Agafonova,Nikolai Gerasimenko,Anna Averchenkova,Evelina Mironova,Anton Bukashkin,Konstantin Kulikov,Andrey Kuznetsov,Denis Dimitrov
关键词-EN: image manipulation methods, introducing image manipulation, manipulation methods, popular for introducing, image fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted for EMNLP 2024 (Demo track)

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.

[AI-19] CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity

链接: https://arxiv.org/abs/2410.21060
作者: Yutong Cheng,Osama Bajaber,Saimon Amanuel Tsegai,Dawn Song,Peng Gao
关键词-EN: cyber threat intelligence, Textual descriptions, rapidly evolving threat, evolving threat landscape, crucial for organizations
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: under peer-review

点击查看摘要

Abstract:Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Syntax parsing relies on fixed rules and dictionaries, while model fine-tuning requires large annotated datasets, making both paradigms challenging to adapt to new threats and ontologies. To bridge the gap, we propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models (LLMs) for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction. Unlike existing methods, CTINexus requires neither extensive data nor parameter tuning and can adapt to various ontologies with minimal annotated examples. This is achieved through (1) a carefully designed automatic prompt construction strategy with optimal demonstration retrieval for extracting a wide range of cybersecurity entities and relations; (2) a hierarchical entity alignment technique that canonicalizes the extracted knowledge and removes redundancy; (3) an ICL-enhanced long-distance relation prediction technique to further complete the CKSG with missing links. Our extensive evaluations using 150 real-world CTI reports collected from 10 platforms demonstrate that CTINexus significantly outperforms existing methods in constructing accurate and complete CSKGs, highlighting its potential to transform CTI analysis with an efficient and adaptable solution for the dynamic threat landscape.

[AI-20] Getting By Goal Misgeneralization With a Little Help From a Mentor NEURIPS2024

链接: https://arxiv.org/abs/2410.21052
作者: Tu Trinh,Mohamad H. Danesh,Nguyen X. Khanh,Benjamin Plaut
关键词-EN: distribution shift, real-world deployments, agent, agent internal state, goal misgeneralization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: SATA Workshop @ NeurIPS 2024 (Towards Safe and Trustworthy Agents)

点击查看摘要

Abstract:While reinforcement learning (RL) agents often perform well during training, they can struggle with distribution shift in real-world deployments. One particularly severe risk of distribution shift is goal misgeneralization, where the agent learns a proxy goal that coincides with the true goal during training but not during deployment. In this paper, we explore whether allowing an agent to ask for help from a supervisor in unfamiliar situations can mitigate this issue. We focus on agents trained with PPO in the CoinRun environment, a setting known to exhibit goal misgeneralization. We evaluate multiple methods for determining when the agent should request help and find that asking for help consistently improves performance. However, we also find that methods based on the agent’s internal state fail to proactively request help, instead waiting until mistakes have already occurred. Further investigation suggests that the agent’s internal state does not represent the coin at all, highlighting the importance of learning nuanced representations, the risks of ignoring everything not immediately relevant to reward, and the necessity of developing ask-for-help strategies tailored to the agent’s training algorithm.

[AI-21] Disentangled and Self-Explainable Node Representation Learning

链接: https://arxiv.org/abs/2410.21043
作者: Simone Piaggesi,André Panisson,Megha Khosla
关键词-EN: capture node properties, unsupervised structural similarity, structural similarity objectives, typically learned, supervised tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Node representations, or embeddings, are low-dimensional vectors that capture node properties, typically learned through unsupervised structural similarity objectives or supervised tasks. While recent efforts have focused on explaining graph model decisions, the interpretability of unsupervised node embeddings remains underexplored. To bridge this gap, we introduce DiSeNE (Disentangled and Self-Explainable Node Embedding), a framework that generates self-explainable embeddings in an unsupervised manner. Our method employs disentangled representation learning to produce dimension-wise interpretable embeddings, where each dimension is aligned with distinct topological structure of the graph. We formalize novel desiderata for disentangled and interpretable embeddings, which drive our new objective functions, optimizing simultaneously for both interpretability and disentanglement. Additionally, we propose several new metrics to evaluate representation quality and human interpretability. Extensive experiments across multiple benchmark datasets demonstrate the effectiveness of our approach.

[AI-22] FairStream: Fair Multimedia Streaming Benchmark for Reinforcement Learning Agents

链接: https://arxiv.org/abs/2410.21029
作者: Jannis Weil,Jonas Ringsdorf,Julian Barthel,Yi-Ping Phoebe Chen,Tobias Meuser
关键词-EN: Multimedia streaming accounts, today internet, Multimedia streaming, Quality of Experience, streaming accounts
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimedia streaming accounts for the majority of traffic in today’s internet. Mechanisms like adaptive bitrate streaming control the bitrate of a stream based on the estimated bandwidth, ideally resulting in smooth playback and a good Quality of Experience (QoE). However, selecting the optimal bitrate is challenging under volatile network conditions. This motivated researchers to train Reinforcement Learning (RL) agents for multimedia streaming. The considered training environments are often simplified, leading to promising results with limited applicability. Additionally, the QoE fairness across multiple streams is seldom considered by recent RL approaches. With this work, we propose a novel multi-agent environment that comprises multiple challenges of fair multimedia streaming: partial observability, multiple objectives, agent heterogeneity and asynchronicity. We provide and analyze baseline approaches across five different traffic classes to gain detailed insights into the behavior of the considered agents, and show that the commonly used Proximal Policy Optimization (PPO) algorithm is outperformed by a simple greedy heuristic. Future work includes the adaptation of multi-agent RL algorithms and further expansions of the environment.

[AI-23] Graph Based Traffic Analysis and Delay Prediction

链接: https://arxiv.org/abs/2410.21028
作者: Gabriele Borg,Charlie Abela
关键词-EN: densely populated country, square kilometre, densely populated, populated country, inhabitants per square
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research is focused on traffic congestion in the small island of Malta which is the most densely populated country in the EU with about 1,672 inhabitants per square kilometre (4,331 inhabitants/sq mi). Furthermore, Malta has a rapid vehicle growth. Based on our research, the number of vehicles increased by around 11,000 in a little more than 6 months, which shows how important it is to have an accurate and comprehensive means of collecting data to tackle the issue of fluctuating traffic in Malta. In this paper, we first present the newly built comprehensive traffic dataset, called MalTra. This dataset includes realistic trips made by members of the public across the island over a period of 200 days. We then describe the methodology we adopted to generate syntactic data to complete our data set as much as possible. In our research, we consider both MalTra and the Q-Traffic dataset, which has been used in several other research studies. The statistical ARIMA model and two graph neural networks, the spatial temporal graph convolutional network (STGCN) and the diffusion convolutional recurrent network (DCRNN) were used to analyse and compare the results with existing research. From the evaluation, we found that the DCRNN model outperforms the STGCN with the former resulting in MAE of 3.98 (6.65 in the case of the latter) and a RMSE of 7.78 (against 12.73 of the latter).

[AI-24] Informed Deep Abstaining Classifier: Investigating noise-robust training for diagnostic decision support systems ICONIP2024

链接: https://arxiv.org/abs/2410.21014
作者: Helen Schneider,Sebastian Nowak,Aditya Parikh,Yannik C. Layer,Maike Theis,Wolfgang Block,Alois M. Sprinkart,Ulrike Attenberger,Rafet Sifa
关键词-EN: Image-based diagnostic decision, diagnostic decision support, utilizing deep learning, Image-based diagnostic, Deep Abstaining Classifier
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This preprint has no post-submission improvements or corrections. The Version of Record of this contribution is published in the Neural Information Processing, ICONIP 2024 Proceedings

点击查看摘要

Abstract:Image-based diagnostic decision support systems (DDSS) utilizing deep learning have the potential to optimize clinical workflows. However, developing DDSS requires extensive datasets with expert annotations and is therefore costly. Leveraging report contents from radiological data bases with Natural Language Processing to annotate the corresponding image data promises to replace labor-intensive manual annotation. As mining “real world” databases can introduce label noise, noise-robust training losses are of great interest. However, current noise-robust losses do not consider noise estimations that can for example be derived based on the performance of the automatic label generator used. In this study, we expand the noise-robust Deep Abstaining Classifier (DAC) loss to an Informed Deep Abstaining Classifier (IDAC) loss by incorporating noise level estimations during training. Our findings demonstrate that IDAC enhances the noise robustness compared to DAC and several state-of-the-art loss functions. The results are obtained on various simulated noise levels using a public chest X-ray data set. These findings are reproduced on an in-house noisy data set, where labels were extracted from the clinical systems of the University Hospital Bonn by a text-based transformer. The IDAC can therefore be a valuable tool for researchers, companies or clinics aiming to develop accurate and reliable DDSS from routine clinical data.

[AI-25] EEG-Driven 3D Object Reconstruction with Color Consistency and Diffusion Prior

链接: https://arxiv.org/abs/2410.20981
作者: Xin Xiang,Wenhui Zhou,Guojun Dai
关键词-EN: current research hotspot, EEG-based visual perception, visual perception reconstruction, research hotspot, perception reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:EEG-based visual perception reconstruction has become a current research hotspot. Neuroscientific studies have shown that humans can perceive various types of visual information, such as color, shape, and texture, when observing objects. However, existing technical methods often face issues such as inconsistencies in texture, shape, and color between the visual stimulus images and the reconstructed images. In this paper, we propose a method for reconstructing 3D objects with color consistency based on EEG signals. The method adopts a two-stage strategy: in the first stage, we train an implicit neural EEG encoder with the capability of perceiving 3D objects, enabling it to capture regional semantic features; in the second stage, based on the latent EEG codes obtained in the first stage, we integrate a diffusion model, neural style loss, and NeRF to implicitly decode the 3D objects. Finally, through experimental validation, we demonstrate that our method can reconstruct 3D objects with color consistency using EEG.

[AI-26] Geo-FuB: A Method for Constructing an Operator-Function Knowledge Base for Geospatial Code Generation Tasks Using Large Language Models

链接: https://arxiv.org/abs/2410.20975
作者: Shuyang Hou,Anqi Zhao,Jianyuan Liang,Zhangxiao Shen,Huayi Wu
关键词-EN: large language models, efficient geospatial modeling, language models, rise of spatiotemporal, spatiotemporal data
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The rise of spatiotemporal data and the need for efficient geospatial modeling have spurred interest in automating these tasks with large language models (LLMs). However, general LLMs often generate errors in geospatial code due to a lack of domain-specific knowledge on functions and operators. To address this, a retrieval-augmented generation (RAG) approach, utilizing an external knowledge base of geospatial functions and operators, is proposed. This study introduces a framework to construct such a knowledge base, leveraging geospatial script semantics. The framework includes: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Semantic Mapping (Geo-FuM). Techniques like Chain-of-Thought, TF-IDF, and the APRIORI algorithm are utilized to derive and align geospatial functions. An example knowledge base, Geo-FuB, built from 154,075 Google Earth Engine scripts, is available on GitHub. Evaluation metrics show a high accuracy, reaching 88.89% overall, with structural and semantic accuracies of 92.03% and 86.79% respectively. Geo-FuB’s potential to optimize geospatial code generation through the RAG and fine-tuning paradigms is highlighted.

[AI-27] BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

链接: https://arxiv.org/abs/2410.20971
作者: Yunhan Zhao,Xiang Zheng,Lin Luo,Yige Li,Xingjun Ma,Yu-Gang Jiang
关键词-EN: superb multimodal capabilities, output harmful responses, multimodal capabilities, tricky prompts, superb multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite their superb multimodal capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks, which are inference-time attacks that induce the model to output harmful responses with tricky prompts. It is thus essential to defend VLMs against potential jailbreaks for their trustworthy deployment in real-world applications. In this work, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends the black-box target VLM against jailbreak attacks without compromising its performance. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator fine-tuned via reinforcement learning for enhancing cross-modal robustness. We empirically show on three VLMs (LLaVA, MiniGPT-4, and Gemini) and two safety benchmarks (MM-SafetyBench and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.

[AI-28] Improving Detection of Person Class Using Dense Pooling

链接: https://arxiv.org/abs/2410.20966
作者: Nouman Ahmad
关键词-EN: deep learning models, improve the accuracy, continuous development, development of deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lately, the continuous development of deep learning models by many researchers in the area of computer vision has attracted more researchers to further improve the accuracy of these models. FasterRCNN [32] has already provided a state-of-the-art approach to improve the accuracy and detection of 80 different objects given in the COCO dataset. To further improve the performance of person detection we have conducted a different approach which gives the state-of-the-art conclusion. An ROI is a step in FasterRCNN that extract the features from the given image with a fixed size and transfer into for further classification. To enhance the ROI performance, we have conducted an approach that implements dense pooling and converts the image into a 3D model to further transform into UV(ultra Violet) images which makes it easy to extract the right features from the images. To implement our approach we have approached the state-of-the-art COCO datasets and extracted 6982 images that include a person object and our final achievements conclude that using our approach has made significant results in detecting the person object in the given image

[AI-29] Neuro-symbolic Learning Yielding Logical Constraints NEURIPS2023

链接: https://arxiv.org/abs/2410.20957
作者: Zenan Li,Yunpeng Huang,Zhaoyu Li,Yuan Yao,Jingwei Xu,Taolue Chen,Xiaoxing Ma,Jian Lu
关键词-EN: Neuro-symbolic systems combine, Neuro-symbolic systems, logical constraint, combine the abilities, logical
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published as a conference paper at NeurIPS 2023, and code is available at [this url]( this https URL )

点击查看摘要

Abstract:Neuro-symbolic systems combine the abilities of neural perception and logical reasoning. However, end-to-end learning of neuro-symbolic systems is still an unsolved challenge. This paper proposes a natural framework that fuses neural network training, symbol grounding, and logical constraint synthesis into a coherent and efficient end-to-end learning process. The capability of this framework comes from the improved interactions between the neural and the symbolic parts of the system in both the training and inference stages. Technically, to bridge the gap between the continuous neural network and the discrete logical constraint, we introduce a difference-of-convex programming technique to relax the logical constraints while maintaining their precision. We also employ cardinality constraints as the language for logical constraint learning and incorporate a trust region method to avoid the degeneracy of logical constraint in learning. Both theoretical analyses and empirical evaluations substantiate the effectiveness of the proposed framework.

[AI-30] Active Legibility in Multiagent Reinforcement Learning

链接: https://arxiv.org/abs/2410.20954
作者: Yanyu Liu,Yinghui Pan,Yifeng Zeng,Biyang Ma,Doshi Prashant
关键词-EN: autonomous driving cars, including urban transportation, critical applications including, applications including urban, multiagent sequential decision
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A multiagent sequential decision problem has been seen in many critical applications including urban transportation, autonomous driving cars, military operations, etc. Its widely known solution, namely multiagent reinforcement learning, has evolved tremendously in recent years. Among them, the solution paradigm of modeling other agents attracts our interest, which is different from traditional value decomposition or communication mechanisms. It enables agents to understand and anticipate others’ behaviors and facilitates their collaboration. Inspired by recent research on the legibility that allows agents to reveal their intentions through their behavior, we propose a multiagent active legibility framework to improve their performance. The legibility-oriented framework allows agents to conduct legible actions so as to help others optimise their behaviors. In addition, we design a series of problem domains that emulate a common scenario and best characterize the legibility in multiagent reinforcement learning. The experimental results demonstrate that the new framework is more efficient and costs less training time compared to several multiagent reinforcement learning algorithms.

[AI-31] FACTS: A Factored State-Space Framework For World Modelling

链接: https://arxiv.org/abs/2410.20922
作者: Li Nanbo,Firas Laakom,Yucheng Xu,Wenyi Wang,Jürgen Schmidhuber
关键词-EN: World modelling, essential for understanding, understanding and predicting, predicting the dynamics, dynamics of complex
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code released in this https URL

点击查看摘要

Abstract:World modelling is essential for understanding and predicting the dynamics of complex systems by learning both spatial and temporal dependencies. However, current frameworks, such as Transformers and selective state-space models like Mambas, exhibit limitations in efficiently encoding spatial and temporal structures, particularly in scenarios requiring long-term high-dimensional sequence modelling. To address these issues, we propose a novel recurrent framework, the \textbfFACTored \textbfState-space (\textbfFACTS) model, for spatial-temporal world modelling. The FACTS framework constructs a graph-structured memory with a routing mechanism that learns permutable memory representations, ensuring invariance to input permutations while adapting through selective state-space propagation. Furthermore, FACTS supports parallel computation of high-dimensional sequences. We empirically evaluate FACTS across diverse tasks, including multivariate time series forecasting and object-centric world modelling, demonstrating that it consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.

[AI-32] Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM -driven Cyberattacks

链接: https://arxiv.org/abs/2410.20911
作者: Dario Pasquini,Evgenios M. Kornaropoulos,Giuseppe Ateniese
关键词-EN: Large language models, Large language, making sophisticated exploits, language models, making sophisticated
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: v0.1

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being harnessed to automate cyberattacks, making sophisticated exploits more accessible and scalable. In response, we propose a new defense strategy tailored to counter LLM-driven cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs’ susceptibility to adversarial inputs to undermine malicious operations. Upon detecting an automated cyberattack, Mantis plants carefully crafted inputs into system responses, leading the attacker’s LLM to disrupt their own operations (passive defense) or even compromise the attacker’s machine (active defense). By deploying purposefully vulnerable decoy services to attract the attacker and using dynamic prompt injections for the attacker’s LLM, Mantis can autonomously hack back the attacker. In our experiments, Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks. To foster further research and collaboration, Mantis is available as an open-source tool: this https URL

[AI-33] Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models

链接: https://arxiv.org/abs/2410.20898
作者: Weijian Luo,Colin Zhang,Debing Zhang,Zhengyang Geng
关键词-EN: generate highly realistic, highly realistic images, frame human preference, human preference alignment, human preference
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the Diff-Instruct*(DI*), a data-free approach for building one-step text-to-image generative models that align with human preference while maintaining the ability to generate highly realistic images. We frame human preference alignment as online reinforcement learning using human feedback (RLHF), where the goal is to maximize the reward function while regularizing the generator distribution to remain close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization, which leads to significantly better performances. Although the direct calculation of this divergence remains intractable, we demonstrate that we can efficiently compute its \emphgradient by deriving an equivalent yet tractable loss function. Remarkably, with Stable Diffusion V1.5 as the reference diffusion model, DI* outperforms \emphall previously leading models by a large margin. When using the 0.6B PixelArt- \alpha model as the reference diffusion, DI* achieves a new record Aesthetic Score of 6.30 and an Image Reward of 1.31 with only a single generation step, almost doubling the scores of the rest of the models with similar sizes. It also achieves an HPSv2 score of 28.70, establishing a new state-of-the-art benchmark. We also observe that DI* can improve the layout and enrich the colors of generated images.

[AI-34] Active Causal Structure Learning with Latent Variables: Towards Learning to Detour in Autonomous Robots

链接: https://arxiv.org/abs/2410.20894
作者: Pablo de los Riscos,Fernando Corbacho
关键词-EN: Artificial General Intelligence, Artificial General, General Intelligence, build AGI agents, causal models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 44 pages, 12 figures

点击查看摘要

Abstract:Artificial General Intelligence (AGI) Agents and Robots must be able to cope with everchanging environments and tasks. They must be able to actively construct new internal causal models of their interactions with the environment when new structural changes take place in the environment. Thus, we claim that active causal structure learning with latent variables (ACSLWL) is a necessary component to build AGI agents and robots. This paper describes how a complex planning and expectation-based detour behavior can be learned by ACSLWL when, unexpectedly, and for the first time, the simulated robot encounters a sort of transparent barrier in its pathway towards its target. ACSWL consists of acting in the environment, discovering new causal relations, constructing new causal models, exploiting the causal models to maximize its expected utility, detecting possible latent variables when unexpected observations occur, and constructing new structures-internal causal models and optimal estimation of the associated parameters, to be able to cope efficiently with the new encountered situations. That is, the agent must be able to construct new causal internal models that transform a previously unexpected and inefficient (sub-optimal) situation, into a predictable situation with an optimal operating plan.

[AI-35] Explainability in AI Based Applications: A Framework for Comparing Different Techniques

链接: https://arxiv.org/abs/2410.20873
作者: Arne Grobrugge,Nidhi Mishra,Johannes Jakubik,Gerhard Satzger
关键词-EN: significantly enhanced decision-making, enhanced decision-making capabilities, explainability techniques, artificial intelligence, processes has significantly
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:The integration of artificial intelligence into business processes has significantly enhanced decision-making capabilities across various industries such as finance, healthcare, and retail. However, explaining the decisions made by these AI systems poses a significant challenge due to the opaque nature of recent deep learning models, which typically function as black boxes. To address this opacity, a multitude of explainability techniques have emerged. However, in practical business applications, the challenge lies in selecting an appropriate explainability method that balances comprehensibility with accuracy. This paper addresses the practical need of understanding differences in the output of explainability techniques by proposing a novel method for the assessment of the agreement of different explainability techniques. Based on our proposed methods, we provide a comprehensive comparative analysis of six leading explainability techniques to help guiding the selection of such techniques in practice. Our proposed general-purpose method is evaluated on top of one of the most popular deep learning architectures, the Vision Transformer model, which is frequently employed in business applications. Notably, we propose a novel metric to measure the agreement of explainability techniques that can be interpreted visually. By providing a practical framework for understanding the agreement of diverse explainability techniques, our research aims to facilitate the broader integration of interpretable AI systems in business applications.

[AI-36] Strada-LLM : Graph LLM for traffic prediction

链接: https://arxiv.org/abs/2410.20856
作者: Seyed Mohamad Moghadas,Yangxintong Lyu,Bruno Cornelis,Alexandre Alahi,Adrian Munteanu
关键词-EN: intelligent transportation systems, transportation systems, vital component, component of intelligent, intelligent transportation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Traffic prediction is a vital component of intelligent transportation systems. By reasoning about traffic patterns in both the spatial and temporal dimensions, accurate and interpretable predictions can be provided. A considerable challenge in traffic prediction lies in handling the diverse data distributions caused by vastly different traffic conditions occurring at different locations. LLMs have been a dominant solution due to their remarkable capacity to adapt to new datasets with very few labeled data samples, i.e., few-shot adaptability. However, existing forecasting techniques mainly focus on extracting local graph information and forming a text-like prompt, leaving LLM- based traffic prediction an open problem. This work presents a probabilistic LLM for traffic forecasting with three highlights. We propose a graph-aware LLM for traffic prediction that considers proximal traffic information. Specifically, by considering the traffic of neighboring nodes as covariates, our model outperforms the corresponding time-series LLM. Furthermore, we adopt a lightweight approach for efficient domain adaptation when facing new data distributions in few-shot fashion. The comparative experiment demonstrates the proposed method outperforms the state-of-the-art LLM-based methods and the traditional GNN- based supervised approaches. Furthermore, Strada-LLM can be easily adapted to different LLM backbones without a noticeable performance drop.

[AI-37] Deep Insights into Automated Optimization with Large Language Models and Evolutionary Algorithms

链接: https://arxiv.org/abs/2410.20848
作者: He Yu,Jing Liu
关键词-EN: Designing optimization approaches, diverse problem domains, Large Language Models, demands extensive manual, extensive manual intervention
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing optimization approaches, whether heuristic or meta-heuristic, usually demands extensive manual intervention and has difficulty generalizing across diverse problem domains. The combination of Large Language Models (LLMs) and Evolutionary Algorithms (EAs) offers a promising new approach to overcome these limitations and make optimization more automated. In this setup, LLMs act as dynamic agents that can generate, refine, and interpret optimization strategies, while EAs efficiently explore complex solution spaces through evolutionary operators. Since this synergy enables a more efficient and creative search process, we first conduct an extensive review of recent research on the application of LLMs in optimization. We focus on LLMs’ dual functionality as solution generators and algorithm designers. Then, we summarize the common and valuable designs in existing work and propose a novel LLM-EA paradigm for automated optimization. Furthermore, centered on this paradigm, we conduct an in-depth analysis of innovative methods for three key components: individual representation, variation operators, and fitness evaluation. We address challenges related to heuristic generation and solution exploration, especially from the LLM prompts’ perspective. Our systematic review and thorough analysis of the paradigm can assist researchers in better understanding the current research and promoting the development of combining LLMs with EAs for automated optimization.

[AI-38] ADLM – stega: A Universal Adaptive Token Selection Algorithm for Improving Steganographic Text Quality via Information Entropy

链接: https://arxiv.org/abs/2410.20825
作者: Zezheng Qin,Congcong Sun,Taiyi He,Yuke He,Azizol Abdullah,Normalia Samian,Nuur Alifah Roslan
关键词-EN: global information sharing, widespread global information, information entropy, information, focal points
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the context of widespread global information sharing, information security and privacy protection have become focal points. Steganographic systems enhance information security by embedding confidential information into public carriers; however, existing generative text steganography methods face challenges in handling the long-tail distribution of candidate word pools, which impacts the imperceptibility of steganographic information. This paper proposes a quality control theory for steganographic text generation based on information entropy constraints, exploring the relationship between the imperceptibility of steganographic texts and information entropy. By controlling the information entropy of the candidate word pool within a specific range, we optimize the imperceptibility of the steganographic text. We establish upper and lower bounds for information entropy and introduce an adaptive truncation method to balance semantic coherence and lexical diversity. Experimental results demonstrate that reasonably controlling the candidate pool size and information entropy thresholds significantly enhances the quality and detection resistance of steganographic texts, showcasing broad application potential in the field of natural language processing.

[AI-39] From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

链接: https://arxiv.org/abs/2410.20791
作者: Gopi Krishnan Rajbahadur,Gustavo A. Oliva,Dayi Lin,Ahmed E. Hassan
关键词-EN: large language models, integrate FMs, core components, rapid expansion, expansion of foundation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid expansion of foundation models (FMs), such as large language models (LLMs), has given rise to FMware–software systems that integrate FMs as core components. While building demonstration-level FMware is relatively straightforward, transitioning to production-ready systems presents numerous challenges, including reliability, high implementation costs, scalability, and compliance with privacy regulations. This paper provides a thematic analysis of the key obstacles in productionizing FMware, synthesized from industry experience and diverse data sources, including hands-on involvement in the Open Platform for Enterprise AI (OPEA) and FMware lifecycle engineering. We identify critical issues in FM selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment, alongside cross-cutting concerns such as memory management, observability, and feedback integration. We discuss needed technologies and strategies to address these challenges and offer guidance on how to enable the transition from demonstration systems to scalable, production-ready FMware solutions. Our findings underscore the importance of continued research and multi-industry collaboration to advance the development of production-ready FMware.

[AI-40] Introducing Spectral Attention for Long-Range Dependency in Time Series Forecasting

链接: https://arxiv.org/abs/2410.20772
作者: Bong Gyun Kang,Dongjun Lee,HyunGi Kim,DoHyun Chung
关键词-EN: modeling faces challenges, Sequence modeling faces, Spectral Attention, diverse tasks, Spectral Attention mechanism
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Co-first Author: Bong Gyun Kang, Dongjun Lee

点击查看摘要

Abstract:Sequence modeling faces challenges in capturing long-range dependencies across diverse tasks. Recent linear and transformer-based forecasters have shown superior performance in time series forecasting. However, they are constrained by their inherent inability to effectively address long-range dependencies in time series data, primarily due to using fixed-size inputs for prediction. Furthermore, they typically sacrifice essential temporal correlation among consecutive training samples by shuffling them into mini-batches. To overcome these limitations, we introduce a fast and effective Spectral Attention mechanism, which preserves temporal correlations among samples and facilitates the handling of long-range information while maintaining the base model structure. Spectral Attention preserves long-period trends through a low-pass filter and facilitates gradient to flow between samples. Spectral Attention can be seamlessly integrated into most sequence models, allowing models with fixed-sized look-back windows to capture long-range dependencies over thousands of steps. Through extensive experiments on 11 real-world time series datasets using 7 recent forecasting models, we consistently demonstrate the efficacy of our Spectral Attention mechanism, achieving state-of-the-art results.

[AI-41] ODRL: A Benchmark for Off-Dynamics Reinforcement Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.20750
作者: Jiafei Lyu,Kang Xu,Jiacheng Xu,Mengbei Yan,Jingwen Yang,Zongzhang Zhang,Chenjia Bai,Zongqing Lu,Xiu Li
关键词-EN: off-dynamics reinforcement learning, reinforcement learning, transfer policies, dynamics mismatch, off-dynamics reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 DB Track

点击查看摘要

Abstract:We consider off-dynamics reinforcement learning (RL) where one needs to transfer policies across different domains with dynamics mismatch. Despite the focus on developing dynamics-aware algorithms, this field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods. ODRL contains four experimental settings where the source and target domains can be either online or offline, and provides diverse tasks and a broad spectrum of dynamics shifts, making it a reliable platform to comprehensively evaluate the agent’s adaptation ability to the target domain. Furthermore, ODRL includes recent off-dynamics RL algorithms in a unified framework and introduces some extra baselines for different settings, all implemented in a single-file manner. To unpack the true adaptation capability of existing methods, we conduct extensive benchmarking experiments, which show that no method has universal advantages across varied dynamics shifts. We hope this benchmark can serve as a cornerstone for future research endeavors. Our code is publicly available at this https URL.

[AI-42] Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.20745
作者: Yilun Jin,Zheng Li,Chenwei Zhang,Tianyu Cao,Yifan Gao,Pratik Jayarao,Mao Li,Xin Liu,Ritesh Sarkhel,Xianfeng Tang,Haodong Wang,Zhengyang Wang,Wenju Xu,Jingfeng Yang,Qingyu Yin,Xian Li,Priyanka Nigam,Yi Xu,Kai Chen,Qiang Yang,Meng Jiang,Bing Yin
关键词-EN: Shopping MMLU, Online shopping, shopping, few-shot learning problem, multi-task online shopping
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Datasets and Benchmarks Track Accepted

点击查看摘要

Abstract:Online shopping is a complex multi-task, few-shot learning problem with a wide and evolving range of entities, relations, and tasks. However, existing models and benchmarks are commonly tailored to specific tasks, falling short of capturing the full complexity of online shopping. Large Language Models (LLMs), with their multi-task and few-shot learning abilities, have the potential to profoundly transform online shopping by alleviating task-specific engineering efforts and by providing users with interactive conversations. Despite the potential, LLMs face unique challenges in online shopping, such as domain-specific concepts, implicit knowledge, and heterogeneous user behaviors. Motivated by the potential and challenges, we propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data. Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality, and can thus comprehensively evaluate the abilities of LLMs as general shop assistants. With Shopping MMLU, we benchmark over 20 existing LLMs and uncover valuable insights about practices and prospects of building versatile LLM-based shop assistants. Shopping MMLU can be publicly accessed at this https URL. In addition, with Shopping MMLU, we host a competition in KDD Cup 2024 with over 500 participating teams. The winning solutions and the associated workshop can be accessed at our website this https URL.

[AI-43] Mitigating Unauthorized Speech Synthesis for Voice Protection CCS

链接: https://arxiv.org/abs/2410.20742
作者: Zhisheng Zhang,Qianyi Yang,Derui Wang,Pengyang Huang,Yuxin Cao,Kai Ye,Jie Hao
关键词-EN: illegal financial gain, malicious voice exploitation, brought huge hazards, recent years, telecom fraud
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ACM CCS Workshop (LAMPS) 2024

点击查看摘要

Abstract:With just a few speech samples, it is possible to perfectly replicate a speaker’s voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards.

[AI-44] GPRec: Bi-level User Modeling for Deep Recommenders

链接: https://arxiv.org/abs/2410.20730
作者: Yejing Wang,Dong Xu,Xiangyu Zhao,Zhiren Mao,Peng Xiang,Ling Yan,Yao Hu,Zijian Zhang,Xuetao Wei,Qidong Liu
关键词-EN: explicitly categorizes users, GPRec explicitly categorizes, explicitly categorizes, categorizes users, learnable manner
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:GPRec explicitly categorizes users into groups in a learnable manner and aligns them with corresponding group embeddings. We design the dual group embedding space to offer a diverse perspective on group preferences by contrasting positive and negative patterns. On the individual level, GPRec identifies personal preferences from ID-like features and refines the obtained individual representations to be independent of group ones, thereby providing a robust complement to the group-level modeling. We also present various strategies for the flexible integration of GPRec into various DRS models. Rigorous testing of GPRec on three public datasets has demonstrated significant improvements in recommendation quality.

[AI-45] Lecture I: Governing the Algorithmic City

链接: https://arxiv.org/abs/2410.20720
作者: Seth Lazar
关键词-EN: John Dewey observed, John Dewey, affected human relationships, Dewey observed, century ago
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A century ago, John Dewey observed that ‘[s]team and electricity have done more to alter the conditions under which men associate together than all the agencies which affected human relationships before our time’. In the last few decades, computing technologies have had a similar effect. Political philosophy’s central task is to help us decide how to live together, by analysing our social relations, diagnosing their failings, and articulating ideals to guide their revision. But these profound social changes have left scarcely a dent in the model of social relations that (analytical) political philosophers assume. This essay aims to reverse that trend. It first builds a model of our novel social relations as they are now, and as they are likely to evolved, and then explores how those differences affect our theories of how to live together. I introduce the ‘Algorithmic City’, the network of algorithmically-mediated social relations, then characterise the intermediary power by which it is governed. I show how algorithmic governance raises new challenges for political philosophy concerning the justification of authority, the foundations of procedural legitimacy, and the possibility of justificatory neutrality.

[AI-46] Lecture II: Communicative Justice and the Distribution of Attention

链接: https://arxiv.org/abs/2410.20718
作者: Seth Lazar
关键词-EN: digital public sphere, digital public, public sphere, amplification algorithms, moderation practices
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Algorithmic intermediaries govern the digital public sphere through their architectures, amplification algorithms, and moderation practices. In doing so, they shape public communication and distribute attention in ways that were previously infeasible with such subtlety, speed and scale. From misinformation and affective polarisation to hate speech and radicalisation, the many pathologies of the digital public sphere attest that they could do so better. But what ideals should they aim at? Political philosophy should be able to help, but existing theories typically assume that a healthy public sphere will spontaneously emerge if only we get the boundaries of free expression right. They offer little guidance on how to intentionally constitute the digital public sphere. In addition to these theories focused on expression, we need a further theory of communicative justice, targeted specifically at the algorithmic intermediaries that shape communication and distribute attention. This lecture argues that political philosophy urgently owes an account of how to govern communication in the digital public sphere, and introduces and defends a democratic egalitarian theory of communicative justice.

[AI-47] Contextual Representation Anchor Network to Alleviate Selection Bias in Few-Shot Drug Discovery

链接: https://arxiv.org/abs/2410.20711
作者: Ruifeng Li,Wei Liu,Xiangxin Zhou,Mingqian Li,Yuhua Zhou,Yuan Yao,Qiang Zhang,Hongyang Chen
关键词-EN: drug discovery process, drug candidate screening, low success rate, molecular property prediction, few-shot learning problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness leads to suboptimal performance. To overcome this challenge, we present a novel method named contextual representation anchor Network (CRA), where an anchor refers to a cluster center of the representations of molecules and serves as a bridge to transfer enriched contextual knowledge into molecular representations and enhance their expressiveness. CRA introduces a dual-augmentation mechanism that includes context augmentation, which dynamically retrieves analogous unlabeled molecules and captures their task-specific contextual knowledge to enhance the anchors, and anchor augmentation, which leverages the anchors to augment the molecular representations. We evaluate our approach on the MoleculeNet and FS-Mol benchmarks, as well as in domain transfer experiments. The results demonstrate that CRA outperforms the state-of-the-art by 2.60% and 3.28% in AUC and \Delta AUC-PR metrics, respectively, and exhibits superior generalization capabilities.

[AI-48] Embedding with Large Language Models for Classification of HIPAA Safeguard Compliance Rules

链接: https://arxiv.org/abs/2410.20664
作者: Md Abdur Rahman,Md Abdul Barek,ABM Kamrul Islam Riad,Md Mostafizur Rahman,Md Bajlur Rashid,Smita Ambedkar,Md Raihan Miaa,Fan Wu,Alfredo Cuzzocrea,Sheikh Iqbal Ahamed
关键词-EN: HIPAA rules categories, protecting patient data, Google Play Store, HIPAA rules patterns, HIPAA rules
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although software developers of mHealth apps are responsible for protecting patient data and adhering to strict privacy and security requirements, many of them lack awareness of HIPAA regulations and struggle to distinguish between HIPAA rules categories. Therefore, providing guidance of HIPAA rules patterns classification is essential for developing secured applications for Google Play Store. In this work, we identified the limitations of traditional Word2Vec embeddings in processing code patterns. To address this, we adopt multilingual BERT (Bidirectional Encoder Representations from Transformers) which offers contextualized embeddings to the attributes of dataset to overcome the issues. Therefore, we applied this BERT to our dataset for embedding code patterns and then uses these embedded code to various machine learning approaches. Our results demonstrate that the models significantly enhances classification performance, with Logistic Regression achieving a remarkable accuracy of 99.95%. Additionally, we obtained high accuracy from Support Vector Machine (99.79%), Random Forest (99.73%), and Naive Bayes (95.93%), outperforming existing approaches. This work underscores the effectiveness and showcases its potential for secure application development.

[AI-49] urboHopp: Accelerated Molecule Scaffold Hopping with Consistency Models NEURIPS2024

链接: https://arxiv.org/abs/2410.20660
作者: Kiwoong Yoo,Owen Oertell,Junhyun Lee,Sanghoon Lee,Jaewoo Kang
关键词-EN: identify viable candidates, Navigating the vast, vast chemical space, viable candidates, formidable challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 22 pages, 11 figures, 8 tables. Presented at NeurIPS 2024

点击查看摘要

Abstract:Navigating the vast chemical space of druggable compounds is a formidable challenge in drug discovery, where generative models are increasingly employed to identify viable candidates. Conditional 3D structure-based drug design (3D-SBDD) models, which take into account complex three-dimensional interactions and molecular geometries, are particularly promising. Scaffold hopping is an efficient strategy that facilitates the identification of similar active compounds by strategically modifying the core structure of molecules, effectively narrowing the wide chemical space and enhancing the discovery of drug-like products. However, the practical application of 3D-SBDD generative models is hampered by their slow processing speeds. To address this bottleneck, we introduce TurboHopp, an accelerated pocket-conditioned 3D scaffold hopping model that merges the strategic effectiveness of traditional scaffold hopping with rapid generation capabilities of consistency models. This synergy not only enhances efficiency but also significantly boosts generation speeds, achieving up to 30 times faster inference speed as well as superior generation quality compared to existing diffusion-based models, establishing TurboHopp as a powerful tool in drug discovery. Supported by faster inference speed, we further optimize our model, using Reinforcement Learning for Consistency Models (RLCM), to output desirable molecules. We demonstrate the broad applicability of TurboHopp across multiple drug discovery scenarios, underscoring its potential in diverse molecular settings.

[AI-50] NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

链接: https://arxiv.org/abs/2410.20650
作者: Yongchang Hao,Yanshuai Cao,Lili Mou
关键词-EN: neural networks improves, networks improves, neural networks, performance, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.

[AI-51] Language Models And A Second Opinion Use Case: The Pocket Professional

链接: https://arxiv.org/abs/2410.20636
作者: David Noever
关键词-EN: Large Language Models, Large Language, seek peer consultation, experienced physicians seek, physicians seek peer
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making, particularly focusing on complex medical cases where even experienced physicians seek peer consultation. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs’ performance against crowd-sourced physician responses. A key finding was the high overall score possible in the latest foundational models (80% accuracy compared to consensus opinion), which exceeds most human metrics reported on the same clinical cases (450 pages of patient profiles, test results). The study rates the LLMs’ performance disparity between straightforward cases (81% accuracy) and complex scenarios (43% accuracy), particularly in these cases generating substantial debate among human physicians. The research demonstrates that LLMs may be valuable as generators of comprehensive differential diagnoses rather than as primary diagnostic tools, potentially helping to counter cognitive biases in clinical decision-making, reduce cognitive loads, and thus remove some sources of medical error. The inclusion of a second comparative legal dataset (Supreme Court cases, N=21) provides added empirical context to the AI use to foster second opinions, though these legal challenges proved considerably easier for LLMs to analyze. In addition to the original contributions of empirical evidence for LLM accuracy, the research aggregated a novel benchmark for others to score highly contested question and answer reliability between both LLMs and disagreeing human practitioners. These results suggest that the optimal deployment of LLMs in professional settings may differ substantially from current approaches that emphasize automation of routine tasks.

[AI-52] Implementation and Application of an Intelligibility Protocol for Interaction with an LLM

链接: https://arxiv.org/abs/2410.20600
作者: Ashwin Srinivasan,Karan Bania,Shreyas V,Harshvardhan Mestha,Sidong Liu
关键词-EN: interactive systems involving, data analysis tasks, constructing interactive systems, machine learning engine, analysis tasks
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Our interest is in constructing interactive systems involving a human-expert interacting with a machine learning engine on data analysis tasks. This is of relevance when addressing complex problems arising in areas of science, the environment, medicine and so on, which are not immediately amenable to the usual methods of statistical or mathematical modelling. In such situations, it is possible that harnessing human expertise and creativity to modern machine-learning capabilities of identifying patterns by constructing new internal representations of the data may provide some insight to possible solutions. In this paper, we examine the implementation of an abstract protocol developed for interaction between agents, each capable of constructing predictions and explanations. The \PXP protocol, described in [12] is motivated by the notion of ‘‘two-way intelligibility’’ and is specified using a pair of communicating finite-state machines. While the formalisation allows the authors to prove several properties about the protocol, no implementation was presented. Here, we address this shortcoming for the case in which one of the agents acts as a ‘‘generator’’ using a large language model (LLM) and the other is an agent that acts as a ‘‘tester’’ using either a human-expert, or a proxy for a human-expert (for example, a database compiled using human-expertise). We believe these use-cases will be a widely applicable form of interaction for problems of the kind mentioned above. We present an algorithmic description of general-purpose implementation, and conduct preliminary experiments on its use in two different areas (radiology and drug-discovery). The experimental results provide early evidence in support of the protocol’s capability of capturing one- and two-way intelligibility in human-LLM in the manner proposed in [12].

[AI-53] Generator Matching: Generative modeling with arbitrary Markov processes

链接: https://arxiv.org/abs/2410.20587
作者: Peter Holderrieth,Marton Havasi,Jason Yim,Neta Shaul,Itai Gat,Tommi Jaakkola,Brian Karrer,Ricky T. Q. Chen,Yaron Lipman
关键词-EN: arbitrary Markov processes, generative modeling, introduce generator matching, modality-agnostic framework, arbitrary Markov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce generator matching, a modality-agnostic framework for generative modeling using arbitrary Markov processes. Generators characterize the infinitesimal evolution of a Markov process, which we leverage for generative modeling in a similar vein to flow matching: we construct conditional generators which generate single data points, then learn to approximate the marginal generator which generates the full data distribution. We show that generator matching unifies various generative modeling methods, including diffusion models, flow matching and discrete diffusion models. Furthermore, it provides the foundation to expand the design space to new and unexplored Markov processes such as jump processes. Finally, generator matching enables the construction of superpositions of Markov generative processes and enables the construction of multimodal models in a rigorous manner. We empirically validate our method on protein and image structure generation, showing that superposition with a jump process improves image generation.

[AI-54] oward Conditional Distribution Calibration in Survival Prediction NEURIPS2024

链接: https://arxiv.org/abs/2410.20579
作者: Shi-ang Qi,Yakun Yu,Russell Greiner
关键词-EN: distribution from censored, involves estimating, conditional calibration, censored datasets, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted to NeurIPS 2024. 41 pages, 23 figures

点击查看摘要

Abstract:Survival prediction often involves estimating the time-to-event distribution from censored datasets. Previous approaches have focused on enhancing discrimination and marginal calibration. In this paper, we highlight the significance of conditional calibration for real-world applications – especially its role in individual decision-making. We propose a method based on conformal prediction that uses the model’s predicted individual survival probability at that instance’s observed time. This method effectively improves the model’s marginal and conditional calibration, without compromising discrimination. We provide asymptotic theoretical guarantees for both marginal and conditional calibration and test it extensively across 15 diverse real-world datasets, demonstrating the method’s practical effectiveness and versatility in various settings.

[AI-55] Unsupervised Panoptic Interpretation of Latent Spaces in GANs Using Space-Filling Vector Quantization

链接: https://arxiv.org/abs/2410.20573
作者: Mohammad Hassan Vali,Tom Bäckström
关键词-EN: latent space, latent, mapped to real-world, space, SFVQ
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative adversarial networks (GANs) learn a latent space whose samples can be mapped to real-world images. Such latent spaces are difficult to interpret. Some earlier supervised methods aim to create an interpretable latent space or discover interpretable directions that require exploiting data labels or annotated synthesized samples for training. However, we propose using a modification of vector quantization called space-filling vector quantization (SFVQ), which quantizes the data on a piece-wise linear curve. SFVQ can capture the underlying morphological structure of the latent space and thus make it interpretable. We apply this technique to model the latent space of pretrained StyleGAN2 and BigGAN networks on various datasets. Our experiments show that the SFVQ curve yields a general interpretable model of the latent space that determines which part of the latent space corresponds to what specific generative factors. Furthermore, we demonstrate that each line of SFVQ’s curve can potentially refer to an interpretable direction for applying intelligible image transformations. We also showed that the points located on an SFVQ line can be used for controllable data augmentation.

[AI-56] SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance

链接: https://arxiv.org/abs/2410.20553
作者: Deepak Vungarala,Sakila Alam,Arnob Ghosh,Shaahin Angizi
关键词-EN: Large Language Models, Large Language, Language Models, shown great potential, generate accurate circuit-level
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great potential in automating code generation; however, their ability to generate accurate circuit-level SPICE code remains limited due to a lack of hardware-specific knowledge. In this paper, we analyze and identify the typical limitations of existing LLMs in SPICE code generation. To address these limitations, we present SPICEPilot a novel Python-based dataset generated using PySpice, along with its accompanying framework. This marks a significant step forward in automating SPICE code generation across various circuit configurations. Our framework automates the creation of SPICE simulation scripts, introduces standardized benchmarking metrics to evaluate LLM’s ability for circuit generation, and outlines a roadmap for integrating LLMs into the hardware design process. SPICEPilot is open-sourced under the permissive MIT license at this https URL.

[AI-57] SympCam: Remote Optical Measurement of Sympathetic Arousal ALT

链接: https://arxiv.org/abs/2410.20552
作者: Björn Braun,Daniel McDuff,Tadas Baltrusaitis,Paul Streli,Max Moebus,Christian Holz
关键词-EN: sympathetic arousal, remote sympathetic arousal, basic signal processing, sympathetic arousal prediction, person sympathetic arousal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at the IEEE-EMBS International Conference on Biomedical and Health Informatics

点击查看摘要

Abstract:Recent work has shown that a person’s sympathetic arousal can be estimated from facial videos alone using basic signal processing. This opens up new possibilities in the field of telehealth and stress management, providing a non-invasive method to measure stress only using a regular RGB camera. In this paper, we present SympCam, a new 3D convolutional architecture tailored to the task of remote sympathetic arousal prediction. Our model incorporates a temporal attention module (TAM) to enhance the temporal coherence of our sequential data processing capabilities. The predictions from our method improve accuracy metrics of sympathetic arousal in prior work by 48% to a mean correlation of 0.77. We additionally compare our method with common remote photoplethysmography (rPPG) networks and show that they alone cannot accurately predict sympathetic arousal “out-of-the-box”. Furthermore, we show that the sympathetic arousal predicted by our method allows detecting physical stress with a balanced accuracy of 90% - an improvement of 61% compared to the rPPG method commonly used in related work, demonstrating the limitations of using rPPG alone. Finally, we contribute a dataset designed explicitly for the task of remote sympathetic arousal prediction. Our dataset contains synchronized face and hand videos of 20 participants from two cameras synchronized with electrodermal activity (EDA) and photoplethysmography (PPG) measurements. We will make this dataset available to the community and use it to evaluate the methods in this paper. To the best of our knowledge, this is the first dataset available to other researchers designed for remote sympathetic arousal prediction.

[AI-58] Deep Reinforcement Learning Agents for Strategic Production Policies in Microeconomic Market Simulations

链接: https://arxiv.org/abs/2410.20550
作者: Eduardo C. Garrido-Merchán,Maria Coronado-Vaca,Álvaro López-López,Carlos Martinez de Ibarreta
关键词-EN: traditional models assumptions, making traditional models, traditional models, limiting their ability, real-world scenarios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Traditional economic models often rely on fixed assumptions about market dynamics, limiting their ability to capture the complexities and stochastic nature of real-world scenarios. However, reality is more complex and includes noise, making traditional models assumptions not met in the market. In this paper, we explore the application of deep reinforcement learning (DRL) to obtain optimal production strategies in microeconomic market environments to overcome the limitations of traditional models. Concretely, we propose a DRL-based approach to obtain an effective policy in competitive markets with multiple producers, each optimizing their production decisions in response to fluctuating demand, supply, prices, subsidies, fixed costs, total production curve, elasticities and other effects contaminated by noise. Our framework enables agents to learn adaptive production policies to several simulations that consistently outperform static and random strategies. As the deep neural networks used by the agents are universal approximators of functions, DRL algorithms can represent in the network complex patterns of data learnt by trial and error that explain the market. Through extensive simulations, we demonstrate how DRL can capture the intricate interplay between production costs, market prices, and competitor behavior, providing insights into optimal decision-making in dynamic economic settings. The results show that agents trained with DRL can strategically adjust production levels to maximize long-term profitability, even in the face of volatile market conditions. We believe that the study bridges the gap between theoretical economic modeling and practical market simulation, illustrating the potential of DRL to revolutionize decision-making in market strategies.

[AI-59] Malinowski in the Age of AI: Can large language models create a text game based on an anthropological classic?

链接: https://arxiv.org/abs/2410.20536
作者: Michael Peter Hoffmann,Jan Fillies,Adrian Paschke
关键词-EN: Large Language Models, Large Language, shown remarkable abilities, Recent advancements, advancements in Large
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted at KUI 2024

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) like ChatGPT and GPT-4 have shown remarkable abilities in a wide range of tasks such as summarizing texts and assisting in coding. Scientific research has demonstrated that these models can also play text-adventure games. This study aims to explore whether LLMs can autonomously create text-based games based on anthropological classics, evaluating also their effectiveness in communicating knowledge. To achieve this, the study engaged anthropologists in discussions to gather their expectations and design inputs for an anthropologically themed game. Through iterative processes following the established HCI principle of ‘design thinking’, the prompts and the conceptual framework for crafting these games were refined. Leveraging GPT3.5, the study created three prototypes of games centered around the seminal anthropological work of the social anthropologist’s Bronislaw Malinowski’s “Argonauts of the Western Pacific” (1922). Subsequently, evaluations were conducted by inviting senior anthropologists to playtest these games, and based on their inputs, the game designs were refined. The tests revealed promising outcomes but also highlighted key challenges: the models encountered difficulties in providing in-depth thematic understandings, showed suspectibility to misinformation, tended towards monotonic responses after an extended period of play, and struggled to offer detailed biographical information. Despite these limitations, the study’s findings open up new research avenues at the crossroads of artificial intelligence, machine learning, LLMs, ethnography, anthropology and human-computer interaction.

[AI-60] Asynchronous Perception Machine For Efficient Test-Time-Training NEURIPS2024

链接: https://arxiv.org/abs/2410.20535
作者: Rajat Modi,Yogesh Singh Rawat
关键词-EN: Asynchronous Perception Machine, propose Asynchronous Perception, propose Asynchronous, APM, Perception Machine
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024 Main Track. APM is a step to getting Geoffrey Hinton’s GLOM working. Original GLOM paper said: “This paper was quickly hijacked by the need to justify the design decisions”. 3 years have passed us. This work provides some justifications and been peer-reviewed and accepted by our peers. A humble blogpost can be found at this https URL

点击查看摘要

Abstract:In this work, we propose Asynchronous Perception Machine (APM), a computationally-efficient architecture for test-time-training (TTT). APM can process patches of an image one at a time in any order \textitasymmetrically, and \textitstill encode semantic-awareness in the net. We demonstrate APM’s ability to recognize out-of-distribution images \textitwithout dataset-specific pre-training, augmentation or any-pretext task. APM offers competitive performance over existing TTT approaches. To perform TTT, APM just distills test sample’s representation \textitonce. APM possesses a unique property: it can learn using just this single representation and starts predicting semantically-aware features. APM demostrates potential applications beyond test-time-training: APM can scale up to a dataset of 2D images and yield semantic-clusterings in a single forward pass. APM also provides first empirical evidence towards validating GLOM’s insight, i.e. input percept is a field. Therefore, APM helps us converge towards an implementation which can do \textitboth interpolation and perception on a \textitshared-connectionist hardware. Our code is publicly available at this link: this https URL. Comments: Accepted to NeurIPS 2024 Main Track. APM is a step to getting Geoffrey Hinton’s GLOM working. Original GLOM paper said: “This paper was quickly hijacked by the need to justify the design decisions”. 3 years have passed us. This work provides some justifications and been peer-reviewed and accepted by our peers. A humble blogpost can be found at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.20535 [cs.CV] (or arXiv:2410.20535v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.20535 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-61] CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

链接: https://arxiv.org/abs/2410.20527
作者: Ali TehraniJamsaz,Arijit Bhattacharjee,Le Chen,Nesreen K. Ahmed,Amir Yazdanbakhsh,Ali Jannesari
关键词-EN: Large Language Models, Recent advancements, advancements in Large, automatic programming language, Large Language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to CUDA and Fortran to C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU, with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for this complex task, improving CodeBLEU by at least 4.63 points compared to closed-source and open-code LLMs.

[AI-62] Props for Machine-Learning Security

链接: https://arxiv.org/abs/2410.20522
作者: Ari Juels,Farinaz Koushanfar
关键词-EN: propose protected pipelines, approach for authenticated, machine learning, propose protected, protected pipelines
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose protected pipelines or props for short, a new approach for authenticated, privacy-preserving access to deep-web data for machine learning (ML). By permitting secure use of vast sources of deep-web data, props address the systemic bottleneck of limited high-quality training data in ML development. Props also enable privacy-preserving and trustworthy forms of inference, allowing for safe use of sensitive data in ML applications. Props are practically realizable today by leveraging privacy-preserving oracle systems initially developed for blockchain applications.

[AI-63] MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic music

链接: https://arxiv.org/abs/2410.20518
作者: Michał Wiszenko,Kacper Stefański,Piotr Malesa,Łukasz Pokorzyński,Mateusz Modrzejewski
关键词-EN: Symbolic music research, music-related machine learning, music research plays, Symbolic music, machine learning
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: in Extended Abstracts for the Late-Breaking Demo Sessionof the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024

点击查看摘要

Abstract:Symbolic music research plays a crucial role in music-related machine learning, but MIDI data can be complex for those without musical expertise. To address this issue, we present MidiTok Visualizer, a web application designed to facilitate the exploration and visualization of various MIDI tokenization methods from the MidiTok Python package. MidiTok Visualizer offers numerous customizable parameters, enabling users to upload MIDI files to visualize tokenized data alongside an interactive piano roll.

[AI-64] Symbotunes: unified hub for symbolic music generative models

链接: https://arxiv.org/abs/2410.20515
作者: Paweł Skierś,Maksymilian Łazarski,Michał Kopeć,Mateusz Modrzejewski
关键词-EN: project structure, symbolic music generative, music generative models, differ significantly, significantly in terms
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Implementations of popular symbolic music generative models often differ significantly in terms of the libraries utilized and overall project structure. Therefore, directly comparing the methods or becoming acquainted with them may present challenges. To mitigate this issue we introduce Symbotunes, an open-source unified hub for symbolic music generative models. Symbotunes contains modern Python implementations of well-known methods for symbolic music generation, as well as a unified pipeline for generating and training.

[AI-65] Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2410.20487
作者: Kaiyan Zhao,Yiming Wang,Yuyang Chen,Xiaoguang Niu,Yan Li,Leong Hou U
关键词-EN: Deep Reinforcement Learning, solving complex decision-making, complex decision-making problems, Deep Reinforcement, achieved remarkable success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has achieved remarkable success in solving complex decision-making problems by combining the representation capabilities of deep learning with the decision-making power of reinforcement learning. However, learning in sparse reward environments remains challenging due to insufficient feedback to guide the optimization of agents, especially in real-life environments with high-dimensional states. To tackle this issue, experience replay is commonly introduced to enhance learning efficiency through past experiences. Nonetheless, current methods of experience replay, whether based on uniform or prioritized sampling, frequently struggle with suboptimal learning efficiency and insufficient utilization of samples. This paper proposes a novel approach, diversity-based experience replay (DBER), which leverages the deterministic point process to prioritize diverse samples in state realizations. We conducted extensive experiments on Robotic Manipulation tasks in MuJoCo, Atari games, and realistic in-door environments in Habitat. The results show that our method not only significantly improves learning efficiency but also demonstrates superior performance in sparse reward environments with high-dimensional states, providing a simple yet effective solution for this field.

[AI-66] Improving Decision Sparsity NEURIPS2024

链接: https://arxiv.org/abs/2410.20483
作者: Yiyang Sun,Tong Wang,Cynthia Rudin
关键词-EN: central aspect, aspect of interpretability, Sparsity, SEV, decision sparsity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Sparsity is a central aspect of interpretability in machine learning. Typically, sparsity is measured in terms of the size of a model globally, such as the number of variables it uses. However, this notion of sparsity is not particularly relevant for decision-making; someone subjected to a decision does not care about variables that do not contribute to the decision. In this work, we dramatically expand a notion of decision sparsity called the Sparse Explanation Value(SEV) so that its explanations are more meaningful. SEV considers movement along a hypercube towards a reference point. By allowing flexibility in that reference and by considering how distances along the hypercube translate to distances in feature space, we can derive sparser and more meaningful explanations for various types of function classes. We present cluster-based SEV and its variant tree-based SEV, introduce a method that improves credibility of explanations, and propose algorithms that optimize decision sparsity in machine learning models.

[AI-67] MusicFlow: Cascaded Flow Matching for Text Guided Music Generation ICML2024

链接: https://arxiv.org/abs/2410.20478
作者: K R Prajwal,Bowen Shi,Matthew Lee,Apoorv Vyas,Andros Tjandra,Mahi Luthra,Baishan Guo,Huiyu Wang,Triantafyllos Afouras,David Kant,Wei-Ning Hsu
关键词-EN: flow matching, flow matching networks, generation model based, model, introduce MusicFlow
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: ICML 2024

点击查看摘要

Abstract:We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over 2\sim5 times smaller and requiring 5 times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

[AI-68] EAFormers: TEnsor-Augmented Transformers for Multi-Dimensional Time Series Forecasting

链接: https://arxiv.org/abs/2410.20439
作者: Linghang Kong,Elynn Chen,Yuzhou Chen,Yuefeng Han
关键词-EN: tensor-variate time series, climate science, time series data, matrix and tensor-variate, increasingly prevalent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-dimensional time series data, such as matrix and tensor-variate time series, are increasingly prevalent in fields such as economics, finance, and climate science. Traditional Transformer models, though adept with sequential data, do not effectively preserve these multi-dimensional structures, as their internal operations in effect flatten multi-dimensional observations into vectors, thereby losing critical multi-dimensional relationships and patterns. To address this, we introduce the Tensor-Augmented Transformer (TEAFormer), a novel method that incorporates tensor expansion and compression within the Transformer framework to maintain and leverage the inherent multi-dimensional structures, thus reducing computational costs and improving prediction accuracy. The core feature of the TEAFormer, the Tensor-Augmentation (TEA) module, utilizes tensor expansion to enhance multi-view feature learning and tensor compression for efficient information aggregation and reduced computational load. The TEA module is not just a specific model architecture but a versatile component that is highly compatible with the attention mechanism and the encoder-decoder structure of Transformers, making it adaptable to existing Transformer architectures. Our comprehensive experiments, which integrate the TEA module into three popular time series Transformer models across three real-world benchmarks, show significant performance enhancements, highlighting the potential of TEAFormers for cutting-edge time series forecasting.

[AI-69] NT-VOT211: A Large-Scale Benchmark for Night-time Visual Object Tracking ACCV

链接: https://arxiv.org/abs/2410.20421
作者: Yu Liu,Arif Mahmood,Muhammad Haris Khan
关键词-EN: visual object tracking, current visual object, predominantly contain day-time, visual object, tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Oral Acceptance at the Asian Conference on Computer Vision (ACCV) 2024, Hanoi, Vietnam

点击查看摘要

Abstract:Many current visual object tracking benchmarks such as OTB100, NfS, UAV123, LaSOT, and GOT-10K, predominantly contain day-time scenarios while the challenges posed by the night-time has been less investigated. It is primarily because of the lack of a large-scale, well-annotated night-time benchmark for rigorously evaluating tracking algorithms. To this end, this paper presents NT-VOT211, a new benchmark tailored for evaluating visual object tracking algorithms in the challenging night-time conditions. NT-VOT211 consists of 211 diverse videos, offering 211,000 well-annotated frames with 8 attributes including camera motion, deformation, fast motion, motion blur, tiny target, distractors, occlusion and out-of-view. To the best of our knowledge, it is the largest night-time tracking benchmark to-date that is specifically designed to address unique challenges such as adverse visibility, image blur, and distractors inherent to night-time tracking scenarios. Through a comprehensive analysis of results obtained from 42 diverse tracking algorithms on NT-VOT211, we uncover the strengths and limitations of these algorithms, highlighting opportunities for enhancements in visual object tracking, particularly in environments with suboptimal lighting. Besides, a leaderboard for revealing performance rankings, annotation tools, comprehensive meta-information and all the necessary code for reproducibility of results is made publicly available. We believe that our NT-VOT211 benchmark will not only be instrumental in facilitating field deployment of VOT algorithms, but will also help VOT enhancements and it will unlock new real-world tracking applications. Our dataset and other assets can be found at: this https URL.

[AI-70] Inevitable Trade-off between Watermark Strength and Speculative Sampling Efficiency for Language Models

链接: https://arxiv.org/abs/2410.20418
作者: Zhengmian Hu,Heng Huang
关键词-EN: Large language models, output distribution, sampling, sampling efficiency, generating content
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models are probabilistic models, and the process of generating content is essentially sampling from the output distribution of the language model. Existing watermarking techniques inject watermarks into the generated content without altering the output quality. On the other hand, existing acceleration techniques, specifically speculative sampling, leverage a draft model to speed up the sampling process while preserving the output distribution. However, there is no known method to simultaneously accelerate the sampling process and inject watermarks into the generated content. In this paper, we investigate this direction and find that the integration of watermarking and acceleration is non-trivial. We prove a no-go theorem, which states that it is impossible to simultaneously maintain the highest watermark strength and the highest sampling efficiency. Furthermore, we propose two methods that maintain either the sampling efficiency or the watermark strength, but not both. Our work provides a rigorous theoretical foundation for understanding the inherent trade-off between watermark strength and sampling efficiency in accelerating the generation of watermarked tokens for large language models. We also conduct numerical experiments to validate our theoretical findings and demonstrate the effectiveness of the proposed methods.

[AI-71] hunderKittens: Simple Fast and Adorable AI Kernels

链接: https://arxiv.org/abs/2410.20399
作者: Benjamin F. Spector,Simran Arora,Aaryan Singhal,Daniel Y. Fu,Christopher Ré
关键词-EN: challenge of mapping, mapping AI architectures, creating a critical, critical bottleneck, theoretical performance thresholds
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse hardware capabilities of GPUs might suggest that we need a wide variety of techniques to achieve high performance. However, our work explores whether a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like parallel compute operations over tiles, (2) at the thread-block level, we provide a template for overlapping asynchronous operations across parallel warps, and (3) at the grid-level, we provide support to help hide the block launch and tear-down, and memory costs. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by 10-40% on attention backwards, 8\times on state space models, and 14\times on linear attention.

[AI-72] Lodge: High-quality and Long Dance Generation with Vivid Choreography Patterns

链接: https://arxiv.org/abs/2410.20389
作者: Ronghui Li,Hongwen Zhang,Yachao Zhang,Yuxiang Zhang,Youliang Zhang,Jie Guo,Yan Zhang,Xiu Li,Yebin Liu
关键词-EN: global choreography patterns, global choreography, choreography patterns, vivid global choreography, music and desired
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:We propose Lodge++, a choreography framework to generate high-quality, ultra-long, and vivid dances given the music and desired genre. To handle the challenges in computational efficiency, the learning of complex and vivid global choreography patterns, and the physical quality of local dance movements, Lodge++ adopts a two-stage strategy to produce dances from coarse to fine. In the first stage, a global choreography network is designed to generate coarse-grained dance primitives that capture complex global choreography patterns. In the second stage, guided by these dance primitives, a primitive-based dance diffusion model is proposed to further generate high-quality, long-sequence dances in parallel, faithfully adhering to the complex choreography patterns. Additionally, to improve the physical plausibility, Lodge++ employs a penetration guidance module to resolve character self-penetration, a foot refinement module to optimize foot-ground contact, and a multi-genre discriminator to maintain genre consistency throughout the dance. Lodge++ is validated by extensive experiments, which show that our method can rapidly generate ultra-long dances suitable for various dance genres, ensuring well-organized global choreography patterns and high-quality local motion.

[AI-73] Addressing the Pitfalls of Image-Based Structural Health Monitoring: A Focus on False Positives False Negatives and Base Rate Bias

链接: https://arxiv.org/abs/2410.20384
作者: Vagelis Plevris
关键词-EN: image-based structural health, structural health monitoring, detecting structural damage, image-based SHM, image-based SHM offers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study explores the limitations of image-based structural health monitoring (SHM) techniques in detecting structural damage. Leveraging machine learning and computer vision, image-based SHM offers a scalable and efficient alternative to manual inspections. However, its reliability is impacted by challenges such as false positives, false negatives, and environmental variability, particularly in low base rate damage scenarios. The Base Rate Bias plays a significant role, as low probabilities of actual damage often lead to misinterpretation of positive results. This study uses both Bayesian analysis and a frequentist approach to evaluate the precision of damage detection systems, revealing that even highly accurate models can yield misleading results when the occurrence of damage is rare. Strategies for mitigating these limitations are discussed, including hybrid systems that combine multiple data sources, human-in-the-loop approaches for critical assessments, and improving the quality of training data. These findings provide essential insights into the practical applicability of image-based SHM techniques, highlighting both their potential and their limitations for real-world infrastructure monitoring.

[AI-74] Multiple kernel concept factorization algorithm based on global fusion

链接: https://arxiv.org/abs/2410.20383
作者: Fei Li,Liang Du,Chaohong Ren
关键词-EN: Non-negative Matrix Factorization, extends matrix factorization, Matrix Factorization, Concept Factorization, improving learning ability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: in Chinese language

点击查看摘要

Abstract:Non-negative Matrix Factorization(NMF) algorithm can only be used to find low rank approximation of original non-negative data while Concept Factorization(CF) algorithm extends matrix factorization to single non-linear kernel space, improving learning ability and adaptability of matrix factorization. In unsupervised environment, to design or select proper kernel function for specific dataset, a new algorithm called Globalized Multiple Kernel CF(GMKCF)was proposed. Multiple candidate kernel functions were input in the same time and learned in the CF framework based on global linear fusion, obtaining a clustering result with high quality and stability and solving the problem of kernel function selection that the CF faced. The convergence of the proposed algorithm was verified by solving the model with alternate iteration. The experimental results on several real databases show that the proposed algorithm outperforms comparison algorithms in data clustering, such as Kernel K-Means(KKM), Spectral Clustering(SC), Kernel CF(KCF), Co-regularized multi-view spectral clustering(Coreg), and Robust Multiple KKM(RMKKM).

[AI-75] FuseFL: One-Shot Federated Learning through the Lens of Causality with Progressive Model Fusion

链接: https://arxiv.org/abs/2410.20380
作者: Zhenheng Tang,Yonggang Zhang,Peijie Dong,Yiu-ming Cheung,Amelie Chi Zhou,Bo Han,Xiaowen Chu
关键词-EN: One-shot Federated Learning, One-shot Federated, significantly reduces communication, Federated Learning, OFL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:One-shot Federated Learning (OFL) significantly reduces communication costs in FL by aggregating trained models only once. However, the performance of advanced OFL methods is far behind the normal FL. In this work, we provide a causal view to find that this performance drop of OFL methods comes from the isolation problem, which means that local isolatedly trained models in OFL may easily fit to spurious correlations due to the data heterogeneity. From the causal perspective, we observe that the spurious fitting can be alleviated by augmenting intermediate features from other clients. Built upon our observation, we propose a novel learning approach to endow OFL with superb performance and low communication and storage costs, termed as FuseFL. Specifically, FuseFL decomposes neural networks into several blocks, and progressively trains and fuses each block following a bottom-up manner for feature augmentation, introducing no additional communication costs. Comprehensive experiments demonstrate that FuseFL outperforms existing OFL and ensemble FL by a significant margin. We conduct comprehensive experiments to show that FuseFL supports high scalability of clients, heterogeneous model training, and low memory costs. Our work is the first attempt using causality to analyze and alleviate data heterogeneity of OFL.

[AI-76] Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios WACV2025

链接: https://arxiv.org/abs/2410.20359
作者: Yongkang Cheng,Mingjiang Liang,Shaoli Huang,Gaoge Han,Jifeng Ning,Wei Liu
关键词-EN: Audio-driven simultaneous gesture, Audio-driven simultaneous, simultaneous gesture generation, human-computer communication, film production
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
*备注: Accepted by WACV 2025 (Round 1)

点击查看摘要

Abstract:Audio-driven simultaneous gesture generation is vital for human-computer communication, AI games, and film production. While previous research has shown promise, there are still limitations. Methods based on VAEs are accompanied by issues of local jitter and global instability, whereas methods based on diffusion models are hampered by low generation efficiency. This is because the denoising process of DDPM in the latter relies on the assumption that the noise added at each step is sampled from a unimodal distribution, and the noise values are small. DDIM borrows the idea from the Euler method for solving differential equations, disrupts the Markov chain process, and increases the noise step size to reduce the number of denoising steps, thereby accelerating generation. However, simply increasing the step size during the step-by-step denoising process causes the results to gradually deviate from the original data distribution, leading to a significant drop in the quality of the generated actions and the emergence of unnatural artifacts. In this paper, we break the assumptions of DDPM and achieves breakthrough progress in denoising speed and fidelity. Specifically, we introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps within the same sampling step, aiming to sample larger noise values and apply fewer denoising steps for high-speed generation.

[AI-77] RopeTP: Global Human Motion Recovery via Integrating Robust Pose Estimation with Diffusion Trajectory Prior WACV2025

链接: https://arxiv.org/abs/2410.20358
作者: Mingjiang Liang,Yongkang Cheng,Hualin Liang,Shaoli Huang,Wei Liu
关键词-EN: combines Robust pose, diffusion Trajectory Prior, Robust pose estimation, combines Robust, Prior to reconstruct
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by WACV 2025 (Round 1)

点击查看摘要

Abstract:We present RopeTP, a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos. At the heart of RopeTP is a hierarchical attention mechanism that significantly improves context awareness, which is essential for accurately inferring the posture of occluded body parts. This is achieved by exploiting the relationships with visible anatomical structures, enhancing the accuracy of local pose estimations. The improved robustness of these local estimations allows for the reconstruction of precise and stable global trajectories. Additionally, RopeTP incorporates a diffusion trajectory model that predicts realistic human motion from local pose sequences. This model ensures that the generated trajectories are not only consistent with observed local actions but also unfold naturally over time, thereby improving the realism and stability of 3D human motion reconstruction. Extensive experimental validation shows that RopeTP surpasses current methods on two benchmark datasets, particularly excelling in scenarios with occlusions. It also outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.

[AI-78] Dynamics as Prompts: In-Context Learning for Sim-to-Real System Identifications

链接: https://arxiv.org/abs/2410.20357
作者: Xilun Zhang,Shiqi Liu,Peide Huang,William Jongwon Han,Yiqi Lyu,Mengdi Xu,Ding Zhao
关键词-EN: remains a significant, significant challenge, challenge in robotics, robotics due, Domain Randomization
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: website: this https URL

点击查看摘要

Abstract:Sim-to-real transfer remains a significant challenge in robotics due to the discrepancies between simulated and real-world dynamics. Traditional methods like Domain Randomization often fail to capture fine-grained dynamics, limiting their effectiveness for precise control tasks. In this work, we propose a novel approach that dynamically adjusts simulation environment parameters online using in-context learning. By leveraging past interaction histories as context, our method adapts the simulation environment dynamics to real-world dynamics without requiring gradient updates, resulting in faster and more accurate alignment between simulated and real-world performance. We validate our approach across two tasks: object scooping and table air hockey. In the sim-to-sim evaluations, our method significantly outperforms the baselines on environment parameter estimation by 80% and 42% in the object scooping and table air hockey setups, respectively. Furthermore, our method achieves at least 70% success rate in sim-to-real transfer on object scooping across three different objects. By incorporating historical interaction data, our approach delivers efficient and smooth system identification, advancing the deployment of robots in dynamic real-world scenarios. Demos are available on our project page: this https URL

[AI-79] Uncovering Capabilities of Model Pruning in Graph Contrastive Learning

链接: https://arxiv.org/abs/2410.20356
作者: Wu Junran,Chen Xueyuan,Li Shangzhe
关键词-EN: achieved great success, Graph contrastive learning, pre-training graph neural, graph neural networks, contrastive learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: MM’ 24

点击查看摘要

Abstract:Graph contrastive learning has achieved great success in pre-training graph neural networks without ground-truth labels. Leading graph contrastive learning follows the classical scheme of contrastive learning, forcing model to identify the essential information from augmented views. However, general augmented views are produced via random corruption or learning, which inevitably leads to semantics alteration. Although domain knowledge guided augmentations alleviate this issue, the generated views are domain specific and undermine the generalization. In this work, motivated by the firm representation ability of sparse model from pruning, we reformulate the problem of graph contrastive learning via contrasting different model versions rather than augmented views. We first theoretically reveal the superiority of model pruning in contrast to data augmentations. In practice, we take original graph as input and dynamically generate a perturbed graph encoder to contrast with the original encoder by pruning its transformation weights. Furthermore, considering the integrity of node embedding in our method, we are capable of developing a local contrastive loss to tackle the hard negative samples that disturb the model training. We extensively validate our method on various benchmarks regarding graph classification via unsupervised and transfer learning. Compared to the state-of-the-art (SOTA) works, better performance can always be obtained by the proposed method.

[AI-80] An approach to hummed-tune and song sequences matching

链接: https://arxiv.org/abs/2410.20352
作者: Loc Bao Pham,Huong Hoang Luong,Phu Thien Tran,Phuc Hoang Ngo,Vi Hoang Nguyen,Thinh Nguyen
关键词-EN: Melody stuck, Melody, song, humming sound, Zalo AI Challenge
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Melody stuck in your head, also known as “earworm”, is tough to get rid of, unless you listen to it again or sing it out loud. But what if you can not find the name of that song? It must be an intolerable feeling. Recognizing a song name base on humming sound is not an easy task for a human being and should be done by machines. However, there is no research paper published about hum tune recognition. Adapting from Hum2Song Zalo AI Challenge 2021 - a competition about querying the name of a song by user’s giving humming tune, which is similar to Google’s Hum to Search. This paper covers details about the pre-processed data from the original type (mp3) to usable form for training and inference. In training an embedding model for the feature extraction phase, we ran experiments with some states of the art, such as ResNet, VGG, AlexNet, MobileNetV2. And for the inference phase, we use the Faiss module to effectively search for a song that matched the sequence of humming sound. The result comes at nearly 94% in MRR@10 metric on the public test set, along with the top 1 result on the public leaderboard.

[AI-81] Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition ECCV2024

链接: https://arxiv.org/abs/2410.20349
作者: Lilang Lin,Lehong Wu,Jiahang Zhang,Jiaying Liu
关键词-EN: Generative models, technique for generation, powerful technique, Generative, idempotent generative model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024

点击查看摘要

Abstract:Generative models, as a powerful technique for generation, also gradually become a critical tool for recognition tasks. However, in skeleton-based action recognition, the features obtained from existing pre-trained generative methods contain redundant information unrelated to recognition, which contradicts the nature of the skeleton’s spatially sparse and temporally consistent properties, leading to undesirable performance. To address this challenge, we make efforts to bridge the gap in theory and methodology and propose a novel skeleton-based idempotent generative model (IGM) for unsupervised representation learning. More specifically, we first theoretically demonstrate the equivalence between generative models and maximum entropy coding, which demonstrates a potential route that makes the features of generative models more compact by introducing contrastive learning. To this end, we introduce the idempotency constraint to form a stronger consistency regularization in the feature space, to push the features only to maintain the critical information of motion semantics for the recognition task. Our extensive experiments on benchmark datasets, NTU RGB+D and PKUMMD, demonstrate the effectiveness of our proposed method. On the NTU 60 xsub dataset, we observe a performance improvement from 84.6 % to 86.2 % . Furthermore, in zero-shot adaptation scenarios, our model demonstrates significant efficacy by achieving promising results in cases that were previously unrecognizable. Our project is available at \urlthis https URL.

[AI-82] R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest NAACL2025

链接: https://arxiv.org/abs/2410.20327
作者: Xupeng Chen,Zhixin Lai,Kangrui Ruan,Shichu Chen,Jiaxiang Liu,Zuozhu Liu
关键词-EN: made significant strides, interpret images holistically, Artificial intelligence, doctor prior knowledge, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures, submitted to NAACL 2025

点击查看摘要

Abstract:Artificial intelligence has made significant strides in medical visual question answering (Med-VQA), yet prevalent studies often interpret images holistically, overlooking the visual regions of interest that may contain crucial information, potentially aligning with a doctor’s prior knowledge that can be incorporated with minimal annotations (e.g., bounding boxes). To address this gap, this paper introduces R-LLaVA, designed to enhance biomedical VQA understanding by integrating simple medical annotations as prior knowledge directly into the image space through CLIP. These annotated visual regions of interest are then fed into the LLaVA model during training, aiming to enrich the model’s understanding of biomedical queries. Experimental evaluation on four standard Med-VQA datasets demonstrates R-LLaVA’s superiority over existing state-of-the-art (SoTA) methods. Additionally, to verify the model’s capability in visual comprehension, a novel multiple-choice medical visual understanding dataset is introduced, confirming the positive impact of focusing on visual regions of interest in advancing biomedical VQA understanding.

[AI-83] Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs

链接: https://arxiv.org/abs/2410.20321
作者: Xingrui Zhuo,Jiapu Wang,Gongqing Wu,Shirui Pan,Xindong Wu
关键词-EN: Knowledge Graph Query, Graph Query Embedding, embed First-Order Logic, Knowledge Graph, First-Order Logic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. To enhance the generalization of KGQE models, recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. The whole process is commonly referred to as Query Pattern Learning (QPL). However, current QPL methods typically suffer from the pattern-entity alignment bias problem, leading to the learned defective query patterns limiting KGQE models’ performance. To address this problem, we propose an effective Query Instruction Parsing Plugin (QIPP) that leverages the context awareness of Pre-trained Language Models (PLMs) to capture latent query patterns from code-like query instructions. Unlike the external information introduced by previous QPL methods, we first propose code-like instructions to express FOL queries in an alternative format. This format utilizes textual variables and nested tuples to convey the logical semantics within FOL queries, serving as raw materials for a PLM-based instruction encoder to obtain complete query patterns. Building on this, we design a query-guided instruction decoder to adapt query patterns to KGQE models. To further enhance QIPP’s effectiveness across various KGQE models, we propose a query pattern injection mechanism based on compressed optimization boundaries and an adaptive normalization component, allowing KGQE models to utilize query patterns more efficiently. Extensive experiments demonstrate that our plug-and-play method improves the performance of eight basic KGQE models and outperforms two state-of-the-art QPL methods.

[AI-84] Few-shot Open Relation Extraction with Gaussian Prototype and Adaptive Margin

链接: https://arxiv.org/abs/2410.20320
作者: Tianlin Guo,Lingling Zhang,Jiaxin Wang,Yuokuo Lei,Yifei Li,Haofen Wang,Jun Liu
关键词-EN: Few-shot relation extraction, relation extraction task, FsRE with NOTA, unknown classes, relation extraction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 30 pages, 4 figures

点击查看摘要

Abstract:Few-shot relation extraction with none-of-the-above (FsRE with NOTA) aims at predicting labels in few-shot scenarios with unknown classes. FsRE with NOTA is more challenging than the conventional few-shot relation extraction task, since the boundaries of unknown classes are complex and difficult to learn. Meta-learning based methods, especially prototype-based methods, are the mainstream solutions to this task. They obtain the classification boundary by learning the sample distribution of each class. However, their performance is limited because few-shot overfitting and NOTA boundary confusion lead to misclassification between known and unknown classes. To this end, we propose a novel framework based on Gaussian prototype and adaptive margin named GPAM for FsRE with NOTA, which includes three modules, semi-factual representation, GMM-prototype metric learning and decision boundary learning. The first two modules obtain better representations to solve the few-shot problem through debiased information enhancement and Gaussian space distance measurement. The third module learns more accurate classification boundaries and prototypes through adaptive margin and negative sampling. In the training procedure of GPAM, we use contrastive learning loss to comprehensively consider the effects of range and margin on the classification of known and unknown classes to ensure the model’s stability and robustness. Sufficient experiments and ablations on the FewRel dataset show that GPAM surpasses previous prototype methods and achieves state-of-the-art performance.

[AI-85] ANOMIX: A Simple yet Effective Hard Negative Generation via Mixing for Graph Anomaly Detection

链接: https://arxiv.org/abs/2410.20310
作者: Hwan Kim,Junghoon Kim,Sungsu Lim
关键词-EN: Graph contrastive learning, number of samples, contrastive learning, generally requires, requires a large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) generally requires a large number of samples. The one of the effective ways to reduce the number of samples is using hard negatives (e.g., Mixup). Designing mixing-based approach for GAD can be difficult due to imbalanced data or limited number of anomalies. We propose ANOMIX, a framework that consists of a novel graph mixing approach, ANOMIX-M, and multi-level contrasts for GAD. ANOMIX-M can effectively mix abnormality and normality from input graph to generate hard negatives, which are important for efficient GCL. ANOMIX is (a) A first mixing approach: firstly attempting graph mixing to generate hard negatives for GAD task and node- and subgraph-level contrasts to distinguish underlying anomalies. (b) Accurate: winning the highest AUC, up to 5.49% higher and 1.76% faster. © Effective: reducing the number of samples nearly 80% in GCL. Code is available at this https URL.

[AI-86] AI-Driven Cyber Threat Intelligence Automation

链接: https://arxiv.org/abs/2410.20287
作者: Shrit Shah,Fatemeh Khoda Parast
关键词-EN: Cyber Threat Intelligence, leveraging Microsoft AI-powered, automating Cyber Threat, Microsoft AI-powered security, Threat Intelligence
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 11 pages

点击查看摘要

Abstract:This study introduces an innovative approach to automating Cyber Threat Intelligence (CTI) processes in industrial environments by leveraging Microsoft’s AI-powered security technologies. Historically, CTI has heavily relied on manual methods for collecting, analyzing, and interpreting data from various sources such as threat feeds. This study introduces an innovative approach to automating CTI processes in industrial environments by leveraging Microsoft’s AI-powered security technologies. Historically, CTI has heavily relied on manual methods for collecting, analyzing, and interpreting data from various sources such as threat feeds, security logs, and dark web forums – a process prone to inefficiencies, especially when rapid information dissemination is critical. By employing the capabilities of GPT-4o and advanced one-shot fine-tuning techniques for large language models, our research delivers a novel CTI automation solution. The outcome of the proposed architecture is a reduction in manual effort while maintaining precision in generating final CTI reports. This research highlights the transformative potential of AI-driven technologies to enhance both the speed and accuracy of CTI and reduce expert demands, offering a vital advantage in today’s dynamic threat landscape.

[AI-87] SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

链接: https://arxiv.org/abs/2410.20285
作者: Antonis Antoniades,Albert Örwall,Kexun Zhang,Yuxi Xie,Anirudh Goyal,William Wang
关键词-EN: Software engineers operating, Monte Carlo Tree, evolving requirements, engineers operating, reconsider their approaches
类目: Artificial Intelligence (cs.AI)
*备注: Main body: 10 pages, 5 figures. Appendix: 5 pages, 4 figures. Open-source codebase

点击查看摘要

Abstract:Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often rely on rigid processes and tend to repeat ineffective actions without the capacity to evaluate their performance or adapt their strategies over time. To address these challenges, we propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents’ performance on repository-level software tasks. SWE-Search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This enables self-feedback loops where agents iteratively refine their strategies based on both quantitative numerical evaluations and qualitative natural language assessments of pursued trajectories. The framework includes a SWE-Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi-agent debate for collaborative decision-making. Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased search depth and identifies key factors that facilitate effective self-evaluation in software agents. This work highlights the potential of self-evaluation driven search techniques to enhance agent reasoning and planning in complex, dynamic software engineering environments.

[AI-88] MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

链接: https://arxiv.org/abs/2410.20280
作者: Haozhe Liu,Shikun Liu,Zijian Zhou,Mengmeng Xu,Yanping Xie,Xiao Han,Juan C. Pérez,Ding Liu,Kumara Kahatapitiya,Menglin Jia,Jui-Chieh Wu,Sen He,Tao Xiang,Jürgen Schmidhuber,Juan-Manuel Pérez-Rúa
关键词-EN: unified diffusion model, integrate the advantages, masked auto-regression, MAR, unified diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini’s MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

[AI-89] EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering

链接: https://arxiv.org/abs/2410.20263
作者: Kai Cheng,Zhengyuan Li,Xingpeng Sun,Byung-Cheol Min,Amrit Singh Bedi,Aniket Bera
关键词-EN: robotic home assistants, Embodied Question Answering, home assistants, essential yet challenging, challenging task
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Embodied Question Answering (EQA) is an essential yet challenging task for robotic home assistants. Recent studies have shown that large vision-language models (VLMs) can be effectively utilized for EQA, but existing works either focus on video-based question answering without embodied exploration or rely on closed-form choice sets. In real-world scenarios, a robotic agent must efficiently explore and accurately answer questions in open-vocabulary settings. To address these challenges, we propose a novel framework called EfficientEQA for open-vocabulary EQA, which enables efficient exploration and accurate answering. In EfficientEQA, the robot actively explores unknown environments using Semantic-Value-Weighted Frontier Exploration, a strategy that prioritizes exploration based on semantic importance provided by calibrated confidence from black-box VLMs to quickly gather relevant information. To generate accurate answers, we employ Retrieval-Augmented Generation (RAG), which utilizes BLIP to retrieve useful images from accumulated observations and VLM reasoning to produce responses without relying on predefined answer choices. Additionally, we detect observations that are highly relevant to the question as outliers, allowing the robot to determine when it has sufficient information to stop exploring and provide an answer. Experimental results demonstrate the effectiveness of our approach, showing an improvement in answering accuracy by over 15% and efficiency, measured in running steps, by over 20% compared to state-of-the-art methods.

[AI-90] Equivariant Blurring Diffusion for Hierarchical Molecular Conformer Generation NEURIPS2024

链接: https://arxiv.org/abs/2410.20255
作者: Jiwoong Park,Yang Shen
关键词-EN: fine atomic details, multiscale view, Equivariant Blurring Diffusion, multiscale manner, diffusion models process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注: NeurIPS 2024

点击查看摘要

Abstract:How can diffusion models process 3D geometries in a coarse-to-fine manner, akin to our multiscale view of the world? In this paper, we address the question by focusing on a fundamental biochemical problem of generating 3D molecular conformers conditioned on molecular graphs in a multiscale manner. Our approach consists of two hierarchical stages: i) generation of coarse-grained fragment-level 3D structure from the molecular graph, and ii) generation of fine atomic details from the coarse-grained approximated structure while allowing the latter to be adjusted simultaneously. For the challenging second stage, which demands preserving coarse-grained information while ensuring SE(3) equivariance, we introduce a novel generative model termed Equivariant Blurring Diffusion (EBD), which defines a forward process that moves towards the fragment-level coarse-grained structure by blurring the fine atomic details of conformers, and a reverse process that performs the opposite operation using equivariant networks. We demonstrate the effectiveness of EBD by geometric and chemical comparison to state-of-the-art denoising diffusion models on a benchmark of drug-like molecules. Ablation studies draw insights on the design of EBD by thoroughly analyzing its architecture, which includes the design of the loss function and the data corruption process. Codes are released at this https URL .

[AI-91] Adaptive Video Understanding Agent : Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning

链接: https://arxiv.org/abs/2410.20252
作者: Sullam Jeoung,Goeric Huybrechts,Bhavana Ganesh,Aram Galstyan,Sravan Bodapati
关键词-EN: computational resources required, content presents significant, presents significant challenges, significant challenges due, substantial computational resources
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding long-form video content presents significant challenges due to its temporal complexity and the substantial computational resources required. In this work, we propose an agent-based approach to enhance both the efficiency and effectiveness of long-form video understanding by utilizing large language models (LLMs) and their tool-harnessing ability. A key aspect of our method is query-adaptive frame sampling, which leverages the reasoning capabilities of LLMs to process only the most relevant frames in real-time, and addresses an important limitation of existing methods which typically involve sampling redundant or irrelevant frames. To enhance the reasoning abilities of our video-understanding agent, we leverage the self-reflective capabilities of LLMs to provide verbal reinforcement to the agent, which leads to improved performance while minimizing the number of frames accessed. We evaluate our method across several video understanding benchmarks and demonstrate that not only it enhances state-of-the-art performance but also improves efficiency by reducing the number of frames sampled.

[AI-92] Neural Fields in Robotics: A Survey

链接: https://arxiv.org/abs/2410.20220
作者: Muhammad Zubair Irshad,Mauro Comi,Yen-Chen Lin,Nick Heppert,Abhinav Valada,Rares Ambrus,Zsolt Kira,Jonathan Tremblay
关键词-EN: Neural Fields, enabling accurate inference, neural representations enabling, Neural Radiance Fields, Neural Fields encompass
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages, 20 figures. Project Page: this https URL

点击查看摘要

Abstract:Neural Fields have emerged as a transformative approach for 3D scene representation in computer vision and robotics, enabling accurate inference of geometry, 3D semantics, and dynamics from posed 2D data. Leveraging differentiable rendering, Neural Fields encompass both continuous implicit and explicit neural representations enabling high-fidelity 3D reconstruction, integration of multi-modal sensor data, and generation of novel viewpoints. This survey explores their applications in robotics, emphasizing their potential to enhance perception, planning, and control. Their compactness, memory efficiency, and differentiability, along with seamless integration with foundation and generative models, make them ideal for real-time applications, improving robot adaptability and decision-making. This paper provides a thorough review of Neural Fields in robotics, categorizing applications across various domains and evaluating their strengths and limitations, based on over 200 papers. First, we present four key Neural Fields frameworks: Occupancy Networks, Signed Distance Fields, Neural Radiance Fields, and Gaussian Splatting. Second, we detail Neural Fields’ applications in five major robotics domains: pose estimation, manipulation, navigation, physics, and autonomous driving, highlighting key works and discussing takeaways and open challenges. Finally, we outline the current limitations of Neural Fields in robotics and propose promising directions for future research. Project page: this https URL

[AI-93] Generative AI in Health Economics and Outcomes Research: A Taxonomy of Key Definitions and Emerging Applications an ISPOR Working Group Report

链接: https://arxiv.org/abs/2410.20204
作者: Rachael Fleurence,Xiaoyan Wang,Jiang Bian,Mitchell K. Higashi,Turgay Ayer,Hua Xu,Dalia Dawoud,Jagpreet Chhatwal
关键词-EN: generative artificial intelligence, health economic modeling, artificial intelligence, explores its emerging, AI-generated outputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 36 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Objective: This article offers a taxonomy of generative artificial intelligence (AI) for health economics and outcomes research (HEOR), explores its emerging applications, and outlines methods to enhance the accuracy and reliability of AI-generated outputs. Methods: The review defines foundational generative AI concepts and highlights current HEOR applications, including systematic literature reviews, health economic modeling, real-world evidence generation, and dossier development. Approaches such as prompt engineering (zero-shot, few-shot, chain-of-thought, persona pattern prompting), retrieval-augmented generation, model fine-tuning, and the use of domain-specific models are introduced to improve AI accuracy and reliability. Results: Generative AI shows significant potential in HEOR, enhancing efficiency, productivity, and offering novel solutions to complex challenges. Foundation models are promising in automating complex tasks, though challenges remain in scientific reliability, bias, interpretability, and workflow integration. The article discusses strategies to improve the accuracy of these AI tools. Conclusion: Generative AI could transform HEOR by increasing efficiency and accuracy across various applications. However, its full potential can only be realized by building HEOR expertise and addressing the limitations of current AI technologies. As AI evolves, ongoing research and innovation will shape its future role in the field.

[AI-94] Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models

链接: https://arxiv.org/abs/2410.20199
作者: Mohammad Beigi,Sijia Wang,Ying Shen,Zihao Lin,Adithya Kulkarni,Jianfeng He,Feng Chen,Ming Jin,Jin-Hee Cho,Dawei Zhou,Chang-Tien Lu,Lifu Huang
关键词-EN: Large Language Models, Large Language, artificial intelligence applications, Language Models, recent years
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have become fundamental to a broad spectrum of artificial intelligence applications. As the use of LLMs expands, precisely estimating the uncertainty in their predictions has become crucial. Current methods often struggle to accurately identify, measure, and address the true uncertainty, with many focusing primarily on estimating model confidence. This discrepancy is largely due to an incomplete understanding of where, when, and how uncertainties are injected into models. This paper introduces a comprehensive framework specifically designed to identify and understand the types and sources of uncertainty, aligned with the unique characteristics of LLMs. Our framework enhances the understanding of the diverse landscape of uncertainties by systematically categorizing and defining each type, establishing a solid foundation for developing targeted methods that can precisely quantify these uncertainties. We also provide a detailed introduction to key related concepts and examine the limitations of current methods in mission-critical and safety-sensitive applications. The paper concludes with a perspective on future directions aimed at enhancing the reliability and practical adoption of these methods in real-world scenarios.

[AI-95] Uncertainty-Penalized Direct Preference Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.20187
作者: Sam Houliston,Alizée Pace,Alexander Immer,Gunnar Rätsch
关键词-EN: Aligning Large Language, Large Language Models, Aligning Large, Large Language, Direct Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at the NeurIPS 2024 FITML Workshop

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

[AI-96] Chemical Language Model Linker: blending text and molecules with modular adapters

链接: https://arxiv.org/abs/2410.20182
作者: Yifan Deng,Spencer S. Ericksen,Anthony Gitter
关键词-EN: enabled the appealing, appealing idea, models, Language Model Linker, Chemical Language Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract:The development of large language models and multi-modal models has enabled the appealing idea of generating novel molecules from text descriptions. Generative modeling would shift the paradigm from relying on large-scale chemical screening to find molecules with desired properties to directly generating those molecules. However, multi-modal models combining text and molecules are often trained from scratch, without leveraging existing high-quality pretrained models. That approach consumes more computational resources and prohibits model scaling. In contrast, we propose a lightweight adapter-based strategy named Chemical Language Model Linker (ChemLML). ChemLML blends the two single domain models and obtains conditional molecular generation from text descriptions while still operating in the specialized embedding spaces of the molecular domain. ChemLML can tailor diverse pretrained text models for molecule generation by training relatively few adapter parameters. We find that the choice of molecular representation used within ChemLML, SMILES versus SELFIES, has a strong influence on conditional molecular generation performance. SMILES is often preferable despite not guaranteeing valid molecules. We raise issues in using the large PubChem dataset of molecules and their associated descriptions for evaluating molecule generation and provide a filtered version of the dataset as a generation test set. To demonstrate how ChemLML could be used in practice, we generate candidate protein inhibitors and use docking to assess their quality.

[AI-97] Diff-CXR: Report-to-CXR generation through a disease-knowledge enhanced diffusion model

链接: https://arxiv.org/abs/2410.20165
作者: Peng Huang,Bowen Guo,Shuyu Liang,Junhu Fu,Yuanyuan Wang,Yi Guo
关键词-EN: broad potential applications, Diffusion-based TTI learning, diverse image generation, medical TTI methods, TTI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-To-Image (TTI) generation is significant for controlled and diverse image generation with broad potential applications. Although current medical TTI methods have made some progress in report-to-Chest-Xray (CXR) generation, their generation performance may be limited due to the intrinsic characteristics of medical data. In this paper, we propose a novel disease-knowledge enhanced Diffusion-based TTI learning framework, named Diff-CXR, for medical report-to-CXR generation. First, to minimize the negative impacts of noisy data on generation, we devise a Latent Noise Filtering Strategy that gradually learns the general patterns of anomalies and removes them in the latent space. Then, an Adaptive Vision-Aware Textual Learning Strategy is designed to learn concise and important report embeddings in a domain-specific Vision-Language Model, providing textual guidance for Chest-Xray generation. Finally, by incorporating the general disease knowledge into the pretrained TTI model via a delicate control adapter, a disease-knowledge enhanced diffusion model is introduced to achieve realistic and precise report-to-CXR generation. Experimentally, our Diff-CXR outperforms previous SOTA medical TTI methods by 33.4% / 8.0% and 23.8% / 56.4% in the FID and mAUC score on MIMIC-CXR and IU-Xray, with the lowest computational complexity at 29.641 GFLOPs. Downstream experiments on three thorax disease classification benchmarks and one CXR-report generation benchmark demonstrate that Diff-CXR is effective in improving classical CXR analysis methods. Notably, models trained on the combination of 1% real data and synthetic data can achieve a competitive mAUC score compared to models trained on all data, presenting promising clinical applications.

[AI-98] AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models NIPS2024

链接: https://arxiv.org/abs/2410.20149
作者: Yabin Zhang,Lei Zhang
关键词-EN: pre-trained vision-language models, Recent research, OOD, OOD images, actual OOD images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NIPS 2024 Camera Ready, Codes are available at \url{ this https URL }

点击查看摘要

Abstract:Recent research has shown that pre-trained vision-language models are effective at identifying out-of-distribution (OOD) samples by using negative labels as guidance. However, employing consistent negative labels across different OOD datasets often results in semantic misalignments, as these text labels may not accurately reflect the actual space of OOD images. To overcome this issue, we introduce \textitadaptive negative proxies, which are dynamically generated during testing by exploring actual OOD images, to align more closely with the underlying OOD label space and enhance the efficacy of negative proxy guidance. Specifically, our approach utilizes a feature memory bank to selectively cache discriminative features from test images, representing the targeted OOD distribution. This facilitates the creation of proxies that can better align with specific OOD datasets. While task-adaptive proxies average features to reflect the unique characteristics of each dataset, the sample-adaptive proxies weight features based on their similarity to individual test samples, exploring detailed sample-level nuances. The final score for identifying OOD samples integrates static negative labels with our proposed adaptive proxies, effectively combining textual and visual knowledge for enhanced performance. Our method is training-free and annotation-free, and it maintains fast testing speed. Extensive experiments across various benchmarks demonstrate the effectiveness of our approach, abbreviated as AdaNeg. Notably, on the large-scale ImageNet benchmark, our AdaNeg significantly outperforms existing methods, with a 2.45% increase in AUROC and a 6.48% reduction in FPR95. Codes are available at \urlthis https URL.

[AI-99] Exploring Welfare Maximization and Fairness in Participatory Budgeting

链接: https://arxiv.org/abs/2410.20143
作者: Gogulapati Sreedurga
关键词-EN: Participatory budgeting, divisible resource, called a budget, voting paradigm, paradigm for distributing
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: PhD Thesis

点击查看摘要

Abstract:Participatory budgeting (PB) is a voting paradigm for distributing a divisible resource, usually called a budget, among a set of projects by aggregating the preferences of individuals over these projects. It is implemented quite extensively for purposes such as government allocating funds to public projects and funding agencies selecting research proposals to support. This PhD dissertation studies the welfare-related and fairness-related objectives for different PB models. Our contribution lies in proposing and exploring novel PB rules that maximize welfare and promote fairness, as well as, in introducing and investigating a range of novel utility notions, axiomatic properties, and fairness notions, effectively filling the gaps in the existing literature for each PB model. The thesis is divided into two main parts, the first focusing on dichotomous and the second focusing on ordinal preferences. Each part considers two cases: (i) the cost of each project is restricted to a single value and partial funding is not permitted and (ii) the cost of each project is flexible and may assume multiple values.

[AI-100] Mask-based Membership Inference Attacks for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2410.20142
作者: Mingrui Liu,Sixiao Zhang,Cheng Long
关键词-EN: RAG system, Membership Inference Attacks, RAG, RAG system knowledge, RAG knowledge databases
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has been an effective approach to mitigate hallucinations in large language models (LLMs) by incorporating up-to-date and domain-specific knowledge. Recently, there has been a trend of storing up-to-date or copyrighted data in RAG knowledge databases instead of using it for LLM training. This practice has raised concerns about Membership Inference Attacks (MIAs), which aim to detect if a specific target document is stored in the RAG system’s knowledge database so as to protect the rights of data producers. While research has focused on enhancing the trustworthiness of RAG systems, existing MIAs for RAG systems remain largely insufficient. Previous work either relies solely on the RAG system’s judgment or is easily influenced by other documents or the LLM’s internal knowledge, which is unreliable and lacks explainability. To address these limitations, we propose a Mask-Based Membership Inference Attacks (MBA) framework. Our framework first employs a masking algorithm that effectively masks a certain number of words in the target document. The masked text is then used to prompt the RAG system, and the RAG system is required to predict the mask values. If the target document appears in the knowledge database, the masked text will retrieve the complete target document as context, allowing for accurate mask prediction. Finally, we adopt a simple yet effective threshold-based method to infer the membership of target document by analyzing the accuracy of mask prediction. Our mask-based approach is more document-specific, making the RAG system’s generation less susceptible to distractions from other documents or the LLM’s internal knowledge. Extensive experiments demonstrate the effectiveness of our approach compared to existing baseline models.

[AI-101] MAD-Sherlock: Multi-Agent Debates for Out-of-Context Misinformation Detection

链接: https://arxiv.org/abs/2410.20140
作者: Kumud Lakara,Juil Sock,Christian Rupprecht,Philip Torr,John Collomosse,Christian Schroeder de Witt
关键词-EN: creating false narratives, misleading text, creating false, false narratives, OOC Misinformation Detection
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the most challenging forms of misinformation involves the out-of-context (OOC) use of images paired with misleading text, creating false narratives. Existing AI-driven detection systems lack explainability and require expensive fine-tuning. We address these issues with MAD-Sherlock: a Multi-Agent Debate system for OOC Misinformation Detection. MAD-Sherlock introduces a novel multi-agent debate framework where multimodal agents collaborate to assess contextual consistency and request external information to enhance cross-context reasoning and decision-making. Our framework enables explainable detection with state-of-the-art accuracy even without domain-specific fine-tuning. Extensive ablation studies confirm that external retrieval significantly improves detection accuracy, and user studies demonstrate that MAD-Sherlock boosts performance for both experts and non-experts. These results position MAD-Sherlock as a powerful tool for autonomous and citizen intelligence applications.

[AI-102] Estuary: A Framework For Building Multimodal Low-Latency Real-Time Socially Interactive Agents

链接: https://arxiv.org/abs/2410.20116
作者: Spencer Lin,Basem Rizk,Miru Jun,Andy Artze,Caitlin Sullivan,Sharon Mozgai,Scott Fisher
关键词-EN: Socially Interactive Agents, generative artificial intelligence, Socially Interactive, field of Socially, Interactive Agents
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: To be published in ACM Intelligent Virtual Agents (IVA) 2024 [DOI: https://doi.org/10.1145/3652988.3696198 ] [ACM ISBN: 979-8-4007-0625-7/24/09]

点击查看摘要

Abstract:The rise in capability and ubiquity of generative artificial intelligence (AI) technologies has enabled its application to the field of Socially Interactive Agents (SIAs). Despite rising interest in modern AI-powered components used for real-time SIA research, substantial friction remains due to the absence of a standardized and universal SIA framework. To target this absence, we developed Estuary: a multimodal (text, audio, and soon video) framework which facilitates the development of low-latency, real-time SIAs. Estuary seeks to reduce repeat work between studies and to provide a flexible platform that can be run entirely off-cloud to maximize configurability, controllability, reproducibility of studies, and speed of agent response times. We are able to do this by constructing a robust multimodal framework which incorporates current and future components seamlessly into a modular and interoperable architecture.

[AI-103] GiVE: Guiding Visual Encoder to Perceive Overlooked Information

链接: https://arxiv.org/abs/2410.20109
作者: Junjie Li,Jianghong Ma,Xiaofeng Zhang,Yuhang Li,Jianyang Shi
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.

[AI-104] Emergence of Globally Attracting Fixed Points in Deep Neural Networks With Nonlinear Activations

链接: https://arxiv.org/abs/2410.20107
作者: Amir Joudaki,Thomas Hofmann
关键词-EN: transform input data, neural networks transform, generalization capabilities, networks transform input, fundamental to unraveling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding how neural networks transform input data across layers is fundamental to unraveling their learning and generalization capabilities. Although prior work has used insights from kernel methods to study neural networks, a global analysis of how the similarity between hidden representations evolves across layers remains underexplored. In this paper, we introduce a theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs. Operating under the mean-field regime, we show that the kernel sequence evolves deterministically via a kernel map, which only depends on the activation function. By expanding activation using Hermite polynomials and using their algebraic properties, we derive an explicit form for kernel map and fully characterize its fixed points. Our analysis reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture. We further extend our results to networks with residual connections and normalization layers, demonstrating similar convergence behaviors. This work provides new insights into the implicit biases of deep neural networks and how architectural choices influence the evolution of representations across layers.

[AI-105] Self-Normalized Resets for Plasticity in Continual Learning

链接: https://arxiv.org/abs/2410.20098
作者: Vivek F. Farias,Adam D. Jozefiak
关键词-EN: increasingly important phenomenon, mitigates plasticity loss, Plasticity Loss, changing tasks, task diminishes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Plasticity Loss is an increasingly important phenomenon that refers to the empirical observation that as a neural network is continually trained on a sequence of changing tasks, its ability to adapt to a new task diminishes over time. We introduce Self-Normalized Resets (SNR), a simple adaptive algorithm that mitigates plasticity loss by resetting a neuron’s weights when evidence suggests its firing rate has effectively dropped to zero. Across a battery of continual learning problems and network architectures, we demonstrate that SNR consistently attains superior performance compared to its competitor algorithms. We also demonstrate that SNR is robust to its sole hyperparameter, its rejection percentile threshold, while competitor algorithms show significant sensitivity. SNR’s threshold-based reset mechanism is motivated by a simple hypothesis test that we derive. Seen through the lens of this hypothesis test, competing reset proposals yield suboptimal error rates in correctly detecting inactive neurons, potentially explaining our experimental observations. We also conduct a theoretical investigation of the optimization landscape for the problem of learning a single ReLU. We show that even when initialized adversarially, an idealized version of SNR learns the target ReLU, while regularization-based approaches can fail to learn.

[AI-106] OGBench: Benchmarking Offline Goal-Conditioned RL

链接: https://arxiv.org/abs/2410.20092
作者: Seohong Park,Kevin Frans,Benjamin Eysenbach,Sergey Levine
关键词-EN: goal-conditioned reinforcement learning, reinforcement learning, acquire diverse behaviors, offline GCRL algorithms, Offline goal-conditioned reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning (RL) because it provides a simple, unsupervised, and domain-agnostic way to acquire diverse behaviors and representations from unlabeled data without rewards. Despite the importance of this setting, we lack a standard benchmark that can systematically evaluate the capabilities of offline GCRL algorithms. In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and stochasticity. While representative algorithms may rank similarly on prior benchmarks, our experiments reveal stark strengths and weaknesses in these different capabilities, providing a strong foundation for building new algorithms. Project page: this https URL

[AI-107] Optimizing Keyphrase Ranking for Relevance and Diversity Using Submodular Function Optimization (SFO)

链接: https://arxiv.org/abs/2410.20080
作者: Muhammad Umair,Syed Jalaluddin Hashmi,Young-Koo Lee
关键词-EN: relevant information efficiently, retrieving relevant information, information efficiently, Keyphrase ranking plays, information retrieval
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Keyphrase ranking plays a crucial role in information retrieval and summarization by indexing and retrieving relevant information efficiently. Advances in natural language processing, especially large language models (LLMs), have improved keyphrase extraction and ranking. However, traditional methods often overlook diversity, resulting in redundant keyphrases. We propose a novel approach using Submodular Function Optimization (SFO) to balance relevance and diversity in keyphrase ranking. By framing the task as submodular maximization, our method selects diverse and representative keyphrases. Experiments on benchmark datasets show that our approach outperforms existing methods in both relevance and diversity metrics, achieving SOTA performance in execution time. Our code is available online.

[AI-108] Evaluating Neural Networks for Early Maritime Threat Detection

链接: https://arxiv.org/abs/2410.20054
作者: Dhanush Tella,Chandra Teja Tiriveedhi,Naphtali Rishe,Dan E. Tamir,Jonathan I. Tamir
关键词-EN: assessing maritime threats, maritime threats, task of classifying, proxy for assessing, assessing maritime
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider the task of classifying trajectories of boat activities as a proxy for assessing maritime threats. Previous approaches have considered entropy-based metrics for clustering boat activity into three broad categories: random walk, following, and chasing. Here, we comprehensively assess the accuracy of neural network-based approaches as alternatives to entropy-based clustering. We train four neural network models and compare them to shallow learning using synthetic data. We also investigate the accuracy of models as time steps increase and with and without rotated data. To improve test-time robustness, we normalize trajectories and perform rotation-based data augmentation. Our results show that deep networks can achieve a test-set accuracy of up to 100% on a full trajectory, with graceful degradation as the number of time steps decreases, outperforming entropy-based clustering.

[AI-109] AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

链接: https://arxiv.org/abs/2410.20050
作者: Lei Li,Xiangxu Zhang,Xiao Zhou,Zheng Liu
关键词-EN: including electronic health, electronic health records, scientific literature, diverse sources, including electronic
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: this https URL.

[AI-110] DQRM: Deep Quantized Recommendation Models

链接: https://arxiv.org/abs/2410.20046
作者: Yang Zhou,Zhen Dong,Ellick Chan,Dhiraj Kalamkar,Diana Marculescu,Kurt Keutzer
关键词-EN: large Internet companies, Large-scale recommendation models, Internet companies, large Internet, Large-scale recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-scale recommendation models are currently the dominant workload for many large Internet companies. These recommenders are characterized by massive embedding tables that are sparsely accessed by the index for user and item features. The size of these 1TB+ tables imposes a severe memory bottleneck for the training and inference of recommendation models. In this work, we propose a novel recommendation framework that is small, powerful, and efficient to run and train, based on the state-of-the-art Deep Learning Recommendation Model (DLRM). The proposed framework makes inference more efficient on the cloud servers, explores the possibility of deploying powerful recommenders on smaller edge devices, and optimizes the workload of the communication overhead in distributed training under the data parallelism settings. Specifically, we show that quantization-aware training (QAT) can impose a strong regularization effect to mitigate the severe overfitting issues suffered by DLRMs. Consequently, we achieved INT4 quantization of DLRM models without any accuracy drop. We further propose two techniques that improve and accelerate the conventional QAT workload specifically for the embedding tables in the recommendation models. Furthermore, to achieve efficient training, we quantize the gradients of the embedding tables into INT8 on top of the well-supported specified sparsification. We show that combining gradient sparsification and quantization together significantly reduces the amount of communication. Briefly, DQRM models with INT4 can achieve 79.07% accuracy on Kaggle with 0.27 GB model size, and 81.21% accuracy on the Terabyte dataset with 1.57 GB, which even outperform FP32 DLRMs that have much larger model sizes (2.16 GB on Kaggle and 12.58 on Terabyte).

[AI-111] Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors

链接: https://arxiv.org/abs/2410.20034
作者: Wenqiang Chen,Jiaxuan Cheng,Leyao Wang,Wei Zhao,Wojciech Matusik
关键词-EN: generates textual responses, natural language question, Visual Question-Answering, progressed significantly, technology that generates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Visual Question-Answering, a technology that generates textual responses from an image and natural language question, has progressed significantly. Notably, it can aid in tracking and inquiring about daily activities, crucial in healthcare monitoring, especially for elderly patients or those with memory disabilities. However, video poses privacy concerns and has a limited field of view. This paper presents Sensor2Text, a model proficient in tracking daily activities and engaging in conversations using wearable sensors. The approach outlined here tackles several challenges, including low information density in wearable sensor data, insufficiency of single wearable sensors in human activities recognition, and model’s limited capacity for Question-Answering and interactive conversations. To resolve these obstacles, transfer learning and student-teacher networks are utilized to leverage knowledge from visual-language models. Additionally, an encoder-decoder neural network model is devised to jointly process language and sensor data for conversational purposes. Furthermore, Large Language Models are also utilized to enable interactive capabilities. The model showcases the ability to identify human activities and engage in Q\A dialogues using various wearable sensor modalities. It performs comparably to or better than existing visual-language models in both captioning and conversational tasks. To our knowledge, this represents the first model capable of conversing about wearable sensor data, offering an innovative approach to daily activity tracking that addresses privacy and field-of-view limitations associated with current vision-based solutions.

[AI-112] SCube: Instant Large-Scale Scene Reconstruction using VoxSplats NEURIPS2024

链接: https://arxiv.org/abs/2410.20030
作者: Xuanchi Ren,Yifan Lu,Hanxue Liang,Zhangjie Wu,Huan Ling,Mike Chen,Sanja Fidler,Francis Williams,Jiahui Huang
关键词-EN: reconstructing large-scale, set of posed, images, Gaussians, set
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images. Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold. To reconstruct a VoxSplat from images, we employ a hierarchical voxel latent diffusion model conditioned on the input images followed by a feedforward appearance prediction model. The diffusion model generates high-resolution grids progressively in a coarse-to-fine manner, and the appearance network predicts a set of Gaussians within each voxel. From as few as 3 non-overlapping input images, SCube can generate millions of Gaussians with a 1024^3 voxel grid spanning hundreds of meters in 20 seconds. Past works tackling scene reconstruction from images either rely on per-scene optimization and fail to reconstruct the scene away from input views (thus requiring dense view coverage as input) or leverage geometric priors based on low-resolution models, which produce blurry results. In contrast, SCube leverages high-resolution sparse networks and produces sharp outputs from few views. We show the superiority of SCube compared to prior art using the Waymo self-driving dataset on 3D reconstruction and demonstrate its applications, such as LiDAR simulation and text-to-scene generation.

[AI-113] FLOW: A Feedback LOop FrameWork for Simultaneously Enhancing Recommendation and User Agents

链接: https://arxiv.org/abs/2410.20027
作者: Shihao Cai,Jizhi Zhang,Keqin Bao,Chongming Gao,Fuli Feng
关键词-EN: large language models, user agent, recommendation agent, user, shown remarkable reasoning
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Agents powered by large language models have shown remarkable reasoning and execution capabilities, attracting researchers to explore their potential in the recommendation domain. Previous studies have primarily focused on enhancing the capabilities of either recommendation agents or user agents independently, but have not considered the interaction and collaboration between recommendation agents and user agents. To address this gap, we propose a novel framework named FLOW, which achieves collaboration between the recommendation agent and the user agent by introducing a feedback loop. Specifically, the recommendation agent refines its understanding of the user’s preferences by analyzing the user agent’s feedback on previously suggested items, while the user agent leverages suggested items to uncover deeper insights into the user’s latent interests. This iterative refinement process enhances the reasoning capabilities of both the recommendation agent and the user agent, enabling more precise recommendations and a more accurate simulation of user behavior. To demonstrate the effectiveness of the feedback loop, we evaluate both recommendation performance and user simulation performance on three widely used recommendation domain datasets. The experimental results indicate that the feedback loop can simultaneously improve the performance of both the recommendation and user agents.

[AI-114] GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

链接: https://arxiv.org/abs/2410.20018
作者: Kyle B. Hatch,Ashwin Balakrishna,Oier Mees,Suraj Nair,Seohong Park,Blake Wulfe,Masha Itkina,Benjamin Eysenbach,Sergey Levine,Thomas Kollar,Benjamin Burchfiel
关键词-EN: pre-trained on Internet-scale, Internet-scale data, policies, Internet-scale, models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code, model checkpoints and videos can be found at this https URL

点击查看摘要

Abstract:Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.

[AI-115] Off-Policy Selection for Initiating Human-Centric Experimental Design

链接: https://arxiv.org/abs/2410.20017
作者: Ge Gao,Xi Yang,Qitong Gao,Song Ju,Miroslav Pajic,Min Chi
关键词-EN: FPS, necessitates personalized treatments, students necessitates personalized, current OPS methods, OPS
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In human-centric tasks such as healthcare and education, the heterogeneity among patients and students necessitates personalized treatments and instructional interventions. While reinforcement learning (RL) has been utilized in those tasks, off-policy selection (OPS) is pivotal to close the loop by offline evaluating and selecting policies without online interactions, yet current OPS methods often overlook the heterogeneity among participants. Our work is centered on resolving a pivotal challenge in human-centric systems (HCSs): how to select a policy to deploy when a new participant joining the cohort, without having access to any prior offline data collected over the participant? We introduce First-Glance Off-Policy Selection (FPS), a novel approach that systematically addresses participant heterogeneity through sub-group segmentation and tailored OPS criteria to each sub-group. By grouping individuals with similar traits, FPS facilitates personalized policy selection aligned with unique characteristics of each participant or group of participants. FPS is evaluated via two important but challenging applications, intelligent tutoring systems and a healthcare application for sepsis treatment and intervention. FPS presents significant advancement in enhancing learning outcomes of students and in-hospital care outcomes.

[AI-116] Enhancing Battery Storage Energy Arbitrage with Deep Reinforcement Learning and Time-Series Forecasting

链接: https://arxiv.org/abs/2410.20005
作者: Manuel Sage,Joshua Campbell,Yaoyao Fiona Zhao
关键词-EN: sources of income, buying and selling, generating revenues, electricity prices, DRL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Operating Systems (cs.OS); Systems and Control (eess.SY)
*备注: Accepted for publication at the 18th ASME International Conference on Energy Sustainability

点击查看摘要

Abstract:Energy arbitrage is one of the most profitable sources of income for battery operators, generating revenues by buying and selling electricity at different prices. Forecasting these revenues is challenging due to the inherent uncertainty of electricity prices. Deep reinforcement learning (DRL) emerged in recent years as a promising tool, able to cope with uncertainty by training on large quantities of historical data. However, without access to future electricity prices, DRL agents can only react to the currently observed price and not learn to plan battery dispatch. Therefore, in this study, we combine DRL with time-series forecasting methods from deep learning to enhance the performance on energy arbitrage. We conduct a case study using price data from Alberta, Canada that is characterized by irregular price spikes and highly non-stationary. This data is challenging to forecast even when state-of-the-art deep learning models consisting of convolutional layers, recurrent layers, and attention modules are deployed. Our results show that energy arbitrage with DRL-enabled battery control still significantly benefits from these imperfect predictions, but only if predictors for several horizons are combined. Grouping multiple predictions for the next 24-hour window, accumulated rewards increased by 60% for deep Q-networks (DQN) compared to the experiments without forecasts. We hypothesize that multiple predictors, despite their imperfections, convey useful information regarding the future development of electricity prices through a “majority vote” principle, enabling the DRL agent to learn more profitable control policies.

[AI-117] Artificial Intelligence of Things: A Survey

链接: https://arxiv.org/abs/2410.19998
作者: Shakhrul Iman Siam,Hyunho Ahn,Li Liu,Samiul Alam,Hui Shen,Zhichao Cao,Ness Shroff,Bhaskar Krishnamachari,Mani Srivastava,Mi Zhang
关键词-EN: Artificial Intelligence, modern Artificial Intelligence, Internet of Things, Intelligence of Things, Things
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: Accepted in ACM Transactions on Sensor Networks (TOSN)

点击查看摘要

Abstract:The integration of the Internet of Things (IoT) and modern Artificial Intelligence (AI) has given rise to a new paradigm known as the Artificial Intelligence of Things (AIoT). In this survey, we provide a systematic and comprehensive review of AIoT research. We examine AIoT literature related to sensing, computing, and networking communication, which form the three key components of AIoT. In addition to advancements in these areas, we review domain-specific AIoT systems that are designed for various important application domains. We have also created an accompanying GitHub repository, where we compile the papers included in this survey: this https URL. This repository will be actively maintained and updated with new research as it becomes available. As both IoT and AI become increasingly critical to our society, we believe AIoT is emerging as an essential research field at the intersection of IoT and modern AI. We hope this survey will serve as a valuable resource for those engaged in AIoT research and act as a catalyst for future explorations to bridge gaps and drive advancements in this exciting field.

[AI-118] SAD: State-Action Distillation for In-Context Reinforcement Learning under Random Policies

链接: https://arxiv.org/abs/2410.19982
作者: Weiqin Chen,Santiago Paternain
关键词-EN: Pretrained foundation models, allowing zero-shot generalization, exhibited extraordinary in-context, extraordinary in-context learning, Pretrained foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pretrained foundation models have exhibited extraordinary in-context learning performance, allowing zero-shot generalization to new tasks not encountered during the pretraining. In the case of RL, in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, current state-of-the-art ICRL algorithms, such as AD, DPT and DIT, impose stringent requirements on generating the pretraining dataset concerning the behavior (source) policies, context information, and action labels, etc. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all environments during the generation of the pretraining dataset. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be both prohibitively intractable and expensive. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate a remarkable pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling the outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during the pretraining. To the best of our knowledge, this is the first work that enables promising ICRL under (e.g., uniform) random policies and random contexts. We establish theoretical analyses regarding the performance guarantees of SAD. Moreover, our empirical results across multiple ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 180.86% in the offline evaluation and by 172.8% in the online evaluation.

[AI-119] OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery

链接: https://arxiv.org/abs/2410.19965
作者: Philipe Dias,Aristeidis Tsaris,Jordan Bowman,Abhishek Potnis,Jacob Arndt,H. Lexie Yang,Dalton Lunga
关键词-EN: hundred million parameters, models remain restricted, remote sensing, Foundation Models, remain restricted
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:While the pretraining of Foundation Models (FMs) for remote sensing (RS) imagery is on the rise, models remain restricted to a few hundred million parameters. Scaling models to billions of parameters has been shown to yield unprecedented benefits including emergent abilities, but requires data scaling and computing resources typically not available outside industry RD labs. In this work, we pair high-performance computing resources including Frontier supercomputer, America’s first exascale system, and high-resolution optical RS data to pretrain billion-scale FMs. Our study assesses performance of different pretrained variants of vision Transformers across image classification, semantic segmentation and object detection benchmarks, which highlight the importance of data scaling for effective model scaling. Moreover, we discuss construction of a novel TIU pretraining dataset, model initialization, with data and pretrained models intended for public release. By discussing technical challenges and details often lacking in the related literature, this work is intended to offer best practices to the geospatial community toward efficient training and benchmarking of larger FMs.

[AI-120] Understanding Adam Requires Better Rotation Dependent Assumptions

链接: https://arxiv.org/abs/2410.19964
作者: Lucas Maes,Tianyue H. Zhang,Alexia Jolicoeur-Martineau,Ioannis Mitliagkas,Damien Scieur,Simon Lacoste-Julien,Charles Guille-Escuret
关键词-EN: Stochastic Gradient Descent, Gradient Descent, Stochastic Gradient, comprehensive theoretical explanation, widespread adoption
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite its widespread adoption, Adam’s advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam’s sensitivity to rotations of the parameter space. We demonstrate that Adam’s performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam’s advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam’s behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam’s empirical success in modern machine learning tasks.

[AI-121] DualMAR: Medical-Augmented Representation from Dual-Expertise Perspectives

链接: https://arxiv.org/abs/2410.19955
作者: Pengfei Hu,Chang Lu,Fei Wang,Yue Ning
关键词-EN: Electronic Health Records, Electronic Health, Health Records, revolutionized healthcare data, healthcare data management
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Electronic Health Records (EHR) has revolutionized healthcare data management and prediction in the field of AI and machine learning. Accurate predictions of diagnosis and medications significantly mitigate health risks and provide guidance for preventive care. However, EHR driven models often have limited scope on understanding medical-domain knowledge and mostly rely on simple-and-sole ontologies. In addition, due to the missing features and incomplete disease coverage of EHR, most studies only focus on basic analysis on conditions and medication. We propose DualMAR, a framework that enhances EHR prediction tasks through both individual observation data and public knowledge bases. First, we construct a bi-hierarchical Diagnosis Knowledge Graph (KG) using verified public clinical ontologies and augment this KG via Large Language Models (LLMs); Second, we design a new proxy-task learning on lab results in EHR for pretraining, which further enhance KG representation and patient embeddings. By retrieving radial and angular coordinates upon polar space, DualMAR enables accurate predictions based on rich hierarchical and semantic embeddings from KG. Experiments also demonstrate that DualMAR outperforms state-of-the-art models, validating its effectiveness in EHR prediction and KG integration in medical domains.

[AI-122] Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations

链接: https://arxiv.org/abs/2410.19948
作者: Kayvan Kousha,Mike Thelwall
关键词-EN: Research Excellence Framework, Academics and departments, Impact Case Studies, assesses Impact Case, benefitted society
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Academics and departments are sometimes judged by how their research has benefitted society. For example, the UK Research Excellence Framework (REF) assesses Impact Case Studies (ICS), which are five-page evidence-based claims of societal impacts. This study investigates whether ChatGPT can evaluate societal impact claims and therefore potentially support expert human assessors. For this, various parts of 6,220 public ICS from REF2021 were fed to ChatGPT 4o-mini along with the REF2021 evaluation guidelines, comparing the results with published departmental average ICS scores. The results suggest that the optimal strategy for high correlations with expert scores is to input the title and summary of an ICS but not the remaining text, and to modify the original REF guidelines to encourage a stricter evaluation. The scores generated by this approach correlated positively with departmental average scores in all 34 Units of Assessment (UoAs), with values between 0.18 (Economics and Econometrics) and 0.56 (Psychology, Psychiatry and Neuroscience). At the departmental level, the corresponding correlations were higher, reaching 0.71 for Sport and Exercise Sciences, Leisure and Tourism. Thus, ChatGPT-based ICS evaluations are simple and viable to support or cross-check expert judgments, although their value varies substantially between fields.

[AI-123] Cobblestone: Iterative Automation for Formal Verification

链接: https://arxiv.org/abs/2410.19940
作者: Saketh Ram Kasibatla,Arpan Agarwal,Yuriy Brun,Sorin Lerner,Talia Ringer,Emily First
关键词-EN: improving software quality, Cobblestone, proof, proof synthesis, software quality
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but it is expensive. Writing proofs manually requires both significant effort and expertise. Recent research has used machine learning to automatically synthesize proofs, reducing verification effort, but these tools are able to prove only a fraction of the desired software properties. We introduce Cobblestone, a new proof-synthesis approach that improves on the state of the art by taking advantage of partial progress in proof synthesis attempts. Unlike prior tools, Cobblestone can produce multiple unsuccessful proofs using a large language model (LLM), identify the working portions of those proofs, and combine them into a single, successful proof, taking advantage of internal partial progress. We evaluate Cobblestone on two benchmarks of open-source Coq projects, controlling for training data leakage in LLM datasets. Fully automatically, Cobblestone can prove 48% of the theorems, while Proverbot9001, the previous state-of-the-art, learning-based, proof-synthesis tool, can prove 17%. Cobblestone establishes a new state of the art for fully automated proof synthesis tools for Coq. We also evaluate Cobblestone in a setting where it is given external partial proof progress from oracles, serving as proxies for a human proof engineer or another tool. When the theorem is broken down into a set of subgoals and Cobblestone is given a set of relevant lemmas already proven in the project, it can prove up to 58% of the theorems. We qualitatively study the theorems Cobblestone is and is not able to prove to outline potential future research directions to further improve proof synthesis, including developing interactive, semi-automated tools. Our research shows that tools can make better use of partial progress made during proof synthesis to more effectively automate formal verification.

[AI-124] Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

链接: https://arxiv.org/abs/2410.19933
作者: Xiyue Peng,Hengquan Guo,Jiawei Zhang,Dongqing Zou,Ziyu Shao,Honghao Wei,Xin Liu
关键词-EN: Markov Decision Process, aligning large language, large language models, Balancing helpfulness, constrained Markov Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. However, these methods can lead to ``safety interference’', where average-based safety constraints compromise the safety of some prompts in favor of others. To address this issue, we propose \textbfRectified Policy Optimization (RePO), which replaces the average safety constraint with stricter (per prompt) safety constraints. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments on Alpaca-7B demonstrate that RePO improves the safety alignment and reduces the safety interference compared to baseline methods. Code is available at this https URL.

[AI-125] Language Agents Meet Causality – Bridging LLM s and Causal World Models

链接: https://arxiv.org/abs/2410.19923
作者: John Gkountouras,Matthias Lindemann,Phillip Lippe,Efstratios Gavves,Ivan Titov
关键词-EN: recently shown great, shown great promise, Large Language Models, Large Language, recently shown
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Project page: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown great promise in planning and reasoning applications. These tasks demand robust systems, which arguably require a causal understanding of the environment. While LLMs can acquire and reflect common sense causal knowledge from their pretraining data, this information is often incomplete, incorrect, or inapplicable to a specific environment. In contrast, causal representation learning (CRL) focuses on identifying the underlying causal structure within a given environment. We propose a framework that integrates CRLs with LLMs to enable causally-aware reasoning and planning. This framework learns a causal world model, with causal variables linked to natural language expressions. This mapping provides LLMs with a flexible interface to process and generate descriptions of actions and states in text form. Effectively, the causal world model acts as a simulator that the LLM can query and interact with. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally-aware method outperforming LLM-based reasoners, especially for longer planning horizons.

[AI-126] Disentangling Genotype and Environment Specific Latent Features for Improved Trait Prediction using a Compositional Autoencoder

链接: https://arxiv.org/abs/2410.19922
作者: Anirudha Powadi,Talukder Zaki Jubery,Michael C. Tross,James C. Schnable,Baskar Ganapathysubramanian
关键词-EN: high-dimensional phenotype data, study introduces, designed to disentangle, disentangle the complex, complex interplay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:This study introduces a compositional autoencoder (CAE) framework designed to disentangle the complex interplay between genotypic and environmental factors in high-dimensional phenotype data to improve trait prediction in plant breeding and genetics programs. Traditional predictive methods, which use compact representations of high-dimensional data through handcrafted features or latent features like PCA or more recently autoencoders, do not separate genotype-specific and environment-specific factors. We hypothesize that disentangling these features into genotype-specific and environment-specific components can enhance predictive models. To test this, we developed a compositional autoencoder (CAE) that decomposes high-dimensional data into distinct genotype-specific and environment-specific latent features. Our CAE framework employs a hierarchical architecture within an autoencoder to effectively separate these entangled latent features. Applied to a maize diversity panel dataset, the CAE demonstrates superior modeling of environmental influences and 5-10 times improved predictive performance for key traits like Days to Pollen and Yield, compared to the traditional methods, including standard autoencoders, PCA with regression, and Partial Least Squares Regression (PLSR). By disentangling latent features, the CAE provides powerful tool for precision breeding and genetic research. This work significantly enhances trait prediction models, advancing agricultural and biological sciences. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN) Cite as: arXiv:2410.19922 [cs.LG] (or arXiv:2410.19922v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.19922 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-127] Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces

链接: https://arxiv.org/abs/2410.19919
作者: Avik Kar,Rahul Singh
关键词-EN: average-reward reinforcement learning, state-action space adaptively, Lipschitz MDPs, state-action space, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs and develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space. We show that its regret can be bounded as \mathcal\tildeO\big(T^1 - d_\texteff.^-1\big) , where d_\texteff. = 2d_\mathcalS + d_z + 3 , d_\mathcalS is the dimension of the state space, and d_z is the zooming dimension. d_z is a problem-dependent quantity, which allows us to conclude that if MDP is benign, then its regret will be small. We note that the existing notion of zooming dimension for average reward RL is defined in terms of policy coverings, and hence it can be huge when the policy class is rich even though the underlying MDP is simple, so that the regret upper bound is nearly O(T) . The zooming dimension proposed in the current work is bounded above by d , the dimension of the state-action space, and hence is truly adaptive, i.e., shows how to capture adaptivity gains for infinite-horizon average-reward RL. ZoRL outperforms other state-of-the-art algorithms in experiments; thereby demonstrating the gains arising due to adaptivity.

[AI-128] Simmering: Sufficient is better than optimal for training neural networks

链接: https://arxiv.org/abs/2410.19912
作者: Irina Babayan,Hazhir Aliahmadi,Greg van Anders
关键词-EN: neural network training, network training techniques, neural networks, techniques that invoke, hoc modification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The broad range of neural network training techniques that invoke optimization but rely on ad hoc modification for validity suggests that optimization-based training is misguided. Shortcomings of optimization-based training are brought to particularly strong relief by the problem of overfitting, where naive optimization produces spurious outcomes. The broad success of neural networks for modelling physical processes has prompted advances that are based on inverting the direction of investigation and treating neural networks as if they were physical systems in their own right These successes raise the question of whether broader, physical perspectives could motivate the construction of improved training algorithms. Here, we introduce simmering, a physics-based method that trains neural networks to generate weights and biases that are merely ``good enough’', but which, paradoxically, outperforms leading optimization-based approaches. Using classification and regression examples we show that simmering corrects neural networks that are overfit by Adam, and show that simmering avoids overfitting if deployed from the outset. Our results question optimization as a paradigm for neural network training, and leverage information-geometric arguments to point to the existence of classes of sufficient training algorithms that do not take optimization as their starting point.

[AI-129] A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection

链接: https://arxiv.org/abs/2410.19898
作者: Muath Alsuhaibani,Ali Pourramezan Fard,Jian Sun,Farida Far Poor,Peter S. Pressman,Mohammad H. Mahoor
关键词-EN: explores recent advances, paper explores recent, cognitive impairment detection, explores recent, recent advances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This review paper explores recent advances in deep learning approaches for non-invasive cognitive impairment detection. We examine various non-invasive indicators of cognitive decline, including speech and language, facial, and motoric mobility. The paper provides an overview of relevant datasets, feature-extracting techniques, and deep-learning architectures applied to this domain. We have analyzed the performance of different methods across modalities and observed that speech and language-based methods generally achieved the highest detection performance. Studies combining acoustic and linguistic features tended to outperform those using a single modality. Facial analysis methods showed promise for visual modalities but were less extensively studied. Most papers focused on binary classification (impaired vs. non-impaired), with fewer addressing multi-class or regression tasks. Transfer learning and pre-trained language models emerged as popular and effective techniques, especially for linguistic analysis. Despite significant progress, several challenges remain, including data standardization and accessibility, model explainability, longitudinal analysis limitations, and clinical adaptation. Lastly, we propose future research directions, such as investigating language-agnostic speech analysis methods, developing multi-modal diagnostic systems, and addressing ethical considerations in AI-assisted healthcare. By synthesizing current trends and identifying key obstacles, this review aims to guide further development of deep learning-based cognitive impairment detection systems to improve early diagnosis and ultimately patient outcomes.

[AI-130] BBC: Predict True Bacteraemia in Blood Cultures via Deep Learning

链接: https://arxiv.org/abs/2410.19887
作者: Kira Sam
关键词-EN: poses significant diagnostic, significant diagnostic challenges, Antonius Hospital emergency, mortality rates, poses significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Bacteraemia, a bloodstream infection with high morbidity and mortality rates, poses significant diagnostic challenges. Accurate diagnosis through blood cultures is resource-intensive. Developing a machine learning model to predict blood culture outcomes in emergency departments offers potential for improved diagnosis, reduced healthcare costs, and mitigated antibiotic this http URL thesis aims to identify optimal machine learning techniques for predicting bacteraemia and develop a predictive model using data from St. Antonius Hospital’s emergency department. Based on current literature, CatBoost and Random Forest were selected as the most promising techniques. Model optimization using Optuna prioritized this http URL final Random Forest model achieved an ROC AUC of 0.78 and demonstrated 0.92 sensitivity on the test set. Notably, it accurately identified 36.02% of patients at low risk of bacteraemia, with only 0.85% false this http URL of this model in St. Antonius Hospital’s emergency department could reduce blood cultures, healthcare costs, and antibiotic treatments. Future studies should focus on external validation, exploring advanced techniques, and addressing potential confounders to ensure model generalizability.

[AI-131] Paved or unpaved? A Deep Learning derived Road Surface Global Dataset from Mapillary Street-View Imagery

链接: https://arxiv.org/abs/2410.19874
作者: Sukanya Randhawa,Eren Aygun,Guntaj Randhawa,Benjamin Herfort,Sven Lautenbach,Alexander Zipf
关键词-EN: street view platform, world largest crowdsourcing-based, largest crowdsourcing-based street, crowdsourcing-based street view, road surface characteristics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have released an open dataset with global coverage on road surface characteristics (paved or unpaved) derived utilising 105 million images from the world’s largest crowdsourcing-based street view platform, Mapillary, leveraging state-of-the-art geospatial AI methods. We propose a hybrid deep learning approach which combines SWIN-Transformer based road surface prediction and CLIP-and-DL segmentation based thresholding for filtering of bad quality images. The road surface prediction results have been matched and integrated with OpenStreetMap (OSM) road geometries. This study provides global data insights derived from maps and statistics about spatial distribution of Mapillary coverage and road pavedness on a continent and countries scale, with rural and urban this http URL dataset expands the availability of global road surface information by over 3 million kilometers, now representing approximately 36% of the total length of the global road this http URL regions showed moderate to high paved road coverage (60-80%), but significant gaps were noted in specific areas of Africa and Asia. Urban areas tend to have near-complete paved coverage, while rural regions display more variability. Model validation against OSM surface data achieved strong performance, with F1 scores for paved roads between 91-97% across this http URL forward the work of Mapillary and their contributors and enrichment of OSM road attributes, our work provides valuable insights for applications in urban planning, disaster routing, logistics optimisation and addresses various Sustainable Development Goals (SDGS): especially SDGs 1 (No poverty), 3 (Good health and well-being), 8 (Decent work and economic growth), 9 (Industry, Innovation and Infrastructure), 11 (Sustainable cities and communities), 12 (Responsible consumption and production), and 13 (Climate action).

[AI-132] Evaluating Deep Learning Approaches for Predictions in Unmonitored Basins with Continental-scale Stream Temperature Models

链接: https://arxiv.org/abs/2410.19865
作者: Jared D. Willard,Fabio Ciulla,Helen Weierbach,Vipin Kumar,Charuleka Varadharajan
关键词-EN: challenge in hydrology, environmental variables, grand challenge, models, variables in unmonitored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 47 pages, 12 figures, 7 tables, submitted to Water Resources Research

点击查看摘要

Abstract:The prediction of streamflows and other environmental variables in unmonitored basins is a grand challenge in hydrology. Recent machine learning (ML) models can harness vast datasets for accurate predictions at large spatial scales. However, there are open questions regarding model design and data needed for inputs and training to improve performance. This study explores these questions while demonstrating the ability of deep learning models to make accurate stream temperature predictions in unmonitored basins across the conterminous United States. First, we compare top-down models that utilize data from a large number of basins with bottom-up methods that transfer ML models built on local sites, reflecting traditional regionalization techniques. We also evaluate an intermediary grouped modeling approach that categorizes sites based on regional co-location or similarity of catchment characteristics. Second, we evaluate trade-offs between model complexity, prediction accuracy, and applicability for more target locations by systematically removing inputs. We then examine model performance when additional training data becomes available due to reductions in input requirements. Our results suggest that top-down models significantly outperform bottom-up and grouped models. Moreover, it is possible to get acceptable accuracy by reducing both dynamic and static inputs enabling predictions for more sites with lower model complexity and computational needs. From detailed error analysis, we determined that the models are more accurate for sites primarily controlled by air temperatures compared to locations impacted by groundwater and dams. By addressing these questions, this research offers a comprehensive perspective on optimizing ML model design for accurate predictions in unmonitored regions.

[AI-133] Real-Time Weapon Detection Using YOLOv8 for Enhanced Safety

链接: https://arxiv.org/abs/2410.19862
作者: Ayush Thakur,Akshat Shrivastav,Rohan Sharma,Triyank Kumar,Kabir Puri
关键词-EN: research paper presents, paper presents, presents the development, public transportation systems, model utilizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:This research paper presents the development of an AI model utilizing YOLOv8 for real-time weapon detection, aimed at enhancing safety in public spaces such as schools, airports, and public transportation systems. As incidents of violence continue to rise globally, there is an urgent need for effective surveillance technologies that can quickly identify potential threats. Our approach focuses on leveraging advanced deep learning techniques to create a highly accurate and efficient system capable of detecting weapons in real-time video streams. The model was trained on a comprehensive dataset containing thousands of images depicting various types of firearms and edged weapons, ensuring a robust learning process. We evaluated the model’s performance using key metrics such as precision, recall, F1-score, and mean Average Precision (mAP) across multiple Intersection over Union (IoU) thresholds, revealing a significant capability to differentiate between weapon and non-weapon classes with minimal error. Furthermore, we assessed the system’s operational efficiency, demonstrating that it can process frames at high speeds suitable for real-time applications. The findings indicate that our YOLOv8-based weapon detection model not only contributes to the existing body of knowledge in computer vision but also addresses critical societal needs for improved safety measures in vulnerable environments. By harnessing the power of artificial intelligence, this research lays the groundwork for developing practical solutions that can be deployed in security settings, ultimately enhancing the protective capabilities of law enforcement and public safety agencies.

[AI-134] Personalized Recommendation Systems using Multimodal Autonomous Multi Agent Systems

链接: https://arxiv.org/abs/2410.19855
作者: Param Thakkar,Anushka Yadav
关键词-EN: highly developed personalised, developed personalised recommendation, paper describes, describes a highly, highly developed
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper describes a highly developed personalised recommendation system using multimodal, autonomous, multi-agent systems. The system focuses on the incorporation of futuristic AI tech and LLMs like Gemini-1.5- pro and LLaMA-70B to improve customer service experiences especially within e-commerce. Our approach uses multi agent, multimodal systems to provide best possible recommendations to its users. The system is made up of three agents as a whole. The first agent recommends products appropriate for answering the given question, while the second asks follow-up questions based on images that belong to these recommended products and is followed up with an autonomous search by the third agent. It also features a real-time data fetch, user preferences-based recommendations and is adaptive learning. During complicated queries the application processes with Symphony, and uses the Groq API to answer quickly with low response times. It uses a multimodal way to utilize text and images comprehensively, so as to optimize product recommendation and customer interaction.

[AI-135] Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts ECAI2024

链接: https://arxiv.org/abs/2410.19852
作者: Sheryl Paul,Jyotirmoy V. Deshmukh
关键词-EN: Reinforcement learning, autonomous agents operating, distribution shifts, policy, optimal policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Neural and Evolutionary Computing (cs.NE)
*备注: Pubblished in ECAI 2024

点击查看摘要

Abstract:Reinforcement learning (RL) has been successfully applied to solve the problem of finding obstacle-free paths for autonomous agents operating in stochastic and uncertain environments. However, when the underlying stochastic dynamics of the environment experiences drastic distribution shifts, the optimal policy obtained in the trained environment may be sub-optimal or may entirely fail in helping find goal-reaching paths for the agent. Approaches like domain randomization and robust RL can provide robust policies, but typically assume minor (bounded) distribution shifts. For substantial distribution shifts, retraining (either with a warm-start policy or from scratch) is an alternative approach. In this paper, we develop a novel approach called \em Evolutionary Robust Policy Optimization (ERPO), an adaptive re-training algorithm inspired by evolutionary game theory (EGT). ERPO learns an optimal policy for the shifted environment iteratively using a temperature parameter that controls the trade off between exploration and adherence to the old optimal policy. The policy update itself is an instantiation of the replicator dynamics used in EGT. We show that under fairly common sparsity assumptions on rewards in such environments, ERPO converges to the optimal policy in the shifted environment. We empirically demonstrate that for path finding tasks in a number of environments, ERPO outperforms several popular RL and deep RL algorithms (PPO, A3C, DQN) in many scenarios and popular environments. This includes scenarios where the RL algorithms are allowed to train from scratch in the new environment, when they are retrained on the new environment, or when they are used in conjunction with domain randomization. ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.

[AI-136] AEPL: Automated and Editable Prompt Learning for Brain Tumor Segmentation

链接: https://arxiv.org/abs/2410.19847
作者: Yongheng Sun,Mingxia Liu,Chunfeng Lian
关键词-EN: diagnosisand treatment planning, pose significant challenges, accurate diagnosisand treatment, Brain tumor segmentation, irregular shapeof tumors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages paper for ISBI2025

点击查看摘要

Abstract:Brain tumor segmentation is crucial for accurate diagnosisand treatment planning, but the small size and irregular shapeof tumors pose significant challenges. Existing methods of-ten fail to effectively incorporate medical domain knowledgesuch as tumor grade, which correlates with tumor aggres-siveness and morphology, providing critical insights for moreaccurate detection of tumor subregions during this http URL propose an Automated and Editable Prompt Learning(AEPL) framework that integrates tumor grade into the seg-mentation process by combining multi-task learning andprompt learning with automatic and editable prompt gen-eration. Specifically, AEPL employs an encoder to extractimage features for both tumor-grade prediction and segmen-tation mask generation. The predicted tumor grades serveas auto-generated prompts, guiding the decoder to produceprecise segmentation masks. This eliminates the need formanual prompts while allowing clinicians to manually editthe auto-generated prompts to fine-tune the segmentation,enhancing both flexibility and precision. The proposed AEPLachieves state-of-the-art performance on the BraTS 2018dataset, demonstrating its effectiveness and clinical this http URL source code can be accessed online.

[AI-137] Enhancing Trust and Safety in Digital Payments: An LLM -Powered Approach

链接: https://arxiv.org/abs/2410.19845
作者: Devendra Dahiphale,Naveen Madiraju,Justin Lin,Rutvik Karve,Monu Agrawal,Anant Modwal,Ramanan Balakrishnan,Shanay Shah,Govind Kaushal,Priya Mandawat,Prakash Hariramani,Arif Merchant(Google, Inc)
关键词-EN: offering unparalleled convenience, Digital payment systems, revolutionized financial transactions, offering unparalleled, users worldwide
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Digital payment systems have revolutionized financial transactions, offering unparalleled convenience and accessibility to users worldwide. However, the increasing popularity of these platforms has also attracted malicious actors seeking to exploit their vulnerabilities for financial gain. To address this challenge, robust and adaptable scam detection mechanisms are crucial for maintaining the trust and safety of digital payment ecosystems. This paper presents a comprehensive approach to scam detection, focusing on the Unified Payments Interface (UPI) in India, Google Pay (GPay) as a specific use case. The approach leverages Large Language Models (LLMs) to enhance scam classification accuracy and designs a digital assistant to aid human reviewers in identifying and mitigating fraudulent activities. The results demonstrate the potential of LLMs in augmenting existing machine learning models and improving the efficiency, accuracy, quality, and consistency of scam reviews, ultimately contributing to a safer and more secure digital payment landscape. Our evaluation of the Gemini Ultra model on curated transaction data showed a 93.33% accuracy in scam classification. Furthermore, the model demonstrated 89% accuracy in generating reasoning for these classifications. A promising fact, the model identified 32% new accurate reasons for suspected scams that human reviewers had not included in the review notes.

[AI-138] GreenEye: Development of Real-Time Traffic Signal Recognition System for Visual Impairments

链接: https://arxiv.org/abs/2410.19840
作者: Danu Kim
关键词-EN: visually impaired people, impaired people, significant challenges, challenges to visually, visually impaired
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Published in Korea Software Congress (2023)

点击查看摘要

Abstract:Recognizing a traffic signal, determining if the signal is green or red, and figuring out the time left to cross the crosswalk are significant challenges to visually impaired people. Previous research has focused on recognizing only two traffic signals, green and red lights, using machine learning techniques. The proposed method developed a GreenEye system that recognizes the traffic signals’ color and tells the time left for pedestrians to cross the crosswalk in real-time. GreenEye’s first training showed the highest precision of 74.6%; four classes reported 40% or lower recognition precision in this training session. The data imbalance caused low precision; thus, extra labeling and database formation were performed to stabilize the number of images between different classes. After the stabilization, all 14 classes showed excelling precision rate of 99.5%.

[AI-139] Multidimensional Knowledge Graph Embeddings for International Trade Flow Analysis

链接: https://arxiv.org/abs/2410.19835
作者: Durgesh Nandini,Simon Bloethner,Mirco Schoenfeld,Mario Larch
关键词-EN: poses significant challenges, offer limited capacity, methods offer limited, traditional regression methods, Understanding the complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Understanding the complex dynamics of high-dimensional, contingent, and strongly nonlinear economic data, often shaped by multiplicative processes, poses significant challenges for traditional regression methods as such methods offer limited capacity to capture the structural changes they feature. To address this, we propose leveraging the potential of knowledge graph embeddings for economic trade data, in particular, to predict international trade relationships. We implement KonecoKG, a knowledge graph representation of economic trade data with multidimensional relationships using SDM-RDFizer, and transform the relationships into a knowledge graph embedding using AmpliGraph.

[AI-140] GNNRL-Smoothing: A Prior-Free Reinforcement Learning Model for Mesh Smoothing

链接: https://arxiv.org/abs/2410.19834
作者: Zhichao Wang,Xinhai Chen,Chunye Gong,Bo Yang,Liang Deng,Yufei Sun,Yufei Pang,Jie Liu
关键词-EN: eliminating distorted elements, smoothing, Mesh, distorted elements, leading to improved
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mesh smoothing methods can enhance mesh quality by eliminating distorted elements, leading to improved convergence in simulations. To balance the efficiency and robustness of traditional mesh smoothing process, previous approaches have employed supervised learning and reinforcement learning to train intelligent smoothing models. However, these methods heavily rely on labeled dataset or prior knowledge to guide the models’ learning. Furthermore, their limited capacity to enhance mesh connectivity often restricts the effectiveness of smoothing. In this paper, we first systematically analyze the learning mechanisms of recent intelligent smoothing methods and propose a prior-free reinforcement learning model for intelligent mesh smoothing. Our proposed model integrates graph neural networks with reinforcement learning to implement an intelligent node smoothing agent and introduces, for the first time, a mesh connectivity improvement agent. We formalize mesh optimization as a Markov Decision Process and successfully train both agents using Twin Delayed Deep Deterministic Policy Gradient and Double Dueling Deep Q-Network in the absence of any prior data or knowledge. We verified the proposed model on both 2D and 3D meshes. Experimental results demonstrate that our model achieves feature-preserving smoothing on complex 3D surface meshes. It also achieves state-of-the-art results among intelligent smoothing methods on 2D meshes and is 7.16 times faster than traditional optimization-based smoothing methods. Moreover, the connectivity improvement agent can effectively enhance the quality distribution of the mesh.

[AI-141] Human-Centric eXplainable AI in Education

链接: https://arxiv.org/abs/2410.19822
作者: Subhankar Maity,Aniket Deroy
关键词-EN: artificial intelligence, understandable and trustworthy, educational environments, developing HCXAI systems, systems
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Preprint. Under Review

点击查看摘要

Abstract:As artificial intelligence (AI) becomes more integrated into educational environments, how can we ensure that these systems are both understandable and trustworthy? The growing demand for explainability in AI systems is a critical area of focus. This paper explores Human-Centric eXplainable AI (HCXAI) in the educational landscape, emphasizing its role in enhancing learning outcomes, fostering trust among users, and ensuring transparency in AI-driven tools, particularly through the innovative use of large language models (LLMs). What challenges arise in the implementation of explainable AI in educational contexts? This paper analyzes these challenges, addressing the complexities of AI models and the diverse needs of users. It outlines comprehensive frameworks for developing HCXAI systems that prioritize user understanding and engagement, ensuring that educators and students can effectively interact with these technologies. Furthermore, what steps can educators, developers, and policymakers take to create more effective, inclusive, and ethically responsible AI solutions in education? The paper provides targeted recommendations to address this question, highlighting the necessity of prioritizing explainability. By doing so, how can we leverage AI’s transformative potential to foster equitable and engaging educational experiences that support diverse learners?

[AI-142] ScreenWriter: Automatic Screenplay Generation and Movie Summarisation

链接: https://arxiv.org/abs/2410.19809
作者: Louis Mahon,Mirella Lapata
关键词-EN: key plot points, recall key plot, creative video content, overview without watching, proliferation of creative
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The proliferation of creative video content has driven demand for textual descriptions or summaries that allow users to recall key plot points or get an overview without watching. The volume of movie content and speed of turnover motivates automatic summarisation, which is nevertheless challenging, requiring identifying character intentions and very long-range temporal dependencies. The few existing methods attempting this task rely heavily on textual screenplays as input, greatly limiting their applicability. In this work, we propose the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions. ScreenWriter introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors, and a novel method for the challenging problem of determining character names, based on a database of actors’ faces. We further demonstrate how these automatic screenplays can be used to generate plot synopses with a hierarchical summarisation method based on scene breaks. We test the quality of the final summaries on the recent MovieSum dataset, which we augment with videos, and show that they are superior to a number of comparison models which assume access to goldstandard screenplays.

[AI-143] LocateBench: Evaluating the Locating Ability of Vision Language Models

链接: https://arxiv.org/abs/2410.19808
作者: Ting-Rui Chiang,Joshua Robinson,Xinyan Velocity Yu,Dani Yogatama
关键词-EN: natural language instructions, real-world applications, locate an object, instructions is crucial, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: We release the dataset at this https URL

点击查看摘要

Abstract:The ability to locate an object in an image according to natural language instructions is crucial for many real-world applications. In this work we propose LocateBench, a high-quality benchmark dedicated to evaluating this ability. We experiment with multiple prompting approaches, and measure the accuracy of several large vision language models. We find that even the accuracy of the strongest model, GPT-4o, lags behind human accuracy by more than 10%.

[AI-144] Xeno-learning: knowledge transfer across species in deep learning-based spectral image analysis

链接: https://arxiv.org/abs/2410.19789
作者: Jan Sellner,Alexander Studier-Fischer,Ahmad Bin Qasim,Silvia Seidlitz,Nicholas Schreck,Minu Tizabi,Manuel Wiesenfarth,Annette Kopp-Schneider,Samuel Knödler,Caelan Max Haney,Gabriel Salg,Berkin Özdemir,Maximilian Dietrich,Maurice Stephan Michel,Felix Nickel,Karl-Friedrich Kowalewski,Lena Maier-Hein
关键词-EN: optical imaging techniques, clinical surgical imaging, revolutionize clinical surgical, imaging techniques, optical imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Jan Sellner and Alexander Studier-Fischer contributed equally to this work

点击查看摘要

Abstract:Novel optical imaging techniques, such as hyperspectral imaging (HSI) combined with machine learning-based (ML) analysis, have the potential to revolutionize clinical surgical imaging. However, these novel modalities face a shortage of large-scale, representative clinical data for training ML algorithms, while preclinical animal data is abundantly available through standardized experiments and allows for controlled induction of pathological tissue states, which is not ethically possible in patients. To leverage this situation, we propose a novel concept called “xeno-learning”, a cross-species knowledge transfer paradigm inspired by xeno-transplantation, where organs from a donor species are transplanted into a recipient species. Using a total of 11,268 HSI images from humans as well as porcine and rat models, we show that although spectral signatures of organs differ across species, shared pathophysiological mechanisms manifest as comparable relative spectral changes across species. Such changes learnt in one species can thus be transferred to a new species via a novel “physiology-based data augmentation” method, enabling the large-scale secondary use of preclinical animal data for humans. The resulting ethical, monetary, and performance benefits of the proposed knowledge transfer paradigm promise a high impact of the methodology on future developments in the field.

[AI-145] Deep Learning-driven Mobile Traffic Measurement Collection and Analysis

链接: https://arxiv.org/abs/2410.19777
作者: Yini Fang
关键词-EN: previous studies overlook, Modelling dynamic traffic, mobile traffic, continuously changing dependencies, traffic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: MPhil thesis

点击查看摘要

Abstract:Modelling dynamic traffic patterns and especially the continuously changing dependencies between different base stations, which previous studies overlook, is challenging. Traditional algorithms struggle to process large volumes of data and to extract deep insights that help elucidate mobile traffic demands with fine granularity, as well as how these demands will evolve in the future. Therefore, in this thesis we harness the powerful hierarchical feature learning abilities of Deep Learning (DL) techniques in both spatial and temporal domains and develop solutions for precise city-scale mobile traffic analysis and forecasting. Firstly, we design Spider, a mobile traffic measurement collection and reconstruction framework with a view to reducing the cost of measurement collection and inferring traffic consumption with high accuracy, despite working with sparse information. In particular, we train a reinforcement learning agent to selectively sample subsets of target mobile coverage areas and tackle the large action space problem specific to this setting. We then introduce a lightweight neural network model to reconstruct the traffic consumption based on historical sparse measurements. Our proposed framework outperforms existing solutions on a real-world mobile traffic dataset. Secondly, we design SDGNet, a handover-aware graph neural network model for long-term mobile traffic forecasting. We model the cellular network as a graph, and leverage handover frequency to capture the dependencies between base stations across time. Handover information reflects user mobility such as daily commute, which helps in increasing the accuracy of the forecasts made. We proposed dynamic graph convolution to extract features from both traffic consumption and handover data, showing that our model outperforms other benchmark graph models on a mobile traffic dataset collected by a major network operator.

[AI-146] Gender Bias of LLM in Economics: An Existentialism Perspective

链接: https://arxiv.org/abs/2410.19775
作者: Hui Zhong,Songsheng Chen,Mian Liang
关键词-EN: Large Language Models, Large Language, Language Models, natural language processing, rapidly gained traction
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Gender Bias, Large Language Models, Decision-Making

点击查看摘要

Abstract:Large Language Models (LLMs), such as GPT-4 and BERT, have rapidly gained traction in natural language processing (NLP) and are now integral to financial decision-making. However, their deployment introduces critical challenges, particularly in perpetuating gender biases that can distort decision-making outcomes in high-stakes economic environments. This paper investigates gender bias in LLMs through both mathematical proofs and empirical experiments using the Word Embedding Association Test (WEAT), demonstrating that LLMs inherently reinforce gender stereotypes even without explicit gender markers. By comparing the decision-making processes of humans and LLMs, we reveal fundamental differences: while humans can override biases through ethical reasoning and individualized understanding, LLMs maintain bias as a rational outcome of their mathematical optimization on biased data. Our analysis proves that bias in LLMs is not an unintended flaw but a systematic result of their rational processing, which tends to preserve and amplify existing societal biases encoded in training data. Drawing on existentialist theory, we argue that LLM-generated bias reflects entrenched societal structures and highlights the limitations of purely technical debiasing methods. This research underscores the need for new theoretical frameworks and interdisciplinary methodologies that address the ethical implications of integrating LLMs into economic and financial decision-making. We advocate for a reconceptualization of how LLMs influence economic decisions, emphasizing the importance of incorporating human-like ethical considerations into AI governance to ensure fairness and equity in AI-driven financial systems.

[AI-147] Copula-Linked Parallel ICA: A Method for Coupling Structural and Functional MRI brain Networks

链接: https://arxiv.org/abs/2410.19774
作者: Oktay Agcaoglu,Rogers F. Silva,Deniz Alacam,Sergey Plis,Tulay Adali,Vince Calhoun(for the Alzheimers Disease Neuroimaging Initiative)
关键词-EN: brain imaging modalities, imaging modalities offer, modalities offer unique, offer unique insights, brain imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注: 25 pages, 10 figures, journal article

点击查看摘要

Abstract:Different brain imaging modalities offer unique insights into brain function and structure. Combining them enhances our understanding of neural mechanisms. Prior multimodal studies fusing functional MRI (fMRI) and structural MRI (sMRI) have shown the benefits of this approach. Since sMRI lacks temporal data, existing fusion methods often compress fMRI temporal information into summary measures, sacrificing rich temporal dynamics. Motivated by the observation that covarying networks are identified in both sMRI and resting-state fMRI, we developed a novel fusion method, by combining deep learning frameworks, copulas and independent component analysis (ICA), named copula linked parallel ICA (CLiP-ICA). This method estimates independent sources for each modality and links the spatial sources of fMRI and sMRI using a copula-based model for more flexible integration of temporal and spatial data. We tested CLiP-ICA using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our results showed that CLiP-ICA effectively captures both strongly and weakly linked sMRI and fMRI networks, including the cerebellum, sensorimotor, visual, cognitive control, and default mode networks. It revealed more meaningful components and fewer artifacts, addressing the long-standing issue of optimal model order in ICA. CLiP-ICA also detected complex functional connectivity patterns across stages of cognitive decline, with cognitively normal subjects generally showing higher connectivity in sensorimotor and visual networks compared to patients with Alzheimer, along with patterns suggesting potential compensatory mechanisms.

[AI-148] Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal Synergy of Poster

链接: https://arxiv.org/abs/2410.19764
作者: Utsav Kumar Nareti,Chandranath Adak,Soumi Chattopadhyay,Pichao Wang
关键词-EN: Movie genre classification, Movie, Movie posters, Movie genre, meticulously designed
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Movie posters are not just decorative; they are meticulously designed to capture the essence of a movie, such as its genre, storyline, and tone/vibe. For decades, movie posters have graced cinema walls, billboards, and now our digital screens as a form of digital posters. Movie genre classification plays a pivotal role in film marketing, audience engagement, and recommendation systems. Previous explorations into movie genre classification have been mostly examined in plot summaries, subtitles, trailers and movie scenes. Movie posters provide a pre-release tantalizing glimpse into a film’s key aspects, which can ignite public interest. In this paper, we presented the framework that exploits movie posters from a visual and textual perspective to address the multilabel movie genre classification problem. Firstly, we extracted text from movie posters using an OCR and retrieved the relevant embedding. Next, we introduce a cross-attention-based fusion module to allocate attention weights to visual and textual embedding. In validating our framework, we utilized 13882 posters sourced from the Internet Movie Database (IMDb). The outcomes of the experiments indicate that our model exhibited promising performance and outperformed even some prominent contemporary architectures.

[AI-149] Reliable Routable and Reproducible: Collection of Pedestrian Pathways at Statewide Scale

链接: https://arxiv.org/abs/2410.19762
作者: Yuxiang Zhang,Bill Howe,Anat Caspi
关键词-EN: mobility technology including, technology including autonomous, including autonomous vehicles, technologies depend crucially, improve mobility equity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2303.02323

点击查看摘要

Abstract:While advances in mobility technology including autonomous vehicles and multi-modal navigation systems can improve mobility equity for people with disabilities, these technologies depend crucially on accurate, standardized, and complete pedestrian path networks. Ad hoc collection efforts lead to a data record that is sparse, unreliable, and non-interoperable. This paper presents a sociotechnical methodology to collect, manage, serve, and maintain pedestrian path data at a statewide scale. Combining the automation afforded by computer-vision approaches applied to aerial imagery and existing road network data with the quality control afforded by interactive tools, we aim to produce routable pedestrian pathways for the entire State of Washington within approximately two years. We extract paths, crossings, and curb ramps at scale from aerial imagery, integrating multi-input segmentation methods with road topology data to ensure connected, routable networks. We then organize the predictions into project regions selected for their value to the public interest, where each project region is divided into intersection-scale tasks. These tasks are assigned and tracked through an interactive tool that manages concurrency, progress, feedback, and data management. We demonstrate that our automated systems outperform state-of-the-art methods in producing routable pathway networks, which then significantly reduces the time required for human vetting. Our results demonstrate the feasibility of yielding accurate, robust pedestrian pathway networks at the scale of an entire state. This paper intends to inform procedures for national-scale ADA compliance by providing pedestrian equity, safety, and accessibility, and improving urban environments for all users. Comments: arXiv admin note: text overlap with arXiv:2303.02323 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.19762 [cs.CV] (or arXiv:2410.19762v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.19762 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yuxiang Zhang [view email] [v1] Sat, 12 Oct 2024 02:31:57 UTC (23,216 KB)

[AI-150] Movie Trailer Genre Classification Using Multimodal Pretrained Features

链接: https://arxiv.org/abs/2410.19760
作者: Serkan Sulun,Paula Viana,Matthew E. P. Davies
关键词-EN: readily accessible pretrained, diverse set, set of readily, readily accessible, accessible pretrained models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP). To foster future research, we make the pretrained features for the entire MovieNet dataset, along with our genre classification code and the trained models, publicly available.

[AI-151] PINNing Cerebral Blood Flow: Analysis of Perfusion MRI in Infants using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2410.19759
作者: Christoforos Galazis,Ching-En Chiu,Tomoki Arichi,Anil A. Bharath,Marta Varela
关键词-EN: Arterial spin labeling, magnetic resonance imaging, Arterial spin, managing neurological issues, infants born prematurely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Arterial spin labeling (ASL) magnetic resonance imaging (MRI) enables cerebral perfusion measurement, which is crucial in detecting and managing neurological issues in infants born prematurely or after perinatal complications. However, cerebral blood flow (CBF) estimation in infants using ASL remains challenging due to the complex interplay of network physiology, involving dynamic interactions between cardiac output and cerebral perfusion, as well as issues with parameter uncertainty and data noise. We propose a new spatial uncertainty-based physics-informed neural network (PINN), SUPINN, to estimate CBF and other parameters from infant ASL data. SUPINN employs a multi-branch architecture to concurrently estimate regional and global model parameters across multiple voxels. It computes regional spatial uncertainties to weigh the signal. SUPINN can reliably estimate CBF (relative error -0.3 \pm 71.7 ), bolus arrival time (AT) ( 30.5 \pm 257.8 ), and blood longitudinal relaxation time ( T_1b ) ( -4.4 \pm 28.9 ), surpassing parameter estimates performed using least squares or standard PINNs. Furthermore, SUPINN produces physiologically plausible spatially smooth CBF and AT maps. Our study demonstrates the successful modification of PINNs for accurate multi-parameter perfusion estimation from noisy and limited ASL data in infants. Frameworks like SUPINN have the potential to advance our understanding of the complex cardio-brain network physiology, aiding in the detection and management of diseases. Source code is provided at: this https URL.

[AI-152] A SAM based Tool for Semi-Automatic Food Annotation ECAI2024

链接: https://arxiv.org/abs/2410.19756
作者: Lubnaa Abdur Rahman,Ioannis Papathanail,Lorenzo Brigato,Stavroula Mougiakakou
关键词-EN: artificial intelligence, critical bottleneck, advancement of artificial, nutrition research, research is hindered
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted Demo Paper - ECAI 2024

点击查看摘要

Abstract:The advancement of artificial intelligence (AI) in food and nutrition research is hindered by a critical bottleneck: the lack of annotated food data. Despite the rise of highly efficient AI models designed for tasks such as food segmentation and classification, their practical application might necessitate proficiency in AI and machine learning principles, which can act as a challenge for non-AI experts in the field of nutritional sciences. Alternatively, it highlights the need to translate AI models into user-friendly tools that are accessible to all. To address this, we present a demo of a semi-automatic food image annotation tool leveraging the Segment Anything Model (SAM). The tool enables prompt-based food segmentation via user interactions, promoting user engagement and allowing them to further categorise food items within meal images and specify weight/volume if necessary. Additionally, we release a fine-tuned version of SAM’s mask decoder, dubbed MealSAM, with the ViT-B backbone tailored specifically for food image segmentation. Our objective is not only to contribute to the field by encouraging participation, collaboration, and the gathering of more annotated food data but also to make AI technology available for a broader audience by translating AI into practical tools.

[AI-153] Using AI Alignment Theory to understand the potential pitfalls of regulatory frameworks

链接: https://arxiv.org/abs/2410.19749
作者: Alejandro Tlaie
关键词-EN: Union Artificial Intelligence, European Union Artificial, Artificial Intelligence Act, Artificial Intelligence, Union Artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper leverages insights from Alignment Theory (AT) research, which primarily focuses on the potential pitfalls of technical alignment in Artificial Intelligence, to critically examine the European Union’s Artificial Intelligence Act (EU AI Act). In the context of AT research, several key failure modes - such as proxy gaming, goal drift, reward hacking or specification gaming - have been identified. These can arise when AI systems are not properly aligned with their intended objectives. The central logic of this report is: what can we learn if we treat regulatory efforts in the same way as we treat advanced AI systems? As we systematically apply these concepts to the EU AI Act, we uncover potential vulnerabilities and areas for improvement in the regulation.

[AI-154] C2DA: Contrastive and Context-aware Domain Adaptive Semantic Segmentation

链接: https://arxiv.org/abs/2410.19748
作者: Md. Al-Masrur Khan,Zheng Chen,Lantao Liu
关键词-EN: target annotation data, target domain data, Unsupervised domain adaptive, source domain data, accessing target annotation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper has 16 pages, 6 figures, 5 tables. It has been accepted for publication at the International Symposium of Robotics Research (ISRR), Long Beach, California, USA, 2024

点击查看摘要

Abstract:Unsupervised domain adaptive semantic segmentation (UDA-SS) aims to train a model on the source domain data (e.g., synthetic) and adapt the model to predict target domain data (e.g., real-world) without accessing target annotation data. Most existing UDA-SS methods only focus on inter-domain knowledge to mitigate the data-shift problem. However, learning the inherent structure of the images and exploring the intrinsic pixel distribution of both domains are ignored, which prevents the UDA-SS methods from producing satisfactory performance like supervised learning. Moreover, incorporating contextual knowledge is also often overlooked. Considering these issues, in this work, we propose a UDA-SS framework that learns both intra-domain and context-aware knowledge. To learn the intra-domain knowledge, we incorporate contrastive loss in both domains, which pulls pixels of similar classes together and pushes the rest away, facilitating intra-image-pixel-wise correlations. To learn context-aware knowledge, we modify the mixing technique by leveraging contextual dependency among the classes. Moreover, we adapt the Mask Image Modeling (MIM) technique to properly use context clues for robust visual recognition, using limited information about the masked images. Comprehensive experiments validate that our proposed method improves the state-of-the-art UDA-SS methods by a margin of 0.51% mIoU and 0.54% mIoU in the adaptation of GTA-V-Cityscapes and Synthia-Cityscapes, respectively. We open-source our C2DA code. Code link: this http URL

[AI-155] owards Next-Generation LLM -based Recommender Systems: A Survey and Beyond

链接: https://arxiv.org/abs/2410.19744
作者: Qi Wang,Jindong Li,Shiqi Wang,Qianli Xing,Runliang Niu,He Kong,Rui Li,Guodong Long,Yi Chang,Chengqi Zhang
关键词-EN: impressive generalization capabilities, natural language processing, Large language models, Large language, recommender systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have not only revolutionized the field of natural language processing (NLP) but also have the potential to bring a paradigm shift in many other fields due to their remarkable abilities of language understanding, as well as impressive generalization capabilities and reasoning skills. As a result, recent studies have actively attempted to harness the power of LLMs to improve recommender systems, and it is imperative to thoroughly review the recent advances and challenges of LLM-based recommender systems. Unlike existing work, this survey does not merely analyze the classifications of LLM-based recommendation systems according to the technical framework of LLMs. Instead, it investigates how LLMs can better serve recommendation tasks from the perspective of the recommender system community, thus enhancing the integration of large language models into the research of recommender system and its practical application. In addition, the long-standing gap between academic research and industrial applications related to recommender systems has not been well discussed, especially in the era of large language models. In this review, we introduce a novel taxonomy that originates from the intrinsic essence of recommendation, delving into the application of large language model-based recommendation systems and their industrial implementation. Specifically, we propose a three-tier structure that more accurately reflects the developmental progression of recommendation systems from research to practical implementation, including representing and understanding, scheming and utilizing, and industrial deployment. Furthermore, we discuss critical challenges and opportunities in this emerging field. A more up-to-date version of the papers is maintained at: this https URL.

[AI-156] AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction

链接: https://arxiv.org/abs/2410.19743
作者: Hongru Wang,Rui Wang,Boyang Xue,Heming Xia,Jingtao Cao,Zeming Liu,Jeff Z. Pan,Kam-Fai Wong
关键词-EN: Large Language Models, Large Language, Language Models, task automation capabilities, versatile external APIs
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources (e.g., different Apps in the iPhone), especially for complex user instructions. In this paper, we introduce \textttAppBench, the first benchmark to evaluate LLMs’ ability to plan and execute multiple APIs from various sources in order to complete the user’s task. Specifically, we consider two significant challenges in multiple APIs: \textit1) graph structures: some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and \textit2) permission constraints: which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at this https URL.

[AI-157] Integrating Reasoning Systems for Trustworthy AI Proceedings of the 4th Workshop on Logic and Practice of Programming (LPOP)

链接: https://arxiv.org/abs/2410.19738
作者: Anil Nerode,Yanhong A. Liu
关键词-EN: fourth Logic, Logic and Practice, position papers, Practice of Programming, Logic Programming
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:This proceedings contains abstracts and position papers for the work to be presented at the fourth Logic and Practice of Programming (LPOP) Workshop. The workshop is to be held in Dallas, Texas, USA, and as a hybrid event, on October 13, 2024, in conjunction with the 40th International Conference on Logic Programming (ICLP). The focus of this workshop is integrating reasoning systems for trustworthy AI, especially including integrating diverse models of programming with rules and constraints.

[AI-158] Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering

链接: https://arxiv.org/abs/2410.21000
作者: Zhilin Zhang,Jie Wang,Ruiqi Zhu,Xiaoliang Gong
关键词-EN: Medical Visual Question, Medical Visual, natural language processing, Visual Question Answering, gained increasing attention
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical Visual Question Answering (MedVQA) has gained increasing attention at the intersection of computer vision and natural language processing. Its capability to interpret radiological images and deliver precise answers to clinical inquiries positions MedVQA as a valuable tool for supporting diagnostic decision-making for physicians and alleviating the workload on radiologists. While recent approaches focus on using unified pre-trained large models for multi-modal fusion like cross-modal Transformers, research on more efficient fusion methods remains relatively scarce within this discipline. In this paper, we introduce a novel fusion model that integrates Orthogonality loss, Multi-head attention and Bilinear Attention Network (OMniBAN) to achieve high computational efficiency and strong performance without the need for pre-training. We conduct comprehensive experiments and clarify aspects of how to enhance bilinear attention fusion to achieve performance comparable to that of large models. Experimental results show that OMniBAN outperforms traditional models on key MedVQA benchmarks while maintaining a lower computational cost, which indicates its potential for efficient clinical application in radiology and pathology image question answering.

[AI-159] Generative Simulations of The Solar Corona Evolution With Denoising Diffusion : Proof of Concept

链接: https://arxiv.org/abs/2410.20843
作者: Grégoire Francisco,Francesco Pio Ramunno,Manolis K. Georgoulis,João Fernandes,Teresa Barata,Dario Del Moro
关键词-EN: solar magnetized corona, coronal mass ejections, space weather impact, Denoising Diffusion Probabilistic, solar wind
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The solar magnetized corona is responsible for various manifestations with a space weather impact, such as flares, coronal mass ejections (CMEs) and, naturally, the solar wind. Modeling the corona’s dynamics and evolution is therefore critical for improving our ability to predict space weather In this work, we demonstrate that generative deep learning methods, such as Denoising Diffusion Probabilistic Models (DDPM), can be successfully applied to simulate future evolutions of the corona as observed in Extreme Ultraviolet (EUV) wavelengths. Our model takes a 12-hour video of an Active Region (AR) as input and simulate the potential evolution of the AR over the subsequent 12 hours, with a time-resolution of two hours. We propose a light UNet backbone architecture adapted to our problem by adding 1D temporal convolutions after each classical 2D spatial ones, and spatio-temporal attention in the bottleneck part. The model not only produce visually realistic outputs but also captures the inherent stochasticity of the system’s evolution. Notably, the simulations enable the generation of reliable confidence intervals for key predictive metrics such as the EUV peak flux and fluence of the ARs, paving the way for probabilistic and interpretable space weather forecasting. Future studies will focus on shorter forecasting horizons with increased spatial and temporal resolution, aiming at reducing the uncertainty of the simulations and providing practical applications for space weather forecasting. The code used for this study is available at the following link: this https URL

[AI-160] Murine AI excels at cats and cheese: Structural differences between human and mouse neurons and their implementation in generative AIs

链接: https://arxiv.org/abs/2410.20735
作者: Rino Saiga,Kaede Shiga,Yo Maruta,Chie Inomoto,Hiroshi Kajiwara,Naoya Nakamura,Yu Kakimoto,Yoshiro Yamamoto,Masahiro Yasutake,Masayuki Uesugi,Akihisa Takeuchi,Kentaro Uesugi,Yasuko Terada,Yoshio Suzuki,Viktor Nikitin,Vincent De Andrade,Francesco De Carlo,Yuichi Yamashita,Masanari Itokawa,Soichiro Ide,Kazutaka Ikeda,Ryuta Mizutani
关键词-EN: human, human faces, image generation, Mouse, mouse neuronal somata
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph)
*备注: 41 pages, 4 figures

点击查看摘要

Abstract:Mouse and human brains have different functions that depend on their neuronal networks. In this study, we analyzed nanometer-scale three-dimensional structures of brain tissues of the mouse medial prefrontal cortex and compared them with structures of the human anterior cingulate cortex. The obtained results indicated that mouse neuronal somata are smaller and neurites are thinner than those of human neurons. These structural features allow mouse neurons to be integrated in the limited space of the brain, though thin neurites should suppress distal connections according to cable theory. We implemented this mouse-mimetic constraint in convolutional layers of a generative adversarial network (GAN) and a denoising diffusion implicit model (DDIM), which were then subjected to image generation tasks using photo datasets of cat faces, cheese, human faces, and birds. The mouse-mimetic GAN outperformed a standard GAN in the image generation task using the cat faces and cheese photo datasets, but underperformed for human faces and birds. The mouse-mimetic DDIM gave similar results, suggesting that the nature of the datasets affected the results. Analyses of the four datasets indicated differences in their image entropy, which should influence the number of parameters required for image generation. The preferences of the mouse-mimetic AIs coincided with the impressions commonly associated with mice. The relationship between the neuronal network and brain function should be investigated by implementing other biological findings in artificial neural networks.

[AI-161] A Statistical Analysis of Deep Federated Learning for Intrinsically Low-dimensional Data

链接: https://arxiv.org/abs/2410.20659
作者: Saptarshi Chakraborty,Peter L. Bartlett
关键词-EN: collaborative machine learning, data privacy concerns, address data privacy, beta, emphasizing decentralized model
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a groundbreaking paradigm in collaborative machine learning, emphasizing decentralized model training to address data privacy concerns. While significant progress has been made in optimizing federated learning, the exploration of generalization error, particularly in heterogeneous settings, has been limited, focusing mainly on parametric cases. This paper investigates the generalization properties of deep federated regression within a two-stage sampling model. Our findings highlight that the intrinsic dimension, defined by the entropic dimension, is crucial for determining convergence rates when appropriate network sizes are used. Specifically, if the true relationship between response and explanatory variables is charecterized by a \beta -Hölder function and there are n independent and identically distributed (i.i.d.) samples from m participating clients, the error rate for participating clients scales at most as \tildeO\left((mn)^-2\beta/(2\beta + \bard_2\beta(\lambda))\right) , and for non-participating clients, it scales as \tildeO\left(\Delta \cdot m^-2\beta/(2\beta + \bard_2\beta(\lambda)) + (mn)^-2\beta/(2\beta + \bard_2\beta(\lambda))\right) . Here, \bard_2\beta(\lambda) represents the 2\beta -entropic dimension of \lambda , the marginal distribution of the explanatory variables, and \Delta characterizes the dependence between the sampling stages. Our results explicitly account for the “closeness” of clients, demonstrating that the convergence rates of deep federated learners depend on intrinsic rather than nominal high-dimensionality.

[AI-162] Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

链接: https://arxiv.org/abs/2410.20578
作者: Ivan Kukanov,Janne Laakkonen,Tomi Kinnunen,Ville Hautamäki
关键词-EN: Current speech deepfake, detection approaches perform, approaches perform satisfactorily, deepfake detection approaches, speech deepfake detection
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

点击查看摘要

Abstract:Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta-learning, aiming to learn attack-invariant features to adapt to unseen attacks with very few samples available. This approach is promising since generating of a high-scale training dataset is often expensive or infeasible. Our experiments demonstrated an improvement in the Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset, using just 96 samples from the unseen dataset. Continuous few-shot adaptation ensures that the system remains up-to-date.

[AI-163] Enhancing Community Vision Screening – AI Driven Retinal Photography for Early Disease Detection and Patient Trust MICCAI2024

链接: https://arxiv.org/abs/2410.20309
作者: Xiaofeng Lei,Yih-Chung Tham,Jocelyn Hui Lin Goh,Yangqin Feng,Yang Bai,Zhi Da Soh,Rick Siow Mong Goh,Xinxing Xu,Yong Liu,Ching-Yu Cheng
关键词-EN: preventing avoidable blindness, Community vision screening, vision screening plays, Enhancing Community Vision, eye care services
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 4 figures, published in MICCAI2024 OMIA XI workshop

点击查看摘要

Abstract:Community vision screening plays a crucial role in identifying individuals with vision loss and preventing avoidable blindness, particularly in rural communities where access to eye care services is limited. Currently, there is a pressing need for a simple and efficient process to screen and refer individuals with significant eye disease-related vision loss to tertiary eye care centers for further care. An ideal solution should seamlessly and readily integrate with existing workflows, providing comprehensive initial screening results to service providers, thereby enabling precise patient referrals for timely treatment. This paper introduces the Enhancing Community Vision Screening (ECVS) solution, which addresses the aforementioned concerns with a novel and feasible solution based on simple, non-invasive retinal photography for the detection of pathology-based visual impairment. Our study employs four distinct deep learning models: RETinal photo Quality Assessment (RETQA), Pathology Visual Impairment detection (PVI), Eye Disease Diagnosis (EDD) and Visualization of Lesion Regions of the eye (VLR). We conducted experiments on over 10 datasets, totaling more than 80,000 fundus photos collected from various sources. The models integrated into ECVS achieved impressive AUC scores of 0.98 for RETQA, 0.95 for PVI, and 0.90 for EDD, along with a DICE coefficient of 0.48 for VLR. These results underscore the promising capabilities of ECVS as a straightforward and scalable method for community-based vision screening.

[AI-164] Modelling of Economic Implications of Bias in AI-Powered Health Emergency Response Systems

链接: https://arxiv.org/abs/2410.20229
作者: Katsiaryna Bahamazava(Department of Mathematical Sciences G.L. Lagrange, Politecnico di Torino, Italy)
关键词-EN: theoretical framework assessing, present a theoretical, social welfare, AI-powered emergency response, social welfare models
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a theoretical framework assessing the economic implications of bias in AI-powered emergency response systems. Integrating health economics, welfare economics, and artificial intelligence, we analyze how algorithmic bias affects resource allocation, health outcomes, and social welfare. By incorporating a bias function into health production and social welfare models, we quantify its impact on demographic groups, showing that bias leads to suboptimal resource distribution, increased costs, and welfare losses. The framework highlights efficiency-equity trade-offs and provides economic interpretations. We propose mitigation strategies, including fairness-constrained optimization, algorithmic adjustments, and policy interventions. Our findings offer insights for policymakers, emergency service providers, and technology developers, emphasizing the need for AI systems that are efficient and equitable. By addressing the economic consequences of biased AI, this study contributes to policies and technologies promoting fairness, efficiency, and social welfare in emergency response services.

[AI-165] Physics informed Shadowgraph Density Field Reconstruction

链接: https://arxiv.org/abs/2410.20203
作者: Xutun Wang,Yuchen Zhang,Zidong Li,Haocheng Wen,Bing Wang
关键词-EN: reconstructing density fields, study presents, physics-informed framework, shadowgraph images, physics-informed neural networks
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study presents a novel approach to reconstructing density fields from shadowgraph images using a physics-informed framework. By integrating traditional shadowgraph imaging techniques with physics-informed neural networks (PINNs), we effectively capture refractive index variations within complex flow fields. The proposed method addresses the inherent challenges of shadowgraphy, such as noise and limited spatial resolution, enabling accurate visualization of fluid dynamics. Experimental results demonstrate the feasibility and robustness of our approach, with significant agreement observed between the reconstructed density fields and experimental measurements. This research contributes to the advancement of non-intrusive diagnostic techniques in fluid mechanics and enhances our understanding of flow structures in various applications.

[AI-166] On-Site Precise Screening of SARS-CoV-2 Systems Using a Channel-Wise Attention-Based PLS-1D-CNN Model with Limited Infrared Signatures

链接: https://arxiv.org/abs/2410.20132
作者: Wenwen Zhang,Zhouzhuo Tang,Yingmei Feng,Xia Yu,Qi Jie Wang,Zhiping Lin
关键词-EN: respiratory syndrome coronavirus, severe acute respiratory, acute respiratory syndrome, limited nasopharyngeal swabs, Beijing You’an Hospital
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:During the early stages of respiratory virus outbreaks, such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the efficient utilize of limited nasopharyngeal swabs for rapid and accurate screening is crucial for public health. In this study, we present a methodology that integrates attenuated total reflection-Fourier transform infrared spectroscopy (ATR-FTIR) with the adaptive iteratively reweighted penalized least squares (airPLS) preprocessing algorithm and a channel-wise attention-based partial least squares one-dimensional convolutional neural network (PLS-1D-CNN) model, enabling accurate screening of infected individuals within 10 minutes. Two cohorts of nasopharyngeal swab samples, comprising 126 and 112 samples from suspected SARS-CoV-2 Omicron variant cases, were collected at Beijing You’an Hospital for verification. Given that ATR-FTIR spectra are highly sensitive to variations in experimental conditions, which can affect their quality, we propose a biomolecular importance (BMI) evaluation method to assess signal quality across different conditions, validated by comparing BMI with PLS-GBM and PLS-RF results. For the ATR-FTIR signals in cohort 2, which exhibited a higher BMI, airPLS was utilized for signal preprocessing, followed by the application of the channel-wise attention-based PLS-1D-CNN model for screening. The experimental results demonstrate that our model outperforms recently reported methods in the field of respiratory virus spectrum detection, achieving a recognition screening accuracy of 96.48%, a sensitivity of 96.24%, a specificity of 97.14%, an F1-score of 96.12%, and an AUC of 0.99. It meets the World Health Organization (WHO) recommended criteria for an acceptable product: sensitivity of 95.00% or greater and specificity of 97.00% or greater for testing prior SARS-CoV-2 infection in moderate to high volume scenarios.

[AI-167] A Multi-Modal Non-Invasive Deep Learning Framework for Progressive Prediction of Seizures

链接: https://arxiv.org/abs/2410.20066
作者: Ali Saeizadeh,Douglas Schonholtz,Joseph S. Neimat,Pedram Johari,Tommaso Melodia
关键词-EN: Deep Learning, innovative framework designed, designed for progressive, methodology based, paper introduces
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 4 pages, 5 figures, IEEE BSN 2024

点击查看摘要

Abstract:This paper introduces an innovative framework designed for progressive (granular in time to onset) prediction of seizures through the utilization of a Deep Learning (DL) methodology based on non-invasive multi-modal sensor networks. Epilepsy, a debilitating neurological condition, affects an estimated 65 million individuals globally, with a substantial proportion facing drug-resistant epilepsy despite pharmacological interventions. To address this challenge, we advocate for predictive systems that provide timely alerts to individuals at risk, enabling them to take precautionary actions. Our framework employs advanced DL techniques and uses personalized data from a network of non-invasive electroencephalogram (EEG) and electrocardiogram (ECG) sensors, thereby enhancing prediction accuracy. The algorithms are optimized for real-time processing on edge devices, mitigating privacy concerns and minimizing data transmission overhead inherent in cloud-based solutions, ultimately preserving battery energy. Additionally, our system predicts the countdown time to seizures (with 15-minute intervals up to an hour prior to the onset), offering critical lead time for preventive actions. Our multi-modal model achieves 95% sensitivity, 98% specificity, and 97% accuracy, averaged among 29 patients.

[AI-168] ransforming Precision: A Comparative Analysis of Vision Transformers CNNs and Traditional ML for Knee Osteoarthritis Severity Diagnosis

链接: https://arxiv.org/abs/2410.20062
作者: Tasnim Sakib Apon,Md.Fahim-Ul-Islam,Nafiz Imtiaz Rafin,Joya Akter,Md. Golam Rabiul Alam
关键词-EN: degenerative joint disease, Knee osteoarthritis, ViT models, pain and impairment, degenerative joint
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knee osteoarthritis(KO) is a degenerative joint disease that can cause severe pain and impairment. With increased prevalence, precise diagnosis by medical imaging analytics is crucial for appropriate illness management. This research investigates a comparative analysis between traditional machine learning techniques and new deep learning models for diagnosing KO severity from X-ray pictures. This study does not introduce new architectural innovations but rather illuminates the robust applicability and comparative effectiveness of pre-existing ViT models in a medical imaging context, specifically for KO severity diagnosis. The insights garnered from this comparative analysis advocate for the integration of advanced ViT models in clinical diagnostic workflows, potentially revolutionizing the precision and reliability of KO assessments. This study does not introduce new architectural innovations but rather illuminates the robust applicability and comparative effectiveness of pre-existing ViT models in a medical imaging context, specifically for KO severity diagnosis. The insights garnered from this comparative analysis advocate for the integration of advanced ViT models in clinical diagnostic workflows, potentially revolutionizing the precision reliability of KO assessments. The study utilizes an osteoarthritis dataset from the Osteoarthritis Initiative (OAI) comprising images with 5 severity categories and uneven class distribution. While classic machine learning models like GaussianNB and KNN struggle in feature extraction, Convolutional Neural Networks such as Inception-V3, VGG-19 achieve better accuracy between 55-65% by learning hierarchical visual patterns. However, Vision Transformer architectures like Da-VIT, GCViT and MaxViT emerge as indisputable champions, displaying 66.14% accuracy, 0.703 precision, 0.614 recall, AUC exceeding 0.835 thanks to self-attention processes.

[AI-169] Roles of LLM s in the Overall Mental Architecture

链接: https://arxiv.org/abs/2410.20037
作者: Ron Sun
关键词-EN: human mental architecture, human mental, cognitive science literatures, understand existing LLMs, mental architecture
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:To better understand existing LLMs, we may examine the human mental (cognitive/psychological) architecture, and its components and structures. Based on psychological, philosophical, and cognitive science literatures, it is argued that, within the human mental architecture, existing LLMs correspond well with implicit mental processes (intuition, instinct, and so on). However, beyond such implicit processes, explicit processes (with better symbolic capabilities) are also present within the human mental architecture, judging from psychological, philosophical, and cognitive science literatures. Various theoretical and empirical issues and questions in this regard are explored. Furthermore, it is argued that existing dual-process computational cognitive architectures (models of the human cognitive/psychological architecture) provide usable frameworks for fundamentally enhancing LLMs by introducing dual processes (both implicit and explicit) and, in the meantime, can also be enhanced by LLMs. The results are synergistic combinations (in several different senses simultaneously).

[AI-170] Urban Mobility: AI ODE-Based Modeling and Scenario Planning

链接: https://arxiv.org/abs/2410.19915
作者: Katsiaryna Bahamazava(Department of Mathematical Sciences G.L. Lagrange, Politecnico di Torino, Italy, iLaVita Nonprofit Foundation, Italy - USA)
关键词-EN: Ordinary Differential Equations, Urbanization and technological, urban mobility, presenting both challenges, challenges and opportunities
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Urbanization and technological advancements are reshaping the future of urban mobility, presenting both challenges and opportunities. This paper combines foresight and scenario planning with mathematical modeling using Ordinary Differential Equations (ODEs) to explore how Artificial Intelligence (AI)-driven technologies can transform transportation systems. By simulating ODE-based models in Python, we quantify the impact of AI innovations, such as autonomous vehicles and intelligent traffic management, on reducing traffic congestion under different regulatory conditions. Our ODE models capture the dynamic relationship between AI adoption rates and traffic congestion, providing quantitative insights into how future scenarios might unfold. By incorporating industry collaborations and case studies, we offer strategic guidance for businesses and policymakers navigating this evolving landscape. This study contributes to understanding how foresight, scenario planning, and ODE modeling can inform strategies for creating more efficient, sustainable, and livable cities through AI adoption. Subjects: General Economics (econ.GN); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.19915 [econ.GN] (or arXiv:2410.19915v1 [econ.GN] for this version) https://doi.org/10.48550/arXiv.2410.19915 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-171] Multi-Modal Transformer and Reinforcement Learning-based Beam Management

链接: https://arxiv.org/abs/2410.19859
作者: Mohammad Ghassemi,Han Zhang,Ali Afana,Akram Bin Sediq,Melike Erol-Kantarci
关键词-EN: improve signal strength, wireless communication systems, improve signal, signal strength, strength and reduce
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures, IEEE Networking Letters

点击查看摘要

Abstract:Beam management is an important technique to improve signal strength and reduce interference in wireless communication systems. Recently, there has been increasing interest in using diverse sensing modalities for beam management. However, it remains a big challenge to process multi-modal data efficiently and extract useful information. On the other hand, the recently emerging multi-modal transformer (MMT) is a promising technique that can process multi-modal data by capturing long-range dependencies. While MMT is highly effective in handling multi-modal data and providing robust beam management, integrating reinforcement learning (RL) further enhances their adaptability in dynamic environments. In this work, we propose a two-step beam management method by combining MMT with RL for dynamic beam index prediction. In the first step, we divide available beam indices into several groups and leverage MMT to process diverse data modalities to predict the optimal beam group. In the second step, we employ RL for fast beam decision-making within each group, which in return maximizes throughput. Our proposed framework is tested on a 6G dataset. In this testing scenario, it achieves higher beam prediction accuracy and system throughput compared to both the MMT-only based method and the RL-only based method.

[AI-172] UniMTS: Unified Pre-training for Motion Time Series NEURIPS2024

链接: https://arxiv.org/abs/2410.19818
作者: Xiyuan Zhang,Diyan Teng,Ranak Roy Chowdhury,Shuheng Li,Dezhi Hong,Rajesh K. Gupta,Jingbo Shang
关键词-EN: Motion time series, smartwatches offer significant, offer significant insights, time series, Motion time
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024. Code: this https URL . Model: this https URL

点击查看摘要

Abstract:Motion time series collected from mobile and wearable devices such as smartphones and smartwatches offer significant insights into human behavioral patterns, with wide applications in healthcare, automation, IoT, and AR/XR due to their low-power, always-on nature. However, given security and privacy concerns, building large-scale motion time series datasets remains difficult, preventing the development of pre-trained models for human activity analysis. Typically, existing models are trained and tested on the same dataset, leading to poor generalizability across variations in device location, device mounting orientation and human activity type. In this paper, we introduce UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities. Specifically, we employ a contrastive learning framework that aligns motion time series with text descriptions enriched by large language models. This helps the model learn the semantics of time series to generalize across activities. Given the absence of large-scale motion time series data, we derive and synthesize time series from existing motion skeleton data with all-joint coverage. Spatio-temporal graph networks are utilized to capture the relationships across joints for generalization across different device locations. We further design rotation-invariant augmentation to make the model agnostic to changes in device mounting orientations. Our model shows exceptional generalizability across 18 motion time series classification benchmark datasets, outperforming the best baselines by 340% in the zero-shot setting, 16.3% in the few-shot setting, and 9.2% in the full-shot setting.

[AI-173] Single-word Auditory Attention Decoding Using Deep Learning Model

链接: https://arxiv.org/abs/2410.19793
作者: Nhan Duc Thanh Nguyen,Huy Phan,Kaare Mikkelsen,Preben Kidmose
关键词-EN: Identifying auditory attention, comparing auditory stimuli, Identifying auditory, auditory attention decoding, auditory attention
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD). The majority of AAD algorithms utilize the so-called envelope entrainment mechanism, whereby auditory attention is identified by how the envelope of the auditory stream drives variation in the electroencephalography (EEG) signal. However, neural processing can also be decoded based on endogenous cognitive responses, in this case, neural responses evoked by attention to specific words in a speech stream. This approach is largely unexplored in the field of AAD but leads to a single-word auditory attention decoding problem in which an epoch of an EEG signal timed to a specific word is labeled as attended or unattended. This paper presents a deep learning approach, based on EEGNet, to address this challenge. We conducted a subject-independent evaluation on an event-based AAD dataset with three different paradigms: word category oddball, word category with competing speakers, and competing speech streams with targets. The results demonstrate that the adapted model is capable of exploiting cognitive-related spatiotemporal EEG features and achieving at least 58% accuracy on the most realistic competing paradigm for the unseen subjects. To our knowledge, this is the first study dealing with this problem.

[AI-174] Real-time Monitoring of Lower Limb Movement Resistance Based on Deep Learning

链接: https://arxiv.org/abs/2410.19769
作者: Buren Batu,Yuanmeng Liu,Tianyi Lyu
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages paper

点击查看摘要

[AI-175] he Geometry of Concepts: Sparse Autoencoder Feature Structure

链接: https://arxiv.org/abs/2410.19750
作者: Yuxiao Li,Eric J. Michaud,David D. Baek,Joshua Engels,Xiaoqing Sun,Max Tegmark
关键词-EN: large language models, recently produced dictionaries, Sparse autoencoders, language models, autoencoders have recently
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The “brain” intermediate-scale structure has significant spatial modularity; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The “galaxy” scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.

[AI-176] Metamizer: a versatile neural optimizer for fast and accurate physics simulations

链接: https://arxiv.org/abs/2410.19746
作者: Nils Wandel,Stefan Schulz,Reinhard Klein
关键词-EN:
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-177] SALINA: Towards Sustainable Live Sonar Analytics in Wild Ecosystems

链接: https://arxiv.org/abs/2410.19742
作者: Chi Xu,Rongsheng Qian,Hao Fang,Xiaoqiang Ma,William I. Atlas,Jiangchuan Liu,Mark A. Spoljaric
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 14 pages, accepted by ACM SenSys 2024

点击查看摘要

[AI-178] ourism destination events classifier based on artificial intelligence techniques

链接: https://arxiv.org/abs/2410.19741
作者: Miguel Camacho-Ruiz,Ramón Alberto Carrasco,Gema Fernández-Avilés,Antonio LaTorre
关键词-EN: provide optimal services, Identifying client, tourist destination management, destination management, provide optimal
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying client needs to provide optimal services is crucial in tourist destination management. The events held in tourist destinations may help to meet those needs and thus contribute to tourist satisfaction. As with product management, the creation of hierarchical catalogs to classify those events can aid event management. The events that can be found on the internet are listed in dispersed, heterogeneous sources, which makes direct classification a difficult, time-consuming task. The main aim of this work is to create a novel process for automatically classifying an eclectic variety of tourist events using a hierarchical taxonomy, which can be applied to support tourist destination management. Leveraging data science methods such as CRISP-DM, supervised machine learning, and natural language processing techniques, the automatic classification process proposed here allows the creation of a normalized catalog across very different geographical regions. Therefore, we can build catalogs with consistent filters, allowing users to find events regardless of the event categories assigned at source, if any. This is very valuable for companies that offer this kind of information across multiple regions, such as airlines, travel agencies or hotel chains. Ultimately, this tool has the potential to revolutionize the way companies and end users interact with tourist events information.

[AI-179] Multi-view biomedical foundation models for molecule-target and property prediction

链接: https://arxiv.org/abs/2410.19704
作者: Parthasarathy Suryanarayanan,Yunguang Qiu,Shreyans Sethi,Diwakar Mahajan,Hongyang Li,Yuxin Yang,Elif Eyigoz,Aldo Guzman Saenz,Daniel E. Platt,Timothy H. Rumbell,Kenney Ng,Sanjoy Dey,Myson Burch,Bum Chul Kwon,Pablo Meyer,Feixiong Cheng,Jianying Hu,Joseph A. Morrone
关键词-EN:
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 34 pages including supplement. 9 figures, 4 tables

点击查看摘要

[AI-180] Analysis of Hopfield Model as Associative Memory

链接: https://arxiv.org/abs/2402.04264
作者: Matteo Silvestri
关键词-EN:
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 35 pages, 23 figures, 3 codes

点击查看摘要

计算机视觉

[CV-0] On Inductive Biases That Enable Generalization of Diffusion Transformers

链接: https://arxiv.org/abs/2410.21273
作者: Jie An,De Wang,Pengsheng Guo,Jiebo Luo,Alexander Schwing
关键词-EN: Recent work studying, UNet-based denoisers reveals, geometry-adaptive harmonic bases, denoisers reveals inductive, reveals inductive biases
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL ; Code repository: this https URL

点击查看摘要

Abstract:Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating the pivotal attention modules of a DiT, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication. Project page: this http URL.

[CV-1] OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

链接: https://arxiv.org/abs/2410.21269
作者: Xize Cheng,Siqi Zheng,Zehan Wang,Minghui Fang,Ziang Zhang,Rongjie Huang,Ziyang Ma,Shengpeng Ji,Jialong Zuo,Tao Jin,Zhou Zhao
关键词-EN: brought tremendous success, recent years, brought tremendous, tremendous success, fields of vision
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Working in progress

点击查看摘要

Abstract:The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \urlthis https URL.

[CV-2] Exploring contextual modeling with linear complexity for point cloud segmentation

链接: https://arxiv.org/abs/2410.21211
作者: Yong Xien Chng,Xuchong Qiu,Yizeng Han,Yifan Pu,Jiewei Cao,Gao Huang
关键词-EN: Point cloud segmentation, important topic, Point cloud, Transformer attention mechanisms, Mamba
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Point cloud segmentation is an important topic in 3D understanding that has traditionally has been tackled using either the CNN or Transformer. Recently, Mamba has emerged as a promising alternative, offering efficient long-range contextual modeling capabilities without the quadratic complexity associated with Transformer’s attention mechanisms. However, despite Mamba’s potential, early efforts have all failed to achieve better performance than the best CNN-based and Transformer-based methods. In this work, we address this challenge by identifying the key components of an effective and efficient point cloud segmentation architecture. Specifically, we show that: 1) Spatial locality and robust contextual understanding are critical for strong performance, and 2) Mamba features linear computational complexity, offering superior data and inference efficiency compared to Transformers, while still being capable of delivering strong contextual understanding. Additionally, we further enhance the standard Mamba specifically for point cloud segmentation by identifying its two key shortcomings. First, the enforced causality in the original Mamba is unsuitable for processing point clouds that have no such dependencies. Second, its unidirectional scanning strategy imposes a directional bias, hampering its ability to capture the full context of unordered point clouds in a single pass. To address these issues, we carefully remove the causal convolutions and introduce a novel Strided Bidirectional SSM to enhance the model’s capability to capture spatial relationships. Our efforts culminate in the development of a novel architecture named MEEPO, which effectively integrates the strengths of CNN and Mamba. MEEPO surpasses the previous state-of-the-art method, PTv3, by up to +0.8 mIoU on multiple key benchmark datasets, while being 42.1% faster and 5.53x more memory efficient.

[CV-3] Joint Audio-Visual Idling Vehicle Detection with Streamlined Input Dependencies

链接: https://arxiv.org/abs/2410.21170
作者: Xiwen Li,Rehman Mohammed,Tristalee Mangin,Surojit Saha,Ross T Whitaker,Kerry E. Kelly,Tolga Tasdizen
关键词-EN: Idling vehicle detection, reducing unnecessary idling, harmful products, helpful in monitoring, integrated into real-time
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Idling vehicle detection (IVD) can be helpful in monitoring and reducing unnecessary idling and can be integrated into real-time systems to address the resulting pollution and harmful products. The previous approach [13], a non-end-to-end model, requires extra user clicks to specify a part of the input, making system deployment more error-prone or even not feasible. In contrast, we introduce an end-to-end joint audio-visual IVD task designed to detect vehicles visually under three states: moving, idling and engine off. Unlike feature co-occurrence task such as audio-visual vehicle tracking, our IVD task addresses complementary features, where labels cannot be determined by a single modality alone. To this end, we propose AVIVD-Net, a novel network that integrates audio and visual features through a bidirectional attention mechanism. AVIVD-Net streamlines the input process by learning a joint feature space, reducing the deployment complexity of previous methods. Additionally, we introduce the AVIVD dataset, which is seven times larger than previous datasets, offering significantly more annotated samples to study the IVD problem. Our model achieves performance comparable to prior approaches, making it suitable for automated deployment. Furthermore, by evaluating AVIVDNet on the feature co-occurrence public dataset MAVD [23], we demonstrate its potential for extension to self-driving vehicle video-camera setups.

[CV-4] Synthetica: Large Scale Synthetic Data for Robot Perception

链接: https://arxiv.org/abs/2410.21153
作者: Ritvik Singh,Jingzhou Liu,Karl Van Wyk,Yu-Wei Chao,Jean-Francois Lafleche,Florian Shkurti,Nathan Ratliff,Ankur Handa
关键词-EN: Vision-based object detectors, provide valuable information, Vision-based object, crucial basis, provide valuable
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 21 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Vision-based object detectors are a crucial basis for robotics applications as they provide valuable information about object localisation in the environment. These need to ensure high reliability in different lighting conditions, occlusions, and visual artifacts, all while running in real-time. Collecting and annotating real-world data for these networks is prohibitively time consuming and costly, especially for custom assets, such as industrial objects, making it untenable for generalization to in-the-wild scenarios. To this end, we present Synthetica, a method for large-scale synthetic data generation for training robust state estimators. This paper focuses on the task of object detection, an important problem which can serve as the front-end for most state estimation problems, such as pose estimation. Leveraging data from a photorealistic ray-tracing renderer, we scale up data generation, generating 2.7 million images, to train highly accurate real-time detection transformers. We present a collection of rendering randomization and training-time data augmentation techniques conducive to robust sim-to-real performance for vision tasks. We demonstrate state-of-the-art performance on the task of object detection while having detectors that run at 50-100Hz which is 9 times faster than the prior SOTA. We further demonstrate the usefulness of our training methodology for robotics applications by showcasing a pipeline for use in the real world with custom objects for which there do not exist prior datasets. Our work highlights the importance of scaling synthetic data generation for robust sim-to-real transfer while achieving the fastest real-time inference speeds. Videos and supplementary information can be found at this URL: this https URL.

[CV-5] Enhancing Learned Image Compression via Cross Window-based Attention

链接: https://arxiv.org/abs/2410.21144
作者: Priyanka Mudgal,Feng Liu
关键词-EN: learned image compression, image compression methods, traditional image compression, image compression, demonstrated superior rate-distortion
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注: Paper accepted and presented in ISVC’23. Copyrights stay with ISVC

点击查看摘要

Abstract:In recent years, learned image compression methods have demonstrated superior rate-distortion performance compared to traditional image compression methods. Recent methods utilize convolutional neural networks (CNN), variational autoencoders (VAE), invertible neural networks (INN), and transformers. Despite their significant contributions, a main drawback of these models is their poor performance in capturing local redundancy. Therefore, to leverage global features along with local redundancy, we propose a CNN-based solution integrated with a feature encoding module. The feature encoding module encodes important features before feeding them to the CNN and then utilizes cross-scale window-based attention, which further captures local redundancy. Cross-scale window-based attention is inspired by the attention mechanism in transformers and effectively enlarges the receptive field. Both the feature encoding module and the cross-scale window-based attention module in our architecture are flexible and can be incorporated into any other network architecture. We evaluate our method on the Kodak and CLIC datasets and demonstrate that our approach is effective and on par with state-of-the-art methods.

[CV-6] Extrapolating Prospective Glaucoma Fundus Images through Diffusion Model in Irregular Longitudinal Sequences

链接: https://arxiv.org/abs/2410.21130
作者: Zhihao Zhao,Junjie Yang,Shahrooz Faghihroohi,Yinzheng Zhao,Daniel Zapp,Kai Huang,Nassir Navab,M.Ali Nasseri
关键词-EN: early therapeutic interventions, support early therapeutic, progression prediction offers, glaucoma progression prediction, therapeutic interventions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BIBM 2024

点击查看摘要

Abstract:The utilization of longitudinal datasets for glaucoma progression prediction offers a compelling approach to support early therapeutic interventions. Predominant methodologies in this domain have primarily focused on the direct prediction of glaucoma stage labels from longitudinal datasets. However, such methods may not adequately encapsulate the nuanced developmental trajectory of the disease. To enhance the diagnostic acumen of medical practitioners, we propose a novel diffusion-based model to predict prospective images by extrapolating from existing longitudinal fundus images of patients. The methodology delineated in this study distinctively leverages sequences of images as inputs. Subsequently, a time-aligned mask is employed to select a specific year for image generation. During the training phase, the time-aligned mask resolves the issue of irregular temporal intervals in longitudinal image sequence sampling. Additionally, we utilize a strategy of randomly masking a frame in the sequence to establish the ground truth. This methodology aids the network in continuously acquiring knowledge regarding the internal relationships among the sequences throughout the learning phase. Moreover, the introduction of textual labels is instrumental in categorizing images generated within the sequence. The empirical findings from the conducted experiments indicate that our proposed model not only effectively generates longitudinal data but also significantly improves the precision of downstream classification tasks.

[CV-7] LAMA: Stable Dual-Domain Deep Reconstruction For Sparse-View CT ALT

链接: https://arxiv.org/abs/2410.21111
作者: Chi Ding,Qingchao Zhang,Ge Wang,Xiaojing Ye,Yunmei Chen
关键词-EN: Inverse problems arise, Alternating Minimization Algorithm, Learned Alternating Minimization, Inverse problems, tomographic imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Journal version for LAMA (Learned Alternating Minimization Algorithm)

点击查看摘要

Abstract:Inverse problems arise in many applications, especially tomographic imaging. We develop a Learned Alternating Minimization Algorithm (LAMA) to solve such problems via two-block optimization by synergizing data-driven and classical techniques with proven convergence. LAMA is naturally induced by a variational model with learnable regularizers in both data and image domains, parameterized as composite functions of neural networks trained with domain-specific data. We allow these regularizers to be nonconvex and nonsmooth to extract features from data effectively. We minimize the overall objective function using Nesterov’s smoothing technique and residual learning architecture. It is demonstrated that LAMA reduces network complexity, improves memory efficiency, and enhances reconstruction accuracy, stability, and interpretability. Extensive experiments show that LAMA significantly outperforms state-of-the-art methods on popular benchmark datasets for Computed Tomography.

[CV-8] LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition

链接: https://arxiv.org/abs/2410.21108
作者: Naga Venkata Sai Raviteja Chappa,Khoa Luu
关键词-EN: Group Activity Recognition, Activity Recognition, computer vision due, Group Activity, Multi-Modal Group Activity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 4 figures, 10 tables

点击查看摘要

Abstract:Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR’s hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR’s superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.

[CV-9] Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models

链接: https://arxiv.org/abs/2410.21088
作者: Wenda Li,Huijie Zhang,Qing Qu
关键词-EN: raised significant concerns, Shallow Diffuse, copyright infringement, raised significant, significant concerns
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The widespread use of AI-generated content from diffusion models has raised significant concerns regarding misinformation and copyright infringement. Watermarking is a crucial technique for identifying these AI-generated images and preventing their misuse. In this paper, we introduce Shallow Diffuse, a new watermarking technique that embeds robust and invisible watermarks into diffusion model outputs. Unlike existing approaches that integrate watermarking throughout the entire diffusion sampling process, Shallow Diffuse decouples these steps by leveraging the presence of a low-dimensional subspace in the image generation process. This method ensures that a substantial portion of the watermark lies in the null space of this subspace, effectively separating it from the image generation process. Our theoretical and empirical analyses show that this decoupling strategy greatly enhances the consistency of data generation and the detectability of the watermark. Extensive experiments further validate that our Shallow Diffuse outperforms existing watermarking methods in terms of robustness and consistency. The codes will be released at this https URL.

[CV-10] KA2ER: Knowledge Adaptive Amalgamation of ExpeRts for Medical Images Segmentation MICCAI2024

链接: https://arxiv.org/abs/2410.21085
作者: Shangde Gao,Yichao Fu,Ke Liu,Hongxia Xu,Jian Wu
关键词-EN: medical image analysis, medical image segmentation, foundation model, medical image, specific medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted to MICCAI2024

点击查看摘要

Abstract:Recently, many foundation models for medical image analysis such as MedSAM, SwinUNETR have been released and proven to be useful in multiple tasks. However, considering the inherent heterogeneity and inhomogeneity of real-world medical data, directly applying these models to specific medical image segmentation tasks often leads to negative domain shift effects, which can severely weaken the model’s segmentation capabilities. To this end, we propose an adaptive amalgamation knowledge framework that aims to train a versatile foundation model to handle the joint goals of multiple expert models, each specialized for a distinct task. Specifically, we first train an nnUNet-based expert model for each task, and reuse the pre-trained SwinUNTER as the target foundation model. Then, the input data for all challenging tasks are encoded in the foundation model and the expert models, respectively, and their backbone features are jointly projected into the adaptive amalgamation layer. Within the hidden layer, the hierarchical attention mechanisms are designed to achieve adaptive merging of the target model to the hidden layer feature knowledge of all experts, which significantly reduces the domain shift arising from the inter-task differences. Finally, the gold amalgamated features and the prompt features are fed into the mask decoder to obtain the segmentation results. Extensive experiments conducted in these challenging tasks demonstrate the effectiveness and adaptability of our foundation model for real-world medical image segmentation.

[CV-11] SPOTS-10: Animal Pattern Benchmark Dataset for Machine Learning Algorithms

链接: https://arxiv.org/abs/2410.21044
作者: John Atanbori
关键词-EN: Recognising animals based, distinctive body patterns, Recognising animals, computer vision, based on distinctive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Dataset and benchmark is freely available at this https URL

点击查看摘要

Abstract:Recognising animals based on distinctive body patterns, such as stripes, spots, or other markings, in night images is a complex task in computer vision. Existing methods for detecting animals in images often rely on colour information, which is not always available in night images, posing a challenge for pattern recognition in such conditions. Nevertheless, recognition at night-time is essential for most wildlife, biodiversity, and conservation applications. The SPOTS-10 dataset was created to address this challenge and to provide a resource for evaluating machine learning algorithms in situ. This dataset is an extensive collection of grayscale images showcasing diverse patterns found in ten animal species. Specifically, SPOTS-10 contains 50,000 32 x 32 grayscale images, divided into ten categories, with 5,000 images per category. The training set comprises 40,000 images, while the test set contains 10,000 images. The SPOTS-10 dataset is freely available on the project GitHub page: this https URL by cloning the repository.

[CV-12] Improving Visual Prompt Tuning by Gaussian Neighborhood Minimization for Long-Tailed Visual Recognition NEURIPS2024

链接: https://arxiv.org/abs/2410.21042
作者: Mengke Li,Ye Liu,Yang Lu,Yiqun Zhang,Yiu-ming Cheung,Hui Huang
关键词-EN: garnered widespread attention, achieved significant progress, Long-tail learning, learning has garnered, garnered widespread
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Long-tail learning has garnered widespread attention and achieved significant progress in recent times. However, even with pre-trained prior knowledge, models still exhibit weaker generalization performance on tail classes. The promising Sharpness-Aware Minimization (SAM) can effectively improve the generalization capability of models by seeking out flat minima in the loss landscape, which, however, comes at the cost of doubling the computational time. Since the update rule of SAM necessitates two consecutive (non-parallelizable) forward and backpropagation at each step. To address this issue, we propose a novel method called Random SAM prompt tuning (RSAM-PT) to improve the model generalization, requiring only one-step gradient computation at each step. Specifically, we search for the gradient descent direction within a random neighborhood of the parameters during each gradient update. To amplify the impact of tail-class samples and avoid overfitting, we employ the deferred re-weight scheme to increase the significance of tail-class samples. The classification accuracy of long-tailed data can be significantly improved by the proposed RSAM-PT, particularly for tail classes. RSAM-PT achieves the state-of-the-art performance of 90.3%, 76.5%, and 50.1% on benchmark datasets CIFAR100-LT (IF 100), iNaturalist 2018, and Places-LT, respectively. The source code is temporarily available at this https URL.

[CV-13] Push-Forward Signed Distance Functions enable interpretable and robust continuous shape quantification

链接: https://arxiv.org/abs/2410.21004
作者: Roua Rouatbi,Juan Esteban Suarez,Ivo F. Sbalzarini
关键词-EN: Push-Forward Signed Distance, Signed Distance Morphometric, Signed Distance, Push-Forward Signed, Distance Morphometric
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Quantitative Methods (q-bio.QM)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:We introduce the Push-Forward Signed Distance Morphometric (PF-SDM), a novel method for shape quantification in biomedical imaging that is continuous, interpretable, and invariant to shape-preserving transformations. PF-SDM effectively captures the geometric properties of shapes, including their topological skeletons and radial symmetries. This results in a robust and interpretable shape descriptor that generalizes to capture temporal shape dynamics. Importantly, PF-SDM avoids certain issues of previous geometric morphometrics, like Elliptical Fourier Analysis and Generalized Procrustes Analysis, such as coefficient correlations and landmark choices. We present the PF-SDM theory, provide a practically computable algorithm, and benchmark it on synthetic data.

[CV-14] Skinned Motion Retargeting with Dense Geometric Interaction Perception NEURIPS2024

链接: https://arxiv.org/abs/2410.20986
作者: Zijie Ye,Jia-Wei Liu,Jia Jia,Shikun Sun,Mike Zheng Shou
关键词-EN: Capturing and maintaining, parts is crucial, crucial for successful, Capturing, successful motion retargeting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Capturing and maintaining geometric interactions among different body parts is crucial for successful motion retargeting in skinned characters. Existing approaches often overlook body geometries or add a geometry correction stage after skeletal motion retargeting. This results in conflicts between skeleton interaction and geometry correction, leading to issues such as jittery, interpenetration, and contact mismatches. To address these challenges, we introduce a new retargeting framework, MeshRet, which directly models the dense geometric interactions in motion retargeting. Initially, we establish dense mesh correspondences between characters using semantically consistent sensors (SCS), effective across diverse mesh topologies. Subsequently, we develop a novel spatio-temporal representation called the dense mesh interaction (DMI) field. This field, a collection of interacting SCS feature vectors, skillfully captures both contact and non-contact interactions between body geometries. By aligning the DMI field during retargeting, MeshRet not only preserves motion semantics but also prevents self-interpenetration and ensures contact preservation. Extensive experiments on the public Mixamo dataset and our newly-collected ScanRet dataset demonstrate that MeshRet achieves state-of-the-art performance. Code available at this https URL.

[CV-15] MovieCharacter: A Tuning-Free Framework for Controllable Character Video Synthesis

链接: https://arxiv.org/abs/2410.20974
作者: Di Qiu,Zheng Chen,Rui Wang,Mingyuan Fan,Changqian Yu,Junshi Huan,Xiang Wen
关键词-EN: hinder real-time applicability, Recent advancements, modeling processes, character video synthesis, fine-tuning or complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in character video synthesis still depend on extensive fine-tuning or complex 3D modeling processes, which can restrict accessibility and hinder real-time applicability. To address these challenges, we propose a simple yet effective tuning-free framework for character video synthesis, named MovieCharacter, designed to streamline the synthesis process while ensuring high-quality outcomes. Our framework decomposes the synthesis task into distinct, manageable modules: character segmentation and tracking, video object removal, character motion imitation, and video composition. This modular design not only facilitates flexible customization but also ensures that each component operates collaboratively to effectively meet user needs. By leveraging existing open-source models and integrating well-established techniques, MovieCharacter achieves impressive synthesis results without necessitating substantial resources or proprietary datasets. Experimental results demonstrate that our framework enhances the efficiency, accessibility, and adaptability of character video synthesis, paving the way for broader creative and interactive applications.

[CV-16] Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models!

链接: https://arxiv.org/abs/2410.20972
作者: Arash Marioriyad,Mohammadali Banayeeanzade,Reza Abbasi,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah
关键词-EN: Stable Diffusion, diffusion models, generating high-quality, Diffusion and DALL-E, capable of generating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models, such as Stable Diffusion and DALL-E, are capable of generating high-quality, diverse, and realistic images from textual prompts. However, they sometimes struggle to accurately depict specific entities described in prompts, a limitation known as the entity missing problem in compositional generation. While prior studies suggested that adjusting cross-attention maps during the denoising process could alleviate this problem, they did not systematically investigate which objective functions could best address it. This study examines three potential causes of the entity-missing problem, focusing on cross-attention dynamics: (1) insufficient attention intensity for certain entities, (2) overly broad attention spread, and (3) excessive overlap between attention maps of different entities. We found that reducing overlap in attention maps between entities can effectively minimize the rate of entity missing. Specifically, we hypothesize that tokens related to specific entities compete for attention on certain image regions during the denoising process, which can lead to divided attention across tokens and prevent accurate representation of each entity. To address this issue, we introduced four loss functions, Intersection over Union (IoU), center-of-mass (CoM) distance, Kullback-Leibler (KL) divergence, and clustering compactness (CC) to regulate attention overlap during denoising steps without the need for retraining. Experimental results across a wide variety of benchmarks reveal that these proposed training-free methods significantly improve compositional accuracy, outperforming previous approaches in visual question answering (VQA), captioning scores, CLIP similarity, and human evaluations. Notably, these methods improved human evaluation scores by 9% over the best baseline, demonstrating substantial improvements in compositional alignment.

[CV-17] BEVPose: Unveiling Scene Semantics through Pose-Guided Multi-Modal BEV Alignment IROS

链接: https://arxiv.org/abs/2410.20969
作者: Mehdi Hosseinzadeh,Ian Reid
关键词-EN: Bird Eye View, create Bird Eye, Eye View, Bird Eye, create Bird
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024. Project page: this https URL

点击查看摘要

Abstract:In the field of autonomous driving and mobile robotics, there has been a significant shift in the methods used to create Bird’s Eye View (BEV) representations. This shift is characterised by using transformers and learning to fuse measurements from disparate vision sensors, mainly lidar and cameras, into a 2D planar ground-based representation. However, these learning-based methods for creating such maps often rely heavily on extensive annotated data, presenting notable challenges, particularly in diverse or non-urban environments where large-scale datasets are scarce. In this work, we present BEVPose, a framework that integrates BEV representations from camera and lidar data, using sensor pose as a guiding supervisory signal. This method notably reduces the dependence on costly annotated data. By leveraging pose information, we align and fuse multi-modal sensory inputs, facilitating the learning of latent BEV embeddings that capture both geometric and semantic aspects of the environment. Our pretraining approach demonstrates promising performance in BEV map segmentation tasks, outperforming fully-supervised state-of-the-art methods, while necessitating only a minimal amount of annotated data. This development not only confronts the challenge of data efficiency in BEV representation learning but also broadens the potential for such techniques in a variety of domains, including off-road and indoor environments.

[CV-18] IndraEye: Infrared Electro-Optical UAV-based Perception Dataset for Robust Downstream Tasks

链接: https://arxiv.org/abs/2410.20953
作者: Manjunath D,Prajwal Gurunath,Sumanth Udupa,Aditya Gandhamal,Shrikar Madhu,Aniruddh Sikdar,Suresh Sundaram
关键词-EN: Deep neural networks, provide rich texture, rich texture details, Deep neural, shown exceptional performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Deep neural networks (DNNs) have shown exceptional performance when trained on well-illuminated images captured by Electro-Optical (EO) cameras, which provide rich texture details. However, in critical applications like aerial perception, it is essential for DNNs to maintain consistent reliability across all conditions, including low-light scenarios where EO cameras often struggle to capture sufficient detail. Additionally, UAV-based aerial object detection faces significant challenges due to scale variability from varying altitudes and slant angles, adding another layer of complexity. Existing methods typically address only illumination changes or style variations as domain shifts, but in aerial perception, correlation shifts also impact DNN performance. In this paper, we introduce the IndraEye dataset, a multi-sensor (EO-IR) dataset designed for various tasks. It includes 5,612 images with 145,666 instances, encompassing multiple viewing angles, altitudes, seven backgrounds, and different times of the day across the Indian subcontinent. The dataset opens up several research opportunities, such as multimodal learning, domain adaptation for object detection and segmentation, and exploration of sensor-specific strengths and weaknesses. IndraEye aims to advance the field by supporting the development of more robust and accurate aerial perception systems, particularly in challenging conditions. IndraEye dataset is benchmarked with object detection and semantic segmentation tasks. Dataset and source codes are available at this https URL.

[CV-19] Evaluating the Robustness of LiDAR Point Cloud Tracking Against Adversarial Attack

链接: https://arxiv.org/abs/2410.20893
作者: Shengjing Tian,Yinan Han,Xiantong Zhao,Bin Liu,Xiuping Liu
关键词-EN: Bird Eye View, adversarial attacks, cloud tracking models, neural network-based LiDAR, tracking models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this study, we delve into the robustness of neural network-based LiDAR point cloud tracking models under adversarial attacks, a critical aspect often overlooked in favor of performance enhancement. These models, despite incorporating advanced architectures like Transformer or Bird’s Eye View (BEV), tend to neglect robustness in the face of challenges such as adversarial attacks, domain shifts, or data corruption. We instead focus on the robustness of the tracking models under the threat of adversarial attacks. We begin by establishing a unified framework for conducting adversarial attacks within the context of 3D object tracking, which allows us to thoroughly investigate both white-box and black-box attack strategies. For white-box attacks, we tailor specific loss functions to accommodate various tracking paradigms and extend existing methods such as FGSM, C\W, and PGD to the point cloud domain. In addressing black-box attack scenarios, we introduce a novel transfer-based approach, the Target-aware Perturbation Generation (TAPG) algorithm, with the dual objectives of achieving high attack performance and maintaining low perceptibility. This method employs a heuristic strategy to enforce sparse attack constraints and utilizes random sub-vector factorization to bolster transferability. Our experimental findings reveal a significant vulnerability in advanced tracking methods when subjected to both black-box and white-box attacks, underscoring the necessity for incorporating robustness against adversarial attacks into the design of LiDAR point cloud tracking models. Notably, compared to existing methods, the TAPG also strikes an optimal balance between the effectiveness of the attack and the concealment of the perturbations.

[CV-20] Improving Generalization in Visual Reasoning via Self-Ensemble

链接: https://arxiv.org/abs/2410.20883
作者: Tien-Huy Nguyen,Quang-Khai Tran,Anh-Tuan Quang-Hoang
关键词-EN: multimodal perceptual processing, visual reasoning necessitates, cognitive faculty, necessitates the integration, integration of multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The cognitive faculty of visual reasoning necessitates the integration of multimodal perceptual processing and commonsense and external knowledge of the world. In recent years, a plethora of large vision-language models (LVLMs) have been proposed, demonstrating outstanding power and exceptional proficiency in commonsense reasoning across diverse domains and tasks. Nevertheless, training such LVLMs requires a lot of costly resources. Recent approaches, instead of training LVLMs from scratch on various large datasets, focus on exploring ways to take advantage of the capabilities of many different LVLMs, such as ensemble methods. In this work, we propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters, a training-free method. Our key insight is that we realized that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities. Extensive experiments on various benchmarks demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on SketchyVQA, Outside Knowledge VQA, and out-of-distribution VQA tasks.

[CV-21] he unrealized potential of agroforestry for an emissions-intensive agricultural commodity

链接: https://arxiv.org/abs/2410.20882
作者: Alexander Becker,Jan D. Wegner,Evans Dawoe,Konrad Schindler,William J. Thompson,Christian Bunn,Rachael D. Garrett,Fabio Castro,Simon P. Hart,Wilma J. Blaser-Hart
关键词-EN: Reconciling agricultural production, Reconciling agricultural, climate-change mitigation, mitigation and adaptation, formidable problems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconciling agricultural production with climate-change mitigation and adaptation is one of the most formidable problems in sustainability. One proposed strategy for addressing this problem is the judicious retention of trees in agricultural systems. However, the magnitude of the current and future-potential benefit that trees contribute remains uncertain, particularly in the agricultural sector where trees can also limit production. Here we help to resolve these issues across a West African region responsible for producing \approx 60% of the world’s cocoa, a crop that contributes one of the highest per unit carbon footprints of all foods. We use machine learning to generate spatially-explicit estimates of shade-tree cover and carbon stocks across the region. We find that existing shade-tree cover is low, and not spatially aligned with climate threat. But we also find enormous unrealized potential for the sector to counterbalance a large proportion of their high carbon footprint annually, without threatening production. Our methods can be applied to other globally significant commodities that can be grown in agroforests, and align with accounting requirements of carbon markets, and emerging legislative requirements for sustainability reporting.

[CV-22] Evaluating Sugarcane Yield Variability with UAV-Derived Cane Height under Different Water and Nitrogen Conditions

链接: https://arxiv.org/abs/2410.20880
作者: Rajiv Ranjan,Tejasavi Birdh,Nandan Mandal,Dinesh Kumar,Shashank Tamaskar
关键词-EN: Unmanned Aerial Vehicle, pre-harvest Digital Surface, Digital Surface Model, Aerial Vehicle, Digital Surface
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 9 fugures, 1 table

点击查看摘要

Abstract:This study investigates the relationship between sugarcane yield and cane height derived under different water and nitrogen conditions from pre-harvest Digital Surface Model (DSM) obtained via Unmanned Aerial Vehicle (UAV) flights over a sugarcane test farm. The farm was divided into 62 blocks based on three water levels (low, medium, and high) and three nitrogen levels (low, medium, and high), with repeated treatments. In pixel distribution of DSM for each block, it provided bimodal distribution representing two peaks, ground level (gaps within canopies) and top of the canopies respectively. Using bimodal distribution, mean cane height was extracted for each block by applying a trimmed mean to the pixel distribution, focusing on the top canopy points. Similarly, the extracted mean elevation of the base was derived from the bottom points, representing ground level. The Derived Cane Height Model (DCHM) was generated by taking the difference between the mean canopy height and mean base elevation for each block. Yield measurements (tons/acre) were recorded post-harvest for each block. By aggregating the data into nine treatment zones (e.g., high water-low nitrogen, low water-high nitrogen), the DCHM and median yield were calculated for each zone. The regression analysis between the DCHM and corresponding yields for the different treatment zones yielded an R 2 of 0.95. This study demonstrates the significant impact of water and nitrogen treatments on sugarcane height and yield, utilizing one-time UAV-derived DSM data.

[CV-23] ByteNet: Rethinking Multimedia File Fragment Classification through Visual Perspectives

链接: https://arxiv.org/abs/2410.20855
作者: Wenyang Liu,Kejun Wu,Tianyi Liu,Yi Wang,Kim-Hui Yap,Lap-Pui Chau
关键词-EN: aims to identify, system metadata, Multimedia file fragment, file fragment types, text without system
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM)
*备注: Accepted in TMM

点击查看摘要

Abstract:Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose Byte2Image, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte ngrams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network ByteNet to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.

[CV-24] FreqMark: Invisible Image Watermarking via Frequency Based Optimization in Latent Space

链接: https://arxiv.org/abs/2410.20824
作者: Yiyang Guo,Ruizhe Li,Mude Hui,Hanzhong Guo,Chen Zhang,Chuangjian Cai,Le Wan,Shangfei Wang
关键词-EN: enabling copyright protection, safeguarding digital content, Invisible watermarking, content authentication, enabling copyright
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Invisible watermarking is essential for safeguarding digital content, enabling copyright protection and content authentication. However, existing watermarking methods fall short in robustness against regeneration attacks. In this paper, we propose a novel method called FreqMark that involves unconstrained optimization of the image latent frequency space obtained after VAE encoding. Specifically, FreqMark embeds the watermark by optimizing the latent frequency space of the images and then extracts the watermark through a pre-trained image encoder. This optimization allows a flexible trade-off between image quality with watermark robustness and effectively resists regeneration attacks. Experimental results demonstrate that FreqMark offers significant advantages in image quality and robustness, permits flexible selection of the encoding bit number, and achieves a bit accuracy exceeding 90% when encoding a 48-bit hidden message under various attack scenarios.

[CV-25] Novel Object Synthesis via Adaptive Text-Image Harmony NEURIPS2024

链接: https://arxiv.org/abs/2410.20823
作者: Zeren Xiong,Zedong Zhang,Zikun Chen,Shuo Chen,Xiang Li,Gan Sun,Jian Yang,Jun Li
关键词-EN: object synthesis task, object, image, Adaptive Text-Image Harmony, object image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS2024

点击查看摘要

Abstract:In this paper, we study an object synthesis task that combines an object text with an object image to create a new object image. However, most diffusion models struggle with this task, \textiti.e., often generating an object that predominantly reflects either the text or the image due to an imbalance between their inputs. To address this issue, we propose a simple yet effective method called Adaptive Text-Image Harmony (ATIH) to generate novel and surprising objects. First, we introduce a scale factor and an injection step to balance text and image features in cross-attention and to preserve image information in self-attention during the text-image inversion diffusion process, respectively. Second, to better integrate object text and image, we design a balanced loss function with a noise parameter, ensuring both optimal editability and fidelity of the object image. Third, to adaptively adjust these parameters, we present a novel similarity score function that not only maximizes the similarities between the generated object image and the input text/image but also balances these similarities to harmonize text and image integration. Extensive experiments demonstrate the effectiveness of our approach, showcasing remarkable object creations such as colobus-glass jar. Project page: this https URL.

[CV-26] Evaluation of neural network algorithms for atmospheric turbulence mitigation

链接: https://arxiv.org/abs/2410.20816
作者: Tushar Jain,Madeline Lubien,Jerome Gilles
关键词-EN: objects being captured, variety of neural, studied to tackle, images and videos, non-steady camera
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A variety of neural networks architectures are being studied to tackle blur in images and videos caused by a non-steady camera and objects being captured. In this paper, we present an overview of these existing networks and perform experiments to remove the blur caused by atmospheric turbulence. Our experiments aim to examine the reusability of existing networks and identify desirable aspects of the architecture in a system that is geared specifically towards atmospheric turbulence mitigation. We compare five different architectures, including a network trained in an end-to-end fashion, thereby removing the need for a stabilization step.

[CV-27] Grid4D: 4D Decomposed Hash Encoding for High-fidelity Dynamic Gaussian Splatting NEURIPS2024

链接: https://arxiv.org/abs/2410.20815
作者: Jiawei Xu,Zexin Fan,Jian Yang,Jin Xie
关键词-EN: static scene rendering, dynamic scene rendering, scene rendering, Gaussian-based dynamic scene, Gaussian splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Recently, Gaussian splatting has received more and more attention in the field of static scene rendering. Due to the low computational overhead and inherent flexibility of explicit representations, plane-based explicit methods are popular ways to predict deformations for Gaussian-based dynamic scene rendering models. However, plane-based methods rely on the inappropriate low-rank assumption and excessively decompose the space-time 4D encoding, resulting in overmuch feature overlap and unsatisfactory rendering quality. To tackle these problems, we propose Grid4D, a dynamic scene rendering model based on Gaussian splatting and employing a novel explicit encoding method for the 4D input through the hash encoding. Different from plane-based explicit representations, we decompose the 4D encoding into one spatial and three temporal 3D hash encodings without the low-rank assumption. Additionally, we design a novel attention module that generates the attention scores in a directional range to aggregate the spatial and temporal features. The directional attention enables Grid4D to more accurately fit the diverse deformations across distinct scene components based on the spatial encoded features. Moreover, to mitigate the inherent lack of smoothness in explicit representation methods, we introduce a smooth regularization term that keeps our model from the chaos of deformation prediction. Our experiments demonstrate that Grid4D significantly outperforms the state-of-the-art models in visual quality and rendering speed.

[CV-28] Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

链接: https://arxiv.org/abs/2410.20812
作者: Jiacheng Wang,Xiang Chen,Renjiu Hu,Rongguang Wang,Min Liu,Yaonan Wang,Jiazheng Wang,Hao Li,Hang Zhang
关键词-EN: pancreatic cancer tissues, Co-examination of second-harmonic, second-harmonic generation, microscopy enables, collagen fibers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Co-examination of second-harmonic generation (SHG) and bright-field (BF) microscopy enables the differentiation of tissue components and collagen fibers, aiding the analysis of human breast and pancreatic cancer tissues. However, large discrepancies between SHG and BF images pose challenges for current learning-based registration models in aligning SHG to BF. In this paper, we propose a novel multi-modal registration framework that employs fidelity-imposed displacement editing to address these challenges. The framework integrates batch-wise contrastive learning, feature-based pre-alignment, and instance-level optimization. Experimental results from the Learn2Reg COMULISglobe SHG-BF Challenge validate the effectiveness of our method, securing the 1st place on the online leaderboard.

[CV-29] Long-Tailed Out-of-Distribution Detection via Normalized Outlier Distribution Adaptation NIPS2024

链接: https://arxiv.org/abs/2410.20807
作者: Wenjun Miao,Guansong Pang,Jin Zheng,Xiao Bai
关键词-EN: OOD samples, OOD, true OOD samples, ground-truth OOD samples, true OOD
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NIPS2024

点击查看摘要

Abstract:One key challenge in Out-of-Distribution (OOD) detection is the absence of ground-truth OOD samples during training. One principled approach to address this issue is to use samples from external datasets as outliers (i.e., pseudo OOD samples) to train OOD detectors. However, we find empirically that the outlier samples often present a distribution shift compared to the true OOD samples, especially in Long-Tailed Recognition (LTR) scenarios, where ID classes are heavily imbalanced, \ie, the true OOD samples exhibit very different probability distribution to the head and tailed ID classes from the outliers. In this work, we propose a novel approach, namely normalized outlier distribution adaptation (AdaptOD), to tackle this distribution shift problem. One of its key components is dynamic outlier distribution adaptation that effectively adapts a vanilla outlier distribution based on the outlier samples to the true OOD distribution by utilizing the OOD knowledge in the predicted OOD samples during inference. Further, to obtain a more reliable set of predicted OOD samples on long-tailed ID data, a novel dual-normalized energy loss is introduced in AdaptOD, which leverages class- and sample-wise normalized energy to enforce a more balanced prediction energy on imbalanced ID samples. This helps avoid bias toward the head samples and learn a substantially better vanilla outlier distribution than existing energy losses during training. It also eliminates the need of manually tuning the sensitive margin hyperparameters in energy losses. Empirical results on three popular benchmarks for OOD detection in LTR show the superior performance of AdaptOD over state-of-the-art methods. Code is available at \urlthis https URL.

[CV-30] ransformer-Based Tooth Alignment Prediction With Occlusion And Collision Constraints

链接: https://arxiv.org/abs/2410.20806
作者: ZhenXing Dong,JiaZhou Chen,YangHui Xu
关键词-EN: treatment requires providing, requires providing tooth, providing tooth alignment, clinical experiences heavily, orthodontic treatment requires
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The planning of digital orthodontic treatment requires providing tooth alignment, which not only consumes a lot of time and labor to determine manually but also relays clinical experiences heavily. In this work, we proposed a lightweight tooth alignment neural network based on Swin-transformer. We first re-organized 3D point clouds based on virtual arch lines and converted them into order-sorted multi-channel textures, which improves the accuracy and efficiency simultaneously. We then designed two new occlusal loss functions that quantitatively evaluate the occlusal relationship between the upper and lower jaws. They are important clinical constraints, first introduced to the best of our knowledge, and lead to cutting-edge prediction accuracy. To train our network, we collected a large digital orthodontic dataset that has 591 clinical cases, including various complex clinical cases. This dataset will benefit the community after its release since there is no open dataset so far. Furthermore, we also proposed two new orthodontic dataset augmentation methods considering tooth spatial distribution and occlusion. We evaluated our method with this dataset and extensive experiments, including comparisons with STAT methods and ablation studies, and demonstrate the high prediction accuracy of our method.

[CV-31] SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity

链接: https://arxiv.org/abs/2410.20790
作者: Kunyun Wang,Jieru Zhao,Shuo Yang,Wenchao Ding,Minyi Guo
关键词-EN: Deep learning models, Convolutional Neural Networks, Deep learning, object detection, Diff Computation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 13 figures

点击查看摘要

Abstract:Deep learning models have become pivotal in the field of video processing and is increasingly critical in practical applications such as autonomous driving and object detection. Although Vision Transformers (ViTs) have demonstrated their power, Convolutional Neural Networks (CNNs) remain a highly efficient and high-performance choice for feature extraction and encoding. However, the intensive computational demands of convolution operations hinder its broader adoption as a video encoder. Given the inherent temporal continuity in video frames, changes between consecutive frames are minimal, allowing for the skipping of redundant computations. This technique, which we term as Diff Computation, presents two primary challenges. First, Diff Computation requires to cache intermediate feature maps to ensure the correctness of non-linear computations, leading to significant memory consumption. Second, the imbalance of sparsity among layers, introduced by Diff Computation, incurs accuracy degradation. To address these issues, we propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation. We integrate these techniques into our framework, SparseTem, to seamlessly support various CNN-based video encoders. SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead. Extensive experimental results demonstrate that SparseTem sets a new state-of-the-art by effectively utilizing temporal continuity to accelerate CNN-based video encoders.

[CV-32] Bidirectional Recurrence for Cardiac Motion Tracking with Gaussian Process Latent Coding NEURIPS2024

链接: https://arxiv.org/abs/2410.20752
作者: Jiewen Yang,Yiqun Lin,Bin Pu,Xiaomeng Li
关键词-EN: assessing cardiac function, Quantitative analysis, crucial for assessing, MRI and Echocardiograms, cardiac
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper Accepted by NeurIPS 2024

点击查看摘要

Abstract:Quantitative analysis of cardiac motion is crucial for assessing cardiac function. This analysis typically uses imaging modalities such as MRI and Echocardiograms that capture detailed image sequences throughout the heartbeat cycle. Previous methods predominantly focused on the analysis of image pairs lacking consideration of the motion dynamics and spatial variability. Consequently, these methods often overlook the long-term relationships and regional motion characteristic of cardiac. To overcome these limitations, we introduce the \textbfGPTrack, a novel unsupervised framework crafted to fully explore the temporal and spatial dynamics of cardiac motion. The GPTrack enhances motion tracking by employing the sequential Gaussian Process in the latent space and encoding statistics by spatial information at each time stamp, which robustly promotes temporal consistency and spatial variability of cardiac dynamics. Also, we innovatively aggregate sequential information in a bidirectional recursive manner, mimicking the behavior of diffeomorphic registration to better capture consistent long-term relationships of motions across cardiac regions such as the ventricles and atria. Our GPTrack significantly improves the precision of motion tracking in both 3D and 4D medical images while maintaining computational efficiency. The code is available at: this https URL

[CV-33] BLAPose: Enhancing 3D Human Pose Estimation with Bone Length Adjustment ACCV

链接: https://arxiv.org/abs/2410.20731
作者: C. Hsu,J. Jang
关键词-EN: Current approaches, neglecting critical physical, estimation primarily focus, joint locations, critical physical constraints
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 8 Postscript figures, uses this http URL and this http URL

点击查看摘要

Abstract:Current approaches in 3D human pose estimation primarily focus on regressing 3D joint locations, often neglecting critical physical constraints such as bone length consistency and body symmetry. This work introduces a recurrent neural network architecture designed to capture holistic information across entire video sequences, enabling accurate prediction of bone lengths. To enhance training effectiveness, we propose a novel augmentation strategy using synthetic bone lengths that adhere to physical constraints. Moreover, we present a bone length adjustment method that preserves bone orientations while substituting bone lengths with predicted values. Our results demonstrate that existing 3D human pose estimation models can be significantly enhanced through this adjustment process. Furthermore, we fine-tune human pose estimation models using inferred bone lengths, observing notable improvements. Our bone length prediction model surpasses the previous best results, and our adjustment and fine-tuning method enhance performance across several metrics on the Human3.6M dataset.

[CV-34] CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

链接: https://arxiv.org/abs/2410.20723
作者: Chongjian Ge,Chenfeng Xu,Yuanfeng Ji,Chensheng Peng,Masayoshi Tomizuka,Ping Luo,Mingyu Ding,Varun Jampani,Wei Zhan
关键词-EN: Recent breakthroughs, Score Distillation Sampling, breakthroughs in text-guided, significantly advanced, advanced the field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in text-guided image generation have significantly advanced the field of 3D generation. While generating a single high-quality 3D object is now feasible, generating multiple objects with reasonable interactions within a 3D space, a.k.a. compositional 3D generation, presents substantial challenges. This paper introduces CompGS, a novel generative framework that employs 3D Gaussian Splatting (GS) for efficient, compositional text-to-3D content generation. To achieve this goal, two core designs are proposed: (1) 3D Gaussians Initialization with 2D compositionality: We transfer the well-established 2D compositionality to initialize the Gaussian parameters on an entity-by-entity basis, ensuring both consistent 3D priors for each entity and reasonable interactions among multiple entities; (2) Dynamic Optimization: We propose a dynamic strategy to optimize 3D Gaussians using Score Distillation Sampling (SDS) loss. CompGS first automatically decomposes 3D Gaussians into distinct entity parts, enabling optimization at both the entity and composition levels. Additionally, CompGS optimizes across objects of varying scales by dynamically adjusting the spatial parameters of each entity, enhancing the generation of fine-grained details, particularly in smaller entities. Qualitative comparisons and quantitative evaluations on T3Bench demonstrate the effectiveness of CompGS in generating compositional 3D objects with superior image quality and semantic alignment over existing methods. CompGS can also be easily extended to controllable 3D editing, facilitating scene generation. We hope CompGS will provide new insights to the compositional 3D generation. Project page: this https URL.

[CV-35] Interpretable Image Classification with Adaptive Prototype-based Vision Transformers

链接: https://arxiv.org/abs/2410.20722
作者: Chiyu Ma,Jon Donnelly,Wenjun Liu,Soroush Vosoughi,Cynthia Rudin,Chaofan Chen
关键词-EN: classification combining deep, combining deep learning, interpretable image classification, image classification combining, Convolutional Neural Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present ProtoViT, a method for interpretable image classification combining deep learning and case-based reasoning. This method classifies an image by comparing it to a set of learned prototypes, providing explanations of the form ``this looks like that.‘’ In our model, a prototype consists of \textitparts, which can deform over irregular geometries to create a better comparison between images. Unlike existing models that rely on Convolutional Neural Network (CNN) backbones and spatially rigid prototypes, our model integrates Vision Transformer (ViT) backbones into prototype based models, while offering spatially deformed prototypes that not only accommodate geometric variations of objects but also provide coherent and clear prototypical feature representations with an adaptive number of prototypical parts. Our experiments show that our model can generally achieve higher performance than the existing prototype based models. Our comprehensive analyses ensure that the prototypes are consistent and the interpretations are faithful.

[CV-36] Face-MLLM : A Large Face Perception Model

链接: https://arxiv.org/abs/2410.20717
作者: Haomiao Sun,Mingjie He,Tianheng Lian,Hu Han,Shiguang Shan
关键词-EN: achieved promising results, rarely explored, face perception tasks, achieved promising, wide range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.

[CV-37] Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition ECCV2024

链接: https://arxiv.org/abs/2410.20716
作者: Satoshi Ikehata,Yuta Asano
关键词-EN: field traditionally hindered, lighting or sensors, recovering surface normals, groundbreaking spectrally multiplexed, spectral ambiguity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024 (Oral)

点击查看摘要

Abstract:In this paper, we present a groundbreaking spectrally multiplexed photometric stereo approach for recovering surface normals of dynamic surfaces without the need for calibrated lighting or sensors, a notable advancement in the field traditionally hindered by stringent prerequisites and spectral ambiguity. By embracing spectral ambiguity as an advantage, our technique enables the generation of training data without specialized multispectral rendering frameworks. We introduce a unique, physics-free network architecture, SpectraM-PS, that effectively processes multiplexed images to determine surface normals across a wide range of conditions and material types, without relying on specific physically-based knowledge. Additionally, we establish the first benchmark dataset, SpectraM14, for spectrally multiplexed photometric stereo, facilitating comprehensive evaluations against existing calibrated methods. Our contributions significantly enhance the capabilities for dynamic surface recovery, particularly in uncalibrated setups, marking a pivotal step forward in the application of photometric stereo across various domains.

[CV-38] CIB-SE-YOLOv8: Optimized YOLOv8 for Real-Time Safety Equipment Detection on Construction Sites

链接: https://arxiv.org/abs/2410.20699
作者: Xiaoyi Liu,Ruina Du,Lianghao Tan,Junran Xu,Chen Chen,Huangqi Jiang,Saleh Aldwais
关键词-EN: Ensuring safety, reducing injuries, playing a key, key role, role in reducing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:Ensuring safety on construction sites is critical, with helmets playing a key role in reducing injuries. Traditional safety checks are labor-intensive and often insufficient. This study presents a computer vision-based solution using YOLO for real-time helmet detection, leveraging the SHEL5K dataset. Our proposed CIB-SE-YOLOv8 model incorporates SE attention mechanisms and modified C2f blocks, enhancing detection accuracy and efficiency. This model offers a more effective solution for promoting safety compliance on construction sites.

[CV-39] ODGS: 3D Scene Reconstruction from Omnidirectional Images with 3D Gaussian Splattings

链接: https://arxiv.org/abs/2410.20686
作者: Suyoung Lee,Jaeyoung Chung,Jaeyoo Huh,Kyoung Mu Lee
关键词-EN: omnidirectional images, Omnidirectional, Gaussian, single image, rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Omnidirectional (or 360-degree) images are increasingly being used for 3D applications since they allow the rendering of an entire scene with a single image. Existing works based on neural radiance fields demonstrate successful 3D reconstruction quality on egocentric videos, yet they suffer from long training and rendering times. Recently, 3D Gaussian splatting has gained attention for its fast optimization and real-time rendering. However, directly using a perspective rasterizer to omnidirectional images results in severe distortion due to the different optical properties between two image domains. In this work, we present ODGS, a novel rasterization pipeline for omnidirectional images, with geometric interpretation. For each Gaussian, we define a tangent plane that touches the unit sphere and is perpendicular to the ray headed toward the Gaussian center. We then leverage a perspective camera rasterizer to project the Gaussian onto the corresponding tangent plane. The projected Gaussians are transformed and combined into the omnidirectional image, finalizing the omnidirectional rasterization process. This interpretation reveals the implicit assumptions within the proposed pipeline, which we verify through mathematical proofs. The entire rasterization process is parallelized using CUDA, achieving optimization and rendering speeds 100 times faster than NeRF-based methods. Our comprehensive experiments highlight the superiority of ODGS by delivering the best reconstruction and perceptual quality across various datasets. Additionally, results on roaming datasets demonstrate that ODGS restores fine details effectively, even when reconstructing large 3D scenes. The source code is available on our project page (this https URL).

[CV-40] Video to Video Generative Adversarial Network for Few-shot Learning Based on Policy Gradient

链接: https://arxiv.org/abs/2410.20657
作者: Yintai Ma,Diego Klabjan,Jean Utke
关键词-EN: deep reinforcement learning, development of sophisticated, facilitated by recent, generative adversarial networks, reinforcement learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 11 figures, submitting to IEEE TNNLS

点击查看摘要

Abstract:The development of sophisticated models for video-to-video synthesis has been facilitated by recent advances in deep reinforcement learning and generative adversarial networks (GANs). In this paper, we propose RL-V2V-GAN, a new deep neural network approach based on reinforcement learning for unsupervised conditional video-to-video synthesis. While preserving the unique style of the source video domain, our approach aims to learn a mapping from a source video domain to a target video domain. We train the model using policy gradient and employ ConvLSTM layers to capture the spatial and temporal information by designing a fine-grained GAN architecture and incorporating spatio-temporal adversarial goals. The adversarial losses aid in content translation while preserving style. Unlike traditional video-to-video synthesis methods requiring paired inputs, our proposed approach is more general because it does not require paired inputs. Thus, when dealing with limited videos in the target domain, i.e., few-shot learning, it is particularly effective. Our experiments show that RL-V2V-GAN can produce temporally coherent video results. These results highlight the potential of our approach for further advances in video-to-video synthesis.

[CV-41] A Comparative Study of Multiple Deep Learning Algorithms for Efficient Localization of Bone Joints in the Upper Limbs of Human Body

链接: https://arxiv.org/abs/2410.20639
作者: Soumalya Bose,Soham Basu,Indranil Bera,Sambit Mallick,Snigdha Paul,Saumodip Das,Swarnendu Sil,Swarnava Ghosh,Anindya Sen
关键词-EN: medical imaging problem, imaging problem, bone joint detections, Automated joint localization, medical imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the medical imaging problem of joint detection in the upper limbs, viz. elbow, shoulder, wrist and finger joints. Localization of joints from X-Ray and Computerized Tomography (CT) scans is an essential step for the assessment of various bone-related medical conditions like Osteoarthritis, Rheumatoid Arthritis, and can even be used for automated bone fracture detection. Automated joint localization also detects the corresponding bones and can serve as input to deep learning-based models used for the computerized diagnosis of the aforementioned medical disorders. This in-creases the accuracy of prediction and aids the radiologists with analyzing the scans, which is quite a complex and exhausting task. This paper provides a detailed comparative study between diverse Deep Learning (DL) models - YOLOv3, YOLOv7, EfficientDet and CenterNet in multiple bone joint detections in the upper limbs of the human body. The research analyses the performance of different DL models, mathematically, graphically and visually. These models are trained and tested on a portion of the openly available MURA (musculoskeletal radiographs) dataset. The study found that the best Mean Average Precision (mAP at 0.5:0.95) values of YOLOv3, YOLOv7, EfficientDet and CenterNet are 35.3, 48.3, 46.5 and 45.9 respectively. Besides, it has been found YOLOv7 performed the best for accurately predicting the bounding boxes while YOLOv3 performed the worst in the Visual Analysis test. Code available at this https URL

[CV-42] Ant Detective: An Automated Approach for Counting Ants in Densely Populated Images and Gaining Insight into Ant Foraging Behavior

链接: https://arxiv.org/abs/2410.20638
作者: Mautushi Das,Fang-Ling Chloe Liu,Charly Hartle,Chin-Cheng Scotty Yang,C. P. James Chen
关键词-EN: densely populated images, Ant foraging behavior, foraging behavior, dynamics and developing, labor-intensive nature
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ant foraging behavior is essential to understanding ecological dynamics and developing effective pest management strategies, but quantifying this behavior is challenging due to the labor-intensive nature of manual counting, especially in densely populated images. This study presents an automated approach using computer vision to count ants and analyze their foraging behavior. Leveraging the YOLOv8 model, the system was calibrated and evaluated on datasets encompassing various imaging scenarios and densities. The study results demonstrate that the system achieves average precision and recall of up to 87.96% and 87,78%, respectively, with only 64 calibration images provided when the both calibration and evaluation images share similar imaging backgrounds. When the background is more complex than the calibration images, the system requires a larger calibration set to generalize effectively, with 1,024 images yielding the precision and recall of up to 83.60% and 78.88, respectively. In more challenging scenarios where more than one thousand ants are present in a single image, the system significantly improves detection accuracy by slicing images into smaller patches, reaching a precision and recall of 77.97% and 71.36%, respectively. The system’s ability to generate heatmaps visualizes the spatial distribution of ant activity over time, providing valuable insights into their foraging patterns. This spatial-temporal analysis enables a more comprehensive understanding of ant behavior, which is crucial for ecological studies and improving pest control methods. By automating the counting process and offering detailed behavioral analysis, this study provides an efficient tool for researchers and pest control professionals to develop more effective strategies.

[CV-43] PViT: Prior-augmented Vision Transformer for Out-of-distribution Detection

链接: https://arxiv.org/abs/2410.20631
作者: Tianhao Zhang,Zhixiang Chen,Lyudmila S. Mihaylova
关键词-EN: Prior-augmented Vision Transformer, biases remain underexplored, achieved remarkable success, inherent inductive biases, inductive biases remain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved remarkable success over various vision tasks, yet their robustness against data distribution shifts and inherent inductive biases remain underexplored. To enhance the robustness of ViT models for image Out-of-Distribution (OOD) detection, we introduce a novel and generic framework named Prior-augmented Vision Transformer (PViT). PViT identifies OOD samples by quantifying the divergence between the predicted class logits and the prior logits obtained from pre-trained models. Unlike existing state-of-the-art OOD detection methods, PViT shapes the decision boundary between ID and OOD by utilizing the proposed prior guide confidence, without requiring additional data modeling, generation methods, or structural modifications. Extensive experiments on the large-scale ImageNet benchmark demonstrate that PViT significantly outperforms existing state-of-the-art OOD detection methods. Additionally, through comprehensive analyses, ablation studies, and discussions, we show how PViT can strategically address specific challenges in managing large vision models, paving the way for new advancements in OOD detection.

[CV-44] Exocentric To Egocentric Transfer For Action Recognition: A Short Survey

链接: https://arxiv.org/abs/2410.20621
作者: Anirudh Thatipelli,Shao-Yuan Lo,Amit K. Roy-Chowdhury
关键词-EN: Egocentric vision captures, vision captures, scene context, camera wearer, captures the scene
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Egocentric vision captures the scene from the point of view of the camera wearer while exocentric vision captures the overall scene context. Jointly modeling ego and exo views is crucial to developing next-generation AI agents. The community has regained interest in the field of egocentric vision. While the third-person view and first-person have been thoroughly investigated, very few works aim to study both synchronously. Exocentric videos contain many relevant signals that are transferrable to egocentric videos. In this paper, we provide a broad overview of works combining egocentric and exocentric visions.

[CV-45] A Framework for Real-Time Volcano-Seismic Event Recognition Based on Multi-Station Seismograms and Semantic Segmentation Models

链接: https://arxiv.org/abs/2410.20595
作者: Camilo Espinosa-Curilem,Millaray Curilem,Daniel Basualto
关键词-EN: timely warning alerts, raising timely warning, understanding volcanic activity, warning alerts, essential for understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 9 figures. This is a pre-print, it is currently under review fro publication at Computers and Geosciences, by Elsevier

点击查看摘要

Abstract:In volcano monitoring, effective recognition of seismic events is essential for understanding volcanic activity and raising timely warning alerts. Traditional methods rely on manual analysis, which can be subjective and labor-intensive. Furthermore, current automatic approaches often tackle detection and classification separately, mostly rely on single station information and generally require tailored preprocessing and representations to perform predictions. These limitations often hinder their application to real-time monitoring and utilization across different volcano conditions. This study introduces a novel approach that utilizes Semantic Segmentation models to automate seismic event recognition by applying a straight forward transformation of multi-channel 1D signals into 2D representations, enabling their use as images. Our framework employs a data-driven, end-to-end design that integrates multi-station seismic data with minimal preprocessing, performing both detection and classification simultaneously for five seismic event classes. We evaluated four state-of-the-art segmentation models (UNet, UNet++, DeepLabV3+ and SwinUNet) on approximately 25.000 seismic events recorded at four different Chilean volcanoes: Nevados del Chillán Volcanic Complex, Laguna del Maule, Villarrica and Puyehue-Cordón Caulle. Among these models, the UNet architecture was identified as the most effective model, achieving mean F1 and Intersection over Union (IoU) scores of up to 0.91 and 0.88, respectively, and demonstrating superior noise robustness and model flexibility to unseen volcano datasets.

[CV-46] Normal-GS: 3D Gaussian Splatting with Normal-Involved Rendering NEURIPS2024

链接: https://arxiv.org/abs/2410.20593
作者: Meng Wei,Qianyi Wu,Jianmin Zheng,Hamid Rezatofighi,Jianfei Cai
关键词-EN: vision and graphics, reconstruction are long-standing, long-standing topics, topics in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 5 figures, accepted at NeurIPS 2024

点击查看摘要

Abstract:Rendering and reconstruction are long-standing topics in computer vision and graphics. Achieving both high rendering quality and accurate geometry is a challenge. Recent advancements in 3D Gaussian Splatting (3DGS) have enabled high-fidelity novel view synthesis at real-time speeds. However, the noisy and discrete nature of 3D Gaussian primitives hinders accurate surface estimation. Previous attempts to regularize 3D Gaussian normals often degrade rendering quality due to the fundamental disconnect between normal vectors and the rendering pipeline in 3DGS-based methods. Therefore, we introduce Normal-GS, a novel approach that integrates normal vectors into the 3DGS rendering pipeline. The core idea is to model the interaction between normals and incident lighting using the physically-based rendering equation. Our approach re-parameterizes surface colors as the product of normals and a designed Integrated Directional Illumination Vector (IDIV). To optimize memory usage and simplify optimization, we employ an anchor-based 3DGS to implicitly encode locally-shared IDIVs. Additionally, Normal-GS leverages optimized normals and Integrated Directional Encoding (IDE) to accurately model specular effects, enhancing both rendering quality and surface normal precision. Extensive experiments demonstrate that Normal-GS achieves near state-of-the-art visual quality while obtaining accurate surface normals and preserving real-time rendering performance.

[CV-47] Detection of adrenal anomalous findings in spinal CT images using multi model graph aggregatio

链接: https://arxiv.org/abs/2410.20568
作者: Shabalin Carmel,Shenkman Israel,Shelef Ilan,Ben-Arie Gal,Alex Geftler,Shahar Yuval
关键词-EN: Low back pain, primary care physicians, Low back, back problems, back pain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Low back pain is the symptom that is the second most frequently reported to primary care physicians, effecting 50 to 80 percent of the population in a lifetime, resulting in multiple referrals of patients suffering from back problems, to CT and MRI scans, which are then examined by radiologists. The radiologists examining these spinal scans naturally focus on spinal pathologies and might miss other types of abnormalities, and in particular, abdominal ones, such as malignancies. Nevertheless, the patients whose spine was scanned might as well have malignant and other abdominal pathologies. Thus, clinicians have suggested the need for computerized assistance and decision support in screening spinal scans for additional abnormalities. In the current study, We have addressed the important case of detecting suspicious lesions in the adrenal glands as an example for the overall methodology we have developed. A patient CT scan is integrated from multiple slices with an axial orientation. Our method determines whether a patient has an abnormal adrenal gland, and localises the abnormality if it exists. Our method is composed of three deep learning models; each model has a different task for achieving the final goal. We call our compound method the Multi Model Graph Aggregation MMGA method. The novelty in this study is twofold. First, the use, for an important screening task, of CT scans that are originally focused and tuned for imaging the spine, which were acquired from patients with potential spinal disorders, for detection of a totally different set of abnormalities such as abdominal Adrenal glands pathologies. Second, we have built a complex pipeline architecture composed from three deep learning models that can be utilized for other organs (such as the pancreas or the kidney), or for similar applications, but using other types of imaging, such as MRI.

[CV-48] Fractal and Turbulent Feature Extraction and NFT Label Generation for Pollock Style Migration Paintings Based on VGG19

链接: https://arxiv.org/abs/2410.20519
作者: Yiquan Wang
关键词-EN: fuses deep learning, create abstract artworks, create abstract, deep learning, MindSpore deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:This paper puts forth an innovative approach that fuses deep learning, fractal analysis, and turbulence feature extraction techniques to create abstract artworks in the style of Pollock. The content and style characteristics of the image are extracted by the MindSpore deep learning framework and a pre-trained VGG19 model. An optimisation process is then employed to The method generates high-quality Pollock-style images by combining content loss, style loss and full variance loss to achieve accurate style migration. Furthermore, this paper implements a fractal dimension calculation method based on the difference box-counting method, which effectively estimates the fractal dimension of an image through edge extraction and fractal analysis. The method is based on a two-dimensional discrete wavelet transform using a Haar wavelet to decompose the image in order to extract different frequency information. This is followed by the combination of multiple features to generate unique non-homogeneous token (NFT) labels for the authentication and protection of digital artwork. The experimental results demonstrate that the generated artworks exhibit The method demonstrates significant diversity and complexity in terms of fractal dimensions and turbulence features, while the generated NFT tags ensure the uniqueness and tamperability of each digital collection. The present method organically combines computer vision, digital signal processing and blockchain technology to provide a new solution for the creation and authentication of digital artworks.

[CV-49] Referring Human Pose and Mask Estimation in the Wild NEURIPS2024

链接: https://arxiv.org/abs/2410.20508
作者: Bo Miao,Mingtao Feng,Zijie Wu,Mohammed Bennamoun,Yongsheng Gao,Ajmal Mian
关键词-EN: Referring Human Pose, introduce Referring Human, Referring Human, Human Pose, introduce Referring
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted by NeurIPS 2024. this https URL

点击查看摘要

Abstract:We introduce Referring Human Pose and Mask Estimation (R-HPM) in the wild, where either a text or positional prompt specifies the person of interest in an image. This new task holds significant potential for human-centric applications such as assistive robotics and sports analysis. In contrast to previous works, R-HPM (i) ensures high-quality, identity-aware results corresponding to the referred person, and (ii) simultaneously predicts human pose and mask for a comprehensive representation. To achieve this, we introduce a large-scale dataset named RefHuman, which substantially extends the MS COCO dataset with additional text and positional prompt annotations. RefHuman includes over 50,000 annotated instances in the wild, each equipped with keypoint, mask, and prompt annotations. To enable prompt-conditioned estimation, we propose the first end-to-end promptable approach named UniPHD for R-HPM. UniPHD extracts multimodal representations and employs a proposed pose-centric hierarchical decoder to process (text or positional) instance queries and keypoint queries, producing results specific to the referred person. Extensive experiments demonstrate that UniPHD produces quality results based on user-friendly prompts and achieves top-tier performance on RefHuman val and MS COCO val2017. Data and Code: this https URL

[CV-50] ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

链接: https://arxiv.org/abs/2410.20502
作者: Zongyi Li,Shujie Hu,Shujie Liu,Long Zhou,Jeongsoo Choi,Lingwei Meng,Xun Guo,Jinyu Li,Hefei Ling,Furu Wei
关键词-EN: recently undergone rapid, long video generation, DiT model, video generation, long videos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at \urlthis http URL.

[CV-51] GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation NEURIPS2024

链接: https://arxiv.org/abs/2410.20474
作者: Phillip Y. Lee,Taehoon Yoon,Minhyuk Sung
关键词-EN: spatial grounding technique, training-free spatial grounding, spatial grounding, Diffusion Transformers, bounding boxes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024. Project Page: this https URL

点击查看摘要

Abstract:We introduce a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backpropagation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become “semantic clones”. Each patch is denoised in its own branch of the generation process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bounding box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free spatial grounding approaches.

[CV-52] Unlocking Comics: The AI4VA Dataset for Visual Understanding ECCV2024

链接: https://arxiv.org/abs/2410.20459
作者: Peter Grönquist,Deblina Bhattacharjee,Bahar Aydemir,Baran Ozaydin,Tong Zhang,Mathieu Salzmann,Sabine Süsstrunk
关键词-EN: comprehensive datasets capable, deep learning, multiple modalities, evolving landscape, landscape of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注: ECCV 2024 Workshop Proceedings

点击查看摘要

Abstract:In the evolving landscape of deep learning, there is a pressing need for more comprehensive datasets capable of training models across multiple modalities. Concurrently, in digital humanities, there is a growing demand to leverage technology for diverse media adaptation and creation, yet limited by sparse datasets due to copyright and stylistic constraints. Addressing this gap, our paper presents a novel dataset comprising Franco-Belgian comics from the 1950s annotated for tasks including depth estimation, semantic segmentation, saliency detection, and character identification. It consists of two distinct and consistent styles and incorporates object concepts and labels taken from natural images. By including such diverse information across styles, this dataset not only holds promise for computational creativity but also offers avenues for the digitization of art and storytelling innovation. This dataset is a crucial component of the AI4VA Workshop Challenges~\urlthis https URL, where we specifically explore depth and saliency. Dataset details at \urlthis https URL.

[CV-53] BlinkVision: A Benchmark for Optical Flow Scene Flow and Point Tracking Estimation using RGB Frames and Events ECCV2024 WWW

链接: https://arxiv.org/abs/2410.20451
作者: Yijin Li,Yichen Shen,Zhaoyang Huang,Shuo Chen,Weikang Bian,Xiaoyu Shi,Fu-Yun Wang,Keqiang Sun,Hujun Bao,Zhaopeng Cui,Guofeng Zhang,Hongsheng Li
关键词-EN: high dynamic range, systems complement traditional, frame rate limitations, Recent advances, complement traditional cameras
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024. Project Page: this https URL

点击查看摘要

Abstract:Recent advances in event-based vision suggest that these systems complement traditional cameras by providing continuous observation without frame rate limitations and a high dynamic range, making them well-suited for correspondence tasks such as optical flow and point tracking. However, there is still a lack of comprehensive benchmarks for correspondence tasks that include both event data and images. To address this gap, we propose BlinkVision, a large-scale and diverse benchmark with multiple modalities and dense correspondence annotations. BlinkVision offers several valuable features: 1) Rich modalities: It includes both event data and RGB images. 2) Extensive annotations: It provides dense per-pixel annotations covering optical flow, scene flow, and point tracking. 3) Large vocabulary: It contains 410 everyday categories, sharing common classes with popular 2D and 3D datasets like LVIS and ShapeNet. 4) Naturalistic: It delivers photorealistic data and covers various naturalistic factors, such as camera shake and deformation. BlinkVision enables extensive benchmarks on three types of correspondence tasks (optical flow, point tracking, and scene flow estimation) for both image-based and event-based methods, offering new observations, practices, and insights for future research. The benchmark website is this https URL.

[CV-54] Vector Quantization Prompting for Continual Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.20444
作者: Li Jiao,Qiuxia Lai,Yu Li,Qiang Xu
关键词-EN: overcome catastrophic forgetting, Continual learning, Continual learning requires, task, requires to overcome
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in NeurIPS 2024

点击查看摘要

Abstract:Continual learning requires to overcome catastrophic forgetting when training a single model on a sequence of tasks. Recent top-performing approaches are prompt-based methods that utilize a set of learnable parameters (i.e., prompts) to encode task knowledge, from which appropriate ones are selected to guide the fixed pre-trained model in generating features tailored to a certain task. However, existing methods rely on predicting prompt identities for prompt selection, where the identity prediction process cannot be optimized with task loss. This limitation leads to sub-optimal prompt selection and inadequate adaptation of pre-trained features for a specific task. Previous efforts have tried to address this by directly generating prompts from input queries instead of selecting from a set of candidates. However, these prompts are continuous, which lack sufficient abstraction for task knowledge representation, making them less effective for continual learning. To address these challenges, we propose VQ-Prompt, a prompt-based continual learning method that incorporates Vector Quantization (VQ) into end-to-end training of a set of discrete prompts. In this way, VQ-Prompt can optimize the prompt selection process with task loss and meanwhile achieve effective abstraction of task knowledge for continual learning. Extensive experiments show that VQ-Prompt outperforms state-of-the-art continual learning methods across a variety of benchmarks under the challenging class-incremental setting. The code is available at \hrefthis https URLthis https URL.

[CV-55] CoralSCOP-LAT: Labeling and Analyzing Tool for Coral Reef Images with Dense Mask

链接: https://arxiv.org/abs/2410.20436
作者: Yuk-Kwan Wong,Ziqiang Zheng,Mingzhe Zhang,David Suggett,Sai-Kit Yeung
关键词-EN: coral reef, provide invaluable information, coral, coral reef analysis, reefs provide invaluable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The coral reef labeling and analysis tool is available at this https URL

点击查看摘要

Abstract:Images of coral reefs provide invaluable information, which is essentially critical for surveying and monitoring the coral reef ecosystems. Robust and precise identification of coral reef regions within surveying imagery is paramount for assessing coral coverage, spatial distribution, and other statistical analyses. However, existing coral reef analytical approaches mainly focus on sparse points sampled from the whole imagery, which are highly subject to the sampling density and cannot accurately express the coral ambulance. Meanwhile, the analysis is both time-consuming and labor-intensive, and it is also limited to coral biologists. In this work, we propose CoralSCOP-LAT, an automatic and semi-automatic coral reef labeling and analysis tool, specially designed to segment coral reef regions (dense pixel masks) in coral reef images, significantly promoting analysis proficiency and accuracy. CoralSCOP-LAT leverages the advanced coral reef foundation model to accurately delineate coral regions, supporting dense coral reef analysis and reducing the dependency on manual annotation. The proposed CoralSCOP-LAT surpasses the existing tools by a large margin from analysis efficiency, accuracy, and flexibility. We perform comprehensive evaluations from various perspectives and the comparison demonstrates that CoralSCOP-LAT not only accelerates the coral reef analysis but also improves accuracy in coral segmentation and analysis. Our CoralSCOP-LAT, as the first dense coral reef analysis tool in the market, facilitates repeated large-scale coral reef monitoring analysis, contributing to more informed conservation efforts and sustainable management of coral reef ecosystems. Our tool will be available at this https URL.

[CV-56] YourSkatingCoach: A Figure Skating Video Benchmark for Fine-Grained Element Analysis

链接: https://arxiv.org/abs/2410.20427
作者: Wei-Yi Chen,Yu-An Su,Wei-Hsin Yeh,Lun-Wei Ku
关键词-EN: machine learning involves, learning involves leveraging, Combining sports, game footage, player statistics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Combining sports and machine learning involves leveraging ML algorithms and techniques to extract insight from sports-related data such as player statistics, game footage, and other relevant information. However, datasets related to figure skating in the literature focus primarily on element classification and are currently unavailable or exhibit only limited access, which greatly raise the entry barrier to developing visual sports technology for it. Moreover, when using such data to help athletes improve their skills, we find they are very coarse-grained: they work for learning what an element is, but they are poorly suited to learning whether the element is good or bad. Here we propose air time detection, a novel motion analysis task, the goal of which is to accurately detect the duration of the air time of a jump. We present YourSkatingCoach, a large, novel figure skating dataset which contains 454 videos of jump elements, the detected skater skeletons in each video, along with the gold labels of the start and ending frames of each jump, together as a video benchmark for figure skating. In addition, although this type of task is often viewed as classification, we cast it as a sequential labeling problem and propose a Transformer-based model to calculate the duration. Experimental results show that the proposed model yields a favorable results for a strong baseline. To further verify the generalizability of the fine-grained labels, we apply the same process to other sports as cross-sports tasks but for coarse-grained task action classification. Here we fine-tune the classification to demonstrate that figure skating, as it contains the essential body movements, constitutes a strong foundation for adaptation to other sports.

[CV-57] Point-PRC: A Prompt Learning Based Regulation Framework for Generalizable Point Cloud Analysis NEURIPS2024

链接: https://arxiv.org/abs/2410.20406
作者: Hongyu Sun,Qiuhong Ke,Yongcai Wang,Wang Chen,Kang Yang,Deying Li,Jianfei Cai
关键词-EN: prevalent prompt learning, paper investigates, based on prevalent, domain generalization, prevalent prompt
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by NeurIPS 2024

点击查看摘要

Abstract:This paper investigates the 3D domain generalization (3DDG) ability of large 3D models based on prevalent prompt learning. Recent works demonstrate the performances of 3D point cloud recognition can be boosted remarkably by parameter-efficient prompt tuning. However, we observe that the improvement on downstream tasks comes at the expense of a severe drop in 3D domain generalization. To resolve this challenge, we present a comprehensive regulation framework that allows the learnable prompts to actively interact with the well-learned general knowledge in large 3D models to maintain good generalization. Specifically, the proposed framework imposes multiple explicit constraints on the prompt learning trajectory by maximizing the mutual agreement between task-specific predictions and task-agnostic knowledge. We design the regulation framework as a plug-and-play module to embed into existing representative large 3D models. Surprisingly, our method not only realizes consistently increasing generalization ability but also enhances task-specific 3D recognition performances across various 3DDG benchmarks by a clear margin. Considering the lack of study and evaluation on 3DDG, we also create three new benchmarks, namely base-to-new, cross-dataset and few-shot generalization benchmarks, to enrich the field and inspire future research. Code and benchmarks are available at \urlthis https URL.

[CV-58] Deep Learning-Driven Microstructure Characterization and Vickers Hardness Prediction of Mg-Gd Alloys

链接: https://arxiv.org/abs/2410.20402
作者: Lu Wang,Hongchan Chen,Bing Wang,Qian Li,Qun Luo,Yuexing Han
关键词-EN: critical research focus, solid-solution Mg-Gd alloys, Mg-Gd alloys, solid-solution Mg-Gd, exploring the relationship
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the field of materials science, exploring the relationship between composition, microstructure, and properties has long been a critical research focus. The mechanical performance of solid-solution Mg-Gd alloys is significantly influenced by Gd content, dendritic structures, and the presence of secondary phases. To better analyze and predict the impact of these factors, this study proposes a multimodal fusion learning framework based on image processing and deep learning techniques. This framework integrates both elemental composition and microstructural features to accurately predict the Vickers hardness of solid-solution Mg-Gd alloys. Initially, deep learning methods were employed to extract microstructural information from a variety of solid-solution Mg-Gd alloy images obtained from literature and experiments. This provided precise grain size and secondary phase microstructural features for performance prediction tasks. Subsequently, these quantitative analysis results were combined with Gd content information to construct a performance prediction dataset. Finally, a regression model based on the Transformer architecture was used to predict the Vickers hardness of Mg-Gd alloys. The experimental results indicate that the Transformer model performs best in terms of prediction accuracy, achieving an R^2 value of 0.9. Additionally, SHAP analysis identified critical values for four key features affecting the Vickers hardness of Mg-Gd alloys, providing valuable guidance for alloy design. These findings not only enhance the understanding of alloy performance but also offer theoretical support for future material design and optimization.

[CV-59] Depth Attention for Robust RGB Tracking ACCV

链接: https://arxiv.org/abs/2410.20395
作者: Yu Liu,Arif Mahmood,Muhammad Haris Khan
关键词-EN: RGB video object, computer vision, fundamental task, task in computer, RGB video
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Oral Acceptance at the Asian Conference on Computer Vision (ACCV) 2024, Hanoi, Vietnam

点击查看摘要

Abstract:RGB video object tracking is a fundamental task in computer vision. Its effectiveness can be improved using depth information, particularly for handling motion-blurred target. However, depth information is often missing in commonly used tracking benchmarks. In this work, we propose a new framework that leverages monocular depth estimation to counter the challenges of tracking targets that are out of view or affected by motion blur in RGB video sequences. Specifically, our work introduces following contributions. To the best of our knowledge, we are the first to propose a depth attention mechanism and to formulate a simple framework that allows seamlessly integration of depth information with state of the art tracking algorithms, without RGB-D cameras, elevating accuracy and robustness. We provide extensive experiments on six challenging tracking benchmarks. Our results demonstrate that our approach provides consistent gains over several strong baselines and achieves new SOTA performance. We believe that our method will open up new possibilities for more sophisticated VOT solutions in real-world scenarios. Our code and models are publicly released: this https URL.

[CV-60] UTSRMorph: A Unified Transformer and Superresolution Network for Unsupervised Medical Image Registration

链接: https://arxiv.org/abs/2410.20348
作者: Runshi Zhang,Hao Mo,Junchen Wang,Bimeng Jie,Yang He,Nenghao Jin,Liang Zhu
关键词-EN: Complicated image registration, medical image analysis, deep learning-based methods, Complicated image, key issue
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13pages,10 figures

点击查看摘要

Abstract:Complicated image registration is a key issue in medical image analysis, and deep learning-based methods have achieved better results than traditional methods. The methods include ConvNet-based and Transformer-based methods. Although ConvNets can effectively utilize local information to reduce redundancy via small neighborhood convolution, the limited receptive field results in the inability to capture global dependencies. Transformers can establish long-distance dependencies via a self-attention mechanism; however, the intense calculation of the relationships among all tokens leads to high redundancy. We propose a novel unsupervised image registration method named the unified Transformer and superresolution (UTSRMorph) network, which can enhance feature representation learning in the encoder and generate detailed displacement fields in the decoder to overcome these problems. We first propose a fusion attention block to integrate the advantages of ConvNets and Transformers, which inserts a ConvNet-based channel attention module into a multihead self-attention module. The overlapping attention block, a novel cross-attention method, uses overlapping windows to obtain abundant correlations with match information of a pair of images. Then, the blocks are flexibly stacked into a new powerful encoder. The decoder generation process of a high-resolution deformation displacement field from low-resolution features is considered as a superresolution process. Specifically, the superresolution module was employed to replace interpolation upsampling, which can overcome feature degradation. UTSRMorph was compared to state-of-the-art registration methods in the 3D brain MR (OASIS, IXI) and MR-CT datasets. The qualitative and quantitative results indicate that UTSRMorph achieves relatively better performance. The code and datasets are publicly available at this https URL.

[CV-61] Wavelet-based Mamba with Fourier Adjustment for Low-light Image Enhancement ACCV2024

链接: https://arxiv.org/abs/2410.20314
作者: Junhao Tan,Songwen Pei,Wei Qin,Bo Fu,Ximing Li,Libo Huang
关键词-EN: Discrete Wavelet Transform, Fast Fourier Transform, Low-Light Image Enhancement, Fourier frequency information, Fast Fourier Adjustment
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 18 pages, 8 figures, ACCV2024

点击查看摘要

Abstract:Frequency information (e.g., Discrete Wavelet Transform and Fast Fourier Transform) has been widely applied to solve the issue of Low-Light Image Enhancement (LLIE). However, existing frequency-based models primarily operate in the simple wavelet or Fourier space of images, which lacks utilization of valid global and local information in each space. We found that wavelet frequency information is more sensitive to global brightness due to its low-frequency component while Fourier frequency information is more sensitive to local details due to its phase component. In order to achieve superior preliminary brightness enhancement by optimally integrating spatial channel information with low-frequency components in the wavelet transform, we introduce channel-wise Mamba, which compensates for the long-range dependencies of CNNs and has lower complexity compared to Diffusion and Transformer models. So in this work, we propose a novel Wavelet-based Mamba with Fourier Adjustment model called WalMaFa, consisting of a Wavelet-based Mamba Block (WMB) and a Fast Fourier Adjustment Block (FFAB). We employ an Encoder-Latent-Decoder structure to accomplish the end-to-end transformation. Specifically, WMB is adopted in the Encoder and Decoder to enhance global brightness while FFAB is adopted in the Latent to fine-tune local texture details and alleviate ambiguity. Extensive experiments demonstrate that our proposed WalMaFa achieves state-of-the-art performance with fewer computational resources and faster speed. Code is now available at: this https URL.

[CV-62] GUMBEL-NERF: Representing Unseen Objects as Part-Compositional Neural Radiance Fields ICIP2024

链接: https://arxiv.org/abs/2410.20306
作者: Yusuke Sekikawa,Chingwei Hsu,Satoshi Ikehata,Rei Kawakami,Ikuro Sato
关键词-EN: neural radiance fields, expert selection mechanism, hindsight expert selection, neural radiance, expert selection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages. Presented at ICIP2024

点击查看摘要

Abstract:We propose Gumbel-NeRF, a mixture-of-expert (MoE) neural radiance fields (NeRF) model with a hindsight expert selection mechanism for synthesizing novel views of unseen objects. Previous studies have shown that the MoE structure provides high-quality representations of a given large-scale scene consisting of many objects. However, we observe that such a MoE NeRF model often produces low-quality representations in the vicinity of experts’ boundaries when applied to the task of novel view synthesis of an unseen object from one/few-shot input. We find that this deterioration is primarily caused by the foresight expert selection mechanism, which may leave an unnatural discontinuity in the object shape near the experts’ boundaries. Gumbel-NeRF adopts a hindsight expert selection mechanism, which guarantees continuity in the density field even near the experts’ boundaries. Experiments using the SRN cars dataset demonstrate the superiority of Gumbel-NeRF over the baselines in terms of various image quality metrics.

[CV-63] Deep Learning Machine Learning – Digital Signal and Image Processing: From Theory to Application

链接: https://arxiv.org/abs/2410.20304
作者: Weiche Hsieh,Ziqian Bi,Junyu Liu,Benji Peng,Sen Zhang,Xuanhe Pan,Jiawei Xu,Jinlang Wang,Keyu Chen,Caitlyn Heqi Yin,Pohsun Feng,Yizhu Wen,Tianyang Wang,Ming Li,Jintao Ren,Qian Niu,Silin Chen,Ming Liu
关键词-EN: Digital Signal Processing, Digital Image Processing, Machine Learning, Deep Learning, popular research areas
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 293 pages

点击查看摘要

Abstract:Digital Signal Processing (DSP) and Digital Image Processing (DIP) with Machine Learning (ML) and Deep Learning (DL) are popular research areas in Computer Vision and related fields. We highlight transformative applications in image enhancement, filtering techniques, and pattern recognition. By integrating frameworks like the Discrete Fourier Transform (DFT), Z-Transform, and Fourier Transform methods, we enable robust data manipulation and feature extraction essential for AI-driven tasks. Using Python, we implement algorithms that optimize real-time data processing, forming a foundation for scalable, high-performance solutions in computer vision. This work illustrates the potential of ML and DL to advance DSP and DIP methodologies, contributing to artificial intelligence, automated feature extraction, and applications across diverse domains.

[CV-64] Harmony4D: A Video Dataset for In-The-Wild Close Human Interactions NEURIPS2024

链接: https://arxiv.org/abs/2410.20294
作者: Rawal Khirodkar,Jyun-Ting Song,Jinkun Cao,Zhengyi Luo,Kris Kitani
关键词-EN: building realistic multi-human, realistic multi-human virtual, multi-human virtual reality, virtual reality systems, key to building
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Understanding how humans interact with each other is key to building realistic multi-human virtual reality systems. This area remains relatively unexplored due to the lack of large-scale datasets. Recent datasets focusing on this issue mainly consist of activities captured entirely in controlled indoor environments with choreographed actions, significantly affecting their diversity. To address this, we introduce Harmony4D, a multi-view video dataset for human-human interaction featuring in-the-wild activities such as wrestling, dancing, MMA, and more. We use a flexible multi-view capture system to record these dynamic activities and provide annotations for human detection, tracking, 2D/3D pose estimation, and mesh recovery for closely interacting subjects. We propose a novel markerless algorithm to track 3D human poses in severe occlusion and close interaction to obtain our annotations with minimal manual intervention. Harmony4D consists of 1.66 million images and 3.32 million human instances from more than 20 synchronized cameras with 208 video sequences spanning diverse environments and 24 unique subjects. We rigorously evaluate existing state-of-the-art methods for mesh recovery and highlight their significant limitations in modeling close interaction scenarios. Additionally, we fine-tune a pre-trained HMR2.0 model on Harmony4D and demonstrate an improved performance of 54.8% PVE in scenes with severe occlusion and contact. Code and data are available at this https URL.

[CV-65] You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models NEURIPS2024

链接: https://arxiv.org/abs/2410.20265
作者: Eric Slyman,Anirudh Kanneganti,Sanghyun Hong,Stefan Lee
关键词-EN: produce socially-fair outputs, compressing foundation vision-language, foundation vision-language models, socially-fair outputs, study the impact
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Workshop paper at NeurIPS 2024 RBFM. 6 pages, 3 figures

点击查看摘要

Abstract:We study the impact of a standard practice in compressing foundation vision-language models - quantization - on the models’ ability to produce socially-fair outputs. In contrast to prior findings with unimodal models that compression consistently amplifies social biases, our extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, we find no consistent change in bias magnitude or direction across a population of compressed models due to quantization.

[CV-66] Enhancing CNN Classification with Lamarckian Memetic Algorithms and Local Search

链接: https://arxiv.org/abs/2410.20234
作者: Akhilbaran Ghosh,Rama Sai Adithya Kalidindi
关键词-EN: critical for optimal, optimal performance, performance in deep, Optimization, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in IEEE SPARC 2024

点击查看摘要

Abstract:Optimization is critical for optimal performance in deep neural networks (DNNs). Traditional gradient-based methods often face challenges like local minima entrapment. This paper explores population-based metaheuristic optimization algorithms for image classification networks. We propose a novel approach integrating a two-stage training technique with population-based optimization algorithms incorporating local search capabilities. Our experiments demonstrate that the proposed method outperforms state-of-the-art gradient-based techniques, such as ADAM, in accuracy and computational efficiency, particularly with high computational complexity and numerous trainable parameters. The results suggest that our approach offers a robust alternative to traditional methods for weight optimization in convolutional neural networks (CNNs). Future work will explore integrating adaptive mechanisms for parameter tuning and applying the proposed method to other types of neural networks and real-time applications.

[CV-67] CAVE: Classifying Abnormalities in Video Capsule Endoscopy

链接: https://arxiv.org/abs/2410.20231
作者: Ishita Harish,Saurav Mishra,Neha Bhadoria,Rithik Kumar,Madhav Arora,Syed Rameem Zahra,Ankur Gupta
关键词-EN: complex image datasets, Block Attention Module, Deep Neural Network, Convolutional Block Attention, Support Vector Machine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this study, we explore an ensemble-based approach to improve classification accuracy in complex image datasets. Utilizing a Convolutional Block Attention Module (CBAM) alongside a Deep Neural Network (DNN) we leverage the unique feature-extraction capabilities of each model to enhance the overall accuracy. Additional models, such as Random Forest, XGBoost, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN), are introduced to further diversify the predictive power of our ensemble. By leveraging these methods, the proposed approach provides robust feature discrimination and improved classification results. Experimental evaluations demonstrate that the ensemble achieves higher accuracy and robustness across challenging and imbalanced classes, showing significant promise for broader applications in computer vision tasks.

[CV-68] An Efficient Watermarking Method for Latent Diffusion Models via Low-Rank Adaptation

链接: https://arxiv.org/abs/2410.20202
作者: Dongdong Lin,Yue Li,Benedetta Tondi,Bin Li,Mauro Barni
关键词-EN: deep neural networks, trained deep models, model watermarking technologies, deep neural, trained deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid proliferation of deep neural networks (DNNs) is driving a surge in model watermarking technologies, as the trained deep models themselves serve as intellectual properties. The core of existing model watermarking techniques involves modifying or tuning the models’ weights. However, with the emergence of increasingly complex models, ensuring the efficiency of watermarking process is essential to manage the growing computational demands. Prioritizing efficiency not only optimizes resource utilization, making the watermarking process more applicable, but also minimizes potential impacts on model performance. In this letter, we propose an efficient watermarking method for latent diffusion models (LDMs) which is based on Low-Rank Adaptation (LoRA). We specifically choose to add trainable low-rank matrices to the existing weight matrices of the models to embed watermark, while keeping the original weights frozen. Moreover, we also propose a dynamic loss weight tuning algorithm to balance the generative task with the watermark embedding task, ensuring that the model can be watermarked with a limited impact on the quality of the generated images. Experimental results show that the proposed method ensures fast watermark embedding and maintains a very low bit error rate of the watermark, a high-quality of the generated image, and a zero false negative rate (FNR) for verification.

[CV-69] ransferable Adversarial Attacks on SAM and Its Downstream Models NEURIPS2024

链接: https://arxiv.org/abs/2410.20197
作者: Song Xia,Wenhan Yang,Yi Yu,Xun Lin,Henghui Ding,Lingyu Duan,Xudong Jiang
关键词-EN: large foundational models, fine-tuning downstream tasks, downstream models, open-sourced SAM, downstream models fine-tuned
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work is accepted by Neurips2024

点击查看摘要

Abstract:The utilization of large foundational models has a dilemma: while fine-tuning downstream tasks from them holds promise for making use of the well-generalized knowledge in practical applications, their open accessibility also poses threats of adverse usage. This paper, for the first time, explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM), by solely utilizing the information from the open-sourced SAM. In contrast to prevailing transfer-based adversarial attacks, we demonstrate the existence of adversarial dangers even without accessing the downstream task and dataset to train a similar surrogate model. To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm to extract the intrinsic vulnerability inherent in the foundation model, which is then utilized as the prior knowledge to guide the generation of adversarial perturbations. Moreover, by formulating the gradient difference in the attacking process between the open-sourced SAM and its fine-tuned downstream models, we theoretically demonstrate that a deviation occurs in the adversarial update direction by directly maximizing the distance of encoded feature embeddings in the open-sourced SAM. Consequently, we propose a gradient robust loss that simulates the associated uncertainty with gradient-based noise augmentation to enhance the robustness of generated adversarial examples (AEs) towards this deviation, thus improving the transferability. Extensive experiments demonstrate the effectiveness of the proposed universal meta-initialized and gradient robust adversarial attack (UMI-GRAT) toward SAMs and their downstream models. Code is available at this https URL.

[CV-70] Image Generation from Image Captioning – Invertible Approach

链接: https://arxiv.org/abs/2410.20171
作者: Nandakishore S Menon,Chandramouli Kamanchi,Raghuram Bharadwaj Diddigi
关键词-EN: performs dual tasks, work aims, aims to build, performs dual, image captioning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as Tiny Paper at ICVGIP 2024 conference

点击查看摘要

Abstract:Our work aims to build a model that performs dual tasks of image captioning and image generation while being trained on only one task. The central idea is to train an invertible model that learns a one-to-one mapping between the image and text embeddings. Once the invertible model is efficiently trained on one task, the image captioning, the same model can generate new images for a given text through the inversion process, with no additional training. This paper proposes a simple invertible neural network architecture for this problem and presents our current findings.

[CV-71] Prompt Diffusion Robustifies Any-Modality Prompt Learning

链接: https://arxiv.org/abs/2410.20164
作者: Yingjun Du,Gaowen Liu,Yuzhang Shang,Yuguang Yao,Ramana Kompella,Cees G. M. Snoek
关键词-EN: Foundation models enable, enable prompt-based classifiers, models enable prompt-based, Foundation models, prompt diffusion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.

[CV-72] Your Image is Secretly the Last Frame of a Pseudo Video

链接: https://arxiv.org/abs/2410.20158
作者: Wenlong Chen,Wenlin Chen,Lapo Rastrelli,Yingzhen Li
关键词-EN: hierarchical variational autoencoders, shown profound success, Diffusion models, generating photo-realistic images, variational autoencoders
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Diffusion models, which can be viewed as a special case of hierarchical variational autoencoders (HVAEs), have shown profound success in generating photo-realistic images. In contrast, standard HVAEs often produce images of inferior quality compared to diffusion models. In this paper, we hypothesize that the success of diffusion models can be partly attributed to the additional self-supervision information for their intermediate latent states provided by corrupted images, which along with the original image form a pseudo video. Based on this hypothesis, we explore the possibility of improving other types of generative models with such pseudo videos. Specifically, we first extend a given image generative model to their video generative model counterpart, and then train the video generative model on pseudo videos constructed by applying data augmentation to the original images. Furthermore, we analyze the potential issues of first-order Markov data augmentation methods, which are typically used in diffusion models, and propose to use more expressive data augmentation to construct more useful information in pseudo videos. Our empirical results on the CIFAR10 and CelebA datasets demonstrate that improved image generation quality can be achieved with additional self-supervised information from pseudo videos.

[CV-73] Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models NEURIPS2024

链接: https://arxiv.org/abs/2410.20155
作者: Liulei Li,Wenguan Wang,Yi Yang
关键词-EN: Prevalent human-object interaction, detection approaches typically, approaches typically leverage, typically leverage large-scale, leverage large-scale visual-linguistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Prevalent human-object interaction (HOI) detection approaches typically leverage large-scale visual-linguistic models to help recognize events involving humans and objects. Though promising, models trained via contrastive learning on text-image pairs often neglect mid/low-level visual cues and struggle at compositional reasoning. In response, we introduce DIFFUSIONHOI, a new HOI detector shedding light on text-to-image diffusion models. Unlike the aforementioned models, diffusion models excel in discerning mid/low-level visual concepts as generative models, and possess strong compositionality to handle novel concepts expressed in text inputs. Considering diffusion models usually emphasize instance objects, we first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions, and extract HOI-relevant cues from images without heavy fine-tuning. Benefited from above, DIFFUSIONHOI achieves SOTA performance on three datasets under both regular and zero-shot setups.

[CV-74] Detection-Guided Deep Learning-Based Model with Spatial Regularization for Lung Nodule Segmentation

链接: https://arxiv.org/abs/2410.20154
作者: Jiasen Zhang,Mingrui Yang,Weihong Guo,Brian A. Xavier,Michael Bolen,Xiaojuan Li
关键词-EN: Lung cancer ranks, cancer-related mortality worldwide, lung nodules plays, lung nodules, cancer ranks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Lung cancer ranks as one of the leading causes of cancer diagnosis and is the foremost cause of cancer-related mortality worldwide. The early detection of lung nodules plays a pivotal role in improving outcomes for patients, as it enables timely and effective treatment interventions. The segmentation of lung nodules plays a critical role in aiding physicians in distinguishing between malignant and benign lesions. However, this task remains challenging due to the substantial variation in the shapes and sizes of lung nodules, and their frequent proximity to lung tissues, which complicates clear delineation. In this study, we introduce a novel model for segmenting lung nodules in computed tomography (CT) images, leveraging a deep learning framework that integrates segmentation and classification processes. This model is distinguished by its use of feature combination blocks, which facilitate the sharing of information between the segmentation and classification components. Additionally, we employ the classification outcomes as priors to refine the size estimation of the predicted nodules, integrating these with a spatial regularization technique to enhance precision. Furthermore, recognizing the challenges posed by limited training datasets, we have developed an optimal transfer learning strategy that freezes certain layers to further improve performance. The results show that our proposed model can capture the target nodules more accurately compared to other commonly used models. By applying transfer learning, the performance can be further improved, achieving a sensitivity score of 0.885 and a Dice score of 0.814.

[CV-75] Semantic Feature Decomposition based Semantic Communication System of Images with Large-scale Visual Generation Models

链接: https://arxiv.org/abs/2410.20126
作者: Senran Fan,Zhicheng Bao,Chen Dong,Haotai Liang,Xiaodong Xu,Ping Zhang
关键词-EN: image communication systems, semantic communication, image communication, communication, communication systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 13 figures

点击查看摘要

Abstract:The end-to-end image communication system has been widely studied in the academic community. The escalating demands on image communication systems in terms of data volume, environmental complexity, and task precision require enhanced communication efficiency, anti-noise ability and semantic fidelity. Therefore, we proposed a novel paradigm based on Semantic Feature Decomposition (SeFD) for the integration of semantic communication and large-scale visual generation models to achieve high-performance, highly interpretable and controllable image communication. According to this paradigm, a Texture-Color based Semantic Communication system of Images TCSCI is proposed. TCSCI decomposing the images into their natural language description (text), texture and color semantic features at the transmitter. During the transmission, features are transmitted over the wireless channel, and at the receiver, a large-scale visual generation model is utilized to restore the image through received features. TCSCI can achieve extremely compressed, highly noise-resistant, and visually similar image semantic communication, while ensuring the interpretability and editability of the transmission process. The experiments demonstrate that the TCSCI outperforms traditional image communication systems and existing semantic communication systems under extreme compression with good anti-noise performance and interpretability.

[CV-76] Anatomical 3D Style Transfer Enabling Efficient Federated Learning with Extremely Low Communication Costs NEURIPS2024

链接: https://arxiv.org/abs/2410.20102
作者: Yuto Shibata,Yasunori Kudo,Yohei Sugawara
关键词-EN: multi-organ segmentation task, federated learning, segmentation task, multi-organ segmentation, Frequency Domain Generalization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by AIM-FM Workshop at NeurIPS 2024

点击查看摘要

Abstract:In this study, we propose a novel federated learning (FL) approach that utilizes 3D style transfer for the multi-organ segmentation task. The multi-organ dataset, obtained by integrating multiple datasets, has high scalability and can improve generalization performance as the data volume increases. However, the heterogeneity of data owing to different clients with diverse imaging conditions and target organs can lead to severe overfitting of local models. To align models that overfit to different local datasets, existing methods require frequent communication with the central server, resulting in higher communication costs and risk of privacy leakage. To achieve an efficient and safe FL, we propose an Anatomical 3D Frequency Domain Generalization (A3DFDG) method for FL. A3DFDG utilizes structural information of human organs and clusters the 3D styles based on the location of organs. By mixing styles based on these clusters, it preserves the anatomical information and leads models to learn intra-organ diversity, while aligning the optimization of each local model. Experiments indicate that our method can maintain its accuracy even in cases where the communication cost is highly limited (=1.25% of the original cost) while achieving a significant difference compared to baselines, with a higher global dice similarity coefficient score of 4.3%. Despite its simplicity and minimal computational overhead, these results demonstrate that our method has high practicality in real-world scenarios where low communication costs and a simple pipeline are required. The code used in this project will be publicly available.

[CV-77] Generative Adversarial Patches for Physical Attacks on Cross-Modal Pedestrian Re-Identification

链接: https://arxiv.org/abs/2410.20097
作者: Yue Su,Hao Li,Maoguo Gong
关键词-EN: Visible-infrared pedestrian Re-identification, infrared cameras, visible cameras, Visible-infrared pedestrian, pedestrian Re-identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visible-infrared pedestrian Re-identification (VI-ReID) aims to match pedestrian images captured by infrared cameras and visible cameras. However, VI-ReID, like other traditional cross-modal image matching tasks, poses significant challenges due to its human-centered nature. This is evidenced by the shortcomings of existing methods, which struggle to extract common features across modalities, while losing valuable information when bridging the gap between them in the implicit feature space, potentially compromising security. To address this vulnerability, this paper introduces the first physical adversarial attack against VI-ReID models. Our method, termed Edge-Attack, specifically tests the models’ ability to leverage deep-level implicit features by focusing on edge information, the most salient explicit feature differentiating individuals across modalities. Edge-Attack utilizes a novel two-step approach. First, a multi-level edge feature extractor is trained in a self-supervised manner to capture discriminative edge representations for each individual. Second, a generative model based on Vision Transformer Generative Adversarial Networks (ViTGAN) is employed to generate adversarial patches conditioned on the extracted edge features. By applying these patches to pedestrian clothing, we create realistic, physically-realizable adversarial samples. This black-box, self-supervised approach ensures the generalizability of our attack against various VI-ReID models. Extensive experiments on SYSU-MM01 and RegDB datasets, including real-world deployments, demonstrate the effectiveness of Edge- Attack in significantly degrading the performance of state-of-the-art VI-ReID methods.

[CV-78] UniVST: A Unified Framework for Training-free Localized Video Style Transfer

链接: https://arxiv.org/abs/2410.20084
作者: Quanjian Song,Mingbao Lin,Wengyi Zhan,Shuicheng Yan,Liujuan Cao
关键词-EN: unified framework, paper presents UniVST, paper presents, style transfer, video style transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages not including reference

点击查看摘要

Abstract:This paper presents UniVST, a unified framework for localized video style transfer. It operates without the need for training, offering a distinct advantage over existing methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages feature maps from the DDIM inversion. This streamlines the model’s architecture by obviating the need for tracking models. (2) An AdaIN-guided style transfer mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding window smoothing strategy that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in video outputs. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object’s style while ensuring temporal consistency and detail preservation.

[CV-79] SFTrack: A Robust Scale and Motion Adaptive Algorithm for Tracking Small and Fast Moving Objects IROS2024

链接: https://arxiv.org/abs/2410.20079
作者: InPyo Song,Jangwon Lee
关键词-EN: Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, paper addresses, addresses the problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IROS 2024 selected Oral

点击查看摘要

Abstract:This paper addresses the problem of multi-object tracking in Unmanned Aerial Vehicle (UAV) footage. It plays a critical role in various UAV applications, including traffic monitoring systems and real-time suspect tracking by the police. However, this task is highly challenging due to the fast motion of UAVs, as well as the small size of target objects in the videos caused by the high-altitude and wide angle views of drones. In this study, we thus introduce a simple yet more effective method compared to previous work to overcome these challenges. Our approach involves a new tracking strategy, which initiates the tracking of target objects from low-confidence detections commonly encountered in UAV application scenarios. Additionally, we propose revisiting traditional appearance-based matching algorithms to improve the association of low-confidence detections. To evaluate the effectiveness of our method, we conducted benchmark evaluations on two UAV-specific datasets (VisDrone2019, UAVDT) and one general object tracking dataset (MOT17). The results demonstrate that our approach surpasses current state-of-the art methodologies, highlighting its robustness and adaptability in diverse tracking environments. Furthermore, we have improved the annotation of the UAVDT dataset by rectifying several errors and addressing omissions found in the original annotations. We will provide this refined version of the dataset to facilitate better benchmarking in the field.

[CV-80] 3D Distance-color-coded Assessment of PCI Stent Apposition via Deep-learning-based Three-dimensional Multi-object Segmentation

链接: https://arxiv.org/abs/2410.20055
作者: Xiaoyang Qin,Hao Huang,Shuaichen Lin,Xinhao Zeng,Kaizhi Cao,Renxiong Wu,Yuming Huang,Junqing Yang,Yong Liu,Gang Li,Guangming Ni
关键词-EN: percutaneous coronary intervention, necessitating percutaneous coronary, Coronary artery disease, global health challenge, artery disease poses
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Coronary artery disease poses a significant global health challenge, often necessitating percutaneous coronary intervention (PCI) with stent implantation. Assessing stent apposition holds pivotal importance in averting and identifying PCI complications that lead to in-stent restenosis. Here we proposed a novel three-dimensional (3D) distance-color-coded assessment (DccA)for PCI stent apposition via deep-learning-based 3D multi-object segmentation in intravascular optical coherence tomography (IV-OCT). Our proposed 3D DccA accurately segments 3D vessel lumens and stents in IV-OCT images, using a spatial matching network and dual-layer training with style transfer. It quantifies and maps stent-lumen distances into a 3D color space, facilitating 3D visual assessment of PCI stent apposition. Achieving over 95% segmentation precision, our proposed DccA enhances clinical evaluation of PCI stent deployment and supports personalized treatment planning.

[CV-81] ResAD: A Simple Framework for Class Generalizable Anomaly Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.20047
作者: Xincheng Yao,Zixin Chen,Chao Gao,Guangtao Zhai,Chongyang Zhang
关键词-EN: feature, feature distribution, residual feature distribution, residual features, target data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper was accepted as a spotlight papaer by NeurIPS 2024

点击查看摘要

Abstract:This paper explores the problem of class-generalizable anomaly detection, where the objective is to train one unified AD model that can generalize to detect anomalies in diverse classes from different domains without any retraining or fine-tuning on the target data. Because normal feature representations vary significantly across classes, this will cause the widely studied one-for-one AD models to be poorly classgeneralizable (i.e., performance drops dramatically when used for new classes). In this work, we propose a simple but effective framework (called ResAD) that can be directly applied to detect anomalies in new classes. Our main insight is to learn the residual feature distribution rather than the initial feature distribution. In this way, we can significantly reduce feature variations. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. Therefore, the learned model can be directly adapted to new classes. ResAD consists of three components: (1) a Feature Converter that converts initial features into residual features; (2) a simple and shallow Feature Constraintor that constrains normal residual features into a spatial hypersphere for further reducing feature variations and maintaining consistency in feature scales among different classes; (3) a Feature Distribution Estimator that estimates the normal residual feature distribution, anomalies can be recognized as out-of-distribution. Despite the simplicity, ResAD can achieve remarkable anomaly detection results when directly used in new classes. The code is available at this https URL.

[CV-82] owards Robust Algorithms for Surgical Phase Recognition via Digital Twin-based Scene Representation

链接: https://arxiv.org/abs/2410.20026
作者: Hao Ding,Yuqian Zhang,Hongchao Shu,Xu Lian,Ji Woong Kim,Axel Krieger,Mathias Unberath
关键词-EN: surgical data science, enabling high-level surgical, Surgical phase recognition, Surgical phase, surgical phase directly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Purpose: Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set, resulting in poor generalizability. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm – an intermediary layer to separate high-level analysis (SPR) from low-level processing (geometric understanding). This approach takes advantage of the recent vision foundation models that ensure reliable low-level scene understanding to craft DT-based scene representations that support various high-level tasks. Methods: We present a DT-based framework for SPR from videos. The framework employs vision foundation models to extract representations. We embed the representation in place of raw video inputs in the state-of-the-art Surgformer model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Results: Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 51.1 on the challenging CRCD dataset, 96.0 on an internal robotics training dataset, and 64.4 on a highly corrupted Cholec80 test set. Conclusion: Our findings lend support to the thesis that DT-based scene representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness, automate feature extraction, and incorporate interpretability for a more comprehensive framework. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.20026 [cs.CV] (or arXiv:2410.20026v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.20026 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hao Ding [view email] [v1] Sat, 26 Oct 2024 00:49:06 UTC (1,665 KB)

[CV-83] Unsupervised Machine Learning for Detecting and Locating Human-Made Objects in 3D Point Cloud

链接: https://arxiv.org/abs/2410.20006
作者: Hong Zhao,Huyunting Huang,Tonglin Zhang,Baijian Yang,Jin Wei-Kocsis,Songlin Fei
关键词-EN: ground filtering, ground filtering stage, ground, typically collected, geological region
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A 3D point cloud is an unstructured, sparse, and irregular dataset, typically collected by airborne LiDAR systems over a geological region. Laser pulses emitted from these systems reflect off objects both on and above the ground, resulting in a dataset containing the longitude, latitude, and elevation of each point, as well as information about the corresponding laser pulse strengths. A widely studied research problem, addressed in many previous works, is ground filtering, which involves partitioning the points into ground and non-ground subsets. This research introduces a novel task: detecting and identifying human-made objects amidst natural tree structures. This task is performed on the subset of non-ground points derived from the ground filtering stage. Marked Point Fields (MPFs) are used as models well-suited to these tasks. The proposed methodology consists of three stages: ground filtering, local information extraction (LIE), and clustering. In the ground filtering stage, a statistical method called One-Sided Regression (OSR) is introduced, addressing the limitations of prior ground filtering methods on uneven terrains. In the LIE stage, unsupervised learning methods are lacking. To mitigate this, a kernel-based method for the Hessian matrix of the MPF is developed. In the clustering stage, the Gaussian Mixture Model (GMM) is applied to the results of the LIE stage to partition the non-ground points into trees and human-made objects. The underlying assumption is that LiDAR points from trees exhibit a three-dimensional distribution, while those from human-made objects follow a two-dimensional distribution. The Hessian matrix of the MPF effectively captures this distinction. Experimental results demonstrate that the proposed ground filtering method outperforms previous techniques, and the LIE method successfully distinguishes between points representing trees and human-made objects.

[CV-84] A-MFST: Adaptive Multi-Flow Sparse Tracker for Real-Time Tissue Tracking Under Occlusion

链接: https://arxiv.org/abs/2410.19996
作者: Yuxin Chen,Zijian Wu,Adam Schmidt,Septimiu E. Salcudean
关键词-EN: Efficient Neural Depth, Sparse Efficient Neural, robot-assisted surgery, critical for downstream, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 12 pages, 6 figures. Submitted to IPCAI2025

点击查看摘要

Abstract:Purpose: Tissue tracking is critical for downstream tasks in robot-assisted surgery. The Sparse Efficient Neural Depth and Deformation (SENDD) model has previously demonstrated accurate and real-time sparse point tracking, but struggled with occlusion handling. This work extends SENDD to enhance occlusion detection and tracking consistency while maintaining real-time performance. Methods: We use the Segment Anything Model2 (SAM2) to detect and mask occlusions by surgical tools, and we develop and integrate into SENDD an Adaptive Multi-Flow Sparse Tracker (A-MFST) with forward-backward consistency metrics, to enhance occlusion and uncertainty estimation. A-MFST is an unsupervised variant of the Multi-Flow Dense Tracker (MFT). Results: We evaluate our approach on the STIR dataset and demonstrate a significant improvement in tracking accuracy under occlusion, reducing average tracking errors by 12 percent in Mean Endpoint Error (MEE) and showing a 6 percent improvement in the averaged accuracy over thresholds of 4, 8, 16, 32, and 64 pixels. The incorporation of forward-backward consistency further improves the selection of optimal tracking paths, reducing drift and enhancing robustness. Notably, these improvements were achieved without compromising the model’s real-time capabilities. Conclusions: Using A-MFST and SAM2, we enhance SENDD’s ability to track tissue in real time under instrument and tissue occlusions.

[CV-85] urn-by-Turn Indoor Navigation for the Visually Impaired

链接: https://arxiv.org/abs/2410.19954
作者: Santosh Srinivasaiah,Sai Kumar Nekkanti,Rohith Reddy Nedhunuri
关键词-EN: environments presents significant, presents significant challenges, visually impaired individuals, impaired individuals due, Navigating indoor environments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Navigating indoor environments presents significant challenges for visually impaired individuals due to complex layouts and the absence of GPS signals. This paper introduces a novel system that provides turn-by-turn navigation inside buildings using only a smartphone equipped with a camera, leveraging multimodal models, deep learning algorithms, and large language models (LLMs). The smartphone’s camera captures real-time images of the surroundings, which are then sent to a nearby Raspberry Pi capable of running on-device LLM models, multimodal models, and deep learning algorithms to detect and recognize architectural features, signage, and obstacles. The interpreted visual data is then translated into natural language instructions by an LLM running on the Raspberry Pi, which is sent back to the user, offering intuitive and context-aware guidance via audio prompts. This solution requires minimal workload on the user’s device, preventing it from being overloaded and offering compatibility with all types of devices, including those incapable of running AI models. This approach enables the client to not only run advanced models but also ensure that the training data and other information do not leave the building. Preliminary evaluations demonstrate the system’s effectiveness in accurately guiding users through complex indoor spaces, highlighting its potential for widespread application

[CV-86] A Multimodal Approach For Endoscopic VCE Image Classification Using BiomedCLIP-PubMedBERT

链接: https://arxiv.org/abs/2410.19944
作者: Nagarajan Ganapathy,Podakanti Satyajith Chary,Teja Venkata Ramana Kumar Pithani,Pavan Kavati,Arun Kumar S
关键词-EN: Video Capsule Endoscopy, Capsule Endoscopy, Video Capsule, Paper presents, enhance diagnostic efficiency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 Pages, 2 Figures, Capsule Vision 2024 Challenge

点击查看摘要

Abstract:This Paper presents an advanced approach for fine-tuning BiomedCLIP PubMedBERT, a multimodal model, to classify abnormalities in Video Capsule Endoscopy (VCE) frames, aiming to enhance diagnostic efficiency in gastrointestinal healthcare. By integrating the PubMedBERT language model with a Vision Transformer (ViT) to process endoscopic images, our method categorizes images into ten specific classes: angioectasia, bleeding, erosion, erythema, foreign body, lymphangiectasia, polyp, ulcer, worms, and normal. Our workflow incorporates image preprocessing and fine-tunes the BiomedCLIP model to generate high-quality embeddings for both visual and textual inputs, aligning them through similarity scoring for classification. Performance metrics, including classification, accuracy, recall, and F1 score, indicate the models strong ability to accurately identify abnormalities in endoscopic frames, showing promise for practical use in clinical diagnostics.

[CV-87] racking and triangulating firefly flashes in field recordings

链接: https://arxiv.org/abs/2410.19932
作者: Raphael Sarfati
关键词-EN: Identifying firefly flashes, Identifying firefly, bright features, features in nature, Identifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Identifying firefly flashes from other bright features in nature images is complicated. I provide a training dataset and trained neural networks for reliable flash classification. The training set consists of thousands of cropped images (patches) extracted by manual labeling from video recordings of fireflies in their natural habitat. The trained network appears as considerably more reliable to differentiate flashes from other sources of light compared to traditional methods relying solely on intensity thresholding. This robust tracking enables a new calibration-free method for the 3D reconstruction of flash occurrences from stereoscopic 360-degree videos, which I also present here.

[CV-88] Exploring Self-Supervised Learning with U-Net Masked Autoencoders and EfficientNet B7 for Improved Classification

链接: https://arxiv.org/abs/2410.19899
作者: Vamshi Krishna Kancharla,Pavan Kumar Kaveti
关键词-EN: reconstruct original images, present a self-supervised, original images, designed to reconstruct, reconstruct original
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Capsule Vision 2024 Challenge

点击查看摘要

Abstract:We present a self-supervised U-Net-based masked autoencoder and noise removal model designed to reconstruct original images. Once adequately trained, this model extracts high-level features, which are then combined with features from the EfficientNet B7 model. These integrated features are subsequently fed into dense layers for classification. Among the approaches of masked input and Gaussian noise removal, we selected the best U-Net reconstruction model. Additionally, we explored various configurations, including EfficientNet with attention, attention fusion of the autoencoder, and classification utilizing U-Net encoder features. The best performance was achieved with EfficientNet B7 combined with U-Net encoder features. We employed the Adam optimizer with a learning rate of 0.0001, achieving a top accuracy of 0.94 on the validation set.

[CV-89] FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

链接: https://arxiv.org/abs/2410.19896
作者: Naga VS Raviteja Chappa,Page Daniel Dobbs,Bhiksha Raj,Khoa Luu
关键词-EN: Semantic Hierarchical Fusion, poses significant challenges, Adaptive Semantic Hierarchical, monitoring and intervention, Flow-Attention Adaptive Semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review at International Journal of Computer Vision; 20 pages, 4 figures, 5 tables

点击查看摘要

Abstract:The proliferation of tobacco-related content on social media platforms poses significant challenges for public health monitoring and intervention. This paper introduces a novel multi-modal deep learning framework named Flow-Attention Adaptive Semantic Hierarchical Fusion (FLAASH) designed to analyze tobacco-related video content comprehensively. FLAASH addresses the complexities of integrating visual and textual information in short-form videos by leveraging a hierarchical fusion mechanism inspired by flow network theory. Our approach incorporates three key innovations, including a flow-attention mechanism that captures nuanced interactions between visual and textual modalities, an adaptive weighting scheme that balances the contribution of different hierarchical levels, and a gating mechanism that selectively emphasizes relevant features. This multi-faceted approach enables FLAASH to effectively process and analyze diverse tobacco-related content, from product showcases to usage scenarios. We evaluate FLAASH on the Multimodal Tobacco Content Analysis Dataset (MTCAD), a large-scale collection of tobacco-related videos from popular social media platforms. Our results demonstrate significant improvements over existing methods, outperforming state-of-the-art approaches in classification accuracy, F1 score, and temporal consistency. The proposed method also shows strong generalization capabilities when tested on standard video question-answering datasets, surpassing current models. This work contributes to the intersection of public health and artificial intelligence, offering an effective tool for analyzing tobacco promotion in digital media.

[CV-90] opology-aware Mamba for Crack Segmentation in Structures

链接: https://arxiv.org/abs/2410.19894
作者: Xin Zuo,Yu Sheng,Jifeng Shen,Yongwei Shan
关键词-EN: Convolutional Neural Network, Traditional Convolutional Neural, Mamba-based model, health of infrastructure, efficient and accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at Journal of Automation in Construction

点击查看摘要

Abstract:CrackMamba, a Mamba-based model, is designed for efficient and accurate crack segmentation for monitoring the structural health of infrastructure. Traditional Convolutional Neural Network (CNN) models struggle with limited receptive fields, and while Vision Transformers (ViT) improve segmentation accuracy, they are computationally intensive. CrackMamba addresses these challenges by utilizing the VMambaV2 with pre-trained ImageNet-1k weights as the encoder and a newly designed decoder for better performance. To handle the random and complex nature of crack development, a Snake Scan module is proposed to reshape crack feature sequences, enhancing feature extraction. Additionally, the three-branch Snake Conv VSS (SCVSS) block is proposed to target cracks more effectively. Experiments show that CrackMamba achieves state-of-the-art (SOTA) performance on the CrackSeg9k and SewerCrack datasets, and demonstrates competitive performance on the retinal vessel segmentation dataset CHASE\underline~DB1, highlighting its generalization capability. The code is publicly available at: this https URL.

[CV-91] A Survey of AI-Generated Video Evaluation

链接: https://arxiv.org/abs/2410.19884
作者: Xiao Liu,Xinhao Xiang,Zizhong Li,Yongheng Wang,Zhuoheng Li,Zhuosheng Liu,Weidi Zhang,Weiqi Ye,Jiawei Zhang
关键词-EN: brought forward significant, forward significant challenges, generating video content, video content, growing capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The growing capabilities of AI in generating video content have brought forward significant challenges in effectively evaluating these videos. Unlike static images or text, video content involves complex spatial and temporal dynamics which may require a more comprehensive and systematic evaluation of its contents in aspects like video presentation quality, semantic information delivery, alignment with human intentions, and the virtual-reality consistency with our physical world. This survey identifies the emerging field of AI-Generated Video Evaluation (AIGVE), highlighting the importance of assessing how well AI-generated videos align with human perception and meet specific instructions. We provide a structured analysis of existing methodologies that could be potentially used to evaluate AI-generated videos. By outlining the strengths and gaps in current approaches, we advocate for the development of more robust and nuanced evaluation frameworks that can handle the complexities of video content, which include not only the conventional metric-based evaluations, but also the current human-involved evaluations, and the future model-centered evaluations. This survey aims to establish a foundational knowledge base for both researchers from academia and practitioners from the industry, facilitating the future advancement of evaluation methods for AI-generated video content.

[CV-92] Radar and Camera Fusion for Object Detection and Tracking: A Comprehensive Survey

链接: https://arxiv.org/abs/2410.19872
作者: Kun Shi,Shibo He,Zhenyu Shi,Anjun Chen,Zehui Xiong,Jiming Chen,Jun Luo
关键词-EN: Multi-modal fusion, reliable object detection, complex environments, implementation of reliable, radar-camera fusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal fusion is imperative to the implementation of reliable object detection and tracking in complex environments. Exploiting the synergy of heterogeneous modal information endows perception systems the ability to achieve more comprehensive, robust, and accurate performance. As a nucleus concern in wireless-vision collaboration, radar-camera fusion has prompted prospective research directions owing to its extensive applicability, complementarity, and compatibility. Nonetheless, there still lacks a systematic survey specifically focusing on deep fusion of radar and camera for object detection and tracking. To fill this void, we embark on an endeavor to comprehensively review radar-camera fusion in a holistic way. First, we elaborate on the fundamental principles, methodologies, and applications of radar-camera fusion perception. Next, we delve into the key techniques concerning sensor calibration, modal representation, data alignment, and fusion operation. Furthermore, we provide a detailed taxonomy covering the research topics related to object detection and tracking in the context of radar and camera this http URL, we discuss the emerging perspectives in the field of radar-camera fusion perception and highlight the potential areas for future research.

[CV-93] Comparing YOLO11 and YOLOv8 for instance segmentation of occluded and non-occluded immature green fruits in complex orchard environment

链接: https://arxiv.org/abs/2410.19869
作者: Ranjan Sapkota,Manoj Karkee
关键词-EN: comprehensive performance evaluation, instance segmentation capabilities, study conducted, conducted a comprehensive, comprehensive performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 Pages, 10 Figures, 3 Tables

点击查看摘要

Abstract:This study conducted a comprehensive performance evaluation on YOLO11 and YOLOv8, the latest in the “You Only Look Once” (YOLO) series, focusing on their instance segmentation capabilities for immature green apples in orchard environments. YOLO11n-seg achieved the highest mask precision across all categories with a notable score of 0.831, highlighting its effectiveness in fruit detection. YOLO11m-seg and YOLO11l-seg excelled in non-occluded and occluded fruitlet segmentation with scores of 0.851 and 0.829, respectively. Additionally, YOLO11x-seg led in mask recall for all categories, achieving a score of 0.815, with YOLO11m-seg performing best for non-occluded immature green fruitlets at 0.858 and YOLOv8x-seg leading the occluded category with 0.800. In terms of mean average precision at a 50% intersection over union (mAP@50), YOLO11m-seg consistently outperformed, registering the highest scores for both box and mask segmentation, at 0.876 and 0.860 for the “All” class and 0.908 and 0.909 for non-occluded immature fruitlets, respectively. YOLO11l-seg and YOLOv8l-seg shared the top box mAP@50 for occluded immature fruitlets at 0.847, while YOLO11m-seg achieved the highest mask mAP@50 of 0.810. Despite the advancements in YOLO11, YOLOv8n surpassed its counterparts in image processing speed, with an impressive inference speed of 3.3 milliseconds, compared to the fastest YOLO11 series model at 4.8 milliseconds, underscoring its suitability for real-time agricultural applications related to complex green fruit environments.

[CV-94] Breaking the Illusion: Real-world Challenges for Adversarial Patches in Object Detection

链接: https://arxiv.org/abs/2410.19863
作者: Jakob Shack,Katarina Petrovic,Olga Saukh
关键词-EN: computer vision applications, Adversarial attacks pose, pose a significant, significant threat, robustness and reliability
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: - 21 pages, 17 figures, 7 tables - accepted in 1st Workshop on Enabling Machine Learning Operations for next-Gen Embedded Wireless Networked Devices (EMERGE), 2024

点击查看摘要

Abstract:Adversarial attacks pose a significant threat to the robustness and reliability of machine learning systems, particularly in computer vision applications. This study investigates the performance of adversarial patches for the YOLO object detection network in the physical world. Two attacks were tested: a patch designed to be placed anywhere within the scene - global patch, and another patch intended to partially overlap with specific object targeted for removal from detection - local patch. Various factors such as patch size, position, rotation, brightness, and hue were analyzed to understand their impact on the effectiveness of the adversarial patches. The results reveal a notable dependency on these parameters, highlighting the challenges in maintaining attack efficacy in real-world conditions. Learning to align digitally applied transformation parameters with those measured in the real world still results in up to a 64% discrepancy in patch performance. These findings underscore the importance of understanding environmental influences on adversarial attacks, which can inform the development of more robust defenses for practical machine learning applications.

[CV-95] YOLO11 and Vision Transformers based 3D Pose Estimation of Immature Green Fruits in Commercial Apple Orchards for Robotic Thinning

链接: https://arxiv.org/abs/2410.19846
作者: Ranjan Sapkota,Manoj Karkee
关键词-EN: algorithm alongside Vision, alongside Vision Transformers, Dense Prediction Transformer, alongside Vision, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 Pages, 13 Figures, 1 Table

点击查看摘要

Abstract:In this study, a robust method for 3D pose estimation of immature green apples (fruitlets) in commercial orchards was developed, utilizing the YOLO11 object detection and pose estimation algorithm alongside Vision Transformers (ViT) for depth estimation (Dense Prediction Transformer (DPT) and Depth Anything V2). For object detection and pose estimation, performance comparisons of YOLO11 (YOLO11n, YOLO11s, YOLO11m, YOLO11l and YOLO11x) and YOLOv8 (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x) were made under identical hyperparameter settings among the all configurations. It was observed that YOLO11n surpassed all configurations of YOLO11 and YOLOv8 in terms of box precision and pose precision, achieving scores of 0.91 and 0.915, respectively. Conversely, YOLOv8n exhibited the highest box and pose recall scores of 0.905 and 0.925, respectively. Regarding the mean average precision at 50% intersection over union (mAP@50), YOLO11s led all configurations with a box mAP@50 score of 0.94, while YOLOv8n achieved the highest pose mAP@50 score of 0.96. In terms of image processing speed, YOLO11n outperformed all configurations with an impressive inference speed of 2.7 ms, significantly faster than the quickest YOLOv8 configuration, YOLOv8n, which processed images in 7.8 ms. Subsequent integration of ViTs for the green fruit’s pose depth estimation revealed that Depth Anything V2 outperformed Dense Prediction Transformer in 3D pose length validation, achieving the lowest Root Mean Square Error (RMSE) of 1.52 and Mean Absolute Error (MAE) of 1.28, demonstrating exceptional precision in estimating immature green fruit lengths. Integration of YOLO11 and Depth Anything Model provides a promising solution to 3D pose estimation of immature green fruits for robotic thinning applications.

[CV-96] Scene-Segmentation-Based Exposure Compensation for Tone Mapping of High Dynamic Range Scenes

链接: https://arxiv.org/abs/2410.19839
作者: Yuma Kinoshita,Hitoshi Kiya
关键词-EN: MEF-based tone mapping, tone mapping, input HDR image, tone, exposure compensation
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: to be presented in APSIPA ASC 2024

点击查看摘要

Abstract:We propose a novel scene-segmentation-based exposure compensation method for multi-exposure image fusion (MEF) based tone mapping. The aim of MEF-based tone mapping is to display high dynamic range (HDR) images on devices with limited dynamic range. To achieve this, this method generates a stack of differently exposed images from an input HDR image and fuses them into a single image. Our approach addresses the limitations of MEF-based tone mapping with existing segmentation-based exposure compensation, which often result in visually unappealing outcomes due to inappropriate exposure value selection. The proposed exposure compensation method first segments the input HDR image into subregions based on luminance values of pixels. It then determines exposure values for multi-exposure images to maximize contrast between regions while preserving relative luminance relationships. This approach contrasts with conventional methods that may invert luminance relationships or compromise contrast between regions. Additionally, we present an improved technique for calculating fusion weights to better reflect the effects of exposure compensation in the final fused image. In a simulation experiment to evaluate the quality of tone-mapped images, the MEF-based tone mapping with the proposed method outperforms three typical tone mapping methods including conventional MEF-based one, in terms of the tone mapped image quality index (TMQI).

[CV-97] Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

链接: https://arxiv.org/abs/2410.19836
作者: Ronan Docherty,Antonis Vamvakeros,Samuel J. Cooper
关键词-EN: self-supervised vision transformers, positional information relevant, vision transformers, self-supervised vision, semantic and positional
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The features of self-supervised vision transformers (ViTs) contain strong semantic and positional information relevant to downstream tasks like object localization and segmentation. Recent works combine these features with traditional methods like clustering, graph partitioning or region correlations to achieve impressive baselines without finetuning or training additional networks. We leverage upsampled features from ViT networks (e.g DINOv2) in two workflows: in a clustering based approach for object localization and segmentation, and paired with standard classifiers in weakly supervised materials segmentation. Both show strong performance on benchmarks, especially in weakly supervised segmentation where the ViT features capture complex relationships inaccessible to classical approaches. We expect the flexibility and generalizability of these features will both speed up and strengthen materials characterization, from segmentation to property-prediction.

[CV-98] GL-NeRF: Gauss-Laguerre Quadrature Enables Training-Free NeRF Acceleration NEURIPS2024

链接: https://arxiv.org/abs/2410.19831
作者: Silong Yong,Yaqi Xie,Simon Stepputtis,Katia Sycara
关键词-EN: inherently time-consuming due, neural radiance fields, number of MLP, MLP calls, sampled per ray
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:Volume rendering in neural radiance fields is inherently time-consuming due to the large number of MLP calls on the points sampled per ray. Previous works would address this issue by introducing new neural networks or data structures. In this work, We propose GL-NeRF, a new perspective of computing volume rendering with the Gauss-Laguerre quadrature. GL-NeRF significantly reduces the number of MLP calls needed for volume rendering, introducing no additional data structures or neural networks. The simple formulation makes adopting GL-NeRF in any NeRF model possible. In the paper, we first justify the use of the Gauss-Laguerre quadrature and then demonstrate this plug-and-play attribute by implementing it in two different NeRF models. We show that with a minimal drop in performance, GL-NeRF can significantly reduce the number of MLP calls, showing the potential to speed up any NeRF model.

[CV-99] Automating Video Thumbnails Selection and Generation with Multimodal and Multistage Analysis

链接: https://arxiv.org/abs/2410.19825
作者: Elia Fantini
关键词-EN: traditional broadcast content, broadcast content, thesis presents, presents an innovative, selection for traditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 150 pages, 60 figures

点击查看摘要

Abstract:This thesis presents an innovative approach to automate video thumbnail selection for traditional broadcast content. Our methodology establishes stringent criteria for diverse, representative, and aesthetically pleasing thumbnails, considering factors like logo placement space, incorporation of vertical aspect ratios, and accurate recognition of facial identities and emotions. We introduce a sophisticated multistage pipeline that can select candidate frames or generate novel images by blending video elements or using diffusion models. The pipeline incorporates state-of-the-art models for various tasks, including downsampling, redundancy reduction, automated cropping, face recognition, closed-eye and emotion detection, shot scale and aesthetic prediction, segmentation, matting, and harmonization. It also leverages large language models and visual transformers for semantic consistency. A GUI tool facilitates rapid navigation of the pipeline’s output. To evaluate our method, we conducted comprehensive experiments. In a study of 69 videos, 53.6% of our proposed sets included thumbnails chosen by professional designers, with 73.9% containing similar images. A survey of 82 participants showed a 45.77% preference for our method, compared to 37.99% for manually chosen thumbnails and 16.36% for an alternative method. Professional designers reported a 3.57-fold increase in valid candidates compared to the alternative method, confirming that our approach meets established criteria. In conclusion, our findings affirm that the proposed method accelerates thumbnail creation while maintaining high-quality standards and fostering greater user engagement.

[CV-100] Flame quality monitoring of flare stack based on deep visual features

链接: https://arxiv.org/abs/2410.19823
作者: Xing Mu
关键词-EN: Flare stacks play, fossil energy plants, petroleum fossil energy, Flare stacks, energy plants
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 9 figures, 2 tables, International Conference on Computer Information Science and Artificial Intelligence(accepted)

点击查看摘要

Abstract:Flare stacks play an important role in the treatment of waste gas and waste materials in petroleum fossil energy plants. Monitoring the efficiency of flame combustion is of great significance for environmental protection. The traditional method of monitoring with sensors is not only expensive, but also easily damaged in harsh combustion environments. In this paper, we propose to monitor the quality of flames using only visual features, including the area ratio of flame to smoke, RGB information of flames, angle of flames and other features. Comprehensive use of image segmentation, target detection, target tracking, principal component analysis, GPT-4 and other methods or tools to complete this task. In the end, real-time monitoring of the picture can be achieved, and when the combustion efficiency is low, measures such as adjusting the ratio of air and waste can be taken in time. As far as we know, the method of this paper is relatively innovative and has industrial production value.

[CV-101] Explainable AI in Handwriting Detection for Dyslexia Using Transfer Learning

链接: https://arxiv.org/abs/2410.19821
作者: Mahmoud Robaa,Mazen Balat,Rewaa Awaad,Esraa Omar,Salah A. Aly
关键词-EN: common learning disorders, characterized by distinct, learning disorders, common learning, distinct features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 4 figures, JAC-ECC Conference

点击查看摘要

Abstract:Dyslexia is one of the most common learning disorders, often characterized by distinct features in handwriting. Early detection is essential for effective intervention. In this paper, we propose an explainable AI (XAI) framework for dyslexia detection through handwriting analysis, utilizing transfer learning and transformer-based models. Our approach surpasses state-of-the-art methods, achieving a test accuracy of 0.9958, while ensuring model interpretability through Grad-CAM visualizations that highlight the critical handwriting features influencing model decisions. The main contributions of this work include the integration of XAI for enhanced interpretability, adaptation to diverse languages and writing systems, and demonstration of the method’s global applicability. This framework not only improves diagnostic accuracy but also fosters trust and understanding among educators, clinicians, and parents, supporting earlier diagnoses and the development of personalized educational strategies.

[CV-102] DivShift: Exploring Domain-Specific Distribution Shift in Volunteer-Collected Biodiversity Datasets NEURIPS2024

链接: https://arxiv.org/abs/2410.19816
作者: Elena Sierra,Lauren E. Gillespie,Salim Soltani,Moises Exposito-Alonso,Teja Kattenborn
关键词-EN: negatively impacting, North American West, American West Coast, impacting the world, biodiversity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024 Workshop on Tackling Climate Change with Machine Learning

点击查看摘要

Abstract:Climate change is negatively impacting the world’s biodiversity. To build automated systems to monitor these negative biodiversity impacts, large-scale, volunteer-collected datasets like iNaturalist are built from community-identified, natural imagery. However, such volunteer-based data are opportunistic and lack a structured sampling strategy, resulting in geographic, temporal, observation quality, and socioeconomic, biases that stymie uptake of these models for downstream biodiversity monitoring tasks. Here we introduce DivShift North American West Coast (DivShift-NAWC), a curated dataset of almost 8 million iNaturalist plant images across the western coast of North America, for exploring the effects of these biases on deep learning model performance. We compare model performance across four known biases and observe that they indeed confound model performance. We suggest practical strategies for curating datasets to train deep learning models for monitoring climate change’s impacts on the world’s biodiversity.

[CV-103] Stochastic Flow Matching for Resolving Small-Scale Physics

链接: https://arxiv.org/abs/2410.19814
作者: Stathi Fotiadis,Noah Brenowitz,Tomas Geffner,Yair Cohen,Michael Pritchard,Arash Vahdat,Morteza Mardani
关键词-EN: partial differential equations, distinct partial differential, poses significant challenges, significant challenges due, details poses significant
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注: 31 pages

点击查看摘要

Abstract:Conditioning diffusion and flow models have proven effective for super-resolving small-scale details in natural this http URL, in physical sciences such as weather, super-resolving small-scale details poses significant challenges due to: (i) misalignment between input and output distributions (i.e., solutions to distinct partial differential equations (PDEs) follow different trajectories), (ii) multi-scale dynamics, deterministic dynamics at large scales vs. stochastic at small scales, and (iii) limited data, increasing the risk of overfitting. To address these challenges, we propose encoding the inputs to a latent base distribution that is closer to the target distribution, followed by flow matching to generate small-scale physics. The encoder captures the deterministic components, while flow matching adds stochastic small-scale details. To account for uncertainty in the deterministic part, we inject noise into the encoder output using an adaptive noise scaling mechanism, which is dynamically adjusted based on maximum-likelihood estimates of the encoder predictions. We conduct extensive experiments on both the real-world CWA weather dataset and the PDE-based Kolmogorov dataset, with the CWA task involving super-resolving the weather variables for the region of Taiwan from 25 km to 2 km scales. Our results show that the proposed stochastic flow matching (SFM) framework significantly outperforms existing methods such as conditional diffusion and flows.

[CV-104] Comparing Surface Landmine Object Detection Models on a New Drone Flyby Dataset

链接: https://arxiv.org/abs/2410.19807
作者: Navin Agrawal-Chung,Zohran Moin
关键词-EN: surface landmine detection, methods is slow, dangerous and prohibitively, prohibitively expensive, Landmine detection
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 22 figures, 7 tables

点击查看摘要

Abstract:Landmine detection using traditional methods is slow, dangerous and prohibitively expensive. Using deep learning-based object detection algorithms drone videos is promising but has multiple challenges due to the small, soda-can size of recently prevalent surface landmines. The literature currently lacks scientific evaluation of optimal ML models for this problem since most object detection research focuses on analysis of ground video surveillance images. In order to help train comprehensive models and drive research for surface landmine detection, we first create a custom dataset comprising drone images of POM-2 and POM-3 Russian surface landmines. Using this dataset, we train, test and compare 4 different computer vision foundation models YOLOF, DETR, Sparse-RCNN and VFNet. Generally, all 4 detectors do well with YOLOF outperforming other models with a mAP score of 0.89 while DETR, VFNET and Sparse-RCNN mAP scores are all around 0.82 for drone images taken from 10m AGL. YOLOF is also quicker to train consuming 56min of training time on a Nvidia V100 compute cluster. Finally, this research contributes landmine image, video datasets and model Jupyter notebooks at this https URL to enable future research in surface landmine detection.

[CV-105] Stable Diffusion with Continuous-time Neural Network

链接: https://arxiv.org/abs/2410.19798
作者: Andras Horvath
关键词-EN: Stable diffusion models, exhibiting unparalleled performance, Stable diffusion, exhibiting unparalleled, models have ushered
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Stable diffusion models have ushered in a new era of advancements in image generation, currently reigning as the state-of-the-art approach, exhibiting unparalleled performance. The process of diffusion, accompanied by denoising through iterative convolutional or transformer network steps, stands at the core of their implementation. Neural networks operating in continuous time naturally embrace the concept of diffusion, this way they could enable more accurate and energy efficient implementation. Within the confines of this paper, my focus delves into an exploration and demonstration of the potential of celllular neural networks in image generation. I will demonstrate their superiority in performance, showcasing their adeptness in producing higher quality images and achieving quicker training times in comparison to their discrete-time counterparts on the commonly cited MNIST dataset. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.19798 [cs.CV] (or arXiv:2410.19798v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.19798 Focus to learn more arXiv-issued DOI via DataCite

[CV-106] Feature Clipping for Uncertainty Calibration

链接: https://arxiv.org/abs/2410.19796
作者: Linwei Tao,Minjing Dong,Chang Xu
关键词-EN: Deep neural networks, reliable uncertainty estimates, ensuring reliable uncertainty, Deep neural, achieved significant success
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved significant success across various tasks, but ensuring reliable uncertainty estimates, known as model calibration, is crucial for their safe and effective deployment. Modern DNNs often suffer from overconfidence, leading to miscalibration. We propose a novel post-hoc calibration method called feature clipping (FC) to address this issue. FC involves clipping feature values to a specified threshold, effectively increasing entropy in high calibration error samples while maintaining the information in low calibration error samples. This process reduces the overconfidence in predictions, improving the overall calibration of the model. Our extensive experiments on datasets such as CIFAR-10, CIFAR-100, and ImageNet, and models including CNNs and transformers, demonstrate that FC consistently enhances calibration performance. Additionally, we provide a theoretical analysis that validates the effectiveness of our method. As the first calibration technique based on feature modification, feature clipping offers a novel approach to improving model calibration, showing significant improvements over both post-hoc and train-time calibration methods and pioneering a new avenue for feature-based model calibration.

[CV-107] DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks

链接: https://arxiv.org/abs/2410.19794
作者: Zohreh Aghababaeyan,Manel Abdellatif,Lionel Briand,Ramesh S
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, deployed across applications, increasingly deployed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are increasingly deployed across applications. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between models, especially with limited test datasets, making it difficult to select or combine models effectively. Differential testing addresses this by generating test inputs that expose discrepancies in DNN model behavior. However, existing approaches face significant limitations: many rely on model internals or are constrained by available seed inputs. To address these challenges, we propose DiffGAN, a black-box test image generation approach for differential testing of DNN models. DiffGAN leverages a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to generate diverse and valid triggering inputs that reveal behavioral discrepancies between models. DiffGAN employs two custom fitness functions, focusing on diversity and divergence, to guide the exploration of the GAN input space and identify discrepancies between models’ outputs. By strategically searching this space, DiffGAN generates inputs with specific features that trigger differences in model behavior. DiffGAN is black-box, making it applicable in more situations. We evaluate DiffGAN on eight DNN model pairs trained on widely used image datasets. Our results show DiffGAN significantly outperforms a SOTA baseline, generating four times more triggering inputs, with greater diversity and validity, within the same budget. Additionally, the generated inputs improve the accuracy of a machine learning-based model selection mechanism, which selects the best-performing model based on input characteristics and can serve as a smart output voting mechanism when using alternative models.

[CV-108] Leveraging Multi-Temporal Sentinel 1 and 2 Satellite Data for Leaf Area Index Estimation With Deep Learning

链接: https://arxiv.org/abs/2410.19787
作者: Clement Wang,Antoine Debouchage,Valentin Goldité,Aurélien Wery,Jules Salzinger
关键词-EN: Leaf Area Index, Area Index, Leaf Area, understand ecosystem health, vegetation dynamics
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Leaf Area Index (LAI) is a critical parameter to understand ecosystem health and vegetation dynamics. In this paper, we propose a novel method for pixel-wise LAI prediction by leveraging the complementary information from Sentinel 1 radar data and Sentinel 2 multi-spectral data at multiple timestamps. Our approach uses a deep neural network based on multiple U-nets tailored specifically to this task. To handle the complexity of the different input modalities, it is comprised of several modules that are pre-trained separately to represent all input data in a common latent space. Then, we fine-tune them end-to-end with a common decoder that also takes into account seasonality, which we find to play an important role. Our method achieved 0.06 RMSE and 0.93 R2 score on publicly available data. We make our contributions available at this https URL for future works to further improve on our current progress.

[CV-109] Resolution Enhancement of Under-sampled Photoacoustic Microscopy Images using Implicit Neural Representations

链接: https://arxiv.org/abs/2410.19786
作者: Youshen Xiao,Sheng Liao,Xuanyang Tian,Fan Zhang,Xinlong Dong,Yunhui Jiang,Xiyu Chen,Ruixi Sun,Yuyao Zhang,Fei Gao
关键词-EN: Point Spread Function, Spread Function, Point Spread, Acoustic-Resolution Photoacoustic Microscopy, PSF
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Acoustic-Resolution Photoacoustic Microscopy (AR-PAM) is promising for subcutaneous vascular imaging, but its spatial resolution is constrained by the Point Spread Function (PSF). Traditional deconvolution methods like Richardson-Lucy and model-based deconvolution use the PSF to improve resolution. However, accurately measuring the PSF is difficult, leading to reliance on less accurate blind deconvolution techniques. Additionally, AR-PAM suffers from long scanning times, which can be reduced via down-sampling, but this necessitates effective image recovery from under-sampled data, a task where traditional interpolation methods fall short, particularly at high under-sampling rates. To address these challenges, we propose an approach based on Implicit Neural Representations (INR). This method learns a continuous mapping from spatial coordinates to initial acoustic pressure, overcoming the limitations of discrete imaging and enhancing AR-PAM’s resolution. By treating the PSF as a learnable parameter within the INR framework, our technique mitigates inaccuracies associated with PSF estimation. We evaluated our method on simulated vascular data, showing significant improvements in Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) over conventional methods. Qualitative enhancements were also observed in leaf vein and in vivo mouse brain microvasculature images. When applied to a custom AR-PAM system, experiments with pencil lead demonstrated that our method delivers sharper, higher-resolution results, indicating its potential to advance photoacoustic microscopy.

[CV-110] How to Backdoor Consistency Models?

链接: https://arxiv.org/abs/2410.19785
作者: Chengen Wang,Murat Kantarcioglu
关键词-EN: Consistency models, Consistency, models, directly mapping noise, allowing for one-step
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Consistency models are a new class of models that generate images by directly mapping noise to data, allowing for one-step generation and significantly accelerating the sampling process. However, their robustness against adversarial attacks has not yet been thoroughly investigated. In this work, we conduct the first study on the vulnerability of consistency models to backdoor attacks. While previous research has explored backdoor attacks on diffusion models, these studies have primarily focused on conventional diffusion models, employing a customized backdoor training process and objective, whereas consistency models have distinct training processes and objectives. Our proposed framework demonstrates the vulnerability of consistency models to backdoor attacks. During image generation, poisoned consistency models produce images with a Fréchet Inception Distance (FID) comparable to that of a clean model when sampling from Gaussian noise. However, once the trigger is activated, they generate backdoor target images. We explore various trigger and target configurations to evaluate the vulnerability of consistency models, including the use of random noise as a trigger. This type of trigger is less conspicuous and aligns well with the sampling process of consistency models. Across all configurations, our framework successfully compromises the consistency models while maintaining high utility and specificity.

[CV-111] Enhancing Apples Defect Classification: Insights from Visible Spectrum and Narrow Spectral Band Imaging

链接: https://arxiv.org/abs/2410.19784
作者: Omar Coello,Moisés Coronel,Darío Carpio,Boris Vintimilla,Luis Chuquimarca
关键词-EN: food supply chain, mitigate economic losses, entire visible spectrum, visible spectrum, supply chain
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:This study addresses the classification of defects in apples as a crucial measure to mitigate economic losses and optimize the food supply chain. An innovative approach is employed that integrates images from the visible spectrum and 660 nm spectral wavelength to enhance accuracy and efficiency in defect classification. The methodology is based on the use of Single-Input and Multi-Inputs convolutional neural networks (CNNs) to validate the proposed strategies. Steps include image acquisition and preprocessing, classification model training, and performance evaluation. Results demonstrate that defect classification using the 660 nm spectral wavelength reveals details not visible in the entire visible spectrum. It is seen that the use of the appropriate spectral range in the classification process is slightly superior to the entire visible spectrum. The MobileNetV1 model achieves an accuracy of 98.80% on the validation dataset versus the 98.26% achieved using the entire visible spectrum. Conclusions highlight the potential to enhance the method by capturing images with specific spectral ranges using filters, enabling more effective network training for classification task. These improvements could further enhance the system’s capability to identify and classify defects in apples.

[CV-112] Data-Driven Uncertainty-Aware Forecasting of Sea Ice Conditions in the Gulf of Ob Based on Satellite Radar Imagery

链接: https://arxiv.org/abs/2410.19782
作者: Stefan Maria Ailuro,Anna Nedorubova,Timofey Grigoryev,Evgeny Burnaev,Vladimir Vanovskiy
关键词-EN: marine activity due, loss necessitates highly, ensure maritime safety, Arctic marine activity, ice loss necessitates
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increase in Arctic marine activity due to rapid warming and significant sea ice loss necessitates highly reliable, short-term sea ice forecasts to ensure maritime safety and operational efficiency. In this work, we present a novel data-driven approach for sea ice condition forecasting in the Gulf of Ob, leveraging sequences of radar images from Sentinel-1, weather observations, and GLORYS forecasts. Our approach integrates advanced video prediction models, originally developed for vision tasks, with domain-specific data preprocessing and augmentation techniques tailored to the unique challenges of Arctic sea ice dynamics. Central to our methodology is the use of uncertainty quantification to assess the reliability of predictions, ensuring robust decision-making in safety-critical applications. Furthermore, we propose a confidence-based model mixture mechanism that enhances forecast accuracy and model robustness, crucial for reliable operations in volatile Arctic environments. Our results demonstrate substantial improvements over baseline approaches, underscoring the importance of uncertainty quantification and specialized data handling for effective and safe operations and reliable forecasting.

[CV-113] Developing Gridded Emission Inventory from High-Resolution Satellite Object Detection for Improved Air Quality Forecasts

链接: https://arxiv.org/abs/2410.19773
作者: Shubham Ghosal,Manmeet Singh,Sachin Ghude,Harsh Kamath,Vaisakh SB,Subodh Wasekar,Anoop Mahajan,Hassan Dashtian,Zong-Liang Yang,Michael Young,Dev Niyogi
关键词-EN: Forecasting model coupled, coupled with Chemistry, satellite detectable resolution, based emission inventory, Forecasting model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study presents an innovative approach to creating a dynamic, AI based emission inventory system for use with the Weather Research and Forecasting model coupled with Chemistry (WRF Chem), designed to simulate vehicular and other anthropogenic emissions at satellite detectable resolution. The methodology leverages state of the art deep learning based computer vision models, primarily employing YOLO (You Only Look Once) architectures (v8 to v10) and T Rex, for high precision object detection. Through extensive data collection, model training, and finetuning, the system achieved significant improvements in detection accuracy, with F1 scores increasing from an initial 0.15 at 0.131 confidence to 0.72 at 0.414 confidence. A custom pipeline converts model outputs into netCDF files storing latitude, longitude, and vehicular count data, enabling real time processing and visualization of emission patterns. The resulting system offers unprecedented temporal and spatial resolution in emission estimates, facilitating more accurate short term air quality forecasts and deeper insights into urban emission dynamics. This research not only enhances WRF Chem simulations but also bridges the gap between AI technologies and atmospheric science methodologies, potentially improving urban air quality management and environmental policymaking. Future work will focus on expanding the system’s capabilities to non vehicular sources and further improving detection accuracy in challenging environmental conditions.

[CV-114] Large Model for Small Data: Foundation Model for Cross-Modal RF Human Activity Recognition

链接: https://arxiv.org/abs/2410.19766
作者: Yuxuan Weng,Guoquan Wu,Tianyue Zheng,Yanbing Yang,Jun Luo
关键词-EN: Human Activity Recognition, based Human Activity, requiring computer visions, Activity Recognition, Human Activity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Radio-Frequency (RF)-based Human Activity Recognition (HAR) rises as a promising solution for applications unamenable to techniques requiring computer visions. However, the scarcity of labeled RF data due to their non-interpretable nature poses a significant obstacle. Thanks to the recent breakthrough of foundation models (FMs), extracting deep semantic insights from unlabeled visual data become viable, yet these vision-based FMs fall short when applied to small RF datasets. To bridge this gap, we introduce FM-Fi, an innovative cross-modal framework engineered to translate the knowledge of vision-based FMs for enhancing RF-based HAR systems. FM-Fi involves a novel cross-modal contrastive knowledge distillation mechanism, enabling an RF encoder to inherit the interpretative power of FMs for achieving zero-shot learning. It also employs the intrinsic capabilities of FM and RF to remove extraneous features for better alignment between the two modalities. The framework is further refined through metric-based few-shot learning techniques, aiming to boost the performance for predefined HAR tasks. Comprehensive evaluations evidently indicate that FM-Fi rivals the effectiveness of vision-based methodologies, and the evaluation results provide empirical validation of FM-Fi’s generalizability across various environments.

[CV-115] Adaptive Real-Time Multi-Loss Function Optimization Using Dynamic Memory Fusion Framework: A Case Study on Breast Cancer Segmentation

链接: https://arxiv.org/abs/2410.19745
作者: Amin Golnari,Mostafa Diba
关键词-EN: highly effective tool, Deep learning, deep learning tasks, range of applications, highly effective
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has proven to be a highly effective tool for a wide range of applications, significantly when leveraging the power of multi-loss functions to optimize performance on multiple criteria simultaneously. However, optimal selection and weighting loss functions in deep learning tasks can significantly influence model performance, yet manual tuning of these functions is often inefficient and inflexible. We propose a novel framework called dynamic memory fusion for adaptive multi-loss function penalizing in real-time to address this. This framework leverages historical loss values data to dynamically adjust the weighting of multiple loss functions throughout the training process. Additionally, this framework integrates an auxiliary loss function to enhance model performance in the early stages. To further research horizons, we introduce the class-balanced dice loss function, designed to address class imbalance by prioritizing underrepresented classes. Experiments on breast ultrasound datasets demonstrate that the framework improves segmentation performance across various metrics. These results demonstrate the effectiveness of our proposed framework in ensuring that the model dynamically adjusts its focus to prioritize the most relevant criteria, leading to improved performance in evolving environments. The source code for our proposed methodology is publicly available on GitHub.

[CV-116] Less Cybersickness Please: Demystifying and Detecting Stereoscopic Visual Inconsistencies in Virtual Reality Apps

链接: https://arxiv.org/abs/2406.09313
作者: Shuqing Li,Cuiyun Gao,Jianping Zhang,Yujia Zhang,Yepang Liu,Jiazhen Gu,Yun Peng,Michael R. Lyu
关键词-EN: Graphical User Interface, Virtual Reality, quality of Virtual, SVI issues, User Interface
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: This work has been accepted at the ACM International Conference on the Foundations of Software Engineering (FSE) 2024, Porto de Galinhas, Brazil. DOI: this https URL

点击查看摘要

Abstract:The quality of Virtual Reality (VR) apps is vital, particularly the rendering quality of the VR Graphical User Interface (GUI). Different from traditional 2D apps, VR apps create a 3D digital scene for users, by rendering two distinct 2D images for the user’s left and right eyes, respectively. Stereoscopic visual inconsistency (denoted as “SVI”) issues, however, undermine the rendering process of the user’s brain, leading to user discomfort and even adverse health effects. Such issues commonly exist but remain underexplored. We conduct an empirical analysis on 282 SVI bug reports from 15 VR platforms, summarizing 15 types of manifestations. The empirical analysis reveals that automatically detecting SVI issues is challenging, mainly because: (1) lack of training data; (2) the manifestations of SVI issues are diverse, complicated, and often application-specific; (3) most accessible VR apps are closed-source commercial software. Existing pattern-based supervised classification approaches may be inapplicable or ineffective in detecting the SVI issues. To counter these challenges, we propose an unsupervised black-box testing framework named StereoID to identify the stereoscopic visual inconsistencies, based only on the rendered GUI states. StereoID generates a synthetic right-eye image based on the actual left-eye image and computes distances between the synthetic right-eye image and the actual right-eye image to detect SVI issues. We propose a depth-aware conditional stereo image translator to power the image generation process, which captures the expected perspective shifts between left-eye and right-eye images. We build a large-scale unlabeled VR stereo screenshot dataset with larger than 171K images from 288 real-world VR apps for experiments. After substantial experiments, StereoID demonstrates superior performance for detecting SVI issues in both user reports and wild VR apps.

[CV-117] KaLDeX: Kalman Filter based Linear Deformable Cross Attention for Retina Vessel Segmentation

链接: https://arxiv.org/abs/2410.21160
作者: Zhihao Zhao,Shahrooz Faghihroohi,Yinzheng Zhao,Junjie Yang,Shipeng Zhong,Kai Huang,Nassir Navab,Boyang Li,M.Ali Nasseri
关键词-EN: accurate vascular segmentation, ophthalmic imaging, eye diseases, realm of ophthalmic, paramount for diagnosing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Background and Objective: In the realm of ophthalmic imaging, accurate vascular segmentation is paramount for diagnosing and managing various eye diseases. Contemporary deep learning-based vascular segmentation models rival human accuracy but still face substantial challenges in accurately segmenting minuscule blood vessels in neural network applications. Due to the necessity of multiple downsampling operations in the CNN models, fine details from high-resolution images are inevitably lost. The objective of this study is to design a structure to capture the delicate and small blood vessels. Methods: To address these issues, we propose a novel network (KaLDeX) for vascular segmentation leveraging a Kalman filter based linear deformable cross attention (LDCA) module, integrated within a UNet++ framework. Our approach is based on two key components: Kalman filter (KF) based linear deformable convolution (LD) and cross-attention (CA) modules. The LD module is designed to adaptively adjust the focus on thin vessels that might be overlooked in standard convolution. The CA module improves the global understanding of vascular structures by aggregating the detailed features from the LD module with the high level features from the UNet++ architecture. Finally, we adopt a topological loss function based on persistent homology to constrain the topological continuity of the segmentation. Results: The proposed method is evaluated on retinal fundus image datasets (DRIVE, CHASE_BD1, and STARE) as well as the 3mm and 6mm of the OCTA-500 dataset, achieving an average accuracy (ACC) of 97.25%, 97.77%, 97.85%, 98.89%, and 98.21%, respectively. Conclusions: Empirical evidence shows that our method outperforms the current best models on different vessel segmentation datasets. Our source code is available at: this https URL.

[CV-118] Scaling-based Data Augmentation for Generative Models and its Theoretical Extension

链接: https://arxiv.org/abs/2410.20780
作者: Yoshitaka Koike,Takumi Nakagawa,Hiroki Waida,Takafumi Kanamori
关键词-EN: paper studies stable, paper studies, generative models, models that enable, high-quality data generation
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies stable learning methods for generative models that enable high-quality data generation. Noise injection is commonly used to stabilize learning. However, selecting a suitable noise distribution is challenging. Diffusion-GAN, a recently developed method, addresses this by using the diffusion process with a timestep-dependent discriminator. We investigate Diffusion-GAN and reveal that data scaling is a key component for stable learning and high-quality data generation. Building on our findings, we propose a learning algorithm, Scale-GAN, that uses data scaling and variance-based regularization. Furthermore, we theoretically prove that data scaling controls the bias-variance trade-off of the estimation error bound. As a theoretical extension, we consider GAN with invertible data augmentations. Comparative evaluations on benchmark datasets demonstrate the effectiveness of our method in improving stability and accuracy.

[CV-119] CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos ECCV2024

链接: https://arxiv.org/abs/2410.20769
作者: Jiewen Yang,Yiqun Lin,Bin Pu,Jiarong Guo,Xiaowei Xu,Xiaomeng Li
关键词-EN: analysing cardiac function, Echocardiogram video plays, plays a crucial, crucial role, role in analysing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper Accepted by ECCV 2024 with Oral Presentation

点击查看摘要

Abstract:Echocardiogram video plays a crucial role in analysing cardiac function and diagnosing cardiac diseases. Current deep neural network methods primarily aim to enhance diagnosis accuracy by incorporating prior knowledge, such as segmenting cardiac structures or lesions annotated by human experts. However, diagnosing the inconsistent behaviours of the heart, which exist across both spatial and temporal dimensions, remains extremely challenging. For instance, the analysis of cardiac motion acquires both spatial and temporal information from the heartbeat cycle. To address this issue, we propose a novel reconstruction-based approach named CardiacNet to learn a better representation of local cardiac structures and motion abnormalities through echocardiogram videos. CardiacNet is accompanied by the Consistency Deformation Codebook (CDC) and the Consistency Deformed-Discriminator (CDD) to learn the commonalities across abnormal and normal samples by incorporating cardiac prior knowledge. In addition, we propose benchmark datasets named CardiacNet-PAH and CardiacNet-ASD to evaluate the effectiveness of cardiac disease assessment. In experiments, our CardiacNet can achieve state-of-the-art results in three different cardiac disease assessment tasks on public datasets CAMUS, EchoNet, and our datasets. The code and dataset are available at: this https URL.

[CV-120] Neural rendering enables dynamic tomography NEURIPS2024

链接: https://arxiv.org/abs/2410.20558
作者: Ivan Grega,William F. Whitney,Vikram S. Deshpande
关键词-EN: Interrupted X-ray computed, X-ray computed tomography, X-ray computed, Interrupted X-ray, computed tomography
类目: Instrumentation and Detectors (physics.ins-det); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 24 pages, 14 figures. Submitted to NeurIPS 2024 ML4PS. For associated visualizations, see this https URL

点击查看摘要

Abstract:Interrupted X-ray computed tomography (X-CT) has been the common way to observe the deformation of materials during an experiment. While this approach is effective for quasi-static experiments, it has never been possible to reconstruct a full 3d tomography during a dynamic experiment which cannot be interrupted. In this work, we propose that neural rendering tools can be used to drive the paradigm shift to enable 3d reconstruction during dynamic events. First, we derive theoretical results to support the selection of projections angles. Via a combination of synthetic and experimental data, we demonstrate that neural radiance fields can reconstruct data modalities of interest more efficiently than conventional reconstruction methods. Finally, we develop a spatio-temporal model with spline-based deformation field and demonstrate that such model can reconstruct the spatio-temporal deformation of lattice samples in real-world experiments.

[CV-121] Sebica: Lightweight Spatial and Efficient Bidirectional Channel Attention Super Resolution Network

链接: https://arxiv.org/abs/2410.20546
作者: Chongxiao Liu
关键词-EN: Single Image Super-Resolution, Single Image, Image Super-Resolution, low-resolution images, vital technique
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 5 figures, 26 conferences

点击查看摘要

Abstract:Single Image Super-Resolution (SISR) is a vital technique for improving the visual quality of low-resolution images. While recent deep learning models have made significant advancements in SISR, they often encounter computational challenges that hinder their deployment in resource-limited or time-sensitive environments. To overcome these issues, we present Sebica, a lightweight network that incorporates spatial and efficient bidirectional channel attention mechanisms. Sebica significantly reduces computational costs while maintaining high reconstruction quality, achieving PSNR/SSIM scores of 28.29/0.7976 and 30.18/0.8330 on the Div2K and Flickr2K datasets, respectively. These results surpass most baseline lightweight models and are comparable to the highest-performing model, but with only 17% and 15% of the parameters and GFLOPs. Additionally, our small version of Sebica has only 7.9K parameters and 0.41 GFLOPS, representing just 3% of the parameters and GFLOPs of the highest-performing model, while still achieving PSNR and SSIM metrics of 28.12/0.7931 and 0.3009/0.8317, on the Flickr2K dataset respectively. In addition, Sebica demonstrates significant improvements in real-world applications, specifically in object detection tasks, where it enhances detection accuracy in traffic video scenarios.

[CV-122] Search Wide Focus Deep: Automated Fetal Brain Extraction with Sparse Training Data

链接: https://arxiv.org/abs/2410.20532
作者: Javid Dadashkarimi,Valeria Pena Trujillo,Camilo Jaimes,Lilla Zöllei,Malte Hoffmann
关键词-EN: challenging task due, Automated fetal brain, Automated fetal, full-uterus MRI, complex anatomy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated fetal brain extraction from full-uterus MRI is a challenging task due to variable head sizes, orientations, complex anatomy, and prevalent artifacts. While deep-learning (DL) models trained on synthetic images have been successful in adult brain extraction, adapting these networks for fetal MRI is difficult due to the sparsity of labeled data, leading to increased false-positive predictions. To address this challenge, we propose a test-time strategy that reduces false positives in networks trained on sparse, synthetic labels. The approach uses a breadth-fine search (BFS) to identify a subvolume likely to contain the fetal brain, followed by a deep-focused sliding window (DFS) search to refine the extraction, pooling predictions to minimize false positives. We train models at different window sizes using synthetic images derived from a small number of fetal brain label maps, augmented with random geometric shapes. Each model is trained on diverse head positions and scales, including cases with partial or no brain tissue. Our framework matches state-of-the-art brain extraction methods on clinical HASTE scans of third-trimester fetuses and exceeds them by up to 5% in terms of Dice in the second trimester as well as EPI scans across both trimesters. Our results demonstrate the utility of a sliding-window approach and combining predictions from several models trained on synthetic images, for improving brain-extraction accuracy by progressively refining regions of interest and minimizing the risk of missing brain mask slices or misidentifying other tissues as brain.

[CV-123] Guidance Disentanglement Network for Optics-Guided Thermal UAV Image Super-Resolution

链接: https://arxiv.org/abs/2410.20466
作者: Zhicheng Zhao,Juanjuan Gu,Chenglong Li,Chun Wang,Zhongling Huang,Jin Tang
关键词-EN: Optics-guided Thermal UAV, Optics-guided Thermal, Thermal UAV image, Thermal UAV, UAV image Super-Resolution
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 19 figures, 8 tables

点击查看摘要

Abstract:Optics-guided Thermal UAV image Super-Resolution (OTUAV-SR) has attracted significant research interest due to its potential applications in security inspection, agricultural measurement, and object detection. Existing methods often employ single guidance model to generate the guidance features from optical images to assist thermal UAV images super-resolution. However, single guidance models make it difficult to generate effective guidance features under favorable and adverse conditions in UAV scenarios, thus limiting the performance of OTUAV-SR. To address this issue, we propose a novel Guidance Disentanglement network (GDNet), which disentangles the optical image representation according to typical UAV scenario attributes to form guidance features under both favorable and adverse conditions, for robust OTUAV-SR. Moreover, we design an attribute-aware fusion module to combine all attribute-based optical guidance features, which could form a more discriminative representation and fit the attribute-agnostic guidance process. To facilitate OTUAV-SR research in complex UAV scenarios, we introduce VGTSR2.0, a large-scale benchmark dataset containing 3,500 aligned optical-thermal image pairs captured under diverse conditions and scenes. Extensive experiments on VGTSR2.0 demonstrate that GDNet significantly improves OTUAV-SR performance over state-of-the-art methods, especially in the challenging low-light and foggy environments commonly encountered in UAV scenarios. The dataset and code will be publicly available at this https URL.

[CV-124] Super-resolved virtual staining of label-free tissue using diffusion models

链接: https://arxiv.org/abs/2410.20073
作者: Yijie Zhang,Luzhe Huang,Nir Pillar,Yuzhu Li,Hanlong Chen,Aydogan Ozcan
关键词-EN: transforming label-free microscopy, Virtual staining, histochemically stained samples, virtual tissue staining, label-free microscopy images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph); Optics (physics.optics)
*备注: 26 Pages, 5 Figures

点击查看摘要

Abstract:Virtual staining of tissue offers a powerful tool for transforming label-free microscopy images of unstained tissue into equivalents of histochemically stained samples. This study presents a diffusion model-based super-resolution virtual staining approach utilizing a Brownian bridge process to enhance both the spatial resolution and fidelity of label-free virtual tissue staining, addressing the limitations of traditional deep learning-based methods. Our approach integrates novel sampling techniques into a diffusion model-based image inference process to significantly reduce the variance in the generated virtually stained images, resulting in more stable and accurate outputs. Blindly applied to lower-resolution auto-fluorescence images of label-free human lung tissue samples, the diffusion-based super-resolution virtual staining model consistently outperformed conventional approaches in resolution, structural similarity and perceptual accuracy, successfully achieving a super-resolution factor of 4-5x, increasing the output space-bandwidth product by 16-25-fold compared to the input label-free microscopy images. Diffusion-based super-resolved virtual tissue staining not only improves resolution and image quality but also enhances the reliability of virtual staining without traditional chemical staining, offering significant potential for clinical diagnostics.

[CV-125] Multi-Class Abnormality Classification Task in Video Capsule Endoscopy

链接: https://arxiv.org/abs/2410.19973
作者: Dev Rishi Verma,Vibhor Saxena,Dhruv Sharma,Arpan Gupta
关键词-EN: Video Capsule Endoscopy, Capsule Endoscopy, Video Capsule, multi-class anomaly classification, advanced transformer architectures
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Video Capsule Endoscopy Challenge

点击查看摘要

Abstract:In this work we addressed the challenge of multi-class anomaly classification in Video Capsule Endoscopy (VCE)[1] with a variety of deep learning models, ranging from custom CNNs to advanced transformer architectures. The purpose is to correctly classify diverse gastrointestinal disorders, which is critical for increasing diagnostic efficiency in clinical settings. We started with a proprietary CNN and improved performance with ResNet[7] for better feature extraction, followed by Vision Transformer (ViT)[2] to capture global dependencies. Multiscale Vision Transformer (MViT)[6] improved hierarchical feature extraction, while Dual Attention Vision Transformer (DaViT)[4] delivered cutting-edge results by combining spatial and channel attention methods. This methodology enabled us to improve model accuracy across a wide range of criteria, greatly surpassing older methods.

[CV-126] Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review

链接: https://arxiv.org/abs/2410.19820
作者: Ahmad Obeid,Said Boumaraf,Anabia Sohail,Taimur Hassan,Sajid Javed,Jorge Dias,Mohammed Bennamoun,Naoufel Werghi
关键词-EN: Recent years witnessed, years witnessed remarkable, Recent years, witnessed remarkable progress, largely fueled
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 36 pages

点击查看摘要

Abstract:Recent years witnessed remarkable progress in computational histopathology, largely fueled by deep learning. This brought the clinical adoption of deep learning-based tools within reach, promising significant benefits to healthcare, offering a valuable second opinion on diagnoses, streamlining complex tasks, and mitigating the risks of inconsistency and bias in clinical decisions. However, a well-known challenge is that deep learning models may contain up to billions of parameters; supervising their training effectively would require vast labeled datasets to achieve reliable generalization and noise resilience. In medical imaging, particularly histopathology, amassing such extensive labeled data collections places additional demands on clinicians and incurs higher costs, which hinders the art’s progress. Addressing this challenge, researchers devised various strategies for leveraging deep learning with limited data and annotation availability. In this paper, we present a comprehensive review of deep learning applications in histopathology, with a focus on the challenges posed by data scarcity over the past decade. We systematically categorize and compare various approaches, evaluate their distinct contributions using benchmarking tables, and highlight their respective advantages and limitations. Additionally, we address gaps in existing reviews and identify underexplored research opportunities, underscoring the potential for future advancements in this field.

[CV-127] raining Compute-Optimal Vision Transformers for Brain Encoding

链接: https://arxiv.org/abs/2410.19810
作者: Sana Ahmadi,Francois Paugam,Tristan Glatard,Pierre Lune Bellec
关键词-EN: brain encoding, brain encoding performance, encoding, brain encoding depends, brain
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The optimal training of a vision transformer for brain encoding depends on three factors: model size, data size, and computational resources. This study investigates these three pillars, focusing on the effects of data scaling, model scaling, and high-performance computing on brain encoding results. Using VideoGPT to extract efficient spatiotemporal features from videos and training a Ridge model to predict brain activity based on these features, we conducted benchmark experiments with varying data sizes (10k, 100k, 1M, 6M) and different model configurations of GPT-2, including hidden layer dimensions, number of layers, and number of attention heads. We also evaluated the effects of training models with 32-bit vs 16-bit floating point representations. Our results demonstrate that increasing the hidden layer dimensions significantly improves brain encoding performance, as evidenced by higher Pearson correlation coefficients across all subjects. In contrast, the number of attention heads does not have a significant effect on the encoding results. Additionally, increasing the number of layers shows some improvement in brain encoding correlations, but the trend is not as consistent as that observed with hidden layer dimensions. The data scaling results show that larger training datasets lead to improved brain encoding performance, with the highest Pearson correlation coefficients observed for the largest dataset size (6M). These findings highlight that the effects of data scaling are more significant compared to model scaling in enhancing brain encoding performance. Furthermore, we explored the impact of floating-point precision by comparing 32-bit and 16-bit representations. Training with 16-bit precision yielded the same brain encoding accuracy as 32-bit, while reducing training time by 1.17 times, demonstrating its efficiency for high-performance computing tasks.

[CV-128] Radon Implicit Field Transform (RIFT): Learning Scenes from Radar Signals ICLR2025

链接: https://arxiv.org/abs/2410.19801
作者: Daqian Bao,Alex Saad-Falcon,Justin Romberg
关键词-EN: wide frequency bandwidths, range resolutions require, resolutions require large, require large antenna, large antenna apertures
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: A version of this work is under review as a submission to ICLR 2025 Conference

点击查看摘要

Abstract:Data acquisition in array signal processing (ASP) is costly, as high angular and range resolutions require large antenna apertures and wide frequency bandwidths. Data requirements grow multiplicatively with viewpoints and frequencies, increasing collection burdens. Implicit Neural Representations (INRs)–neural network models of 3D scenes–offer compact, continuous representations with minimal data, interpolating to unseen viewpoints, potentially reducing sampling costs in ASP. We propose the Radon Implicit Field Transform (RIFT), combining a radar forward model (Generalized Radon Transform, GRT) with an INR-based scene representation learned from radar signals. This method extends to other ASP problems by replacing the GRT with appropriate algorithms. In experiments, we synthesize radar data using the GRT and train the INR model by minimizing radar signal reconstruction error. We render the scene using the trained INR and evaluate it against ground truth. We introduce new error metrics: phase-Root Mean Square Error (p-RMSE) and magnitude-Structural Similarity Index Measure (m-SSIM). Compared to traditional scene models, our RIFT model achieves up to 188% improvement in scene reconstruction with only 10% of the data. Using the same amount of data, RIFT achieves 3x better reconstruction and shows a 10% improvement when generalizing to unseen viewpoints.

[CV-129] Data-Driven Cellular Network Selector for Vehicle Teleoperations

链接: https://arxiv.org/abs/2410.19791
作者: Barak Gahtan,Reuven Cohen,Alex M. Bronstein,Eli Shapira
关键词-EN: autonomous vehicle, control of robotic, development of autonomous, Remote control, Active Network Selector
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: IEEE Network of Future 2024

点击查看摘要

Abstract:Remote control of robotic systems, also known as teleoperation, is crucial for the development of autonomous vehicle (AV) technology. It allows a remote operator to view live video from AVs and, in some cases, to make real-time decisions. The effectiveness of video-based teleoperation systems is heavily influenced by the quality of the cellular network and, in particular, its packet loss rate and latency. To optimize these parameters, an AV can be connected to multiple cellular networks and determine in real time over which cellular network each video packet will be transmitted. We present an algorithm, called Active Network Selector (ANS), which uses a time series machine learning approach for solving this problem. We compare ANS to a baseline non-learning algorithm, which is used today in commercial systems, and show that ANS performs much better, with respect to both packet loss and packet latency.

[CV-130] Multi-modal Image and Radio Frequency Fusion for Optimizing Vehicle Positioning

链接: https://arxiv.org/abs/2410.19788
作者: Ouwen Huan,Tao Luo,Mingzhe Chen
关键词-EN: channel state information, unlabeled CSI, CSI, jointly localizes vehicles, state information
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, a multi-modal vehicle positioning framework that jointly localizes vehicles with channel state information (CSI) and images is designed. In particular, we consider an outdoor scenario where each vehicle can communicate with only one BS, and hence, it can upload its estimated CSI to only its associated BS. Each BS is equipped with a set of cameras, such that it can collect a small number of labeled CSI, a large number of unlabeled CSI, and the images taken by cameras. To exploit the unlabeled CSI data and position labels obtained from images, we design an meta-learning based hard expectation-maximization (EM) algorithm. Specifically, since we do not know the corresponding relationship between unlabeled CSI and the multiple vehicle locations in images, we formulate the calculation of the training objective as a minimum matching problem. To reduce the impact of label noises caused by incorrect matching between unlabeled CSI and vehicle locations obtained from images and achieve better convergence, we introduce a weighted loss function on the unlabeled datasets, and study the use of a meta-learning algorithm for computing the weighted loss. Subsequently, the model parameters are updated according to the weighted loss function of unlabeled CSI samples and their matched position labels obtained from images. Simulation results show that the proposed method can reduce the positioning error by up to 61% compared to a baseline that does not use images and uses only CSI fingerprint for vehicle positioning.

机器学习

[LG-0] Online Weighted Paging with Unknown Weights

链接: https://arxiv.org/abs/2410.21266
作者: Orin Levy,Noam Touitou,Aviv Rosenberg
关键词-EN: fetching pages arrive, slots as requests, Buchbinder and Naor, pages arrive online, page weights
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Online paging is a fundamental problem in the field of online algorithms, in which one maintains a cache of k slots as requests for fetching pages arrive online. In the weighted variant of this problem, each page has its own fetching cost; a substantial line of work on this problem culminated in an (optimal) O(\log k) -competitive randomized algorithm, due to Bansal, Buchbinder and Naor (FOCS’07). Existing work for weighted paging assumes that page weights are known in advance, which is not always the case in practice. For example, in multi-level caching architectures, the expected cost of fetching a memory block is a function of its probability of being in a mid-level cache rather than the main memory. This complex property cannot be predicted in advance; over time, however, one may glean information about page weights through sampling their fetching cost multiple times. We present the first algorithm for online weighted paging that does not know page weights in advance, but rather learns from weight samples. In terms of techniques, this requires providing (integral) samples to a fractional solver, requiring a delicate interface between this solver and the randomized rounding scheme; we believe that our work can inspire online algorithms to other problems that involve cost sampling. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2410.21266 [cs.LG] (or arXiv:2410.21266v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.21266 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Modular Duality in Deep Learning

链接: https://arxiv.org/abs/2410.21265
作者: Jeremy Bernstein,Laker Newhouse
关键词-EN: dual vector, weights reside, primal space, duality map, optimization theory
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We conclude by deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers – the latter two methods are based on a new rectangular Newton-Schulz iteration that we propose. Our iteration was recently used to set new speed records for training NanoGPT. Overall, we hope that our theory of modular duality will yield a next generation of fast and scalable optimizers for general neural architectures.

[LG-2] One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

链接: https://arxiv.org/abs/2410.21257
作者: Zhendong Wang,Zhaoshuo Li,Ajay Mandlekar,Zhenjia Xu,Jiaojiao Fan,Yashraj Narang,Linxi Fan,Yuke Zhu,Yogesh Balaji,Mingyuan Zhou,Ming-Yu Liu,Yu Zeng
关键词-EN: demonstrating exceptional performance, demonstrating exceptional, behavior cloning, increasingly being applied, exceptional performance
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for real-time applications in resource-constrained robotics setups and dynamically changing environments. In this paper, we introduce the One-Step Diffusion Policy (OneDP), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only 2% - 10% additional pre-training cost for convergence. We evaluated OneDP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. We share the project page at this https URL.

[LG-3] textttskwdro: a library for Wasserstein distributionally robust machine learning

链接: https://arxiv.org/abs/2410.21231
作者: Florian Vincent,Waïss Azizian,Franck Iutzeler,Jérôme Malick
关键词-EN: training robust machine, robust machine learning, Python library, machine learning models, present skwdro
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Optimization and Control (math.OC)
*备注: 6 pages 1 figure

点击查看摘要

Abstract:We present skwdro, a Python library for training robust machine learning models. The library is based on distributionally robust optimization using optimal transport distances. For ease of use, it features both scikit-learn compatible estimators for popular objectives, as well as a wrapper for PyTorch modules, enabling researchers and practitioners to use it in a wide range of models with minimal code changes. Its implementation relies on an entropic smoothing of the original robust objective in order to ensure maximal model flexibility. The library is available at this https URL

[LG-4] Reconstructing dynamics from sparse observations with no training on target system

链接: https://arxiv.org/abs/2410.21222
作者: Zheng-Meng Zhai,Jun-Yin Huang,Benjamin D. Stern,Ying-Cheng Lai
关键词-EN: target system, data, system, nonlinear target system, target
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 31 pages, 21 figures

点击查看摘要

Abstract:In applications, an anticipated situation is where the system of interest has never been encountered before and sparse observations can be made only once. Can the dynamics be faithfully reconstructed from the limited observations without any training data? This problem defies any known traditional methods of nonlinear time-series analysis as well as existing machine-learning methods that typically require extensive data from the target system for training. We address this challenge by developing a hybrid transformer and reservoir-computing machine-learning scheme. The key idea is that, for a complex and nonlinear target system, the training of the transformer can be conducted not using any data from the target system, but with essentially unlimited synthetic data from known chaotic systems. The trained transformer is then tested with the sparse data from the target system. The output of the transformer is further fed into a reservoir computer for predicting the long-term dynamics or the attractor of the target system. The power of the proposed hybrid machine-learning framework is demonstrated using a large number of prototypical nonlinear dynamical systems, with high reconstruction accuracy even when the available data is only 20% of that required to faithfully represent the dynamical behavior of the underlying system. The framework provides a paradigm of reconstructing complex and nonlinear dynamics in the extreme situation where training data does not exist and the observations are random and sparse.

[LG-5] SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications

链接: https://arxiv.org/abs/2410.21194
作者: Ilias Diakonikolas,Samuel B. Hopkins,Ankit Pensia,Stefan Tiegel
关键词-EN: variate polynomial, mathbb, square polynomials, centered subgaussian distribution, universal constant
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We prove that there is a universal constant C0 so that for every d \in \mathbb N , every centered subgaussian distribution \mathcal D on \mathbb R^d , and every even p \in \mathbb N , the d -variate polynomial (Cp)^p/2 \cdot |v|_2^p - \mathbb E_X \sim \mathcal D \langle v,X\rangle^p is a sum of square polynomials. This establishes that every subgaussian distribution is \emphSoS-certifiably subgaussian – a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand’s generic chaining/majorizing measures theorem.

[LG-6] On Homomorphic Encryption Based Strategies for Class Imbalance in Federated Learning

链接: https://arxiv.org/abs/2410.21192
作者: Arpit Guleria,J. Harshan,Ranjitha Prasad,B. N. Bharath
关键词-EN: machine learning models, Class imbalance, lead to bias, bias and poor, poor generalization
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted for Presentation at CODS COMAD 2024

点击查看摘要

Abstract:Class imbalance in training datasets can lead to bias and poor generalization in machine learning models. While pre-processing of training datasets can efficiently address both these issues in centralized learning environments, it is challenging to detect and address these issues in a distributed learning environment such as federated learning. In this paper, we propose FLICKER, a privacy preserving framework to address issues related to global class imbalance in federated learning. At the heart of our contribution lies the popular CKKS homomorphic encryption scheme, which is used by the clients to privately share their data attributes, and subsequently balance their datasets before implementing the FL scheme. Extensive experimental results show that our proposed method significantly improves the FL accuracy numbers when used along with popular datasets and relevant baselines.

[LG-7] Differentially Private Learned Indexes

链接: https://arxiv.org/abs/2410.21164
作者: Jianzhang Du,Tilak Mudgal,Rutvi Rahul Gadre,Yukui Luo,Chenghong Wang
关键词-EN: Trusted Execution Environments, Execution Environments, Trusted Execution, efficiently answering predicate, secured by Trusted
类目: Databases (cs.DB); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we address the problem of efficiently answering predicate queries on encrypted databases, those secured by Trusted Execution Environments (TEEs), which enable untrusted providers to process encrypted user data without revealing its contents. A common strategy in modern databases to accelerate predicate queries is the use of indexes, which map attribute values (keys) to their corresponding positions in a sorted data array. This allows for fast lookup and retrieval of data subsets that satisfy specific predicates. Unfortunately, indexes cannot be directly applied to encrypted databases due to strong data dependent leakages. Recent approaches apply differential privacy (DP) to construct noisy indexes that enable faster access to encrypted data while maintaining provable privacy guarantees. However, these methods often suffer from large storage costs, with index sizes typically scaling linearly with the key space. To address this challenge, we propose leveraging learned indexes, a trending technique that repurposes machine learning models as indexing structures, to build more compact DP indexes.

[LG-8] Resilience in Knowledge Graph Embeddings

链接: https://arxiv.org/abs/2410.21163
作者: Arnab Sharma,N’Dah Jean Kouagou,Axel-Cyrille Ngonga Ngomo
关键词-EN: knowledge graphs, witnessed widespread applications, recommendation systems, recent years, knowledge
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, knowledge graphs have gained interest and witnessed widespread applications in various domains, such as information retrieval, question-answering, recommendation systems, amongst others. Large-scale knowledge graphs to this end have demonstrated their utility in effectively representing structured knowledge. To further facilitate the application of machine learning techniques, knowledge graph embedding (KGE) models have been developed. Such models can transform entities and relationships within knowledge graphs into vectors. However, these embedding models often face challenges related to noise, missing information, distribution shift, adversarial attacks, etc. This can lead to sub-optimal embeddings and incorrect inferences, thereby negatively impacting downstream applications. While the existing literature has focused so far on adversarial attacks on KGE models, the challenges related to the other critical aspects remain unexplored. In this paper, we, first of all, give a unified definition of resilience, encompassing several factors such as generalisation, performance consistency, distribution adaption, and robustness. After formalizing these concepts for machine learning in general, we define them in the context of knowledge graphs. To find the gap in the existing works on resilience in the context of knowledge graphs, we perform a systematic survey, taking into account all these aspects mentioned previously. Our survey results show that most of the existing works focus on a specific aspect of resilience, namely robustness. After categorizing such works based on their respective aspects of resilience, we discuss the challenges and future research directions.

[LG-9] Offline Reinforcement Learning With Combinatorial Action Spaces

链接: https://arxiv.org/abs/2410.21151
作者: Matthew Landers,Taylor W. Killian,Hugo Barnes,Thomas Hartvigsen,Afsaneh Doryab
关键词-EN: Reinforcement learning problems, Reinforcement learning, combinatorial action spaces, simultaneous execution, execution of multiple
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning problems often involve large action spaces arising from the simultaneous execution of multiple sub-actions, resulting in combinatorial action spaces. Learning in combinatorial action spaces is difficult due to the exponential growth in action space size with the number of sub-actions and the dependencies among these sub-actions. In offline settings, this challenge is compounded by limited and suboptimal data. Current methods for offline learning in combinatorial spaces simplify the problem by assuming sub-action independence. We propose Branch Value Estimation (BVE), which effectively captures sub-action dependencies and scales to large combinatorial spaces by learning to evaluate only a small subset of actions at each timestep. Our experiments show that BVE outperforms state-of-the-art methods across a range of action space sizes.

[LG-10] LLM -initialized Differentiable Causal Discovery

链接: https://arxiv.org/abs/2410.21141
作者: Shiv Kampani,David Hidary,Constantijn van der Poel,Martin Ganahl,Brenda Miao
关键词-EN: Large Language Models, causal discovery, causal, scientific domains, discovery
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The discovery of causal relationships between random variables is an important yet challenging problem that has applications across many scientific domains. Differentiable causal discovery (DCD) methods are effective in uncovering causal relationships from observational data; however, these approaches often suffer from limited interpretability and face challenges in incorporating domain-specific prior knowledge. In contrast, Large Language Models (LLMs)-based causal discovery approaches have recently been shown capable of providing useful priors for causal discovery but struggle with formal causal reasoning. In this paper, we propose LLM-DCD, which uses an LLM to initialize the optimization of the maximum likelihood objective function of DCD approaches, thereby incorporating strong priors into the discovery method. To achieve this initialization, we design our objective function to depend on an explicitly defined adjacency matrix of the causal graph as its only variational parameter. Directly optimizing the explicitly defined adjacency matrix provides a more interpretable approach to causal discovery. Additionally, we demonstrate higher accuracy on key benchmarking datasets of our approach compared to state-of-the-art alternatives, and provide empirical evidence that the quality of the initialization directly impacts the quality of the final output of our DCD approach. LLM-DCD opens up new opportunities for traditional causal discovery methods like DCD to benefit from future improvements in the causal reasoning capabilities of LLMs.

[LG-11] FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge

链接: https://arxiv.org/abs/2410.21120
作者: Sifat Ut Taki,Arthi Padmanabhan,Spyridon Mastorakis
关键词-EN: DNN models, aimed to revolutionize, models, serverless inference services, popular DNN models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Edge AI computing boxes are a new class of computing devices that are aimed to revolutionize the AI industry. These compact and robust hardware units bring the power of AI processing directly to the source of data–on the edge of the network. On the other hand, on-demand serverless inference services are becoming more and more popular as they minimize the infrastructural cost associated with hosting and running DNN models for small to medium-sized businesses. However, these computing devices are still constrained in terms of resource availability. As such, the service providers need to load and unload models efficiently in order to meet the growing demand. In this paper, we introduce FusedInf to efficiently swap DNN models for on-demand serverless inference services on the edge. FusedInf combines multiple models into a single Direct Acyclic Graph (DAG) to efficiently load the models into the GPU memory and make execution faster. Our evaluation of popular DNN models showed that creating a single DAG can make the execution of the models up to 14% faster while reducing the memory requirement by up to 17%. The prototype implementation is available at this https URL.

[LG-12] A Unified Solution to Diverse Heterogeneities in One-shot Federated Learning

链接: https://arxiv.org/abs/2410.21119
作者: Jun Bai,Yiliao Song,Di Wu,Atul Sajjanhar,Yong Xiang,Wei Zhou,Xiaohui Tao,Yan Li
关键词-EN: requiring multiple communications, privacy leakage risks, traditional FLs requiring, FLs requiring multiple, One-shot federated learning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One-shot federated learning (FL) limits the communication between the server and clients to a single round, which largely decreases the privacy leakage risks in traditional FLs requiring multiple communications. However, we find existing one-shot FL frameworks are vulnerable to distributional heterogeneity due to their insufficient focus on data heterogeneity while concentrating predominantly on model heterogeneity. Filling this gap, we propose a unified, data-free, one-shot federated learning framework (FedHydra) that can effectively address both model and data heterogeneity. Rather than applying existing value-only learning mechanisms, a structure-value learning mechanism is proposed in FedHydra. Specifically, a new stratified learning structure is proposed to cover data heterogeneity, and the value of each item during computation reflects model heterogeneity. By this design, the data and model heterogeneity issues are simultaneously monitored from different aspects during learning. Consequently, FedHydra can effectively mitigate both issues by minimizing their inherent conflicts. We compared FedHydra with three SOTA baselines on four benchmark datasets. Experimental results show that our method outperforms the previous one-shot FL methods in both homogeneous and heterogeneous settings.

[LG-13] Dual-Agent Deep Reinforcement Learning for Dynamic Pricing and Replenishment

链接: https://arxiv.org/abs/2410.21109
作者: Yi Zheng,Zehao Li,Peng Jiang,Yijie Peng
关键词-EN: study the dynamic, inconsistent decision frequencies, inconsistent decision, dynamic pricing, traditional demand assumption
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:We study the dynamic pricing and replenishment problems under inconsistent decision frequencies. Different from the traditional demand assumption, the discreteness of demand and the parameter within the Poisson distribution as a function of price introduce complexity into analyzing the problem property. We demonstrate the concavity of the single-period profit function with respect to product price and inventory within their respective domains. The demand model is enhanced by integrating a decision tree-based machine learning approach, trained on comprehensive market data. Employing a two-timescale stochastic approximation scheme, we address the discrepancies in decision frequencies between pricing and replenishment, ensuring convergence to local optimum. We further refine our methodology by incorporating deep reinforcement learning (DRL) techniques and propose a fast-slow dual-agent DRL algorithm. In this approach, two agents handle pricing and inventory and are updated on different scales. Numerical results from both single and multiple products scenarios validate the effectiveness of our methods.

[LG-14] ree-Wasserstein Distance for High Dimensional Data with a Latent Feature Hierarchy

链接: https://arxiv.org/abs/2410.21107
作者: Ya-Wei Eileen Lin,Ronald R. Coifman,Gal Mishne,Ronen Talmon
关键词-EN: Finding meaningful distances, important scientific task, Finding meaningful, latent feature hierarchy, high-dimensional data samples
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Finding meaningful distances between high-dimensional data samples is an important scientific task. To this end, we propose a new tree-Wasserstein distance (TWD) for high-dimensional data with two key aspects. First, our TWD is specifically designed for data with a latent feature hierarchy, i.e., the features lie in a hierarchical space, in contrast to the usual focus on embedding samples in hyperbolic space. Second, while the conventional use of TWD is to speed up the computation of the Wasserstein distance, we use its inherent tree as a means to learn the latent feature hierarchy. The key idea of our method is to embed the features into a multi-scale hyperbolic space using diffusion geometry and then present a new tree decoding method by establishing analogies between the hyperbolic embedding and trees. We show that our TWD computed based on data observations provably recovers the TWD defined with the latent feature hierarchy and that its computation is efficient and scalable. We showcase the usefulness of the proposed TWD in applications to word-document and single-cell RNA-sequencing datasets, demonstrating its advantages over existing TWDs and methods based on pre-trained models.

[LG-15] Federated Time Series Generation on Feature and Temporally Misaligned Data

链接: https://arxiv.org/abs/2410.21072
作者: Chenrui Fan,Zhi Wen Soi,Aditya Shankar,Abele Mălan,Lydia Y. Chen
关键词-EN: Distributed time series, Distributed time, federated time series, time series, misaligned time steps
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Distributed time series data presents a challenge for federated learning, as clients often possess different feature sets and have misaligned time steps. Existing federated time series models are limited by the assumption of perfect temporal or feature alignment across clients. In this paper, we propose FedTDD, a novel federated time series diffusion model that jointly learns a synthesizer across clients. At the core of FedTDD is a novel data distillation and aggregation framework that reconciles the differences between clients by imputing the misaligned timesteps and features. In contrast to traditional federated learning, FedTDD learns the correlation across clients’ time series through the exchange of local synthetic outputs instead of model parameters. A coordinator iteratively improves a global distiller network by leveraging shared knowledge from clients through the exchange of synthetic data. As the distiller becomes more refined over time, it subsequently enhances the quality of the clients’ local feature estimates, allowing each client to then improve its local imputations for missing data using the latest, more accurate distiller. Experimental results on five datasets demonstrate FedTDD’s effectiveness compared to centralized training, and the effectiveness of sharing synthetic outputs to transfer knowledge of local time series. Notably, FedTDD achieves 79.4% and 62.8% improvement over local training in Context-FID and Correlational scores.

[LG-16] Computable Lipschitz Bounds for Deep Neural Networks

链接: https://arxiv.org/abs/2410.21053
作者: Moreno Pintore,Bruno Després
关键词-EN: neural-network based models, convolutional neural networks, Deriving sharp, Lipschitz constant, computable upper bounds
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deriving sharp and computable upper bounds of the Lipschitz constant of deep neural networks is crucial to formally guarantee the robustness of neural-network based models. We analyse three existing upper bounds written for the l^2 norm. We highlight the importance of working with the l^1 and l^\infty norms and we propose two novel bounds for both feed-forward fully-connected neural networks and convolutional neural networks. We treat the technical difficulties related to convolutional neural networks with two different methods, called explicit and implicit. Several numerical tests empirically confirm the theoretical results, help to quantify the relationship between the presented bounds and establish the better accuracy of the new bounds. Four numerical tests are studied: two where the output is derived from an analytical closed form are proposed; another one with random matrices; and the last one for convolutional neural networks trained on the MNIST dataset. We observe that one of our bound is optimal in the sense that it is exact for the first test with the simplest analytical form and it is better than other bounds for the other tests.

[LG-17] Physics-informed Partitioned Coupled Neural Operator for Complex Networks

链接: https://arxiv.org/abs/2410.21025
作者: Weidong Wu,Yong Zhang,Lili Hao,Yang Chen,Xiaoyan Sun,Dunwei Gong
关键词-EN: partial differential equations, Operators provide efficient, Neural Operators provide, Fourier Neural Operator, Partitioned Coupled Neural
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Operators provide efficient, high-fidelity simulations for systems governed by partial differential equations (PDEs). However, most existing studies focus only on multi-scale, multi-physics systems within a single spatial region, neglecting the case with multiple interconnected sub-regions, such as gas and thermal systems. To address this, this paper proposes a Physics-Informed Partitioned Coupled Neural Operator (PCNO) to enhance the simulation performance of such networks. Compared to the existing Fourier Neural Operator (FNO), this method designs a joint convolution operator within the Fourier layer, enabling global integration capturing all sub-regions. Additionally, grid alignment layers are introduced outside the Fourier layer to help the joint convolution operator accurately learn the coupling relationship between sub-regions in the frequency domain. Experiments on gas networks demonstrate that the proposed operator not only accurately simulates complex systems but also shows good generalization and low model complexity.

[LG-18] A Review of Graph-Powered Data Quality Applications for IoT Monitoring Sensor Networks

链接: https://arxiv.org/abs/2410.21006
作者: Pau Ferrer-Cid,Jose M. Barcelo-Ordinas,Jorge Garcia-Vidal
关键词-EN: Internet of Things, development of Internet, smart cities, precision agriculture, widespread adoption
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Paper submitted to Journal of Network and Computer Applications

点击查看摘要

Abstract:The development of Internet of Things (IoT) technologies has led to the widespread adoption of monitoring networks for a wide variety of applications, such as smart cities, environmental monitoring, and precision agriculture. A major research focus in recent years has been the development of graph-based techniques to improve the quality of data from sensor networks, a key aspect for the use of sensed data in decision-making processes, digital twins, and other applications. Emphasis has been placed on the development of machine learning and signal processing techniques over graphs, taking advantage of the benefits provided by the use of structured data through a graph topology. Many technologies such as the graph signal processing (GSP) or the successful graph neural networks (GNNs) have been used for data quality enhancement tasks. In this survey, we focus on graph-based models for data quality control in monitoring sensor networks. Furthermore, we delve into the technical details that are commonly leveraged for providing powerful graph-based solutions for data quality tasks in sensor networks, including missing value imputation, outlier detection, or virtual sensing. To conclude, we have identified future trends and challenges such as graph-based models for digital twins or model transferability and generalization.

[LG-19] SepMamba: State-space models for speaker separation using Mamba

链接: https://arxiv.org/abs/2410.20997
作者: Thor Højhus Avenstrup,Boldizsár Elek,István László Mádi,András Bence Schin,Morten Mørup,Bjørn Sand Jensen,Kenny Falkær Olsen
关键词-EN: learning-based single-channel speaker, recent years largely, years largely due, Deep learning-based single-channel, transformer-based attention mechanism
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.

[LG-20] Reference-Free Formula Drift with Reinforcement Learning: From Driving Data to Tire Energy-Inspired Real-World Policies ICRA2025

链接: https://arxiv.org/abs/2410.20990
作者: Franck Djeumou,Michael Thompson,Makoto Suminaka,John Subosits
关键词-EN: give future autonomous, future autonomous cars, autonomous cars maximum, cars maximum flexibility, professional drivers
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Initial submission to ICRA 2025

点击查看摘要

Abstract:The skill to drift a car–i.e., operate in a state of controlled oversteer like professional drivers–could give future autonomous cars maximum flexibility when they need to retain control in adverse conditions or avoid collisions. We investigate real-time drifting strategies that put the car where needed while bypassing expensive trajectory optimization. To this end, we design a reinforcement learning agent that builds on the concept of tire energy absorption to autonomously drift through changing and complex waypoint configurations while safely staying within track bounds. We achieve zero-shot deployment on the car by training the agent in a simulation environment built on top of a neural stochastic differential equation vehicle model learned from pre-collected driving data. Experiments on a Toyota GR Supra and Lexus LC 500 show that the agent is capable of drifting smoothly through varying waypoint configurations with tracking error as low as 10 cm while stably pushing the vehicles to sideslip angles of up to 63°.

[LG-21] Refining CART Models for Covariate Shift with Importance Weight

链接: https://arxiv.org/abs/2410.20978
作者: Mingyang Cai,Thomas Klausch,Mark A. van de Wiel
关键词-EN: Machine learning models, Machine learning, medical applications due, decrease predictive accuracy, applications due
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Machine learning models often face challenges in medical applications due to covariate shifts, where discrepancies between training and target data distributions can decrease predictive accuracy. This paper introduces an adaptation of Classification and Regression Trees (CART) that incorporates importance weighting to address these distributional differences effectively. By assigning greater weight to training samples that closely represent the target distribution, our approach modifies the CART model to improve performance in the presence of covariate shift. We evaluate the effectiveness of this method through simulation studies and apply it to real-world medical data, showing significant improvements in predictive accuracy. The results indicate that this weighted CART approach can be valuable in medical and other fields where covariate shift poses challenges, enabling more reliable predictions across diverse data distributions.

[LG-22] Simultaneous Unlearning of Multiple Protected User Attributes From Variational Autoencoder Recommenders Using Adversarial Training

链接: https://arxiv.org/abs/2410.20965
作者: Gustavo Escobedo,Christian Ganhör,Stefan Brandl,Mirjam Augstein,Markus Schedl
关键词-EN: neural network-based collaborative, network-based collaborative filtering, collaborative filtering models, users’ history logs, users’ protected attributes
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In widely used neural network-based collaborative filtering models, users’ history logs are encoded into latent embeddings that represent the users’ preferences. In this setting, the models are capable of mapping users’ protected attributes (e.g., gender or ethnicity) from these user embeddings even without explicit access to them, resulting in models that may treat specific demographic user groups unfairly and raise privacy issues. While prior work has approached the removal of a single protected attribute of a user at a time, multiple attributes might come into play in real-world scenarios. In the work at hand, we present AdvXMultVAE which aims to unlearn multiple protected attributes (exemplified by gender and age) simultaneously to improve fairness across demographic user groups. For this purpose, we couple a variational autoencoder (VAE) architecture with adversarial training (AdvMultVAE) to support simultaneous removal of the users’ protected attributes with continuous and/or categorical values. Our experiments on two datasets, LFM-2b-100k and Ml-1m, from the music and movie domains, respectively, show that our approach can yield better results than its singular removal counterparts (based on AdvMultVAE) in effectively mitigating demographic biases whilst improving the anonymity of latent embeddings.

[LG-23] Neural Hamilton: Can A.I. Understand Hamiltonian Mechanics?

链接: https://arxiv.org/abs/2410.20951
作者: Tae-Geun Kim,Seong Chan Park
关键词-EN: reformulates classical mechanics, operator learning problem, reformulates classical, classical mechanics, operator learning
类目: Machine Learning (cs.LG); Classical Physics (physics.class-ph); Computational Physics (physics.comp-ph)
*备注: 33 pages, 8 figures, 9 tables

点击查看摘要

Abstract:We propose a novel framework based on neural network that reformulates classical mechanics as an operator learning problem. A machine directly maps a potential function to its corresponding trajectory in phase space without solving the Hamilton equations. Most notably, while conventional methods tend to accumulate errors over time through iterative time integration, our approach prevents error propagation. Two newly developed neural network architectures, namely VaRONet and MambONet, are introduced to adapt the Variational LSTM sequence-to-sequence model and leverage the Mamba model for efficient temporal dynamics processing. We tested our approach with various 1D physics problems: harmonic oscillation, double-well potentials, Morse potential, and other potential models outside the training data. Compared to traditional numerical methods based on the fourth-order Runge-Kutta (RK4) algorithm, our model demonstrates improved computational efficiency and accuracy. Code is available at: this https URL Comments: 33 pages, 8 figures, 9 tables Subjects: Machine Learning (cs.LG); Classical Physics (physics.class-ph); Computational Physics (physics.comp-ph) Cite as: arXiv:2410.20951 [cs.LG] (or arXiv:2410.20951v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.20951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] Constrained Optimal Fuel Consumption of HEV:Considering the Observational Perturbation

链接: https://arxiv.org/abs/2410.20913
作者: Shuchang Yan,Haoran Sun
关键词-EN: constrained reinforcement learning, assume accurate observation, precise speed curves, constrained optimal fuel, state of charge
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We assume accurate observation of battery state of charge (SOC) and precise speed curves when addressing the constrained optimal fuel consumption (COFC) problem via constrained reinforcement learning (CRL). However, in practice, SOC measurements are often distorted by noise or confidentiality protocols, and actual reference speeds may deviate from expectations. We aim to minimize fuel consumption while maintaining SOC balance under observational perturbations in SOC and speed. This work first worldwide uses seven training approaches to solve the COFC problem under five types of perturbations, including one based on a uniform distribution, one designed to maximize rewards, one aimed at maximizing costs, and one along with its improved version that seeks to decrease reward on Toyota Hybrid Systems (THS) under New European Driving Cycle (NEDC) condition. The result verifies that the six can successfully solve the COFC problem under observational perturbations, and we further compare the robustness and safety of these training approaches and analyze their impact on optimal fuel consumption.

[LG-25] Deep Recurrent Stochastic Configuration Networks for Modelling Nonlinear Dynamic Systems

链接: https://arxiv.org/abs/2410.20904
作者: Gang Dang,Dianhui Wang
关键词-EN: Deep learning techniques, domain applications, techniques have shown, shown promise, termed deep recurrent
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep learning techniques have shown promise in many domain applications. This paper proposes a novel deep reservoir computing framework, termed deep recurrent stochastic configuration network (DeepRSCN) for modelling nonlinear dynamic systems. DeepRSCNs are incrementally constructed, with all reservoir nodes directly linked to the final output. The random parameters are assigned in the light of a supervisory mechanism, ensuring the universal approximation property of the built model. The output weights are updated online using the projection algorithm to handle the unknown dynamics. Given a set of training samples, DeepRSCNs can quickly generate learning representations, which consist of random basis functions with cascaded input and readout weights. Experimental results over a time series prediction, a nonlinear system identification problem, and two industrial data predictive analyses demonstrate that the proposed DeepRSCN outperforms the single-layer network in terms of modelling efficiency, learning capability, and generalization performance.

[LG-26] Generative Example-Based Explanations: Bridging the Gap between Generative Modeling and Explainability

链接: https://arxiv.org/abs/2410.20890
作者: Philipp Vaeth,Alexander M. Fruehwald,Benjamin Paassen,Magda Gregorova
关键词-EN: high-dimensional input data, decision algorithms, produce example-based explanations, leveraged deep generative, Recently
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recently, several methods have leveraged deep generative modeling to produce example-based explanations of decision algorithms for high-dimensional input data. Despite promising results, a disconnect exists between these methods and the classical explainability literature, which focuses on lower-dimensional data with semantically meaningful features. This conceptual and communication gap leads to misunderstandings and misalignments in goals and expectations. In this paper, we bridge this gap by proposing a novel probabilistic framework for local example-based explanations. Our framework integrates the critical characteristics of classical local explanation desiderata while being amenable to high-dimensional data and their modeling through deep generative models. Our aim is to facilitate communication, foster rigor and transparency, and improve the quality of peer discussion and research progress.

[LG-27] CODES: Benchmarking Coupled ODE Surrogates NEURIPS2024

链接: https://arxiv.org/abs/2410.20886
作者: Robin Janssen,Immanuel Sulzer,Tobias Buck
关键词-EN: coupled ODE systems, ODE systems, coupled ODE, comprehensive evaluation, architectures for coupled
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computational Physics (physics.comp-ph)
*备注: 12 pages, 10 figures, accepted for the Machine Learning and the Physical Sciences workshop at NeurIPS 2024, source code available on GitHub at this https URL

点击查看摘要

Abstract:We introduce CODES, a benchmark for comprehensive evaluation of surrogate architectures for coupled ODE systems. Besides standard metrics like mean squared error (MSE) and inference time, CODES provides insights into surrogate behaviour across multiple dimensions like interpolation, extrapolation, sparse data, uncertainty quantification and gradient correlation. The benchmark emphasizes usability through features such as integrated parallel training, a web-based configuration generator, and pre-implemented baseline models and datasets. Extensive documentation ensures sustainability and provides the foundation for collaborative improvement. By offering a fair and multi-faceted comparison, CODES helps researchers select the most suitable surrogate for their specific dataset and application while deepening our understanding of surrogate learning behaviour.

[LG-28] On Probabilistic Pullback Metrics on Latent Hyperbolic Manifolds

链接: https://arxiv.org/abs/2410.20850
作者: Luis Augenstein,Noémie Jaquier,Tamim Asfour,Leonel Rozo
关键词-EN: Gaussian Process Latent, Latent Variable Models, Process Latent Variable, Gaussian Process, Variable Models
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages, 7 figures, 1 table

点击查看摘要

Abstract:Gaussian Process Latent Variable Models (GPLVMs) have proven effective in capturing complex, high-dimensional data through lower-dimensional representations. Recent advances show that using Riemannian manifolds as latent spaces provides more flexibility to learn higher quality embeddings. This paper focuses on the hyperbolic manifold, a particularly suitable choice for modeling hierarchical relationships. While previous approaches relied on hyperbolic geodesics for interpolating the latent space, this often results in paths crossing low-data regions, leading to highly uncertain predictions. Instead, we propose augmenting the hyperbolic metric with a pullback metric to account for distortions introduced by the GPLVM’s nonlinear mapping. Through various experiments, we demonstrate that geodesics on the pullback metric not only respect the geometry of the hyperbolic latent space but also align with the underlying data distribution, significantly reducing uncertainty in predictions.

[LG-29] mporal Streaming Batch Principal Component Analysis for Time Series Classification

链接: https://arxiv.org/abs/2410.20820
作者: Enshuo Yan,Huachuan Wang,Weihao Xia
关键词-EN: excellent classification capabilities, show significant shortcomings, prolonged training times, long sequence multivariate, current sequence analysis
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In multivariate time series classification, although current sequence analysis models have excellent classification capabilities, they show significant shortcomings when dealing with long sequence multivariate data, such as prolonged training times and decreased accuracy. This paper focuses on optimizing model performance for long-sequence multivariate data by mitigating the impact of extended time series and multiple variables on the model. We propose a principal component analysis (PCA)-based temporal streaming compression and dimensionality reduction algorithm for time series data (temporal streaming batch PCA, TSBPCA), which continuously updates the compact representation of the entire sequence through streaming PCA time estimation with time block updates, enhancing the data representation capability of a range of sequence analysis models. We evaluated this method using various models on five real datasets, and the experimental results show that our method performs well in terms of classification accuracy and time efficiency. Notably, our method demonstrates a trend of increasing effectiveness as sequence length grows; on the two longest sequence datasets, accuracy improved by about 7.2%, and execution time decreased by 49.5%.

[LG-30] zGAN: An Outlier-focused Generative Adversarial Network For Realistic Synthetic Data Generation

链接: https://arxiv.org/abs/2410.20808
作者: Azizjon Azimi,Bonu Boboeva,Ilyas Varshavskiy,Shuhrat Khalilbekov,Akhlitdin Nizamitdinov,Najima Noyoftova,Sergey Shulgin
关键词-EN: classical machine learning, machine learning models, black swans, posed a fundamental, fundamental challenge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The phenomenon of “black swans” has posed a fundamental challenge to performance of classical machine learning models. Perceived rise in frequency of outlier conditions, especially in post-pandemic environment, has necessitated exploration of synthetic data as a complement real data in model training. This article provides a general overview and experimental investigation of the zGAN model architecture developed for the purpose of generating synthetic tabular data with outlier characteristics. The model is put to test in binary classification environments and shows promising results on not only synthetic data generation, but also on uplift capabilities vis-à-vis model performance. A distinctive feature of zGAN is its enhanced correlation capability between features in the generated data, replicating correlations of features in real training data. Furthermore, crucial is the ability of zGAN to generate outliers based on covariance of real data or synthetically generated covariances. This approach to outlier generation enables modeling of complex economic events and augmentation of outliers for tasks such as training predictive models and detecting, processing or removing outliers. Experiments and comparative analyses as part of this study were conducted on both private (credit risk in financial services) and public datasets.

[LG-31] Reduction-based Pseudo-label Generation for Instance-dependent Partial Label Learning

链接: https://arxiv.org/abs/2410.20797
作者: Congyu Qiao,Ning Xu,Yihao Hu,Xin Geng
关键词-EN: Instance-dependent Partial Label, Partial Label Learning, Instance-dependent Partial, Label Learning, Partial Label
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Instance-dependent Partial Label Learning (ID-PLL) aims to learn a multi-class predictive model given training instances annotated with candidate labels related to features, among which correct labels are hidden fixed but unknown. The previous works involve leveraging the identification capability of the training model itself to iteratively refine supervision information. However, these methods overlook a critical aspect of ID-PLL: the training model is prone to overfitting on incorrect candidate labels, thereby providing poor supervision information and creating a bottleneck in training. In this paper, we propose to leverage reduction-based pseudo-labels to alleviate the influence of incorrect candidate labels and train our predictive model to overcome this bottleneck. Specifically, reduction-based pseudo-labels are generated by performing weighted aggregation on the outputs of a multi-branch auxiliary model, with each branch trained in a label subspace that excludes certain labels. This approach ensures that each branch explicitly avoids the disturbance of the excluded labels, allowing the pseudo-labels provided for instances troubled by these excluded labels to benefit from the unaffected branches. Theoretically, we demonstrate that reduction-based pseudo-labels exhibit greater consistency with the Bayes optimal classifier compared to pseudo-labels directly generated from the predictive model.

[LG-32] Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

链接: https://arxiv.org/abs/2410.20786
作者: Jianmina Ma,Jingtian Ji,Yue Gao
关键词-EN: achieved promising progress, Constrained reinforcement learning, reinforcement learning, Constrained reinforcement, achieved promising
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered. However, constrained reinforcement learning methods face challenges in striking the right balance between task performance and constraint satisfaction and it is prone for them to get stuck in over-conservative or constraint violating local minima. In this paper, we propose Adversarial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training. Our approach divides original constrained problem into two adversarial stages that are solved alternately, and the policy update performance of our algorithm can be theoretically guaranteed. We validate our method through experiments conducted on Safety Gymnasium and quadruped locomotion tasks. Results demonstrate that our algorithm achieves better performances compared to commonly used baselines.

[LG-33] An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation

链接: https://arxiv.org/abs/2410.20773
作者: Saarth Vardhan,Pavani R Acharya,Samarth S Rao,Oorjitha Ratna Jasthi,S Natarajan
关键词-EN: Music source separation, individual sound sources, mixed audio signals, involves isolating individual, isolating individual sound
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals. This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance across traditional Vocal, Drum, and Bass (VDB) stems, as well as expanding into second-level hierarchical separation for sub-stems like kick, snare, lead vocals, and background vocals. Our method addresses the limitations of relying on a single model by utilising the complementary strengths of various models, leading to more balanced results across stems. For stem selection, we used the harmonic mean of Signal-to-Noise Ratio (SNR) and Signal-to-Distortion Ratio (SDR), ensuring that extreme values do not skew the results and that both metrics are weighted effectively. In addition to consistently high performance across the VDB stems, we also explored second-level hierarchical separation, revealing important insights into the complexities of MSS and how factors like genre and instrumentation can influence model performance. While the second-level separation results show room for improvement, the ability to isolate sub-stems marks a significant advancement. Our findings pave the way for further research in MSS, particularly in expanding model capabilities beyond VDB and improving niche stem separations such as guitar and piano.

[LG-34] ask Confusion and Catastrophic Forgetting in Class-Incremental Learning: A Mathematical Framework for Discriminative and Generative Modelings NEURIPS2024

链接: https://arxiv.org/abs/2410.20768
作者: Milad Khademi Nori,Il-Min Kim
关键词-EN: task confusion, class-incremental learning, models must classify, time without task-IDs, classify all previously
类目: Machine Learning (cs.LG)
*备注: 30 pages, 15 figures, Camera-Ready NeurIPS 2024

点击查看摘要

Abstract:In class-incremental learning (class-IL), models must classify all previously seen classes at test time without task-IDs, leading to task confusion. Despite being a key challenge, task confusion lacks a theoretical understanding. We present a novel mathematical framework for class-IL and prove the Infeasibility Theorem, showing optimal class-IL is impossible with discriminative modeling due to task confusion. However, we establish the Feasibility Theorem, demonstrating that generative modeling can achieve optimal class-IL by overcoming task confusion. We then assess popular class-IL strategies, including regularization, bias-correction, replay, and generative classifier, using our framework. Our analysis suggests that adopting generative modeling, either for generative replay or direct classification (generative classifier), is essential for optimal class-IL.

[LG-35] Faster WIND: Accelerating Iterative Best-of-N Distillation for LLM Alignment

链接: https://arxiv.org/abs/2410.20727
作者: Tong Yang,Jincheng Mei,Hanjun Dai,Zixin Wen,Shicong Cen,Dale Schuurmans,Yuejie Chi,Bo Dai
关键词-EN: aligning large language, large language models, Recent advances, iterative BOND, iterative BOND algorithm
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advances in aligning large language models with human preferences have corroborated the growing importance of best-of-N distillation (BOND). However, the iterative BOND algorithm is prohibitively expensive in practice due to the sample and computation inefficiency. This paper addresses the problem by revealing a unified game-theoretic connection between iterative BOND and self-play alignment, which unifies seemingly disparate algorithmic paradigms. Based on the connection, we establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization that approximates iterative BOND in the parameter space. We provides provable sample efficiency guarantee for one of the WIND variant with the square loss objective. The experimental results confirm that our algorithm not only accelerates the computation, but also achieves superior sample efficiency compared to existing methods.

[LG-36] Wireless-Friendly Window Position Optimization for RIS-Aided Outdoor-to-Indoor Networks based on Multi-Modal Large Language Model

链接: https://arxiv.org/abs/2410.20691
作者: Jinbo Hou,Kehai Qiu,Zitian Zhang,Yong Yu,Kezhi Wang,Stefano Capolongo,Jiliang Zhang,Zeyang Li,Jie Zhang
关键词-EN: reconfigurable intelligent surfaces, window-deployed reconfigurable intelligent, utilizing large language, large language models, networks utilizing large
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper aims to simultaneously optimize indoor wireless and daylight performance by adjusting the positions of windows and the beam directions of window-deployed reconfigurable intelligent surfaces (RISs) for RIS-aided outdoor-to-indoor (O2I) networks utilizing large language models (LLM) as optimizers. Firstly, we illustrate the wireless and daylight system models of RIS-aided O2I networks and formulate a joint optimization problem to enhance both wireless traffic sum rate and daylight illumination performance. Then, we present a multi-modal LLM-based window optimization (LMWO) framework, accompanied by a prompt construction template to optimize the overall performance in a zero-shot fashion, functioning as both an architect and a wireless network planner. Finally, we analyze the optimization performance of the LMWO framework and the impact of the number of windows, room size, number of RIS units, and daylight factor. Numerical results demonstrate that our proposed LMWO framework can achieve outstanding optimization performance in terms of initial performance, convergence speed, final outcomes, and time complexity, compared with classic optimization methods. The building’s wireless performance can be significantly enhanced while ensuring indoor daylight performance.

[LG-37] Reprogramming Pretrained Target-Specific Diffusion Models for Dual-Target Drug Design NEURIPS2024

链接: https://arxiv.org/abs/2410.20688
作者: Xiangxin Zhou,Jiaqi Guan,Yijia Zhang,Xingang Peng,Liang Wang,Jianzhu Ma
关键词-EN: attracted significant attention, significant attention due, Dual-target therapeutic strategies, overcoming drug resistance, cancer therapy
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Dual-target therapeutic strategies have become a compelling approach and attracted significant attention due to various benefits, such as their potential in overcoming drug resistance in cancer therapy. Considering the tremendous success that deep generative models have achieved in structure-based drug design in recent years, we formulate dual-target drug design as a generative task and curate a novel dataset of potential target pairs based on synergistic drug combinations. We propose to design dual-target drugs with diffusion models that are trained on single-target protein-ligand complex pairs. Specifically, we align two pockets in 3D space with protein-ligand binding priors and build two complex graphs with shared ligand nodes for SE(3)-equivariant composed message passing, based on which we derive a composed drift in both 3D and categorical probability space in the generative process. Our algorithm can well transfer the knowledge gained in single-target pretraining to dual-target scenarios in a zero-shot manner. We also repurpose linker design methods as strong baselines for this task. Extensive experiments demonstrate the effectiveness of our method compared with various baselines.

[LG-38] Segmenting Watermarked Texts From Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.20670
作者: Xingchi Li,Guanxun Li,Xianyang Zhang
关键词-EN: unnoticeable statistical signals, involves embedding, embedding nearly unnoticeable, Watermarking, text
类目: Machine Learning (cs.LG); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 25 pages, 12 figures, 2 tables, NeurIPS 2024

点击查看摘要

Abstract:Watermarking is a technique that involves embedding nearly unnoticeable statistical signals within generated content to help trace its source. This work focuses on a scenario where an untrusted third-party user sends prompts to a trusted language model (LLM) provider, who then generates a text from their LLM with a watermark. This setup makes it possible for a detector to later identify the source of the text if the user publishes it. The user can modify the generated text by substitutions, insertions, or deletions. Our objective is to develop a statistical method to detect if a published text is LLM-generated from the perspective of a detector. We further propose a methodology to segment the published text into watermarked and non-watermarked sub-strings. The proposed approach is built upon randomization tests and change point detection techniques. We demonstrate that our method ensures Type I and Type II error control and can accurately identify watermarked sub-strings by finding the corresponding change point locations. To validate our technique, we apply it to texts generated by several language models with prompts extracted from Google’s C4 dataset and obtain encouraging numerical results. We release all code publicly at this https URL.

[LG-39] Learning Variational Inequalities from Data: Fast Generalization Rates under Strong Monotonicity

链接: https://arxiv.org/abs/2410.20649
作者: Eric Zhao,Tatjana Chavdarova,Michael Jordan
关键词-EN: Variational inequalities, problems encompassing machine, optimization problems encompassing, encompassing machine learning, machine learning problems
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Variational inequalities (VIs) are a broad class of optimization problems encompassing machine learning problems ranging from standard convex minimization to more complex scenarios like min-max optimization and computing the equilibria of multi-player games. In convex optimization, strong convexity allows for fast statistical learning rates requiring only \Theta(1/\epsilon) stochastic first-order oracle calls to find an \epsilon -optimal solution, rather than the standard \Theta(1/\epsilon^2) calls. In this paper, we explain how one can similarly obtain fast \Theta(1/\epsilon) rates for learning VIs that satisfy strong monotonicity, a generalization of strong convexity. Specifically, we demonstrate that standard stability-based generalization arguments for convex minimization extend directly to VIs when the domain admits a small covering, or when the operator is integrable and suboptimality is measured by potential functions; such as when finding equilibria in multi-player games.

[LG-40] General Causal Imputation via Synthetic Interventions

链接: https://arxiv.org/abs/2410.20647
作者: Marco Jiralerspong,Thomas Jiralerspong,Vedant Shah,Dhanya Sridhar,Gauthier Gidel
关键词-EN: sets of elements, drug compounds, researchers typically, cell types, types and drug
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Given two sets of elements (such as cell types and drug compounds), researchers typically only have access to a limited subset of their interactions. The task of causal imputation involves using this subset to predict unobserved interactions. Squires et al. (2022) have proposed two estimators for this task based on the synthetic interventions (SI) estimator: SI-A (for actions) and SI-C (for contexts). We extend their work and introduce a novel causal imputation estimator, generalized synthetic interventions (GSI). We prove the identifiability of this estimator for data generated from a more complex latent factor model. On synthetic and real data we show empirically that it recovers or outperforms their estimators.

[LG-41] Plastic Learning with Deep Fourier Features

链接: https://arxiv.org/abs/2410.20634
作者: Alex Lewandowski,Dale Schuurmans,Marlos C. Machado
关键词-EN: deep Fourier features, deep Fourier, face of non-stationarity, struggle to learn, learn continually
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks can struggle to learn continually in the face of non-stationarity. This phenomenon is known as loss of plasticity. In this paper, we identify underlying principles that lead to plastic algorithms. In particular, we provide theoretical results showing that linear function approximation, as well as a special case of deep linear networks, do not suffer from loss of plasticity. We then propose deep Fourier features, which are the concatenation of a sine and cosine in every layer, and we show that this combination provides a dynamic balance between the trainability obtained through linearity and the effectiveness obtained through the nonlinearity of neural networks. Deep networks composed entirely of deep Fourier features are highly trainable and sustain their trainability over the course of learning. Our empirical results show that continual learning performance can be drastically improved by replacing ReLU activations with deep Fourier features. These results hold for different continual learning scenarios (e.g., label noise, class incremental learning, pixel permutations) on all major supervised learning datasets used for continual learning research, such as CIFAR10, CIFAR100, and tiny-ImageNet.

[LG-42] abDiff: a Multi-Modal Diffusion Model for Tabular Data Generation

链接: https://arxiv.org/abs/2410.20626
作者: Juntong Shi,Minkai Xu,Harper Hua,Hengrui Zhang,Stefano Ermon,Jure Leskovec
关键词-EN: Synthesizing high-quality tabular, Synthesizing high-quality, data science tasks, science tasks, privacy protection
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a multi-modal stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to 22.5% improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at this https URL.

[LG-43] Practical Bayesian Algorithm Execution via Posterior Sampling NEURIPS2024

链接: https://arxiv.org/abs/2410.20596
作者: Chu Xin Cheng,Raul Astudillo,Thomas Desautels,Yisong Yue
关键词-EN: Bayesian algorithm execution, efficiently selecting evaluation, Bayesian algorithm, selecting evaluation points, framework for efficiently
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Published as a conference paper at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We consider Bayesian algorithm execution (BAX), a framework for efficiently selecting evaluation points of an expensive function to infer a property of interest encoded as the output of a base algorithm. Since the base algorithm typically requires more evaluations than are feasible, it cannot be directly applied. Instead, BAX methods sequentially select evaluation points using a probabilistic numerical approach. Current BAX methods use expected information gain to guide this selection. However, this approach is computationally intensive. Observing that, in many tasks, the property of interest corresponds to a target set of points defined by the function, we introduce PS-BAX, a simple, effective, and scalable BAX method based on posterior sampling. PS-BAX is applicable to a wide range of problems, including many optimization variants and level set estimation. Experiments across diverse tasks demonstrate that PS-BAX performs competitively with existing baselines while being significantly faster, simpler to implement, and easily parallelizable, setting a strong baseline for future research. Additionally, we establish conditions under which PS-BAX is asymptotically convergent, offering new insights into posterior sampling as an algorithm design paradigm.

[LG-44] PaPaGei: Open Foundation Models for Optical Physiological Signals

链接: https://arxiv.org/abs/2410.20542
作者: Arvind Pillai,Dimitris Spathis,Fahim Kawsar,Mohammad Malekzadeh
关键词-EN: PPG signals, wearable devices, PPG, widely used non-invasive, non-invasive technique
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Code and models: this https URL

点击查看摘要

Abstract:Photoplethysmography (PPG) is the most widely used non-invasive technique for monitoring biosignals and cardiovascular health, with applications in both clinical settings and consumer health through wearable devices. Current machine learning models trained on PPG signals are mostly task-specific and lack generalizability. Previous works often used single-device datasets, did not explore out-of-domain generalization, or did not release their models, hindering reproducibility and further research. We introduce PaPaGei, the first open foundation model for PPG signals. PaPaGei is pre-trained on more than 57,000 hours of 20 million unlabeled segments of PPG signals using publicly available datasets exclusively. We evaluate against popular time-series foundation models and other benchmarks on 20 tasks of 10 diverse datasets spanning cardiovascular health, sleep disorders, pregnancy monitoring, and wellbeing assessment. Our architecture incorporates novel representation learning approaches that leverage differences in PPG signal morphology across individuals, enabling it to capture richer representations than traditional contrastive learning methods. Across 20 tasks, PaPaGei improves classification and regression performance by an average of 6.3% and 2.9%, respectively, compared to other competitive time-series foundation models in at least 14 tasks. PaPaGei is more data- and parameter-efficient than other foundation models or methods, as it outperforms 70x larger models. Beyond accuracy, we also investigate robustness against different skin tones, establishing a benchmark for bias evaluations of future models. Notably, PaPaGei can be used out of the box as both a feature extractor and an encoder for other multimodal models, opening up new opportunities for multimodal health monitoring

[LG-45] Info-CELS: Informative Saliency Map Guided Counterfactual Explanation

链接: https://arxiv.org/abs/2410.20539
作者: Peiyu Li,Omar Bahri,Pouya Hosseinzadeh,Soukaïna Filali Boubrahimi,Shah Muhammad Hamdi
关键词-EN: interpretable machine learning, machine learning approaches, learning approaches continues, Explainable Artificial Intelligence, continues to grow
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As the demand for interpretable machine learning approaches continues to grow, there is an increasing necessity for human involvement in providing informative explanations for model decisions. This is necessary for building trust and transparency in AI-based systems, leading to the emergence of the Explainable Artificial Intelligence (XAI) field. Recently, a novel counterfactual explanation model, CELS, has been introduced. CELS learns a saliency map for the interest of an instance and generates a counterfactual explanation guided by the learned saliency map. While CELS represents the first attempt to exploit learned saliency maps not only to provide intuitive explanations for the reason behind the decision made by the time series classifier but also to explore post hoc counterfactual explanations, it exhibits limitations in terms of high validity for the sake of ensuring high proximity and sparsity. In this paper, we present an enhanced approach that builds upon CELS. While the original model achieved promising results in terms of sparsity and proximity, it faced limitations in validity. Our proposed method addresses this limitation by removing mask normalization to provide more informative and valid counterfactual explanations. Through extensive experimentation on datasets from various domains, we demonstrate that our approach outperforms the CELS model, achieving higher validity and producing more informative explanations.

[LG-46] A Cosmic-Scale Benchmark for Symmetry-Preserving Data Processing NEURIPS2024

链接: https://arxiv.org/abs/2410.20516
作者: Julia Balla,Siddharth Mishra-Sharma,Carolina Cuesta-Lazaro,Tommi Jaakkola,Tess Smidt
关键词-EN: Efficiently processing structured, processing structured point, Efficiently processing, structured point cloud, preserving multiscale information
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 19 pages, 3 figures; To appear at the NeurReps Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Efficiently processing structured point cloud data while preserving multiscale information is a key challenge across domains, from graphics to atomistic modeling. Using a curated dataset of simulated galaxy positions and properties, represented as point clouds, we benchmark the ability of graph neural networks to simultaneously capture local clustering environments and long-range correlations. Given the homogeneous and isotropic nature of the Universe, the data exhibits a high degree of symmetry. We therefore focus on evaluating the performance of Euclidean symmetry-preserving ( E(3) -equivariant) graph neural networks, showing that they can outperform non-equivariant counterparts and domain-specific information extraction techniques in downstream performance as well as simulation-efficiency. However, we find that current architectures fail to capture information from long-range correlations as effectively as domain-specific baselines, motivating future work on architectures better suited for extracting long-range information.

[LG-47] When Less is More: Achieving Faster Convergence in Distributed Edge Machine Learning

链接: https://arxiv.org/abs/2410.20495
作者: Advik Raj Basani,Siddharth Chaitra Vivek,Advaith Krishna,Arnab K. Paul
关键词-EN: Distributed Machine Learning, Distributed Machine, Machine Learning, holds immense potential, devices holds immense
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 11 pages, 19 figures, 3 tables; code: this https URL

点击查看摘要

Abstract:Distributed Machine Learning (DML) on resource-constrained edge devices holds immense potential for real-world applications. However, achieving fast convergence in DML in these heterogeneous environments remains a significant challenge. Traditional frameworks like Bulk Synchronous Parallel and Asynchronous Stochastic Parallel rely on frequent, small updates that incur substantial communication overhead and hinder convergence speed. Furthermore, these frameworks often employ static dataset sizes, neglecting the heterogeneity of edge devices and potentially leading to straggler nodes that slow down the entire training process. The straggler nodes, i.e., edge devices that take significantly longer to process their assigned data chunk, hinder the overall training speed. To address these limitations, this paper proposes Hermes, a novel probabilistic framework for efficient DML on edge devices. This framework leverages a dynamic threshold based on recent test loss behavior to identify statistically significant improvements in the model’s generalization capability, hence transmitting updates only when major improvements are detected, thereby significantly reducing communication overhead. Additionally, Hermes employs dynamic dataset allocation to optimize resource utilization and prevents performance degradation caused by straggler nodes. Our evaluations on a real-world heterogeneous resource-constrained environment demonstrate that Hermes achieves faster convergence compared to state-of-the-art methods, resulting in a remarkable 13.22 x reduction in training time and a 62.1% decrease in communication overhead.

[LG-48] Hamiltonian Score Matching and Generative Flows

链接: https://arxiv.org/abs/2410.20470
作者: Peter Holderrieth,Yilun Xu,Tommi Jaakkola
关键词-EN: Hamiltonian Monte Carlo, Classical Hamiltonian mechanics, Monte Carlo, Classical Hamiltonian, Carlo for applications
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Classical Hamiltonian mechanics has been widely used in machine learning in the form of Hamiltonian Monte Carlo for applications with predetermined force fields. In this work, we explore the potential of deliberately designing force fields for Hamiltonian ODEs, introducing Hamiltonian velocity predictors (HVPs) as a tool for score matching and generative models. We present two innovations constructed with HVPs: Hamiltonian Score Matching (HSM), which estimates score functions by augmenting data via Hamiltonian trajectories, and Hamiltonian Generative Flows (HGFs), a novel generative model that encompasses diffusion models and flow matching as HGFs with zero force fields. We showcase the extended design space of force fields by introducing Oscillation HGFs, a generative model inspired by harmonic oscillators. Our experiments validate our theoretical insights about HSM as a novel score matching metric and demonstrate that HGFs rival leading generative modeling techniques.

[LG-49] Integrating uncertainty quantification into randomized smoothing based robustness guarantees

链接: https://arxiv.org/abs/2410.20432
作者: Sina Däubener,Kira Maag,David Krueger,Asja Fischer
关键词-EN: Deep neural networks, Deep neural, hazardous incorrect predictions, extremely powerful, safety-critical applications
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep neural networks have proven to be extremely powerful, however, they are also vulnerable to adversarial attacks which can cause hazardous incorrect predictions in safety-critical applications. Certified robustness via randomized smoothing gives a probabilistic guarantee that the smoothed classifier’s predictions will not change within an \ell_2 -ball around a given input. On the other hand (uncertainty) score-based rejection is a technique often applied in practice to defend models against adversarial attacks. In this work, we fuse these two approaches by integrating a classifier that abstains from predicting when uncertainty is high into the certified robustness framework. This allows us to derive two novel robustness guarantees for uncertainty aware classifiers, namely (i) the radius of an \ell_2 -ball around the input in which the same label is predicted and uncertainty remains low and (ii) the \ell_2 -radius of a ball in which the predictions will either not change or be uncertain. While the former provides robustness guarantees with respect to attacks aiming at increased uncertainty, the latter informs about the amount of input perturbation necessary to lead the uncertainty aware model into a wrong prediction. Notably, this is on CIFAR10 up to 20.93% larger than for models not allowing for uncertainty based rejection. We demonstrate, that the novel framework allows for a systematic robustness evaluation of different network architectures and uncertainty measures and to identify desired properties of uncertainty quantification techniques. Moreover, we show that leveraging uncertainty in a smoothed classifier helps out-of-distribution detection.

[LG-50] Causal Modeling in Multi-Context Systems: Distinguishing Multiple Context-Specific Causal Graphs which Account for Observational Support

链接: https://arxiv.org/abs/2410.20405
作者: Martin Rabel,Wiebke Günther,Jakob Runge,Andreas Gerhardus
关键词-EN: multiple contexts carries, Causal, multiple contexts, contexts carries, causal graphs
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Causal structure learning with data from multiple contexts carries both opportunities and challenges. Opportunities arise from considering shared and context-specific causal graphs enabling to generalize and transfer causal knowledge across contexts. However, a challenge that is currently understudied in the literature is the impact of differing observational support between contexts on the identifiability of causal graphs. Here we study in detail recently introduced [6] causal graph objects that capture both causal mechanisms and data support, allowing for the analysis of a larger class of context-specific changes, characterizing distribution shifts more precisely. We thereby extend results on the identifiability of context-specific causal structures and propose a framework to model context-specific independence (CSI) within structural causal models (SCMs) in a refined way that allows to explore scenarios where these graph objects differ. We demonstrate how this framework can help explaining phenomena like anomalies or extreme events, where causal mechanisms change or appear to change under different conditions. Our results contribute to the theoretical foundations for understanding causal relations in multi-context systems, with implications for generalization, transfer learning, and anomaly detection. Future work may extend this approach to more complex data types, such as time-series.

[LG-51] Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss

链接: https://arxiv.org/abs/2410.20401
作者: Kunal Dahiya,Diego Ortego,David Jiménez
关键词-EN: Extreme Multi-label Classification, Multi-label Classification, predict relevant labels, methods predict relevant, Extreme Multi-label
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Extreme Multi-label Classification (XMC) methods predict relevant labels for a given query in an extremely large label space. Recent works in XMC address this problem using deep encoders that project text descriptions to an embedding space suitable for recovering the closest labels. However, learning deep models can be computationally expensive in large output spaces, resulting in a trade-off between high performing brute-force approaches and efficient solutions. In this paper, we propose PRIME, a XMC method that employs a novel prototypical contrastive learning technique to reconcile efficiency and performance surpassing brute-force approaches. We frame XMC as a data-to-prototype prediction task where label prototypes aggregate information from related queries. More precisely, we use a shallow transformer encoder that we coin as Label Prototype Network, which enriches label representations by aggregating text-based embeddings, label centroids and learnable free vectors. We jointly train a deep encoder and the Label Prototype Network using an adaptive triplet loss objective that better adapts to the high granularity and ambiguity of extreme label spaces. PRIME achieves state-of-the-art results in several public benchmarks of different sizes and domains, while keeping the model efficient.

[LG-52] Evaluation of uncertainty estimations for Gaussian process regression based machine learning interatomic potentials

链接: https://arxiv.org/abs/2410.20398
作者: Matthias Holzenkamp,Dongyu Lyu,Ulrich Kleinekathöfer,Peter Zaspel
关键词-EN: Machine learning interatomic, quantum chemical calculations, expensive quantum chemical, Machine learning, learning interatomic potentials
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) have seen significant advances as efficient replacement of expensive quantum chemical calculations. Uncertainty estimations for MLIPs are crucial to quantify the additional model error they introduce and to leverage this information in active learning strategies. MLIPs that are based on Gaussian process regression provide a standard deviation as a possible uncertainty measure. An alternative approach are ensemble-based uncertainties. Although these uncertainty measures have been applied to active learning, it has rarely been studied how they correlate with the error, and it is not always clear whether active learning actually outperforms random sampling strategies. We consider GPR models with Coulomb and SOAP representations as inputs to predict potential energy surfaces and excitation energies of molecules. We evaluate, how the GPR variance and ensemble-based uncertainties relate to the error and whether model performance improves by selecting the most uncertain samples from a fixed configuration space. For the ensemble based uncertainty estimations, we find that they often do not provide any information about the error. For the GPR standard deviation, we find that often predictions with an increasing standard deviation also have an increasing systematical bias, which is not captured by the uncertainty. In these cases, selecting training samples with the highest uncertainty leads to a model with a worse test error compared to random sampling. We conclude that confidence intervals, which are derived from the predictive standard deviation, can be highly overconfident. Selecting samples with high GPR standard deviation leads to a model that overemphasizes the borders of the configuration space represented in the fixed dataset. This may result in worse performance in more densely sampled areas but better generalization for extrapolation tasks.

[LG-53] Hierarchical Multiple Kernel K-Means Algorithm Based on Sparse Connectivity

链接: https://arxiv.org/abs/2410.20391
作者: Lei Wang,Liang Du,Peng Zhou
关键词-EN: hierarchical multiple kernel, Multiple kernel, multiple kernel K-Means, aims to find, find an optimal
类目: Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:Multiple kernel learning (MKL) aims to find an optimal, consistent kernel function. In the hierarchical multiple kernel clustering (HMKC) algorithm, sample features are extracted layer by layer from a high-dimensional space to maximize the retention of effective information. However, information interaction between layers is often ignored. In this model, only corresponding nodes in adjacent layers exchange information; other nodes remain isolated, and if full connectivity is adopted, the diversity of the final consistency matrix is reduced. Therefore, this paper proposes a hierarchical multiple kernel K-Means (SCHMKKM) algorithm based on sparse connectivity, which controls the assignment matrix to achieve sparse connections through a sparsity rate, thereby locally fusing the features obtained by distilling information between layers. Finally, we conduct cluster analysis on multiple datasets and compare it with the fully connected hierarchical multiple kernel K-Means (FCHMKKM) algorithm in experiments. It is shown that more discriminative information fusion is beneficial for learning a better consistent partition matrix, and the fusion strategy based on sparse connection outperforms the full connection strategy.

[LG-54] Unsupervised Feature Selection Algorithm Based on Dual Manifold Re-ranking

链接: https://arxiv.org/abs/2410.20388
作者: Yunhui Liang,Jianwen Gan,Yan Chen,Peng Zhou,Liang Du
关键词-EN: data analysis tasks, numerous data analysis, High-dimensional data, unsupervised feature selection, Feature selection
类目: Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:High-dimensional data is commonly encountered in numerous data analysis tasks. Feature selection techniques aim to identify the most representative features from the original high-dimensional data. Due to the absence of class label information, it is significantly more challenging to select appropriate features in unsupervised learning scenarios compared to supervised ones. Traditional unsupervised feature selection methods typically score the features of samples based on certain criteria, treating samples indiscriminately. However, these approaches fail to fully capture the internal structure of the data. The importance of different samples should vary, and there is a dual relationship between the weight of samples and features that will influence each other. Therefore, an unsupervised feature selection algorithm based on dual manifold re-ranking (DMRR) is proposed in this paper. Different similarity matrices are constructed to depict the manifold structures among samples, between samples and features, and among features themselves. Then, manifold re-ranking is performed by combining the initial scores of samples and features. By comparing DMRR with three original unsupervised feature selection algorithms and two unsupervised feature selection post-processing algorithms, experimental results confirm that the importance information of different samples and the dual relationship between sample and feature are beneficial for achieving better feature selection.

[LG-55] Rethinking Reconstruction-based Graph-Level Anomaly Detection: Limitations and a Simple Remedy NEURIPS2024

链接: https://arxiv.org/abs/2410.20366
作者: Sunwoo Kim,Soo Yong Lee,Fanchen Bu,Shinhwan Kang,Kyungho Kim,Jaemin Yoo,Kijung Shin
关键词-EN: GLAD, learn representations, aiming to accurately, Graph, Graph autoencoders
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Published as a conference paper at NeurIPS 2024

点击查看摘要

Abstract:Graph autoencoders (Graph-AEs) learn representations of given graphs by aiming to accurately reconstruct them. A notable application of Graph-AEs is graph-level anomaly detection (GLAD), whose objective is to identify graphs with anomalous topological structures and/or node features compared to the majority of the graph population. Graph-AEs for GLAD regard a graph with a high mean reconstruction error (i.e. mean of errors from all node pairs and/or nodes) as anomalies. Namely, the methods rest on the assumption that they would better reconstruct graphs with similar characteristics to the majority. We, however, report non-trivial counter-examples, a phenomenon we call reconstruction flip, and highlight the limitations of the existing Graph-AE-based GLAD methods. Specifically, we empirically and theoretically investigate when this assumption holds and when it fails. Through our analyses, we further argue that, while the reconstruction errors for a given graph are effective features for GLAD, leveraging the multifaceted summaries of the reconstruction errors, beyond just mean, can further strengthen the features. Thus, we propose a novel and simple GLAD method, named MUSE. The key innovation of MUSE involves taking multifaceted summaries of reconstruction errors as graph features for GLAD. This surprisingly simple method obtains SOTA performance in GLAD, performing best overall among 14 methods across 10 datasets.

[LG-56] FoldMark: Protecting Protein Generative Models with Watermarking

链接: https://arxiv.org/abs/2410.20354
作者: Zaixi Zhang,Ruofan Jin,Kaidi Fu,Le Cong,Marinka Zitnik,Mengdi Wang
关键词-EN: protein generative models, understanding protein function, generative models, protein generative, Protein
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Protein structure is key to understanding protein function and is essential for progress in bioengineering, drug discovery, and molecular biology. Recently, with the incorporation of generative AI, the power and accuracy of computational protein structure prediction/design have been improved significantly. However, ethical concerns such as copyright protection and harmful content generation (biosecurity) pose challenges to the wide implementation of protein generative models. Here, we investigate whether it is possible to embed watermarks into protein generative models and their outputs for copyright authentication and the tracking of generated structures. As a proof of concept, we propose a two-stage method FoldMark as a generalized watermarking strategy for protein generative models. FoldMark first pretrain watermark encoder and decoder, which can minorly adjust protein structures to embed user-specific information and faithfully recover the information from the encoded structure. In the second step, protein generative models are fine-tuned with watermark Low-Rank Adaptation (LoRA) modules to preserve generation quality while learning to generate watermarked structures with high recovery rates. Extensive experiments are conducted on open-source protein structure prediction models (e.g., ESMFold and MultiFlow) and de novo structure design models (e.g., FrameDiff and FoldFlow) and we demonstrate that our method is effective across all these generative models. Meanwhile, our watermarking framework only exerts a negligible impact on the original protein structure quality and is robust under potential post-processing and adaptive attacks.

[LG-57] Leveraging Auxiliary Task Relevance for Enhanced Industrial Fault Diagnosis through Curriculum Meta-learning

链接: https://arxiv.org/abs/2410.20351
作者: Jinze Wang,Tiehua Zhang,Boon Xian Chai,Adriano Di Pietro,Dimitrios Georgakopoulos,Jiong Jin
关键词-EN: maintaining operational safety, smart manufacturing, machine breakdowns, breakdowns is crucial, crucial for maintaining
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accurate diagnosis of machine breakdowns is crucial for maintaining operational safety in smart manufacturing. Despite the promise shown by deep learning in automating fault identification, the scarcity of labeled training data, particularly for equipment failure instances, poses a significant challenge. This limitation hampers the development of robust classification models. Existing methods like model-agnostic meta-learning (MAML) do not adequately address variable working conditions, affecting knowledge transfer. To address these challenges, a Related Task Aware Curriculum Meta-learning (RT-ACM) enhanced fault diagnosis framework is proposed in this paper, inspired by human cognitive learning processes. RT-ACM improves training by considering the relevance of auxiliary working conditions, adhering to the principle of paying more attention to more relevant knowledge", and focusing on easier first, harder later" curriculum sampling. This approach aids the meta-learner in achieving a superior convergence state. Extensive experiments on two real-world datasets demonstrate the superiority of RT-ACM framework.

[LG-58] Logarithmically Quantized Distributed Optimization over Dynamic Multi-Agent Networks

链接: https://arxiv.org/abs/2410.20345
作者: Mohammadreza Doostmohammadian,Sérgio Pequito
关键词-EN: signal processing, machine learning, Distributed optimization finds, Distributed optimization, optimization finds
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Distributed optimization finds many applications in machine learning, signal processing, and control systems. In these real-world applications, the constraints of communication networks, particularly limited bandwidth, necessitate implementing quantization techniques. In this paper, we propose distributed optimization dynamics over multi-agent networks subject to logarithmically quantized data transmission. Under this condition, data exchange benefits from representing smaller values with more bits and larger values with fewer bits. As compared to uniform quantization, this allows for higher precision in representing near-optimal values and more accuracy of the distributed optimization algorithm. The proposed optimization dynamics comprise a primary state variable converging to the optimizer and an auxiliary variable tracking the objective function’s gradient. Our setting accommodates dynamic network topologies, resulting in a hybrid system requiring convergence analysis using matrix perturbation theory and eigenspectrum analysis.

[LG-59] Intuitionistic Fuzzy Universum Twin Support Vector Machine for Imbalanced Data

链接: https://arxiv.org/abs/2410.20335
作者: A. Quadir,M. Tanveer
关键词-EN: proposed IFUTSVM-ID model, proposed IFUTSVM-ID, major difficulties, machine learning methods, IFUTSVM-ID model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the major difficulties in machine learning methods is categorizing datasets that are imbalanced. This problem may lead to biased models, where the training process is dominated by the majority class, resulting in inadequate representation of the minority class. Universum twin support vector machine (UTSVM) produces a biased model towards the majority class, as a result, its performance on the minority class is often poor as it might be mistakenly classified as noise. Moreover, UTSVM is not proficient in handling datasets that contain outliers and noises. Inspired by the concept of incorporating prior information about the data and employing an intuitionistic fuzzy membership scheme, we propose intuitionistic fuzzy universum twin support vector machines for imbalanced data (IFUTSVM-ID). We use an intuitionistic fuzzy membership scheme to mitigate the impact of noise and outliers. Moreover, to tackle the problem of imbalanced class distribution, data oversampling and undersampling methods are utilized. Prior knowledge about the data is provided by universum data. This leads to better generalization performance. UTSVM is susceptible to overfitting risks due to the omission of the structural risk minimization (SRM) principle in their primal formulations. However, the proposed IFUTSVM-ID model incorporates the SRM principle through the incorporation of regularization terms, effectively addressing the issue of overfitting. We conduct a comprehensive evaluation of the proposed IFUTSVM-ID model on benchmark datasets from KEEL and compare it with existing baseline models. Furthermore, to assess the effectiveness of the proposed IFUTSVM-ID model in diagnosing Alzheimer’s disease (AD), we applied them to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. Experimental results showcase the superiority of the proposed IFUTSVM-ID models compared to the baseline models.

[LG-60] Embedded Nonlocal Operator Regression (ENOR): Quantifying model error in learning nonlocal operators

链接: https://arxiv.org/abs/2410.20331
作者: Yiming Fan,Habib Najm,Yue Yu,Stewart Silling,Marta D’Elia
关键词-EN: represent long-range dependence, Nonlocal Operator Regression, Operator Regression, Nonlocal Operator, multiscale effects
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Nonlocal, integral operators have become an efficient surrogate for bottom-up homogenization, due to their ability to represent long-range dependence and multiscale effects. However, the nonlocal homogenized model has unavoidable discrepancy from the microscale model. Such errors accumulate and propagate in long-term simulations, making the resultant prediction unreliable. To develop a robust and reliable bottom-up homogenization framework, we propose a new framework, which we coin Embedded Nonlocal Operator Regression (ENOR), to learn a nonlocal homogenized surrogate model and its structural model error. This framework provides discrepancy-adaptive uncertainty quantification for homogenized material response predictions in long-term simulations. The method is built on Nonlocal Operator Regression (NOR), an optimization-based nonlocal kernel learning approach, together with an embedded model error term in the trainable kernel. Then, Bayesian inference is employed to infer the model error term parameters together with the kernel parameters. To make the problem computationally feasible, we use a multilevel delayed acceptance Markov chain Monte Carlo (MLDA-MCMC) method, enabling efficient Bayesian model calibration and model error estimation. We apply this technique to predict long-term wave propagation in a heterogeneous one-dimensional bar, and compare its performance with additive noise models. Owing to its ability to capture model error, the learned ENOR achieves improved estimation of posterior predictive uncertainty.

[LG-61] Domain Specific Data Distillation and Multi-modal Embedding Generation

链接: https://arxiv.org/abs/2410.20325
作者: Sharadind Peddiraju,Srini Rajagopal
关键词-EN: creating domain-centric embeddings, domain-centric embeddings arises, challenge of creating, creating domain-centric, Hybrid Collaborative Filtering
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:The challenge of creating domain-centric embeddings arises from the abundance of unstructured data and the scarcity of domain-specific structured data. Conventional embedding techniques often rely on either modality, limiting their applicability and efficacy. This paper introduces a novel modeling approach that leverages structured data to filter noise from unstructured data, resulting in embeddings with high precision and recall for domain-specific attribute prediction. The proposed model operates within a Hybrid Collaborative Filtering (HCF) framework, where generic entity representations are fine-tuned through relevant item prediction tasks. Our experiments, focusing on the cloud computing domain, demonstrate that HCF-based embeddings outperform AutoEncoder-based embeddings (using purely unstructured data), achieving a 28% lift in precision and an 11% lift in recall for domain-specific attribute prediction.

[LG-62] ProtSCAPE: Mapping the landscape of protein conformations in molecular dynamics

链接: https://arxiv.org/abs/2410.20317
作者: Siddharth Viswanath,Dhananjay Bhaskar,David R. Johnson,Joao Felipe Rocha,Egbert Castro,Jackson D. Grady,Alex T. Grigas,Michael A. Perlmutter,Corey S. O’Hern,Smita Krishnaswamy
关键词-EN: biological functions, essential for comprehending, comprehending their biological, geometric scattering transform, protein
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注: Accepted as a short paper at the 5th Molecular Machine Learning Conference (MoML 2024)

点击查看摘要

Abstract:Understanding the dynamic nature of protein structures is essential for comprehending their biological functions. While significant progress has been made in predicting static folded structures, modeling protein motions on microsecond to millisecond scales remains challenging. To address these challenges, we introduce a novel deep learning architecture, Protein Transformer with Scattering, Attention, and Positional Embedding (ProtSCAPE), which leverages the geometric scattering transform alongside transformer-based attention mechanisms to capture protein dynamics from molecular dynamics (MD) simulations. ProtSCAPE utilizes the multi-scale nature of the geometric scattering transform to extract features from protein structures conceptualized as graphs and integrates these features with dual attention structures that focus on residues and amino acid signals, generating latent representations of protein trajectories. Furthermore, ProtSCAPE incorporates a regression head to enforce temporally coherent latent representations.

[LG-63] Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model NEURIPS2024

链接: https://arxiv.org/abs/2410.20312
作者: Jing Zhang,Linjiajie Fang,Kexin Shi,Wenjia Wang,Bing-Yi Jing
关键词-EN: offline reinforcement learning, main obstacle, success of offline, offline reinforcement, Q-value distribution learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Neurips 2024

点击查看摘要

Abstract:``Distribution shift’’ is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy’s knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative estimates, we introduce an uncertainty-aware optimization objective for updating the Q-value function. The proposed QDQ demonstrates solid theoretical guarantees for the accuracy of Q-value distribution learning and uncertainty measurement, as well as the performance of the learning policy. QDQ consistently shows strong performance on the D4RL benchmark and achieves significant improvements across many tasks.

[LG-64] Predicting Mortality and Functional Status Scores of Traumatic Brain Injury Patients using Supervised Machine Learning

链接: https://arxiv.org/abs/2410.20300
作者: Lucas Steinmetz,Shivam Maheshwari,Garik Kazanjian,Abigail Loyson,Tyler Alexander,Venkat Margapuri,C. Nataraj
关键词-EN: Traumatic brain injury, public health challenge, significant public health, Functional Status Scale, Traumatic brain
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traumatic brain injury (TBI) presents a significant public health challenge, often resulting in mortality or lasting disability. Predicting outcomes such as mortality and Functional Status Scale (FSS) scores can enhance treatment strategies and inform clinical decision-making. This study applies supervised machine learning (ML) methods to predict mortality and FSS scores using a real-world dataset of 300 pediatric TBI patients from the University of Colorado School of Medicine. The dataset captures clinical features, including demographics, injury mechanisms, and hospitalization outcomes. Eighteen ML models were evaluated for mortality prediction, and thirteen models were assessed for FSS score prediction. Performance was measured using accuracy, ROC AUC, F1-score, and mean squared error. Logistic regression and Extra Trees models achieved high precision in mortality prediction, while linear regression demonstrated the best FSS score prediction. Feature selection reduced 103 clinical variables to the most relevant, enhancing model efficiency and interpretability. This research highlights the role of ML models in identifying high-risk patients and supporting personalized interventions, demonstrating the potential of data-driven analytics to improve TBI care and integrate into clinical workflows.

[LG-65] DeCaf: A Causal Decoupling Framework for OOD Generalization on Node Classification

链接: https://arxiv.org/abs/2410.20295
作者: Xiaoxue Han,Huzefa Rangwala,Yue Ning
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, creating vulnerability, critical domains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are susceptible to distribution shifts, creating vulnerability and security issues in critical domains. There is a pressing need to enhance the generalizability of GNNs on out-of-distribution (OOD) test data. Existing methods that target learning an invariant (feature, structure)-label mapping often depend on oversimplified assumptions about the data generation process, which do not adequately reflect the actual dynamics of distribution shifts in graphs. In this paper, we introduce a more realistic graph data generation model using Structural Causal Models (SCMs), allowing us to redefine distribution shifts by pinpointing their origins within the generation process. Building on this, we propose a casual decoupling framework, DeCaf, that independently learns unbiased feature-label and structure-label mappings. We provide a detailed theoretical framework that shows how our approach can effectively mitigate the impact of various distribution shifts. We evaluate DeCaf across both real-world and synthetic datasets that demonstrate different patterns of shifts, confirming its efficacy in enhancing the generalizability of GNNs.

[LG-66] A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods Challenges and Biases

链接: https://arxiv.org/abs/2410.20293
作者: Yunchong Liu,Xiaorui Shen,Yeyubei Zhang,Zhongyan Wang,Yexin Tian,Jianglai Dai,Yuchen Cao
关键词-EN: necessitating automated detection, automated detection systems, Social media platforms, platforms like Twitter, Instagram have facilitated
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.

[LG-67] Classification under strategic adversary manipulation using pessimistic bilevel optimisation

链接: https://arxiv.org/abs/2410.20284
作者: David Benfield,Stefano Coniglio,Martin Kunc,Phan Tu Vuong,Alain Zemkoho
关键词-EN: Adversarial machine learning, machine learning concerns, learning concerns situations, learners face attacks, Adversarial machine
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 27 pages, 5 figures, under review

点击查看摘要

Abstract:Adversarial machine learning concerns situations in which learners face attacks from active adversaries. Such scenarios arise in applications such as spam email filtering, malware detection and fake-image generation, where security methods must be actively updated to keep up with the ever improving generation of malicious this http URL model these interactions between the learner and the adversary as a game and formulate the problem as a pessimistic bilevel optimisation problem with the learner taking the role of the leader. The adversary, modelled as a stochastic data generator, takes the role of the follower, generating data in response to the classifier. While existing models rely on the assumption that the adversary will choose the least costly solution leading to a convex lower-level problem with a unique solution, we present a novel model and solution method which do not make such assumptions. We compare these to the existing approach and see significant improvements in performance suggesting that relaxing these assumptions leads to a more realistic model.

[LG-68] Proactive Fraud Defense: Machine Learnings Evolving Role in Protecting Against Online Fraud

链接: https://arxiv.org/abs/2410.20281
作者: Md Kamrul Hasan Chy
关键词-EN: machine learning, evolving tactics employed, traditional fraud detection, fraud detection methods, sophisticated and pervasive
类目: Machine Learning (cs.LG)
*备注: World Journal of Advanced Research and Reviews (2024)

点击查看摘要

Abstract:As online fraud becomes more sophisticated and pervasive, traditional fraud detection methods are struggling to keep pace with the evolving tactics employed by fraudsters. This paper explores the transformative role of machine learning in addressing these challenges by offering more advanced, scalable, and adaptable solutions for fraud detection and prevention. By analyzing key models such as Random Forest, Neural Networks, and Gradient Boosting, this paper highlights the strengths of machine learning in processing vast datasets, identifying intricate fraud patterns, and providing real-time predictions that enable a proactive approach to fraud prevention. Unlike rule-based systems that react after fraud has occurred, machine learning models continuously learn from new data, adapting to emerging fraud schemes and reducing false positives, which ultimately minimizes financial losses. This research emphasizes the potential of machine learning to revolutionize fraud detection frameworks by making them more dynamic, efficient, and capable of handling the growing complexity of fraud across various industries. Future developments in machine learning, including deep learning and hybrid models, are expected to further enhance the predictive accuracy and applicability of these systems, ensuring that organizations remain resilient in the face of new and emerging fraud tactics.

[LG-69] Centaur: a foundation model of human cognition

链接: https://arxiv.org/abs/2410.20268
作者: Marcel Binz,Elif Akata,Matthias Bethge,Franziska Brändle,Fred Callaway,Julian Coda-Forno,Peter Dayan,Can Demircan,Maria K. Eckstein,Noémi Éltető,Thomas L. Griffiths,Susanne Haridi,Akshay K. Jagadish,Li Ji-An,Alexander Kipnis,Sreejan Kumar,Tobias Ludwig,Marvin Mathony,Marcelo Mattar,Alireza Modirshanechi,Surabhi S. Nath,Joshua C. Peterson,Milena Rmus,Evan M. Russek,Tankred Saanum,Natalia Scharfenberg,Johannes A. Schubert,Luca M. Schulze Buschoff,Nishad Singhi,Xin Sui,Mirko Thalmann,Fabian Theis,Vuong Truong,Vishaal Udandarao,Konstantinos Voudouris,Robert Wilson,Kristin Witte,Shuchen Wu,Dirk Wulff,Huadong Xiong,Eric Schulz
关键词-EN: Establishing a unified, goal of psychology, major goal, Establishing, Centaur
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Establishing a unified theory of cognition has been a major goal of psychology. While there have been previous attempts to instantiate such theories by building computational models, we currently do not have one model that captures the human mind in its entirety. Here we introduce Centaur, a computational model that can predict and simulate human behavior in any experiment expressible in natural language. We derived Centaur by finetuning a state-of-the-art language model on a novel, large-scale data set called Psych-101. Psych-101 reaches an unprecedented scale, covering trial-by-trial data from over 60,000 participants performing over 10,000,000 choices in 160 experiments. Centaur not only captures the behavior of held-out participants better than existing cognitive models, but also generalizes to new cover stories, structural task modifications, and entirely new domains. Furthermore, we find that the model’s internal representations become more aligned with human neural activity after finetuning. Taken together, Centaur is the first real candidate for a unified model of human cognition. We anticipate that it will have a disruptive impact on the cognitive sciences, challenging the existing paradigm for developing computational models.

[LG-70] Learning Approximated Maximal Safe Sets via Hypernetworks for MPC-Based Local Motion Planning

链接: https://arxiv.org/abs/2410.20267
作者: Bojan Derajić,Mohamed-Khalil Bouzidi,Sebastian Bernhard,Wolfgang Hönig
关键词-EN: maximal safe sets, motion planning tasks, local motion planning, mobile robotics, paper presents
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents a novel learning-based approach for online estimation of maximal safe sets for local motion planning tasks in mobile robotics. We leverage the idea of hypernetworks to achieve good generalization properties and real-time performance simultaneously. As the source of supervision, we employ the Hamilton-Jacobi (HJ) reachability analysis, allowing us to consider general nonlinear dynamics and arbitrary constraints. We integrate our model into a model predictive control (MPC) local planner as a safety constraint and compare the performance with relevant baselines in realistic 3D simulations for different environments and robot dynamics. The results show the advantages of our approach in terms of a significantly higher success rate: 2 to 18 percent over the best baseline, while achieving real-time performance.

[LG-71] Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL NEURIPS2024

链接: https://arxiv.org/abs/2410.20254
作者: Andrew Wagenmaker,Kevin Huang,Liyiming Ke,Byron Boots,Kevin Jamieson,Abhishek Gupta
关键词-EN: real world, generalizes effectively, order to mitigate, train a policy, deploy this policy
类目: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emphdirect sim2real transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emphexploratory policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches – least-squares regression oracles and naive randomized exploration – yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails. We validate our theoretical results on several realistic robotic simulators and a real-world robotic sim2real task, demonstrating that transferring exploratory policies can yield substantial gains in practice as well.

[LG-72] Convergence Guarantees for the DeepWalk Embedding on Block Models

链接: https://arxiv.org/abs/2410.20248
作者: Christopher Harker,Aditya Bhaskara
关键词-EN: powerful tool, tool for understanding, Stochastic Block Model, graphs, solving nonlinear optimization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph embeddings have emerged as a powerful tool for understanding the structure of graphs. Unlike classical spectral methods, recent methods such as DeepWalk, Node2Vec, etc. are based on solving nonlinear optimization problems on the graph, using local information obtained by performing random walks. These techniques have empirically been shown to produce ‘‘better’’ embeddings than their classical counterparts. However, due to their reliance on solving a nonconvex optimization problem, obtaining theoretical guarantees on the properties of the solution has remained a challenge, even for simple classes of graphs. In this work, we show convergence properties for the DeepWalk algorithm on graphs obtained from the Stochastic Block Model (SBM). Despite being simplistic, the SBM has proved to be a classic model for analyzing the behavior of algorithms on large graphs. Our results mirror the existing ones for spectral embeddings on SBMs, showing that even in the case of one-dimensional embeddings, the output of the DeepWalk algorithm provably recovers the cluster structure with high probability.

[LG-73] Model Equality Testing: Which Model Is This API Serving?

链接: https://arxiv.org/abs/2410.20247
作者: Irena Gao,Percy Liang,Carlos Guestrin
关键词-EN: Azure AI Studio, Amazon Bedrock, Bedrock and Azure, accessed via Amazon, large language models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution – often without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

[LG-74] Hoeffding adaptive trees for multi-label classification on data streams

链接: https://arxiv.org/abs/2410.20242
作者: Aurora Esteban,Alberto Cano,Amelia Zafra,Sebastián Ventura
关键词-EN: increasing real-world scenarios, real-world scenarios generating, scenarios generating data, multi-label data stream, Data stream learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data stream learning is a very relevant paradigm because of the increasing real-world scenarios generating data at high velocities and in unbounded sequences. Stream learning aims at developing models that can process instances as they arrive, so models constantly adapt to new concepts and the temporal evolution in the stream. In multi-label data stream environments where instances have the peculiarity of belonging simultaneously to more than one class, the problem becomes even more complex and poses unique challenges such as different concept drifts impacting different labels at simultaneous or distinct times, higher class imbalance, or new labels emerging in the stream. This paper proposes a novel approach to multi-label data stream classification called Multi-Label Hoeffding Adaptive Tree (MLHAT). MLHAT leverages the Hoeffding adaptive tree to address these challenges by considering possible relations and label co-occurrences in the partitioning process of the decision tree, dynamically adapting the learner in each leaf node of the tree, and implementing a concept drift detector that can quickly detect and replace tree branches that are no longer performing well. The proposed approach is compared with other 18 online multi-label classifiers on 41 datasets. The results, validated with statistical analysis, show that MLHAT outperforms other state-of-the-art approaches in 12 well-known multi-label metrics.

[LG-75] SAFE setup for generative molecular design

链接: https://arxiv.org/abs/2410.20232
作者: Yassir El Mesbahi,Emmanuel Noutahi
关键词-EN: Sequential Attachment-based Fragment, Attachment-based Fragment Embedding, pivotal in drug, face challenges, challenges in fragment-constrained
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:SMILES-based molecular generative models have been pivotal in drug design but face challenges in fragment-constrained tasks. To address this, the Sequential Attachment-based Fragment Embedding (SAFE) representation was recently introduced as an alternative that streamlines those tasks. In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms. We found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust. SAFE-based models also consistently outperform SMILES-based approaches in scaffold decoration and linker design, particularly with BRICS decomposition yielding the best results. These insights highlight key factors that significantly impact the efficacy of SAFE-based generative models.

[LG-76] Recursive Function Definitions in Static Dataflow Graphs and their Implementation in TensorFlow

链接: https://arxiv.org/abs/2410.20225
作者: Kelly Kostopoulou,Angelos Charalambidis,Panos Rondogiannis
关键词-EN: represent their computations, Modern machine learning, machine learning systems, learning systems represent, dataflow
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern machine learning systems represent their computations as dataflow graphs. The increasingly complex neural network architectures crave for more powerful yet efficient programming abstractions. In this paper we propose an efficient technique for supporting recursive function definitions in dataflow-based systems such as TensorFlow. The proposed approach transforms the given recursive definitions into a static dataflow graph that is enriched with two simple yet powerful dataflow operations. Since static graphs do not change during execution, they can be easily partitioned and executed efficiently in distributed and heterogeneous environments. The proposed technique makes heavy use of the idea of tagging, which was one of the cornerstones of dataflow systems since their inception. We demonstrate that our technique is compatible with the idea of automatic differentiation, a notion that is crucial for dataflow systems that focus on deep learning applications. We describe the principles of an actual implementation of the technique in the TensorFlow framework, and present experimental results that demonstrate that the use of tagging is of paramount importance for developing efficient high-level abstractions for modern dataflow systems.

[LG-77] Revisiting Differential Verification: Equivalence Verification with Confidence

链接: https://arxiv.org/abs/2410.20207
作者: Samuel Teuber,Philipp Kern,Marvin Janzen,Bernhard Beckert
关键词-EN: validated neural networks, neural networks, validated neural, desirable to prove, behaves equivalently
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 47 pages (main paper has 16 pages); 8 figures

点击查看摘要

Abstract:When validated neural networks (NNs) are pruned (and retrained) before deployment, it is desirable to prove that the new NN behaves equivalently to the (original) reference NN. To this end, our paper revisits the idea of differential verification which performs reasoning on differences between NNs: On the one hand, our paper proposes a novel abstract domain for differential verification admitting more efficient reasoning about equivalence. On the other hand, we investigate empirically and theoretically which equivalence properties are (not) efficiently solved using differential reasoning. Based on the gained insights, and following a recent line of work on confidence-based verification, we propose a novel equivalence property that is amenable to Differential Verification while providing guarantees for large parts of the input space instead of small-scale guarantees constructed w.r.t. predetermined input points. We implement our approach in a new tool called VeryDiff and perform an extensive evaluation on numerous old and new benchmark families, including new pruned NNs for particle jet classification in the context of CERN’s LHC where we observe median speedups 300x over the State-of-the-Art verifier alpha,beta-CROWN.

[LG-78] Copyright-Aware Incentive Scheme for Generative Art Models Using Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2410.20180
作者: Zhuan Shi,Yifei Song,Xiaoli Tang,Lingjuan Lyu,Boi Faltings
关键词-EN: achieved remarkable performance, Generative art, generative art raises, Diffusion models, perturbing Diffusion models
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:Generative art using Diffusion models has achieved remarkable performance in image generation and text-to-image tasks. However, the increasing demand for training data in generative art raises significant concerns about copyright infringement, as models can produce images highly similar to copyrighted works. Existing solutions attempt to mitigate this by perturbing Diffusion models to reduce the likelihood of generating such images, but this often compromises model performance. Another approach focuses on economically compensating data holders for their contributions, yet it fails to address copyright loss adequately. Our approach begin with the introduction of a novel copyright metric grounded in copyright law and court precedents on infringement. We then employ the TRAK method to estimate the contribution of data holders. To accommodate the continuous data collection process, we divide the training into multiple rounds. Finally, We designed a hierarchical budget allocation method based on reinforcement learning to determine the budget for each round and the remuneration of the data holder based on the data holder’s contribution and copyright loss in each round. Extensive experiments across three datasets show that our method outperforms all eight benchmarks, demonstrating its effectiveness in optimizing budget distribution in a copyright-aware manner. To the best of our knowledge, this is the first technical work that introduces to incentive contributors and protect their copyrights by compensating them.

[LG-79] Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

链接: https://arxiv.org/abs/2410.20176
作者: Yuting Tang,Xin-Qiang Cai,Jing-Cheng Pang,Qiyu Wu,Yao-Xiang Ding,Masashi Sugiyama
关键词-EN: Reinforcement Learning, delayed reward, Composite Delayed Reward, Composite Delayed, delayed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent’s performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex structure reflecting the overall contribution of the sequence. To address this problem, we present a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps. Building on this framework, we propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism to effectively model these contributions. We conduct experiments on challenging locomotion tasks where the agent receives delayed rewards computed from composite functions of observable step rewards. The experimental results indicate that CoDeTr consistently outperforms baseline methods across evaluated metrics. Additionally, we demonstrate that it effectively identifies the most significant time steps within the sequence and accurately predicts rewards that closely reflect the environment feedback.

[LG-80] Alternatives of Unsupervised Representations of Variables on the Latent Space

链接: https://arxiv.org/abs/2410.20172
作者: Alex Glushkovsky
关键词-EN: unsupervised machine learning, latent space, variational autoencoder, machine learning, applying a variational
类目: Machine Learning (cs.LG)
*备注: 20 pages, 15 figures, 4 tables

点击查看摘要

Abstract:The article addresses the application of unsupervised machine learning to represent variables on the 2D latent space by applying a variational autoencoder (beta-VAE). Representation of variables on low dimensional spaces allows for data visualization, disentanglement of variables based on underlying characteristics, finding of meaningful patterns and outliers, and supports interpretability. Five distinct methods have been introduced to represent variables on the latent space: (1) straightforward transposed, (2) univariate metadata of variables, such as variable statistics, empirical probability density and cumulative distribution functions, (3) adjacency matrices of different metrics, such as correlations, R2 values, Jaccard index, cosine similarity, and mutual information, (4) gradient mappings followed by spot cross product calculation, and (5) combined. Twenty-eight approaches of variable representations by beta-VAE have been considered. The pairwise spot cross product addresses relationships of gradients of two variables along latent space axes, such as orthogonal, confounded positive, confounded negative, and everything in between. The article addresses generalized representations of variables that cover both features and labels. Dealing with categorical variables, reinforced entanglement has been introduced to represent one-hot encoded categories. The article includes three examples: (1) synthetic data with known dependencies, (2) famous MNIST example of handwritten numbers, and (3) real-world multivariate time series of Canadian financial market interest rates. As a result, unsupervised representations of interest rates on the latent space correctly disentangled rates based on their type, such as bonds, T-bills, GICs, or conventional mortgages, positioned bonds and T-bills along a single curve, and ordered rates by their terms along that curve.

[LG-81] Cyberbullying or just Sarcasm? Unmasking Coordinated Networks on Reddit

链接: https://arxiv.org/abs/2410.20170
作者: Pinky Pamecha,Chaitya Shah,Divyam Jain,Kashish Gandhi,Kiran Bhowmick,Meera Narvekar
关键词-EN: social media usage, make sarcastic comments, media usage, comments on posts, rapid growth
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:With the rapid growth of social media usage, a common trend has emerged where users often make sarcastic comments on posts. While sarcasm can sometimes be harmless, it can blur the line with cyberbullying, especially when used in negative or harmful contexts. This growing issue has been exacerbated by the anonymity and vast reach of the internet, making cyberbullying a significant concern on platforms like Reddit. Our research focuses on distinguishing cyberbullying from sarcasm, particularly where online language nuances make it difficult to discern harmful intent. This study proposes a framework using natural language processing (NLP) and machine learning to differentiate between the two, addressing the limitations of traditional sentiment analysis in detecting nuanced behaviors. By analyzing a custom dataset scraped from Reddit, we achieved a 95.15% accuracy in distinguishing harmful content from sarcasm. Our findings also reveal that teenagers and minority groups are particularly vulnerable to cyberbullying. Additionally, our research uncovers coordinated graphs of groups involved in cyberbullying, identifying common patterns in their behavior. This research contributes to improving detection capabilities for safer online communities.

[LG-82] Infectious Disease Forecasting in India using LLM s and Deep Learning

链接: https://arxiv.org/abs/2410.20168
作者: Chaitya Shah,Kashish Gandhi,Javal Shah,Kreena Shah,Nilesh Patil,Kiran Bhowmick
关键词-EN: healthcare systems worldwide, uncontrollable disease outbreaks, exposed several vulnerabilities, systems worldwide, disease outbreaks
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Many uncontrollable disease outbreaks of the past exposed several vulnerabilities in the healthcare systems worldwide. While advancements in technology assisted in the rapid creation of the vaccinations, there needs to be a pressing focus on the prevention and prediction of such massive outbreaks. Early detection and intervention of an outbreak can drastically reduce its impact on public health while also making the healthcare system more resilient. The complexity of disease transmission dynamics, influence of various directly and indirectly related factors and limitations of traditional approaches are the main bottlenecks in taking preventive actions. Specifically, this paper implements deep learning algorithms and LLM’s to predict the severity of infectious disease outbreaks. Utilizing the historic data of several diseases that have spread in India and the climatic data spanning the past decade, the insights from our research aim to assist in creating a robust predictive system for any outbreaks in the future.

[LG-83] DeepMIDE: A Multivariate Spatio-Temporal Method for Ultra-Scale Offshore Wind Energy Forecasting

链接: https://arxiv.org/abs/2410.20166
作者: Feng Ye,Xinxi Zhang,Michael Stein,Ahmed Aziz Ezzat
关键词-EN: taller wind turbines, offshore wind industry, unlock access, access to stronger, industry is advancing
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:To unlock access to stronger winds, the offshore wind industry is advancing with significantly larger and taller wind turbines. This massive upscaling motivates a departure from univariate wind forecasting methods that traditionally focused on a single representative height. To fill this gap, we propose DeepMIDE–a statistical deep learning method which jointly models the offshore wind speeds across space, time, and height. DeepMIDE is formulated as a multi-output integro-difference equation model with a multivariate, nonstationary, and state-dependent kernel characterized by a set of advection vectors that encode the physics of wind field formation and propagation. Embedded within DeepMIDE, an advanced deep learning architecture learns these advection vectors from high dimensional streams of exogenous weather information, which, along with other parameters, are plugged back into the statistical model for probabilistic multi-height space-time forecasting. Tested on real-world data from future offshore wind energy sites in the Northeastern United States, the wind speed and power forecasts from DeepMIDE are shown to outperform those from prevalent time series, spatio-temporal, and deep learning methods.

[LG-84] GFlowNet Fine-tuning for Diverse Correct Solutions in Mathematical Reasoning Tasks

链接: https://arxiv.org/abs/2410.20147
作者: Ryoichi Takase,Masaya Tsunokake,Yuta Tsuchiya,Shota Inuzuka
关键词-EN: typically require, require an understanding, understanding of fundamental, Mathematical reasoning problems, fundamental laws
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematical reasoning problems are among the most challenging, as they typically require an understanding of fundamental laws to solve. The laws are universal, but the derivation of the final answer changes depending on how a problem is approached. When training large language models (LLMs), learning the capability of generating such multiple solutions is essential to accelerate their use in mathematical education. To this end, we train LLMs using generative flow network (GFlowNet). Different from reward-maximizing reinforcement learning (RL), GFlowNet fine-tuning seeks to find diverse solutions by training the LLM whose distribution is proportional to a reward function. In numerical experiments, we evaluate GFlowNet fine-tuning and reward-maximizing RL in terms of accuracy and diversity. The results show that GFlowNet fine-tuning derives correct final answers from diverse intermediate reasoning steps, indicating the improvement of the capability of alternative solution generation.

[LG-85] FedMABA: Towards Fair Federated Learning through Multi-Armed Bandits Allocation

链接: https://arxiv.org/abs/2410.20141
作者: Zhichao Wang,Lin Wang,Yongxin Guo,Ying-Jun Angela Zhang,Xiaoying Tang
关键词-EN: privacy-preserving collaborative paradigm, federated learning, collaborative paradigm, increasing concern, privacy has driven
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The increasing concern for data privacy has driven the rapid development of federated learning (FL), a privacy-preserving collaborative paradigm. However, the statistical heterogeneity among clients in FL results in inconsistent performance of the server model across various clients. Server model may show favoritism towards certain clients while performing poorly for others, heightening the challenge of fairness. In this paper, we reconsider the inconsistency in client performance distribution and introduce the concept of adversarial multi-armed bandit to optimize the proposed objective with explicit constraints on performance disparities. Practically, we propose a novel multi-armed bandit-based allocation FL algorithm (FedMABA) to mitigate performance unfairness among diverse clients with different data distributions. Extensive experiments, in different Non-I.I.D. scenarios, demonstrate the exceptional performance of FedMABA in enhancing fairness.

[LG-86] CodePurify: Defend Backdoor Attacks on Neural Code Models via Entropy-based Purification

链接: https://arxiv.org/abs/2410.20136
作者: Fangwen Mu,Junjie Wang,Zhuohao Yu,Lin Shi,Song Wang,Mingyang Li,Qing Wang
关键词-EN: found widespread success, backdoor attacks, victim model behavior, code, advanced backdoor attacks
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural code models have found widespread success in tasks pertaining to code intelligence, yet they are vulnerable to backdoor attacks, where an adversary can manipulate the victim model’s behavior by inserting triggers into the source code. Recent studies indicate that advanced backdoor attacks can achieve nearly 100% attack success rates on many software engineering tasks. However, effective defense techniques against such attacks remain insufficiently explored. In this study, we propose CodePurify, a novel defense against backdoor attacks on code models through entropy-based purification. Entropy-based purification involves the process of precisely detecting and eliminating the possible triggers in the source code while preserving its semantic information. Within this process, CodePurify first develops a confidence-driven entropy-based measurement to determine whether a code snippet is poisoned and, if so, locates the triggers. Subsequently, it purifies the code by substituting the triggers with benign tokens using a masked language model. We extensively evaluate CodePurify against four advanced backdoor attacks across three representative tasks and two popular code models. The results show that CodePurify significantly outperforms four commonly used defense baselines, improving average defense performance by at least 40%, 40%, and 12% across the three tasks, respectively. These findings highlight the potential of CodePurify to serve as a robust defense against backdoor attacks on neural code models.

[LG-87] Analyzing Multi-Stage Loss Curve: Plateau and Descent Mechanisms in Neural Networks

链接: https://arxiv.org/abs/2410.20119
作者: Zheng-An Chen,Tao Luo,GuiHong Wang
关键词-EN: training loss curves, secondary plateau stage, plateau stage, neural networks, reflecting the non-linearity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The multi-stage phenomenon in the training loss curves of neural networks has been widely observed, reflecting the non-linearity and complexity inherent in the training process. In this work, we investigate the training dynamics of neural networks (NNs), with particular emphasis on the small initialization regime and identify three distinct stages observed in the loss curve during training: initial plateau stage, initial descent stage, and secondary plateau stage. Through rigorous analysis, we reveal the underlying challenges causing slow training during the plateau stages. Building on existing work, we provide a more detailed proof for the initial plateau. This is followed by a comprehensive analysis of the dynamics in the descent stage. Furthermore, we explore the mechanisms that enable the network to overcome the prolonged secondary plateau stage, supported by both experimental evidence and heuristic reasoning. Finally, to better understand the relationship between global training trends and local parameter adjustments, we employ the Wasserstein distance to capture the microscopic evolution of weight amplitude distribution.

[LG-88] GeoFUSE: A High-Efficiency Surrogate Model for Seawater Intrusion Prediction and Uncertainty Reduction

链接: https://arxiv.org/abs/2410.20118
作者: Su Jiang,Chuyang Liu,Dipankar Dwivedi
关键词-EN: rising sea levels, sea levels due, coastal aquifers poses, Fourier Neural Operator, Principal Component Analysis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Seawater intrusion into coastal aquifers poses a significant threat to groundwater resources, especially with rising sea levels due to climate change. Accurate modeling and uncertainty quantification of this process are crucial but are often hindered by the high computational costs of traditional numerical simulations. In this work, we develop GeoFUSE, a novel deep-learning-based surrogate framework that integrates the U-Net Fourier Neural Operator (U-FNO) with Principal Component Analysis (PCA) and Ensemble Smoother with Multiple Data Assimilation (ESMDA). GeoFUSE enables fast and efficient simulation of seawater intrusion while significantly reducing uncertainty in model predictions. We apply GeoFUSE to a 2D cross-section of the Beaver Creek tidal stream-floodplain system in Washington State. Using 1,500 geological realizations, we train the U-FNO surrogate model to approximate salinity distribution and accumulation. The U-FNO model successfully reduces the computational time from hours (using PFLOTRAN simulations) to seconds, achieving a speedup of approximately 360,000 times while maintaining high accuracy. By integrating measurement data from monitoring wells, the framework significantly reduces geological uncertainty and improves the predictive accuracy of the salinity distribution over a 20-year period. Our results demonstrate that GeoFUSE improves computational efficiency and provides a robust tool for real-time uncertainty quantification and decision making in groundwater management. Future work will extend GeoFUSE to 3D models and incorporate additional factors such as sea-level rise and extreme weather events, making it applicable to a broader range of coastal and subsurface flow systems.

[LG-89] FedSSP: Federated Graph Learning with Spectral Knowledge and Personalized Preference

链接: https://arxiv.org/abs/2410.20105
作者: Zihan Tan,Guancheng Wan,Wenke Huang,Mang Ye
关键词-EN: Federated Graph Learning, Graph Neural Networks, Neural Networks, Personalized Federated Graph, accommodating personalized requirements
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Personalized Federated Graph Learning (pFGL) facilitates the decentralized training of Graph Neural Networks (GNNs) without compromising privacy while accommodating personalized requirements for non-IID participants. In cross-domain scenarios, structural heterogeneity poses significant challenges for pFGL. Nevertheless, previous pFGL methods incorrectly share non-generic knowledge globally and fail to tailor personalized solutions locally under domain structural shift. We innovatively reveal that the spectral nature of graphs can well reflect inherent domain structural shifts. Correspondingly, our method overcomes it by sharing generic spectral knowledge. Moreover, we indicate the biased message-passing schemes for graph structures and propose the personalized preference module. Combining both strategies, we propose our pFGL framework FedSSP which Shares generic Spectral knowledge while satisfying graph Preferences. Furthermore, We perform extensive experiments on cross-dataset and cross-domain settings to demonstrate the superiority of our framework. The code is available at this https URL.

[LG-90] Latent Neural Operator Pretraining for Solving Time-Dependent PDEs

链接: https://arxiv.org/abs/2410.20100
作者: Tian Wang,Chuang Wang
关键词-EN: Latent Neural Operator, gain increasing attraction, increasing attraction recently, neural operator, Neural Operator Pretraining
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Pretraining methods gain increasing attraction recently for solving PDEs with neural operators. It alleviates the data scarcity problem encountered by neural operator learning when solving single PDE via training on large-scale datasets consisting of various PDEs and utilizing shared patterns among different PDEs to improve the solution precision. In this work, we propose the Latent Neural Operator Pretraining (LNOP) framework based on the Latent Neural Operator (LNO) backbone. We achieve universal transformation through pretraining on hybrid time-dependent PDE dataset to extract representations of different physical systems and solve various time-dependent PDEs in the latent space through finetuning on single PDE dataset. Our proposed LNOP framework reduces the solution error by 31.7% on four problems and can be further improved to 57.1% after finetuning. On out-of-distribution dataset, our LNOP model achieves roughly 50% lower error and 3 \times data efficiency on average across different dataset sizes. These results show that our method is more competitive in terms of solution precision, transfer capability and data efficiency compared to non-pretrained neural operators.

[LG-91] Sample Efficient Bayesian Learning of Causal Graphs from Interventions NEURIPS24

链接: https://arxiv.org/abs/2410.20089
作者: Zihan Zhou,Muhammad Qasim Elahi,Murat Kocaoglu
关键词-EN: causal graph, Causal, science and engineering, interventional samples, fundamental problem
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: To appear in Proceedings of NeurIPS 24

点击查看摘要

Abstract:Causal discovery is a fundamental problem with applications spanning various areas in science and engineering. It is well understood that solely using observational data, one can only orient the causal graph up to its Markov equivalence class, necessitating interventional data to learn the complete causal graph. Most works in the literature design causal discovery policies with perfect interventions, i.e., they have access to infinite interventional samples. This study considers a Bayesian approach for learning causal graphs with limited interventional samples, mirroring real-world scenarios where such samples are usually costly to obtain. By leveraging the recent result of Wienöbst et al. (2023) on uniform DAG sampling in polynomial time, we can efficiently enumerate all the cut configurations and their corresponding interventional distributions of a target set, and further track their posteriors. Given any number of interventional samples, our proposed algorithm randomly intervenes on a set of target vertices that cut all the edges in the graph and returns a causal graph according to the posterior of each target set. When the number of interventional samples is large enough, we show theoretically that our proposed algorithm will return the true causal graph with high probability. We compare our algorithm against various baseline methods on simulated datasets, demonstrating its superior accuracy measured by the structural Hamming distance between the learned DAG and the ground truth. Additionally, we present a case study showing how this algorithm could be modified to answer more general causal questions without learning the whole graph. As an example, we illustrate that our method can be used to estimate the causal effect of a variable that cannot be intervened.

[LG-92] mg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography NEURIPS2024

链接: https://arxiv.org/abs/2410.20081
作者: Viswanath Sivakumar,Jeffrey Seely,Alan Du,Sean R Bittner,Adam Berenzweig,Anuoluwapo Bolarinwa,Alexandre Gramfort,Michael I Mandel
关键词-EN: detect individual spinal, Surface electromyography, non-invasively measures signals, measures signals generated, individual spinal neurons
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
*备注: Submitted to NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities. Dataset and code can be accessed at this https URL.

[LG-93] CGKN: A Deep Learning Framework for Modeling Complex Dynamical Systems and Efficient Data Assimilation

链接: https://arxiv.org/abs/2410.20072
作者: Chuanqi Chen,Nan Chen,Yinling Zhang,Jin-Long Wu
关键词-EN: predict complex dynamical, deep learning models, Deep learning, complex dynamical systems, engineering areas
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Deep learning is widely used to predict complex dynamical systems in many scientific and engineering areas. However, the black-box nature of these deep learning models presents significant challenges for carrying out simultaneous data assimilation (DA), which is a crucial technique for state estimation, model identification, and reconstructing missing data. Integrating ensemble-based DA methods with nonlinear deep learning models is computationally expensive and may suffer from large sampling errors. To address these challenges, we introduce a deep learning framework designed to simultaneously provide accurate forecasts and efficient DA. It is named Conditional Gaussian Koopman Network (CGKN), which transforms general nonlinear systems into nonlinear neural differential equations with conditional Gaussian structures. CGKN aims to retain essential nonlinear components while applying systematic and minimal simplifications to facilitate the development of analytic formulae for nonlinear DA. This allows for seamless integration of DA performance into the deep learning training process, eliminating the need for empirical tuning as required in ensemble methods. CGKN compensates for structural simplifications by lifting the dimension of the system, which is motivated by Koopman theory. Nevertheless, CGKN exploits special nonlinear dynamics within the lifted space. This enables the model to capture extreme events and strong non-Gaussian features in joint and marginal distributions with appropriate uncertainty quantification. We demonstrate the effectiveness of CGKN for both prediction and DA on three strongly nonlinear and non-Gaussian turbulent systems: the projected stochastic Burgers–Sivashinsky equation, the Lorenz 96 system, and the El Niño-Southern Oscillation. The results justify the robustness and computational efficiency of CGKN.

[LG-94] Understanding the Effect of GCN Convolutions in Regression Tasks

链接: https://arxiv.org/abs/2410.20068
作者: Juntong Chen,Johannes Schmidt-Hieber,Claire Donnat,Olga Klopp
关键词-EN: pivotal method, method in machine, Graph Convolutional Networks, Convolutional Networks, modeling functions
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 31 pages

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) have become a pivotal method in machine learning for modeling functions over graphs. Despite their widespread success across various applications, their statistical properties (e.g. consistency, convergence rates) remain ill-characterized. To begin addressing this knowledge gap, in this paper, we provide a formal analysis of the impact of convolution operators on regression tasks over homophilic networks. Focusing on estimators based solely on neighborhood aggregation, we examine how two common convolutions - the original GCN and GraphSage convolutions - affect the learning error as a function of the neighborhood topology and the number of convolutional layers. We explicitly characterize the bias-variance trade-off incurred by GCNs as a function of the neighborhood size and identify specific graph topologies where convolution operators are less effective. Our theoretical findings are corroborated by synthetic experiments, and provide a start to a deeper quantitative understanding of convolutional effects in GCNs for offering rigorous guidelines for practitioners.

[LG-95] Deep Concept Identification for Generative Design

链接: https://arxiv.org/abs/2410.20061
作者: Ryo Tsumoto,Kentaro Yaji,Yutaka Nomaguchi,Kikuo Fujita
关键词-EN: high design degree, topology optimization, alternatives, generative design, concept identification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A generative design based on topology optimization provides diverse alternatives as entities in a computational model with a high design degree. However, as the diversity of the generated alternatives increases, the cognitive burden on designers to select the most appropriate alternatives also increases. Whereas the concept identification approach, which finds various categories of entities, is an effective means to structure alternatives, evaluation of their similarities is challenging due to shape diversity. To address this challenge, this study proposes a concept identification framework for generative design using deep learning (DL) techniques. One of the key abilities of DL is the automatic learning of different representations of a specific task. Deep concept identification finds various categories that provide insights into the mapping relationships between geometric properties and structural performance through representation learning using DL. The proposed framework generates diverse alternatives using a generative design technique, clusters the alternatives into several categories using a DL technique, and arranges these categories for design practice using a classification model. This study demonstrates its fundamental capabilities by implementing variational deep embedding, a generative and clustering model based on the DL paradigm, and logistic regression as a classification model. A simplified design problem of a two-dimensional bridge structure is applied as a case study to validate the proposed framework. Although designers are required to determine the viewing aspect level by setting the number of concepts, this implementation presents the identified concepts and their relationships in the form of a decision tree based on a specified level.

[LG-96] Mechanism learning: Reverse causal inference in the presence of multiple unknown confounding through front-door causal bootstrapping

链接: https://arxiv.org/abs/2410.20057
作者: Jianqiao Mao,Max A. Little
关键词-EN: recover associational, major limitation, limitation of machine, learn predictive relationships, predictive relationships
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:A major limitation of machine learning (ML) prediction models is that they recover associational, rather than causal, predictive relationships between variables. In high-stakes automation applications of ML this is problematic, as the model often learns spurious, non-causal associations. This paper proposes mechanism learning, a simple method which uses front-door causal bootstrapping to deconfound observational data such that any appropriate ML model is forced to learn predictive relationships between effects and their causes (reverse causal inference), despite the potential presence of multiple unknown and unmeasured confounding. Effect variables can be very high dimensional, and the predictive relationship nonlinear, as is common in ML applications. This novel method is widely applicable, the only requirement is the existence of a mechanism variable mediating the cause (prediction target) and effect (feature data), which is independent of the (unmeasured) confounding variables. We test our method on fully synthetic, semi-synthetic and real-world datasets, demonstrating that it can discover reliable, unbiased, causal ML predictors where by contrast, the same ML predictor trained naively using classical supervised learning on the original observational data, is heavily biased by spurious associations. We provide code to implement the results in the paper, online.

[LG-97] Annotation Efficiency: Identifying Hard Samples via Blocked Sparse Linear Bandits

链接: https://arxiv.org/abs/2410.20041
作者: Adit Jain,Soumyabrata Pal,Sunav Choudhary,Ramasuri Narayanam,Vikram Krishnamurthy
关键词-EN: frac, label-scarce setting, mathsf, annotation, annotating datapoints
类目: Machine Learning (cs.LG)
*备注: 31 Pages

点击查看摘要

Abstract:This paper considers the problem of annotating datapoints using an expert with only a few annotation rounds in a label-scarce setting. We propose soliciting reliable feedback on difficulty in annotating a datapoint from the expert in addition to ground truth label. Existing literature in active learning or coreset selection turns out to be less relevant to our setting since they presume the existence of a reliable trained model, which is absent in the label-scarce regime. However, the literature on coreset selection emphasizes the presence of difficult data points in the training set to perform supervised learning in downstream tasks (Mindermann et al., 2022). Therefore, for a given fixed annotation budget of \mathsfT rounds, we model the sequential decision-making problem of which (difficult) datapoints to choose for annotation in a sparse linear bandits framework with the constraint that no arm can be pulled more than once (blocking constraint). With mild assumptions on the datapoints, our (computationally efficient) Explore-Then-Commit algorithm BSLB achieves a regret guarantee of \widetilde\mathsfO(k^\frac13 \mathsfT^\frac23 +k^-\frac12 \beta_k + k^-\frac112 \beta_k^\frac12\mathsfT^\frac56) where the unknown parameter vector has tail magnitude \beta_k at sparsity level k . To this end, we show offline statistical guarantees of Lasso estimator with mild Restricted Eigenvalue (RE) condition that is also robust to sparsity. Finally, we propose a meta-algorithm C-BSLB that does not need knowledge of the optimal sparsity parameters at a no-regret cost. We demonstrate the efficacy of our BSLB algorithm for annotation in the label-scarce setting for an image classification task on the PASCAL-VOC dataset, where we use real-world annotation difficulty scores.

[LG-98] Revisiting PlayeRank

链接: https://arxiv.org/abs/2410.20038
作者: Louise Schmidt,Cristian Lillo,Javier Bustos
关键词-EN: Linear Support Vector, evaluated by Pappalardo, performance score called, Support Vector Machine, designed and evaluated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this article we revise the football’s performance score called PlayeRank, designed and evaluated by Pappalardo et al.\ in 2019. First, we analyze the weights extracted from the Linear Support Vector Machine (SVM) that solves the classification problem of “which set of events has a higher impact on the chances of winning a match”. Here, we notice that the previously published results include the Goal-Scored event during the training phase, which produces inconsistencies. We fix these inconsistencies, and show new weights capable of solving the same problem. Following the intuition that the best team should always win a match, we define the team’s quality as the average number of players involved in the game. We show that, using the original PlayeRank, in 94.13% of the matches either the superior team beats the inferior team or the teams end tied if the scores are similar. Finally, we present a way to use PlayeRank in an online fashion using modified free analysis tools. Calculating this modified version of PlayeRank, we performed an online analysis of a real football match every five minutes of game. Here, we evaluate the usefulness of that information with experts and managers, and conclude that the obtained data indeed provides useful information that was not previously available to the manager during the match. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.20038 [cs.LG] (or arXiv:2410.20038v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.20038 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] On-Robot Reinforcement Learning with Goal-Contrastive Rewards

链接: https://arxiv.org/abs/2410.19989
作者: Ondrej Biza,Thomas Weng,Lingfeng Sun,Karl Schmeckpeper,Tarik Kelestemur,Yecheng Jason Ma,Robert Platt,Jan-Willem van de Meent,Lawson L. S. Wong
关键词-EN: Reinforcement Learning, potential to enable, Reinforcement, reward, Learning
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has the potential to enable robots to learn from their own actions in the real world. Unfortunately, RL can be prohibitively expensive, in terms of on-robot runtime, due to inefficient exploration when learning from a sparse reward signal. Designing dense reward functions is labour-intensive and requires domain expertise. In our work, we propose GCR (Goal-Contrastive Rewards), a dense reward function learning method that can be trained on passive video demonstrations. By using videos without actions, our method is easier to scale, as we can use arbitrary videos. GCR combines two loss functions, an implicit value loss function that models how the reward increases when traversing a successful trajectory, and a goal-contrastive loss that discriminates between successful and failed trajectories. We perform experiments in simulated manipulation environments across RoboMimic and MimicGen tasks, as well as in the real world using a Franka arm and a Spot quadruped. We find that GCR leads to a more-sample efficient RL, enabling model-free RL to solve about twice as many tasks as our baseline reward learning methods. We also demonstrate positive cross-embodiment transfer from videos of people and of other robots performing a task. Appendix: \urlthis https URL.

[LG-100] Residual Random Neural Networks

链接: https://arxiv.org/abs/2410.19987
作者: M. Andrecut
关键词-EN: single-layer feedforward neural, neural networks literature, feedforward neural network, single-layer feedforward, recurring motif
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:The single-layer feedforward neural network with random weights is a recurring motif in the neural networks literature. The advantage of these networks is their simplified training, which reduces to solving a ridge-regression problem. However, a general assumption is that these networks require a large number of hidden neurons relative to the dimensionality of the data samples, in order to achieve good classification accuracy. Contrary to this assumption, here we show that one can obtain good classification results even if the number of hidden neurons has the same order of magnitude as the dimensionality of the data samples, if this dimensionality is reasonably high. We also develop an efficient iterative residual training method for such random neural networks, which significantly improves their classification accuracy. Moreover, we also describe an encryption (obfuscation) method which can be used to protect both the data and the neural network model.

[LG-101] Resolving Domain Shift For Representations Of Speech In Non-Invasive Brain Recordings ICLR2025

链接: https://arxiv.org/abs/2410.19986
作者: Jeremiah Ridge,Oiwi Parker Jones
关键词-EN: amazing recent successes, recent successes achieved, brain activity, invasive devices, enabled researchers
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: Submitted to ICLR 2025

点击查看摘要

Abstract:Machine learning techniques have enabled researchers to leverage neuroimaging data to decode speech from brain activity, with some amazing recent successes achieved by applications built using invasive devices. However, research requiring surgical implants has a number of practical limitations. Non-invasive neuroimaging techniques provide an alternative but come with their own set of challenges, the limited scale of individual studies being among them. Without the ability to pool the recordings from different non-invasive studies, data on the order of magnitude needed to leverage deep learning techniques to their full potential remains out of reach. In this work, we focus on non-invasive data collected using magnetoencephalography (MEG). We leverage two different, leading speech decoding models to investigate how an adversarial domain adaptation framework augments their ability to generalize across datasets. We successfully improve the performance of both models when training across multiple datasets. To the best of our knowledge, this study is the first ever application of feature-level, deep learning based harmonization for MEG neuroimaging data. Our analysis additionally offers further evidence of the impact of demographic features on neuroimaging data, demonstrating that participant age strongly affects how machine learning models solve speech decoding tasks using MEG data. Lastly, in the course of this study we produce a new open-source implementation of one of these models to the benefit of the broader scientific community.

[LG-102] Global Graph Counterfactual Explanation: A Subgraph Mapping Approach

链接: https://arxiv.org/abs/2410.19978
作者: Yinhan He,Wendy Zheng,Yaochen Zhu,Jing Ma,Saumitra Mishra,Natraj Raman,Ninghao Liu,Jundong Li
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, real-world applications, widely deployed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been widely deployed in various real-world applications. However, most GNNs are black-box models that lack explanations. One strategy to explain GNNs is through counterfactual explanation, which aims to find minimum perturbations on input graphs that change the GNN predictions. Existing works on GNN counterfactual explanations primarily concentrate on the local-level perspective (i.e., generating counterfactuals for each individual graph), which suffers from information overload and lacks insights into the broader cross-graph relationships. To address such issues, we propose GlobalGCE, a novel global-level graph counterfactual explanation method. GlobalGCE aims to identify a collection of subgraph mapping rules as counterfactual explanations for the target GNN. According to these rules, substituting certain significant subgraphs with their counterfactual subgraphs will change the GNN prediction to the desired class for most graphs (i.e., maximum coverage). Methodologically, we design a significant subgraph generator and a counterfactual subgraph autoencoder in our GlobalGCE, where the subgraphs and the rules can be effectively generated. Extensive experiments demonstrate the superiority of our GlobalGCE compared to existing baselines. Our code can be found at this https URL.

[LG-103] Privacy without Noisy Gradients: Slicing Mechanism for Generative Model Training NEURIPS2024

链接: https://arxiv.org/abs/2410.19941
作者: Kristjan Greenewald,Yuancheng Yu,Hao Wang,Kai Xu
关键词-EN: typically involves injecting, discriminator training procedure, involves injecting noise, typically involves, involves injecting
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: accepted to Neurips 2024

点击查看摘要

Abstract:Training generative models with differential privacy (DP) typically involves injecting noise into gradient updates or adapting the discriminator’s training procedure. As a result, such approaches often struggle with hyper-parameter tuning and convergence. We consider the slicing privacy mechanism that injects noise into random low-dimensional projections of the private data, and provide strong privacy guarantees for it. These noisy projections are used for training generative models. To enable optimizing generative models using this DP approach, we introduce the smoothed-sliced f -divergence and show it enjoys statistical consistency. Moreover, we present a kernel-based estimator for this divergence, circumventing the need for adversarial training. Extensive numerical experiments demonstrate that our approach can generate synthetic data of higher quality compared with baselines. Beyond performance improvement, our method, by sidestepping the need for noisy gradients, offers data scientists the flexibility to adjust generator architecture and hyper-parameters, run the optimization over any number of epochs, and even restart the optimization process – all without incurring additional privacy costs.

[LG-104] Provable optimal transport with transformers: The essence of depth and prompt engineering

链接: https://arxiv.org/abs/2410.19931
作者: Hadi Daneshmand
关键词-EN: provable performance guarantees, establish provable performance, provable performance, optimal transport, performance guarantees
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Can we establish provable performance guarantees for transformers? Establishing such theoretical guarantees is a milestone in developing trustworthy generative AI. In this paper, we take a step toward addressing this question by focusing on optimal transport, a fundamental problem at the intersection of combinatorial and continuous optimization. Leveraging the computational power of attention layers, we prove that a transformer with fixed parameters can effectively solve the optimal transport problem in Wasserstein-2 with entropic regularization for an arbitrary number of points. Consequently, the transformer can sort lists of arbitrary sizes up to an approximation factor. Our results rely on an engineered prompt that enables the transformer to implement gradient descent with adaptive stepsizes on the dual optimal transport. Combining the convergence analysis of gradient descent with Sinkhorn dynamics, we establish an explicit approximation bound for optimal transport with transformers, which improves as depth increases. Our findings provide novel insights into the essence of prompt engineering and depth for solving optimal transport. In particular, prompt engineering boosts the algorithmic expressivity of transformers, allowing them implement an optimization method. With increasing depth, transformers can simulate several iterations of gradient descent.

[LG-105] Prediction of Final Phosphorus Content of Steel in a Scrap-Based Electric Arc Furnace Using Artificial Neural Networks

链接: https://arxiv.org/abs/2410.19924
作者: Riadh Azzaz,Valentin Hurel,Patrice Menard,Mohammad Jahazi,Samira Ebrahimi Kahou,Elmira Moosavi-Khoonsari
关键词-EN: scrap-based electric arc, electric arc furnace, reducing environmental impacts, arc furnace process, scrap-based electric
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 53 pages, 8 figures

点击查看摘要

Abstract:The scrap-based electric arc furnace process is expected to capture a significant share of the steel market in the future due to its potential for reducing environmental impacts through steel recycling. However, managing impurities, particularly phosphorus, remains a challenge. This study aims to develop a machine learning model to estimate the steel phosphorus content at the end of the process based on input parameters. Data were collected over two years from a steel plant, focusing on the chemical composition and weight of the scrap, the volume of oxygen injected, and process duration. After preprocessing the data, several machine learning models were evaluated, with the artificial neural network (ANN) emerging as the most effective. The best ANN model included four hidden layers. The model was trained for 500 epochs with a batch size of 50. The best model achieves a mean square error (MSE) of 0.000016, a root-mean-square error (RMSE) of 0.0049998, a coefficient of determination (R2) of 99.96%, and a correlation coefficient ® of 99.98%. Notably, the model achieved a 100% hit rate for predicting phosphorus content within ±0.001 wt% (±10 ppm). These results demonstrate that the optimized ANN model offers accurate predictions for the steel final phosphorus content.

[LG-106] Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

链接: https://arxiv.org/abs/2410.19920
作者: Mohamed Salim Aissi,Clement Romac,Thomas Carta,Sylvain Lamprier,Pierre-Yves Oudeyer,Olivier Sigaud,Laure Soulier,Nicolas Thome
关键词-EN: sequential decision-making tasks, aligning large language, Reinforcement learning, large language models, knowledge with sequential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model’s internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.

[LG-107] Collaborative Inference over Wireless Channels with Feature Differential Privacy

链接: https://arxiv.org/abs/2410.19917
作者: Mohamed Seif,Yuqi Nie,Andrea J. Goldsmith,H. Vincent Poor
关键词-EN: enhance Artificial Intelligence, significantly enhance Artificial, Artificial Intelligence, enhance Artificial, multiple wireless edge
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: This work is under review for possible IEEE publication. arXiv admin note: substantial text overlap with arXiv:2406.00256

点击查看摘要

Abstract:Collaborative inference among multiple wireless edge devices has the potential to significantly enhance Artificial Intelligence (AI) applications, particularly for sensing and computer vision. This approach typically involves a three-stage process: a) data acquisition through sensing, b) feature extraction, and c) feature encoding for transmission. However, transmitting the extracted features poses a significant privacy risk, as sensitive personal data can be exposed during the process. To address this challenge, we propose a novel privacy-preserving collaborative inference mechanism, wherein each edge device in the network secures the privacy of extracted features before transmitting them to a central server for inference. Our approach is designed to achieve two primary objectives: 1) reducing communication overhead and 2) ensuring strict privacy guarantees during feature transmission, while maintaining effective inference performance. Additionally, we introduce an over-the-air pooling scheme specifically designed for classification tasks, which provides formal guarantees on the privacy of transmitted features and establishes a lower bound on classification accuracy.

[LG-108] Air Quality Prediction with Physics-Informed Dual Neural ODEs in Open Systems

链接: https://arxiv.org/abs/2410.19892
作者: Jindong Tian,Yuxuan Liang,Ronghui Xu,Peng Chen,Chenjuan Guo,Aoying Zhou,Lujia Pan,Zhongwen Rao,Bin Yang
关键词-EN: inform public policy, pollution significantly threatens, significantly threatens human, threatens human health, necessitating effective air
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Air pollution significantly threatens human health and ecosystems, necessitating effective air quality prediction to inform public policy. Traditional approaches are generally categorized into physics-based and data-driven models. Physics-based models usually struggle with high computational demands and closed-system assumptions, while data-driven models may overlook essential physical dynamics, confusing the capturing of spatiotemporal correlations. Although some physics-informed approaches combine the strengths of both models, they often face a mismatch between explicit physical equations and implicit learned representations. To address these challenges, we propose Air-DualODE, a novel physics-informed approach that integrates dual branches of Neural ODEs for air quality prediction. The first branch applies open-system physical equations to capture spatiotemporal dependencies for learning physics dynamics, while the second branch identifies the dependencies not addressed by the first in a fully data-driven way. These dual representations are temporally aligned and fused to enhance prediction accuracy. Our experimental results demonstrate that Air-DualODE achieves state-of-the-art performance in predicting pollutant concentrations across various spatial scales, thereby offering a promising solution for real-world air quality challenges.

[LG-109] EnergyPlus Room Simulator

链接: https://arxiv.org/abs/2410.19888
作者: Manuel Weber,Philipp Bogdain,Sophia Viktoria Weißenberger,Diana Marjanovic,Katharina Sammet,Jan Vellmer,Farzan Banihashemi,Peter Mandl
关键词-EN: Research towards energy, buildings heavily relies, measured indoor climate, energy optimization, heavily relies
类目: Machine Learning (cs.LG)
*备注: Presented at BuildSim Nordic 2024. The conference was held from June 9 to 11, 2024, in Espoo, Finland

点击查看摘要

Abstract:Research towards energy optimization in buildings heavily relies on building-related data such as measured indoor climate factors. While data collection is a labor- and cost-intensive task, simulations are a cheap alternative to generate datasets of arbitrary sizes, particularly useful for data-intensive deep learning methods. In this paper, we present the tool EnergyPlus Room Simulator, which enables the simulation of indoor climate in a specific room of a building using the simulation software EnergyPlus. It allows to alter room models and simulate various factors such as temperature, humidity, and CO2 concentration. In contrast to manually working with EnergyPlus, this tool enhances the simulation process by offering a convenient interface, including a user-friendly graphical user interface (GUI) as well as a REST API. The tool is intended to support scientific, building-related tasks such as occupancy detection on a room level by facilitating fast access to simulation data that may, for instance, be used for pre-training machine learning models.

[LG-110] Recommendations for Comprehensive and Independent Evaluation of Machine Learning-Based Earth System Models

链接: https://arxiv.org/abs/2410.19882
作者: Paul A. Ullrich,Elizabeth A. Barnes,William D. Collins,Katherine Dagon,Shiheng Duan,Joshua Elms,Jiwoo Lee,L. Ruby Leung,Dan Lu,Maria J. Molina,Travis A. O’Brien
关键词-EN: Machine learning, Earth system, multiple disciplines, coupled Earth system, Earth
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Machine learning (ML) is a revolutionary technology with demonstrable applications across multiple disciplines. Within the Earth science community, ML has been most visible for weather forecasting, producing forecasts that rival modern physics-based models. Given the importance of deepening our understanding and improving predictions of the Earth system on all time scales, efforts are now underway to develop forecasting models into Earth-system models (ESMs), capable of representing all components of the coupled Earth system (or their aggregated behavior) and their response to external changes. Modeling the Earth system is a much more difficult problem than weather forecasting, not least because the model must represent the alternate (e.g., future) coupled states of the system for which there are no historical observations. Given that the physical principles that enable predictions about the response of the Earth system are often not explicitly coded in these ML-based models, demonstrating the credibility of ML-based ESMs thus requires us to build evidence of their consistency with the physical system. To this end, this paper puts forward five recommendations to enhance comprehensive, standardized, and independent evaluation of ML-based ESMs to strengthen their credibility and promote their wider use.

[LG-111] Causal Order Discovery based on Monotonic SCMs NEURIPS2024

链接: https://arxiv.org/abs/2410.19870
作者: Ali Izadi,Martin Ester
关键词-EN: Structural Causal Models, monotonic Structural Causal, causal order discovery, enable causal inference, Triangular Monotonic Increasing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the NeurIPS 2024 Workshop on Causal Representation Learning

点击查看摘要

Abstract:In this paper, we consider the problem of causal order discovery within the framework of monotonic Structural Causal Models (SCMs), which have gained attention for their potential to enable causal inference and causal discovery from observational data. While existing approaches either assume prior knowledge about the causal order or use complex optimization techniques to impose sparsity in the Jacobian of Triangular Monotonic Increasing maps, our work introduces a novel sequential procedure that directly identifies the causal order by iteratively detecting the root variable. This method eliminates the need for sparsity assumptions and the associated optimization challenges, enabling the identification of a unique SCM without the need for multiple independence tests to break the Markov equivalence class. We demonstrate the effectiveness of our approach in sequentially finding the root variable, comparing it to methods that maximize Jacobian sparsity.

[LG-112] Hypergraph Neural Networks Reveal Spatial Domains from Single-cell Transcriptomics Data

链接: https://arxiv.org/abs/2410.19868
作者: Mehrad Soltani,Luis Rueda
关键词-EN: Graph Neural Networks, Hypergraph Neural Networks, paramount importance, Neural Networks, transcriptomics data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The task of spatial clustering of transcriptomics data is of paramount importance. It enables the classification of tissue samples into diverse subpopulations of cells, which, in turn, facilitates the analysis of the biological functions of clusters, tissue reconstruction, and cell-cell interactions. Many approaches leverage gene expressions, spatial locations, and histological images to detect spatial domains; however, Graph Neural Networks (GNNs) as state of the art models suffer from a limitation in the assumption of pairwise connections between nodes. In the case of domain detection in spatial transcriptomics, some cells are found to be not directly related. Still, they are grouped as the same domain, which shows the incapability of GNNs for capturing implicit connections among the cells. While graph edges connect only two nodes, hyperedges connect an arbitrary number of nodes along their edges, which lets Hypergraph Neural Networks (HGNNs) capture and utilize richer and more complex structural information than traditional GNNs. We use autoencoders to address the limitation of not having the actual labels, which are well-suited for unsupervised learning. Our model has demonstrated exceptional performance, achieving the highest iLISI score of 1.843 compared to other methods. This score indicates the greatest diversity of cell types identified by our method. Furthermore, our model outperforms other methods in downstream clustering, achieving the highest ARI values of 0.51 and Leiden score of 0.60. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.19868 [cs.LG] (or arXiv:2410.19868v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.19868 Focus to learn more arXiv-issued DOI via DataCite

[LG-113] Simultaneous Dimensionality Reduction for Extracting Useful Representations of Large Empirical Multimodal Datasets

链接: https://arxiv.org/abs/2410.19867
作者: Eslam Abdelaleem
关键词-EN: concise mathematical representations, Deep Variational Multivariate, Multivariate Information Bottleneck, quest for simplification, drives the exploration
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: PhD Dissertation, available at Emory EDT @ this https URL

点击查看摘要

Abstract:The quest for simplification in physics drives the exploration of concise mathematical representations for complex systems. This Dissertation focuses on the concept of dimensionality reduction as a means to obtain low-dimensional descriptions from high-dimensional data, facilitating comprehension and analysis. We address the challenges posed by real-world data that defy conventional assumptions, such as complex interactions within neural systems or high-dimensional dynamical systems. Leveraging insights from both theoretical physics and machine learning, this work unifies diverse reduction methods under a comprehensive framework, the Deep Variational Multivariate Information Bottleneck. This framework enables the design of tailored reduction algorithms based on specific research questions. We explore and assert the efficacy of simultaneous reduction approaches over their independent reduction counterparts, demonstrating their superiority in capturing covariation between multiple modalities, while requiring less data. We also introduced novel techniques, such as the Deep Variational Symmetric Information Bottleneck, for general nonlinear simultaneous reduction. We show that the same principle of simultaneous reduction is the key to efficient estimation of mutual information. We show that our new method is able to discover the coordinates of high-dimensional observations of dynamical systems. Through analytical investigations and empirical validations, we shed light on the intricacies of dimensionality reduction methods, paving the way for enhanced data analysis across various domains. We underscore the potential of these methodologies to extract meaningful insights from complex datasets, driving advancements in fundamental research and applied sciences. As these methods evolve, they promise to deepen our understanding of complex systems and inform more effective data analysis strategies.

[LG-114] Enhancing Deep Learning based RMT Data Inversion using Gaussian Random Field

链接: https://arxiv.org/abs/2410.19858
作者: Koustav Ghosal,Arun Singh,Samir Malakar,Shalivahan Srivastava,Deepak Gupta
关键词-EN: Deep learning, powerful tool, Gaussian Random Fields, Radio Magnetotelluric data, GRF dataset
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Deep learning (DL) methods have emerged as a powerful tool for the inversion of geophysical data. When applied to field data, these models often struggle without additional fine-tuning of the network. This is because they are built on the assumption that the statistical patterns in the training and test datasets are the same. To address this, we propose a DL-based inversion scheme for Radio Magnetotelluric data where the subsurface resistivity models are generated using Gaussian Random Fields (GRF). The network’s generalization ability was tested with an out-of-distribution (OOD) dataset comprising a homogeneous background and various rectangular-shaped anomalous bodies. After end-to-end training with the GRF dataset, the pre-trained network successfully identified anomalies in the OOD dataset. Synthetic experiments confirmed that the GRF dataset enhances generalization compared to a homogeneous background OOD dataset. The network accurately recovered structures in a checkerboard resistivity model, and demonstrated robustness to noise, outperforming traditional gradient-based methods. Finally, the developed scheme is tested using exemplary field data from a waste site near Roorkee, India. The proposed scheme enhances generalization in a data-driven supervised learning framework, suggesting a promising direction for OOD generalization in DL methods.

[LG-115] Prototype-Based Methods in Explainable AI and Emerging Opportunities in the Geosciences ICML

链接: https://arxiv.org/abs/2410.19856
作者: Anushka Narayanan,Karianne J. Bergen
关键词-EN: intrinsically interpretable XAI, prototype-based XAI, interpretable XAI methods, comparing input data, learning tasks
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Accepted at AI for Science Workshop-Oral (Attention Track), Proceedings of 41st International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Prototype-based methods are intrinsically interpretable XAI methods that produce predictions and explanations by comparing input data with a set of learned prototypical examples that are representative of the training data. In this work, we discuss a series of developments in the field of prototype-based XAI that show potential for scientific learning tasks, with a focus on the geosciences. We organize the prototype-based XAI literature into three themes: the development and visualization of prototypes, types of prototypes, and the use of prototypes in various learning tasks. We discuss how the authors use prototype-based methods, their novel contributions, and any limitations or challenges that may arise when adapting these methods for geoscientific learning tasks. We highlight differences between geoscientific data sets and the standard benchmarks used to develop XAI methods, and discuss how specific geoscientific applications may benefit from using or modifying existing prototype-based XAI techniques.

[LG-116] Deep Learning and Machine Learning – Python Data Structures and Mathematics Fundamental: From Theory to Practice

链接: https://arxiv.org/abs/2410.19849
作者: Silin Chen,Ziqian Bi,Junyu Liu,Benji Peng,Sen Zhang,Xuanhe Pan,Jiawei Xu,Jinlang Wang,Keyu Chen,Caitlyn Heqi Yin,Pohsun Feng,Yizhu Wen,Tianyang Wang,Ming Li,Jintao Ren,Qian Niu,Ming Liu
关键词-EN: machine learning, deep learning, comprehensive introduction, foundational concepts, concepts of machine
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Programming Languages (cs.PL)
*备注: 298 pages

点击查看摘要

Abstract:This book provides a comprehensive introduction to the foundational concepts of machine learning (ML) and deep learning (DL). It bridges the gap between theoretical mathematics and practical application, focusing on Python as the primary programming language for implementing key algorithms and data structures. The book covers a wide range of topics, including basic and advanced Python programming, fundamental mathematical operations, matrix operations, linear algebra, and optimization techniques crucial for training ML and DL models. Advanced subjects like neural networks, optimization algorithms, and frequency domain methods are also explored, along with real-world applications of large language models (LLMs) and artificial intelligence (AI) in big data management. Designed for both beginners and advanced learners, the book emphasizes the critical role of mathematical principles in developing scalable AI solutions. Practical examples and Python code are provided throughout, ensuring readers gain hands-on experience in applying theoretical knowledge to solve complex problems in ML, DL, and big data analytics.

[LG-117] Artificial intelligence for partial differential equations in computational mechanics: A review

链接: https://arxiv.org/abs/2410.19843
作者: Yizheng Wang,Jinshuai Bai,Zhongya Lin,Qimin Wang,Cosmin Anitescu,Jia Sun,Mohammad Sadegh Eshaghi,Yuantong Gu,Xi-Qiao Feng,Xiaoying Zhuang,Timon Rabczuk,Yinghua Liu
关键词-EN: Artificial intelligence, integrating artificial intelligence, attracted widespread attention, partial differential equations, artificial intelligence algorithms
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Artificial intelligence (AI) has become ubiquitous, empowering various fields, especially integrating artificial intelligence and traditional science (AI for Science: Artificial intelligence for science), which has attracted widespread attention. In AI for Science, using artificial intelligence algorithms to solve partial differential equations (AI for PDEs: Artificial intelligence for partial differential equations) has become a focal point in computational mechanics. The core of AI for PDEs is the fusion of data and partial differential equations (PDEs), which can solve almost any PDEs. Therefore, this article provides a comprehensive review of the research on AI for PDEs, summarizing the existing algorithms and theories. The article discusses the applications of AI for PDEs in computational mechanics, including solid mechanics, fluid mechanics, and biomechanics. The existing AI for PDEs algorithms include those based on Physics-Informed Neural Networks (PINNs), Deep Energy Methods (DEM), Operator Learning, and Physics-Informed Neural Operator (PINO). AI for PDEs represents a new method of scientific simulation that provides approximate solutions to specific problems using large amounts of data, then fine-tuning according to specific physical laws, avoiding the need to compute from scratch like traditional algorithms. Thus, AI for PDEs is the prototype for future foundation models in computational mechanics, capable of significantly accelerating traditional numerical algorithms.

[LG-118] Novel Development of LLM Driven mCODE Data Model for Improved Clinical Trial Matching to Enable Standardization and Interoperability in Oncology Research

链接: https://arxiv.org/abs/2410.19826
作者: Aarsh Shekhar,Mincheol Kim
关键词-EN: costs nationally reaching, cancer costs nationally, cancer care contributes, costs nationally, effective diagnosis
类目: Machine Learning (cs.LG)
*备注: 18 pages, 13 figures, accessible and published at: The Young Researcher Fall 2024 Volume 8, Number 2(Special Edition in Collaboration with Harvard Undergraduate Openbio Laboratory); Pages 28-45

点击查看摘要

Abstract:Each year, the lack of efficient data standardization and interoperability in cancer care contributes to the severe lack of timely and effective diagnosis, while constantly adding to the burden of cost, with cancer costs nationally reaching over 208 billion in 2023 alone. Traditional methods regarding clinical trial enrollment and clinical care in oncology are often manual, time-consuming, and lack a data-driven approach. This paper presents a novel framework to streamline standardization, interoperability, and exchange of cancer domains and enhance the integration of oncology-based EHRs across disparate healthcare systems. This paper utilizes advanced LLMs and Computer Engineering to streamline cancer clinical trials and discovery. By utilizing FHIR’s resource-based approach and LLM-generated mCODE profiles, we ensure timely, accurate, and efficient sharing of patient information across disparate healthcare systems. Our methodology involves transforming unstructured patient treatment data, PDFs, free-text information, and progress notes into enriched mCODE profiles, facilitating seamless integration with our novel AI and ML-based clinical trial matching engine. The results of this study show a significant improvement in data standardization, with accuracy rates of our trained LLM peaking at over 92% with datasets consisting of thousands of patient data. Additionally, our LLM demonstrated an accuracy rate of 87% for SNOMED-CT, 90% for LOINC, and 84% for RxNorm codes. This trumps the current status quo, with LLMs such as GPT-4 and Claude’s 3.5 peaking at an average of 77%. This paper successfully underscores the potential of our standardization and interoperability framework, paving the way for more efficient and personalized cancer treatment.

[LG-119] Substance Beats Style: Why Beginning Students Fail to Code with LLM s

链接: https://arxiv.org/abs/2410.19792
作者: Francesca Lucchetti,Zixuan Wu,Arjun Guha,Molly Q Feldman,Carolyn Jane Anderson
关键词-EN: existing work shows, professional programmers, existing work, increasing the productivity, productivity of professional
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.

[LG-120] A Human-Centered Approach for Improving Supervised Learning

链接: https://arxiv.org/abs/2410.19778
作者: Shubhi Bansal,Atharva Tendulkar,Nagendra Kumar
关键词-EN: Artificial Intelligence systems, developing Artificial Intelligence, Artificial Intelligence, Supervised Learning, labeled data inputs
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised Learning is a way of developing Artificial Intelligence systems in which a computer algorithm is trained on labeled data inputs. Effectiveness of a Supervised Learning algorithm is determined by its performance on a given dataset for a particular problem. In case of Supervised Learning problems, Stacking Ensembles usually perform better than individual classifiers due to their generalization ability. Stacking Ensembles combine predictions from multiple Machine Learning algorithms to make final predictions. Inspite of Stacking Ensembles superior performance, the overhead of Stacking Ensembles such as high cost, resources, time, and lack of explainability create challenges in real-life applications. This paper shows how we can strike a balance between performance, time, and resource constraints. Another goal of this research is to make Ensembles more explainable and intelligible using the Human-Centered approach. To achieve the aforementioned goals, we proposed a Human-Centered Behavior-inspired algorithm that streamlines the Ensemble Learning process while also reducing time, cost, and resource overhead, resulting in the superior performance of Supervised Learning in real-world applications. To demonstrate the effectiveness of our method, we perform our experiments on nine real-world datasets. Experimental results reveal that the proposed method satisfies our goals and outperforms the existing methods.

[LG-121] Learning Robust Representations for Communications over Interference-limited Channels

链接: https://arxiv.org/abs/2410.19767
作者: Shubham Paul,Sudharsan Senthil,Preethi Seshadri,Nambi Seshadri,R David Koilpillai
关键词-EN: two-user interference channel, neighbouring cells, cellular networks, users located, context of cellular
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Submitted to WCNC 2025

点击查看摘要

Abstract:In the context of cellular networks, users located at the periphery of cells are particularly vulnerable to substantial interference from neighbouring cells, which can be represented as a two-user interference channel. This study introduces two highly effective methodologies, namely TwinNet and SiameseNet, using autoencoders, tailored for the design of encoders and decoders for block transmission and detection in interference-limited environments. The findings unambiguously illustrate that the developed models are capable of leveraging the interference structure to outperform traditional methods reliant on complete orthogonality. While it is recognized that systems employing coordinated transmissions and independent detection can offer greater capacity, the specific gains of data-driven models have not been thoroughly quantified or elucidated. This paper conducts an analysis to demonstrate the quantifiable advantages of such models in particular scenarios. Additionally, a comprehensive examination of the characteristics of codewords generated by these models is provided to offer a more intuitive comprehension of how these models achieve superior performance.

[LG-122] A New Perspective to Boost Performance Fairness for Medical Federated Learning

链接: https://arxiv.org/abs/2410.19765
作者: Yunlu Yan,Lei Zhu,Yuexiang Li,Xinxing Xu,Rick Siow Mong Goh,Yong Liu,Salman Khan,Chun-Mei Feng
关键词-EN: benefits healthy, sustainable collaboration, healthy and sustainable, Improving the fairness, Improving
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
*备注: 11 pages, 2 Figures

点击查看摘要

Abstract:Improving the fairness of federated learning (FL) benefits healthy and sustainable collaboration, especially for medical applications. However, existing fair FL methods ignore the specific characteristics of medical FL applications, i.e., domain shift among the datasets from different hospitals. In this work, we propose Fed-LWR to improve performance fairness from the perspective of feature shift, a key issue influencing the performance of medical FL systems caused by domain shift. Specifically, we dynamically perceive the bias of the global model across all hospitals by estimating the layer-wise difference in feature representations between local and global models. To minimize global divergence, we assign higher weights to hospitals with larger differences. The estimated client weights help us to re-aggregate the local models per layer to obtain a fairer global model. We evaluate our method on two widely used federated medical image segmentation benchmarks. The results demonstrate that our method achieves better and fairer performance compared with several state-of-the-art fair FL methods.

[LG-123] Physical Simulation for Multi-agent Multi-machine Tending

链接: https://arxiv.org/abs/2410.19761
作者: Abdalwhab Abdalwhab,Giovanni Beltrame,David St-Onge
关键词-EN: workforce shortages, heavily minimize, sector was recently, recently affected, affected by workforce
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 3 pages, one figure, an extended abstract presented at the 7th edition of the Montreal AI symposium (MAIS) 2024

点击查看摘要

Abstract:The manufacturing sector was recently affected by workforce shortages, a problem that automation and robotics can heavily minimize. Simultaneously, reinforcement learning (RL) offers a promising solution where robots can learn through interaction with the environment. In this work, we leveraged a simplistic robotic system to work with RL with “real” data without having to deploy large expensive robots in a manufacturing setting. A real-world tabletop arena was designed with robots that mimic the agents’ behavior in the simulation. Despite the difference in dynamics and machine size, the robots were able to depict the same behavior as in the simulation. In addition, those experiments provided an initial understanding of the real deployment challenges.

[LG-124] Establishing Nationwide Power System Vulnerability Index across US Counties Using Interpretable Machine Learning

链接: https://arxiv.org/abs/2410.19754
作者: Junwei Ma,Bo Li,Olufemi A. Omitaomu,Ali Mostafavi
关键词-EN: power system vulnerability, rising energy demand, power system, system vulnerability, Power outages
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Power outages have become increasingly frequent, intense, and prolonged in the US due to climate change, aging electrical grids, and rising energy demand. However, largely due to the absence of granular spatiotemporal outage data, we lack data-driven evidence and analytics-based metrics to quantify power system vulnerability. This limitation has hindered the ability to effectively evaluate and address vulnerability to power outages in US communities. Here, we collected ~179 million power outage records at 15-minute intervals across 3022 US contiguous counties (96.15% of the area) from 2014 to 2023. We developed a power system vulnerability assessment framework based on three dimensions (intensity, frequency, and duration) and applied interpretable machine learning models (XGBoost and SHAP) to compute Power System Vulnerability Index (PSVI) at the county level. Our analysis reveals a consistent increase in power system vulnerability over the past decade. We identified 318 counties across 45 states as hotspots for high power system vulnerability, particularly in the West Coast (California and Washington), the East Coast (Florida and the Northeast area), the Great Lakes megalopolis (Chicago-Detroit metropolitan areas), and the Gulf of Mexico (Texas). Heterogeneity analysis indicates that urban counties, counties with interconnected grids, and states with high solar generation exhibit significantly higher vulnerability. Our results highlight the significance of the proposed PSVI for evaluating the vulnerability of communities to power outages. The findings underscore the widespread and pervasive impact of power outages across the country and offer crucial insights to support infrastructure operators, policymakers, and emergency managers in formulating policies and programs aimed at enhancing the resilience of the US power infrastructure.

[LG-125] Combining LLM Code Generation with Formal Specifications and Reactive Program Synthesis

链接: https://arxiv.org/abs/2410.19736
作者: William Murphy,Nikolaus Holzer,Feitong Qiao,Leyi Cui,Raven Rothkopf,Nathan Koenig,Mark Santolucito
关键词-EN: Large Language Models, Large Language, Language Models, past few years, exploded in usefulness
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In the past few years, Large Language Models (LLMs) have exploded in usefulness and popularity for code generation tasks. However, LLMs still struggle with accuracy and are unsuitable for high-risk applications without additional oversight and verification. In particular, they perform poorly at generating code for highly complex systems, especially with unusual or out-of-sample logic. For such systems, verifying the code generated by the LLM may take longer than writing it by hand. We introduce a solution that divides the code generation into two parts; one to be handled by an LLM and one to be handled by formal methods-based program synthesis. We develop a benchmark to test our solution and show that our method allows the pipeline to solve problems previously intractable for LLM code generation.

[LG-126] Adaptive Transfer Clustering: A Unified Framework

链接: https://arxiv.org/abs/2410.21263
作者: Yuqi Gu,Zhongyuan Lyu,Kaizheng Wang
关键词-EN: general transfer learning, transfer learning framework, learning framework, main dataset, Gaussian mixture
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 52 pages, 8 figures

点击查看摘要

Abstract:We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method’s effectiveness in various scenarios.

[LG-127] Quantum computing and persistence in topological data analysis

链接: https://arxiv.org/abs/2410.21258
作者: Casper Gyurik,Alexander Schmidhuber,Robbie King,Vedran Dunjko,Ryu Hayakawa
关键词-EN: Topological data analysis, extract noise-robust features, Topological data, data analysis, aims to extract
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:Topological data analysis (TDA) aims to extract noise-robust features from a data set by examining the number and persistence of holes in its topology. We show that a computational problem closely related to a core task in TDA – determining whether a given hole persists across different length scales – is \mathsfBQP_1 -hard and contained in \mathsfBQP . This result implies an exponential quantum speedup for this problem under standard complexity-theoretic assumptions. Our approach relies on encoding the persistence of a hole in a variant of the guided sparse Hamiltonian problem, where the guiding state is constructed from a harmonic representative of the hole.

[LG-128] On learning higher-order cumulants in diffusion models NEURIPS2024

链接: https://arxiv.org/abs/2410.21212
作者: Gert Aarts,Diaa E. Habibi,Lingxiao Wang,Kai Zhou
关键词-EN: connected n-point functions, diffusion models learn, forward process, forward process higher-order, models learn correlations
类目: High Energy Physics - Lattice (hep-lat); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 21 pages, many figures. Extended version of contribution accepted in the NeurIPS 2024 workshop “Machine Learning and the Physical Sciences”

点击查看摘要

Abstract:To analyse how diffusion models learn correlations beyond Gaussian ones, we study the behaviour of higher-order cumulants, or connected n-point functions, under both the forward and backward process. We derive explicit expressions for the moment- and cumulant-generating functionals, in terms of the distribution of the initial data and properties of forward process. It is shown analytically that during the forward process higher-order cumulants are conserved in models without a drift, such as the variance-expanding scheme, and that therefore the endpoint of the forward process maintains nontrivial correlations. We demonstrate that since these correlations are encoded in the score function, higher-order cumulants are learnt in the backward process, also when starting from a normal prior. We confirm our analytical results in an exactly solvable toy model with nonzero cumulants and in scalar lattice field theory.

[LG-129] Robustness and Generalization in Quantum Reinforcement Learning via Lipschitz Regularization

链接: https://arxiv.org/abs/2410.21117
作者: Nico Meyer,Julian Berberich,Christopher Mutschler,Daniel D. Scherer
关键词-EN: promising significant advancements, reduce model complexity, model complexity compared, leverages quantum computing, classical approaches
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Quantum machine learning leverages quantum computing to enhance accuracy and reduce model complexity compared to classical approaches, promising significant advancements in various fields. Within this domain, quantum reinforcement learning has garnered attention, often realized using variational quantum circuits to approximate the policy function. This paper addresses the robustness and generalization of quantum reinforcement learning by combining principles from quantum computing and control theory. Leveraging recent results on robust quantum machine learning, we utilize Lipschitz bounds to propose a regularized version of a quantum policy gradient approach, named the RegQPG algorithm. We show that training with RegQPG improves the robustness and generalization of the resulting policies. Furthermore, we introduce an algorithmic variant that incorporates curriculum learning, which minimizes failures during training. Our findings are validated through numerical experiments, demonstrating the practical benefits of our approach.

[LG-130] Stronger Regret Bounds for Safe Online Reinforcement Learning in the Linear Quadratic Regulator

链接: https://arxiv.org/abs/2410.21081
作者: Benjamin Schiffer,Lucas Janson
关键词-EN: online reinforcement learning, reinforcement learning require, Linear Quadratic Regulator, study Linear Quadratic, practical applications
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we study Linear Quadratic Regulator (LQR) learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Unlike in previous works, we allow for both bounded and unbounded noise distributions and study stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to these complications, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Our primary contribution is the first \tildeO_T(\sqrtT) -regret bound for constrained LQR learning, which we show relative to a specific baseline of non-linear controllers. We then prove that, for any non-linear baseline satisfying natural assumptions, \tildeO_T(\sqrtT) -regret is possible when the noise distribution has sufficiently large support and \tildeO_T(T^2/3) -regret is possible for any subgaussian noise distribution. An overarching theme of our results is that enforcing safety provides “free exploration” that compensates for the added cost of uncertainty in safety constrained control, resulting in the same regret rate as in the unconstrained problem.

[LG-131] Accelerated Bayesian parameter estimation and model selection for gravitational waves with normalizing flows NEURIPS2024

链接: https://arxiv.org/abs/2410.21076
作者: Alicja Polanska,Thibeau Wouters,Peter T. H. Pang,Kaze K. W. Wong,Jason D. McEwen
关键词-EN: joint Bayesian parameter, Bayesian parameter estimation, high-performance computing techniques, gravitational wave astrophysics, chain Monte Carlo
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: accepted to NeurIPS 2024 workshop on Machine Learning and the Physical Sciences

点击查看摘要

Abstract:We present an accelerated pipeline, based on high-performance computing techniques and normalizing flows, for joint Bayesian parameter estimation and model selection and demonstrate its efficiency in gravitational wave astrophysics. We integrate the Jim inference toolkit, a normalizing flow-enhanced Markov chain Monte Carlo (MCMC) sampler, with the learned harmonic mean estimator. Our Bayesian evidence estimates run on 1 GPU are consistent with traditional nested sampling techniques run on 16 CPU cores, while reducing the computation time by factors of 5\times and 15\times for 4 -dimensional and 11 -dimensional gravitational wave inference problems, respectively. Our code is available in well-tested and thoroughly documented open-source packages, ensuring accessibility and reproducibility for the wider research community.

[LG-132] BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration

链接: https://arxiv.org/abs/2410.21033
作者: James Sharpnack,Kevin Hao,Phoebe Mulcaire,Klinton Bicknell,Geoff LaFlair,Kevin Yancey,Alina A. von Davier
关键词-EN: robust large-scale computerized, large-scale computerized adaptive, present a complete, quickly calibrating, calibrating and administering
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (this http URL, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2410.21033 [stat.ML] (or arXiv:2410.21033v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2410.21033 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-133] Breccia and basalt classification of thin sections of Apollo rocks with deep learning

链接: https://arxiv.org/abs/2410.21024
作者: Freja Thoresen,Aidan Cowley,Romeo Haak,Jonas Lewe,Clara Moriceau,Piotr Knapczyk,Victoria S. Engelschiøn
关键词-EN: Apollo programme time, Human exploration, moon is expected, programme time, expected to resume
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human exploration of the moon is expected to resume in the next decade, following the last such activities in the Apollo programme time. One of the major objectives of returning to the Moon is to continue retrieving geological samples, with a focus on collecting high-quality specimens to maximize scientific return. Tools that assist astronauts in making informed decisions about sample collection activities can maximize the scientific value of future lunar missions. A lunar rock classifier is a tool that can potentially provide the necessary information for astronauts to analyze lunar rock samples, allowing them to augment in-situ value identification of samples. Towards demonstrating the value of such a tool, in this paper, we introduce a framework for classifying rock types in thin sections of lunar rocks. We leverage the vast collection of petrographic thin-section images from the Apollo missions, captured under plane-polarized light (PPL), cross-polarised light (XPL), and reflected light at varying magnifications. Advanced machine learning methods, including contrastive learning, are applied to analyze these images and extract meaningful features. The contrastive learning approach fine-tunes a pre-trained Inception-Resnet-v2 network with the SimCLR loss function. The fine-tuned Inception-Resnet-v2 network can then extract essential features effectively from the thin-section images of Apollo rocks. A simple binary classifier is trained using transfer learning from the fine-tuned Inception-ResNet-v2 to 98.44% ( \pm 1.47) accuracy in separating breccias from basalts.

[LG-134] Large Language Model-Guided Prediction Toward Quantum Materials Synthesis

链接: https://arxiv.org/abs/2410.20976
作者: Ryotaro Okabe,Zack West,Abhijatmedhi Chotrattanapituk,Mouyang Cheng,Denisse Córdova Carrizales,Weiwei Xie,Robert J. Cava,Mingda Li
关键词-EN: modern technology, essential for modern, quantum materials development, inorganic crystalline materials, materials
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 66 pages total, 6 main figures + 3 supplementary figures

点击查看摘要

Abstract:The synthesis of inorganic crystalline materials is essential for modern technology, especially in quantum materials development. However, designing efficient synthesis workflows remains a significant challenge due to the precise experimental conditions and extensive trial and error. Here, we present a framework using large language models (LLMs) to predict synthesis pathways for inorganic materials, including quantum materials. Our framework contains three models: LHS2RHS, predicting products from reactants; RHS2LHS, predicting reactants from products; and TGT2CEQ, generating full chemical equations for target compounds. Fine-tuned on a text-mined synthesis database, our model raises accuracy from under 40% with pretrained models, to under 80% using conventional fine-tuning, and further to around 90% with our proposed generalized Tanimoto similarity, while maintaining robust to additional synthesis steps. Our model further demonstrates comparable performance across materials with varying degrees of quantumness quantified using quantum weight, indicating that LLMs offer a powerful tool to predict balanced chemical equations for quantum materials discovery.

[LG-135] Asteroid Mining: ACTFriends Results for the GTOC 12 Problem

链接: https://arxiv.org/abs/2410.20839
作者: Dario Izzo,Marcus Märtens,Laurent Beauregard,Max Bannach,Giacomo Acciarini,Emmanuel Blazquez,Alexander Hadjiivanov,Jai Grover,Gernot Heißel,Yuri Shimane,Chit Hong Yam
关键词-EN: Global Trajectory Competition, Global Trajectory, Sustainable Asteroid Mining, Advanced Concepts Team, ESA Advanced Concepts
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In 2023, the 12th edition of Global Trajectory Competition was organised around the problem referred to as “Sustainable Asteroid Mining”. This paper reports the developments that led to the solution proposed by ESA’s Advanced Concepts Team. Beyond the fact that the proposed approach failed to rank higher than fourth in the final competition leader-board, several innovative fundamental methodologies were developed which have a broader application. In particular, new methods based on machine learning as well as on manipulating the fundamental laws of astrodynamics were developed and able to fill with remarkable accuracy the gap between full low-thrust trajectories and their representation as impulsive Lambert transfers. A novel technique was devised to formulate the challenge of optimal subset selection from a repository of pre-existing optimal mining trajectories as an integer linear programming problem. Finally, the fundamental problem of searching for single optimal mining trajectories (mining and collecting all resources), albeit ignoring the possibility of having intra-ship collaboration and thus sub-optimal in the case of the GTOC12 problem, was efficiently solved by means of a novel search based on a look-ahead score and thus making sure to select asteroids that had chances to be re-visited later on.

[LG-136] Robust Estimation for Kernel Exponential Families with Smoothed Total Variation Distances

链接: https://arxiv.org/abs/2410.20760
作者: Takafumi Kanamori,Kodai Yokoyama,Takayuki Kawashima
关键词-EN: commonly assume, independent and identically, identically distributed, pre-specified statistical model, probability distribution included
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In statistical inference, we commonly assume that samples are independent and identically distributed from a probability distribution included in a pre-specified statistical model. However, such an assumption is often violated in practice. Even an unexpected extreme sample called an \it outlier can significantly impact classical estimators. Robust statistics studies how to construct reliable statistical methods that efficiently work even when the ideal assumption is violated. Recently, some works revealed that robust estimators such as Tukey’s median are well approximated by the generative adversarial net (GAN), a popular learning method for complex generative models using neural networks. GAN is regarded as a learning method using integral probability metrics (IPM), which is a discrepancy measure for probability distributions. In most theoretical analyses of Tukey’s median and its GAN-based approximation, however, the Gaussian or elliptical distribution is assumed as the statistical model. In this paper, we explore the application of GAN-like estimators to a general class of statistical models. As the statistical model, we consider the kernel exponential family that includes both finite and infinite-dimensional models. To construct a robust estimator, we propose the smoothed total variation (STV) distance as a class of IPMs. Then, we theoretically investigate the robustness properties of the STV-based estimators. Our analysis reveals that the STV-based estimator is robust against the distribution contamination for the kernel exponential family. Furthermore, we analyze the prediction accuracy of a Monte Carlo approximation method, which circumvents the computational difficulty of the normalization constant.

[LG-137] Likelihood approximations via Gaussian approximate inference

链接: https://arxiv.org/abs/2410.20754
作者: Thang D. Bui
关键词-EN: modelling complex real-world, complex real-world observations, pose significant computational, significant computational challenges, essential for modelling
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-Gaussian likelihoods are essential for modelling complex real-world observations but pose significant computational challenges in learning and inference. Even with Gaussian priors, non-Gaussian likelihoods often lead to analytically intractable posteriors, necessitating approximation methods. To this end, we propose efficient schemes to approximate the effects of non-Gaussian likelihoods by Gaussian densities based on variational inference and moment matching in transformed bases. These enable efficient inference strategies originally designed for models with a Gaussian likelihood to be deployed. Our empirical results demonstrate that the proposed matching strategies attain good approximation quality for binary and multiclass classification in large-scale point-estimate and distributional inferential settings. In challenging streaming problems, the proposed methods outperform all existing likelihood approximations and approximate inference methods in the exact models. As a by-product, we show that the proposed approximate log-likelihoods are a superior alternative to least-squares on raw labels for neural network classification.

[LG-138] Wearable-Based Real-time Freezing of Gait Detection in Parkinsons Disease Using Self-Supervised Learning

链接: https://arxiv.org/abs/2410.20715
作者: Shovito Barua Soumma,Kartik Mangipudi,Daniel Peterson,Shyamal Mehta,Hassan Ghasemzadeh
关键词-EN: single triaxial accelerometer, Freezing of Gait, Parkinson Disease, innovative self-supervised learning, self-supervised learning framework
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 2pages, 2 figures, submitted in BHI’24

点击查看摘要

Abstract:LIFT-PD is an innovative self-supervised learning framework developed for real-time detection of Freezing of Gait (FoG) in Parkinson’s Disease (PD) patients, using a single triaxial accelerometer. It minimizes the reliance on large labeled datasets by applying a Differential Hopping Windowing Technique (DHWT) to address imbalanced data during training. Additionally, an Opportunistic Inference Module is used to reduce energy consumption by activating the model only during active movement periods. Extensive testing on publicly available datasets showed that LIFT-PD improved precision by 7.25% and accuracy by 4.4% compared to supervised models, while using 40% fewer labeled samples and reducing inference time by 67%. These findings make LIFT-PD a highly practical and energy-efficient solution for continuous, in-home monitoring of PD patients.

[LG-139] Super Resolution Based on Deep Operator Networks

链接: https://arxiv.org/abs/2410.20706
作者: Siyuan Yang
关键词-EN: Deep Operator Networks, partial differential equations, perform super-resolution reconstruction, super-resolution reconstruction, conventional interpolation methods
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We use Deep Operator Networks (DeepONets) to perform super-resolution reconstruction of the solutions of two types of partial differential equations and compare the model predictions with the results obtained using conventional interpolation methods to verify the advantages of DeepONets. We employ two pooling methods to downsample the origin data and conduct super-resolution reconstruction under three different resolutions of input images. The results show that the DeepONet model can predict high-frequency oscillations and small-scale structures from low-resolution inputs very well. For the two-dimensional problem, we introduce convolutional layers to extract information from input images at a lower cost than purer MLPs. We adjust the size of the training set and observe the variation of prediction errors. In both one-dimensional and two-dimensional cases, the super-resolution reconstruction using the DeepONet model demonstrates much more accurate prediction results than cubic spline interpolation, highlighting the superiority of operator learning methods in handling such problems compared to traditional interpolation techniques.

[LG-140] Joint Channel Selection using FedDRL in V2X

链接: https://arxiv.org/abs/2410.20687
作者: Lorenzo Mancini,Safwan Labbi,Karim Abed Meraim,Fouzi Boukhalfa,Alain Durmus,Paul Mangold,Eric Moulines
关键词-EN: Deep Reinforcement Learning, revolutionizing transportation, enabling interactions, Machine Learning, Proximal Policy Optimization
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vehicle-to-everything (V2X) communication technology is revolutionizing transportation by enabling interactions between vehicles, devices, and infrastructures. This connectivity enhances road safety, transportation efficiency, and driver assistance systems. V2X benefits from Machine Learning, enabling real-time data analysis, better decision-making, and improved traffic predictions, making transportation safer and more efficient. In this paper, we study the problem of joint channel selection, where vehicles with different technologies choose one or more Access Points (APs) to transmit messages in a network. In this problem, vehicles must learn a strategy for channel selection, based on observations that incorporate vehicles’ information (position and speed), network and communication data (Signal-to-Interference-plus-Noise Ratio from past communications), and environmental data (road type). We propose an approach based on Federated Deep Reinforcement Learning (FedDRL), which enables each vehicle to benefit from other vehicles’ experiences. Specifically, we apply the federated Proximal Policy Optimization (FedPPO) algorithm to this task. We show that this method improves communication reliability while minimizing transmission costs and channel switches. The efficiency of the proposed solution is assessed via realistic simulations, highlighting the potential of FedDRL to advance V2X technology.

[LG-141] Multi-modal Data based Semi-Supervised Learning for Vehicle Positioning

链接: https://arxiv.org/abs/2410.20680
作者: Ouwen Huan,Yang Yang,Tao Luo,Mingzhe Chen
关键词-EN: unlabeled CSI data, CSI data, unlabeled CSI, based semi-supervised learning, multi-modal data based
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, a multi-modal data based semi-supervised learning (SSL) framework that jointly use channel state information (CSI) data and RGB images for vehicle positioning is designed. In particular, an outdoor positioning system where the vehicle locations are determined by a base station (BS) is considered. The BS equipped with several cameras can collect a large amount of unlabeled CSI data and a small number of labeled CSI data of vehicles, and the images taken by cameras. Although the collected images contain partial information of vehicles (i.e. azimuth angles of vehicles), the relationship between the unlabeled CSI data and its azimuth angle, and the distances between the BS and the vehicles captured by images are both unknown. Therefore, the images cannot be directly used as the labels of unlabeled CSI data to train a positioning model. To exploit unlabeled CSI data and images, a SSL framework that consists of a pretraining stage and a downstream training stage is proposed. In the pretraining stage, the azimuth angles obtained from the images are considered as the labels of unlabeled CSI data to pretrain the positioning model. In the downstream training stage, a small sized labeled dataset in which the accurate vehicle positions are considered as labels is used to retrain the model. Simulation results show that the proposed method can reduce the positioning error by up to 30% compared to a baseline where the model is not pretrained.

[LG-142] MCI-GRU: Stock Prediction Model Based on Multi-Head Cross-Attention and Improved GRU

链接: https://arxiv.org/abs/2410.20679
作者: Peng Zhu,Yuante Li,Yifan Hu,Sheng Xiang,Qinyuan Liu,Dawei Cheng,Yuqi Liang
关键词-EN: big data era, financial markets grow, markets grow increasingly, grow increasingly complex, Graph Neural Networks
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:As financial markets grow increasingly complex in the big data era, accurate stock prediction has become more critical. Traditional time series models, such as GRUs, have been widely used but often struggle to capture the intricate nonlinear dynamics of markets, particularly in the flexible selection and effective utilization of key historical information. Recently, methods like Graph Neural Networks and Reinforcement Learning have shown promise in stock prediction but require high data quality and quantity, and they tend to exhibit instability when dealing with data sparsity and noise. Moreover, the training and inference processes for these models are typically complex and computationally expensive, limiting their broad deployment in practical applications. Existing approaches also generally struggle to capture unobservable latent market states effectively, such as market sentiment and expectations, microstructural factors, and participant behavior patterns, leading to an inadequate understanding of market dynamics and subsequently impact prediction accuracy. To address these challenges, this paper proposes a stock prediction model, MCI-GRU, based on a multi-head cross-attention mechanism and an improved GRU. First, we enhance the GRU model by replacing the reset gate with an attention mechanism, thereby increasing the model’s flexibility in selecting and utilizing historical information. Second, we design a multi-head cross-attention mechanism for learning unobservable latent market state representations, which are further enriched through interactions with both temporal features and cross-sectional features. Finally, extensive experiments on four main stock markets show that the proposed method outperforms SOTA techniques across multiple metrics. Additionally, its successful application in real-world fund management operations confirms its effectiveness and practicality.

[LG-143] A Machine Learning-Driven Wireless System for Structural Health Monitoring

链接: https://arxiv.org/abs/2410.20678
作者: Marius Pop,Mihai Tudose,Daniel Visan,Mircea Bocioaga,Mihai Botan,Cesar Banu,Tiberiu Salaoru
关键词-EN: fiber reinforced polymer, carbon fiber reinforced, primarily targeting aerospace, wireless system integrated, machine learning
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:The paper presents a wireless system integrated with a machine learning (ML) model for structural health monitoring (SHM) of carbon fiber reinforced polymer (CFRP) structures, primarily targeting aerospace applications. The system collects data via carbon nanotube (CNT) piezoresistive sensors embedded within CFRP coupons, wirelessly transmitting these data to a central server for processing. A deep neural network (DNN) model predicts mechanical properties and can be extended to forecast structural failures, facilitating proactive maintenance and enhancing safety. The modular design supports scalability and can be embedded within digital twin frameworks, offering significant benefits to aircraft operators and manufacturers. The system utilizes an ML model with a mean absolute error (MAE) of 0.14 on test data for forecasting mechanical properties. Data transmission latency throughout the entire system is less than one second in a LAN setup, highlighting its potential for real-time monitoring applications in aerospace and other industries. However, while the system shows promise, challenges such as sensor reliability under extreme environmental conditions and the need for advanced ML models to handle diverse data streams have been identified as areas for future research.

[LG-144] Injectivity capacity of ReLU gates

链接: https://arxiv.org/abs/2410.20646
作者: Mihailo Stojnic
关键词-EN: ReLU networks layers, ReLU layers injectivity, ReLU injectivity capacity, ReLU networks, injectivity property
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the injectivity property of the ReLU networks layers. Determining the ReLU injectivity capacity (ratio of the number of layer’s inputs and outputs) is established as isomorphic to determining the capacity of the so-called \ell_0 spherical perceptron. Employing \emphfully lifted random duality theory (fl RDT) a powerful program is developed and utilized to handle the \ell_0 spherical perceptron and implicitly the ReLU layers injectivity. To put the entire fl RDT machinery in practical use, a sizeable set of numerical evaluations is conducted as well. The lifting mechanism is observed to converge remarkably fast with relative corrections in the estimated quantities not exceeding \sim 0.1% already on the third level of lifting. Closed form explicit analytical relations among key lifting parameters are uncovered as well. In addition to being of incredible importance in handling all the required numerical work, these relations also shed a new light on beautiful parametric interconnections within the lifting structure. Finally, the obtained results are also shown to fairly closely match the replica predictions from [40].

[LG-145] Near Optimal Pure Exploration in Logistic Bandits

链接: https://arxiv.org/abs/2410.20640
作者: Eduardo Ochoa Rivera,Ambuj Tewari
关键词-EN: garnered significant attention, significant attention due, real-world scenarios, garnered significant, significant attention
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages, 2 figures

点击查看摘要

Abstract:Bandit algorithms have garnered significant attention due to their practical applications in real-world scenarios. However, beyond simple settings such as multi-arm or linear bandits, optimal algorithms remain scarce. Notably, no optimal solution exists for pure exploration problems in the context of generalized linear model (GLM) bandits. In this paper, we narrow this gap and develop the first track-and-stop algorithm for general pure exploration problems under the logistic bandit called logistic track-and-stop (Log-TS). Log-TS is an efficient algorithm that asymptotically matches an approximation for the instance-specific lower bound of the expected sample complexity up to a logarithmic factor.

[LG-146] Kernel Approximation of Fisher-Rao Gradient Flows

链接: https://arxiv.org/abs/2410.20622
作者: Jia-Jie Zhu,Alexander Mielke
关键词-EN: PDE gradient flows, gradient flows, PDE gradient, Wasserstein type gradient, flows
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:The purpose of this paper is to answer a few open questions in the interface of kernel methods and PDE gradient flows. Motivated by recent advances in machine learning, particularly in generative modeling and sampling, we present a rigorous investigation of Fisher-Rao and Wasserstein type gradient flows concerning their gradient structures, flow equations, and their kernel approximations. Specifically, we focus on the Fisher-Rao (also known as Hellinger) geometry and its various kernel-based approximations, developing a principled theoretical framework using tools from PDE gradient flows and optimal transport theory. We also provide a complete characterization of gradient flows in the maximum-mean discrepancy (MMD) space, with connections to existing learning and inference algorithms. Our analysis reveals precise theoretical insights linking Fisher-Rao flows, Stein flows, kernel discrepancies, and nonparametric regression. We then rigorously prove evolutionary \Gamma -convergence for kernel-approximated Fisher-Rao flows, providing theoretical guarantees beyond pointwise convergence. Finally, we analyze energy dissipation using the Helmholtz-Rayleigh principle, establishing important connections between classical theory in mechanics and modern machine learning practice. Our results provide a unified theoretical foundation for understanding and analyzing approximations of gradient flows in machine learning applications through a rigorous gradient flow and variational method perspective.

[LG-147] SIGMA: Single Interpolated Generative Model for Anomalies

链接: https://arxiv.org/abs/2410.20537
作者: Ranit Das,David Shih
关键词-EN: resonant anomaly detection, anomaly detection search, signal region, key step, resonant anomaly
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:A key step in any resonant anomaly detection search is accurate modeling of the background distribution in each signal region. Data-driven methods like CATHODE accomplish this by training separate generative models on the complement of each signal region, and interpolating them into their corresponding signal regions. Having to re-train the generative model on essentially the entire dataset for each signal region is a major computational cost in a typical sliding window search with many signal regions. Here, we present SIGMA, a new, fully data-driven, computationally-efficient method for estimating background distributions. The idea is to train a single generative model on all of the data and interpolate its parameters in sideband regions in order to obtain a model for the background in the signal region. The SIGMA method significantly reduces the computational cost compared to previous approaches, while retaining a similar high quality of background modeling and sensitivity to anomalous signals.

[LG-148] Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds

链接: https://arxiv.org/abs/2410.20318
作者: Tiangang Cui,Alex Gorodetsky
关键词-EN: enabling efficient computation, Bayesian matrix completion, low-rank Bayesian matrix, enabling efficient, efficient computation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We present a new sampling-based approach for enabling efficient computation of low-rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular-value-decomposition (SVD) parametrization of low-rank matrices. Our prior is analogous to the seminal nuclear-norm regularization used in non-Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (-within-Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two-matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real-world benchmark problems we considered.

[LG-149] On the Gaussian process limit of Bayesian Additive Regression Trees

链接: https://arxiv.org/abs/2410.20289
作者: Giacomo Petrillo
关键词-EN: Bayesian Additive Regression, Bayesian Additive, Additive Regression Trees, nonparametric Bayesian regression, Bayesian regression technique
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Additive Regression Trees (BART) is a nonparametric Bayesian regression technique of rising fame. It is a sum-of-decision-trees model, and is in some sense the Bayesian version of boosting. In the limit of infinite trees, it becomes equivalent to Gaussian process (GP) regression. This limit is known but has not yet led to any useful analysis or application. For the first time, I derive and compute the exact BART prior covariance function. With it I implement the infinite trees limit of BART as GP regression. Through empirical tests, I show that this limit is worse than standard BART in a fixed configuration, but also that tuning the hyperparameters in the natural GP way yields a competitive method, although a properly tuned BART is still superior. The advantage of using a GP surrogate of BART is the analytical likelihood, which simplifies model building and sidesteps the complex BART MCMC. More generally, this study opens new ways to understand and develop BART and GP regression. The implementation of BART as GP is available in the Python package this https URL .

[LG-150] Robust Model Evaluation over Large-scale Federated Networks

链接: https://arxiv.org/abs/2410.20250
作者: Amir Najafi,Samin Mahdizadeh Sani,Farzan Farnia
关键词-EN: machine learning model, unseen target network, source network, address the challenge, challenge of certifying
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 40 pages

点击查看摘要

Abstract:In this paper, we address the challenge of certifying the performance of a machine learning model on an unseen target network, using measurements from an available source network. We focus on a scenario where heterogeneous datasets are distributed across a source network of clients, all connected to a central server. Specifically, consider a source network “A” composed of K clients, each holding private data from unique and heterogeneous distributions, which are assumed to be independent samples from a broader meta-distribution \mu . Our goal is to provide certified guarantees for the model’s performance on a different, unseen target network “B,” governed by another meta-distribution \mu’ , assuming the deviation between \mu and \mu’ is bounded by either the Wasserstein distance or an f -divergence. We derive theoretical guarantees for the model’s empirical average loss and provide uniform bounds on the risk CDF, where the latter correspond to novel and adversarially robust versions of the Glivenko-Cantelli theorem and the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. Our bounds are computable in polynomial time with a polynomial number of queries to the K clients, preserving client privacy by querying only the model’s (potentially adversarial) loss on private data. We also establish non-asymptotic generalization bounds that consistently converge to zero as both K and the minimum client sample size grow. Extensive empirical evaluations validate the robustness and practicality of our bounds across real-world tasks.

[LG-151] he inexact power augmented Lagrangian method for constrained nonconvex optimization

链接: https://arxiv.org/abs/2410.20153
作者: Alexander Bodard,Konstantinos Oikonomidis,Emanuel Laude,Panagiotis Patrinos
关键词-EN: inexact augmented Lagrangian, augmented Lagrangian method, augmented Lagrangian, unconventional inexact augmented, Euclidean norm raised
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work introduces an unconventional inexact augmented Lagrangian method, where the augmenting term is a Euclidean norm raised to a power between one and two. The proposed algorithm is applicable to a broad class of constrained nonconvex minimization problems, that involve nonlinear equality constraints over a convex set under a mild regularity condition. First, we conduct a full complexity analysis of the method, leveraging an accelerated first-order algorithm for solving the Hölder-smooth subproblems. Next, we present an inexact proximal point method to tackle these subproblems, demonstrating that it achieves an improved convergence rate. Notably, this rate reduces to the best-known convergence rate for first-order methods when the augmenting term is a squared Euclidean norm. Our worst-case complexity results further show that using lower powers for the augmenting term leads to faster constraint satisfaction, albeit with a slower decrease in the dual residual. Numerical experiments support our theoretical findings, illustrating that this trade-off between constraint satisfaction and cost minimization is advantageous for certain practical problems.

[LG-152] Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD NEURIPS2024

链接: https://arxiv.org/abs/2410.20135
作者: Aniket Das,Dheeraj Nagaraj,Soumyabrata Pal,Arun Suggala,Prateek Varshney
关键词-EN: traditional batch setting, batch setting due, high-dimensional heavy-tailed statistical, heavy-tailed statistical estimation, Sigma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:We consider the problem of high-dimensional heavy-tailed statistical estimation in the streaming setting, which is much harder than the traditional batch setting due to memory constraints. We cast this problem as stochastic convex optimization with heavy tailed stochastic gradients, and prove that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite. More precisely, with T samples, we show that Clipped-SGD, for smooth and strongly convex objectives, achieves an error of \sqrt\frac\mathsfTr(\Sigma)+\sqrt\mathsfTr(\Sigma)|\Sigma|_2\log(\frac\log(T)\delta)T with probability 1-\delta , where \Sigma is the covariance of the clipped gradient. Note that the fluctuations (depending on \frac1\delta ) are of lower order than the term \mathsfTr(\Sigma) . This improves upon the current best rate of \sqrt\frac\mathsfTr(\Sigma)\log(\frac1\delta)T for Clipped-SGD, known only for smooth and strongly convex objectives. Our results also extend to smooth convex and lipschitz convex objectives. Key to our result is a novel iterative refinement strategy for martingale concentration, improving upon the PAC-Bayes approach of Catoni and Giulini.

[LG-153] ISDNN: A Deep Neural Network for Channel Estimation in Massive MIMO systems

链接: https://arxiv.org/abs/2410.20110
作者: Do Hai Son,Vu Tung Lam,Tran Thi Thuy Quynh
关键词-EN: Massive Multiple-Input Multiple-Output, massive MIMO technology, massive MIMO, Multiple-Input Multiple-Output, Iterative Sequential DNN
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Massive Multiple-Input Multiple-Output (massive MIMO) technology stands as a cornerstone in 5G and beyonds. Despite the remarkable advancements offered by massive MIMO technology, the extreme number of antennas introduces challenges during the channel estimation (CE) phase. In this paper, we propose a single-step Deep Neural Network (DNN) for CE, termed Iterative Sequential DNN (ISDNN), inspired by recent developments in data detection algorithms. ISDNN is a DNN based on the projected gradient descent algorithm for CE problems, with the iterative iterations transforming into a DNN using the deep unfolding method. Furthermore, we introduce the structured channel ISDNN (S-ISDNN), extending ISDNN to incorporate side information such as directions of signals and antenna array configurations for enhanced CE. Simulation results highlight that ISDNN significantly outperforms another DNN-based CE (DetNet), in terms of training time (13%), running time (4.6%), and accuracy (0.43 dB). Furthermore, the S-ISDNN demonstrates even faster than ISDNN in terms of training time, though its overall performance still requires further improvement.

[LG-154] Dimension reduction via score ratio matching

链接: https://arxiv.org/abs/2410.19990
作者: Ricardo Baptista,Michael Brennan,Youssef Marzouk
关键词-EN: identifying maximally informative, allowing high-dimensional problems, Gradient-based dimension reduction, dimension reduction decreases, maximally informative
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 9 figures, 1 table

点击查看摘要

Abstract:Gradient-based dimension reduction decreases the cost of Bayesian inference and probabilistic modeling by identifying maximally informative (and informed) low-dimensional projections of the data and parameters, allowing high-dimensional problems to be reformulated as cheaper low-dimensional problems. A broad family of such techniques identify these projections and provide error bounds on the resulting posterior approximations, via eigendecompositions of certain diagnostic matrices. Yet these matrices require gradients or even Hessians of the log-likelihood, excluding the purely data-driven setting and many problems of simulation-based inference. We propose a framework, derived from score-matching, to extend gradient-based dimension reduction to problems where gradients are unavailable. Specifically, we formulate an objective function to directly learn the score ratio function needed to compute the diagnostic matrices, propose a tailored parameterization for the score ratio network, and introduce regularization methods that capitalize on the hypothesized low-dimensional structure. We also introduce a novel algorithm to iteratively identify the low-dimensional reduced basis vectors more accurately with limited data based on eigenvalue deflation methods. We show that our approach outperforms standard score-matching for problems with low-dimensional structure, and demonstrate its effectiveness for PDE-constrained Bayesian inverse problems and conditional generative modeling.

[LG-155] Statistical Inference in Classification of High-dimensional Gaussian Mixture

链接: https://arxiv.org/abs/2410.19950
作者: Hanwen Huang,Peng Zeng
关键词-EN: general covariance matrices, classification problem, high-dimensional mixture, Gaussians with general, covariance matrices
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size n and the dimension p approach infinity while their ratio \alpha=n/p remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using L_1 -regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.

[LG-156] Method for noise-induced regularization in quantum neural networks

链接: https://arxiv.org/abs/2410.19921
作者: Wilfrid Somogyi,Ekaterina Pankovets,Viacheslav Kuzmin,Alexey Melnikov
关键词-EN: quantum computing paradigm, current quantum computing, computing paradigm, significant focus, reduction or mitigation
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 6 figures, 2 tables

点击查看摘要

Abstract:In the current quantum computing paradigm, significant focus is placed on the reduction or mitigation of quantum decoherence. When designing new quantum processing units, the general objective is to reduce the amount of noise qubits are subject to, and in algorithm design, a large effort is underway to provide scalable error correction or mitigation techniques. Yet some previous work has indicated that certain classes of quantum algorithms, such as quantum machine learning, may, in fact, be intrinsically robust to or even benefit from the presence of a small amount of noise. Here, we demonstrate that noise levels in quantum hardware can be effectively tuned to enhance the ability of quantum neural networks to generalize data, acting akin to regularisation in classical neural networks. As an example, we consider a medical regression task, where, by tuning the noise level in the circuit, we improved the mean squared error loss by 8%.

[LG-157] Predicting potato plant vigor from the seed tuber properties

链接: https://arxiv.org/abs/2410.19875
作者: Elisa Atza,Rob Klooster,Falko Hofstra,Frank van der Werff,Hans van Doorn,Neil Budko
关键词-EN: exponential growth stage, vigor, seed tuber, seed tuber biochemistry, growth stage
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Numerical Analysis (math.NA); Populations and Evolution (q-bio.PE); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The vigor of potato plants, defined as the canopy area at the end of the exponential growth stage, depends on the origin and physiological state of the seed tuber. Experiments carried out with six potato varieties in three test fields over three years show that there is a 73%-90% correlation in the vigor of the plants from the same seedlot grown in different test fields. However, these correlations are not always observed on the level of individual varieties and vanish or become negative when the seed tubers and young plants experience environmental stress. A comprehensive study of the association between the vigor and the seed tuber biochemistry has revealed that, while 50%-70% of the variation in the plant vigor is explained by the tuber data, the vigor is dominated by the potato genotype. Analysis of individual predictors, such as the abundance of a particular metabolite, indicates that the vigor enhancing properties of the seed tubers differ between genotypes. Variety-specific models show that, for some varieties, up to 30% of the vigor variation within the variety is explained by and can be predicted from the tuber biochemistry, whereas, for other varieties, the association between the tuber composition and the vigor is much weaker.

[LG-158] Dynamic User Grouping based on Location and Heading in 5G NR Systems

链接: https://arxiv.org/abs/2410.19854
作者: Dino Pjanić,Korkut Emre Arslantürk,Xuesong Cai,Fredrik Tufvesson
关键词-EN: improve network performance, significantly improve network, Sounding Reference Signals, service delivery, User grouping based
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:User grouping based on geographic location in fifth generation (5G) New Radio (NR) systems has several applications that can significantly improve network performance, user experience, and service delivery. We demonstrate how Sounding Reference Signals channel fingerprints can be used for dynamic user grouping in a 5G NR commercial deployment based on outdoor positions and heading direction employing machine learning methods such as neural networks combined with clustering methods.

[LG-159] A practical fast method for solving sum-of-squares problems for very large polynomials

链接: https://arxiv.org/abs/2410.19844
作者: Daniel Keren,Margarita Osadchy,Roi Poranne
关键词-EN: Sum of squares, Semidefinite Program, SOS problem, SOS, Sum
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sum of squares (SOS) optimization is a powerful technique for solving problems where the positivity of a polynomials must be enforced. The common approach to solve an SOS problem is by relaxation to a Semidefinite Program (SDP). The main advantage of this transormation is that SDP is a convex problem for which efficient solvers are readily available. However, while considerable progress has been made in recent years, the standard approaches for solving SDPs are still known to scale poorly. Our goal is to devise an approach that can handle larger, more complex problems than is currently possible. The challenge indeed lies in how SDPs are commonly solved. State-Of-The-Art approaches rely on the interior point method, which requires the factorization of large matrices. We instead propose an approach inspired by polynomial neural networks, which exhibit excellent performance when optimized using techniques from the deep learning toolbox. In a somewhat counter-intuitive manner, we replace the convex SDP formulation with a non-convex, unconstrained, and \emphover parameterized formulation, and solve it using a first order optimization method. It turns out that this approach can handle very large problems, with polynomials having over four million coefficients, well beyond the range of current SDP-based approaches. Furthermore, we highlight theoretical and practical results supporting the experimental success of our approach in avoiding spurious local minima, which makes it amenable to simple and fast solutions based on gradient descent. In all the experiments, our approach had always converged to a correct global minimum, on general (non-sparse) polynomials, with running time only slightly higher than linear in the number of polynomial coefficients, compared to higher than quadratic in the number of coefficients for SDP-based methods.

[LG-160] Contrastive random lead coding for channel-agnostic self-supervision of biosignals

链接: https://arxiv.org/abs/2410.19842
作者: Thea Brüsch,Mikkel N. Schmidt,Tommy S. Alstrøm
关键词-EN: learning yields impressive, Contrastive learning yields, yields impressive results, computer vision, learning yields
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning yields impressive results for self-supervision in computer vision. The approach relies on the creation of positive pairs, something which is often achieved through augmentations. However, for multivariate time series effective augmentations can be difficult to design. Additionally, the number of input channels for biosignal datasets often varies from application to application, limiting the usefulness of large self-supervised models trained with specific channel configurations. Motivated by these challenges, we set out to investigate strategies for creation of positive pairs for channel-agnostic self-supervision of biosignals. We introduce contrastive random lead coding (CRLC), where random subsets of the input channels are used to create positive pairs and compare with using augmentations and neighboring segments in time as positive pairs. We validate our approach by pre-training models on EEG and ECG data, and then fine-tuning them for downstream tasks. CRLC outperforms competing strategies in both scenarios in the channel-agnostic setting. For EEG, the approach additionally outperforms the state-of-the-art reference model. Notably, for EEG tasks CRLC surpasses the current state-of-the-art reference model. While, the state-of-the-art reference model is superior in the ECG task, incorporating CRLC allows us to obtain comparable results. In conclusion, CRLC helps generalization across variable channel setups when training our channel-agnostic model.

[LG-161] Non-invasive Neural Decoding in Source Reconstructed Brain Space

链接: https://arxiv.org/abs/2410.19838
作者: Yonatan Gideoni,Ryan Charles Timms,Oiwi Parker Jones
关键词-EN: Non-invasive brainwave decoding, Non-invasive brainwave, Electroencephalography, EEG, brainwave decoding
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures, 14 tables, under review

点击查看摘要

Abstract:Non-invasive brainwave decoding is usually done using Magneto/Electroencephalography (MEG/EEG) sensor measurements as inputs. This makes combining datasets and building models with inductive biases difficult as most datasets use different scanners and the sensor arrays have a nonintuitive spatial structure. In contrast, fMRI scans are acquired directly in brain space, a voxel grid with a typical structured input representation. By using established techniques to reconstruct the sensors’ sources’ neural activity it is possible to decode from voxels for MEG data as well. We show that this enables spatial inductive biases, spatial data augmentations, better interpretability, zero-shot generalisation between datasets, and data harmonisation.

[LG-162] Automatic Classification of Sleep Stages from EEG Signals Using Riemannian Metrics and Transformer Networks

链接: https://arxiv.org/abs/2410.19819
作者: Mathieu Seraphim,Alexis Lechervy(GREYC),Florian Yger(MILES, LAMSADE, LITIS, App - LITIS),Luc Brun,Olivier Etard(COMETE, UNICAEN)
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-163] BUNDL: Bayesian Uncertainty-aware Deep Learning with Noisy training Labels for Seizure Detection in EEG

链接: https://arxiv.org/abs/2410.19815
作者: Deeksha M Shama,Archana Venkataraman
关键词-EN: Deep learning methods, Deep learning, learning methods, Deep, deep learning model
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep learning methods are at the forefront of automated epileptic seizure detection and onset zone localization using scalp-EEG. However, the performance of deep learning methods rely heavily on the quality of annotated training datasets. Scalp EEG is susceptible to high noise levels, which in turn leads to imprecise annotations of the seizure timing and characteristics. This label noise presents a significant challenge in model training and generalization. In this paper, we introduce a novel statistical framework that informs a deep learning model of label ambiguity, thereby enhancing the overall seizure detection performance. Our Bayesian UncertaiNty-aware Deep Learning, BUNDL, strategy offers a straightforward and model-agnostic method for training deep neural networks with noisy training labels that does not add any parameters to existing architectures. By integrating domain knowledge into the statistical framework, we derive a novel KL-divergence-based loss function that capitalizes on uncertainty to better learn seizure characteristics from scalp EEG. Additionally, we explore the impact of improved seizure detection on the task of automated onset zone localization. We validate BUNDL using a comprehensive simulated EEG dataset and two publicly available datasets, TUH and CHB-MIT. BUNDL consistently improves the performance of three base models on simulated data under seven types of label noise and three EEG signal-to-noise ratios. Similar improvements were observed in the real-world TUH and CHB-MIT datasets. Finally, we demonstrate that BUNDL improves the accuracy of seizure onset zone localization. BUNDL is specifically designed to address label ambiguities, enabling the training of reliable and trustworthy models for epilepsy evaluation.

[LG-164] he Useful Side of Motion: Using Head Motion Parameters to Correct for Respiratory Confounds in BOLD fMRI

链接: https://arxiv.org/abs/2410.19802
作者: Abdoljalil Addeh,G. Bruce Pike,M. Ethan MacDonald
关键词-EN: Magnetic Resonance Imaging, functional Magnetic Resonance, Acquiring accurate external, Resonance Imaging, Magnetic Resonance
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 3 pahes, 1 Figure, 2024 ISMRM Workshop on Motion Correction in MR, 03-06 September 2024, Québec City, QC, Canada. Abstract Number 23

点击查看摘要

Abstract:Acquiring accurate external respiratory data during functional Magnetic Resonance Imaging (fMRI) is challenging, prompting the exploration of machine learning methods to estimate respiratory variation (RV) from fMRI data. Respiration induces head motion, including real and pseudo motion, which likely provides useful information about respiratory events. Recommended notch filters mitigate respiratory-induced motion artifacts, suggesting that a bandpass filter at the respiratory frequency band isolates respiratory-induced head motion. This study seeks to enhance the accuracy of RV estimation from resting-state BOLD-fMRI data by integrating estimated head motion parameters. Specifically, we aim to determine the impact of incorporating raw versus bandpass-filtered head motion parameters on RV reconstruction accuracy using one-dimensional convolutional neural networks (1D-CNNs). This approach addresses the limitations of traditional filtering techniques and leverages the potential of head motion data to provide a more robust estimation of respiratory-induced variations.

[LG-165] Feasibility Analysis of Federated Neural Networks for Explainable Detection of Atrial Fibrillation

链接: https://arxiv.org/abs/2410.19781
作者: Diogo Reis Santos,Andrea Protani,Lorenzo Giusti,Albert Sund Aillet,Pierpaolo Brutti,Luigi Serio
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-166] Sampling from Bayesian Neural Network Posteriors with Symmetric Minibatch Splitting Langevin Dynamics

链接: https://arxiv.org/abs/2410.19780
作者: Daniel Paulin,Peter A. Whalley,Neil K. Chada,Benedict Leimkuhler
关键词-EN: scalable kinetic Langevin, kinetic Langevin dynamics, sampling parameter spaces, Langevin dynamics algorithm, Langevin dynamics
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Computation (stat.CO); Methodology (stat.ME)
*备注: 35 pages, 6 figures. The first two authors contributed equally

点击查看摘要

Abstract:We propose a scalable kinetic Langevin dynamics algorithm for sampling parameter spaces of big data and AI applications. Our scheme combines a symmetric forward/backward sweep over minibatches with a symmetric discretization of Langevin dynamics. For a particular Langevin splitting method (UBU), we show that the resulting Symmetric Minibatch Splitting-UBU (SMS-UBU) integrator has bias O(h^2 d^1/2) in dimension d0 with stepsize h0 , despite only using one minibatch per iteration, thus providing excellent control of the sampling bias as a function of the stepsize. We apply the algorithm to explore local modes of the posterior distribution of Bayesian neural networks (BNNs) and evaluate the calibration performance of the posterior predictive probabilities for neural networks with convolutional neural network architectures for classification problems on three different datasets (Fashion-MNIST, Celeb-A and chest X-ray). Our results indicate that BNNs sampled with SMS-UBU can offer significantly better calibration performance compared to standard methods of training and stochastic weight averaging.

[LG-167] EEGPT: Unleashing the Potential of EEG Generalist Foundation Model by Autoregressive Pre-training

链接: https://arxiv.org/abs/2410.19779
作者: Tongtian Yue,Shuning Xue,Xuange Gao,Yepeng Tang,Longteng Guo,Jie Jiang,Jing Liu
关键词-EN: spontaneous brain activity, EEG, brain activity, highlighting their significant, generalist EEG
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) signals are pivotal in providing insights into spontaneous brain activity, highlighting their significant importance in neuroscience research. However, the exploration of versatile EEG models is constrained by diverse data formats, outdated pre-training paradigms, and limited transfer learning methods, only leading to specialist models on single dataset. In this paper, we introduce EEGPT, the first generalist EEG foundation model designed to address these challenges. First, we propose an electrode-wise modeling strategy that treats each electrode as a fundamental unit, enabling the integration of diverse EEG datasets collected from up to 138 electrodes, amassing 37.5M pre-training samples. Second, we develop the first autoregressive EEG pre-trained model, moving away from traditional masked autoencoder approaches to a next signal prediction task that better captures the sequential and temporal dependencies of EEG data. We also explore scaling laws with model up to 1.1B parameters: the largest in EEG research to date. Third, we introduce a multi-task transfer learning paradigm using a learnable electrode graph network shared across tasks, which for the first time confirms multi-task compatibility and synergy. As the first generalist EEG foundation model, EEGPT shows broad compatibility with various signal acquisition devices, subjects, and tasks. It supports up to 138 electrodes and any combination thereof as input. Furthermore, we simultaneously evaluate it on 5 distinct tasks across 12 benchmarks. EEGPT consistently outperforms existing specialist models across all downstream tasks, with its effectiveness further validated through extensive ablation studies. This work sets a new direction for generalist EEG modeling, offering improved scalability, transferability, and adaptability for a wide range of EEG applications. The code and models will be released.

[LG-168] Real-Time Stress Detection via Photoplethysmogram Signals: Implementation of a Combined Continuous Wavelet Transform and Convolutional Neural Network on Resource-Constrained Microcontrollers MICRO

链接: https://arxiv.org/abs/2410.19776
作者: Yasin Hasanpoor,Amin Rostami,Bahram Tarvirdizadeh,Khalil Alipour,Mohammad Ghamari
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 figures, implemented on Microcontroller

点击查看摘要

[LG-169] Statistical Test for Auto Feature Engineering by Selective Inference

链接: https://arxiv.org/abs/2410.19768
作者: Tatsuya Matsukawa,Tomohiro Shiraishi,Shuichi Nishino,Teruyuki Katsuoka,Ichiro Takeuchi
关键词-EN: Auto Feature Engineering, developing practical machine, practical machine learning, machine learning pipelines, enhance model performance
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 36 pages, 7 figures

点击查看摘要

Abstract:Auto Feature Engineering (AFE) plays a crucial role in developing practical machine learning pipelines by automating the transformation of raw data into meaningful features that enhance model performance. By generating features in a data-driven manner, AFE enables the discovery of important features that may not be apparent through human experience or intuition. On the other hand, since AFE generates features based on data, there is a risk that these features may be overly adapted to the data, making it essential to assess their reliability appropriately. Unfortunately, because most AFE problems are formulated as combinatorial search problems and solved by heuristic algorithms, it has been challenging to theoretically quantify the reliability of generated features. To address this issue, we propose a new statistical test for generated features by AFE algorithms based on a framework called selective inference. As a proof of concept, we consider a simple class of tree search-based heuristic AFE algorithms, and consider the problem of testing the generated features when they are used in a linear model. The proposed test can quantify the statistical significance of the generated features in the form of p -values, enabling theoretically guaranteed control of the risk of false findings.

[LG-170] he Effect of Acute Stress on the Interpretability and Generalization of Schizophrenia Predictive Machine Learning Models

链接: https://arxiv.org/abs/2410.19739
作者: Gideon Vos,Maryam Ebrahimpour,Liza van Eijk,Zoltan Sarnyai,Mostafa Rahimi Azghadi
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 20 pages, 7 figures

点击查看摘要

信息检索

[IR-0] Pay Attention to Attention for Sequential Recommendation RECSYS2024

链接: https://arxiv.org/abs/2410.21048
作者: Yuli Liu,Min Liu,Xiaojing Liu
关键词-EN: demonstrated remarkable success, Transformer-based approaches, approaches have demonstrated, demonstrated remarkable, remarkable success
类目: Information Retrieval (cs.IR)
*备注: Accepted at RecSys 2024

点击查看摘要

Abstract:Transformer-based approaches have demonstrated remarkable success in various sequence-based tasks. However, traditional self-attention models may not sufficiently capture the intricate dependencies within items in sequential recommendation scenarios. This is due to the lack of explicit emphasis on attention weights, which play a critical role in allocating attention and understanding item-to-item correlations. To better exploit the potential of attention weights and improve the capability of sequential recommendation in learning high-order dependencies, we propose a novel sequential recommendation (SR) approach called attention weight refinement (AWRSR). AWRSR enhances the effectiveness of self-attention by additionally paying attention to attention weights, allowing for more refined attention distributions of correlations among items. We conduct comprehensive experiments on multiple real-world datasets, demonstrating that our approach consistently outperforms state-of-the-art SR models. Moreover, we provide a thorough analysis of AWRSR’s effectiveness in capturing higher-level dependencies. These findings suggest that AWRSR offers a promising new direction for enhancing the performance of self-attention architecture in SR tasks, with potential applications in other sequence-based problems as well.

[IR-1] Challenges in Implementing a Recommender System for Historical Research in the Humanities ALT RECSYS2024

链接: https://arxiv.org/abs/2410.20909
作者: Florian Atzenhofer-Baumgartner,Bernhard C. Geiger,Christoph Trattner,Georg Vogeler,Dominik Kowald
关键词-EN: historical legal documents, extended abstract describes, http URL, implementing recommender systems, legal documents
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: Presented at AltRecSys 2024: The First Workshop on Alternative, Unexpected, and Critical Ideas in Recommendation, October 18, 2024, co-located with the ACM Conference on Recommender Systems 2024 (RecSys 2024), Bari, Italy

点击查看摘要

Abstract:This extended abstract describes the challenges in implementing recommender systems for digital archives in the humanities, focusing on this http URL, a platform for historical legal documents. We discuss three key aspects: (i) the unique characteristics of so-called charters as items for recommendation, (ii) the complex multi-stakeholder environment, and (iii) the distinct information-seeking behavior of scholars in the humanities. By examining these factors, we aim to contribute to the development of more effective and tailored recommender systems for (digital) humanities research.

[IR-2] RecFlow: An Industrial Full Flow Recommendation Dataset

链接: https://arxiv.org/abs/2410.20868
作者: Qi Liu,Kai Zheng,Rui Huang,Wuchao Li,Kuo Cai,Yuan Chai,Yanan Niu,Yiqun Hui,Bing Han,Na Mou,Hongning Wang,Wentian Bao,Yunen Yu,Guorui Zhou,Han Li,Yang Song,Defu Lian,Kun Gai
关键词-EN: pipeline to balance, efficiency when delivering, vast corpus, algorithms, RecFlow
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Industrial recommendation systems (RS) rely on the multi-stage pipeline to balance effectiveness and efficiency when delivering items from a vast corpus to users. Existing RS benchmark datasets primarily focus on the exposure space, where novel RS algorithms are trained and evaluated. However, when these algorithms transition to real world industrial RS, they face a critical challenge of handling unexposed items which are a significantly larger space than the exposed one. This discrepancy profoundly impacts their practical performance. Additionally, these algorithms often overlook the intricate interplay between multiple RS stages, resulting in suboptimal overall system performance. To address this issue, we introduce RecFlow, an industrial full flow recommendation dataset designed to bridge the gap between offline RS benchmarks and the real online environment. Unlike existing datasets, RecFlow includes samples not only from the exposure space but also unexposed items filtered at each stage of the RS funnel. Our dataset comprises 38M interactions from 42K users across nearly 9M items with additional 1.9B stage samples collected from 9.3M online requests over 37 days and spanning 6 stages. Leveraging the RecFlow dataset, we conduct courageous exploration experiments, showcasing its potential in designing new algorithms to enhance effectiveness by incorporating stage-specific samples. Some of these algorithms have already been deployed online, consistently yielding significant gains. We propose RecFlow as the first comprehensive benchmark dataset for the RS community, supporting research on designing algorithms at any stage, study of selection bias, debiased algorithms, multi-stage consistency and optimality, multi-task recommendation, and user behavior modeling. The RecFlow dataset, along with the corresponding source code, is available at this https URL.

[IR-3] Leveraging AI and Sentiment Analysis for Forecasting Election Outcomes in Mauritius

链接: https://arxiv.org/abs/2410.20859
作者: Missie Chercheur,Malkenzie Bovafiz
关键词-EN: focusing on Mauritius’, AI-driven sentiment analysis, forecasting election outcomes, election outcomes, Sentiment Impact Score
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study explores the use of AI-driven sentiment analysis as a novel tool for forecasting election outcomes, focusing on Mauritius’ 2024 elections. In the absence of reliable polling data, we analyze media sentiment toward two main political parties L’Alliance Lepep and L’Alliance Du Changement by classifying news articles from prominent Mauritian media outlets as positive, negative, or neutral. We employ a multilingual BERT-based model and a custom Sentiment Scoring Algorithm to quantify sentiment dynamics and apply the Sentiment Impact Score (SIS) for measuring sentiment influence over time. Our forecast model suggests L’Alliance Du Changement is likely to secure a minimum of 37 seats, while L’Alliance Lepep is predicted to obtain the remaining 23 seats out of the 60 available. Findings indicate that positive media sentiment strongly correlates with projected electoral gains, underscoring the role of media in shaping public perception. This approach not only mitigates media bias through adjusted scoring but also serves as a reliable alternative to traditional polling. The study offers a scalable methodology for political forecasting in regions with limited polling infrastructure and contributes to advancements in the field of political data science.

[IR-4] Beyond Positive History: Re-ranking with List-level Hybrid Feedback

链接: https://arxiv.org/abs/2410.20778
作者: Muyan Weng,Yunjia Xi,Weiwen Liu,Bo Chen,Jianghao Lin,Ruiming Tang,Weinan Zhang,Yong Yu
关键词-EN: list-level hybrid feedback, recommender systems, list-level hybrid, behavior patterns, stage of recommender
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As the last stage of recommender systems, re-ranking generates a re-ordered list that aligns with the user’s preference. However, previous works generally focus on item-level positive feedback as history (e.g., only clicked items) and ignore that users provide positive or negative feedback on items in the entire list. This list-level hybrid feedback can reveal users’ holistic preferences and reflect users’ comparison behavior patterns manifesting within a list. Such patterns could predict user behaviors on candidate lists, thus aiding better re-ranking. Despite appealing benefits, extracting and integrating preferences and behavior patterns from list-level hybrid feedback into re-ranking multiple items remains challenging. To this end, we propose Re-ranking with List-level Hybrid Feedback (dubbed RELIFE). It captures user’s preferences and behavior patterns with three modules: a Disentangled Interest Miner to disentangle the user’s preferences into interests and disinterests, a Sequential Preference Mixer to learn users’ entangled preferences considering the context of feedback, and a Comparison-aware Pattern Extractor to capture user’s behavior patterns within each list. Moreover, for better integration of patterns, contrastive learning is adopted to align the behavior patterns of candidate and historical lists. Extensive experiments show that RELIFE significantly outperforms SOTA re-ranking baselines.

[IR-5] GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems

链接: https://arxiv.org/abs/2410.20643
作者: Wilson Wongso,Hao Xue,Flora D. Salim
关键词-EN: Traditional POI recommendation, vector-based user embeddings, Traditional POI, dense vector-based user, POI recommendation systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional POI recommendation systems often lack transparency, interpretability, and scrutability due to their reliance on dense vector-based user embeddings. Furthermore, the cold-start problem – where systems have insufficient data for new users – limits their ability to generate accurate recommendations. Existing methods often address this by leveraging similar trajectories from other users, but this approach can be computationally expensive and increases the context length for LLM-based methods, making them difficult to scale. To address these limitations, we propose a method that generates natural language (NL) user profiles from large-scale, location-based social network (LBSN) check-ins, utilizing robust personality assessments and behavioral theories. These NL profiles capture user preferences, routines, and behaviors, improving POI prediction accuracy while offering enhanced transparency. By incorporating NL profiles as system prompts to LLMs, our approach reduces reliance on extensive historical data, while remaining flexible, easily updated, and computationally efficient. Our method is not only competitive with other LLM-based and complex agentic frameworks but is also more scalable for real-world scenarios and on-device POI recommendations. Results demonstrate that our approach consistently outperforms baseline methods, offering a more interpretable and resource-efficient solution for POI recommendation systems. Our source code is available at: \urlthis https URL.

[IR-6] Collaborative Knowledge Fusion: A Novel Approach for Multi-task Recommender Systems via LLM s

链接: https://arxiv.org/abs/2410.20642
作者: Chuang Zhao,Xing Su,Ming He,Hongke Zhao,Jianping Fan,Xiaomeng Li
关键词-EN: impressive general intelligence, recommender systems, LLMs-based recommender systems, recommender systems primarily, impressive general
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Owing to the impressive general intelligence of large language models (LLMs), there has been a growing trend to integrate them into recommender systems to gain a more profound insight into human interests and intentions. Existing LLMs-based recommender systems primarily leverage item attributes and user interaction histories in textual format, improving the single task like rating prediction or explainable recommendation. Nevertheless, these approaches overlook the crucial contribution of traditional collaborative signals in discerning users’ profound intentions and disregard the interrelatedness among tasks. To address these limitations, we introduce a novel framework known as CKF, specifically developed to boost multi-task recommendations via personalized collaborative knowledge fusion into LLMs. Specifically, our method synergizes traditional collaborative filtering models to produce collaborative embeddings, subsequently employing the meta-network to construct personalized mapping bridges tailored for each user. Upon mapped, the embeddings are incorporated into meticulously designed prompt templates and then fed into an advanced LLM to represent user interests. To investigate the intrinsic relationship among diverse recommendation tasks, we develop Multi-Lora, a new parameter-efficient approach for multi-task optimization, adept at distinctly segregating task-shared and task-specific information. This method forges a connection between LLMs and recommendation scenarios, while simultaneously enriching the supervisory signal through mutual knowledge transfer among various tasks. Extensive experiments and in-depth robustness analyses across four common recommendation tasks on four large public data sets substantiate the effectiveness and superiority of our framework.

[IR-7] R3AG: First Workshop on Refined and Reliable Retrieval Augmented Generation SIGIR

链接: https://arxiv.org/abs/2410.20598
作者: Zihan Wang,Xuri Ge,Joemon M. Jose,Haitao Yu,Weizhi Ma,Zhaochun Ren,Xin Xin
关键词-EN: gained wide attention, external knowledge augmentation, improve generative models, RAG, Retrieval-augmented generation
类目: Information Retrieval (cs.IR)
*备注: R^3AG workshop overview at SIGIR-AP 2024

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has gained wide attention as the key component to improve generative models with external knowledge augmentation from information retrieval. It has shown great prominence in enhancing the functionality and performance of large language model (LLM)-based applications. However, with the comprehensive application of RAG, more and more problems and limitations have been identified, thus urgently requiring further fundamental exploration to improve current RAG frameworks. This workshop aims to explore in depth how to conduct refined and reliable RAG for downstream AI tasks. To this end, we propose to organize the first R3AG workshop at SIGIR-AP 2024 to call for participants to re-examine and formulate the basic principles and practical implementation of refined and reliable RAG. The workshop serves as a platform for both academia and industry researchers to conduct discussions, share insights, and foster research to build the next generation of RAG systems. Participants will engage in discussions and presentations focusing on fundamental challenges, cutting-edge research, and potential pathways to improve RAG. At the end of the workshop, we aim to have a clearer understanding of how to improve the reliability and applicability of RAG with more robust information retrieval and language generation. Comments: R^3AG workshop overview at SIGIR-AP 2024 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2410.20598 [cs.IR] (or arXiv:2410.20598v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.20598 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-8] Coherence-guided Preference Disentanglement for Cross-domain Recommendations

链接: https://arxiv.org/abs/2410.20580
作者: Zongyi Xiang,Yan Zhang,Lixin Duan,Hongzhi Yin,Ivor W. Tsang
关键词-EN: user-item interactive data, platforms lack comprehensive, lack comprehensive user-item, comprehensive user-item interactive, Discovering user preferences
类目: Information Retrieval (cs.IR)
*备注: 28 pages

点击查看摘要

Abstract:Discovering user preferences across different domains is pivotal in cross-domain recommendation systems, particularly when platforms lack comprehensive user-item interactive data. The limited presence of shared users often hampers the effective modeling of common preferences. While leveraging shared items’ attributes, such as category and popularity, can enhance cross-domain recommendation performance, the scarcity of shared items between domains has limited research in this area. To address this, we propose a Coherence-guided Preference Disentanglement (CoPD) method aimed at improving cross-domain recommendation by i) explicitly extracting shared item attributes to guide the learning of shared user preferences and ii) disentangling these preferences to identify specific user interests transferred between domains. CoPD introduces coherence constraints on item embeddings of shared and specific domains, aiding in extracting shared attributes. Moreover, it utilizes these attributes to guide the disentanglement of user preferences into separate embeddings for interest and conformity through a popularity-weighted loss. Experiments conducted on real-world datasets demonstrate the superior performance of our proposed CoPD over existing competitive baselines, highlighting its effectiveness in enhancing cross-domain recommendation performance.

[IR-9] Automatic Estimation of Singing Voice Musical Dynamics

链接: https://arxiv.org/abs/2410.20540
作者: Jyoti Narang,Nazif Can Tamer,Viviana De La Vega,Xavier Serra
关键词-EN: Musical dynamics, Musical dynamics form, singing voice, singing voice performances, form a core
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: To be published in ISMIR 2024, 6 pages

点击查看摘要

Abstract:Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic.

[IR-10] Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2410.20381
作者: Haoyu Zhang,Jun Liu,Zhenhua Zhu,Shulin Zeng,Maojia Sheng,Tao Yang,Guohao Dai,Yu Wang
关键词-EN: important information representations, information retrieval, important information, texts is commonly, embedded vector representations
类目: Information Retrieval (cs.IR)
*备注: 8 pages

点击查看摘要

Abstract:ANNS for embedded vector representations of texts is commonly used in information retrieval, with two important information representations being sparse and dense vectors. While it has been shown that combining these representations improves accuracy, the current method of conducting sparse and dense vector searches separately suffers from low scalability and high system complexity. Alternatively, building a unified index faces challenges with accuracy and efficiency. To address these issues, we propose a graph-based ANNS algorithm for dense-sparse hybrid vectors. Firstly, we propose a distribution alignment method to improve accuracy, which pre-samples dense and sparse vectors to analyze their distance distribution statistic, resulting in a 1% \sim 9% increase in accuracy. Secondly, to improve efficiency, we design an adaptive two-stage computation strategy that initially computes dense distances only and later computes hybrid distances. Further, we prune the sparse vectors to speed up the calculation. Compared to naive implementation, we achieve \sim2.1\times acceleration. Thorough experiments show that our algorithm achieves 8.9x \sim 11.7x throughput at equal accuracy compared to existing hybrid vector search algorithms.

[IR-11] WindTunnel – A Framework for Community Aware Sampling of Large Corpora

链接: https://arxiv.org/abs/2410.20301
作者: Michael Iannelli
关键词-EN: Conducting comprehensive information, high computational costs, Conducting comprehensive, retrieval augmented generation, augmented generation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conducting comprehensive information retrieval experiments, such as in search or retrieval augmented generation, often comes with high computational costs. This is because evaluating a retrieval algorithm requires indexing the entire corpus, which is significantly larger than the set of (query, result) pairs under evaluation. This issue is especially pronounced in big data and neural retrieval, where indexing becomes increasingly time-consuming and complex. In this paper, we present WindTunnel, a novel framework developed at Yext to generate representative samples of large corpora, enabling efficient end-to-end information retrieval experiments. By preserving the community structure of the dataset, WindTunnel overcomes limitations in current sampling methods, providing more accurate evaluations.

[IR-12] Quam: Adaptive Retrieval through Query Affinity Modelling

链接: https://arxiv.org/abs/2410.20286
作者: Mandeep Rathee,Sean MacAvaney,Avishek Anand
关键词-EN: Building relevance models, NLP community, Building relevance, rank documents based, user information
类目: Information Retrieval (cs.IR)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:Building relevance models to rank documents based on user information needs is a central task in information retrieval and the NLP community. Beyond the direct ad-hoc search setting, many knowledge-intense tasks are powered by a first-stage retrieval stage for context selection, followed by a more involved task-specific model. However, most first-stage ranking stages are inherently limited by the recall of the initial ranking documents. Recently, adaptive re-ranking techniques have been proposed to overcome this issue by continually selecting documents from the whole corpus, rather than only considering an initial pool of documents. However, so far these approaches have been limited to heuristic design choices, particularly in terms of the criteria for document selection. In this work, we propose a unifying view of the nascent area of adaptive retrieval by proposing, Quam, a \textitquery-affinity model that exploits the relevance-aware document similarity graph to improve recall, especially for low re-ranking budgets. Our extensive experimental evidence shows that our proposed approach, Quam improves the recall performance by up to 26% over the standard re-ranking baselines. Further, the query affinity modelling and relevance-aware document graph modules can be injected into any adaptive retrieval approach. The experimental results show the existing adaptive retrieval approach improves recall by up to 12%. The code of our work is available at \urlthis https URL.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-10-29

目录

概览 (2024-10-29)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载