本篇博文主要展示 2024-11-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-11-08)

今日共更新429篇论文,其中:

  • 自然语言处理60篇(Computation and Language (cs.CL))
  • 人工智能107篇(Artificial Intelligence (cs.AI))
  • 计算机视觉102篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习152篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Analyzing The Language of Visual Tokens

【速读】: 该论文试图解决的问题是理解基于transformer的视觉和语言模型中图像离散化表示的统计行为,特别是这些视觉语言是否遵循与自然语言相似的频率分布、语法结构或拓扑结构。解决方案的关键在于采用以自然语言为中心的方法,分析视觉语言的统计特性,揭示其与自然语言的显著相似性和根本差异。具体来说,论文发现视觉语言虽然遵循Zipfian分布,但其高创新性导致更高的熵和更低的压缩率,且主要表示对象部分,显示出中间粒度。此外,视觉语言缺乏连贯的语法结构,导致更高的困惑度和较弱的层次组织。尽管视觉模型与自然语言的对齐比其他模型更紧密,但这种对齐仍然远不如自然语言内部的凝聚力。通过这些实验,论文展示了理解视觉语言的统计特性如何指导更有效的计算机视觉模型设计。

链接: https://arxiv.org/abs/2411.05001
作者: David M. Chan,Rodolfo Corona,Joonyong Park,Cheol Jun Cho,Yutong Bai,Trevor Darrell
关键词-EN: LLaVA and Chameleon, discrete tokenized representation, visual languages, languages, discrete visual languages
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These models often treat image patches as discrete tokens, analogous to words in natural language, learning joint alignments between visual and human languages. However, little is known about the statistical behavior of these visual languages - whether they follow similar frequency distributions, grammatical structures, or topologies as natural languages. In this paper, we take a natural-language-centric approach to analyzing discrete visual languages and uncover striking similarities and fundamental differences. We demonstrate that, although visual languages adhere to Zipfian distributions, higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts, indicating intermediate granularity. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages. Finally, we demonstrate that, while vision models align more closely with natural languages than other models, this alignment remains significantly weaker than the cohesion found within natural languages. Through these experiments, we demonstrate how understanding the statistical properties of discrete visual languages can inform the design of more effective computer vision models.
摘要:随着基于 Transformer 的模型在视觉和语言任务中的引入,如 LLaVA 和 Chameleon,人们对图像的离散 Token 化表示重新产生了兴趣。这些模型通常将图像块视为离散的 Token,类似于自然语言中的单词,学习视觉与人类语言之间的联合对齐。然而,关于这些视觉语言的统计行为知之甚少——它们是否遵循与自然语言相似的频率分布、语法结构或拓扑结构。本文采用以自然语言为中心的方法来分析离散视觉语言,并揭示了显著的相似性和根本差异。我们证明,尽管视觉语言遵循 Zipfian 分布,但更高的 Token 创新性驱动了更高的熵和更低的压缩率,其中 Token 主要代表对象的部分,表明中间粒度。我们还表明,视觉语言缺乏连贯的语法结构,导致更高的困惑度和比自然语言更弱的层次组织。最后,我们证明,尽管视觉模型与自然语言的对齐比其他模型更紧密,但这种对齐仍然明显弱于自然语言内部的凝聚力。通过这些实验,我们展示了理解离散视觉语言的统计特性如何指导设计更有效的计算机视觉模型。

[NLP-1] Needle Threading: Can LLM s Follow Threads through Near-Million-Scale Haystacks?

【速读】: 该论文试图解决的问题是如何评估大型语言模型(LLMs)在处理长上下文信息时的有效性和性能。解决方案的关键在于设计了一系列检索实验,以评估17个领先的LLMs在长上下文窗口中跟踪信息线索的能力。研究发现,尽管许多模型在同时跟踪多条信息线索时表现出较高的“线程安全性”,但它们的有效上下文限制实际上远低于其支持的上下文长度,且随着上下文窗口的扩大,准确性显著下降。此外,研究强调了不同分词器(tokenizers)的标记计数不应直接比较,因为它们通常对应于显著不同的字符数量。

链接: https://arxiv.org/abs/2411.05000
作者: Jonathan Roberts,Kai Han,Samuel Albanie
关键词-EN: Large Language Models, Large Language, downstream functions broadens, Language Models, functions broadens
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared – they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.
摘要:随着大语言模型 (Large Language Models, LLMs) 的上下文限制增加,其可能的应用范围和下游功能也随之扩大。在许多现实世界的任务中,决策依赖于分散在大量通常不相关文档中的细节信息。长上下文 LLMs 似乎非常适合这种复杂的信息检索和推理任务,而传统上这些任务既耗时又费力。然而,尽管近年来长上下文模型的开发取得了快速进展,但我们对 LLMs 如何有效利用其上下文的理解并未同步跟上。为了解决这一问题,我们进行了一系列检索实验,旨在评估 17 个领先 LLMs 的能力,例如它们在上下文窗口中跟踪信息线索的能力。令人惊讶的是,我们发现许多模型在跟踪多条线索时表现出色,即具有显著的“线索安全性”:能够在不显著损失性能的情况下同时跟踪多条线索。尽管如此,对于许多模型而言,我们发现其有效上下文限制显著短于支持的上下文长度,随着上下文窗口的扩大,准确性逐渐下降。我们的研究还强调了一个重要观点,即不同 Tokenizer 的 Token 计数不应直接比较——它们通常对应于数量显著不同的书写字符。我们公开了代码和长上下文实验数据。

[NLP-2] LLM 2CLIP: Powerful Language Model Unlock Richer Visual Representation

【速读】: 该论文试图解决CLIP在处理长且复杂的文本描述时的局限性问题,并提出了一种名为LLM2CLIP的新方法。解决方案的关键在于利用大型语言模型(LLMs)的强大文本理解能力来增强CLIP的跨模态表示学习。具体来说,通过在对比学习框架下微调LLM,将其文本能力嵌入到输出嵌入中,从而显著提高输出层的文本区分度。此外,设计了一种高效的训练过程,使微调后的LLM作为CLIP视觉编码器的强大教师,从而能够处理更长和更复杂的文本描述,突破了传统CLIP文本编码器的上下文窗口和能力限制。实验结果表明,这种方法在跨模态任务中带来了显著的性能提升。

链接: https://arxiv.org/abs/2411.04997
作者: Weiquan Huang,Aoqi Wu,Yifan Yang,Xufang Luo,Yuqing Yang,Liang Hu,Qi Dai,Xiyang Dai,Dongdong Chen,Chong Luo,Lili Qiu
关键词-EN: foundational models today, important multimodal foundational, CLIP, multimodal foundational models, LLMs
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:CLIP is one of the most important multimodal foundational models today. What powers CLIP’s capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs’ strong textual understanding can fundamentally improve CLIP’s ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP’s potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer’s textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP’s visual encoder. Thanks to the LLM’s presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP’s text encoder’s context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.
摘要:CLIP 是当今最重要的多模态基础模型之一。是什么赋予了 CLIP 如此强大的能力?自然语言,作为人类知识的载体,提供了丰富的监督信号,塑造了一个强大的跨模态表示空间。然而,随着 GPT-4 和 LLaMA 等大语言模型 (LLM) 的快速发展,语言理解和生成的边界不断被拓展。这引发了一个有趣的问题:能否利用大语言模型的能力来进一步提升多模态表示学习的效果?将大语言模型融入 CLIP 的潜在好处显而易见。大语言模型强大的文本理解能力可以根本性地提升 CLIP 处理图像描述的能力,显著增强其处理长而复杂文本的能力,这是传统 CLIP 的一个众所周知的局限。此外,大语言模型在海量文本语料库上进行训练,拥有开放世界的知识。这使得它们在训练过程中能够扩展描述信息,从而提高学习过程的效率。在本文中,我们提出了 LLM2CLIP,一种利用大语言模型力量来释放 CLIP 潜力的新方法。通过在描述空间中使用对比学习对大语言模型进行微调,我们将其文本能力提取到输出嵌入中,显著提高了输出层的文本区分能力。然后,我们设计了一个高效的训练过程,其中微调后的大语言模型作为 CLIP 视觉编码器的强大教师。得益于大语言模型的存在,我们现在可以在不受传统 CLIP 文本编码器上下文窗口和能力限制的情况下,融入更长和更复杂的描述。我们的实验表明,这种方法在跨模态任务中带来了显著的改进。

[NLP-3] Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

【速读】: 该论文试图解决多模态大语言模型(Large Language Models, LLMs)在处理文本、图像和语音时面临的计算资源和数据集扩展问题。解决方案的关键是引入了一种稀疏的多模态Transformer架构,称为Mixture-of-Transformers (MoT)。MoT通过解耦模型的非嵌入参数(包括前馈网络、注意力矩阵和层归一化),实现了模态特定的处理,同时保持全局自注意力机制覆盖整个输入序列。这种设计显著降低了预训练的计算成本,并在多个实验设置和模型规模下验证了其有效性,例如在Chameleon 7B设置中,MoT以55.8%的FLOPs达到了与密集基线相当的性能。

链接: https://arxiv.org/abs/2411.04996
作者: Weixin Liang,Lili Yu,Liang Luo,Srinivasan Iyer,Ning Dong,Chunting Zhou,Gargi Ghosh,Mike Lewis,Wen-tau Yih,Luke Zettlemoyer,Xi Victoria Lin
关键词-EN: large language models, dense baseline, unified framework, development of large, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality – including feed-forward networks, attention matrices, and layer normalization – enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline’s performance using only 55.8% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT’s practical benefits, achieving dense baseline image quality in 47.2% of the wall-clock time and text quality in 75.6% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).
摘要:大语言模型 (LLM) 的发展已扩展至多模态系统,这些系统能够在统一的框架内处理文本、图像和语音。与仅处理文本的 LLM 相比,训练这些模型需要更大规模的数据集和计算资源。为应对扩展挑战,我们提出了混合 Transformer (Mixture-of-Transformers, MoT),这是一种稀疏的多模态 Transformer 架构,显著降低了预训练的计算成本。MoT 通过模态解耦模型的非嵌入参数——包括前馈网络、注意力矩阵和层归一化——实现了对完整输入序列的全局自注意力机制的模态特定处理。我们在多种设置和模型规模下评估了 MoT。在 Chameleon 7B 设置(自回归文本和图像生成)中,MoT 仅使用 55.8% 的浮点运算 (FLOPs) 就达到了与密集基线相当的性能。当扩展至包含语音时,MoT 仅用 37.2% 的 FLOPs 就达到了与密集基线相当的语音性能。在 Transfusion 设置中,文本和图像采用不同的训练目标,7B 的 MoT 模型在仅使用三分之一 FLOPs 的情况下,达到了与密集基线相当的图像模态性能,而 760M 的 MoT 模型在关键图像生成指标上优于 1.4B 的密集基线。系统分析进一步凸显了 MoT 的实际效益,在 47.2% 的挂钟时间内实现了与密集基线相当的图像质量,在 75.6% 的挂钟时间内实现了与密集基线相当的文本质量(在配备 NVIDIA A100 GPU 的 AWS p4de.24xlarge 实例上测量)。

[NLP-4] he Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities

【速读】: 该论文试图解决现代语言模型如何处理跨语言和跨模态输入的问题。解决方案的关键在于提出了“语义枢纽假说”(semantic hub hypothesis),即模型通过学习一个跨异质数据类型(如不同语言和模态)的共享表征空间来实现这一能力。在这个空间中,语义相似的输入(即使来自不同模态或语言)会被放置在相近的位置。论文通过实验验证了模型在中间层对不同语言的语义等价输入的表征相似性,并展示了这种共享表征空间可以通过模型的主导预训练语言进行解释。此外,论文还发现,对共享表征空间的干预会影响模型在其他数据类型上的输出,这表明共享表征空间不仅是大规模训练的副产品,而是模型在输入处理过程中主动利用的结构。

链接: https://arxiv.org/abs/2411.04986
作者: Zhaofeng Wu,Xinyan Velocity Yu,Dani Yogatama,Jiasen Lu,Yoon Kim
关键词-EN: Modern language models, Modern language, modalities, Modern, shared representation space
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages and modalities), which places semantically similar inputs near one another, even if they are from different modalities/languages. We term this the semantic hub hypothesis, following the hub-and-spoke model from neuroscience (Patterson et al., 2007) which posits that semantic knowledge in the human brain is organized through a transmodal semantic “hub” which integrates information from various modality-specific “spokes” regions. We first show that model representations for semantically equivalent inputs in different languages are similar in the intermediate layers, and that this space can be interpreted using the model’s dominant pretraining language via the logit lens. This tendency extends to other data types, including arithmetic expressions, code, and visual/audio inputs. Interventions in the shared representation space in one data type also predictably affect model outputs in other data types, suggesting that this shared representations space is not simply a vestigial byproduct of large-scale training on broad data, but something that is actively utilized by the model during input processing.
摘要:现代语言模型能够处理跨多种语言和模态的输入。我们假设,模型通过学习一个跨异构数据类型(例如,不同语言和模态)的共享表示空间来获得这种能力,该空间将语义上相似的输入放置在相近的位置,即使它们来自不同的模态或语言。我们称之为语义枢纽假设,借鉴了神经科学中的枢纽与辐条模型(Patterson et al., 2007),该模型认为人类大脑中的语义知识是通过一个跨模态的语义“枢纽”组织的,该枢纽整合了来自各个模态特定“辐条”区域的信息。我们首先展示了模型在不同语言中语义等价输入的表示在中间层是相似的,并且可以通过模型的主导预训练语言通过logit透镜来解释这个空间。这种趋势扩展到其他数据类型,包括算术表达式、代码以及视觉/音频输入。在一个数据类型的共享表示空间中的干预也会可预测地影响模型在其他数据类型中的输出,这表明这个共享表示空间不仅仅是大规模广泛数据训练的残留副产品,而是在输入处理过程中被模型积极利用的。

[NLP-5] SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

【速读】: 该论文试图解决大型语言模型(LLM)推理过程中的加速问题,提出了一种名为SuffixDecoding的新方法。解决方案的关键在于利用后缀树(suffix trees)从先前生成的输出中提取模式,以高效预测候选词序列。与依赖草稿模型或专用解码头的现有方法不同,SuffixDecoding通过动态构建和更新后缀树,利用基于经验词频的评分机制来构建推测树,从而实现灵活的树结构推测,且无需维护和协调额外模型。该方法仅需CPU内存,适用于典型的LLM服务节点,并在多种任务中展示了与基于模型的方法相媲美的加速效果。

链接: https://arxiv.org/abs/2411.04975
作者: Gabriele Oliaro,Zhihao Jia,Daniel Campos,Aurick Qiao
关键词-EN: accelerating large language, large language model, accelerating large, large language, SuffixDecoding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to 1.4\times higher output throughput than SpecInfer and up to 1.1\times lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to 2.9\times higher output throughput and 3\times lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.
摘要:我们提出了 SuffixDecoding,这是一种通过推测性解码加速大语言模型 (LLM) 推理的新颖无模型方法。与依赖草稿模型或专用解码头的现有方法不同,SuffixDecoding 利用从先前生成的输出构建的后缀树来高效预测候选 Token 序列。我们的方法实现了灵活的树结构推测,而无需维护和协调额外模型的开销。SuffixDecoding 构建并动态更新后缀树,以捕捉生成文本中的模式,并通过基于经验 Token 频率的原则性评分机制来构建推测树。SuffixDecoding 仅需要 CPU 内存,这在典型 LLM 服务节点上丰富且未充分利用。我们证明,SuffixDecoding 在包括开放领域聊天、代码生成和文本到 SQL 任务在内的多样化工作负载中,相比基于模型的方法实现了具有竞争力的加速效果。对于开放式聊天和代码生成任务,SuffixDecoding 的输出吞吐量比 SpecInfer 高出最多 1.4 倍,每 Token 时间 (TPOT) 延迟降低最多 1.1 倍。对于一个专有的多 LLM 文本到 SQL 应用,SuffixDecoding 的输出吞吐量比推测性解码高出最多 2.9 倍,延迟降低最多 3 倍。我们的评估显示,即使在小规模参考语料库(256 个示例)的情况下,SuffixDecoding 仍能保持高接受率,并且随着更多历史输出的加入,性能持续提升。

[NLP-6] BitNet a4.8: 4-bit Activations for 1-bit LLM s

【速读】: 该论文试图解决1-bit大型语言模型(LLMs)在推理成本和性能之间的平衡问题。解决方案的关键在于引入BitNet a4.8,通过混合量化和稀疏化策略,将激活值量化为4-bit(INT4/FP4),同时对中间状态进行稀疏化并随后进行8-bit量化,以减轻异常通道引入的量化误差。这种方法不仅在性能上与BitNet b1.58相当,而且在推理速度上更快,同时激活参数减少至55%,并支持3-bit的KV缓存,从而显著提升了大规模LLM部署和推理的效率。

链接: https://arxiv.org/abs/2411.04965
作者: Hongyu Wang,Shuming Ma,Furu Wei
关键词-EN: Large Language Models, Large Language, Language Models, Recent research, presents a promising
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.
摘要:近期关于1-bit大语言模型(LLMs)的研究,如BitNet b1.58,展示了一个在保持性能的同时降低LLMs推理成本的有前景的方向。在本研究中,我们引入了BitNet a4.8,为1-bit LLMs启用了4-bit激活功能。BitNet a4.8采用了一种混合量化和稀疏化策略,以缓解由异常通道引入的量化误差。具体而言,我们为注意力层和前馈网络层的输入使用了4-bit激活,同时对中间状态进行稀疏化处理,随后进行8-bit量化。广泛的实验表明,BitNet a4.8在训练成本相同的情况下,其性能与BitNet b1.58相当,并且在推理速度上更快,支持4-bit(INT4/FP4)内核。此外,BitNet a4.8仅激活了55%的参数,并支持3-bit KV缓存,进一步提升了大规模LLM部署和推理的效率。

[NLP-7] Position Paper On Diagnostic Uncertainty Estimation from Large Language Models : Next-Word Probability Is Not Pre-test Probability NEURIPS2024 ALT

【速读】: 该论文试图解决大型语言模型(LLMs)在临床决策支持中估计预测试概率(pre-test probabilities)能力有限的问题。解决方案的关键在于评估现有LLMs(如Mistral-7B和Llama3-70B)在诊断任务中的表现,并揭示当前提取LLM概率估计方法的局限性。通过这一研究,论文强调了改进LLM置信度估计技术的必要性。

链接: https://arxiv.org/abs/2411.04962
作者: Yanjun Gao,Skatje Myers,Shan Chen,Dmitriy Dligach,Timothy A Miller,Danielle Bitterman,Guanhua Chen,Anoop Mayampurath,Matthew Churpek,Majid Afshar
关键词-EN: Large language models, diagnostic decision support, estimate pre-test probabilities, Large language, remains limited
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to GenAI4Health Workshop at NeurIPS 2024

点击查看摘要

Abstract:Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.
摘要:大语言模型 (LLM) 正在被探索用于诊断决策支持,但其估计预测试概率的能力,这对临床决策至关重要,仍然有限。本研究评估了两个 LLM,Mistral-7B 和 Llama3-70B,使用结构化的电子健康记录数据在三个诊断任务上。我们考察了三种当前提取 LLM 概率估计的方法,并揭示了它们的局限性。我们的目标是强调改进 LLM 置信度估计技术的必要性。

[NLP-8] M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

【速读】: 该论文试图解决文档视觉问答(DocVQA)中现有方法在处理多页文档、跨文档信息以及视觉元素(如图表)时的局限性。解决方案的关键是引入了一种名为M3DocRAG的新型多模态检索增强生成(RAG)框架,该框架能够灵活处理不同文档上下文(封闭域和开放域)、问题跳跃(单跳和多跳)以及证据模态(文本、图表、图像等)。M3DocRAG通过结合多模态检索器和多模态语言模型(MLM),能够高效处理单个或多个文档,同时保留视觉信息。此外,论文还提出了一个新的基准数据集M3DocVQA,用于评估开放域DocVQA任务。

链接: https://arxiv.org/abs/2411.04952
作者: Jaemin Cho,Debanjan Mahata,Ozan Irsoy,Yujie He,Mohit Bansal
关键词-EN: broad applications, text extraction tools, documents, visual question answering, text extraction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal language models (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them. We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts (closed-domain and open-domain), question hops (single-hop and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG finds relevant documents and answers questions using a multi-modal retriever and an MLM, so that it can efficiently handle single or many documents while preserving visual information. Since previous DocVQA datasets ask questions in the context of a specific document, we also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages. In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance than many strong baselines, including state-of-the-art performance in MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully handle various scenarios, such as when relevant information exists across multiple pages and when answer evidence only exists in images.
摘要:文档视觉问答(Document Visual Question Answering, DocVQA)管道能够回答来自文档的问题,具有广泛的应用前景。现有方法主要集中在处理单页文档的多模态语言模型(Multi-modal Language Models, MLMs),或依赖于基于文本的检索增强生成(Retrieval-Augmented Generation, RAG),后者使用光学字符识别(Optical Character Recognition, OCR)等文本提取工具。然而,这些方法在实际应用中存在诸多困难:(a)问题往往需要跨不同页面或文档的信息,而MLMs难以处理大量长文档;(b)文档中常包含图表等视觉元素的重要信息,但文本提取工具会忽略这些内容。我们提出了M3DocRAG,一种新颖的多模态RAG框架,能够灵活适应各种文档上下文(封闭域和开放域)、问题跳跃(单跳和多跳)以及证据模态(文本、图表、图形等)。M3DocRAG通过多模态检索器和MLM来查找相关文档并回答问题,从而能够高效处理单个或多个文档,同时保留视觉信息。由于先前的DocVQA数据集仅在特定文档的上下文中提问,我们还推出了M3DocVQA,这是一个新的基准,用于评估超过3,000个PDF文档和40,000多页的开放域DocVQA。在三个基准测试(M3DocVQA/MMLongBench-Doc/MP-DocVQA)中,实证结果表明,M3DocRAG结合ColPali和Qwen2-VL 7B在众多强基线中表现优异,包括在MP-DocVQA中达到最先进水平。我们提供了对不同索引、MLMs和检索模型的全面分析。最后,我们定性地展示了M3DocRAG在各种场景中的成功应用,例如当相关信息分布在多个页面时,以及当答案证据仅存在于图像中时。

[NLP-9] Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach

【速读】: 该论文试图解决文本分类中文学特征(如作者风格和主题内容)之间的重叠问题,特别是这些特征在文本单元之间的序列相关性对分类结果的影响。解决方案的关键在于引入了一种假设检验方法,通过多元二元分布模型来评估序列相关文学特征对文本分类的影响。该方法将文本单元之间的序列相关性建模为随机过程,并评估在不同邻接尺度下的聚类可能性,从而确定分类是否主要由序列相关特征驱动。实验结果表明,该方法能够有效识别文本分类是否主要受序列相关文学特征的影响,尤其是在文本差异主要体现在作者风格或体裁而非单一作者在相似体裁中的情况下。

链接: https://arxiv.org/abs/2411.04950
作者: Gideon Yoffe,Nachum Dershowitz,Ariel Vishne,Barak Sober
关键词-EN: reflect semi-conscious choices, semi-conscious choices distinct, Stylometry aims, analyzing literary traits, literary traits assumed
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stylometry aims to distinguish authors by analyzing literary traits assumed to reflect semi-conscious choices distinct from elements like genre or theme. However, these components often overlap, complicating text classification based solely on feature distributions. While some literary properties, such as thematic content, are likely to manifest as correlations between adjacent text units, others, like authorial style, may be independent thereof. We introduce a hypothesis-testing approach to evaluate the influence of sequentially correlated literary properties on text classification, aiming to determine when these correlations drive classification. Using a multivariate binary distribution, our method models sequential correlations between text units as a stochastic process, assessing the likelihood of clustering across varying adjacency scales. This enables us to examine whether classification is dominated by sequentially correlated properties or remains independent. In experiments on a diverse English prose corpus, our analysis integrates traditional and neural embeddings within supervised and unsupervised frameworks. Results demonstrate that our approach effectively identifies when textual classification is not primarily influenced by sequentially correlated literary properties, particularly in cases where texts differ in authorial style or genre rather than by a single author within a similar genre.
摘要:文体学旨在通过分析被认为反映作者半意识选择的文学特征来区分作者,这些特征与体裁或主题等元素不同。然而,这些成分往往重叠,仅基于特征分布进行文本分类变得复杂。虽然某些文学属性,如主题内容,可能会表现为相邻文本单元之间的相关性,但其他属性,如作者风格,则可能是独立的。我们提出了一种假设检验方法,以评估顺序相关的文学属性对文本分类的影响,旨在确定这些相关性何时驱动分类。使用多元二元分布,我们的方法将文本单元之间的顺序相关性建模为随机过程,评估在不同邻接尺度上聚类的可能性。这使我们能够检查分类是否主要由顺序相关的属性主导,还是保持独立。在多样化的英语散文语料库上的实验中,我们的分析结合了传统和神经嵌入在监督和无监督框架内。结果表明,我们的方法有效地识别出文本分类何时不受顺序相关的文学属性的主要影响,特别是在文本在作者风格或体裁上存在差异,而不是在相似体裁内由单一作者创作的情况下。

[NLP-10] GPTKB: Building Very Large Knowledge Bases from Language Models

【速读】: 该论文试图解决通用领域知识库(KB)建设中的创新问题,特别是针对“三大巨头”(Wikidata、Yago 和 DBpedia)之外的全新尝试。解决方案的关键在于利用大型语言模型(LLM)完全自主构建大规模通用领域知识库。论文展示了从 LLM 中进行大规模知识库构建的可行性,并强调了实体识别、实体和属性规范化以及分类法构建等方面的具体挑战。作为原型,研究团队使用 GPT-4o-mini 构建了 GPTKB,该知识库包含超过 2.9 百万个实体的 105 百万个三元组,成本仅为以往知识库构建项目的 1/100。这一工作在自然语言处理(NLP)和语义网领域均具有里程碑意义,首次为 LLM 的知识(或信念)提供了建设性见解,并为通用领域知识库建设展示了新的前进方向。

链接: https://arxiv.org/abs/2411.04920
作者: Yujia Hu,Shrestha Ghosh,Tuan-Phong Nugyen,Simon Razniewski
关键词-EN: Yago and DBpedia, intelligent applications, General-domain knowledge bases, Wikidata, Yago
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 11 pages, 4 tables

点击查看摘要

Abstract:General-domain knowledge bases (KB), in particular the “big three” – Wikidata, Yago and DBpedia – are the backbone of many intelligent applications. While these three have seen steady development, comprehensive KB construction at large has seen few fresh attempts. In this work, we propose to build a large general-domain KB entirely from a large language model (LLM). We demonstrate the feasibility of large-scale KB construction from LLMs, while highlighting specific challenges arising around entity recognition, entity and property canonicalization, and taxonomy construction. As a prototype, we use GPT-4o-mini to construct GPTKB, which contains 105 million triples for more than 2.9 million entities, at a cost 100x less than previous KBC projects. Our work is a landmark for two fields: For NLP, for the first time, it provides \textitconstructive insights into the knowledge (or beliefs) of LLMs. For the Semantic Web, it shows novel ways forward for the long-standing challenge of general-domain KB construction. GPTKB is accessible at this https URL.
摘要:通用领域知识库(KB),特别是“三大巨头”——Wikidata、Yago 和 DBpedia,是众多智能应用的基石。尽管这三者在稳步发展,但大规模的全面知识库构建却鲜有新的尝试。在本研究中,我们提出完全基于大语言模型(LLM)构建一个大规模的通用领域知识库。我们展示了从 LLM 进行大规模知识库构建的可行性,同时指出了在实体识别、实体与属性规范化以及分类体系构建等方面存在的具体挑战。作为原型,我们使用 GPT-4o-mini 构建了 GPTKB,该知识库包含超过 2.9 百万个实体的 1.05 亿个三元组,成本仅为以往知识库构建项目的 1/100。我们的工作在两个领域具有里程碑意义:对于自然语言处理(NLP),首次提供了关于 LLM 知识(或信念)的构建性见解;对于语义网,它为长期存在的通用领域知识库构建挑战展示了新的前进方向。GPTKB 可通过以下链接访问。

[NLP-11] GASE: Generatively Augmented Sentence Encoding

【速读】: 该论文试图解决在推理阶段通过数据增强提升句子嵌入(sentence embeddings)性能的问题。解决方案的关键在于利用生成式文本模型(generative text models)在推理时生成多样化的语言变体(如改写、摘要或提取关键词),并将原始文本和生成的变体嵌入进行池化(pooling)。这种方法无需访问模型参数或进行大规模的微调,显著降低了计算资源需求。实验结果表明,生成式增强(generative augmentation)显著提升了在语义文本相似度(STS)基准上的性能,特别是在基线性能较低的嵌入模型上效果更为显著。这表明生成式增强在推理阶段的集成不仅增加了语义多样性,还增强了嵌入模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2411.04914
作者: Manuel Frank,Haithem Afli
关键词-EN: Augmented Sentence Encoding, Semantic Textual Similarity, augmentation, applying generative text, data augmentation
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:We propose an approach to enhance sentence embeddings by applying generative text models for data augmentation at inference time. Unlike conventional data augmentation that utilises synthetic training data, our approach does not require access to model parameters or the computational resources typically required for fine-tuning state-of-the-art models. Generatively Augmented Sentence Encoding uses diverse linguistic synthetic variants of input texts generated by paraphrasing, summarising, or extracting keywords, followed by pooling the original and synthetic embeddings. Experimental results on the Massive Text Embedding Benchmark for Semantic Textual Similarity (STS) demonstrate performance improvements across a range of embedding models using different generative models for augmentation. We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance. These findings suggest that integrating generative augmentation at inference time adds semantic diversity and can enhance the robustness and generalizability of sentence embeddings for embedding models. Our results show that the degree to which generative augmentation can improve STS performance depends not only on the embedding model but also on the dataset. From a broader perspective, the approach allows trading training for inference compute.
摘要:我们提出了一种通过在推理时应用生成式文本模型进行数据增强来增强句子嵌入的方法。与利用合成训练数据的传统数据增强方法不同,我们的方法不需要访问模型参数或通常用于微调最先进模型所需的计算资源。生成式增强句子编码使用通过改写、总结或提取关键词生成的输入文本的多语言合成变体,然后对原始和合成嵌入进行池化处理。在语义文本相似性(STS)的大规模文本嵌入基准测试中的实验结果表明,使用不同生成模型进行增强的多种嵌入模型在性能上有所提升。我们发现,生成式增强对基线性能较低的嵌入模型带来了更大的性能提升。这些发现表明,在推理时集成生成式增强不仅增加了语义多样性,还能增强句子嵌入的鲁棒性和泛化能力。我们的结果显示,生成式增强对STS性能的提升程度不仅取决于嵌入模型,还取决于数据集。从更广泛的角度来看,该方法允许在推理计算中进行训练资源的权衡。

[NLP-12] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

【速读】: 该论文试图解决当前代码大型语言模型(LLMs)在科学研究中缺乏透明性和可重复性的问题。解决方案的关键在于全面开放模型的训练数据、数据处理流程、实验结果和训练协议,具体包括:(1) 代码优化的启发式规则用于数据清洗和去重方法,(2) 代码相关文本语料库的召回,以及 (3) 在退火和监督微调阶段使用高质量的合成数据。通过这些关键要素,论文提出了OpenCoder模型,旨在为研究社区提供一个强大的、开放的基础,以促进代码AI领域的可重复性研究和加速技术进步。

链接: https://arxiv.org/abs/2411.04905
作者: Siming Huang,Tianhao Cheng,Jason Klein Liu,Jiaran Hao,Liuyihan Song,Yang Xu,J. Yang,J.H. Liu,Chenchen Zhang,Linzheng Chai,Ruifeng Yuan,Zhaoxiang Zhang,Jie Fu,Qian Liu,Ge Zhang,Zili Wang,Yuan Qi,Yinghui Xu,Wei Chu
关键词-EN: http URL open-access, Large language models, URL open-access code, Large language, http URL
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent this http URL open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an ``open cookbook’’ for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Through this comprehensive release, we identify the key ingredients for building a top-tier code LLM: (1) code optimized heuristic rules for data cleaning and methods for data deduplication, (2) recall of text corpus related to code and (3) high-quality synthetic data in both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research, and enable reproducible advancements in code AI.
摘要:用于代码的大语言模型(LLMs)在包括代码生成、推理任务和 AI 智能体在内的多个领域中已成为不可或缺的工具。开源代码 LLMs 的性能正逐渐接近专有模型的水平,然而,适合严格科学研究的、具备可重复数据处理管道和透明训练协议的高质量代码 LLMs 仍然有限。这种稀缺性源于多种挑战,包括资源限制、伦理考量以及保持模型先进性的竞争优势。为了填补这一空白,我们推出了 OpenCoder,这是一个顶级代码 LLM,不仅在性能上可与领先模型相媲美,而且作为研究社区的“开放食谱”。与大多数先前的努力不同,我们不仅发布了模型权重和推理代码,还提供了可重复的训练数据、完整的数据处理管道、严格的实验消融结果以及详细的训练协议,以供开放科学研究。通过这种全面的发布,我们确定了构建顶级代码 LLM 的关键要素:(1) 针对数据清洗和去重方法的代码优化启发式规则,(2) 与代码相关的文本语料库的召回,以及 (3) 在退火和监督微调阶段的高质量合成数据。通过提供这种程度的开放性,我们的目标是扩大对顶级代码 LLM 各个方面的访问,使 OpenCoder 既作为一个强大的模型,又作为一个开放的基础,以加速研究并实现代码 AI 的可重复进步。

[NLP-13] Sentiment Analysis of Spanish Political Party Tweets Using Pre-trained Language Models

【速读】: 该论文试图解决西班牙政治党派在Twitter上的情感表达模式问题,并探讨这些情感表达与其政治意识形态之间的关系。解决方案的关键在于利用预训练语言模型BETO和RoBERTuito对西班牙语文本进行情感分析,通过分析PSOE、PP、Vox、Podemos和Ciudadanos等主要政党的推文数据,研究情感分布及其与党派意识形态的关联。研究发现,各党派在情感表达上存在显著差异,如Vox表现出较高的负面情感,而PSOE则表现出较高的正面情感,这些情感模式与其政治立场相符,从而验证了情感表达在政治传播中的重要性。

链接: https://arxiv.org/abs/2411.04862
作者: Chuqiao Song,Shunzhang Chen,Xinyi Cai,Hao Chen
关键词-EN: Pre-trained Language Models, Spanish Political Party, Political Party Communications, Spanish Political, Pre-trained Language
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 21 pages, 6 figures

点击查看摘要

Abstract:Title: Sentiment Analysis of Spanish Political Party Communications on Twitter Using Pre-trained Language Models Authors: Chuqiao Song, Shunzhang Chen, Xinyi Cai, Hao Chen Comments: 21 pages, 6 figures Abstract: This study investigates sentiment patterns within Spanish political party communications on Twitter by leveraging BETO and RoBERTuito, two pre-trained language models optimized for Spanish text. Using a dataset of tweets from major Spanish political parties: PSOE, PP, Vox, Podemos, and Ciudadanos, spanning 2019 to 2024, this research analyzes sentiment distributions and explores the relationship between sentiment expression and party ideology. The findings indicate that both models consistently identify a predominant Neutral sentiment across all parties, with significant variations in Negative and Positive sentiments that align with ideological distinctions. Specifically, Vox exhibits higher levels of Negative sentiment, while PSOE demonstrates relatively high Positive sentiment, supporting the hypothesis that emotional appeals in political messaging reflect ideological stances. This study underscores the potential of pre-trained language models for non-English sentiment analysis on social media, providing insights into sentiment dynamics that shape public discourse within Spain’s multi-party political system. Keywords: Spanish politics, sentiment analysis, pre-trained language models, Twitter, BETO, RoBERTuito, political ideology, multi-party system Comments: 21 pages, 6 figures Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) MSC classes: 68T50 (Natural Language Processing), 68T10 (Pattern Recognition, Speech Recognition), 91F10 (Political Science) Cite as: arXiv:2411.04862 [cs.CL] (or arXiv:2411.04862v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.04862 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chuqiao Song [view email] [v1] Thu, 7 Nov 2024 16:53:09 UTC (497 KB) Full-text links: Access Paper: View a PDF of the paper titled Sentiment Analysis of Spanish Political Party Tweets Using Pre-trained Language Models, by Chuqiao Song and 3 other authorsView PDFOther Formats view license Current browse context: cs.CL prev | next new | recent | 2024-11 Change to browse by: cs cs.CY References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
摘要:
标题:利用预训练语言模型对西班牙政党在Twitter上的沟通进行情感分析
作者:宋楚乔、陈顺章、蔡欣怡、陈浩
评论:21页,6幅图
摘要:本研究通过利用BETO和RoBERTuito这两个针对西班牙语文本优化的预训练语言模型,探讨了西班牙主要政党在Twitter上的沟通中的情感模式。研究使用了来自西班牙主要政党(PSOE、PP、Vox、Podemos和Ciudadanos)的推文数据集,时间跨度从2019年到2024年,分析了情感分布,并探讨了情感表达与政党意识形态之间的关系。研究发现,两个模型均一致识别出所有政党中占主导地位的中性情感,但在负面和正面情感上存在显著差异,这些差异与政党的意识形态区分相吻合。具体而言,Vox表现出较高的负面情感水平,而PSOE则显示出相对较高的正面情感,这支持了政治信息中的情感诉求反映了意识形态立场的假设。本研究强调了预训练语言模型在非英语社交媒体情感分析中的潜力,为理解塑造西班牙多党政治体系中公众话语的情感动态提供了见解。
关键词:西班牙政治、情感分析、预训练语言模型、Twitter、BETO、RoBERTuito、政治意识形态、多党制
评论:21页,6幅图
学科:计算与语言(cs.CL);计算机与社会(cs.CY
MSC类别:68T50(自然语言处理),68T10(模式识别,语音识别),91F10(政治科学)
引用为:arXiv:2411.04862 [cs.CL]
(或 arXiv:2411.04862v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2411.04862
arXiv-issued DOI via DataCite(待注册)
提交历史:从:宋楚乔 [查看电子邮件]
[v1] 2024年11月7日 16:53:09 UTC(497 KB)
全文链接:访问论文:查看题为《利用预训练语言模型对西班牙政党推文进行情感分析》的PDF,作者为宋楚乔及其他三位作者查看PDF其他格式查看许可证当前浏览上下文:cs.CL上一篇 | 下一篇新 | 最近 | 2024-11更改浏览方式:cs cs.CY参考文献引用NASA ADSGoogle Scholar Semantic Scholar导出BibTeX引用加载中…BibTeX格式引用加载中…数据由以下提供:书签已检查=“已检查”>书目工具书目和引用工具书目浏览器切换书目浏览器(什么是浏览器?)连接的论文切换连接的论文(什么是连接的论文?)Litmaps切换Litmaps(什么是Litmaps?)scite.ai切换scite智能引用(什么是智能引用?)代码、数据、媒体与本文相关的代码、数据和媒体alphaXiv切换alphaXiv(什么是alphaXiv?)代码链接切换CatalyzeX代码查找器(什么是CatalyzeX?)DagsHub切换DagsHub(什么是DagsHub?)GotitPub切换Gotit.pub(什么是GotitPub?)Huggingface切换Hugging Face(什么是Huggingface?)代码链接切换Papers with Code(什么是Papers with Code?)ScienceCast切换ScienceCast(什么是ScienceCast?)演示演示Replicate切换Replicate(什么是Replicate?)Spaces切换Hugging Face Spaces(什么是Spaces?)Spaces切换TXYZ.AI什么是TXYZ.AI?)相关论文推荐和搜索工具影响花链接影响花(什么是影响花?)核心推荐者切换CORE推荐者(什么是CORE?)作者地点机构主题关于arXivLabsarXivLabs:社区合作者的实验项目arXivLabs是一个框架,允许合作者在我们的网站上直接开发和分享新的arXiv功能。无论是个人还是组织,与arXivLabs合作的人都接受了我们的开放、社区、卓越和用户数据隐私的价值观。arXiv致力于这些价值观,并且只与遵守这些价值观的合作伙伴合作。有一个为arXiv社区增加价值的项目想法?了解更多关于arXivLabs的信息。本文的哪些作者是支持者?|禁用MathJax(什么是MathJax?)mathjaxToggle();关于帮助联系arXiv点击此处联系arXiv联系订阅arXiv邮件点击此处订阅订阅版权隐私政策网页无障碍辅助arXiv操作状态通过电子邮件或Slack获取状态通知

[NLP-14] Prompt-Guided Internal States for Hallucination Detection of Large Language Models

【速读】: 该论文试图解决大型语言模型 (Large Language Models, LLMs) 在跨领域应用中产生的幻觉 (hallucinations) 检测问题,即模型生成的响应在逻辑上连贯但事实错误或误导。解决方案的关键在于提出了一种名为 PRISM 的新框架,通过使用适当的提示 (prompts) 引导 LLM 内部状态中与文本真实性相关的结构变化,使其在不同领域的文本中更加显著和一致,从而增强现有幻觉检测方法的跨领域泛化能力。

链接: https://arxiv.org/abs/2411.04847
作者: Fujie Zhang,Peiqi Yu,Biao Yi,Baolei Zhang,Tong Li,Zheli Liu
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, demonstrated remarkable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of tasks in different domains. However, they sometimes generate responses that are logically coherent but factually incorrect or misleading, which is known as LLM hallucinations. Data-driven supervised methods train hallucination detectors by leveraging the internal states of LLMs, but detectors trained on specific domains often struggle to generalize well to other domains. In this paper, we aim to enhance the cross-domain performance of supervised detectors with only in-domain data. We propose a novel framework, prompt-guided internal states for hallucination detection of LLMs, namely PRISM. By utilizing appropriate prompts to guide changes in the structure related to text truthfulness within the LLM’s internal states, we make this structure more salient and consistent across texts from different domains. We integrated our framework with existing hallucination detection methods and conducted experiments on datasets from different domains. The experimental results indicate that our framework significantly enhances the cross-domain generalization of existing hallucination detection methods.
摘要:大语言模型 (LLM) 在不同领域的多种任务中展示了卓越的能力。然而,它们有时会生成逻辑上连贯但事实错误或误导性的响应,这种现象被称为 LLM 幻觉。基于数据的监督方法通过利用 LLM 的内部状态来训练幻觉检测器,但这些在特定领域训练的检测器往往难以很好地泛化到其他领域。本文旨在仅使用领域内数据来提升监督检测器的跨领域性能。我们提出了一种新颖的框架,即用于 LLM 幻觉检测的提示引导内部状态,简称 PRISM。通过利用适当的提示来引导 LLM 内部状态中与文本真实性相关的结构变化,我们使得这一结构在不同领域的文本中更加显著和一致。我们将该框架与现有的幻觉检测方法结合,并在来自不同领域的数据集上进行了实验。实验结果表明,我们的框架显著增强了现有幻觉检测方法的跨领域泛化能力。

[NLP-15] VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

【速读】: 该论文试图解决学术文本向普通读者简化表达的问题,特别是在文档级别上缺乏相关数据集和有效模型的问题。解决方案的关键在于发布了首个学术文本向普通读者简化表达的数据集VTechAGP,并提出了动态软提示生成语言模型DSPT5。DSPT5通过对比生成损失函数学习动态提示中的关键词向量,并在推理阶段采用人群采样解码策略,以在语义和结构层面上进一步筛选最佳输出候选。实验结果表明,DSPT5在轻量级模型中表现出色,能够与最先进的语言模型相媲美。

链接: https://arxiv.org/abs/2411.04825
作者: Ming Cheng,Jiaying Gong,Chenhan Yuan,William A. Ingram,Edward Fox,Hoda Eldardiry
关键词-EN: Existing text simplification, sentence-level text generation, Existing text, text paraphrase dataset, focus on sentence-level
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 21 pages, 3 figures

点击查看摘要

Abstract:Existing text simplification or paraphrase datasets mainly focus on sentence-level text generation in a general domain. These datasets are typically developed without using domain knowledge. In this paper, we release a novel dataset, VTechAGP, which is the first academic-to-general-audience text paraphrase dataset consisting of 4,938 document-level these and dissertation academic and general-audience abstract pairs from 8 colleges authored over 25 years. We also propose a novel dynamic soft prompt generative language model, DSPT5. For training, we leverage a contrastive-generative loss function to learn the keyword vectors in the dynamic prompt. For inference, we adopt a crowd-sampling decoding strategy at both semantic and structural levels to further select the best output candidate. We evaluate DSPT5 and various state-of-the-art large language models (LLMs) from multiple perspectives. Results demonstrate that the SOTA LLMs does not provide satisfactory outcomes, while the lightweight DSPT5 can achieve competitive results. To the best of our knowledge, we are the first to build a benchmark dataset and solutions for academic-to-general-audience text paraphrase dataset.
摘要:现有的文本简化或释义数据集主要集中在通用领域的句子级文本生成上。这些数据集通常是在不使用领域知识的情况下开发的。本文中,我们发布了一个新的数据集,即 VTechAGP,这是首个由 4,938 对来自 8 所学院的学术论文和面向普通读者的摘要组成的文档级学术到普通读者的释义数据集,这些文档跨越了 25 年的时间。我们还提出了一种新颖的动态软提示生成语言模型,即 DSPT5。在训练过程中,我们利用对比生成损失函数来学习动态提示中的关键词向量。在推理阶段,我们在语义和结构层面上采用众包采样解码策略,以进一步选择最佳输出候选。我们从多个角度评估了 DSPT5 和多种最先进的生成语言模型(LLMs)。结果表明,现有的最先进大语言模型并不能提供令人满意的结果,而轻量级的 DSPT5 则能取得具有竞争力的成果。据我们所知,我们是首个为学术到普通读者的文本释义数据集构建基准数据集和解决方案的团队。

[NLP-16] When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun

【速读】: 该论文试图解决的问题是古典中文资源在处理韩文(Hanja)和日文(Kanbun)历史文档时的跨语言迁移能力。研究通过实验发现,古典中文数据集对韩文历史文档的机器翻译、命名实体识别和标点恢复任务的影响有限,性能差异在序列标注任务中仅为±0.0068 F1-score,翻译任务中最高为+0.84 BLEU score。关键在于,随着本地语言数据的增加,古典中文资源对Hanja的益处迅速减少,仅在极低资源场景下对韩文和日文历史文档有显著改善。这强调了在跨语言迁移中需要谨慎的实证验证,而非盲目假设其益处。

链接: https://arxiv.org/abs/2411.04822
作者: Seyoung Song,Haneul Yoo,Jiho Jin,Kyunghyun Cho,Alice Oh
关键词-EN: Classical Chinese resources, Classical Chinese, Korea and Japan, Classical Chinese datasets, Sinosphere have led
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Historical and linguistic connections within the Sinosphere have led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within \pm0.0068 F1-score for sequence labeling tasks and up to +0.84 BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These mixed results emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.
摘要:汉学圈内的历史与语言联系促使研究者在处理来自韩国和日本的历史文献时,利用古汉语资源进行跨语言迁移。本文质疑了从古汉语到韩文(Hanja)和日文(Kanbun)这两种古代书写语言的跨语言迁移能力。我们在机器翻译、命名实体识别和标点恢复任务上的实验表明,古汉语数据集对用韩文书写的古代韩国文献的语言模型性能影响甚微,序列标注任务的性能差异在±0.0068 F1-score以内,翻译任务的性能差异最高可达+0.84 BLEU分数。这些局限性在不同模型规模、架构和领域特定数据集中持续存在。我们的分析显示,随着韩文本地语言数据的增加,古汉语资源的优势迅速减弱,而在韩日历史文献的极低资源场景中,仅显示出显著的改进。这些混合结果强调了需要谨慎的实证验证,而不是盲目假设跨语言迁移的益处。

[NLP-17] LuxBank: The First Universal Dependency Treebank for Luxembourgish

【速读】: 该论文试图解决卢森堡语(Luxembourgish)在通用依存关系(Universal Dependencies, UD)项目中的缺失问题。解决方案的关键在于引入了LuxBank,这是首个针对卢森堡语的UD树库(UD Treebank),通过建立正式的卢森堡语标注指南,填补了该语言在句法标注和分析方面的空白。LuxBank不仅为语言学家和语言学习者提供了资源,还为开发拼写检查器、语法检查器、整理现有文本档案以及训练大型语言模型提供了工具。通过将卢森堡语纳入UD框架,该研究旨在增强对西日耳曼语言句法变异的理解,并为记录较小、半标准化语言提供模型,从而在更广泛的语义学和自然语言处理(NLP)社区中将卢森堡语定位为有价值的资源。

链接: https://arxiv.org/abs/2411.04813
作者: Alistair Plum,Caroline Döhmer,Emilia Milano,Anne-Marie Lutgen,Christoph Purschke
关键词-EN: Universal Dependencies, significantly expanded linguistic, expanded linguistic coverage, Germanic language spoken, West Germanic
类目: Computation and Language (cs.CL)
备注: Accepted at 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)

点击查看摘要

Abstract:The Universal Dependencies (UD) project has significantly expanded linguistic coverage across 161 languages, yet Luxembourgish, a West Germanic language spoken by approximately 400,000 people, has remained absent until now. In this paper, we introduce LuxBank, the first UD Treebank for Luxembourgish, addressing the gap in syntactic annotation and analysis for this `low-research’ language. We establish formal guidelines for Luxembourgish language annotation, providing the foundation for the first large-scale quantitative analysis of its syntax. LuxBank serves not only as a resource for linguists and language learners but also as a tool for developing spell checkers and grammar checkers, organising existing text archives and even training large language models. By incorporating Luxembourgish into the UD framework, we aim to enhance the understanding of syntactic variation within West Germanic languages and offer a model for documenting smaller, semi-standardised languages. This work positions Luxembourgish as a valuable resource in the broader linguistic and NLP communities, contributing to the study of languages with limited research and resources.
摘要:通用依存关系 (Universal Dependencies, UD) 项目已显著扩展了 161 种语言的语言覆盖范围,然而,卢森堡语(一种由约 40 万人使用的西日耳曼语)至今仍未被纳入其中。本文介绍了 LuxBank,这是首个针对卢森堡语的 UD 树库,旨在填补该“低研究”语言在句法标注与分析方面的空白。我们为卢森堡语标注制定了正式指南,为首次对其句法进行大规模定量分析奠定了基础。LuxBank 不仅为语言学家和语言学习者提供了资源,还为开发拼写检查器和语法检查器、整理现有文本档案,甚至训练大语言模型提供了工具。通过将卢森堡语纳入 UD 框架,我们旨在增进对西日耳曼语系句法变异的理解,并为记录较小、半标准化的语言提供模型。这项工作将卢森堡语定位为语言学和自然语言处理 (NLP) 领域中的宝贵资源,有助于推动对研究资源有限语言的研究。

[NLP-18] Kwai-STaR: Transform LLM s into State-Transition Reasoners

【速读】: 该论文试图解决大语言模型(LLMs)在数学推理能力上的不足,关键解决方案在于将数学问题求解过程定义为从初始未解决状态到最终解决状态的状态转移过程,并提出了Kwai-STaR框架。该框架通过以下三个步骤将LLMs转化为状态转移推理器(State-Transition Reasoners),以提升其直观推理能力:(1) 定义适用于数学推理的状态空间;(2) 基于状态空间生成状态转移数据;(3) 通过课程训练策略将原始LLMs转换为状态转移推理器。实验结果表明,经过小规模Kwai-STaR数据集训练后,包括Mistral-7B和LLaMA-3在内的通用LLMs在GSM8K和GSM-Hard数据集上取得了显著的性能提升,同时状态转移设计还赋予了Kwai-STaR出色的训练和推理效率。

链接: https://arxiv.org/abs/2411.04799
作者: Xingyu Lu,Yuhang Hu,Changyi Liu,Tianke Zhang,Zhenyu Yang,Zhixiang Ding,Shengsheng Qian,Meng Du,Ruiwen Kang,Kaiyu Tang,Fan Yang,Tingting Gao,Di Zhang,Hai-Tao Zheng,Bin Wen
关键词-EN: Mathematical reasoning presents, presents a significant, significant challenge, Mathematical reasoning, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Mathematical reasoning presents a significant challenge to the cognitive capabilities of LLMs. Various methods have been proposed to enhance the mathematical ability of LLMs. However, few recognize the value of state transition for LLM reasoning. In this work, we define mathematical problem-solving as a process of transiting from an initial unsolved state to the final resolved state, and propose Kwai-STaR framework, which transforms LLMs into State-Transition Reasoners to improve their intuitive reasoning capabilities. Our approach comprises three main steps: (1) Define the state space tailored to the mathematical reasoning. (2) Generate state-transition data based on the state space. (3) Convert original LLMs into State-Transition Reasoners via a curricular training strategy. Our experiments validate the effectiveness of Kwai-STaR in enhancing mathematical reasoning: After training on the small-scale Kwai-STaR dataset, general LLMs, including Mistral-7B and LLaMA-3, achieve considerable performance gain on the GSM8K and GSM-Hard dataset. Additionally, the state transition-based design endows Kwai-STaR with remarkable training and inference efficiency. Further experiments are underway to establish the generality of Kwai-STaR.
摘要:数学推理对大语言模型(LLM)的认知能力提出了重大挑战。已有多种方法被提出以增强 LLM 的数学能力,但鲜有研究认识到状态转移在 LLM 推理中的价值。在本研究中,我们将数学问题解决定义为从初始未解决状态向最终解决状态的转移过程,并提出了 Kwai-STaR 框架,该框架将 LLM 转化为状态转移推理器,以提升其直观推理能力。我们的方法包括三个主要步骤:(1)定义适用于数学推理的状态空间;(2)基于状态空间生成状态转移数据;(3)通过课程训练策略将原始 LLM 转换为状态转移推理器。我们的实验验证了 Kwai-STaR 在增强数学推理方面的有效性:在经过小规模 Kwai-STaR 数据集训练后,包括 Mistral-7B 和 LLaMA-3 在内的通用 LLM 在 GSM8K 和 GSM-Hard 数据集上取得了显著的性能提升。此外,基于状态转移的设计赋予了 Kwai-STaR 出色的训练和推理效率。进一步的实验正在进行中,以验证 Kwai-STaR 的通用性。

[NLP-19] AlignXIE: Improving Multilingual Information Extraction by Cross-Lingual Alignment

【速读】: 该论文试图解决跨语言信息抽取(Information Extraction, IE)中存在的语言间不平衡问题,即尽管大型语言模型(LLMs)在跨语言对齐方面表现出色,但在信息抽取任务中仍存在显著的语言间不平衡,揭示了IE对齐的内在缺陷。解决方案的关键在于提出了AlignXIE,一个基于代码的LLM,通过两种策略显著增强跨语言IE对齐:首先,将不同语言的IE任务形式化为代码生成任务,使用Python类标准化各种模式的表示,确保不同语言中相同本体的统一性和模式对齐;其次,通过引入翻译实例预测任务的跨语言对齐阶段,利用ParallelNER这一包含257,190个样本的双语平行数据集,通过LLM自动生成并手动标注,以确保数据质量,从而对齐抽取过程。最终,通过多语言IE指令调优获得AlignXIE,在未见过的9种语言上,其性能超越了ChatGPT 30.17%和当前最先进模型(SoTA)20.03%,展示了卓越的跨语言IE能力。

链接: https://arxiv.org/abs/2411.04794
作者: Yuxin Zuo,Wenxuan Jiang,Wenxuan Liu,Zixuan Li,Long Bai,Hanbin Wang,Yutao Zeng,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
关键词-EN: Empirical evidence suggests, LLMs exhibit spontaneous, Empirical evidence, exhibit spontaneous cross-lingual, spontaneous cross-lingual alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Empirical evidence suggests that LLMs exhibit spontaneous cross-lingual alignment. Our findings suggest that although LLMs also demonstrate promising cross-lingual alignment in Information Extraction, there remains significant imbalance across languages, revealing an underlying deficiency in the IE alignment. To address this issue, we propose AlignXIE, a powerful code-based LLM that significantly enhances cross-lingual IE alignment through two strategies. Firstly, AlignXIE formulates IE across different languages, especially non-English ones, as code generation tasks, standardizing the representation of various schemas using Python classes to ensure consistency of the same ontology in different languages and align the schema. Secondly, it incorporates an IE cross-lingual alignment phase through a translated instance prediction task proposed in this paper to align the extraction process, utilizing ParallelNER, an IE bilingual parallel dataset with 257,190 samples, generated by our proposed LLM-based automatic pipeline for IE parallel data construction, with manual annotation to ensure quality. Ultimately, we obtain AlignXIE through multilingual IE instruction tuning. Although without training in 9 unseen languages, AlignXIE surpasses ChatGPT by 30.17% and SoTA by 20.03% , thereby demonstrating superior cross-lingual IE capabilities. Comprehensive evaluations on 63 IE benchmarks in Chinese and English under various settings, demonstrate that AlignXIE significantly enhances cross-lingual and multilingual IE through boosting the IE alignment.
摘要:实证证据表明,大语言模型 (LLM) 表现出自发的跨语言对齐能力。我们的研究发现,尽管大语言模型在信息抽取 (Information Extraction) 中也展现出有前景的跨语言对齐能力,但不同语言之间仍存在显著的不平衡,揭示了信息抽取对齐中的潜在缺陷。为解决这一问题,我们提出了 AlignXIE,一种基于代码的大语言模型,通过两种策略显著增强了跨语言信息抽取对齐能力。首先,AlignXIE 将不同语言,特别是非英语语言的信息抽取任务形式化为代码生成任务,使用 Python 类标准化各种模式表示,以确保不同语言中相同本体的统一性和模式对齐。其次,它通过本文提出的翻译实例预测任务,引入了一个信息抽取跨语言对齐阶段,利用 ParallelNER,一个包含 257,190 个样本的双语平行数据集,该数据集由我们提出的基于大语言模型的信息抽取平行数据自动构建管道生成,并经过人工标注以确保质量。最终,我们通过多语言信息抽取指令微调获得了 AlignXIE。尽管在没有训练的 9 种未见语言上,AlignXIE 仍超越了 ChatGPT 30.17% 和当前最先进技术 (SoTA) 20.03%,从而展示了卓越的跨语言信息抽取能力。在各种设置下对 63 个中英文信息抽取基准的综合评估表明,AlignXIE 通过提升信息抽取对齐能力,显著增强了跨语言和多语言信息抽取效果。

[NLP-20] Enhancing Investment Analysis: Optimizing AI-Agent Collaboration in Financial Research

【速读】: 该论文试图解决现有生成式人工智能 (GenAI) 在金融分析和投资决策中主要依赖单一智能体系统,未能充分利用多智能体协作潜力的问题。解决方案的关键在于提出了一种新颖的多智能体协作系统,该系统通过配置不同规模的智能体组和协作结构,利用各智能体组的优势,采用次优组合策略动态适应不同市场条件和投资场景,从而在基本面分析、市场情绪分析和风险分析等子任务中优化性能。研究结果表明,该多智能体协作系统在复杂金融环境中相比传统单一智能体模型,提供了更高的准确性、效率和适应性。

链接: https://arxiv.org/abs/2411.04788
作者: Xuewen Han,Neng Wang,Shangkun Che,Hongyang Yang,Kunpeng Zhang,Sean Xin Xu
关键词-EN: generative artificial intelligence, gained significant attention, recent years, artificial intelligence, application of generative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
备注:

点击查看摘要

Abstract:In recent years, the application of generative artificial intelligence (GenAI) in financial analysis and investment decision-making has gained significant attention. However, most existing approaches rely on single-agent systems, which fail to fully utilize the collaborative potential of multiple AI agents. In this paper, we propose a novel multi-agent collaboration system designed to enhance decision-making in financial investment research. The system incorporates agent groups with both configurable group sizes and collaboration structures to leverage the strengths of each agent group type. By utilizing a sub-optimal combination strategy, the system dynamically adapts to varying market conditions and investment scenarios, optimizing performance across different tasks. We focus on three sub-tasks: fundamentals, market sentiment, and risk analysis, by analyzing the 2023 SEC 10-K forms of 30 companies listed on the Dow Jones Index. Our findings reveal significant performance variations based on the configurations of AI agents for different tasks. The results demonstrate that our multi-agent collaboration system outperforms traditional single-agent models, offering improved accuracy, efficiency, and adaptability in complex financial environments. This study highlights the potential of multi-agent systems in transforming financial analysis and investment decision-making by integrating diverse analytical perspectives.
摘要:近年来,生成式人工智能 (GenAI) 在金融分析和投资决策中的应用引起了广泛关注。然而,大多数现有方法依赖于单一代理系统,未能充分利用多个 AI 智能体之间的协作潜力。本文提出了一种新颖的多智能体协作系统,旨在增强金融投资研究中的决策能力。该系统集成了具有可配置群体规模和协作结构的智能体群体,以利用每种智能体群体类型的优势。通过采用次优组合策略,系统能够动态适应不同的市场条件和投资场景,优化不同任务的性能。我们重点关注三个子任务:基本面分析、市场情绪分析和风险分析,通过分析道琼斯指数中 30 家公司的 2023 年 SEC 10-K 表格来实现。研究结果表明,不同任务的 AI 智能体配置会导致显著的性能差异。实验结果显示,我们的多智能体协作系统优于传统的单智能体模型,在复杂的金融环境中提供了更高的准确性、效率和适应性。本研究强调了多智能体系统在整合多样化分析视角方面,对金融分析和投资决策变革的潜力。

[NLP-21] A study of Vietnamese readability assessing through semantic and statistical features

【速读】: 该论文试图解决越南语文本可读性评估中仅依赖统计特征的局限性问题。解决方案的关键在于引入了一种结合统计和语义分析的新方法,通过使用先进的语言模型(如PhoBERT、ViDeBERTa和ViBERT)进行语义分析,并结合统计方法提取文本的句法和词汇特征。实验结果表明,这种联合方法显著提高了可读性分类的准确性,强调了在越南语文本难度评估中同时考虑统计和语义特征的重要性。

链接: https://arxiv.org/abs/2411.04756
作者: Hung Tuan Le,Long Truong To,Manh Trong Nguyen,Quyen Nguyen,Trong-Hop Do
关键词-EN: reader text comprehension, Vietnamese Text Readability, text involves assessing, Vietnamese Text, impact the reader
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Determining the difficulty of a text involves assessing various textual features that may impact the reader’s text comprehension, yet current research in Vietnamese has only focused on statistical features. This paper introduces a new approach that integrates statistical and semantic approaches to assessing text readability. Our research utilized three distinct datasets: the Vietnamese Text Readability Dataset (ViRead), OneStopEnglish, and RACE, with the latter two translated into Vietnamese. Advanced semantic analysis methods were employed for the semantic aspect using state-of-the-art language models such as PhoBERT, ViDeBERTa, and ViBERT. In addition, statistical methods were incorporated to extract syntactic and lexical features of the text. We conducted experiments using various machine learning models, including Support Vector Machine (SVM), Random Forest, and Extra Trees and evaluated their performance using accuracy and F1 score metrics. Our results indicate that a joint approach that combines semantic and statistical features significantly enhances the accuracy of readability classification compared to using each method in isolation. The current study emphasizes the importance of considering both statistical and semantic aspects for a more accurate assessment of text difficulty in Vietnamese. This contribution to the field provides insights into the adaptability of advanced language models in the context of Vietnamese text readability. It lays the groundwork for future research in this area.
摘要:确定文本的难度涉及评估可能影响读者文本理解的各种文本特征,然而当前针对越南语的研究仅关注统计特征。本文提出了一种新的方法,将统计和语义方法结合起来评估文本的可读性。我们的研究使用了三个不同的数据集:越南语文本可读性数据集(ViRead)、OneStopEnglish 和 RACE,后两者被翻译成越南语。在语义分析方面,我们采用了先进的语义分析方法,使用了如 PhoBERT、ViDeBERTa 和 ViBERT 等最先进的语言模型。此外,我们还结合了统计方法来提取文本的句法和词汇特征。我们使用多种机器学习模型进行了实验,包括支持向量机(SVM)、随机森林和极端树,并使用准确率和 F1 分数指标评估了它们的性能。结果表明,结合语义和统计特征的联合方法显著提高了可读性分类的准确性,相比于单独使用每种方法。当前研究强调了在越南语文本难度评估中同时考虑统计和语义方面的重要性。这一贡献为该领域提供了关于先进语言模型在越南语文本可读性背景下适应性的见解,并为该领域的未来研究奠定了基础。

[NLP-22] RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval

【速读】: 该论文试图解决从代码混合(code-mixing)的对话中提取相关信息的问题,特别是在罗马字转写孟加拉语与英语混合的场景中。解决方案的关键在于开发了一种自动识别代码混合对话中最相关答案的机制。具体方法包括使用GPT-3.5 Turbo进行提示,并利用相关文档的顺序性构建数学模型,以检测与查询相对应的相关文档。实验结果表明,该方法在从复杂的代码混合数字对话中提取相关信息方面具有显著效果。

链接: https://arxiv.org/abs/2411.04752
作者: Aniket Deroy,Subhankar Maity
关键词-EN: widespread linguistic phenomenon, single sentence, linguistic phenomenon, integration of lexical, lexical and grammatical
类目: Computation and Language (cs.CL)
备注: Accepted at FIRE 2024 (Track: Code-Mixed Information Retrieval from Social Media Data)

点击查看摘要

Abstract:Code-mixing, the integration of lexical and grammatical elements from multiple languages within a single sentence, is a widespread linguistic phenomenon, particularly prevalent in multilingual societies. In India, social media users frequently engage in code-mixed conversations using the Roman script, especially among migrant communities who form online groups to share relevant local information. This paper focuses on the challenges of extracting relevant information from code-mixed conversations, specifically within Roman transliterated Bengali mixed with English. This study presents a novel approach to address these challenges by developing a mechanism to automatically identify the most relevant answers from code-mixed conversations. We have experimented with a dataset comprising of queries and documents from Facebook, and Query Relevance files (QRels) to aid in this task. Our results demonstrate the effectiveness of our approach in extracting pertinent information from complex, code-mixed digital conversations, contributing to the broader field of natural language processing in multilingual and informal text environments. We use GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant documents to frame a mathematical model which helps to detect relevant documents corresponding to a query.
摘要:代码混合(Code-mixing),即在单一句子中融合多种语言的词汇和语法元素,是一种广泛的语言现象,尤其在多语言社会中普遍存在。在印度,社交媒体用户经常使用罗马字母进行代码混合的对话,特别是在移民社区中,他们通过在线群组分享相关的本地信息。本文聚焦于从代码混合对话中提取相关信息的挑战,特别是针对罗马字母转写的孟加拉语与英语混合的情况。本研究提出了一种新颖的方法来应对这些挑战,通过开发一种机制来自动识别代码混合对话中最相关的答案。我们使用了一个包含来自Facebook的查询和文档以及查询相关文件(QRels)的数据集进行实验。实验结果表明,我们的方法在从复杂的代码混合数字对话中提取相关信息方面具有有效性,为多语言和非正式文本环境中的自然语言处理领域做出了贡献。我们通过提示GPT-3.5 Turbo,并利用相关文档的顺序特性构建了一个数学模型,以帮助检测与查询相对应的相关文档。

[NLP-23] BhasaAnuvaad: A Speech Translation Dataset for 14 Indian Languages

【速读】: 该论文试图解决印度语言自动语音翻译(Automatic Speech Translation, AST)数据集稀缺的问题,特别是针对低资源印度语言的AST系统性能落后于高资源语言(如英语)的现状。解决方案的关键在于引入BhasaAnuvaad数据集,这是目前公开的最大规模的AST数据集,涵盖14种印度官方语言,总时长超过44,400小时,包含1700万条文本片段。该数据集通过整合现有资源、大规模网络挖掘和合成数据生成三种方式构建,旨在提升AST系统对自发性和非正式语言模式的处理能力,从而推动低资源印度语言AST技术的发展。

链接: https://arxiv.org/abs/2411.04699
作者: Sparsh Jain,Ashwin Sankar,Devilal Choudhary,Dhairya Suman,Nikhil Narasimhan,Mohammed Safi Ur Rahman Khan,Anoop Kunchukuttan,Mitesh M Khapra,Raj Dabre
关键词-EN: Automatic Speech Translation, remain critically scarce, Indian languages, Indian languages remain, languages remain critically
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Automatic Speech Translation (AST) datasets for Indian languages remain critically scarce, with public resources covering fewer than 10 of the 22 official languages. This scarcity has resulted in AST systems for Indian languages lagging far behind those available for high-resource languages like English. In this paper, we first evaluate the performance of widely-used AST systems on Indian languages, identifying notable performance gaps and challenges. Our findings show that while these systems perform adequately on read speech, they struggle significantly with spontaneous speech, including disfluencies like pauses and hesitations. Additionally, there is a striking absence of systems capable of accurately translating colloquial and informal language, a key aspect of everyday communication. To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 14 scheduled Indian languages spanning over 44,400 hours and 17M text segments. BhasaAnuvaad contains data for English speech to Indic text, as well as Indic speech to English text. This dataset comprises three key categories: (1) Curated datasets from existing resources, (2) Large-scale web mining, and (3) Synthetic data generation. By offering this diverse and expansive dataset, we aim to bridge the resource gap and promote advancements in AST for low-resource Indian languages, especially in handling spontaneous and informal speech patterns.
摘要:印度语言的自动语音翻译 (Automatic Speech Translation, AST) 数据集仍然极度匮乏,公开资源仅涵盖不到 22 种官方语言中的 10 种。这种稀缺性导致印度语言的 AST 系统远远落后于英语等高资源语言的系统。本文首先评估了广泛使用的 AST 系统在印度语言上的表现,识别出显著的性能差距和挑战。我们的研究发现,尽管这些系统在朗读语音上表现尚可,但在处理包括停顿和犹豫在内的即兴语音时表现显著不佳。此外,能够准确翻译口语和非正式语言的系统明显缺失,而这是日常交流的关键方面。为此,我们引入了 BhasaAnuvaad,这是目前公开的最大 AST 数据集,涵盖 14 种印度官方语言,总时长超过 44,400 小时,包含 1700 万条文本片段。BhasaAnuvaad 包含英语语音到印度语言文本以及印度语言语音到英语文本的数据。该数据集包括三个关键类别:(1) 从现有资源中精选的数据集,(2) 大规模网络挖掘数据,(3) 合成数据生成。通过提供这一多样且广泛的数据集,我们旨在弥合资源差距,推动低资源印度语言 AST 的发展,特别是在处理即兴和非正式语音模式方面。

[NLP-24] DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

【速读】: 该论文试图解决现有后验解释性方法在捕捉神经语言模型决策过程中的局限性,特别是它们未能全面捕捉模型基于全局理解的决策,以及难以区分基于虚假相关性的决策。解决方案的关键在于引入了一种名为DISCO的新方法,通过识别与模型预测相关的因果n-gram关联,生成基于规则的全局解释。DISCO利用可扩展的序列挖掘技术从训练数据中提取相关文本片段,并进行因果关系检查,以提炼出解释模型行为的稳健规则。这些规则不仅揭示了潜在的过拟合问题,还提供了对误导性特征组合的洞察,从而在复杂模型行为解释方面优于现有方法。

链接: https://arxiv.org/abs/2411.04649
作者: Zijian Zhang,Vinay Setty,Yumeng Wang,Avishek Anand
关键词-EN: neural language models, interpretable explanations comprehensible, rapid advancement, advancement of neural, neural language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid advancement of neural language models, the deployment of over-parameterized models has surged, increasing the need for interpretable explanations comprehensible to human inspectors. Existing post-hoc interpretability methods, which often focus on unigram features of single input textual instances, fail to capture the models’ decision-making process fully. Additionally, many methods do not differentiate between decisions based on spurious correlations and those based on a holistic understanding of the input. Our paper introduces DISCO, a novel method for discovering global, rule-based explanations by identifying causal n-gram associations with model predictions. This method employs a scalable sequence mining technique to extract relevant text spans from training data, associate them with model predictions, and conduct causality checks to distill robust rules that elucidate model behavior. These rules expose potential overfitting and provide insights into misleading feature combinations. We validate DISCO through extensive testing, demonstrating its superiority over existing methods in offering comprehensive insights into complex model behaviors. Our approach successfully identifies all shortcuts manually introduced into the training data (100% detection rate on the MultiRC dataset), resulting in an 18.8% regression in model performance – a capability unmatched by any other method. Furthermore, DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output. This alleviates the burden of abundant instance-wise explanations and helps assess the model’s risk when encountering out-of-distribution (OOD) data.
摘要:随着神经语言模型的快速发展,过度参数化模型的部署激增,对人类检查员可理解的解释性解释的需求也随之增加。现有的后验解释性方法,通常专注于单个输入文本实例的单字特征,未能完全捕捉模型的决策过程。此外,许多方法未能区分基于虚假相关性的决策与基于对输入整体理解的决策。本文介绍了DISCO,一种通过识别与模型预测相关的因果n-gram关联来发现基于规则的全局解释的新方法。该方法采用可扩展的序列挖掘技术,从训练数据中提取相关文本片段,将其与模型预测关联,并进行因果关系检查,以提炼出解释模型行为的稳健规则。这些规则揭示了潜在的过拟合现象,并提供了对误导性特征组合的洞察。我们通过广泛的测试验证了DISCO,证明其在提供对复杂模型行为的全面洞察方面优于现有方法。我们的方法成功识别了所有手动引入训练数据的捷径(在MultiRC数据集上检测率为100%),导致模型性能下降18.8%——这是其他任何方法都无法比拟的能力。此外,DISCO支持交互式解释,使人类检查员能够区分基于规则输出的虚假原因。这减轻了大量实例级解释的负担,并有助于评估模型在遇到分布外(OOD)数据时的风险。

[NLP-25] Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop COLING2025

【速读】: 该论文旨在解决机器学习模型训练中数据标注成本高、耗时长的问题。解决方案的关键在于采用多种策略来加速标注过程、降低成本并减轻人工负担,包括生成合成训练数据、主动学习和混合标注(generating synthetic training data, active learning, and hybrid labeling)。论文不仅介绍了这些策略的基本原理、优缺点,还通过实际案例详细讨论了它们的应用,并提供了管理标注人员和控制数据集质量的最佳实践。此外,论文还设计了一个实践工作坊,指导参与者实施混合标注设置,以优化数据标注项目。

链接: https://arxiv.org/abs/2411.04637
作者: Ekaterina Artemova,Akim Tsvigun,Dominik Schlechtweg,Natalia Fedorova,Sergei Tilga,Boris Obmoroshev
关键词-EN: deploying machine learning, machine learning models, learning models relies, synthetic training data, deploying machine
类目: Computation and Language (cs.CL)
备注: To be presented at COLING 2025

点击查看摘要

Abstract:Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.
摘要:训练和部署机器学习模型依赖于大量的人工标注数据。随着人工标注的成本和时间消耗不断增加,近期研究开发了多种策略以加速标注过程、降低成本并减轻人工负担:生成合成训练数据、主动学习和混合标注。本教程面向实际应用:我们将介绍每种策略的基本原理,突出其优势和局限性,并详细讨论实际案例研究。此外,我们还将探讨管理人工标注员的最佳实践以及控制最终数据集质量的方法。教程中包含一个实践工作坊,参与者将在指导下实施混合标注设置。本教程旨在为来自研究机构和工业界的自然语言处理从业者提供指导,这些从业者参与或对优化数据标注项目感兴趣。

[NLP-26] FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis

【速读】: 该论文试图解决阿尔及利亚方言 (Algerian dialect, AD) 在低资源语言环境下缺乏标注语料库的问题,这一问题严重阻碍了机器学习 (Machine Learning, ML) 应用在该语言中的有效处理。解决方案的关键在于开发了一个名为 FASSILA 的专用语料库,用于假新闻 (Fake News, FN) 检测和情感分析 (Sentiment Analysis, SA)。该语料库包含 10,087 个句子,涵盖超过 19,497 个独特的 AD 词汇,并涉及七个不同领域。论文详细描述了数据收集、清洗和标注的过程,并提出了一种适用于 FN 检测和 SA 的标注方案。显著的标注者间一致性 (Inter-Annotator Agreement) 表明该标注方案能够产生高质量且一致的标注结果。随后,论文展示了基于 BERT 模型和 ML 模型的分类实验,结果显示了有希望的成果,并指出了未来研究的方向。该语料库已在 GitHub 上公开,以促进该领域未来的发展。

链接: https://arxiv.org/abs/2411.04604
作者: Amin Abdedaiem,Abdelhalim Hafedh Dahou,Mohamed Amine Cheragui,Brigitte Mathiak
关键词-EN: faces challenges due, Machine Learning, Algerian dialect, notably in Machine, annotated corpora
类目: Computation and Language (cs.CL)
备注: 16 pages, 6 Figuers

点击查看摘要

Abstract:In the context of low-resource languages, the Algerian dialect (AD) faces challenges due to the absence of annotated corpora, hindering its effective processing, notably in Machine Learning (ML) applications reliant on corpora for training and assessment. This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA. This corpus comprises 10,087 sentences, encompassing over 19,497 unique words in AD, and addresses the significant lack of linguistic resources in the language and covers seven distinct domains. We propose an annotation scheme for FN detection and SA, detailing the data collection, cleaning, and labelling process. Remarkable Inter-Annotator Agreement indicates that the annotation scheme produces consistent annotations of high quality. Subsequent classification experiments using BERT-based models and ML models are presented, demonstrate promising results and highlight avenues for further research. The dataset is made freely available on GitHub (this https URL) to facilitate future advancements in the field.
摘要:在低资源语言的背景下,阿尔及利亚方言 (Algerian Dialect, AD) 由于缺乏标注语料库,面临着在依赖语料库进行训练和评估的机器学习 (Machine Learning, ML) 应用中有效处理的挑战。本研究概述了为 AD 中的假新闻 (Fake News, FN) 检测和情感分析 (Sentiment Analysis, SA) 开发专用语料库 FASSILA 的过程。该语料库包含 10,087 个句子,涵盖超过 19,497 个 AD 中的独特词汇,解决了该语言中显著的语言资源匮乏问题,并涵盖了七个不同的领域。我们提出了一种用于 FN 检测和 SA 的标注方案,详细描述了数据收集、清洗和标注过程。显著的标注者间一致性表明,该标注方案能够产生高质量的一致性标注。随后,使用基于 BERT 的模型和 ML 模型进行的分类实验展示了有前景的结果,并指出了进一步研究的方向。该数据集已在 GitHub 上免费提供(此 https URL),以促进该领域的未来进展。

[NLP-27] Self-Calibrated Listwise Reranking with Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在列表式重排序任务中由于上下文窗口限制而导致的计算成本增加和全局比较信息捕捉不足的问题。解决方案的关键在于提出了一种自校准的列表式重排序方法,该方法通过引入相关性感知的列表式重排序框架和自校准训练机制来实现。具体来说,该框架通过显式的列表视图相关性评分来提高重排序效率,并允许在整个候选集上进行全局比较。自校准训练则利用LLM内部生成的点视图相关性评估来校准列表视图相关性评估,从而确保计算得分的可比性。

链接: https://arxiv.org/abs/2411.04602
作者: Ruiyang Ren,Yuhao Wang,Kun Zhou,Wayne Xin Zhao,Wenjie Wang,Jing Liu,Ji-Rong Wen,Tat-Seng Chua
关键词-EN: Large language models, advanced linguistic capabilities, Large language, language models, linguistic capabilities
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs), with advanced linguistic capabilities, have been employed in reranking tasks through a sequence-to-sequence approach. In this paradigm, multiple passages are reranked in a listwise manner and a textual reranked permutation is generated. However, due to the limited context window of LLMs, this reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. This not only increases computational costs but also restricts the LLM from fully capturing all the comparison information for all candidates. To address these challenges, we propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking. To achieve it, we first propose the relevance-aware listwise reranking framework, which incorporates explicit list-view relevance scores to improve reranking efficiency and enable global comparison across the entire candidate set. Second, to ensure the comparability of the computed scores, we propose self-calibrated training that uses point-view relevance assessments generated internally by the LLM itself to calibrate the list-view relevance assessments. Extensive experiments and comprehensive analysis on the BEIR benchmark and TREC Deep Learning Tracks demonstrate the effectiveness and efficiency of our proposed method.
摘要:大语言模型(Large Language Models, LLMs)凭借其先进的语言能力,已被应用于通过序列到序列的方法进行重排序任务。在这种范式中,多个段落以列表方式进行重排序,并生成文本形式的重排序排列。然而,由于LLMs的上下文窗口有限,这种重排序范式需要采用滑动窗口策略来迭代处理更大的候选集。这不仅增加了计算成本,还限制了LLM全面捕捉所有候选者的比较信息。为了应对这些挑战,我们提出了一种新颖的自校准列表重排序方法,旨在利用LLMs生成全局相关性分数进行排序。为此,我们首先提出了相关性感知的列表重排序框架,该框架结合了显式的列表视图相关性分数,以提高重排序效率并实现整个候选集的全局比较。其次,为了确保计算分数的可比性,我们提出了自校准训练,使用LLM内部生成的点视图相关性评估来校准列表视图相关性评估。在BEIR基准和TREC深度学习轨道上的广泛实验和综合分析证明了我们提出方法的有效性和效率。

[NLP-28] byan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction

【速读】: 该论文试图解决阿拉伯语语法错误纠正(Grammatical Error Correction, GEC)领域中数据资源有限的问题。解决方案的关键在于利用生成式 AI 工具 ChatGPT 来扩充阿拉伯语语法错误纠正的训练数据集,创建了一个名为“Tibyan”的语料库。通过从阿拉伯书籍和开放语料库中收集带有语法错误的句子对,并使用 ChatGPT 生成包含多种类型错误的平行语料,结合语言学专家的验证和迭代优化,最终构建了一个包含约 600K 词符的语料库,涵盖了七种错误类型:正字法、形态学、句法、语义、标点、合并和分割。

链接: https://arxiv.org/abs/2411.04588
作者: Ahlam Alrehili,Areej Alhothali
关键词-EN: sample size constraints, Natural language processing, overcome sample size, sample size, Arabic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Natural language processing (NLP) utilizes text data augmentation to overcome sample size constraints. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC). Furthermore, QALB-14 and QALB-15 are the only datasets used in most Arabic grammatical error correction research, with approximately 20,500 parallel examples, which is considered low compared with other languages. Therefore, this study aims to develop an Arabic corpus called “Tibyan” for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including the collection and pre-processing of a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured that they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49 of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.
摘要:自然语言处理 (NLP) 利用文本数据增强来克服样本量限制。增加样本量是缓解这些挑战的自然且广泛使用的策略。在本研究中,我们选择阿拉伯语来增加样本量并纠正语法错误。阿拉伯语被认为是语法错误纠正 (GEC) 资源有限的语言之一。此外,QALB-14 和 QALB-15 是大多数阿拉伯语语法错误纠正研究中使用的唯一数据集,约有 20,500 个平行示例,与其他语言相比被认为是较低的。因此,本研究旨在使用 ChatGPT 开发一个名为“Tibyan”的阿拉伯语语法错误纠正语料库。ChatGPT 被用作数据增强工具,基于一对包含语法错误的阿拉伯语句子和从阿拉伯语书籍中提取的无错误句子(称为引导句子)。建立我们的语料库涉及多个步骤,包括从各种来源(如书籍和开放访问语料库)收集和预处理一对阿拉伯语文本。然后,我们使用 ChatGPT 基于之前收集的文本生成平行语料库,作为生成包含多种类型错误的句子的引导。通过邀请语言学专家审查和验证自动生成的句子,我们确保它们是正确且无错误的。根据语言学专家提供的反馈,语料库经过迭代验证和改进以提高其准确性。最后,我们使用阿拉伯语错误类型注释工具 (ARETA) 分析了 Tibyan 语料库中的错误类型。我们的语料库包含 49 种错误,包括七种类型:正字法、形态学、句法、语义、标点符号、合并和分割。Tibyan 语料库包含约 600 K Token。

[NLP-29] he State and Fate of Summarization Datasets

【速读】: 该论文试图解决自动摘要领域中存在的数据集标注不一致和缺乏通用术语的问题,导致难以发现现有资源和确定连贯的研究方向。解决方案的关键在于构建了一个涵盖样本属性、收集方法和分布的新型本体(ontology),并通过调查133个数据集(涵盖超过100种语言)来验证其有效性。该本体揭示了低资源语言高质量数据集的缺乏,以及领域过度依赖新闻数据和自动收集的远监督数据的问题。此外,论文还提供了一个网络界面,供用户交互和探索本体及数据集,以及一个用于摘要数据卡的模板,以促进未来研究的一致性和连贯性。

链接: https://arxiv.org/abs/2411.04585
作者: Noam Dahan,Gabriel Stanovsky
关键词-EN: consistently attracted attention, attracted attention, downstream tasks, consistently attracted, versatility and wide
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic summarization has consistently attracted attention, due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field’s over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.
摘要:自动摘要技术因其多功能性和在多种下游任务中的广泛应用而持续受到关注。尽管其受欢迎程度高,我们发现标注工作在很大程度上是分散的,缺乏统一的术语体系。因此,发现现有资源或确定连贯的研究方向变得具有挑战性。为解决这一问题,我们对涵盖133个数据集、超过100种语言的大量工作进行了调研,创建了一个新颖的本体论,涵盖样本属性、收集方法和分布情况。基于此本体论,我们提出了关键观察结果,包括低资源语言高质量数据集的缺乏,以及该领域对新闻领域和自动收集的远监督数据的过度依赖。最后,我们提供了一个网络接口,允许用户交互和探索我们的本体论和数据集集合,以及一个用于摘要数据卡的模板,这些工具可用于促进未来研究形成更加连贯的工作体系。

[NLP-30] Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages

【速读】: 该论文试图解决低资源语言(如Malasar语)在自动语音识别(ASR)系统开发中面临的挑战,特别是由于缺乏数字资源和原生文字系统而导致的困难。解决方案的关键在于采用多阶段微调策略,通过先构建一个中间的泰米尔语ASR模型(Tamil ASR),然后在此基础上对Malasar语数据进行微调。这种策略利用了泰米尔语与Malasar语之间的语言相似性,有效弥补了Malasar语数据稀缺的问题。实验结果表明,多阶段微调策略显著提升了ASR性能,相较于直接在Malasar语数据上微调,实现了4.5%的绝对词错误率(WER)降低。此外,通过后处理中的标点符号去除,进一步将WER降低至47.3%,解决了格式不一致对评估的影响。这一方法强调了在低资源语言ASR系统开发中,利用语言相似性和针对性后处理策略的重要性。

链接: https://arxiv.org/abs/2411.04573
作者: Leena G Pillai,Kavya Manohar,Basil K Raju,Elizabeth Sherly
关键词-EN: OpenAI Whisper model, OpenAI Whisper, automatic speech recognition, Whisper model, enhance automatic speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages using OpenAI’s Whisper model. In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India. Malasar language faces critical challenges for technological intervention due to its lack of a native script and absence of digital or spoken data resources. Working in collaboration with Wycliffe India and Malasar community members, we created a spoken Malasar corpus paired with transcription in Tamil script, a closely related major language. In our approach to build ASR model for Malasar, we first build an intermediate Tamil ASR, leveraging higher data availability for Tamil annotated speech. This intermediate model is subsequently fine-tuned on Malasar data, allowing for more effective ASR adaptation despite limited resources. The multistage fine-tuning strategy demonstrated significant improvements over direct fine-tuning on Malasar data alone, achieving a word error rate (WER) of 51.9%, which is 4.5% absolute reduction when compared to the direct fine-tuning method. Further a WER reduction to 47.3% was achieved through punctuation removal in post-processing, which addresses formatting inconsistencies that impact evaluation. Our results underscore the effectiveness of sequential multistage fine-tuning combined with targeted post-processing as a scalable strategy for ASR system development in low-resource languages, especially where linguistic similarities can be leveraged to bridge gaps in training data.
摘要:本文提出了一种新颖的多阶段微调策略,旨在利用 OpenAI 的 Whisper 模型提升低资源语言的自动语音识别 (ASR) 性能。该方法通过在语言学上相似的语言之间逐步适应模型,为数字资源有限的语言构建 ASR 模型。我们在 Malasar 语言上进行了实验,这是一种在印度南部西高止山脉约有万人使用的达罗毗荼语。由于缺乏本土文字和数字或语音数据资源,Malasar 语言在技术干预方面面临重大挑战。在与 Wycliffe India 和 Malasar 社区成员的合作中,我们创建了一个包含 Tamil 文字转录的 Malasar 口语语料库,Tamil 是一种密切相关的主要语言。在为 Malasar 构建 ASR 模型的过程中,我们首先利用 Tamil 标注语音的高数据可用性构建了一个中间的 Tamil ASR 模型。随后,该中间模型在 Malasar 数据上进行微调,尽管资源有限,但仍能实现更有效的 ASR 适应。多阶段微调策略相较于直接在 Malasar 数据上进行微调,显著提升了性能,实现了 51.9% 的词错误率 (WER),比直接微调方法降低了 4.5%。此外,通过后处理中的标点符号去除,进一步将 WER 降低至 47.3%,解决了影响评估的格式不一致问题。我们的结果强调了顺序多阶段微调与针对性后处理相结合作为一种可扩展策略,在低资源语言的 ASR 系统开发中的有效性,特别是在可以利用语言相似性来弥补训练数据缺口的情况下。

[NLP-31] Pruning Literals for Highly Efficient Explainability at Word Level

【速读】: 该论文试图解决现有自然语言处理 (NLP) 模型在解释预测结果时透明度不足的问题。解决方案的关键在于设计了一种后处理剪枝方法,用于简化Tsetlin Machine ™ 中的子句结构,从而提高模型的可解释性。具体来说,该方法通过消除子句中随机放置的文字(命题逻辑),使得模型的解释更加高效和易于理解。实验结果表明,经过剪枝的TM在YELP-HAT数据集上的注意力图与人类注意力图更为一致,且在成对相似性度量上也优于基于注意力图的神经网络模型。尽管剪枝方法在准确性上没有显著下降,反而在某些测试数据上提升了4%到9%的性能。

链接: https://arxiv.org/abs/2411.04557
作者: Rohan Kumar Yadav,Bimal Bhattarai,Abhik Jana,Lei Jiao,Seid Muhie Yimam
关键词-EN: Natural Language Processing, Language Processing, Natural Language, Designing an explainable, machine learning models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Designing an explainable model becomes crucial now for Natural Language Processing(NLP) since most of the state-of-the-art machine learning models provide a limited explanation for the prediction. In the spectrum of an explainable model, Tsetlin Machine™ is promising because of its capability of providing word-level explanation using proposition logic. However, concern rises over the elaborated combination of literals (propositional logic) in the clause that makes the model difficult for humans to comprehend, despite having a transparent learning process. In this paper, we design a post-hoc pruning of clauses that eliminate the randomly placed literals in the clause thereby making the model more efficiently interpretable than the vanilla TM. Experiments on the publicly available YELP-HAT Dataset demonstrate that the proposed pruned TM’s attention map aligns more with the human attention map than the vanilla TM’s attention map. In addition, the pairwise similarity measure also surpasses the attention map-based neural network models. In terms of accuracy, the proposed pruning method does not degrade the accuracy significantly but rather enhances the performance up to 4% to 9% in some test data.
摘要:在自然语言处理 (Natural Language Processing, NLP) 领域,设计一个可解释的模型变得至关重要,因为大多数最先进的机器学习模型在预测时提供的解释有限。在可解释模型的范畴中,Tsetlin 机器 (Tsetlin Machine, TM) 因其能够使用命题逻辑提供词级别的解释而显示出潜力。然而,由于子句中命题逻辑的复杂组合,尽管学习过程透明,模型仍然难以被人类理解。本文设计了一种后处理剪枝方法,通过消除子句中随机放置的命题逻辑,使得模型比原始的 TM 更具高效的可解释性。在公开的 YELP-HAT 数据集上的实验表明,所提出的剪枝 TM 的注意力图比原始 TM 的注意力图更符合人类的注意力图。此外,成对相似性度量也超过了基于注意力图的神经网络模型。在准确性方面,所提出的剪枝方法并未显著降低准确性,反而在某些测试数据上提升了 4% 至 9% 的性能。

[NLP-32] Best Practices for Distilling Large Language Models into BERT for Web Search Ranking

【速读】: 该论文试图解决大型语言模型(LLMs)在商业搜索系统中直接应用的高成本问题,解决方案的关键在于将LLMs的排序能力迁移到更紧凑的模型(如BERT)上。具体方法包括通过持续预训练(Continued Pre-Training)增强LLMs的训练,使用查询作为输入,点击的标题和摘要作为输出,并进行监督微调,采用排序损失(rank loss)将最终标记作为整个句子的代表。此外,引入混合点对点和边际均方误差损失(hybrid point-wise and margin MSE loss),以实现从LLMs到BERT等较小模型的排序知识迁移。这种方法在资源受限的环境中提供了一个可行的解决方案,并通过离线和在线评估验证了其有效性。

链接: https://arxiv.org/abs/2411.04539
作者: Dezhi Ye,Junwei Hu,Jiabin Fan,Bowen Tian,Jie Liu,Haijin Liang,Jin Ma
关键词-EN: zero-shot relevance rankers, Large Language Models, Recent studies, Large Language, studies have highlighted
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Arxiv Version

点击查看摘要

Abstract:Recent studies have highlighted the significant potential of Large Language Models (LLMs) as zero-shot relevance rankers. These methods predominantly utilize prompt learning to assess the relevance between queries and documents by generating a ranked list of potential documents. Despite their promise, the substantial costs associated with LLMs pose a significant challenge for their direct implementation in commercial search systems. To overcome this barrier and fully exploit the capabilities of LLMs for text ranking, we explore techniques to transfer the ranking expertise of LLMs to a more compact model similar to BERT, using a ranking loss to enable the deployment of less resource-intensive models. Specifically, we enhance the training of LLMs through Continued Pre-Training, taking the query as input and the clicked title and summary as output. We then proceed with supervised fine-tuning of the LLM using a rank loss, assigning the final token as a representative of the entire sentence. Given the inherent characteristics of autoregressive language models, only the final token /s can encapsulate all preceding tokens. Additionally, we introduce a hybrid point-wise and margin MSE loss to transfer the ranking knowledge from LLMs to smaller models like BERT. This method creates a viable solution for environments with strict resource constraints. Both offline and online evaluations have confirmed the efficacy of our approach, and our model has been successfully integrated into a commercial web search engine as of February 2024.
摘要:近期研究表明,大语言模型 (LLMs) 作为零样本相关性排序器具有显著潜力。这些方法主要通过提示学习来评估查询与文档之间的相关性,生成潜在文档的排序列表。尽管这些方法前景广阔,但与 LLMs 相关的巨大成本成为其在商业搜索系统中直接实施的主要障碍。为克服这一障碍并充分利用 LLMs 在文本排序方面的能力,我们探索了将 LLMs 的排序专长转移至类似 BERT 的更紧凑模型中的技术,通过排序损失来实现资源消耗较少的模型的部署。具体而言,我们通过持续预训练 (Continued Pre-Training) 来增强 LLMs 的训练,以查询为输入,点击的标题和摘要为输出。随后,我们使用排序损失对 LLM 进行监督微调,将最终 Token 作为整个句子的代表。鉴于自回归语言模型的固有特性,只有最终 Token /s 能够封装所有先前的 Token。此外,我们引入了一种混合点对点和边际均方误差损失 (hybrid point-wise and margin MSE loss),以将排序知识从 LLMs 转移至 BERT 等较小模型。该方法为资源严格受限的环境提供了一种可行的解决方案。离线和在线评估均证实了我们方法的有效性,我们的模型已于 2024 年 2 月成功集成到商业网页搜索引擎中。

[NLP-33] Meta-Reasoning Improves Tool Use in Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在特定任务中表现不佳的问题,特别是通过外部工具的使用来提升其性能。解决方案的关键在于提出了一种名为Tool selECTion via meta-reasONing (TECTON)的两阶段系统。首先,通过自定义微调的语言模型头部(LM head)对任务进行推理,生成候选工具;然后,在禁用自定义头部的情况下,进行元推理(meta-reasoning),即对前一步的推理过程进行再推理,以做出最终的工具选择。这种方法利用了冻结模型的泛化能力,显著提升了在数学推理数据集上的表现,无论是在分布内还是分布外任务中。

链接: https://arxiv.org/abs/2411.04535
作者: Lisa Alazraki,Marek Rei
关键词-EN: large language models, External tools, typically fail, large language, External
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:External tools help large language models (LLMs) succeed at tasks where they would otherwise typically fail. In existing frameworks, LLMs learn tool use either by in-context demonstrations or via full model fine-tuning on annotated data. As these approaches do not easily scale, a recent trend is to abandon them in favor of lightweight, parameter-efficient tuning paradigms. These methods allow quickly alternating between the frozen LLM and its specialised fine-tuned version, by switching on or off a handful of additional custom parameters. Hence, we postulate that the generalization ability of the frozen model can be leveraged to improve tool selection. We present Tool selECTion via meta-reasONing (TECTON), a two-phase system that first reasons over a task using a custom fine-tuned LM head and outputs candidate tools. Then, with the custom head disabled, it meta-reasons (i.e., it reasons over the previous reasoning process) to make a final choice. We show that TECTON results in substantial gains - both in-distribution and out-of-distribution - on a range of math reasoning datasets.
摘要:外部工具帮助大语言模型 (LLMs) 在通常会失败的情境中取得成功。在现有的框架中,LLMs 通过上下文演示或通过在标注数据上进行全模型微调来学习使用工具。由于这些方法不易扩展,最近的趋势是放弃它们,转而采用轻量级、参数高效的微调范式。这些方法通过切换少量额外的自定义参数,允许在冻结的 LLM 和其专门的微调版本之间快速切换。因此,我们假设冻结模型的泛化能力可以被利用来改进工具选择。我们提出了通过元推理进行工具选择 (TECTON),这是一个两阶段的系统,首先使用自定义微调的 LM 头对任务进行推理并输出候选工具。然后,在禁用自定义头的情况下,进行元推理(即对之前的推理过程进行推理)以做出最终选择。我们展示了 TECTON 在各种数学推理数据集上,无论是分布内还是分布外,都取得了显著的提升。

[NLP-34] omato Tomahto Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models

【速读】: 该论文试图解决的问题是:在仅使用编码器(encoder-only)的多语言语言模型(multilingual language models, mLMs)中,子词(subwords)之间的共享语义在多大程度上能够影响模型的性能。解决方案的关键在于通过合并语义相似的子词及其嵌入向量(embeddings),形成“语义标记”(semantic tokens),并评估更新后的mLMs在5个异构多语言下游任务中的表现。研究结果表明,共享语义能够显著提升模型在不同分词器(tokenizers)和模型规模下的预测能力。此外,零样本(zero-shot)实验结果显示,使用语义标记的模型在某些分类任务中的表现甚至优于原始模型,这表明子词级别的共享语义可能作为跨语言迁移的锚点(anchors)。

链接: https://arxiv.org/abs/2411.04530
作者: Xinyu Zhang,Jing Lu,Vinh Q. Tran,Tal Schuster,Donald Metzler,Jimmy Lin
关键词-EN: Human understanding, similar semantic concepts, word choices, human intuition transfer, represent similar semantic
类目: Computation and Language (cs.CL)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Human understanding of language is robust to different word choices as far as they represent similar semantic concepts. To what extent does our human intuition transfer to language models, which represent all subwords as distinct embeddings? In this work, we take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs). To this end, we form “semantic tokens” by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on 5 heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections on the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we found the zero-shot results with semantic tokens are on par or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transferring.
摘要:人类的语言理解能力对于不同的词汇选择具有较强的鲁棒性,只要这些词汇代表相似的语义概念。然而,对于将所有子词表示为不同嵌入的语言模型,这种人类直觉在多大程度上能够迁移?在本研究中,我们首次探讨了仅编码器多语言语言模型 (mLMs) 中子词间共享语义的作用。为此,我们通过合并语义相似的子词及其嵌入,形成“语义 Token (semantic tokens)”,并在 5 个异质多语言下游任务上评估更新后的 mLMs。结果表明,普遍的共享语义能够使模型在不同 Tokenizer 和模型规模下进行预测时取得显著进展。对分组子词的检查显示,它们表现出广泛的语义相似性,包括多种语言和文字中的同义词和翻译。最后,我们发现使用语义 Token 的零样本结果在某些分类任务上与原始模型相当,甚至在某些情况下更优,这表明子词级别的共享语义可能作为跨语言迁移的锚点。

[NLP-35] hanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model

【速读】: 该论文试图解决大型语言模型(LLM)在复杂社交对话中难以像人类一样规划适当的对话技能的问题。解决方案的关键在于提出了一个名为“Multifaceted Skill-of-Mind”的技能标注对话数据集,该数据集包含了多轮和多方面的对话技能,涵盖了各种互动场景(如长期关系、咨询、任务导向),并基于多样化的社交背景(如人口统计学、人物角色、经验法则)。基于此数据集,论文引入了一组名为“Thanos”的技能注入型LLM,模型规模分别为1B、3B和8B参数。通过广泛的实验,这些模型成功展示了技能规划过程,并在多个领域中表现出强大的泛化能力。此外,Thanos显著提升了基于LLM的对话代理生成的回复质量,并在人类评估中促进了亲社会行为。

链接: https://arxiv.org/abs/2411.04496
作者: Young-Jun Lee,Dokyong Lee,Junyoung Youn,Kyeongjin Oh,Ho-Jin Choi
关键词-EN: increase social bonding, humans naturally acquire, bonding with interlocutors, naturally acquire, acquire the ability
类目: Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:To increase social bonding with interlocutors, humans naturally acquire the ability to respond appropriately in a given situation by considering which conversational skill is most suitable for the response - a process we call skill-of-mind. For large language model (LLM)-based conversational agents, planning appropriate conversational skills, as humans do, is challenging due to the complexity of social dialogue, especially in interactive scenarios. To address this, we propose a skill-of-mind-annotated conversation dataset, named Multifaceted Skill-of-Mind, which includes multi-turn and multifaceted conversational skills across various interactive scenarios (e.g., long-term, counseling, task-oriented), grounded in diverse social contexts (e.g., demographics, persona, rules of thumb). This dataset consists of roughly 100K conversations. Using this dataset, we introduce a new family of skill-of-mind-infused LLMs, named Thanos, with model sizes of 1B, 3B, and 8B parameters. With extensive experiments, these models successfully demonstrate the skill-of-mind process and exhibit strong generalizability in inferring multifaceted skills across a variety of domains. Moreover, we show that Thanos significantly enhances the quality of responses generated by LLM-based conversational agents and promotes prosocial behavior in human evaluations.
摘要:为了增强与对话者的社会联系,人类自然地通过考虑哪种对话技巧最适合当前情境来做出适当的回应——这一过程我们称之为“心智技能”。对于基于大语言模型 (LLM) 的对话智能体,像人类一样规划适当的对话技巧是极具挑战性的,尤其是在互动场景中,社交对话的复杂性尤为突出。为了解决这一问题,我们提出了一种名为“多面心智技能”的心智技能标注对话数据集,该数据集包含了多轮和多面的对话技巧,涵盖了各种互动场景(例如,长期关系、咨询、任务导向),并基于多样化的社会背景(例如,人口统计学、人物角色、经验法则)。该数据集包含约 10 万条对话。利用这一数据集,我们引入了一类新型的心智技能融入式 LLM,命名为 Thanos,其模型规模分别为 1B、3B 和 8B 参数。通过广泛的实验,这些模型成功展示了心智技能过程,并在推断跨多个领域的多面技能方面表现出强大的泛化能力。此外,我们证明 Thanos 显著提升了基于 LLM 的对话智能体生成回应的质量,并在人类评估中促进了亲社会行为。

[NLP-36] ML-Promise: A Multilingual Dataset for Corporate Promise Verification

【速读】: 该论文试图解决政治家、企业领袖和公众人物承诺的可信度评估问题,特别是在环境、社会和治理(ESG)报告中的承诺。解决方案的关键在于引入“承诺验证”(Promise Verification)的概念,通过系统化的步骤如承诺识别、证据评估和验证时机的评估来实现。论文提出了首个多语言数据集ML-Promise,涵盖英语、法语、中文、日语和韩语,旨在促进对承诺的深入验证,特别是针对企业环境贡献的承诺,以应对绿色清洗等实践带来的挑战。此外,论文还探讨了基于文本和图像的基线方法,并展示了检索增强生成(RAG)方法的潜力,旨在推动多语言和多领域中公共承诺责任性的进一步讨论。

链接: https://arxiv.org/abs/2411.04473
作者: Yohei Seki,Hakusen Shu,Anaïs Lhuissier,Hanwool Lee,Juyeon Kang,Min-Yuh Day,Chung-Chi Chen
关键词-EN: made by politicians, institutional reputation, significant impact, Promises made, corporate leaders
类目: Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:Promises made by politicians, corporate leaders, and public figures have a significant impact on public perception, trust, and institutional reputation. However, the complexity and volume of such commitments, coupled with difficulties in verifying their fulfillment, necessitate innovative methods for assessing their credibility. This paper introduces the concept of Promise Verification, a systematic approach involving steps such as promise identification, evidence assessment, and the evaluation of timing for verification. We propose the first multilingual dataset, ML-Promise, which includes English, French, Chinese, Japanese, and Korean, aimed at facilitating in-depth verification of promises, particularly in the context of Environmental, Social, and Governance (ESG) reports. Given the growing emphasis on corporate environmental contributions, this dataset addresses the challenge of evaluating corporate promises, especially in light of practices like greenwashing. Our findings also explore textual and image-based baselines, with promising results from retrieval-augmented generation (RAG) approaches. This work aims to foster further discourse on the accountability of public commitments across multiple languages and domains.
摘要:政治家、企业领袖和公众人物的承诺对公众认知、信任和机构声誉具有重大影响。然而,这些承诺的复杂性和数量,加上验证其履行情况的困难,需要创新的方法来评估其可信度。本文提出了承诺验证的概念,这是一种系统化的方法,涉及承诺识别、证据评估和验证时机评估等步骤。我们提出了首个多语言数据集 ML-Promise,该数据集包括英语、法语、中文、日语和韩语,旨在促进对承诺的深入验证,特别是在环境、社会和治理(ESG)报告的背景下。鉴于企业环境贡献日益受到重视,该数据集解决了评估企业承诺的挑战,特别是在面对绿色清洗等实践的情况下。我们的研究还探讨了基于文本和图像的基线方法,并展示了检索增强生成(RAG)方法的潜力。这项工作旨在促进关于多语言和多领域公共承诺责任性的进一步讨论。

[NLP-37] Gradient Localization Improves Lifelong Pretraining of Language Models EMNLP

【速读】: 该论文试图解决大语言模型(LLMs)在存储不同类型知识时,其知识存储机制不明确的问题。特别是,论文关注于时间敏感实体相关的两种知识类型,并发现这些知识类型在LLMs的参数中分布于不同的子集。论文的关键解决方案在于识别出这些知识类型对应的参数子集,并通过针对这些特定层的参数更新来改进持续预训练的效果,从而解决新信息吸收失败和先前学习信息遗忘的问题。

链接: https://arxiv.org/abs/2411.04448
作者: Jared Fernandez,Yonatan Bisk,Emma Strubell
关键词-EN: Large Language Models, web-scale text corpora, capture world knowledge, Large Language, language models store
类目: Computation and Language (cs.CL)
备注: EMNLP Findings 2024

点击查看摘要

Abstract:Large Language Models (LLMs) trained on web-scale text corpora have been shown to capture world knowledge in their parameters. However, the mechanism by which language models store different types of knowledge is poorly understood. In this work, we examine two types of knowledge relating to temporally sensitive entities and demonstrate that each type is localized to different sets of parameters within the LLMs. We hypothesize that the lack of consideration of the locality of knowledge in existing continual learning methods contributes to both: the failed uptake of new information, and catastrophic forgetting of previously learned information. We observe that sequences containing references to updated and newly mentioned entities exhibit larger gradient norms in a subset of layers. We demonstrate that targeting parameter updates to these relevant layers can improve the performance of continually pretraining on language containing temporal drift.
摘要:基于网络规模文本语料库训练的大语言模型 (LLM) 已被证明能够在其参数中捕捉世界知识。然而,语言模型存储不同类型知识的机制尚不明确。在本研究中,我们探讨了与时间敏感实体相关的两种知识类型,并证明每种类型都局限于 LLM 中不同的参数集合。我们假设,现有持续学习方法中对知识局部性缺乏考虑,导致了新信息的吸收失败以及先前学习信息的灾难性遗忘。我们观察到,包含对更新和新增实体引用的序列在某些层中表现出更大的梯度范数。我们证明,将参数更新集中在这些相关层上,可以提高在包含时间漂移的语言上进行持续预训练的性能。

[NLP-38] ACCIO: Table Understanding Enhanced via Contrastive Learning with Aggregations

【速读】: 该论文试图解决通过对比学习增强表格理解的问题。解决方案的关键在于引入了一种名为ACCIO(tAble understanding enhanCed via Contrastive learnIng with aggregatiOns)的新方法,该方法通过对比原始表格与其枢轴摘要(pivot summaries)来进行对比学习,从而训练一个编码器将这些表格对拉近。这种方法首次尝试利用表格对进行表格嵌入,通过列类型注释的验证,ACCIO在宏观F1得分上达到了91.1,与最先进的方法相比表现出色,显示出在表格理解领域的显著进步潜力。

链接: https://arxiv.org/abs/2411.04443
作者: Whanhee Cho
关键词-EN: recent natural language, natural language models, recent natural, natural language, language models
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The attention to table understanding using recent natural language models has been growing. However, most related works tend to focus on learning the structure of the table directly. Just as humans improve their understanding of sentences by comparing them, they can also enhance their understanding by comparing tables. With this idea, in this paper, we introduce ACCIO, tAble understanding enhanCed via Contrastive learnIng with aggregatiOns, a novel approach to enhancing table understanding by contrasting original tables with their pivot summaries through contrastive learning. ACCIO trains an encoder to bring these table pairs closer together. Through validation via column type annotation, ACCIO achieves competitive performance with a macro F1 score of 91.1 compared to state-of-the-art methods. This work represents the first attempt to utilize pairs of tables for table embedding, promising significant advancements in table comprehension. Our code is available at this https URL.
摘要:近年来,利用自然语言模型进行表格理解的关注度逐渐增加。然而,大多数相关研究倾向于直接学习表格的结构。正如人类通过比较句子来提升对句子的理解一样,他们也可以通过比较表格来增强对表格的理解。基于这一思想,本文提出了ACCIO,即通过对比学习与聚合增强表格理解(tAble understanding enhanCed via Contrastive learnIng with aggregatiOns),这是一种通过对比原始表格与其枢轴摘要来增强表格理解的新方法。ACCIO训练一个编码器,使这些表格对更加接近。通过列类型注释的验证,ACCIO在与最先进方法的比较中取得了具有竞争力的表现,宏F1得分为91.1。这项工作首次尝试利用表格对进行表格嵌入,有望在表格理解方面取得显著进展。我们的代码可在以下链接获取:https URL。

[NLP-39] One fish two fish but not the whole sea: Alignment reduces language models conceptual diversity

【速读】: 该论文试图解决的问题是:在行为研究中使用大型语言模型 (LLMs) 替代人类时,这些模型是否能够捕捉到人类概念多样性,以及训练后对齐 (post-training alignment) 是否会影响模型的内部多样性。解决方案的关键在于提出了一种新的方法来测量合成生成的 LLM “群体” 的概念多样性,通过将模拟个体的内部变异性与群体级别的变异性相关联。研究结果表明,尽管没有模型达到人类级别的多样性,但对齐模型通常显示出比指令微调模型更低的多样性,这突显了在增加模型价值对齐与减少概念表示多样性之间的潜在权衡。

链接: https://arxiv.org/abs/2411.04427
作者: Sonia K. Murthy,Tomer Ullman,Jennifer Hu
关键词-EN: Researchers in social, large language models, social science, science and psychology, psychology have recently
类目: Computation and Language (cs.CL)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Researchers in social science and psychology have recently proposed using large language models (LLMs) as replacements for humans in behavioral research. In addition to arguments about whether LLMs accurately capture population-level patterns, this has raised questions about whether LLMs capture human-like conceptual diversity. Separately, it is debated whether post-training alignment (RLHF or RLAIF) affects models’ internal diversity. Inspired by human studies, we use a new way of measuring the conceptual diversity of synthetically-generated LLM “populations” by relating the internal variability of simulated individuals to the population-level variability. We use this approach to evaluate non-aligned and aligned LLMs on two domains with rich human behavioral data. While no model reaches human-like diversity, aligned models generally display less diversity than their instruction fine-tuned counterparts. Our findings highlight potential trade-offs between increasing models’ value alignment and decreasing the diversity of their conceptual representations.
摘要:社会科学和心理学领域的研究人员最近提出,使用大语言模型 (LLMs) 作为行为研究中人类的替代品。除了关于 LLMs 是否准确捕捉群体水平模式的争论外,这一提议还引发了关于 LLMs 是否能捕捉人类概念多样性的问题。此外,关于训练后对齐 (RLHF 或 RLAIF) 是否影响模型内部多样性的讨论也在进行中。受人类研究的启发,我们通过将模拟个体的内部变异性与群体水平的变异性相关联,采用了一种新的方法来测量合成生成 LLM “群体”的概念多样性。我们使用这种方法在两个具有丰富人类行为数据的领域中评估了未对齐和对齐的 LLMs。尽管没有模型达到人类水平的多样性,但对齐模型通常表现出比其指令微调的对应模型更少的多样性。我们的研究结果突显了在增加模型价值对齐与减少其概念表示多样性之间可能存在的权衡。

[NLP-40] DELIFT: Data Efficient Language model Instruction Fine Tuning

【速读】: 该论文试图解决在大语言模型(LLMs)微调过程中,由于数据冗余或不具信息性导致的资源密集问题。解决方案的关键是引入了一种名为DELIFT(Data Efficient Language model Instruction Fine-Tuning)的新算法,该算法通过系统优化数据选择,涵盖微调的三个关键阶段:指令调优、任务特定微调(如推理、问答)和持续微调(如整合新数据版本)。DELIFT的核心在于采用了一种成对效用度量,该度量能够量化数据样本对模型响应其他样本的改进效果,从而有效衡量样本相对于模型当前能力的信息价值。通过应用不同的子模块函数到这一度量上,DELIFT能够选择多样且最优的子集,这些子集在所有微调阶段都具有效用。实验结果表明,DELIFT能够在不牺牲性能的前提下,将微调数据量减少高达70%,显著节省计算资源,并在效率和效果上优于现有方法。

链接: https://arxiv.org/abs/2411.04425
作者: Ishika Agarwal,Krishna Killamsetty,Lucian Popa,Marina Danilevksy
关键词-EN: large language models, Efficient Language model, Fine-tuning large language, Data Efficient Language, Language model Instruction
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is essential for enhancing their performance on specific tasks but is often resource-intensive due to redundant or uninformative data. To address this inefficiency, we introduce DELIFT (Data Efficient Language model Instruction Fine-Tuning), a novel algorithm that systematically optimizes data selection across the three key stages of fine-tuning: (1) instruction tuning, (2) task-specific fine-tuning (e.g., reasoning, question-answering), and (3) continual fine-tuning (e.g., incorporating new data versions). Unlike existing methods that focus on single-stage optimization or rely on computationally intensive gradient calculations, DELIFT operates efficiently across all stages. Central to our approach is a pairwise utility metric that quantifies how beneficial a data sample is for improving the model’s responses to other samples, effectively measuring the informational value relative to the model’s current capabilities. By leveraging different submodular functions applied to this metric, DELIFT selects diverse and optimal subsets that are useful across all stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance, offering significant computational savings and outperforming existing methods in both efficiency and efficacy.
摘要:微调大语言模型(Large Language Models, LLMs)对于提升其在特定任务上的表现至关重要,但由于冗余或无信息数据的存在,这一过程通常非常耗费资源。为了解决这一效率问题,我们提出了 DELIFT(数据高效语言模型指令微调),这是一种新颖的算法,系统地优化了微调过程中的三个关键阶段的数据选择:(1)指令微调,(2)任务特定微调(例如,推理、问答),以及(3)持续微调(例如,整合新数据版本)。与现有专注于单一阶段优化或依赖计算密集型梯度计算的方法不同,DELIFT 在所有阶段都能高效运作。我们的方法核心在于一种成对效用度量,该度量量化了数据样本对改善模型对其他样本响应的有益程度,从而有效地衡量了相对于模型当前能力的信息价值。通过利用应用于该度量的不同子模块函数,DELIFT 选择出多样且最优的子集,这些子集在微调的所有阶段都具有实用性。在各种任务和模型规模上的实验表明,DELIFT 能够在不牺牲性能的情况下将微调数据量减少高达 70%,显著节省计算资源,并在效率和效果上优于现有方法。

[NLP-41] Bayesian Calibration of Win Rate Estimation with LLM Evaluators EMNLP2024

【速读】: 该论文试图解决使用大型语言模型(LLMs)作为评估器时,由于其固有的胜率估计偏差而导致的不准确性问题。解决方案的关键在于提出了两种校准方法:贝叶斯胜率采样(Bayesian Win Rate Sampling, BWRS)和贝叶斯Dawid-Skene(Bayesian Dawid-Skene),这两种方法均利用贝叶斯推断来更准确地推断生成式语言模型的真实胜率。通过在涵盖故事生成、摘要生成和指令跟随任务的六个数据集上的实证验证,证明了这两种方法在提高LLMs作为评估器的胜率估计准确性方面的有效性。

链接: https://arxiv.org/abs/2411.04424
作者: Yicheng Gao,Gonghan Xu,Zhe Wang,Arman Cohan
关键词-EN: Recent advances, large language models, win rate, Win Rate Sampling, win rate estimation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Recent advances in large language models (LLMs) show the potential of using LLMs as evaluators for assessing the quality of text generations from LLMs. However, applying LLM evaluators naively to compare or judge between different systems can lead to unreliable results due to the intrinsic win rate estimation bias of LLM evaluators. In order to mitigate this problem, we propose two calibration methods, Bayesian Win Rate Sampling (BWRS) and Bayesian Dawid-Skene, both of which leverage Bayesian inference to more accurately infer the true win rate of generative language models. We empirically validate our methods on six datasets covering story generation, summarization, and instruction following tasks. We show that both our methods are effective in improving the accuracy of win rate estimation using LLMs as evaluators, offering a promising direction for reliable automatic text quality evaluation.
摘要:近年来,大语言模型 (LLM) 的进步展示了将 LLM 用作评估器以评估从 LLM 生成的文本质量的潜力。然而,直接应用 LLM 评估器来比较或判断不同系统可能会导致结果不可靠,这是由于 LLM 评估器固有的胜率估计偏差。为了缓解这一问题,我们提出了两种校准方法:贝叶斯胜率采样 (Bayesian Win Rate Sampling, BWRS) 和贝叶斯 Dawid-Skene,这两种方法都利用贝叶斯推断来更准确地推断生成式语言模型的真实胜率。我们在涵盖故事生成、摘要生成和指令跟随任务的六个数据集上实证验证了我们的方法。结果表明,这两种方法都能有效提高使用 LLM 作为评估器时的胜率估计准确性,为可靠的自动文本质量评估提供了有前景的方向。

[NLP-42] Variational Low-Rank Adaptation Using IVON NEURIPS2024

【速读】: 该论文试图解决在大规模语言模型(如Llama-2)中,使用低秩适应(Low-Rank Adaptation, LoRA)时,如何在不显著增加成本的前提下提高模型精度和校准度的问题。解决方案的关键在于使用改进的变分在线牛顿法(Improved Variational Online Newton, IVON)替代传统的AdamW优化算法进行微调。IVON在保持较低成本和易于实现的同时,显著提升了模型的准确性和预期校准误差,为大规模语言模型的优化提供了新的有效途径。

链接: https://arxiv.org/abs/2411.04421
作者: Bai Cong,Nico Daheim,Yuesong Shen,Daniel Cremers,Rio Yokota,Mohammad Emtiyaz Khan,Thomas Möllenhoff
关键词-EN: Variational Online Newton, Improved Variational Online, Low-Rank Adaptation, learning can significantly, substantial increase
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Published at 38th Workshop on Fine-Tuning in Machine Learning (NeurIPS 2024). Code available at this https URL

点击查看摘要

Abstract:We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models. The code is available at this https URL.
摘要:我们展示了变分学习能够显著提升低秩适应 (LoRA) 的准确性和校准效果,且不会大幅增加成本。我们采用改进的变分在线牛顿 (IVON) 算法替代 AdamW 来微调大语言模型。对于拥有 70 亿参数的 Llama-2,IVON 在准确性上比 AdamW 提升了 2.8%,预期校准误差降低了 4.6%。此外,IVON 的准确性优于其他贝叶斯方法,但成本更低且实现更为简便。我们的研究进一步证明了 IVON 在大语言模型中的有效性。相关代码可在以下链接获取:https URL。

[NLP-43] Measuring short-form factuality in large language models

【速读】: 该论文试图解决的问题是如何有效评估语言模型在回答简短、事实性问题时的能力。解决方案的关键在于设计了一个名为SimpleQA的基准测试,该测试具有两个主要特点:一是测试具有挑战性,因为问题是对抗性地收集的,旨在挑战GPT-4的回答;二是评分简单,因为每个问题只有一个无可争议的正确答案,评分标准为正确、错误或未尝试。SimpleQA的核心目标是评估模型是否“知道自己知道什么”,并希望该基准在未来几代前沿模型中仍然具有相关性。

链接: https://arxiv.org/abs/2411.04368
作者: Jason Wei,Nguyen Karina,Hyung Won Chung,Yunxin Joy Jiao,Spencer Papay,Amelia Glaese,John Schulman,William Fedus
关键词-EN: evaluates the ability, ability of language, SimpleQA, fact-seeking questions, answer short
类目: Computation and Language (cs.CL)
备注: Blog post: this https URL

点击查看摘要

Abstract:We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models “know what they know,” and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at this https URL.
摘要:我们提出了 SimpleQA,这是一个评估语言模型回答简短、事实性问题的能力的基准。在设计这个评估时,我们优先考虑了两个特性。首先,SimpleQA 具有挑战性,因为它是在对抗 GPT-4 响应的情况下收集的。其次,响应易于评分,因为问题的设置使得只有一个无可争议的正确答案。SimpleQA 中的每个答案都被评定为正确、错误或未尝试。一个理想行为的模型应尽可能多地回答正确的问题,同时不尝试那些它不确定知道正确答案的问题。SimpleQA 是一个简单、有针对性的评估,用于判断模型是否“知道自己知道什么”,我们希望这个基准能在未来几代前沿模型中保持相关性。SimpleQA 可以在以下链接找到:https URL。

[NLP-44] Robust and Efficient Fine-tuning of LLM s with Bayesian Reparameterization of Low-Rank Adaptation

【速读】: 该论文试图解决大语言模型(LLMs)在低秩适应(low-rank adaptation)微调过程中由于超参数选择敏感性导致的模型性能不稳定问题。解决方案的关键在于提出了一种名为MonteCLoRA的高效微调技术,通过蒙特卡罗估计(Monte Carlo estimation)学习低秩参数的无偏后验估计,从而在仅增加O(1)额外参数的情况下显著降低估计方差,增强微调后模型的稳定性和性能。该方法在自然语言理解任务和生成任务中均表现出更高的准确性和鲁棒性,特别是在预训练的RoBERTa-base和LLaMA-1-7B模型上,分别实现了高达3.8%的准确性提升和50%的方差降低。

链接: https://arxiv.org/abs/2411.04358
作者: Vaibhav Seth,Arinjay Pathak,Ayan Sengupta,Natraj Raman,Sriram Gopalakrishnan,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, enormous size, highly resource-intensive, resource-intensive to fine-tune
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 48 pages, 10 figures, 10 tables, Code: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are highly resource-intensive to fine-tune due to their enormous size. While low-rank adaptation is a prominent parameter-efficient fine-tuning approach, it suffers from sensitivity to hyperparameter choices, leading to instability in model performance on fine-tuning downstream tasks. This paper highlights the importance of effective parameterization in low-rank fine-tuning to reduce estimator variance and enhance the stability of final model outputs. We propose MonteCLoRA, an efficient fine-tuning technique, employing Monte Carlo estimation to learn an unbiased posterior estimation of low-rank parameters with low expected variance, which stabilizes fine-tuned LLMs with only O(1) additional parameters. MonteCLoRA shows significant improvements in accuracy and robustness, achieving up to 3.8% higher accuracy and 8.6% greater robustness than existing efficient fine-tuning methods on natural language understanding tasks with pre-trained RoBERTa-base. Furthermore, in generative tasks with pre-trained LLaMA-1-7B, MonteCLoRA demonstrates robust zero-shot performance with 50% lower variance than the contemporary efficient fine-tuning methods. The theoretical and empirical results presented in the paper underscore how parameterization and hyperpriors balance exploration-exploitation in the low-rank parametric space, therefore leading to more optimal and robust parameter estimation during efficient fine-tuning.
摘要:大语言模型 (LLM) 由于其庞大的规模,微调过程需要消耗大量资源。尽管低秩适应是一种显著的参数高效微调方法,但它对超参数选择的敏感性导致了在微调下游任务时模型性能的不稳定性。本文强调了在低秩微调中有效参数化的重要性,以减少估计器的方差并增强最终模型输出的稳定性。我们提出了 MonteCLoRA,一种高效的微调技术,采用蒙特卡罗估计来学习低秩参数的无偏后验估计,其期望方差较低,从而在仅增加 O(1) 额外参数的情况下稳定微调后的 LLM。MonteCLoRA 在准确性和鲁棒性方面显示出显著改进,在预训练的 RoBERTa-base 上进行的自然语言理解任务中,其准确性比现有高效微调方法高出 3.8%,鲁棒性提高 8.6%。此外,在预训练的 LLaMA-1-7B 上进行的生成任务中,MonteCLoRA 展示了零样本性能的鲁棒性,方差比当代高效微调方法低 50%。本文的理论和实证结果强调了参数化和超先验在低秩参数空间中平衡探索-利用的重要性,从而在高效微调过程中实现更优和更稳健的参数估计。

[NLP-45] Scaling Laws for Precision

【速读】: 该论文试图解决低精度训练和推理对语言模型质量和成本的影响问题,而现有的缩放法则并未考虑这一点。解决方案的关键在于提出了“精度感知”的缩放法则,用于训练和推理阶段。具体来说,论文指出在低精度下训练会减少模型的“有效参数数量”,从而可以预测由于低精度训练和后训练量化导致的额外损失。对于推理阶段,研究发现后训练量化引入的退化随着模型训练数据的增加而增加,最终使得额外的预训练数据变得有害。论文还提出了一个统一的函数形式,用于预测不同精度下训练和推理的退化,并通过465次预训练运行验证了其预测的准确性,涵盖了高达1.7B参数和26B标记的模型。

链接: https://arxiv.org/abs/2411.04330
作者: Tanishq Kumar,Zachary Ankner,Benjamin F. Spector,Blake Bordelon,Niklas Muennighoff,Mansheej Paul,Cengiz Pehlevan,Christopher Ré,Aditi Raghunathan
关键词-EN: scaling laws, current scaling laws, training, quality and cost, cost of language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise “precision-aware” scaling laws for both training and inference. We propose that training in lower precision reduces the model’s “effective parameter count,” allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.
摘要:低精度训练和推理影响语言模型的质量和成本,但当前的缩放法则并未考虑这一点。在本研究中,我们设计了“精度感知”的缩放法则,用于训练和推理。我们提出,在较低精度下训练会减少模型的“有效参数数量”,从而使我们能够预测从低精度训练和训练后量化引入的额外损失。对于推理,我们发现训练后量化引入的退化随着模型训练数据的增加而增加,最终使得额外的预训练数据变得有害。对于训练,我们的缩放法则允许我们预测不同部分在不同精度下的模型损失,并建议在较低精度下训练更大的模型可能是计算优化的。我们将训练后和预训练量化的缩放法则统一为一个单一的函数形式,该形式能够预测在不同精度下训练和推理的退化。我们在超过465次预训练运行中进行了拟合,并在高达1.7亿参数、训练数据量高达260亿Token的模型上验证了我们的预测。

[NLP-46] CodeTree: Agent -guided Tree Search for Code Generation with Large Language Models

【速读】: 该论文试图解决在面对具有极大搜索空间的复杂编程任务时,现有大型语言模型(LLMs)在多阶段规划、代码生成和调试方面存在的困难。解决方案的关键在于提出了CodeTree框架,该框架通过统一的树结构在代码生成的不同阶段高效地探索搜索空间。具体来说,CodeTree在每个阶段通过环境执行反馈和LLM生成的反馈来指导关键决策(如排序、终止和扩展),从而显式地探索不同的编码策略并生成相应的编码解决方案,最终实现代码的自主优化和改进。实验结果表明,CodeTree在多个代码生成基准测试中显著优于现有方法,特别是在复杂任务如SWEBench上取得了显著的性能提升。

链接: https://arxiv.org/abs/2411.04329
作者: Jierui Li,Hung Le,Yinbo Zhou,Caiming Xiong,Silvio Savarese,Doyen Sahoo
关键词-EN: Pre-trained on massive, demonstrated remarkable achievements, text data, large language models, massive amounts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with capabilities to self-refine and improve generated code autonomously. However, on challenging coding tasks with extremely large search space, current agentic approaches still struggle with multi-stage planning, generating, and debugging. To address this problem, we propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions. In each stage, critical decision-making (ranking, termination, expanding) of the exploration process is guided by both the environmental execution-based feedback and LLM-agent-generated feedback. We comprehensively evaluated CodeTree on 7 code generation benchmarks and demonstrated the significant performance gains of CodeTree against strong baselines. Using GPT-4o as the base model, we consistently achieved top results of 95.1 on HumanEval, 98.7 on MBPP, and 43.0 on CodeContests. On the challenging SWEBench benchmark, our approach led to significant performance gains.
摘要:在大规模代码和文本数据上预训练的大语言模型 (LLM) 在代码生成任务中展现了显著的成就。通过额外的基于执行的反馈,这些模型可以作为具备自主改进生成代码能力的 AI 智能体。然而,在搜索空间极其庞大的复杂编码任务中,当前的智能体方法在多阶段规划、生成和调试方面仍面临挑战。为解决这一问题,我们提出了 CodeTree,这是一个用于 LLM 智能体在代码生成过程的不同阶段高效探索搜索空间的框架。具体而言,我们采用了一种统一的树结构,以显式地探索不同的编码策略,生成相应的编码解决方案,并随后对这些解决方案进行优化。在每个阶段,探索过程的关键决策(排序、终止、扩展)由环境执行反馈和 LLM 智能体生成的反馈共同指导。我们在 7 个代码生成基准上全面评估了 CodeTree,并展示了其相对于强基线的显著性能提升。以 GPT-4o 为基础模型,我们在 HumanEval、MBPP 和 CodeContests 上分别取得了 95.1、98.7 和 43.0 的顶尖成绩。在具有挑战性的 SWEBench 基准上,我们的方法也带来了显著的性能提升。

[NLP-47] Balancing Transparency and Accuracy: A Comparative Analysis of Rule-Based and Deep Learning Models in Political Bias Classification

【速读】: 该论文试图解决自动检测媒体中政治偏见的问题,特别是在美国新闻文章中的应用。解决方案的关键在于比较基于规则的方法和深度学习方法在分类偏见上的效果。研究强调了现代自学习系统在不受约束的数据摄取中的敏感性,并重新考虑了传统基于规则系统的优势。通过将两种模型应用于左倾(CNN)和右倾(FOX)新闻文章,研究评估了它们在超出原始训练数据上的有效性,并分析了每种模型的准确性。结果表明,基于规则的模型在不同数据条件下表现一致且更具透明性,而深度学习模型则依赖于训练集,在处理未见数据时表现不佳。

链接: https://arxiv.org/abs/2411.04328
作者: Manuel Nunez Martinez,Sonja Schmer-Galunder,Zoey Liu,Sangpil Youm,Chathuri Jayaweera,Bonnie J. Dorr
关键词-EN: opposing political viewpoints, increasing political polarization, automatically detecting political, detecting political bias, digital information
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The unchecked spread of digital information, combined with increasing political polarization and the tendency of individuals to isolate themselves from opposing political viewpoints, has driven researchers to develop systems for automatically detecting political bias in media. This trend has been further fueled by discussions on social media. We explore methods for categorizing bias in US news articles, comparing rule-based and deep learning approaches. The study highlights the sensitivity of modern self-learning systems to unconstrained data ingestion, while reconsidering the strengths of traditional rule-based systems. Applying both models to left-leaning (CNN) and right-leaning (FOX) news articles, we assess their effectiveness on data beyond the original training and test this http URL analysis highlights each model’s accuracy, offers a framework for exploring deep-learning explainability, and sheds light on political bias in US news media. We contrast the opaque architecture of a deep learning model with the transparency of a linguistically informed rule-based model, showing that the rule-based model performs consistently across different data conditions and offers greater transparency, whereas the deep learning model is dependent on the training set and struggles with unseen data.
摘要:数字信息的未受控传播,结合日益加剧的政治极化和个体倾向于隔离对立政治观点的趋势,促使研究人员开发自动检测媒体中政治偏见的系统。这一趋势在社交媒体的讨论中得到了进一步推动。我们探讨了分类美国新闻文章中偏见的方法,比较了基于规则和深度学习的方法。研究强调了现代自学习系统对无约束数据摄取的敏感性,同时重新审视了传统基于规则系统的优势。我们将这两种模型应用于左倾(CNN)和右倾(FOX)新闻文章,评估它们在超出原始训练数据上的有效性,并通过此http URL分析突出了每个模型的准确性,提供了一个探索深度学习可解释性的框架,并揭示了美国新闻媒体中的政治偏见。我们对比了深度学习模型的不透明架构与语言学启发的基于规则模型的透明性,表明基于规则模型在不同数据条件下表现一致,并提供更大的透明度,而深度学习模型则依赖于训练集,在处理未见数据时表现不佳。

[NLP-48] A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI

【速读】: 该论文试图解决南非和刚果民主共和国(DRC)由于缺乏准确标注数据而导致的AI驱动翻译和情感分析系统的复杂语言环境问题。解决方案的关键在于开发了一个多语言词典,该词典最初设计用于法语和Tshiluba(Ciluba),现已扩展到包括英语、Afrikaans、Sepedi和Zulu的翻译,并通过整合语言特定的情感评分来增强情感分类的文化相关性。此外,论文还创建了一个全面的测试语料库,并训练了多种机器学习模型(如随机森林、支持向量机(SVM)、决策树和高斯朴素贝叶斯(GNB))来预测低资源语言(LRLs)的情感。其中,随机森林模型表现尤为出色,能够有效捕捉情感极性和处理语言特定的细微差别。同时,使用大型语言模型BERT进行基于上下文的情感预测,达到了99%的准确率和98%的精确度,显著优于其他模型。通过可解释AI(XAI)对BERT预测结果进行解释,提高了透明度并增强了情感分类的信心。总体而言,该论文提出的词典和机器学习模型显著提升了南非和DRC低资源语言的翻译和情感分析能力。

链接: https://arxiv.org/abs/2411.04316
作者: Melusi Malinga,Isaac Lupanda,Mike Wa Nkongolo,Phil van Deventer
关键词-EN: Republic of Congo, Democratic Republic, accurately labeled data, complex linguistic landscape, creates unique challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work is part of a PhD proposal in Information Technology at the University of Pretoria, supervised by Dr. Mike Wa Nkongolo and co-supervised by Dr. Phil van Deventer, under the Low-Resource Language Processing Lab in the Department of Informatics

点击查看摘要

Abstract:South Africa and the Democratic Republic of Congo (DRC) present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba (Ciluba), which creates unique challenges for AI-driven translation and sentiment analysis systems due to a lack of accurately labeled data. This study seeks to address these challenges by developing a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu. The lexicon enhances cultural relevance in sentiment classification by integrating language-specific sentiment scores. A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models such as Random Forest, Support Vector Machine (SVM), Decision Trees, and Gaussian Naive Bayes (GNB) trained to predict sentiment across low resource languages (LRLs). Among them, the Random Forest model performed particularly well, capturing sentiment polarity and handling language-specific nuances effectively. Furthermore, Bidirectional Encoder Representations from Transformers (BERT), a Large Language Model (LLM), is applied to predict context-based sentiment with high accuracy, achieving 99% accuracy and 98% precision, outperforming other models. The BERT predictions were clarified using Explainable AI (XAI), improving transparency and fostering confidence in sentiment classification. Overall, findings demonstrate that the proposed lexicon and machine learning models significantly enhance translation and sentiment analysis for LRLs in South Africa and the DRC, laying a foundation for future AI models that support underrepresented languages, with applications across education, governance, and business in multilingual contexts.
摘要:南非和刚果民主共和国(DRC)的语言环境复杂,包括祖鲁语、塞佩迪语、南非荷兰语、法语、英语和奇卢巴语(Ciluba),这为基于AI的翻译和情感分析系统带来了独特的挑战,主要原因是缺乏准确标注的数据。本研究旨在通过开发一种多语言词汇来应对这些挑战,该词汇最初设计用于法语和奇卢巴语,现已扩展至包括英语、南非荷兰语、塞佩迪语和祖鲁语的翻译。该词汇通过整合特定语言的情感评分,增强了情感分类的文化相关性。研究还创建了一个全面的测试语料库,以支持翻译和情感分析任务,并训练了多种机器学习模型,如随机森林、支持向量机(SVM)、决策树和高斯朴素贝叶斯(GNB),以预测低资源语言(LRLs)中的情感。其中,随机森林模型表现尤为出色,能够有效捕捉情感极性并处理语言特定的细微差别。此外,研究还应用了基于Transformer的双向编码表示(BERT),这是一种大语言模型(LLM),用于预测基于上下文的情感,取得了99%的准确率和98%的精确度,优于其他模型。通过可解释AI(XAI)对BERT的预测结果进行解释,提高了透明度,增强了情感分类的信心。总体而言,研究结果表明,所提出的词汇和机器学习模型显著提升了南非和刚果民主共和国低资源语言的翻译和情感分析能力,为未来支持弱势语言的AI模型奠定了基础,并在多语言环境下的教育、治理和商业领域具有广泛应用。

[NLP-49] Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

【速读】: 该论文试图解决在双语教育环境中,多语言大型语言模型(Multilingual Large Language Models, MLLMs)在评估学生写作时的偏差问题。解决方案的关键在于通过使用合成数据集对模型进行微调,这些数据集包括英语、西班牙语和Spanglish(西班牙语和英语混合)。研究结果表明,经过微调的模型在三种语言的评估中表现显著提升,从而增强了MLLMs在支持双语学习者真实语言实践中的有效性。这一研究强调了在教育领域设计和实施语言模型时,纳入非英语语言的重要性。

链接: https://arxiv.org/abs/2411.04308
作者: Anand Syamkumar,Nora Tseng,Kaycie Barron,Shanglin Yang,Shamya Karumbaiah,Rheeya Uppal,Junjie Hu
关键词-EN: providing instructor feedback, generating educational content, reducing teacher workload, offer promise, educational content
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer promise in generating educational content, providing instructor feedback, and reducing teacher workload on assessments. While prior studies have focused on studying LLM-powered learning analytics, limited research has examined how effective LLMs are in a bilingual context. In this paper, we study the effectiveness of multilingual large language models (MLLMs) across monolingual (English-only, Spanish-only) and bilingual (Spanglish) student writing. We present a learning analytics use case that details LLM performance in assessing acceptable and unacceptable explanations of Science and Social Science concepts. Our findings reveal a significant bias in the grading performance of pre-trained models for bilingual writing compared to English-only and Spanish-only writing. Following this, we fine-tune open-source MLLMs including Llama 3.1 and Mistral NeMo using synthetic datasets generated in English, Spanish, and Spanglish. Our experiments indicate that the models perform significantly better for all three languages after fine-tuning with bilingual data. This study highlights the potential of enhancing MLLM effectiveness to support authentic language practices amongst bilingual learners. It also aims to illustrate the value of incorporating non-English languages into the design and implementation of language models in education.
摘要:大语言模型 (LLM) 在生成教育内容、提供教师反馈以及减轻教师在评估工作量方面展现出巨大潜力。尽管先前的研究主要集中在探讨基于 LLM 的学习分析,但关于 LLM 在双语环境中的有效性研究仍较为有限。本文研究了多语言大语言模型 (MLLM) 在单语(仅英语、仅西班牙语)和双语(西班牙英语混合)学生写作中的有效性。我们展示了一个学习分析用例,详细描述了 LLM 在评估科学和社会科学概念的可接受和不可接受解释方面的表现。研究结果显示,预训练模型在双语写作评分方面存在显著偏差,相较于仅英语和仅西班牙语写作。随后,我们使用在英语、西班牙语和西班牙英语混合语中生成的合成数据集,对包括 Llama 3.1 和 Mistral NeMo 在内的开源 MLLM 进行了微调。实验表明,经过双语数据微调后,模型在三种语言上的表现均显著提升。本研究强调了提升 MLLM 有效性以支持双语学习者真实语言实践的潜力,并旨在展示将非英语语言纳入教育领域语言模型设计和实施的价值。

[NLP-50] A Capabilities Approach to Studying Bias and Harm in Language Technologies

【速读】: 该论文试图解决主流自然语言处理 (NLP) 研究中忽视全球大多数语言的问题,并探讨在将新技术引入这些语言环境时可能带来的公平性、偏见和包容性问题。解决方案的关键在于采用能力方法 (Capabilities Approach),该方法关注人们在社会、政治和经济交织背景下的实际能力,而非理论上可获得的资源。通过这种方法,论文强调了与社区成员进行有意义的合作,以定义和衡量语言技术带来的潜在危害,从而实现更公平和包容的多语言和多文化评估。

链接: https://arxiv.org/abs/2411.04298
作者: Hellina Hailu Nigatu,Zeerak Talat
关键词-EN: Mainstream Natural Language, Natural Language Processing, Mainstream Natural, Language Processing, Natural Language
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to the New Perspectives on Bias and Discrimination in Language Technology workshop

点击查看摘要

Abstract:Mainstream Natural Language Processing (NLP) research has ignored the majority of the world’s languages. In moving from excluding the majority of the world’s languages to blindly adopting what we make for English, we first risk importing the same harms we have at best mitigated and at least measured for English. However, in evaluating and mitigating harms arising from adopting new technologies into such contexts, we often disregard (1) the actual community needs of Language Technologies, and (2) biases and fairness issues within the context of the communities. In this extended abstract, we consider fairness, bias, and inclusion in Language Technologies through the lens of the Capabilities Approach. The Capabilities Approach centers on what people are capable of achieving, given their intersectional social, political, and economic contexts instead of what resources are (theoretically) available to them. We detail the Capabilities Approach, its relationship to multilingual and multicultural evaluation, and how the framework affords meaningful collaboration with community members in defining and measuring the harms of Language Technologies.
摘要:主流自然语言处理 (Natural Language Processing, NLP) 研究忽视了世界上大多数语言。从排除大多数世界语言到盲目采用为英语开发的技术,我们首先面临的风险是将我们在英语环境中至少减轻和测量的同样危害引入。然而,在评估和减轻将新技术引入这些环境所产生的危害时,我们往往忽视了(1)语言技术的实际社区需求,以及(2)社区背景中的偏见和公平问题。在本扩展摘要中,我们通过能力方法 (Capabilities Approach) 的视角,考虑了语言技术中的公平性、偏见和包容性问题。能力方法的核心是关注人们在交织的社会、政治和经济背景下能够实现什么,而不是理论上可用的资源。我们详细阐述了能力方法,它与多语言和多文化评估的关系,以及该框架如何促进与社区成员在定义和测量语言技术危害方面的有意义合作。

[NLP-51] Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models

【速读】: 该论文试图解决视觉语言模型(Vision-language models, VLMs)在多模态任务中安全性对齐的挑战,特别是其复杂架构导致的安全性问题。论文揭示了VLM的视觉编码器各层之间安全性的不公平分布,早期和中层比最终层更容易受到恶意输入的影响。解决方案的关键在于识别并解决这种“跨层”脆弱性,即模型在训练时使用的默认架构设置无法泛化到未见或分布外场景,导致某些层暴露于风险中。通过投影激活从不同中间层进行综合分析,论文展示了这些层在面对恶意输入时更可能生成有害输出。实验结果表明,当前基于单一默认层的安全对齐策略不足以应对这种跨层脆弱性。

链接: https://arxiv.org/abs/2411.04291
作者: Saketh Bachu,Erfan Shayegani,Trishna Chakraborty,Rohit Lal,Arindam Dutta,Chengyu Song,Yue Dong,Nael Abu-Ghazaleh,Amit K. Roy-Chowdhury
关键词-EN: complex architecture makes, large language models, Vision-language models, multi-modal tasks, improved significantly
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, Under Review

点击查看摘要

Abstract:Vision-language models (VLMs) have improved significantly in multi-modal tasks, but their more complex architecture makes their safety alignment more challenging than the alignment of large language models (LLMs). In this paper, we reveal an unfair distribution of safety across the layers of VLM’s vision encoder, with earlier and middle layers being disproportionately vulnerable to malicious inputs compared to the more robust final layers. This ‘cross-layer’ vulnerability stems from the model’s inability to generalize its safety training from the default architectural settings used during training to unseen or out-of-distribution scenarios, leaving certain layers exposed. We conduct a comprehensive analysis by projecting activations from various intermediate layers and demonstrate that these layers are more likely to generate harmful outputs when exposed to malicious inputs. Our experiments with LLaVA-1.5 and Llama 3.2 show discrepancies in attack success rates and toxicity scores across layers, indicating that current safety alignment strategies focused on a single default layer are insufficient.
摘要:视觉-语言模型(Vision-language models, VLMs)在多模态任务中取得了显著进展,但其更为复杂的架构使得其安全性对齐比大语言模型(Large Language Models, LLMs)更具挑战性。本文揭示了VLM视觉编码器各层之间安全性的不公平分布,早期和中层相较于更为稳健的最终层,更容易受到恶意输入的影响。这种“跨层”脆弱性源于模型无法将其在训练期间使用的默认架构设置中的安全训练泛化到未见或分布外的场景,从而使得某些层暴露在外。我们通过投影来自不同中间层的激活进行了全面的分析,并证明这些层在面对恶意输入时更可能生成有害输出。我们在LLaVA-1.5和Llama 3.2上的实验显示,各层的攻击成功率和毒性评分存在差异,这表明当前专注于单一默认层的安全对齐策略是不充分的。

[NLP-52] Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

【速读】: 该论文试图解决大型语言模型(LLMs)在复杂推理任务中表现不佳的问题,特别是在需要多步骤推理的情况下。解决方案的关键是引入了一种名为LaTent Reasoning Optimization (LaTRO)的框架,该框架通过将推理过程形式化为从潜在分布中采样,并利用变分方法进行优化,从而在训练过程中提升LLMs的推理能力。LaTRO使LLMs能够在不依赖外部反馈或奖励模型的情况下,同时改进其推理过程和评估推理质量的能力。实验结果表明,LaTRO在GSM8K和ARC-Challenge数据集上显著提升了模型的零样本准确率,验证了其有效性。

链接: https://arxiv.org/abs/2411.04282
作者: Haolin Chen,Yihao Feng,Zuxin Liu,Weiran Yao,Akshara Prabhakar,Shelby Heinecke,Ricky Ho,Phil Mui,Silvio Savarese,Caiming Xiong,Huan Wang
关键词-EN: Large language models, Large language, shown impressive capabilities, complex reasoning tasks, shown impressive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \urlthis https URL.
摘要:大语言模型 (LLM) 展示了令人印象深刻的能力,但在需要多步骤的复杂推理任务上仍面临挑战。尽管基于提示的方法如思维链 (Chain-of-Thought, CoT) 可以在推理时提升 LLM 的推理能力,但在训练过程中优化推理能力仍然是一个难题。我们提出了隐式推理优化 (LaTent Reasoning Optimization, LaTRO),这是一个将推理形式化为从隐式分布中采样并通过变分方法进行优化的原则性框架。LaTRO 使 LLM 能够在不依赖外部反馈或奖励模型的情况下,同时提升其推理过程和评估推理质量的能力。我们通过在 GSM8K 和 ARC-Challenge 数据集上使用多种模型架构进行实验,验证了 LaTRO 的有效性。在 GSM8K 数据集上,LaTRO 将零样本准确率在 Phi-3.5-mini、Mistral-7B 和 Llama-3.1-8B 模型上分别平均提升了 12.5% 和 9.6% 相比于基础模型和监督微调模型。我们的研究结果表明,预训练的 LLM 具有潜在的推理能力,可以通过我们提出的优化方法以自我改进的方式解锁和增强。LaTRO 的代码可在 \urlthis https URL 获取。

[NLP-53] Diversity Helps Jailbreak Large Language Models

【速读】: 该论文试图解决大型语言模型(LLM)在安全约束下的漏洞问题,揭示了现有安全训练方法可能仅是掩盖而非消除这些漏洞。解决方案的关键在于利用LLM偏离先前上下文的能力,通过简单的指令使其偏离并混淆先前的攻击,从而显著提高绕过安全约束的成功率。该方法在测试中表现出色,成功率比现有方法高出62%,且仅使用13%的查询量,暴露了当前LLM安全训练的重大缺陷,强调了需要彻底改革测试方法以确保LLM的安全性和可靠性。

链接: https://arxiv.org/abs/2411.04223
作者: Weiliang Zhao,Daniel Ben-Levi,Junfeng Yang,Chengzhi Mao
关键词-EN: generate harmful outputs, powerful jailbreak technique, leverages large language, large language models’, language models’ ability
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2312.02119

点击查看摘要

Abstract:We have uncovered a powerful jailbreak technique that leverages large language models’ ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62% higher success rate in compromising nine leading chatbots, including GPT-4, Gemini, and Llama, while using only 13% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.
摘要:我们发现了一种强大的越狱技术,利用大语言模型(LLM)偏离先前上下文的能力,使其能够绕过安全约束并生成有害输出。通过简单地指示 LLM 偏离并混淆之前的攻击,我们的方法显著优于现有技术,在攻击包括 GPT-4、Gemini 和 Llama 在内的九个领先聊天机器人时,成功率提高了高达 62%,而仅使用了 13% 的查询量。这一发现揭示了当前 LLM 安全训练中的一个关键漏洞,表明现有方法可能只是掩盖了漏洞而非消除它们。我们的研究结果强烈呼吁迫切需要革新测试方法,以确保 LLM 的安全性和可靠性。

[NLP-54] Crystal: Illuminating LLM Abilities on Language and Code

【速读】: 该论文试图解决大型语言模型(LLMs)在代码生成和自然语言处理能力之间的复杂交互问题,特别是在代码生成模型(如StarCoder和Code Llama)中,如何有效地整合这两种能力。解决方案的关键在于提出了一种两阶段的预训练策略,通过调整代码和自然语言的比例来增强单一LLM中自然语言和编码能力的整合。具体来说,第一阶段侧重于代码生成,第二阶段则平衡代码和自然语言的训练。最终模型Crystal在自然语言和代码生成性能上均表现出色,且数据效率优于Llama 2和Code Llama。此外,论文强调了精心设计的数据配比的重要性,并承诺开源所有预训练细节以促进社区研究。

链接: https://arxiv.org/abs/2411.04156
作者: Tianhua Tao,Junbo Li,Bowen Tan,Hongyi Wang,William Marshall,Bhargav M Kanakiya,Joel Hestness,Natalia Vassilieva,Zhiqiang Shen,Eric P. Xing,Zhengzhong Liu
关键词-EN: Large Language Models, play increasingly critical, increasingly critical roles, Large Language, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published as a conference paper at COLM 2024

点击查看摘要

Abstract:Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language or code explanations. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. Furthermore, there is a lack of thorough prior studies on the LLM pretraining strategy that mixes code and natural language. In this work, we propose a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes two phases of training with appropriately adjusted code/language ratios. The resulting model, Crystal, demonstrates remarkable capabilities in both domains. Specifically, it has natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. Crystal exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We verify our pretraining strategy by analyzing the training process and observe consistent improvements in most benchmarks. We also adopted a typical application adaptation phase with a code-centric data mixture, only to find that it did not lead to enhanced performance or training efficiency, underlining the importance of a carefully designed data recipe. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, loggings and 136 checkpoints throughout the training.
摘要:专注于代码生成的大语言模型(LLMs),例如 StarCoder 和 Code Llama,在各种软件开发场景中扮演着越来越关键的角色。对于许多特定应用,如使用自然语言进行代码片段检索或代码解释,代码 LLMs 不仅需要具备代码生成能力,还需要具备自然语言处理能力。语言和编码技能的复杂交互使得开发强大的代码 LLMs 变得复杂。此外,关于混合代码和自然语言的 LLM 预训练策略的先前研究尚不充分。在本研究中,我们提出了一种预训练策略,旨在增强单一 LLM 中自然语言和编码能力的整合。具体而言,该策略包括两个阶段的训练,并适当调整代码与语言的比例。最终得到的模型 Crystal 在两个领域都表现出卓越的能力。具体来说,它在自然语言和编码性能上分别与 Llama 2 和 Code Llama 相当。Crystal 展示了更好的数据效率,使用了 1.4 万亿个 Token,而 Llama 2 和 Code Llama 则使用了超过 2 万亿个 Token。我们通过分析训练过程验证了预训练策略,并观察到大多数基准测试中持续的改进。我们还采用了典型的应用适应阶段,使用以代码为中心的数据混合,结果发现这并未带来性能或训练效率的提升,这强调了精心设计数据配方的重要性。为了促进社区内的研究,我们承诺公开预训练的所有细节,包括训练数据集、代码、日志记录以及训练过程中的 136 个检查点。

[NLP-55] Software Design Pattern Model and Data Structure Algorithm Abilities on Microservices Architecture Design in High-tech Enterprises

【速读】: 该论文试图解决企业中微服务架构设计中软件设计模型能力和数据结构算法能力的影响问题。解决方案的关键在于强调强大的设计模型和高效的算法,以实现微服务架构的卓越可扩展性、性能和灵活性。研究发现,这些能力有助于更好的服务分解、数据处理优化和系统响应性提升。尽管如此,论文也指出了在整合新兴技术和适应软件设计实践变化方面的研究空白,并建议未来的研究方向以填补这些空白。

链接: https://arxiv.org/abs/2411.04143
作者: Jun Cui
关键词-EN: structure algorithm abilities, design model capabilities, study investigates, investigates the impact, data structure algorithm
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the impact of software design model capabilities and data structure algorithm abilities on microservices architecture design within enterprises. Utilizing a qualitative methodology, the research involved in-depth interviews with software architects and developers who possess extensive experience in microservices implementation. The findings reveal that organizations emphasizing robust design models and efficient algorithms achieve superior scalability, performance, and flexibility in their microservices architecture. Notably, participants highlighted that a strong foundation in these areas facilitates better service decomposition, optimizes data processing, and enhances system responsiveness. Despite these insights, gaps remain regarding the integration of emerging technologies and the evolving nature of software design practices. This paper contributes to the existing literature by underscoring the critical role of these competencies in fostering effective microservices architectures and suggests avenues for future research to address identified gaps
摘要:本研究探讨了软件设计模型能力和数据结构算法能力对企业内微服务架构设计的影响。采用定性研究方法,研究包括对具有丰富微服务实施经验的软件架构师和开发人员进行深度访谈。研究结果显示,注重强大设计模型和高效算法的组织在微服务架构中实现了卓越的可扩展性、性能和灵活性。特别地,参与者强调,在这些领域打下坚实基础有助于更好地进行服务分解,优化数据处理,并提升系统响应能力。尽管如此,关于新兴技术的整合和软件设计实践的不断演变,仍存在一些未解决的差距。本文通过强调这些能力在促进有效微服务架构中的关键作用,为现有文献做出了贡献,并指出了未来研究的方向,以解决已识别的差距。

[NLP-56] A Comparative Study on the Impact of Test-Driven Development (TDD) and Behavior-Driven Development (BDD) on Enterprise Software Delivery Effectiveness

【速读】: 该论文试图解决的问题是如何在企业环境中评估和选择测试驱动开发 (Test-Driven Development, TDD) 和行为驱动开发 (Behavior-Driven Development, BDD) 对软件交付效果的影响。解决方案的关键在于通过深入访谈收集数据,揭示这两种开发模型在交付速度、软件质量和团队协作方面的不同效果。具体来说,TDD 通过强调早期测试和迭代开发提升代码质量和减少缺陷,而 BDD 通过关注行为规范和直接涉及利益相关者来改善跨职能沟通。然而,TDD 可能需要更高的初始时间投入,而 BDD 可能在需求清晰度方面遇到挑战。这些发现有助于企业根据项目类型和利益相关者的需求选择最适合的开发模型。

链接: https://arxiv.org/abs/2411.04141
作者: Jun Cui
关键词-EN: software delivery effectiveness, paper compares, Test-Driven Development, Behavior-Driven Development, BDD
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper compares the impact of Test-Driven Development (TDD) and Behavior-Driven Development (BDD) on software delivery effectiveness within enterprise environments. Using a qualitative research design, data were collected through in-depth interviews with developers and project managers from enterprises adopting TDD or BDD. Moreover, the findings reveal distinct effects of each model on delivery speed, software quality, and team collaboration. Specifically, TDD emphasizes early testing and iterative development, leading to enhanced code quality and fewer defects, while BDD improves cross-functional communication by focusing on behavior specifications that involve stakeholders directly. However, TDD may create a higher initial time investment, and BDD might encounter challenges in requirement clarity. These differences highlight gaps in understanding how each model aligns with varying project types and stakeholder needs, which can guide enterprises in selecting the most suitable model for their unique requirements. The study contributes to the literature by providing insights into the practical application and challenges of TDD and BDD, suggesting future research on their long-term impacts in diverse settings.
摘要:本文比较了测试驱动开发 (Test-Driven Development, TDD) 和行为驱动开发 (Behavior-Driven Development, BDD) 对企业环境中软件交付效率的影响。通过采用定性研究设计,数据收集自采用 TDD 或 BDD 的企业中的开发人员和项目经理的深度访谈。研究结果揭示了这两种模型在交付速度、软件质量和团队协作方面的不同效果。具体而言,TDD 强调早期测试和迭代开发,从而提高了代码质量和减少了缺陷;而 BDD 通过关注涉及直接利益相关者的行为规范,改善了跨职能沟通。然而,TDD 可能会导致较高的初始时间投入,而 BDD 可能在需求清晰度方面遇到挑战。这些差异突显了理解每种模型如何与不同项目类型和利益相关者需求相匹配的差距,这可以指导企业在选择最适合其独特需求的模型时做出决策。本研究通过提供关于 TDD 和 BDD 实际应用和挑战的见解,为文献做出了贡献,并建议未来研究关注其在不同环境中的长期影响。

[NLP-57] Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在训练过程中数据污染的问题,特别是在性能评估和比较时面临的挑战。解决方案的关键在于引入了一个名为MM-Detect的多模态数据污染检测框架。该框架能够敏感地检测不同程度的污染,并识别出由于训练集泄露导致的显著性能提升。此外,论文还探讨了污染可能源自MLLMs所使用的预训练语言模型(LLMs)的预训练阶段以及MLLMs的微调阶段,为污染引入的阶段提供了新的见解。

链接: https://arxiv.org/abs/2411.03823
作者: Dingjie Song,Sicheng Lai,Shunian Chen,Lichao Sun,Benyou Wang
关键词-EN: demonstrated superior performance, large language models, rapid progression, demonstrated superior, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. In this study, we introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs. Our experimental results indicate that MM-Detect is sensitive to varying degrees of contamination and can highlight significant performance improvements due to leakage of the training set of multimodal benchmarks. Furthermore, We also explore the possibility of contamination originating from the pre-training phase of LLMs used by MLLMs and the fine-tuning phase of MLLMs, offering new insights into the stages at which contamination may be introduced.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的快速发展在各种多模态基准测试中展现了卓越的性能。然而,训练过程中的数据污染问题给性能评估和比较带来了挑战。尽管在大语言模型 (Large Language Models, LLMs) 中存在多种检测数据集污染的方法,但由于 MLLMs 涉及多种模态和多个训练阶段,这些方法对 MLLMs 的效果较差。在本研究中,我们提出了一种针对 MLLMs 的多模态数据污染检测框架,名为 MM-Detect。我们的实验结果表明,MM-Detect 对不同程度的污染具有敏感性,并且能够突出由于多模态基准测试训练集泄露而带来的显著性能提升。此外,我们还探讨了 MLLMs 所使用的 LLMs 预训练阶段和 MLLMs 微调阶段可能产生的污染来源,为污染可能引入的阶段提供了新的见解。

[NLP-58] Analyzing Multimodal Features of Spontaneous Voice Assistant Commands for Mild Cognitive Impairment Detection

【速读】: 该论文试图解决轻度认知障碍 (Mild Cognitive Impairment, MCI) 的早期检测问题,特别是通过语音助手 (Voice Assistant, VA) 的自发命令来识别 MCI。解决方案的关键在于设计了一种命令生成任务,相较于传统的命令阅读任务,该任务更能反映参与者的认知能力。通过开发基于音频、文本、意图及多模态融合特征的分类和回归模型,研究发现命令生成任务在识别MCI方面表现更优,平均分类准确率达到82%。此外,生成的命令与记忆和注意力子域的相关性更强,这表明命令生成任务在MCI检测中的有效性,并暗示了利用纵向家庭命令进行MCI检测的潜力。

链接: https://arxiv.org/abs/2411.04158
作者: Nana Lin,Youxiang Zhu,Xiaohui Liang,John A. Batsis,Caroline Summerour
关键词-EN: Mild cognitive impairment, major public health, public health concern, health concern due, Mild cognitive
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Mild cognitive impairment (MCI) is a major public health concern due to its high risk of progressing to dementia. This study investigates the potential of detecting MCI with spontaneous voice assistant (VA) commands from 35 older adults in a controlled setting. Specifically, a command-generation task is designed with pre-defined intents for participants to freely generate commands that are more associated with cognitive ability than read commands. We develop MCI classification and regression models with audio, textual, intent, and multimodal fusion features. We find the command-generation task outperforms the command-reading task with an average classification accuracy of 82%, achieved by leveraging multimodal fusion features. In addition, generated commands correlate more strongly with memory and attention subdomains than read commands. Our results confirm the effectiveness of the command-generation task and imply the promise of using longitudinal in-home commands for MCI detection.
摘要:轻度认知障碍 (Mild Cognitive Impairment, MCI) 因其高风险进展为痴呆症而成为重大的公共卫生问题。本研究探讨了在受控环境中,通过35名老年人的自发语音助手 (Voice Assistant, VA) 指令检测 MCI 的潜力。具体而言,设计了一项指令生成任务,参与者根据预定义的意图自由生成指令,这些指令比阅读指令更能反映认知能力。我们开发了基于音频、文本、意图及多模态融合特征的 MCI 分类和回归模型。研究发现,指令生成任务在分类准确率上平均达到82%,优于指令阅读任务,这得益于多模态融合特征的利用。此外,生成的指令与记忆和注意力子领域的相关性比阅读指令更强。本研究结果证实了指令生成任务的有效性,并暗示了利用纵向家庭指令进行 MCI 检测的前景。

[NLP-59] Unified Pathological Speech Analysis with Prompt Tuning

【速读】: 该论文试图解决病理语音分析中针对特定疾病建模的局限性问题,即现有模型通常仅针对单一疾病设计,忽略了疾病之间的关联性,从而限制了性能和训练效率。解决方案的关键在于采用提示调优(prompt tuning)技术,构建一个统一的病理语音分析系统,能够同时检测多达三种疾病(阿尔茨海默病、抑郁症和帕金森病)。该系统利用预训练的口语语言模型,通过提示调优仅调整少量参数,实现跨任务的知识共享,从而提高训练效率、加速收敛并提升F1分数。实验结果表明,该方法在病理语音分析中表现出较强的竞争力。

链接: https://arxiv.org/abs/2411.04142
作者: Fei Yang,Xuenan Xu,Mengyue Wu,Kai Yu
关键词-EN: Pathological speech analysis, Pathological speech, speech analysis, previous pathological speech, speech
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Pathological speech analysis has been of interest in the detection of certain diseases like depression and Alzheimer’s disease and attracts much interest from researchers. However, previous pathological speech analysis models are commonly designed for a specific disease while overlooking the connection between diseases, which may constrain performance and lower training efficiency. Instead of fine-tuning deep models for different tasks, prompt tuning is a much more efficient training paradigm. We thus propose a unified pathological speech analysis system for as many as three diseases with the prompt tuning technique. This system uses prompt tuning to adjust only a small part of the parameters to detect different diseases from speeches of possible patients. Our system leverages a pre-trained spoken language model and demonstrates strong performance across multiple disorders while only fine-tuning a fraction of the parameters. This efficient training approach leads to faster convergence and improved F1 scores by allowing knowledge to be shared across tasks. Our experiments on Alzheimer’s disease, Depression, and Parkinson’s disease show competitive results, highlighting the effectiveness of our method in pathological speech analysis.
摘要:病理语音分析在抑郁症和阿尔茨海默病等疾病的检测中引起了广泛关注,并吸引了众多研究者的兴趣。然而,以往的病理语音分析模型通常针对特定疾病设计,忽略了疾病之间的关联,这可能限制了模型的性能并降低了训练效率。与为不同任务微调深度模型相比,提示调优(prompt tuning)是一种更为高效的训练范式。因此,我们提出了一种基于提示调优技术的统一病理语音分析系统,该系统能够同时检测多达三种疾病。该系统利用提示调优技术,仅调整模型参数的一小部分,即可从可能患者的语音中检测出不同的疾病。我们的系统利用预训练的口语语言模型,在多种疾病检测中表现出强大的性能,同时仅微调了部分参数。这种高效的训练方法通过允许任务间的知识共享,实现了更快的收敛和更高的F1分数。我们在阿尔茨海默病、抑郁症和帕金森病的实验中展示了具有竞争力的结果,突显了我们的方法在病理语音分析中的有效性。

人工智能

[AI-0] ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

链接: https://arxiv.org/abs/2411.05003
作者: David Junhao Zhang,Roni Paiss,Shiran Zada,Nikhil Karnad,David E. Jacobs,Yael Pritch,Inbar Mosseri,Mike Zheng Shou,Neal Wadhwa,Nataniel Ruiz
关键词-EN: controllable camera trajectories, modeling have allowed, allowed for controllable, video, Recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: project page: this https URL

点击查看摘要

Abstract:Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.

[AI-1] HourVideo: 1-Hour Video-Language Understanding NEURIPS2024

链接: https://arxiv.org/abs/2411.04998
作者: Keshigeyan Chandrasegaran,Agrim Gupta,Lea M. Hadzic,Taran Kota,Jimming He,Cristóbal Eyzaguirre,Zane Durante,Manling Li,Jiajun Wu,Li Fei-Fei
关键词-EN: hour-long video-language understanding, video-language understanding, hour-long video-language, present HourVideo, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Datasets and Benchmarks Track; 28 pages

点击查看摘要

Abstract:We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at this https URL

[AI-2] Public Procurement for Responsible AI? Understanding U.S. Cities Practices Challenges and Needs

链接: https://arxiv.org/abs/2411.04994
作者: Nari Johnson,Elise Silva,Harrison Leon,Motahhare Eslami,Beth Schwanke,Ravit Dotan,Hoda Heidari
关键词-EN: developed internally, tools adopted, city employees, process called public, called public procurement
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Preprint, under revision

点击查看摘要

Abstract:Most AI tools adopted by governments are not developed internally, but instead are acquired from third-party vendors in a process called public procurement. While scholars and regulatory proposals have recently turned towards procurement as a site of intervention to encourage responsible AI governance practices, little is known about the practices and needs of city employees in charge of AI procurement. In this paper, we present findings from semi-structured interviews with 18 city employees across 7 US cities. We find that AI acquired by cities often does not go through a conventional public procurement process, posing challenges to oversight and governance. We identify five key types of challenges to leveraging procurement for responsible AI that city employees face when interacting with colleagues, AI vendors, and members of the public. We conclude by discussing recommendations and implications for governments, researchers, and policymakers.

[AI-3] Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations Theory and Alternatives

链接: https://arxiv.org/abs/2411.04991
作者: Hao Sun,Yunyi Shen,Jean-Francois Ton
关键词-EN: Large Language Model, Large Language, Language Model, reward modeling, common and successful
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear why this model – originally developed for multi-player stochastic game matching – can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first revisit the foundations of using BT models in reward modeling, and establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization. This is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of order consistency in reward modeling and demonstrate that the BT model possesses this property. Consequently, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using 6 base LLMs, 2 datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.

[AI-4] Clustering in Causal Attention Masking NEURIPS2024

链接: https://arxiv.org/abs/2411.04990
作者: Nikita Karagodin,Yury Polyanskiy,Philippe Rigollet
关键词-EN: self-attention dynamics proposed, work presents, self-attention dynamics, dynamics proposed, causally masked attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Dynamical Systems (math.DS)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024), 22 pages, 6 figures

点击查看摘要

Abstract:This work presents a modification of the self-attention dynamics proposed by Geshkovski et al. (arXiv:2312.10794) to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of Geshkovski et al. (arXiv:2312.10794) in this context: While previous rigorous results focused on cases where all three matrices (Key, Query, and Value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and a value matrix equal to the identity. Additionally, we establish a connection to the classical Rényi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states.

[AI-5] Few-Shot Task Learning through Inverse Generative Modeling

链接: https://arxiv.org/abs/2411.04987
作者: Aviv Netanyahu,Yilun Du,Antonia Bronars,Jyothish Pari,Joshua Tenenbaum,Tianmin Shu,Pulkit Agrawal
关键词-EN: Inverse Generative Modeling, Few-Shot Task Learning, task concept learning, extremely challenging, Task Learning
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning the intents of an agent, defined by its goals or motion style, is often extremely challenging from just a few examples. We refer to this problem as task concept learning and present our approach, Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM), which learns new task concepts by leveraging invertible neural generative models. The core idea is to pretrain a generative model on a set of basic concepts and their demonstrations. Then, given a few demonstrations of a new concept (such as a new goal or a new action), our method learns the underlying concepts through backpropagation without updating the model weights, thanks to the invertibility of the generative model. We evaluate our method in five domains – object rearrangement, goal-oriented navigation, motion caption of human actions, autonomous driving, and real-world table-top manipulation. Our experimental results demonstrate that via the pretrained generative model, we successfully learn novel concepts and generate agent plans or motion corresponding to these concepts in (1) unseen environments and (2) in composition with training concepts.

[AI-6] DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

链接: https://arxiv.org/abs/2411.04983
作者: Gaoyue Zhou,Hengkai Pan,Yann LeCun,Lerrel Pinto
关键词-EN: predict future outcomes, outcomes given control, fundamental for physical, world models, DINO World Model
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, have proven challenging to learn and are typically developed for task-specific solutions with online policy learning. We argue that the true potential of world models lies in their ability to reason and plan across diverse problems using only passive data. Concretely, we require world models to have the following three properties: 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To realize this, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This design allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic behavior planning by treating desired goal patch features as prediction targets. We evaluate DINO-WM across various domains, including maze navigation, tabletop pushing, and particle manipulation. Our experiments demonstrate that DINO-WM can generate zero-shot behavioral solutions at test time without relying on expert demonstrations, reward modeling, or pre-learned inverse models. Notably, DINO-WM exhibits strong generalization capabilities compared to prior state-of-the-art work, adapting to diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

[AI-7] Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

链接: https://arxiv.org/abs/2411.04981
作者: Dylan Manuel,Nafis Tanveer Islam,Joseph Khoury,Ana Nunez,Elias Bou-Harb,Peyman Najafirad
关键词-EN: identify critical security, Security experts reverse, critical security vulnerabilities, experts reverse engineer, critical security
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Security experts reverse engineer (decompile) binary code to identify critical security vulnerabilities. The limited access to source code in vital systems - such as firmware, drivers, and proprietary software used in Critical Infrastructures (CI) - makes this analysis even more crucial on the binary level. Even with available source code, a semantic gap persists after compilation between the source and the binary code executed by the processor. This gap may hinder the detection of vulnerabilities in source code. That being said, current research on Large Language Models (LLMs) overlooks the significance of decompiled binaries in this area by focusing solely on source code. In this work, we are the first to empirically uncover the substantial semantic limitations of state-of-the-art LLMs when it comes to analyzing vulnerabilities in decompiled binaries, largely due to the absence of relevant datasets. To bridge the gap, we introduce DeBinVul, a novel decompiled binary code vulnerability dataset. Our dataset is multi-architecture and multi-optimization, focusing on C/C++ due to their wide usage in CI and association with numerous vulnerabilities. Specifically, we curate 150,872 samples of vulnerable and non-vulnerable decompiled binary code for the task of (i) identifying; (ii) classifying; (iii) describing vulnerabilities; and (iv) recovering function names in the domain of decompiled binaries. Subsequently, we fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in the capabilities of CodeLlama, Llama3, and CodeGen2 respectively, in detecting binary code vulnerabilities. Additionally, using DeBinVul, we report a high performance of 80-90% on the vulnerability classification task. Furthermore, we report improved performance in function name recovery and vulnerability description tasks.

[AI-8] Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification

链接: https://arxiv.org/abs/2411.04956
作者: Mischa Dombrowski,Hadrien Reynaud,Bernhard Kainz
关键词-EN: easily deceive casual, deceive casual observers, produced image quality, domain experts alike, Latent Video Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 tables, 6 figures

点击查看摘要

Abstract:Latent Video Diffusion Models can easily deceive casual observers and domain experts alike thanks to the produced image quality and temporal consistency. Beyond entertainment, this creates opportunities around safe data sharing of fully synthetic datasets, which are crucial in healthcare, as well as other domains relying on sensitive personal information. However, privacy concerns with this approach have not fully been addressed yet, and models trained on synthetic data for specific downstream tasks still perform worse than those trained on real data. This discrepancy may be partly due to the sampling space being a subspace of the training videos, effectively reducing the training data size for downstream models. Additionally, the reduced temporal consistency when generating long videos could be a contributing factor. In this paper, we first show that training privacy-preserving models in latent space is computationally more efficient and generalize better. Furthermore, to investigate downstream degradation factors, we propose to use a re-identification model, previously employed as a privacy preservation filter. We demonstrate that it is sufficient to train this model on the latent space of the video generator. Subsequently, we use these models to evaluate the subspace covered by synthetic video datasets and thus introduce a new way to measure the faithfulness of generative machine learning models. We focus on a specific application in healthcare echocardiography to illustrate the effectiveness of our novel methods. Our findings indicate that only up to 30.8% of the training videos are learned in latent video diffusion models, which could explain the lack of performance when training downstream tasks on synthetic data. Comments: 8 pages, 5 tables, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.04956 [cs.CV] (or arXiv:2411.04956v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.04956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

链接: https://arxiv.org/abs/2411.04928
作者: Wenqiang Sun,Shuo Chen,Fangfu Liu,Zilong Chen,Yueqi Duan,Jun Zhang,Yikai Wang
关键词-EN: generate photorealistic, video diffusion, framework designed, designed to generate, single image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project Page: this https URL

点击查看摘要

Abstract:In this paper, we introduce \textbfDimensionX, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.

[AI-10] StoryAgent : Customized Storytelling Video Generation via Multi-Agent Collaboration

链接: https://arxiv.org/abs/2411.04925
作者: Panwen Hu,Jin Jiang,Jianqi Chen,Mingfei Han,Shengcai Liao,Xiaojun Chang,Xiaodan Liang
关键词-EN: streamline conventional processes, AI-Generated Content, conventional processes, AIGC, advent of AI-Generated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The advent of AI-Generated Content (AIGC) has spurred research into automated video generation to streamline conventional processes. However, automating storytelling video production, particularly for customized narratives, remains challenging due to the complexity of maintaining subject consistency across shots. While existing approaches like Mora and AesopAgent integrate multiple agents for Story-to-Video (S2V) generation, they fall short in preserving protagonist consistency and supporting Customized Storytelling Video Generation (CSVG). To address these limitations, we propose StoryAgent, a multi-agent framework designed for CSVG. StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Notably, our framework includes agents for story design, storyboard generation, video creation, agent coordination, and result evaluation. Leveraging the strengths of different models, StoryAgent enhances control over the generation process, significantly improving character consistency. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency, while a novel storyboard generation pipeline is proposed to maintain subject consistency across shots. Extensive experiments demonstrate the effectiveness of our approach in synthesizing highly consistent storytelling videos, outperforming state-of-the-art methods. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.

[AI-11] Evaluating Robustness of Reinforcement Learning Algorithms for Autonomous Shipping

链接: https://arxiv.org/abs/2411.04915
作者: Bavo Lesy,Ali Anwar,Siegfried Mercelis
关键词-EN: improve maritime efficiency, autonomous shipping due, autonomous shipping, efficiency and safety, growing interest
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures. Will be presented at IEEE RAAI 2024

点击查看摘要

Abstract:Recently, there has been growing interest in autonomous shipping due to its potential to improve maritime efficiency and safety. The use of advanced technologies, such as artificial intelligence, can address the current navigational and operational challenges in autonomous shipping. In particular, inland waterway transport (IWT) presents a unique set of challenges, such as crowded waterways and variable environmental conditions. In such dynamic settings, the reliability and robustness of autonomous shipping solutions are critical factors for ensuring safe operations. This paper examines the robustness of benchmark deep reinforcement learning (RL) algorithms, implemented for IWT within an autonomous shipping simulator, and their ability to generate effective motion planning policies. We demonstrate that a model-free approach can achieve an adequate policy in the simulator, successfully navigating port environments never encountered during training. We focus particularly on Soft-Actor Critic (SAC), which we show to be inherently more robust to environmental disturbances compared to MuZero, a state-of-the-art model-based RL algorithm. In this paper, we take a significant step towards developing robust, applied RL frameworks that can be generalized to various vessel types and navigate complex port- and inland environments and scenarios.

[AI-12] GUI Agents with Foundation Models: A Comprehensive Survey

链接: https://arxiv.org/abs/2411.04890
作者: Shuai Wang,Weiwen Liu,Jingxuan Chen,Weinan Gan,Xingshan Zeng,Shuai Yu,Xinlong Hao,Kun Shao,Yasheng Wang,Ruiming Tang
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), facilitate intelligent agents being capable of performing complex tasks. By leveraging the ability of (M)LLMs to process and interpret Graphical User Interfaces (GUIs), these agents can autonomously execute user instructions by simulating human-like interactions such as clicking and typing. This survey consolidates recent research on (M)LLM-based GUI agents, highlighting key innovations in data, frameworks, and applications. We begin by discussing representative datasets and benchmarks. Next, we summarize a unified framework that captures the essential components used in prior research, accompanied by a taxonomy. Additionally, we explore commercial applications of (M)LLM-based GUI agents. Drawing from existing work, we identify several key challenges and propose future research directions. We hope this paper will inspire further developments in the field of (M)LLM-based GUI agents.

[AI-13] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

链接: https://arxiv.org/abs/2411.04872
作者: Elliot Glazer,Ege Erdil,Tamay Besiroglu,Diego Chicharro,Evan Chen,Alex Gunning,Caroline Falkman Olsson,Jean-Stanislas Denain,Anson Ho,Emily de Oliveira Santos,Olli Järviniemi,Matthew Barnett,Robert Sandler,Jaime Sevilla,Qiuyu Ren,Elizabeth Pratt,Lionel Levine,Grant Barkley,Natalie Stewart,Bogdan Grechuk,Tetiana Grechuk,Shreepranav Varma Enugandla
关键词-EN: exceptionally challenging mathematics, hundreds of original, exceptionally challenging, expert mathematicians, challenging mathematics problems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics – from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

[AI-14] hink Smart Act SMARL! Analyzing Probabilistic Logic Driven Safety in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2411.04867
作者: Satchit Chatterji,Erman Acar
关键词-EN: important challenge, challenge for enabling, enabling the deployment, deployment of reinforcement, real world
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 14 figures

点击查看摘要

Abstract:An important challenge for enabling the deployment of reinforcement learning (RL) algorithms in the real world is safety. This has resulted in the recent research field of Safe RL, which aims to learn optimal policies that are safe. One successful approach in that direction is probabilistic logic shields (PLS), a model-based Safe RL technique that uses formal specifications based on probabilistic logic programming, constraining an agent’s policy to comply with those specifications in a probabilistic sense. However, safety is inherently a multi-agent concept, since real-world environments often involve multiple agents interacting simultaneously, leading to a complex system which is hard to control. Moreover, safe multi-agent RL (Safe MARL) is still underexplored. In order to address this gap, in this paper we ( i ) introduce Shielded MARL (SMARL) by extending PLS to MARL – in particular, we introduce Probabilistic Logic Temporal Difference Learning (PLTD) to enable shielded independent Q-learning (SIQL), and introduce shielded independent PPO (SIPPO) using probabilistic logic policy gradients; ( ii ) show its positive effect and use as an equilibrium selection mechanism in various game-theoretic environments including two-player simultaneous games, extensive-form games, stochastic games, and some grid-world extensions in terms of safety, cooperation, and alignment with normative behaviors; and ( iii ) look into the asymmetric case where only one agent is shielded, and show that the shielded agent has a significant influence on the unshielded one, providing further evidence of SMARL’s ability to enhance safety and cooperation in diverse multi-agent environments.

[AI-15] ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset WACV2025 WACV

链接: https://arxiv.org/abs/2411.04865
作者: Olaf Wysocki,Yue Tan,Thomas Froech,Yan Xia,Magdalena Wysocki,Ludwig Hoegner,Daniel Cremers,Christoph Holst
关键词-EN: computer vision, Facade semantic segmentation, Facade, photogrammetry and computer, facade segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to WACV 2025 (IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))

点击查看摘要

Abstract:Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods’ comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. Moreover, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.

[AI-16] A multi-purpose automatic editing system based on lecture semantics for remote education

链接: https://arxiv.org/abs/2411.04859
作者: Panwen Hu,Rui Huang
关键词-EN: popular recently due, Remote teaching, convenience and safety, popular recently, recently due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Remote teaching has become popular recently due to its convenience and safety, especially under extreme circumstances like a pandemic. However, online students usually have a poor experience since the information acquired from the views provided by the broadcast platforms is limited. One potential solution is to show more camera views simultaneously, but it is technically challenging and distracting for the viewers. Therefore, an automatic multi-camera directing/editing system, which aims at selecting the most concerned view at each time instance to guide the attention of online students, is in urgent demand. However, existing systems mostly make simple assumptions and focus on tracking the position of the speaker instead of the real lecture semantics, and therefore have limited capacities to deliver optimal information flow. To this end, this paper proposes an automatic multi-purpose editing system based on the lecture semantics, which can both direct the multiple video streams for real-time broadcasting and edit the optimal video offline for review purposes. Our system directs the views by semantically analyzing the class events while following the professional directing rules, mimicking a human director to capture the regions of interest from the viewpoint of the onsite students. We conduct both qualitative and quantitative analyses to verify the effectiveness of the proposed system and its components.

[AI-17] Plasticity Loss in Deep Reinforcement Learning: A Survey

链接: https://arxiv.org/abs/2411.04832
作者: Timo Klein,Lukas Miklautz,Kevin Sidak,Claudia Plant,Sebastian Tschiatschek
关键词-EN: Akin to neuroplasticity, deep Reinforcement Learning, neural networks enables, human brains, neuroplasticity in human
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Akin to neuroplasticity in human brains, the plasticity of deep neural networks enables their quick adaption to new data. This makes plasticity particularly crucial for deep Reinforcement Learning (RL) agents: Once plasticity is lost, an agent’s performance will inevitably plateau because it cannot improve its policy to account for changes in the data distribution, which are a necessary consequence of its learning process. Thus, developing well-performing and sample-efficient agents hinges on their ability to remain plastic during training. Furthermore, the loss of plasticity can be connected to many other issues plaguing deep RL, such as training instabilities, scaling failures, overestimation bias, and insufficient exploration. With this survey, we aim to provide an overview of the emerging research on plasticity loss for academics and practitioners of deep reinforcement learning. First, we propose a unified definition of plasticity loss based on recent works, relate it to definitions from the literature, and discuss metrics for measuring plasticity loss. Then, we categorize and discuss numerous possible causes of plasticity loss before reviewing currently employed mitigation strategies. Our taxonomy is the first systematic overview of the current state of the field. Lastly, we discuss prevalent issues within the literature, such as a necessity for broader evaluation, and provide recommendations for future research, like gaining a better understanding of an agent’s neural activity and behavior.

[AI-18] D3epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

链接: https://arxiv.org/abs/2411.04826
作者: Siyu Chen,Hong Liu,Wenhao Li,Ying Zhu,Guoquan Wang,Jianbing Wu
关键词-EN: technology in robotics, Depth estimation, crucial technology, Depth, dynamic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Open sourced

点击查看摘要

Abstract:Depth estimation is a crucial technology in robotics. Recently, self-supervised depth estimation methods have demonstrated great potential as they can efficiently leverage large amounts of unlabelled real-world data. However, most existing methods are designed under the assumption of static scenes, which hinders their adaptability in dynamic environments. To address this issue, we present D ^3 epth, a novel method for self-supervised depth estimation in dynamic scenes. It tackles the challenge of dynamic objects from two key perspectives. First, within the self-supervised framework, we design a reprojection constraint to identify regions likely to contain dynamic objects, allowing the construction of a dynamic mask that mitigates their impact at the loss level. Second, for multi-frame depth estimation, we introduce a cost volume auto-masking strategy that leverages adjacent frames to identify regions associated with dynamic objects and generate corresponding masks. This provides guidance for subsequent processes. Furthermore, we propose a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments. Extensive experiments on KITTI and Cityscapes datasets demonstrate that the proposed method consistently outperforms existing self-supervised monocular depth estimation baselines. Code is available at \urlthis https URL.

[AI-19] Defending Deep Regression Models against Backdoor Attacks

链接: https://arxiv.org/abs/2411.04811
作者: Lingyu Du,Yupei Liu,Jinyuan Jia,Guohao Lan
关键词-EN: Deep regression models, regression models, Deep regression, regression, backdoored deep regression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep regression models are used in a wide variety of safety-critical applications, but are vulnerable to backdoor attacks. Although many defenses have been proposed for classification models, they are ineffective as they do not consider the uniqueness of regression models. First, the outputs of regression models are continuous values instead of discretized labels. Thus, the potential infected target of a backdoored regression model has infinite possibilities, which makes it impossible to be determined by existing defenses. Second, the backdoor behavior of backdoored deep regression models is triggered by the activation values of all the neurons in the feature space, which makes it difficult to be detected and mitigated using existing defenses. To resolve these problems, we propose DRMGuard, the first defense to identify if a deep regression model in the image domain is backdoored or not. DRMGuard formulates the optimization problem for reverse engineering based on the unique output-space and feature-space characteristics of backdoored deep regression models. We conduct extensive evaluations on two regression tasks and four datasets. The results show that DRMGuard can consistently defend against various backdoor attacks. We also generalize four state-of-the-art defenses designed for classifiers to regression models, and compare DRMGuard with them. The results show that DRMGuard significantly outperforms all those defenses.

[AI-20] MPVO: Motion-Prior based Visual Odometry for PointGoal Navigation ECCV

链接: https://arxiv.org/abs/2411.04796
作者: Sayan Paul,Ruddra dev Roychoudhury,Brojeshwar Bhowmick
关键词-EN: enabling accurate point-goal, Visual odometry, GPS and compass, accurate point-goal navigation, unreliable and inaccurate
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in 50SFM Workshop of the 18th European Conference on Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Visual odometry (VO) is essential for enabling accurate point-goal navigation of embodied agents in indoor environments where GPS and compass sensors are unreliable and inaccurate. However, traditional VO methods face challenges in wide-baseline scenarios, where fast robot motions and low frames per second (FPS) during inference hinder their performance, leading to drift and catastrophic failures in point-goal navigation. Recent deep-learned VO methods show robust performance but suffer from sample inefficiency during training; hence, they require huge datasets and compute resources. So, we propose a robust and sample-efficient VO pipeline based on motion priors available while an agent is navigating an environment. It consists of a training-free action-prior based geometric VO module that estimates a coarse relative pose which is further consumed as a motion prior by a deep-learned VO model, which finally produces a fine relative pose to be used by the navigation policy. This strategy helps our pipeline achieve up to 2x sample efficiency during training and demonstrates superior accuracy and robustness in point-goal navigation tasks compared to state-of-the-art VO method(s). Realistic indoor environments of the Gibson dataset is used in the AI-Habitat simulator to evaluate the proposed approach using navigation metrics (like success/SPL) and pose metrics (like RPE/ATE). We hope this method further opens a direction of work where motion priors from various sources can be utilized to improve VO estimates and achieve better results in embodied navigation tasks.

[AI-21] Navigating Trade-offs: Policy Summarization for Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2411.04784
作者: Zuzanna Osika,Jazmin Zatarain-Salazar,Frans A. Oliehoek,Pradeep K. Murukannaiah
关键词-EN: solve problems involving, problems involving multiple, involving multiple objectives, MORL, solution set
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-objective reinforcement learning (MORL) is used to solve problems involving multiple objectives. An MORL agent must make decisions based on the diverse signals provided by distinct reward functions. Training an MORL agent yields a set of solutions (policies), each presenting distinct trade-offs among the objectives (expected returns). MORL enhances explainability by enabling fine-grained comparisons of policies in the solution set based on their trade-offs as opposed to having a single policy. However, the solution set is typically large and multi-dimensional, where each policy (e.g., a neural network) is represented by its objective values. We propose an approach for clustering the solution set generated by MORL. By considering both policy behavior and objective values, our clustering method can reveal the relationship between policy behaviors and regions in the objective space. This approach can enable decision makers (DMs) to identify overarching trends and insights in the solution set rather than examining each policy individually. We tested our method in four multi-objective environments and found it outperformed traditional k-medoids clustering. Additionally, we include a case study that demonstrates its real-world application. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.04784 [cs.AI] (or arXiv:2411.04784v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.04784 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Frontiers in Artificial Intelligence and Applications, vol. 392, ECAI 2024, pp. 2919-2926 Related DOI: https://doi.org/10.3233/FAIA240830 Focus to learn more DOI(s) linking to related resources

[AI-22] Attention Masks Help Adversarial Attacks to Bypass Safety Detectors

链接: https://arxiv.org/abs/2411.04772
作者: Yunfan Shi
关键词-EN: recent research advancements, adversarial attack methods, attention mask generation, current approaches, discoverable and slower
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite recent research advancements in adversarial attack methods, current approaches against XAI monitors are still discoverable and slower. In this paper, we present an adaptive framework for attention mask generation to enable stealthy, explainable and efficient PGD image classification adversarial attack under XAI monitors. Specifically, we utilize mutation XAI mixture and multitask self-supervised X-UNet for attention mask generation to guide PGD attack. Experiments on MNIST (MLP), CIFAR-10 (AlexNet) have shown that our system can outperform benchmark PGD, Sparsefool and SOTA SINIFGSM in balancing among stealth, efficiency and explainability which is crucial for effectively fooling SOTA defense protected classifiers.

[AI-23] Exploring the Stability Gap in Continual Learning: The Role of the Classification Head WACV2025

链接: https://arxiv.org/abs/2411.04723
作者: Wojciech Łapacz,Daniel Marczak,Filip Szatkowski,Tomasz Trzciński
关键词-EN: mitigating catastrophic forgetting, evolving data distributions, enabling neural networks, stability gap, catastrophic forgetting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:Continual learning (CL) has emerged as a critical area in machine learning, enabling neural networks to learn from evolving data distributions while mitigating catastrophic forgetting. However, recent research has identified the stability gap – a phenomenon where models initially lose performance on previously learned tasks before partially recovering during training. Such learning dynamics are contradictory to the intuitive understanding of stability in continual learning where one would expect the performance to degrade gradually instead of rapidly decreasing and then partially recovering later. To better understand and alleviate the stability gap, we investigate it at different levels of the neural network architecture, particularly focusing on the role of the classification head. We introduce the nearest-mean classifier (NMC) as a tool to attribute the influence of the backbone and the classification head on the stability gap. Our experiments demonstrate that NMC not only improves final performance, but also significantly enhances training stability across various continual learning benchmarks, including CIFAR100, ImageNet100, CUB-200, and FGVC Aircrafts. Moreover, we find that NMC also reduces task-recency bias. Our analysis provides new insights into the stability gap and suggests that the primary contributor to this phenomenon is the linear head, rather than the insufficient representation learning.

[AI-24] Differential Privacy Overview and Fundamental Techniques

链接: https://arxiv.org/abs/2411.04710
作者: Ferdinando Fioretto,Pascal Van Hentenryck,Juba Ziani
关键词-EN: Artificial Intelligence, Theory to Practice, Differential Privacy, Privacy, implement Differential Privacy
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Chapter 1 of book: “Differential Privacy in Artificial Intelligence: From Theory to Practice”

点击查看摘要

Abstract:This chapter is meant to be part of the book “Differential Privacy in Artificial Intelligence: From Theory to Practice” and provides an introduction to Differential Privacy. It starts by illustrating various attempts to protect data privacy, emphasizing where and why they failed, and providing the key desiderata of a robust privacy definition. It then defines the key actors, tasks, and scopes that make up the domain of privacy-preserving data analysis. Following that, it formalizes the definition of Differential Privacy and its inherent properties, including composition, post-processing immunity, and group privacy. The chapter also reviews the basic techniques and mechanisms commonly used to implement Differential Privacy in its pure and approximate forms.

[AI-25] he Multiple Dimensions of Spuriousness in Machine Learning

链接: https://arxiv.org/abs/2411.04696
作者: Samuel J. Bell,Skyler Wang
关键词-EN: today machine learning, machine learning, Learning correlations, artificial intelligence, forms the foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning correlations from data forms the foundation of today’s machine learning (ML) and artificial intelligence (AI) research. While such an approach enables the automatic discovery of patterned relationships within big data corpora, it is susceptible to failure modes when unintended correlations are captured. This vulnerability has expanded interest in interrogating spuriousness, often critiqued as an impediment to model performance, fairness, and robustness. In this article, we trace deviations from the conventional definition of statistical spuriousness-which denotes a non-causal observation arising from either coincidence or confounding variables-to articulate how ML researchers make sense of spuriousness in practice. Drawing on a broad survey of ML literature, we conceptualize the “multiple dimensions of spuriousness,” encompassing: relevance (“Models should only use correlations that are relevant to the task.”), generalizability (“Models should only use correlations that generalize to unseen data”), human-likeness (“Models should only use correlations that a human would use to perform the same task”), and harmfulness (“Models should only use correlations that are not harmful”). These dimensions demonstrate that ML spuriousness goes beyond the causal/non-causal dichotomy and that the disparate interpretative paths researchers choose could meaningfully influence the trajectory of ML development. By underscoring how a fundamental problem in ML is contingently negotiated in research contexts, we contribute to ongoing debates about responsible practices in AI development.

[AI-26] Reciprocal Point Learning Network with Large Electromagnetic Kernel for SAR Open-Set Recognition

链接: https://arxiv.org/abs/2411.04693
作者: Xiayang Xiao,Zhuoxuan Li,Ruyi Zhang,Jiacheng Chen,Haipeng Wang
关键词-EN: Synthetic Aperture Radar, existing Synthetic Aperture, Automatic Target Recognition, Aperture Radar, Synthetic Aperture
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The limitations of existing Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) methods lie in their confinement by the closed-environment assumption, hindering their effective and robust handling of unknown target categories in open environments. Open Set Recognition (OSR), a pivotal facet for algorithmic practicality, intends to categorize known classes while denoting unknown ones as “unknown.” The chief challenge in OSR involves concurrently mitigating risks associated with generalizing features from a restricted set of known classes to numerous unknown samples and the open space exposure to potential unknown data. To enhance open-set SAR classification, a method called scattering kernel with reciprocal learning network is proposed. Initially, a feature learning framework is constructed based on reciprocal point learning (RPL), establishing a bounded space for potential unknown classes. This approach indirectly introduces unknown information into a learner confined to known classes, thereby acquiring more concise and discriminative representations. Subsequently, considering the variability in the imaging of targets at different angles and the discreteness of components in SAR images, a proposal is made to design convolutional kernels based on large-sized attribute scattering center models. This enhances the ability to extract intrinsic non-linear features and specific scattering characteristics in SAR images, thereby improving the discriminative features of the model and mitigating the impact of imaging variations on classification performance. Experiments on the MSTAR datasets substantiate the superior performance of the proposed approach called ASC-RPL over mainstream methods.

[AI-27] Personalized Federated Learning for Cross-view Geo-localization

链接: https://arxiv.org/abs/2411.04692
作者: Christos Anagnostopoulos,Alexandros Gkillas,Nikos Piperigkos,Aris S. Lalos
关键词-EN: Cross-view Image Geo-localization, Image Geo-localization, Cross-view Image, combining Federated Learning, methodology combining Federated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures, Preprint submitted to the IEEE 26th International Workshop on Multimedia Signal Processing (MMSP)

点击查看摘要

Abstract:In this paper we propose a methodology combining Federated Learning (FL) with Cross-view Image Geo-localization (CVGL) techniques. We address the challenges of data privacy and heterogeneity in autonomous vehicle environments by proposing a personalized Federated Learning scenario that allows selective sharing of model parameters. Our method implements a coarse-to-fine approach, where clients share only the coarse feature extractors while keeping fine-grained features specific to local environments. We evaluate our approach against traditional centralized and single-client training schemes using the KITTI dataset combined with satellite imagery. Results demonstrate that our federated CVGL method achieves performance close to centralized training while maintaining data privacy. The proposed partial model sharing strategy shows comparable or slightly better performance than classical FL, offering significant reduced communication overhead without sacrificing accuracy. Our work contributes to more robust and privacy-preserving localization systems for autonomous vehicles operating in diverse environments

[AI-28] AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data

链接: https://arxiv.org/abs/2411.04691
作者: Tianyi Zhang,Miu Kojima,Simon D’Alfonso
关键词-EN: valuable tools, tools for personal, data, personal sensing, assess mental health
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Smartphones, equipped with an array of sensors, have become valuable tools for personal sensing. Particularly in digital health, smartphones facilitate the tracking of health-related behaviors and contexts, contributing significantly to digital phenotyping, a process where data from digital interactions is analyzed to infer behaviors and assess mental health. Traditional methods process raw sensor data into information features for statistical and machine learning analyses. In this paper, we introduce a novel approach that systematically converts smartphone-collected data into structured, chronological narratives. The AWARE Narrator translates quantitative smartphone sensing data into English language descriptions, forming comprehensive narratives of an individual’s activities. We apply the framework to the data collected from university students over a week, demonstrating the potential of utilizing the narratives to summarize individual behavior, and analyzing psychological states by leveraging large language models.

[AI-29] Solving Generalized Grouping Problems in Cellular Manufacturing Systems Using a Network Flow Model

链接: https://arxiv.org/abs/2411.04685
作者: Md. Kutub Uddin,Md. Saiful Islam,Md Abrar Jahin,Md. Saiful Islam Seam,M. F. Mridha
关键词-EN: cellular manufacturing systems, process route, process route family, route family formation, process
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to a journal

点击查看摘要

Abstract:This paper focuses on the generalized grouping problem in the context of cellular manufacturing systems (CMS), where parts may have more than one process route. A process route lists the machines corresponding to each part of the operation. Inspired by the extensive and widespread use of network flow algorithms, this research formulates the process route family formation for generalized grouping as a unit capacity minimum cost network flow model. The objective is to minimize dissimilarity (based on the machines required) among the process routes within a family. The proposed model optimally solves the process route family formation problem without pre-specifying the number of part families to be formed. The process route of family formation is the first stage in a hierarchical procedure. For the second stage (machine cell formation), two procedures, a quadratic assignment programming (QAP) formulation and a heuristic procedure, are proposed. The QAP simultaneously assigns process route families and machines to a pre-specified number of cells in such a way that total machine utilization is maximized. The heuristic procedure for machine cell formation is hierarchical in nature. Computational results for some test problems show that the QAP and the heuristic procedure yield the same results.

[AI-30] CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

链接: https://arxiv.org/abs/2411.04679
作者: Jie Liu,Pan Zhou,Yingjun Du,Ah-Hwee Tan,Cees G.M. Snoek,Jan-Jakob Sonke,Efstratios Gavves
关键词-EN: large language model, Cooperative Plan Optimization, language model, common goal, problem among large
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注: Under review

点击查看摘要

Abstract:In this work, we address the cooperation problem among large language model (LLM) based embodied agents, where agents must cooperate to achieve a common goal. Previous methods often execute actions extemporaneously and incoherently, without long-term strategic and cooperative planning, leading to redundant steps, failures, and even serious repercussions in complex tasks like search-and-rescue missions where discussion and cooperative plan are crucial. To solve this issue, we propose Cooperative Plan Optimization (CaPo) to enhance the cooperation efficiency of LLM-based embodied agents. Inspired by human cooperation schemes, CaPo improves cooperation efficiency with two phases: 1) meta-plan generation, and 2) progress-adaptive meta-plan and execution. In the first phase, all agents analyze the task, discuss, and cooperatively create a meta-plan that decomposes the task into subtasks with detailed steps, ensuring a long-term strategic and coherent plan for efficient coordination. In the second phase, agents execute tasks according to the meta-plan and dynamically adjust it based on their latest progress (e.g., discovering a target object) through multi-turn discussions. This progress-based adaptation eliminates redundant actions, improving the overall cooperation efficiency of agents. Experimental results on the ThreeDworld Multi-Agent Transport and Communicative Watch-And-Help tasks demonstrate that CaPo achieves much higher task completion rate and efficiency compared with state-of-the-arts.

[AI-31] CUIfy the XR: An Open-Source Package to Embed LLM -powered Conversational Agents in XR

链接: https://arxiv.org/abs/2411.04671
作者: Kadir Burak Buldu,Süleyman Özdel,Ka Hei Carrie Lau,Mengdi Wang,Daniel Saad,Sofie Schönborn,Auxane Boch,Enkelejda Kasneci,Efe Bozkir
关键词-EN: sensor technologies enable, technologies enable numerous, enable numerous opportunities, Recent developments, machine learning
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering consumer-grade head-mounted displays (HMDs) in an affordable way, it is likely that XR will become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)–powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. In this paper, we provide the community with an open-source, customizable, extensible, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with various LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes the latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: this https URL

[AI-32] EffiCANet: Efficient Time Series Forecasting with Convolutional Attention

链接: https://arxiv.org/abs/2411.04669
作者: Xinxing Zhou,Jiaqi Ye,Shubao Zhao,Ming Jin,Chengyi Yang,Yanlong Wen,Xiaojie Yuan
关键词-EN: multivariate time series, smart cities requires, time series data, cities requires efficient, exponential growth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The exponential growth of multivariate time series data from sensor networks in domains like industrial monitoring and smart cities requires efficient and accurate forecasting models. Current deep learning methods often fail to adequately capture long-range dependencies and complex inter-variable relationships, especially under real-time processing constraints. These limitations arise as many models are optimized for either short-term forecasting with limited receptive fields or long-term accuracy at the cost of efficiency. Additionally, dynamic and intricate interactions between variables in real-world data further complicate modeling efforts. To address these limitations, we propose EffiCANet, an Efficient Convolutional Attention Network designed to enhance forecasting accuracy while maintaining computational efficiency. EffiCANet integrates three key components: (1) a Temporal Large-kernel Decomposed Convolution (TLDC) module that captures long-term temporal dependencies while reducing computational overhead; (2) an Inter-Variable Group Convolution (IVGC) module that captures complex and evolving relationships among variables; and (3) a Global Temporal-Variable Attention (GTVA) mechanism that prioritizes critical temporal and inter-variable features. Extensive evaluations across nine benchmark datasets show that EffiCANet achieves the maximum reduction of 10.02% in MAE over state-of-the-art models, while cutting computational costs by 26.2% relative to conventional large-kernel convolution methods, thanks to its efficient decomposition strategy.

[AI-33] wav2sleep: A Unified Multi-Modal Approach to Sleep Stage Classification from Physiological Signals ML4H ALT

链接: https://arxiv.org/abs/2411.04644
作者: Jonathan F. Carter,Lionel Tarassenko
关键词-EN: obtrusive sensor measurements, enable important applications, Accurate classification, obtrusive sensor, sensor measurements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to Machine Learning for Health (ML4H) 2024

点击查看摘要

Abstract:Accurate classification of sleep stages from less obtrusive sensor measurements such as the electrocardiogram (ECG) or photoplethysmogram (PPG) could enable important applications in sleep medicine. Existing approaches to this problem have typically used deep learning models designed and trained to operate on one or more specific input signals. However, the datasets used to develop these models often do not contain the same sets of input signals. Some signals, particularly PPG, are much less prevalent than others, and this has previously been addressed with techniques such as transfer learning. Additionally, only training on one or more fixed modalities precludes cross-modal information transfer from other sources, which has proved valuable in other problem domains. To address this, we introduce wav2sleep, a unified model designed to operate on variable sets of input signals during training and inference. After jointly training on over 10,000 overnight recordings from six publicly available polysomnography datasets, including SHHS and MESA, wav2sleep outperforms existing sleep stage classification models across test-time input combinations including ECG, PPG, and respiratory signals.

[AI-34] AP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

链接: https://arxiv.org/abs/2411.04642
作者: Jonathan Fhima,Elad Ben Avraham,Oren Nuriel,Yair Kittenplon,Roy Ganz,Aviad Aberdam,Ron Litman
关键词-EN: considerable research interest, garnered considerable research, effectively handling text, research interest, garnered considerable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM. Initially, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through brief fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based VL benchmarks.

[AI-35] Verification of Neural Networks against Convolutional Perturbations via Parameterised Kernels

链接: https://arxiv.org/abs/2411.04594
作者: Benedikt Brückner,Alessio Lomuscio
关键词-EN: blurring or sharpening, convolutional perturbations, perturbations, define input perturbations, verification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We develop a method for the efficient verification of neural networks against convolutional perturbations such as blurring or sharpening. To define input perturbations we use well-known camera shake, box blur and sharpen kernels. We demonstrate that these kernels can be linearly parameterised in a way that allows for a variation of the perturbation strength while preserving desired kernel properties. To facilitate their use in neural network verification, we develop an efficient way of convolving a given input with these parameterised kernels. The result of this convolution can be used to encode the perturbation in a verification setting by prepending a linear layer to a given network. This leads to tight bounds and a high effectiveness in the resulting verification step. We add further precision by employing input splitting as a branch and bound strategy. We demonstrate that we are able to verify robustness on a number of standard benchmarks where the baseline is unable to provide any safety certificates. To the best of our knowledge, this is the first solution for verifying robustness against specific convolutional perturbations such as camera shake.

[AI-36] On the Inherent Robustness of One-Stage Object Detection against Out-of-Distribution Data

链接: https://arxiv.org/abs/2411.04586
作者: Aitor Martinez-Seras,Javier Del Ser,Alain Andres,Pablo Garcia-Bringas
关键词-EN: unknown objects, open world, fundamental aspect, aspect for developing, developing safe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 figures, 4 tables, under review

点击查看摘要

Abstract:Robustness is a fundamental aspect for developing safe and trustworthy models, particularly when they are deployed in the open world. In this work we analyze the inherent capability of one-stage object detectors to robustly operate in the presence of out-of-distribution (OoD) data. Specifically, we propose a novel detection algorithm for detecting unknown objects in image data, which leverages the features extracted by the model from each sample. Differently from other recent approaches in the literature, our proposal does not require retraining the object detector, thereby allowing for the use of pretrained models. Our proposed OoD detector exploits the application of supervised dimensionality reduction techniques to mitigate the effects of the curse of dimensionality on the features extracted by the model. Furthermore, it utilizes high-resolution feature maps to identify potential unknown objects in an unsupervised fashion. Our experiments analyze the Pareto trade-off between the performance detecting known and unknown objects resulting from different algorithmic configurations and inference confidence thresholds. We also compare the performance of our proposed algorithm to that of logits-based post-hoc OoD methods, as well as possible fusion strategies. Finally, we discuss on the competitiveness of all tested methods against state-of-the-art OoD approaches for object detection models over the recently published Unknown Object Detection benchmark. The obtained results verify that the performance of avant-garde post-hoc OoD detectors can be further improved when combined with our proposed algorithm.

[AI-37] Interpreting the Learned Model in MuZero Planning TAAI2024

链接: https://arxiv.org/abs/2411.04580
作者: Hung Guei,Yan-Ru Ju,Wei-Yu Chen,Ti-Rong Wu
关键词-EN: predict environment dynamics, achieved superhuman performance, relying on simulators, achieved superhuman, predict environment
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by the 29th International Conference on Technologies and Applications of Artificial Intelligence (TAAI 2024)

点击查看摘要

Abstract:MuZero has achieved superhuman performance in various games by using a dynamics network to predict environment dynamics for planning, without relying on simulators. However, the latent states learned by the dynamics network make its planning process opaque. This paper aims to demystify MuZero’s model by interpreting the learned latent states. We incorporate observation reconstruction and state consistency into MuZero training and conduct an in-depth analysis to evaluate latent states across two board games: 9x9 Go and Outer-Open Gomoku, and three Atari games: Breakout, Ms. Pacman, and Pong. Our findings reveal that while the dynamics network becomes less accurate over longer simulations, MuZero still performs effectively by using planning to correct errors. Our experiments also show that the dynamics network learns better latent states in board games than in Atari games. These insights contribute to a better understanding of MuZero and offer directions for future research to improve the playing performance, robustness, and interpretability of the MuZero algorithm.

[AI-38] Multi-Agents are Social Groups: Investigating Social Influence of Multiple Agents Agent s in Human-Agent Interactions

链接: https://arxiv.org/abs/2411.04578
作者: Tianqi Song,Yugin Tan,Zicheng Zhu,Yibin Feng,Yi-Chieh Lee
关键词-EN: common goal, daily life, achieve a common, increasingly prevalent, prevalent in daily
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Multi-agent systems - systems with multiple independent AI agents working together to achieve a common goal - are becoming increasingly prevalent in daily life. Drawing inspiration from the phenomenon of human group social influence, we investigate whether a group of AI agents can create social pressure on users to agree with them, potentially changing their stance on a topic. We conducted a study in which participants discussed social issues with either a single or multiple AI agents, and where the agents either agreed or disagreed with the user’s stance on the topic. We found that conversing with multiple agents (holding conversation content constant) increased the social pressure felt by participants, and caused a greater shift in opinion towards the agents’ stances on each topic. Our study shows the potential advantages of multi-agent systems over single-agent platforms in causing opinion change. We discuss design implications for possible multi-agent systems that promote social good, as well as the potential for malicious actors to use these systems to manipulate public opinion.

[AI-39] Impact of Label Noise on Learning Complex Features NEURIPS2024

链接: https://arxiv.org/abs/2411.04569
作者: Rahul Vashisht,P. Krishna Kumar,Harsha Vardhan Govind,Harish G. Ramaswamy
关键词-EN: Neural networks trained, simpler decision boundaries, Neural networks, decision boundaries, typically converging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at Workshop on Scientific Methods for Understanding Deep Learning, NeurIPS 2024

点击查看摘要

Abstract:Neural networks trained with stochastic gradient descent exhibit an inductive bias towards simpler decision boundaries, typically converging to a narrow family of functions, and often fail to capture more complex features. This phenomenon raises concerns about the capacity of deep models to adequately learn and represent real-world datasets. Traditional approaches such as explicit regularization, data augmentation, architectural modifications, etc., have largely proven ineffective in encouraging the models to learn diverse features. In this work, we investigate the impact of pre-training models with noisy labels on the dynamics of SGD across various architectures and datasets. We show that pretraining promotes learning complex functions and diverse features in the presence of noise. Our experiments demonstrate that pre-training with noisy labels encourages gradient descent to find alternate minima that do not solely depend upon simple features, rather learns more complex and broader set of features, without hurting performance.

[AI-40] A Generalisation of Voter Model: Influential Nodes and Convergence Properties

链接: https://arxiv.org/abs/2411.04564
作者: Abhiram Manohara,Ahad N. Zehmakan
关键词-EN: representing a social, social network, positive or negative, blue nodes, negative opinion
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Consider an undirected graph G, representing a social network, where each node is blue or red, corresponding to positive or negative opinion on a topic. In the voter model, in discrete time rounds, each node picks a neighbour uniformly at random and adopts its colour. Despite its significant popularity, this model does not capture some fundamental real-world characteristics such as the difference in the strengths of individuals connections, individuals with neutral opinion on a topic, and individuals who are reluctant to update their opinion. To address these issues, we introduce and study a generalisation of the voter model. Motivating by campaigning strategies, we study the problem of selecting a set of seeds blue nodes to maximise the expected number of blue nodes after some rounds. We prove that the problem is NP- hard and provide a polynomial time approximation algorithm with the best possible approximation guarantee. Our experiments on real-world and synthetic graph data demonstrate that the proposed algorithm outperforms other algorithms. We also investigate the convergence properties of the model. We prove that the process could take an exponential number of rounds to converge. However, if we limit ourselves to strongly connected graphs, the convergence time is polynomial and the period (the number of states in convergence) divides the length of all cycles in the graph.

[AI-41] Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.04562
作者: Marvin Alles,Philip Becker-Ehmck,Patrick van der Smagt,Maximilian Karl
关键词-EN: offline reinforcement learning, absence of costly, costly feedback, static datasets poses, offline reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.

[AI-42] An Axiomatic Study of the Evaluation of Enthymeme Decoding in Weighted Structured Argumentation

链接: https://arxiv.org/abs/2411.04555
作者: Jonathan Ben-Naim,Victor David,Anthony Hunter
关键词-EN: pair consisting, claim supported, decodings, enthymemes, set of premises
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 14 pages

点击查看摘要

Abstract:An argument can be seen as a pair consisting of a set of premises and a claim supported by them. Arguments used by humans are often enthymemes, i.e., some premises are implicit. To better understand, evaluate, and compare enthymemes, it is essential to decode them, i.e., to find the missing premisses. Many enthymeme decodings are possible. We need to distinguish between reasonable decodings and unreasonable ones. However, there is currently no research in the literature on “How to evaluate decodings?”. To pave the way and achieve this goal, we introduce seven criteria related to decoding, based on different research areas. Then, we introduce the notion of criterion measure, the objective of which is to evaluate a decoding with regard to a certain criterion. Since such measures need to be validated, we introduce several desirable properties for them, called axioms. Another main contribution of the paper is the construction of certain criterion measures that are validated by our axioms. Such measures can be used to identify the best enthymemes decodings.

[AI-43] Vision Language Models are In-Context Value Learners

链接: https://arxiv.org/abs/2411.04549
作者: Yecheng Jason Ma,Joey Hejna,Ayzaan Wahid,Chuyuan Fu,Dhruv Shah,Jacky Liang,Zhuo Xu,Sean Kirmani,Peng Xu,Danny Driess,Ted Xiao,Jonathan Tompson,Osbert Bastani,Dinesh Jayaraman,Wenhao Yu,Tingnan Zhang,Dorsa Sadigh,Fei Xia
关键词-EN: Predicting temporal progress, Predicting temporal, visual trajectories, trajectories is important, important for intelligent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website and demo: this https URL

点击查看摘要

Abstract:Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (\GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and advantage-weighted regression – all without any model training or finetuning.

[AI-44] Dynamic Detection of Relevant Objectives and Adaptation to Preference Drifts in Interactive Evolutionary Multi-Objective Optimization

链接: https://arxiv.org/abs/2411.04547
作者: Seyed Mahdi Shavarani,Mahmoud Golabi,Richard Allmendinger,Lhassane Idoumghar
关键词-EN: Evolutionary Multi-Objective Optimization, Evolutionary Multi-Objective, Multi-Objective Optimization Algorithms, multiple conflicting objectives, widely employed
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Evolutionary Multi-Objective Optimization Algorithms (EMOAs) are widely employed to tackle problems with multiple conflicting objectives. Recent research indicates that not all objectives are equally important to the decision-maker (DM). In the context of interactive EMOAs, preference information elicited from the DM during the optimization process can be leveraged to identify and discard irrelevant objectives, a crucial step when objective evaluations are computationally expensive. However, much of the existing literature fails to account for the dynamic nature of DM preferences, which can evolve throughout the decision-making process and affect the relevance of objectives. This study addresses this limitation by simulating dynamic shifts in DM preferences within a ranking-based interactive algorithm. Additionally, we propose methods to discard outdated or conflicting preferences when such shifts occur. Building on prior research, we also introduce a mechanism to safeguard relevant objectives that may become trapped in local or global optima due to the diminished correlation with the DM-provided rankings. Our experimental results demonstrate that the proposed methods effectively manage evolving preferences and significantly enhance the quality and desirability of the solutions produced by the algorithm.

[AI-45] GenJoin: Conditional Generative Plan-to-Plan Query Optimizer that Learns from Subplan Hints

链接: https://arxiv.org/abs/2411.04525
作者: Pavel Sulimov,Claude Lehmann,Kurt Stockinger
关键词-EN: Query, learned query optimizers, learned query, query optimizers, machine learning
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Query optimization has become a research area where classical algorithms are being challenged by machine learning algorithms. At the same time, recent trends in learned query optimizers have shown that it is prudent to take advantage of decades of database research and augment classical query optimizers by shrinking the plan search space through different types of hints (e.g. by specifying the join type, scan type or the order of joins) rather than completely replacing the classical query optimizer with machine learning models. It is especially relevant for cases when classical optimizers cannot fully enumerate all logical and physical plans and, as an alternative, need to rely on less robust approaches like genetic algorithms. However, even symbiotically learned query optimizers are hampered by the need for vast amounts of training data, slow plan generation during inference and unstable results across various workload conditions. In this paper, we present GenJoin - a novel learned query optimizer that considers the query optimization problem as a generative task and is capable of learning from a random set of subplan hints to produce query plans that outperform the classical optimizer. GenJoin is the first learned query optimizer that significantly and consistently outperforms PostgreSQL as well as state-of-the-art methods on two well-known real-world benchmarks across a variety of workloads using rigorous machine learning evaluations.

[AI-46] Continuous Sign Language Recognition System using Deep Learning with MediaPipe Holistic

链接: https://arxiv.org/abs/2411.04517
作者: Sharvani Srivastava,Sudhakar Singh,Pooja,Shiv Prakash
关键词-EN: Sign Language, Chinese Sign Language, Indian Sign Language, Sign, American Sign Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 14 pages, 4 figures, Wireless Pers Commun

点击查看摘要

Abstract:Sign languages are the language of hearing-impaired people who use visuals like the hand, facial, and body movements for communication. There are different signs and gestures representing alphabets, words, and phrases. Nowadays approximately 300 sign languages are being practiced worldwide such as American Sign Language (ASL), Chinese Sign Language (CSL), Indian Sign Language (ISL), and many more. Sign languages are dependent on the vocal language of a place. Unlike vocal or spoken languages, there are no helping words in sign language like is, am, are, was, were, will, be, etc. As only a limited population is well-versed in sign language, this lack of familiarity of sign language hinders hearing-impaired people from communicating freely and easily with everyone. This issue can be addressed by a sign language recognition (SLR) system which has the capability to translate the sign language into vocal language. In this paper, a continuous SLR system is proposed using a deep learning model employing Long Short-Term Memory (LSTM), trained and tested on an ISL primary dataset. This dataset is created using MediaPipe Holistic pipeline for tracking face, hand, and body movements and collecting landmarks. The system recognizes the signs and gestures in real-time with 88.23% accuracy.

[AI-47] FedDP: Privacy-preserving method based on federated learning for histopathology image segmentation

链接: https://arxiv.org/abs/2411.04509
作者: Liangrui Pan,Mao Huang,Lian Wang,Pinle Qin,Shaoliang Peng
关键词-EN: Hematoxylin and Eosin, surgical planning, tumor diagnosis, post-operative assessment, considered the gold
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in BIBM2024

点击查看摘要

Abstract:Hematoxylin and Eosin (HE) staining of whole slide images (WSIs) is considered the gold standard for pathologists and medical practitioners for tumor diagnosis, surgical planning, and post-operative assessment. With the rapid advancement of deep learning technologies, the development of numerous models based on convolutional neural networks and transformer-based models has been applied to the precise segmentation of WSIs. However, due to privacy regulations and the need to protect patient confidentiality, centralized storage and processing of image data are impractical. Training a centralized model directly is challenging to implement in medical settings due to these privacy this http URL paper addresses the dispersed nature and privacy sensitivity of medical image data by employing a federated learning framework, allowing medical institutions to collaboratively learn while protecting patient privacy. Additionally, to address the issue of original data reconstruction through gradient inversion during the federated learning training process, differential privacy introduces noise into the model updates, preventing attackers from inferring the contributions of individual samples, thereby protecting the privacy of the training this http URL results show that the proposed method, FedDP, minimally impacts model accuracy while effectively safeguarding the privacy of cancer pathology image data, with only a slight decrease in Dice, Jaccard, and Acc indices by 0.55%, 0.63%, and 0.42%, respectively. This approach facilitates cross-institutional collaboration and knowledge sharing while protecting sensitive data privacy, providing a viable solution for further research and application in the medical field.

[AI-48] Series-to-Series Diffusion Bridge Model

链接: https://arxiv.org/abs/2411.04491
作者: Hao Yang,Zhanbo Feng,Feng Zhou,Robert C Qiu,Zenan Ling
关键词-EN: complex data distributions, showcasing their robust, time series, risen to prominence, robust capability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have risen to prominence in time series forecasting, showcasing their robust capability to model complex data distributions. However, their effectiveness in deterministic predictions is often constrained by instability arising from their inherent stochasticity. In this paper, we revisit time series diffusion models and present a comprehensive framework that encompasses most existing diffusion-based methods. Building on this theoretical foundation, we propose a novel diffusion-based time series forecasting model, the Series-to-Series Diffusion Bridge Model ( \mathrmS^2DBM ), which leverages the Brownian Bridge process to reduce randomness in reverse estimations and improves accuracy by incorporating informative priors and conditions derived from historical time series data. Experimental results demonstrate that \mathrmS^2DBM delivers superior performance in point-to-point forecasting and competes effectively with other diffusion-based models in probabilistic forecasting.

[AI-49] Magent ic-One: A Generalist Multi-Agent System for Solving Complex Tasks

链接: https://arxiv.org/abs/2411.04468
作者: Adam Fourney,Gagan Bansal,Hussein Mozannar,Cheng Tan,Eduardo Salinas,Erkang(Eric)Zhu,Friederike Niedtner,Grace Proebsting,Griffin Bassman,Jack Gerrits,Jacob Alber,Peter Chang,Ricky Loynd,Robert West,Victor Dibia,Ahmed Awadallah,Ece Kamar,Rafah Hosn,Saleema Amershi
关键词-EN: large foundation models, driven by advances, foundation models, promise to enhance, advances in large
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One’s modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner – which is important when agents’ actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at this https URL

[AI-50] Enabling Adaptive Agent Training in Open-Ended Simulators by Targeting Diversity NEURIPS2024

链接: https://arxiv.org/abs/2411.04466
作者: Robby Costales,Stefanos Nikolaidis
关键词-EN: embodied decision-making domains, decision-making domains remains, domains remains bottlenecked, wider application, embodied decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:The wider application of end-to-end learning methods to embodied decision-making domains remains bottlenecked by their reliance on a superabundance of training data representative of the target domain. Meta-reinforcement learning (meta-RL) approaches abandon the aim of zero-shot generalization–the goal of standard reinforcement learning (RL)–in favor of few-shot adaptation, and thus hold promise for bridging larger generalization gaps. While learning this meta-level adaptive behavior still requires substantial data, efficient environment simulators approaching real-world complexity are growing in prevalence. Even so, hand-designing sufficiently diverse and numerous simulated training tasks for these complex domains is prohibitively labor-intensive. Domain randomization (DR) and procedural generation (PG), offered as solutions to this problem, require simulators to possess carefully-defined parameters which directly translate to meaningful task diversity–a similarly prohibitive assumption. In this work, we present DIVA, an evolutionary approach for generating diverse training tasks in such complex, open-ended simulators. Like unsupervised environment design (UED) methods, DIVA can be applied to arbitrary parameterizations, but can additionally incorporate realistically-available domain knowledge–thus inheriting the flexibility and generality of UED, and the supervised structure embedded in well-designed simulators exploited by DR and PG. Our empirical results showcase DIVA’s unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature. These findings highlight the potential of such semi-supervised environment design (SSED) approaches, of which DIVA is the first humble constituent, to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.

[AI-51] Can CDT rationalise the ex ante optimal policy via modified anthropics?

链接: https://arxiv.org/abs/2411.04462
作者: Emery Cooper,Caspar Oesterheld,Vincent Conitzer
关键词-EN: causal decision theory, evidential decision theory, Newcomb problem, decision theory, ante policy optimisation
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:In Newcomb’s problem, causal decision theory (CDT) recommends two-boxing and thus comes apart from evidential decision theory (EDT) and ex ante policy optimisation (which prescribe one-boxing). However, in Newcomb’s problem, you should perhaps believe that with some probability you are in a simulation run by the predictor to determine whether to put a million dollars into the opaque box. If so, then causal decision theory might recommend one-boxing in order to cause the predictor to fill the opaque box. In this paper, we study generalisations of this approach. That is, we consider general Newcomblike problems and try to form reasonable self-locating beliefs under which CDT’s recommendations align with an EDT-like notion of ex ante policy optimisation. We consider approaches in which we model the world as running simulations of the agent, and an approach not based on such models (which we call ‘Generalised Generalised Thirding’, or GGT). For each approach, we characterise the resulting CDT policies, and prove that under certain conditions, these include the ex ante optimal policies.

[AI-52] Scaling Laws for Pre-training Agents and World Models

链接: https://arxiv.org/abs/2411.04434
作者: Tim Pearce,Tabish Rashid,Dave Bignell,Raluca Georgescu,Sam Devlin,Katja Hofmann
关键词-EN: increasing model parameters, performance of embodied, shown to improve, improve by increasing, embodied agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent’s behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better’, we show that the same types of power laws found in language modeling (e.g. between loss and optimal model size), also arise in world modeling and imitation learning. However, the coefficients of these laws are heavily influenced by the tokenizer, task \ architecture – this has important implications on the optimal sizing of models and data.

[AI-53] owards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers

链接: https://arxiv.org/abs/2411.04403
作者: Zhichao Geng,Dongyu Ru,Yang Yang
关键词-EN: Learned sparse retrieval, mature inverted-index engines, garnered growing attention, efficiently perform retrieval, Learned sparse
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learned sparse retrieval, which can efficiently perform retrieval through mature inverted-index engines, has garnered growing attention in recent years. Particularly, the inference-free sparse retrievers are attractive as they eliminate online model inference in the retrieval phase thereby avoids huge computational cost, offering reasonable throughput and latency. However, even the state-of-the-art (SOTA) inference-free sparse models lag far behind in terms of search relevance when compared to both sparse and dense siamese models. Towards competitive search relevance for inference-free sparse retrievers, we argue that they deserve dedicated training methods other than using same ones with siamese encoders. In this paper, we propose two different approaches for performance improvement. First, we introduce the IDF-aware FLOPS loss, which introduces Inverted Document Frequency (IDF) to the sparsification of representations. We find that it mitigates the negative impact of the FLOPS regularization on search relevance, allowing the model to achieve a better balance between accuracy and efficiency. Moreover, we propose a heterogeneous ensemble knowledge distillation framework that combines siamese dense and sparse retrievers to generate supervisory signals during the pre-training phase. The ensemble framework of dense and sparse retriever capitalizes on their strengths respectively, providing a strong upper bound for knowledge distillation. To concur the diverse feedback from heterogeneous supervisors, we normalize and then aggregate the outputs of the teacher models to eliminate score scale differences. On the BEIR benchmark, our model outperforms existing SOTA inference-free sparse model by \textbf3.3 NDCG@10 score. It exhibits search relevance comparable to siamese sparse retrievers and client-side latency only \textbf1.1x that of BM25.

[AI-54] A Bayesian Mixture Model of Temporal Point Processes with Determinantal Point Process Prior

链接: https://arxiv.org/abs/2411.04397
作者: Yiwei Dong,Shaoxin Ye,Yuwen Cao,Qiyu Han,Hongteng Xu,Hanfang Yang
关键词-EN: temporal point processes, Asynchronous event sequence, temporal point, Asynchronous event, Determinantal Point Process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Asynchronous event sequence clustering aims to group similar event sequences in an unsupervised manner. Mixture models of temporal point processes have been proposed to solve this problem, but they often suffer from overfitting, leading to excessive cluster generation with a lack of diversity. To overcome these limitations, we propose a Bayesian mixture model of Temporal Point Processes with Determinantal Point Process prior (TP ^2 DP ^2 ) and accordingly an efficient posterior inference algorithm based on conditional Gibbs sampling. Our work provides a flexible learning framework for event sequence clustering, enabling automatic identification of the potential number of clusters and accurate grouping of sequences with similar features. It is applicable to a wide range of parametric temporal point processes, including neural network-based models. Experimental results on both synthetic and real-world data suggest that our framework could produce moderately fewer yet more diverse mixture components, and achieve outstanding results across multiple evaluation metrics.

[AI-55] Bridging the Gap: Representation Spaces in Neuro-Symbolic AI

链接: https://arxiv.org/abs/2411.04393
作者: Xin Zhang,Victor S.Sheng
关键词-EN: models by combining, combining the advantages, effective method, data representation methods, Neuro-symbolic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neuro-symbolic AI is an effective method for improving the overall performance of AI models by combining the advantages of neural networks and symbolic learning. However, there are differences between the two in terms of how they process data, primarily because they often use different data representation methods, which is often an important factor limiting the overall performance of the two. From this perspective, we analyzed 191 studies from 2013 by constructing a four-level classification framework. The first level defines five types of representation spaces, and the second level focuses on five types of information modalities that the representation space can represent. Then, the third level describes four symbolic logic methods. Finally, the fourth-level categories propose three collaboration strategies between neural networks and symbolic learning. Furthermore, we conducted a detailed analysis of 46 research based on their representation space.

[AI-56] Neuro-Symbolic AI: Explainability Challenges and Future Trends

链接: https://arxiv.org/abs/2411.04383
作者: Xin Zhang,Victor S. Sheng
关键词-EN: essential reason limiting, explicit intermediate representations, vital fields, essential reason, reason limiting
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainability is an essential reason limiting the application of neural networks in many vital fields. Although neuro-symbolic AI hopes to enhance the overall explainability by leveraging the transparency of symbolic learning, the results are less evident than imagined. This article proposes a classification for explainability by considering both model design and behavior of 191 studies from 2013, focusing on neuro-symbolic AI, hoping to inspire scholars who want to understand the explainability of neuro-symbolic AI. Precisely, we classify them into five categories by considering whether the form of bridging the representation differences is readable as their design factor, if there are representation differences between neural networks and symbolic logic learning, and whether a model decision or prediction process is understandable as their behavior factor: implicit intermediate representations and implicit prediction, partially explicit intermediate representations and partially explicit prediction, explicit intermediate representations or explicit prediction, explicit intermediate representation and explicit prediction, unified representation and explicit prediction. We also analyzed the research trends and three significant challenges: unified representations, explainability and transparency, and sufficient cooperation from neural networks and symbolic learning. Finally, we put forward suggestions for future research in three aspects: unified representations, enhancing model explainability, ethical considerations, and social impact.

[AI-57] Benchmarking Large Language Models with Integer Sequence Generation Tasks

链接: https://arxiv.org/abs/2411.04372
作者: Daniel O’Malley,Manish Bhattarai,Javier Santos
关键词-EN: Online Encyclopedia, computes integer sequences, large language model, integer sequences, paper presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Integer Sequences (OEIS), a widely-used resource for mathematical sequences. The benchmark is designed to evaluate both the correctness of the generated code and its computational efficiency. Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences. In order to ensure models do not exploit memorized sequence values, we introduce an automated cheating detection mechanism that flags the use of lookup tables and validated this automation against human cheating evaluations. This benchmark provides a meaningful challenge for current LLMs, offering insights into their mathematical reasoning and code writing capabilities, which can guide future research directions and model development in mathematical reasoning and code synthesis.

[AI-58] ComFairGNN: Community Fair Graph Neural Network

链接: https://arxiv.org/abs/2411.04371
作者: Yonas Sium,Qi Li
关键词-EN: Graph Neural Networks, Neural Networks, addressing graph analytical, graph analytical problems, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become the leading approach for addressing graph analytical problems in various real-world scenarios. However, GNNs may produce biased predictions against certain demographic subgroups due to node attributes and neighbors surrounding a node. Most current research on GNN fairness focuses predominantly on debiasing GNNs using oversimplified fairness evaluation metrics, which can give a misleading impression of fairness. Understanding the potential evaluation paradoxes due to the complicated nature of the graph structure is crucial for developing effective GNN debiasing mechanisms. In this paper, we examine the effectiveness of current GNN debiasing methods in terms of unfairness evaluation. Specifically, we introduce a community-level strategy to measure bias in GNNs and evaluate debiasing methods at this level. Further, We introduce ComFairGNN, a novel framework designed to mitigate community-level bias in GNNs. Our approach employs a learnable coreset-based debiasing function that addresses bias arising from diverse local neighborhood distributions during GNNs neighborhood aggregation. Comprehensive evaluations on three benchmark datasets demonstrate our model’s effectiveness in both accuracy and fairness metrics.

[AI-59] GaGSL: Global-augmented Graph Structure Learning via Graph Information Bottleneck

链接: https://arxiv.org/abs/2411.04356
作者: Shuangjie Li,Jiangqing Song,Baoming Zhang,Gaoli Ruan,Junyuan Xie,Chongjun Wang
关键词-EN: graph structure, Graph neural networks, processing graph data, Graph, structure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are prominent for their effectiveness in processing graph data for semi-supervised node classification tasks. Most works of GNNs assume that the observed structure accurately represents the underlying node relationships. However, the graph structure is inevitably noisy or incomplete in reality, which can degrade the quality of graph representations. Therefore, it is imperative to learn a clean graph structure that balances performance and robustness. In this paper, we propose a novel method named \textitGlobal-augmented Graph Structure Learning (GaGSL), guided by the Graph Information Bottleneck (GIB) principle. The key idea behind GaGSL is to learn a compact and informative graph structure for node classification tasks. Specifically, to mitigate the bias caused by relying solely on the original structure, we first obtain augmented features and augmented structure through global feature augmentation and global structure augmentation. We then input the augmented features and augmented structure into a structure estimator with different parameters for optimization and re-definition of the graph structure, respectively. The redefined structures are combined to form the final graph structure. Finally, we employ GIB based on mutual information to guide the optimization of the graph structure to obtain the minimum sufficient graph structure. Comprehensive evaluations across a range of datasets reveal the outstanding performance and robustness of GaGSL compared with the state-of-the-art methods.

[AI-60] Model and Deep learning based Dynamic Range Compression Inversion

链接: https://arxiv.org/abs/2411.04337
作者: Haoran Sun,Dominique Fourer,Hichem Maaref
关键词-EN: Dynamic Range Compression, Range Compression, Dynamic Range, popular audio effect, audio signal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Dynamic Range Compression (DRC) is a popular audio effect used to control the dynamic range of a signal. Inverting DRC can also help to restore the original dynamics to produce new mixes and/or to improve the overall quality of the audio signal. Since, state-of-the-art DRC inversion techniques either ignore parameters or require precise parameters that are difficult to estimate, we fill the gap by combining a model-based approach with neural networks for DRC inversion. To this end, depending on the scenario, we use different neural networks to estimate DRC parameters. Then, a model-based inversion is completed to restore the original audio signal. Our experimental results show the effectiveness and robustness of the proposed method in comparison to several state-of-the-art methods, when applied on two music datasets.

[AI-61] Gradient Boosting Trees and Large Language Models for Tabular Data Few-Shot Learning

链接: https://arxiv.org/abs/2411.04324
作者: Carlos Huertas
关键词-EN: Large Language Models, Large Language, Machine Learning, Language Models, brought numerous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: FedCSIS 2024 - Data Mining Competition - 1st Place Winner

点击查看摘要

Abstract:Large Language Models (LLM) have brought numerous of new applications to Machine Learning (ML). In the context of tabular data (TD), recent studies show that TabLLM is a very powerful mechanism for few-shot-learning (FSL) applications, even if gradient boosting decisions trees (GBDT) have historically dominated the TD field. In this work we demonstrate that although LLMs are a viable alternative, the evidence suggests that baselines used to gauge performance can be improved. We replicated public benchmarks and our methodology improves LightGBM by 290%, this is mainly driven by forcing node splitting with few samples, a critical step in FSL with GBDT. Our results show an advantage to TabLLM for 8 or fewer shots, but as the number of samples increases GBDT provides competitive performance at a fraction of runtime. For other real-life applications with vast number of samples, we found FSL still useful to improve model diversity, and when combined with ExtraTrees it provides strong resilience to overfitting, our proposal was validated in a ML competition setting ranking first place.

[AI-62] A Random-Key Optimizer for Combinatorial Optimization

链接: https://arxiv.org/abs/2411.04293
作者: Antonio A. Chaves,Mauricio G.C. Resende,Edilson F. de Arruda,Ricardo M. A. Silva
关键词-EN: search method tailored, efficient stochastic local, Random-Key Optimizer, stochastic local search, local search method
类目: Artificial Intelligence (cs.AI)
*备注: 54 pages, 16 figures, 8 tables

点击查看摘要

Abstract:This paper presents the Random-Key Optimizer (RKO), a versatile and efficient stochastic local search method tailored for combinatorial optimization problems. Using the random-key concept, RKO encodes solutions as vectors of random keys that are subsequently decoded into feasible solutions via problem-specific decoders. The RKO framework is able to combine a plethora of classic metaheuristics, each capable of operating independently or in parallel, with solution sharing facilitated through an elite solution pool. This modular approach allows for the adaptation of various metaheuristics, including simulated annealing, iterated local search, and greedy randomized adaptive search procedures, among others. The efficacy of the RKO framework, implemented in C++, is demonstrated through its application to three NP-hard combinatorial optimization problems: the alpha-neighborhood p-median problem, the tree of hubs location problem, and the node-capacitated graph partitioning problem. The results highlight the framework’s ability to produce high-quality solutions across diverse problem domains, underscoring its potential as a robust tool for combinatorial optimization.

[AI-63] Robust Real-Time Mortality Prediction in the Intensive Care Unit using Temporal Difference Learning ALT

链接: https://arxiv.org/abs/2411.04285
作者: Thomas Frost,Kezhi Li,Steve Harris
关键词-EN: learning, supervised machine learning, high variance, long-term patient outcomes, patient outcomes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published in the Proceedings of the 4th Machine Learning for Health symposium, Proceedings of Machine Learning Research (PMLR)

点击查看摘要

Abstract:The task of predicting long-term patient outcomes using supervised machine learning is a challenging one, in part because of the high variance of each patient’s trajectory, which can result in the model over-fitting to the training data. Temporal difference (TD) learning, a common reinforcement learning technique, may reduce variance by generalising learning to the pattern of state transitions rather than terminal outcomes. However, in healthcare this method requires several strong assumptions about patient states, and there appears to be limited literature evaluating the performance of TD learning against traditional supervised learning methods for long-term health outcome prediction tasks. In this study, we define a framework for applying TD learning to real-time irregularly sampled time series data using a Semi-Markov Reward Process. We evaluate the model framework in predicting intensive care mortality and show that TD learning under this framework can result in improved model robustness compared to standard supervised learning methods. and that this robustness is maintained even when validated on external datasets. This approach may offer a more reliable method when learning to predict patient outcomes using high-variance irregular time series data.

[AI-64] Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking

链接: https://arxiv.org/abs/2411.04281
作者: Xingran Chen,Zhenke Wu,Xu Shi,Hyunghoon Cho,Bhramar Mukherjee
关键词-EN: proposed open-source software, methods, scoping review, recommendations for practitioners, software to offer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. A Python package, ``SynthEHRella’', is provided to integrate various choices of approaches and evaluation metrics, enabling more streamlined exploration and evaluation of multiple methods. We found that method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. Based on the decision tree, GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.

[AI-65] Bayesian Inference in Recurrent Explicit Duration Switching Linear Dynamical Systems

链接: https://arxiv.org/abs/2411.04280
作者: Mikołaj Słupiński,Piotr Lipiński
关键词-EN: Linear Dynamical Systems, Recurrent Explicit Duration, Switching Linear Dynamical, Duration Switching Linear, Explicit Duration Switching
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel model called Recurrent Explicit Duration Switching Linear Dynamical Systems (REDSLDS) that incorporates recurrent explicit duration variables into the rSLDS model. We also propose an inference and learning scheme that involves the use of Pólya-gamma augmentation. We demonstrate the improved segmentation capabilities of our model on three benchmark datasets, including two quantitative datasets and one qualitative dataset.

[AI-66] he Recurrent Sticky Hierarchical Dirichlet Process Hidden Markov Model

链接: https://arxiv.org/abs/2411.04278
作者: Mikołaj Słupiński,Piotr Lipiński
关键词-EN: Process Hidden Markov, Dirichlet Process Hidden, Hierarchical Dirichlet Process, Hidden Markov Model, classical Hidden Markov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) is a natural Bayesian nonparametric extension of the classical Hidden Markov Model for learning from (spatio-)temporal data. A sticky HDP-HMM has been proposed to strengthen the self-persistence probability in the HDP-HMM. Then, disentangled sticky HDP-HMM has been proposed to disentangle the strength of the self-persistence prior and transition prior. However, the sticky HDP-HMM assumes that the self-persistence probability is stationary, limiting its expressiveness. Here, we build on previous work on sticky HDP-HMM and disentangled sticky HDP-HMM, developing a more general model: the recurrent sticky HDP-HMM (RS-HDP-HMM). We develop a novel Gibbs sampling strategy for efficient inference in this model. We show that RS-HDP-HMM outperforms disentangled sticky HDP-HMM, sticky HDP-HMM, and HDP-HMM in both synthetic and real data segmentation.

[AI-67] Object Recognition in Human Computer Interaction:- A Comparative Analysis

链接: https://arxiv.org/abs/2411.04263
作者: Kaushik Ranade,Tanmay Khule,Riddhi More
关键词-EN: widely researched area, Human-computer interaction, widely researched, researched area, continuous advancements
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human-computer interaction (HCI) has been a widely researched area for many years, with continuous advancements in technology leading to the development of new techniques that change the way we interact with computers. With the recent advent of powerful computers, we recognize human actions and interact accordingly, thus revolutionizing the way we interact with computers. The purpose of this paper is to provide a comparative analysis of various algorithms used for recognizing user faces and gestures in the context of computer vision and HCI. This study aims to explore and evaluate the performance of different algorithms in terms of accuracy, robustness, and efficiency. This study aims to provide a comprehensive analysis of algorithms for face and gesture recognition in the context of computer vision and HCI, with the goal of improving the design and development of interactive systems that are more intuitive, efficient, and user-friendly.

[AI-68] Learning Generalizable Policy for Obstacle-Aware Autonomous Drone Racing

链接: https://arxiv.org/abs/2411.04246
作者: Yueqian Liu
关键词-EN: Autonomous drone racing, Autonomous drone, gained attention, potential to push, push the boundaries
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 11 figures. This preprint is part of the author’s this http URL . thesis supervised by Ir. Hang Yu and Dr. Ir. Christophe De Wagter, at MAVLab TU Delft. Full thesis is available at this https URL

点击查看摘要

Abstract:Autonomous drone racing has gained attention for its potential to push the boundaries of drone navigation technologies. While much of the existing research focuses on racing in obstacle-free environments, few studies have addressed the complexities of obstacle-aware racing, and approaches presented in these studies often suffer from overfitting, with learned policies generalizing poorly to new environments. This work addresses the challenge of developing a generalizable obstacle-aware drone racing policy using deep reinforcement learning. We propose applying domain randomization on racing tracks and obstacle configurations before every rollout, combined with parallel experience collection in randomized environments to achieve the goal. The proposed randomization strategy is shown to be effective through simulated experiments where drones reach speeds of up to 70 km/h, racing in unseen cluttered environments. This study serves as a stepping stone toward learning robust policies for obstacle-aware drone racing and general-purpose drone navigation in cluttered environments. Code is available at this https URL.

[AI-69] WiFlexFormer: Efficient WiFi-Based Person-Centric Sensing

链接: https://arxiv.org/abs/2411.04224
作者: Julian Strohmayer,Matthias Wödlinger,Martin Kampel
关键词-EN: Channel State Information, WiFi Channel State, State Information, Transformer-based architecture designed, Channel State
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose WiFlexFormer, a highly efficient Transformer-based architecture designed for WiFi Channel State Information (CSI)-based person-centric sensing. We benchmark WiFlexFormer against state-of-the-art vision and specialized architectures for processing radio frequency data and demonstrate that it achieves comparable Human Activity Recognition (HAR) performance while offering a significantly lower parameter count and faster inference times. With an inference time of just 10 ms on an Nvidia Jetson Orin Nano, WiFlexFormer is optimized for real-time inference. Additionally, its low parameter count contributes to improved cross-domain generalization, where it often outperforms larger models. Our comprehensive evaluation shows that WiFlexFormer is a potential solution for efficient, scalable WiFi-based sensing applications. The PyTorch implementation of WiFlexFormer is publicly available at: this https URL.

[AI-70] Equivariant Graph Network Approximations of High-Degree Polynomials for Force Field Prediction

链接: https://arxiv.org/abs/2411.04219
作者: Zhao Xu,Haiyang Yu,Montgomery Bohde,Shuiwang Ji
关键词-EN: Recent advancements, equivariant deep models, equivariant polynomial functions, deep models, models have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in equivariant deep models have shown promise in accurately predicting atomic potentials and force fields in molecular dynamics simulations. Using spherical harmonics (SH) and tensor products (TP), these equivariant networks gain enhanced physical understanding, like symmetries and many-body interactions. Beyond encoding physical insights, SH and TP are also crucial to represent equivariant polynomial functions. In this work, we analyze the equivariant polynomial functions for the equivariant architecture, and introduce a novel equivariant network, named PACE. The proposed PACE utilizes edge booster and the Atomic Cluster Expansion (ACE) technique to approximate a greater number of SE(3) \times S_n equivariant polynomial functions with enhanced degrees. As experimented in commonly used benchmarks, PACE demonstrates state-of-the-art performance in predicting atomic energy and force fields, with robust generalization capability across various geometric distributions under molecular dynamics (MD) across different temperature conditions. Our code is publicly available as part of the AIRS library this https URL.

[AI-71] Quantum Diffusion Models for Few-Shot Learning

链接: https://arxiv.org/abs/2411.04217
作者: Ruhan Wang,Ye Wang,Jing Liu,Toshiaki Koike-Akino
关键词-EN: Modern quantum machine, parameterized quantum circuits, training datasets, testing datasets, Modern quantum
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Modern quantum machine learning (QML) methods involve the variational optimization of parameterized quantum circuits on training datasets, followed by predictions on testing datasets. Most state-of-the-art QML algorithms currently lack practical advantages due to their limited learning capabilities, especially in few-shot learning tasks. In this work, we propose three new frameworks employing quantum diffusion model (QDM) as a solution for the few-shot learning: label-guided generation inference (LGGI); label-guided denoising inference (LGDI); and label-guided noise addition inference (LGNAI). Experimental results demonstrate that our proposed algorithms significantly outperform existing methods.

[AI-72] DiMSUM: Diffusion Mamba – A Scalable and Unified Spatial-Frequency Method for Image Generation NEURIPS2024

链接: https://arxiv.org/abs/2411.04168
作者: Hao Phung,Quan Dao,Trung Dao,Hoang Phan,Dimitris Metaxas,Anh Tran
关键词-EN: image generation tasks, effectively harnessing spatial, architecture for diffusion, inductive bias, image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at this https URL.

[AI-73] Cooperation and Personalization on a Seesaw: Choice-based FL for Safe Cooperation in Wireless Networks

链接: https://arxiv.org/abs/2411.04159
作者: Han Zhang,Medhat Elsayed,Majid Bavand,Raimundas Gaigalas,Yigit Ozcan,Melike Erol-Kantarci
关键词-EN: distributed artificial intelligence, innovative distributed artificial, Federated learning, artificial intelligence, innovative distributed
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is an innovative distributed artificial intelligence (AI) technique. It has been used for interdisciplinary studies in different fields such as healthcare, marketing and finance. However the application of FL in wireless networks is still in its infancy. In this work, we first overview benefits and concerns when applying FL to wireless networks. Next, we provide a new perspective on existing personalized FL frameworks by analyzing the relationship between cooperation and personalization in these frameworks. Additionally, we discuss the possibility of tuning the cooperation level with a choice-based approach. Our choice-based FL approach is a flexible and safe FL framework that allows participants to lower the level of cooperation when they feel unsafe or unable to benefit from the cooperation. In this way, the choice-based FL framework aims to address the safety and fairness concerns in FL and protect participants from malicious attacks.

[AI-74] UnityGraph: Unified Learning of Spatio-temporal features for Multi-person Motion Prediction

链接: https://arxiv.org/abs/2411.04151
作者: Kehua Qu,Rui Ding,Jin Tang
关键词-EN: significant real-world applications, real-world applications, complex and emerging, emerging field, field with significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13pages, 12 figures. arXiv admin note: text overlap with arXiv:2411.03729

点击查看摘要

Abstract:Multi-person motion prediction is a complex and emerging field with significant real-world applications. Current state-of-the-art methods typically adopt dual-path networks to separately modeling spatial features and temporal features. However, the uncertain compatibility of the two networks brings a challenge for spatio-temporal features fusion and violate the spatio-temporal coherence and coupling of human motions by nature. To address this issue, we propose a novel graph structure, UnityGraph, which treats spatio-temporal features as a whole, enhancing model coherence and this http URL-temporal features as a whole, enhancing model coherence and coupling. Specifically, UnityGraph is a hypervariate graph based network. The flexibility of the hypergraph allows us to consider the observed motions as graph nodes. We then leverage hyperedges to bridge these nodes for exploring spatio-temporal features. This perspective considers spatio-temporal dynamics unitedly and reformulates multi-person motion prediction into a problem on a single graph. Leveraging the dynamic message passing based on this hypergraph, our model dynamically learns from both types of relations to generate targeted messages that reflect the relevance among nodes. Extensive experiments on several datasets demonstrates that our method achieves state-of-the-art performance, confirming its effectiveness and innovative design.

[AI-75] Diffusion-based Auction Mechanism for Efficient Resource Management in 6G-enabled Vehicular Metaverses

链接: https://arxiv.org/abs/2411.04139
作者: Jiawen Kang,Yongju Tong,Yue Zhong,Junlong Chen,Minrui Xu,Dusit Niyato,Runrong Deng,Shiwen Mao
关键词-EN: Vehicular Metaverses, model-based Augmented Reality, large Artificial Intelligence, high bandwidth connectivity, real-time vehicular
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of 6G-enable Vehicular Metaverses is transforming the automotive industry by integrating immersive, real-time vehicular services through ultra-low latency and high bandwidth connectivity. In 6G-enable Vehicular Metaverses, vehicles are represented by Vehicle Twins (VTs), which serve as digital replicas of physical vehicles to support real-time vehicular applications such as large Artificial Intelligence (AI) model-based Augmented Reality (AR) navigation, called VT tasks. VT tasks are resource-intensive and need to be offloaded to ground Base Stations (BSs) for fast processing. However, high demand for VT tasks and limited resources of ground BSs, pose significant resource allocation challenges, particularly in densely populated urban areas like intersections. As a promising solution, Unmanned Aerial Vehicles (UAVs) act as aerial edge servers to dynamically assist ground BSs in handling VT tasks, relieving resource pressure on ground BSs. However, due to high mobility of UAVs, there exists information asymmetry regarding VT task demands between UAVs and ground BSs, resulting in inefficient resource allocation of UAVs. To address these challenges, we propose a learning-based Modified Second-Bid (MSB) auction mechanism to optimize resource allocation between ground BSs and UAVs by accounting for VT task latency and accuracy. Moreover, we design a diffusion-based reinforcement learning algorithm to optimize the price scaling factor, maximizing the total surplus of resource providers and minimizing VT task latency. Finally, simulation results demonstrate that the proposed diffusion-based MSB auction outperforms traditional baselines, providing better resource distribution and enhanced service quality for vehicular users.

[AI-76] NetworkGym: Reinforcement Learning Environments for Multi-Access Traffic Management in Network Simulation NEURIPS

链接: https://arxiv.org/abs/2411.04138
作者: Momin Haider,Ming Yin,Menglei Zhang,Arpit Gupta,Jing Zhu,Yu-Xiang Wang
关键词-EN: Mobile devices, multi-access traffic splitting, LTE, Mobile, multiple access networks
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS (Datasets and Benchmarks)

点击查看摘要

Abstract:Mobile devices such as smartphones, laptops, and tablets can often connect to multiple access networks (e.g., Wi-Fi, LTE, and 5G) simultaneously. Recent advancements facilitate seamless integration of these connections below the transport layer, enhancing the experience for apps that lack inherent multi-path support. This optimization hinges on dynamically determining the traffic distribution across networks for each device, a process referred to as \textitmulti-access traffic splitting. This paper introduces \textitNetworkGym, a high-fidelity network environment simulator that facilitates generating multiple network traffic flows and multi-access traffic splitting. This simulator facilitates training and evaluating different RL-based solutions for the multi-access traffic splitting problem. Our initial explorations demonstrate that the majority of existing state-of-the-art offline RL algorithms (e.g. CQL) fail to outperform certain hand-crafted heuristic policies on average. This illustrates the urgent need to evaluate offline RL algorithms against a broader range of benchmarks, rather than relying solely on popular ones such as D4RL. We also propose an extension to the TD3+BC algorithm, named Pessimistic TD3 (PTD3), and demonstrate that it outperforms many state-of-the-art offline RL algorithms. PTD3’s behavioral constraint mechanism, which relies on value-function pessimism, is theoretically motivated and relatively simple to implement.

[AI-77] Generative AI Enabled Matching for 6G Multiple Access

链接: https://arxiv.org/abs/2411.04137
作者: Xudong Wang,Hongyang Du,Dusit Niyato,Lijie Zhou,Lei Feng,Zhixiang Yang,Fanqin Zhou,Wenjing Li
关键词-EN: applying deep learning, deep learning models, applying deep, deep learning, multiple access
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages,5 figures

点击查看摘要

Abstract:In wireless networks, applying deep learning models to solve matching problems between different entities has become a mainstream and effective approach. However, the complex network topology in 6G multiple access presents significant challenges for the real-time performance and stability of matching generation. Generative artificial intelligence (GenAI) has demonstrated strong capabilities in graph feature extraction, exploration, and generation, offering potential for graph-structured matching generation. In this paper, we propose a GenAI-enabled matching generation framework to support 6G multiple access. Specifically, we first summarize the classical matching theory, discuss common GenAI models and applications from the perspective of matching generation. Then, we propose a framework based on generative diffusion models (GDMs) that iteratively denoises toward reward maximization to generate a matching strategy that meets specific requirements. Experimental results show that, compared to decision-based AI approaches, our framework can generate more effective matching strategies based on given conditions and predefined rewards, helping to solve complex problems in 6G multiple access, such as task allocation.

[AI-78] Enhancement of Approximation Spaces by the Use of Primals and Neighborhood

链接: https://arxiv.org/abs/2411.04133
作者: A. Çaksu Güler
关键词-EN: handling incomplete information, Rough set theory, rough set models, generalized rough set, Rough set
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Rough set theory is one of the most widely used and significant approaches for handling incomplete information. It divides the universe in the beginning and uses equivalency relations to produce blocks. Numerous generalized rough set models have been put out and investigated in an effort to increase flexibility and extend the range of possible uses. We introduce four new generalized rough set models that draw inspiration from “neighborhoods and primals” in order to make a contribution to this topic. By minimizing the uncertainty regions, these models are intended to assist decision makers in more effectively analyzing and evaluating the provided data. We verify this goal by demonstrating that the existing models outperform certain current method approaches in terms of improving the approximation operators (upper and lower) and accuracy measurements. We claim that the current models can preserve nearly all significant aspects associated with the rough set model. Preserving the monotonic property, which enables us to assess data uncertainty and boost confidence in outcomes, is one of the intriguing characterizations derived from the existing models. With the aid of specific instances, we also compare the areas of the current approach. Finally, we demonstrate that the new strategy we define for our everyday health-related problem yields more accurate findings.

[AI-79] AmazonQAC: A Large-Scale Naturalistic Query Autocomplete Dataset EMNLP2024

链接: https://arxiv.org/abs/2411.04129
作者: Dante Everaert,Rohit Patki,Tianqi Zheng,Christopher Potts
关键词-EN: Query Autocomplete, modern search engines, predicting search queries, search queries based, facilitating user interaction
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024

点击查看摘要

Abstract:Query Autocomplete (QAC) is a critical feature in modern search engines, facilitating user interaction by predicting search queries based on input prefixes. Despite its widespread adoption, the absence of large-scale, realistic datasets has hindered advancements in QAC system development. This paper addresses this gap by introducing AmazonQAC, a new QAC dataset sourced from Amazon Search logs, comprising 395M samples. The dataset includes actual sequences of user-typed prefixes leading to final search terms, as well as session IDs and timestamps that support modeling the context-dependent aspects of QAC. We assess Prefix Trees, semantic retrieval, and Large Language Models (LLMs) with and without finetuning. We find that finetuned LLMs perform best, particularly when incorporating contextual information. However, even our best system achieves only half of what we calculate is theoretically possible on our test data, which implies QAC is a challenging problem that is far from solved with existing systems. This contribution aims to stimulate further research on QAC systems to better serve user needs in diverse environments. We open-source this data on Hugging Face at this https URL.

[AI-80] Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignment

链接: https://arxiv.org/abs/2411.04127
作者: Joshua T. S. Hewson
关键词-EN: everyday life, ensuring its safe, urgent challenges, deeply integrated, integrated into critical
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) becomes deeply integrated into critical infrastructures and everyday life, ensuring its safe deployment is one of humanity’s most urgent challenges. Current AI models prioritize task optimization over safety, leading to risks of unintended harm. These risks are difficult to address due to the competing interests of governments, businesses, and advocacy groups, all of which have different priorities in the AI race. Current alignment methods, such as reinforcement learning from human feedback (RLHF), focus on extrinsic behaviors without instilling a genuine understanding of human values. These models are vulnerable to manipulation and lack the social intelligence necessary to infer the mental states and intentions of others, raising concerns about their ability to safely and responsibly make important decisions in complex and novel situations. Furthermore, the divergence between extrinsic and intrinsic motivations in AI introduces the risk of deceptive or harmful behaviors, particularly as systems become more autonomous and intelligent. We propose a novel human-inspired approach which aims to address these various concerns and help align competing objectives.

[AI-81] We Urgently Need Intrinsically Kind Machines NEURIPS2024

链接: https://arxiv.org/abs/2411.04126
作者: Joshua T. S. Hewson
关键词-EN: Artificial Intelligence systems, Artificial Intelligence, Intelligence systems, rapidly evolving, integrating extrinsic
类目: Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 IMOL Workshop Paper

点击查看摘要

Abstract:Artificial Intelligence systems are rapidly evolving, integrating extrinsic and intrinsic motivations. While these frameworks offer benefits, they risk misalignment at the algorithmic level while appearing superficially aligned with human values. In this paper, we argue that an intrinsic motivation for kindness is crucial for making sure these models are intrinsically aligned with human values. We argue that kindness, defined as a form of altruism motivated to maximize the reward of others, can counteract any intrinsic motivations that might lead the model to prioritize itself over human well-being. Our approach introduces a framework and algorithm for embedding kindness into foundation models by simulating conversations. Limitations and future research directions for scalable implementation are discussed.

[AI-82] SPGD: Steepest Perturbed Gradient Descent Optimization

链接: https://arxiv.org/abs/2411.04946
作者: Amir M. Vahedi,Horea T. Ilies
关键词-EN: Steepest Perturbed Gradient, Perturbed Gradient Descent, Gradient Descent, saddle points, flat regions
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 28 pages, 26 figures, submitted to Journal of Mechanical Design

点击查看摘要

Abstract:Optimization algorithms are pivotal in advancing various scientific and industrial fields but often encounter obstacles such as trapping in local minima, saddle points, and plateaus (flat regions), which makes the convergence to reasonable or near-optimal solutions particularly challenging. This paper presents the Steepest Perturbed Gradient Descent (SPGD), a novel algorithm that innovatively combines the principles of the gradient descent method with periodic uniform perturbation sampling to effectively circumvent these impediments and lead to better solutions whenever possible. SPGD is distinctively designed to generate a set of candidate solutions and select the one exhibiting the steepest loss difference relative to the current solution. It enhances the traditional gradient descent approach by integrating a strategic exploration mechanism that significantly increases the likelihood of escaping sub-optimal local minima and navigating complex optimization landscapes effectively. Our approach not only retains the directed efficiency of gradient descent but also leverages the exploratory benefits of stochastic perturbations, thus enabling a more comprehensive search for global optima across diverse problem spaces. We demonstrate the efficacy of SPGD in solving the 3D component packing problem, an NP-hard challenge. Preliminary results show a substantial improvement over four established methods, particularly on response surfaces with complex topographies and in multidimensional non-convex continuous optimization problems. Comparative analyses with established 2D benchmark functions highlight SPGD’s superior performance, showcasing its ability to navigate complex optimization landscapes. These results emphasize SPGD’s potential as a versatile tool for a wide range of optimization problems.

[AI-83] Machine learning and optimization-based approaches to duality in statistical physics

链接: https://arxiv.org/abs/2411.04838
作者: Andrea E. V. Ferrari,Prateek Gupta,Nabil Iqbal
关键词-EN: modern theoretical physics, mathematical descriptions, theoretical physics, physical system, key idea
类目: atistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 27 pages + appendices, lots of plots

点击查看摘要

Abstract:The notion of duality – that a given physical system can have two different mathematical descriptions – is a key idea in modern theoretical physics. Establishing a duality in lattice statistical mechanics models requires the construction of a dual Hamiltonian and a map from the original to the dual observables. By using simple neural networks to parameterize these maps and introducing a loss function that penalises the difference between correlation functions in original and dual models, we formulate the process of duality discovery as an optimization problem. We numerically solve this problem and show that our framework can rediscover the celebrated Kramers-Wannier duality for the 2d Ising model, reconstructing the known mapping of temperatures. We also discuss an alternative approach which uses known features of the mapping of topological lines to reduce the problem to optimizing the couplings in a dual Hamiltonian, and explore next-to-nearest neighbour deformations of the 2d Ising duality. We discuss future directions and prospects for discovering new dualities within this framework.

[AI-84] Equivariant Graph Attention Networks with Structural Motifs for Predicting Cell Line-Specific Synergistic Drug Combinations

链接: https://arxiv.org/abs/2411.04747
作者: Zachary Schwehr
关键词-EN: forms of treatment, primary forms, drug, drug combination, methods
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, Presented at IEEE CIBCB

点击查看摘要

Abstract:Cancer is the second leading cause of death, with chemotherapy as one of the primary forms of treatment. As a result, researchers are turning to drug combination therapy to decrease drug resistance and increase efficacy. Current methods of drug combination screening, such as in vivo and in vitro, are inefficient due to stark time and monetary costs. In silico methods have become increasingly important for screening drugs, but current methods are inaccurate and generalize poorly to unseen anticancer drugs. In this paper, I employ a geometric deep-learning model utilizing a graph attention network that is equivariant to 3D rotations, translations, and reflections with structural motifs. Additionally, the gene expression of cancer cell lines is utilized to classify synergistic drug combinations specific to each cell line. I compared the proposed geometric deep learning framework to current state-of-the-art (SOTA) methods, and the proposed model architecture achieved greater performance on all 12 benchmark tasks performed on the DrugComb dataset. Specifically, the proposed framework outperformed other SOTA methods by an accuracy difference greater than 28%. Based on these results, I believe that the equivariant graph attention network’s capability of learning geometric data accounts for the large performance improvements. The model’s ability to generalize to foreign drugs is thought to be due to the structural motifs providing a better representation of the molecule. Overall, I believe that the proposed equivariant geometric deep learning framework serves as an effective tool for virtually screening anticancer drug combinations for further validation in a wet lab environment. The code for this work is made available online at: this https URL.

[AI-85] Graph neural networks and non-commuting operators NEURIPS2024

链接: https://arxiv.org/abs/2411.04265
作者: Mauricio Velasco,Kaiying O’Hare,Bernardo Rychtenberg,Soledad Villar
关键词-EN: typically involve predicting, involve predicting features, wide variety, typically involve, involve predicting
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Graph neural networks (GNNs) provide state-of-the-art results in a wide variety of tasks which typically involve predicting features at the vertices of a graph. They are built from layers of graph convolutions which serve as a powerful inductive bias for describing the flow of information among the vertices. Often, more than one data modality is available. This work considers a setting in which several graphs have the same vertex set and a common vertex-level learning task. This generalizes standard GNN models to GNNs with several graph operators that do not commute. We may call this model graph-tuple neural networks (GtNN). In this work, we develop the mathematical theory to address the stability and transferability of GtNNs using properties of non-commuting non-expansive operators. We develop a limit theory of graphon-tuple neural networks and use it to prove a universal transferability theorem that guarantees that all graph-tuple neural networks are transferable on convergent graph-tuple sequences. In particular, there is no non-transferable energy under the convergence we consider here. Our theoretical results extend well-known transferability theorems for GNNs to the case of several simultaneous graphs (GtNNs) and provide a strict improvement on what is currently known even in the GNN case. We illustrate our theoretical results with simple experiments on synthetic and real-world data. To this end, we derive a training procedure that provably enforces the stability of the resulting model. Comments: NeurIPS 2024 Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.04265 [stat.ML] (or arXiv:2411.04265v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.04265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-86] Bio-xLSTM: Generative modeling representation and in-context learning of biological and chemical sequences

链接: https://arxiv.org/abs/2411.04165
作者: Niklas Schmidinger,Lisa Schneckenreiter,Philipp Seidl,Johannes Schimunek,Pieter-Jan Hoedt,Johannes Brandstetter,Andreas Mayr,Sohvi Luukkonen,Sepp Hochreiter,Günter Klambauer
关键词-EN: enable crucial applications, sequences enable crucial, chemical sequences, chemical sequences enable, drug discovery
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models for biological and chemical sequences enable crucial applications such as drug discovery, protein engineering, and precision medicine. Currently, these language models are predominantly based on Transformer architectures. While Transformers have yielded impressive results, their quadratic runtime dependency on the sequence length complicates their use for long genomic sequences and in-context learning on proteins and chemical sequences. Recently, the recurrent xLSTM architecture has been shown to perform favorably compared to Transformers and modern state-space model (SSM) architectures in the natural language domain. Similar to SSMs, xLSTMs have a linear runtime dependency on the sequence length and allow for constant-memory decoding at inference time, which makes them prime candidates for modeling long-range dependencies in biological and chemical sequences. In this work, we tailor xLSTM towards these domains and propose a suite of architectural variants called Bio-xLSTM. Extensive experiments in three large domains, genomics, proteins, and chemistry, were performed to assess xLSTM’s ability to model biological and chemical sequences. The results show that models based on Bio-xLSTM a) can serve as proficient generative models for DNA, protein, and chemical sequences, b) learn rich representations for those modalities, and c) can perform in-context learning for proteins and small molecules.

[AI-87] Enhancing Weakly Supervised Semantic Segmentation for Fibrosis via Controllable Image Generation

链接: https://arxiv.org/abs/2411.03551
作者: Zhiling Yue,Yingying Fang,Liutao Yang,Nikhil Baid,Simon Walsh,Guang Yang
关键词-EN: Fibrotic Lung Disease, severe condition marked, Fibrotic Lung, Lung Disease, lung stiffening
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge, we propose DiffSeg, a novel weakly supervised semantic segmentation (WSSS) method that uses image-level annotations to generate pixel-level fibrosis segmentation, reducing the need for fine-grained manual labeling. Additionally, our DiffSeg incorporates a diffusion-based generative model to synthesize HRCT images with different levels of fibrosis from healthy slices, enabling the generation of the fibrosis-injected slices and their paired fibrosis location. Experiments indicate that our method significantly improves the accuracy of pseudo masks generated by existing WSSS methods, greatly reducing the complexity of manual labeling and enhancing the consistency of the generated masks.

计算机视觉

[CV-0] SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

链接: https://arxiv.org/abs/2411.05007
作者: Muyang Li,Yujun Lin,Zhekai Zhang,Tianle Cai,Xiuyu Li,Junxian Guo,Enze Xie,Chenlin Meng,Jun-Yan Zhu,Song Han
关键词-EN: proven highly effective, generating high-quality images, Diffusion models, effective at generating, generating high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Quantization Library: this https URL Inference Engine: this https URL Website: this https URL Demo: this https URL Blog: this https URL

点击查看摘要

Abstract:Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na"ıvely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt- \Sigma , and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5 \times , achieving 3.0 \times speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.

[CV-1] ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing NEURIPS2024

链接: https://arxiv.org/abs/2411.05006
作者: Jun-Kun Chen,Yu-Xiong Wang
关键词-EN: paper proposes ProEdit, progressive manner, scene editing guided, paper proposes, editing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024. Project Page: this https URL

点击查看摘要

Abstract:This paper proposes ProEdit - a simple yet effective framework for high-quality 3D scene editing guided by diffusion distillation in a novel progressive manner. Inspired by the crucial observation that multi-view inconsistency in scene editing is rooted in the diffusion model’s large feasible output space (FOS), our framework controls the size of FOS and reduces inconsistency by decomposing the overall editing task into several subtasks, which are then executed progressively on the scene. Within this framework, we design a difficulty-aware subtask decomposition scheduler and an adaptive 3D Gaussian splatting (3DGS) training strategy, ensuring high quality and efficiency in performing each subtask. Extensive evaluation shows that our ProEdit achieves state-of-the-art results in various scenes and challenging editing tasks, all through a simple framework without any expensive or sophisticated add-ons like distillation losses, components, or training procedures. Notably, ProEdit also provides a new way to control, preview, and select the “aggressivity” of editing operation during the editing process.

[CV-2] Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

链接: https://arxiv.org/abs/2411.05005
作者: Shuhong Zheng,Zhipeng Bao,Ruoyu Zhao,Martial Hebert,Yu-Xiong Wang
关键词-EN: high-fidelity image synthesis, recently exhibited promising, exhibited promising results, visual perception tasks, dense visual perception
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 26 pages, 14 figures

点击查看摘要

Abstract:Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism. Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.

[CV-3] LoFi: Scalable Local Image Reconstruction with Implicit Neural Representation

链接: https://arxiv.org/abs/2411.04995
作者: AmirEhsan Khorashadizadeh,Tobías I. Liaudat,Tianlin Liu,Jason D. McEwen,Ivan Dokmanić
关键词-EN: attracted significant attention, implicit neural representations, signal processing due, Neural fields, implicit neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural fields or implicit neural representations (INRs) have attracted significant attention in machine learning and signal processing due to their efficient continuous representation of images and 3D volumes. In this work, we build on INRs and introduce a coordinate-based local processing framework for solving imaging inverse problems, termed LoFi (Local Field). Unlike conventional methods for image reconstruction, LoFi processes local information at each coordinate \textitseparately by multi-layer perceptrons (MLPs), recovering the object at that specific coordinate. Similar to INRs, LoFi can recover images at any continuous coordinate, enabling image reconstruction at multiple resolutions. With comparable or better performance than standard CNNs for image reconstruction, LoFi achieves excellent generalization to out-of-distribution data and memory usage almost independent of image resolution. Remarkably, training on 1024 \times 1024 images requires just 3GB of memory – over 20 times less than the memory typically needed by standard CNNs. Additionally, LoFi’s local design allows it to train on extremely small datasets with less than 10 samples, without overfitting or the need for regularization or early stopping. Finally, we use LoFi as a denoising prior in a plug-and-play framework for solving general inverse problems to benefit from its continuous image representation and strong generalization. Although trained on low-resolution images, LoFi can be used as a low-dimensional prior to solve inverse problems at any resolution. We validate our framework across a variety of imaging modalities, from low-dose computed tomography to radio interferometric imaging.

[CV-4] SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

链接: https://arxiv.org/abs/2411.04989
作者: Koichi Namekata,Sherwin Bahmani,Ziyi Wu,Yash Kant,Igor Gilitschenski,David B. Lindell
关键词-EN: achieved impressive, photo-realistic quality, Abstract, impressive, photo-realistic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided \unicodex2013 offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while being competitive with supervised models in terms of visual quality and motion fidelity.

[CV-5] Planar Reflection-Aware Neural Radiance Fields

链接: https://arxiv.org/abs/2411.04984
作者: Chen Gao,Yipeng Wang,Changil Kim,Jia-Bin Huang,Johannes Kopf
关键词-EN: Neural Radiance Fields, demonstrated exceptional capabilities, Neural Radiance, high fidelity, demonstrated exceptional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have demonstrated exceptional capabilities in reconstructing complex scenes with high fidelity. However, NeRF’s view dependency can only handle low-frequency reflections. It falls short when handling complex planar reflections, often interpreting them as erroneous scene geometries and leading to duplicated and inaccurate scene representations. To address this challenge, we introduce a reflection-aware NeRF that jointly models planar reflectors, such as windows, and explicitly casts reflected rays to capture the source of the high-frequency reflections. We query a single radiance field to render the primary color and the source of the reflection. We propose a sparse edge regularization to help utilize the true sources of reflections for rendering planar reflections rather than creating a duplicate along the primary ray at the same depth. As a result, we obtain accurate scene geometry. Rendering along the primary ray results in a clean, reflection-free view, while explicitly rendering along the reflected ray allows us to reconstruct highly detailed reflections. Our extensive quantitative and qualitative evaluations of real-world datasets demonstrate our method’s enhanced performance in accurately handling reflections.

[CV-6] AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation NEURIPS2024

链接: https://arxiv.org/abs/2411.04967
作者: Anil Kag,Huseyin Coskun,Jierun Chen,Junli Cao,Willi Menapace,Aliaksandr Siarohin,Sergey Tulyakov,Jian Ren
关键词-EN: Neural network architecture, Neural network, design requires making, transformer blocks, requires making
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024. Project Page: this https URL

点击查看摘要

Abstract:Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN – a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective \emphasymmetric architecture, where the distribution of convolutional and transformer blocks is \emphasymmetric, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.

[CV-7] VAIR: Visuo-Acoustic Implicit Representations for Low-Cost Multi-Modal Transparent Surface Reconstruction in Indoor Scenes

链接: https://arxiv.org/abs/2411.04963
作者: Advaith V. Sethuraman,Onur Bagoren,Harikrishnan Seetharaman,Dalton Richardson,Joseph Taylor,Katherine A. Skinner
关键词-EN: Mobile robots operating, Mobile robots, robots operating indoors, navigate challenging scenes, robots operating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural representations to enable dense reconstruction of transparent surfaces in indoor scenes. We propose a novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces. We demonstrate that we can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction (point clouds or mesh) with transparent surface prediction. We evaluate our method’s effectiveness qualitatively and quantitatively on a new dataset collected using a custom, low-cost sensing platform featuring RGB-D cameras and ultrasonic sensors. Our method exhibits significant improvement over state-of-the-art for transparent surface reconstruction.

[CV-8] CAD-MLLM : Unifying Multimodality-Conditioned CAD Generation With MLLM

链接: https://arxiv.org/abs/2411.04954
作者: Jingwei Xu,Chenyu Wang,Zibo Zhao,Wen Liu,Yi Ma,Shenghua Gao
关键词-EN: easily generate CAD, unified Computer-Aided Design, generate CAD models, CAD models based, CAD models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user’s inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models’ vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: this https URL

[CV-9] A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

链接: https://arxiv.org/abs/2411.04942
作者: Panwen Hu,Nan Xiao,Feifei Li,Yongquan Chen,Rui Huang
关键词-EN: editing techniques attract, editing, techniques attract, attention from industry, industry and academia
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

[CV-10] SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering EMNLP2024

链接: https://arxiv.org/abs/2411.04933
作者: ianyu Yang,Yiyang Nan,Lisen Dai,Zhenwen Liang,Yapeng Tian,Xiangliang Zhang
关键词-EN: involves answering questions, answering questions based, Audio-Visual Question Answering, involves answering, Semantic Representation Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: EMNLP 2024

点击查看摘要

Abstract:Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.

[CV-11] MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views NEURIPS2024

链接: https://arxiv.org/abs/2411.04924
作者: Yuedong Chen,Chuanxia Zheng,Haofei Xu,Bohan Zhuang,Andrea Vedaldi,Tat-Jen Cham,Jianfei Cai
关键词-EN: diverse real-world scenes, real-world scenes, diverse real-world, Stable Video Diffusion, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024, Project page: this https URL , Code: this https URL

点击查看摘要

Abstract:We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360’s performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: this https URL.

[CV-12] VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

链接: https://arxiv.org/abs/2411.04923
作者: Shehan Munasinghe,Hanan Gani,Wenqi Zhu,Jiale Cao,Eric Xing,Fahad Shahbaz Khan,Salman Khan
关键词-EN: Large Language Model, due to complex, Large Multimodal Models, spatial and temporal, video-based Large Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report of VideoGLaMM

点击查看摘要

Abstract:Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

[CV-13] Stem-OB: Generalizable Visual Imitation Learning with Stem-Like Convergent Observation through Diffusion Inversion

链接: https://arxiv.org/abs/2411.04919
作者: Kaizhe Hu,Zihang Rui,Yao He,Yuyao Liu,Pu Hua,Huazhe Xu
关键词-EN: demonstrate strong performance, visual input perturbations, Visual imitation learning, imitation learning methods, learning methods demonstrate
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Arxiv preprint version

点击查看摘要

Abstract:Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced with visual input perturbations, including variations in lighting and textures, impeding their real-world application. We propose Stem-OB that utilizes pretrained image diffusion models to suppress low-level visual differences while maintaining high-level scene structures. This image inversion process is akin to transforming the observation into a shared representation, from which other observations stem, with extraneous details removed. Stem-OB contrasts with data-augmentation approaches as it is robust to various unspecified appearance changes without the need for additional training. Our method is a simple yet highly effective plug-and-play solution. Empirical results confirm the effectiveness of our approach in simulated tasks and show an exceptionally significant improvement in real-world applications, with an average increase of 22.2% in success rates compared to the best baseline. See this https URL for more info.

[CV-14] Robust Iris Centre Localisation for Assistive Eye-Gaze Tracking

链接: https://arxiv.org/abs/2411.04912
作者: Nipun Sandamal Ranasekara Pathiranage,Stefania Cristina,Kenneth P. Camilleri
关键词-EN: iris centre localisation, eye-gaze tracking platform, robust iris centre, research work, iris centre
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this research work, we address the problem of robust iris centre localisation in unconstrained conditions as a core component of our eye-gaze tracking platform. We investigate the application of U-Net variants for segmentation-based and regression-based approaches to improve our iris centre localisation, which was previously based on Bayes’ classification. The achieved results are comparable to or better than the state-of-the-art, offering a drastic improvement over those achieved by the Bayes’ classifier, and without sacrificing the real-time performance of our eye-gaze tracking platform.

[CV-15] In the Era of Prompt Learning with Vision-Language Models

链接: https://arxiv.org/abs/2411.04892
作者: Ankit Jha
关键词-EN: Large-scale foundation models, shown strong zero-shot, strong zero-shot generalization, Large-scale foundation, limiting their adaptability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICVGIP 2024, Young Faculty Symposium

点击查看摘要

Abstract:Large-scale foundation models like CLIP have shown strong zero-shot generalization but struggle with domain shifts, limiting their adaptability. In our work, we introduce \textscStyLIP, a novel domain-agnostic prompt learning strategy for Domain Generalization (DG). StyLIP disentangles visual style and content in CLIPs vision encoder by using style projectors to learn domain-specific prompt tokens and combining them with content features. Trained contrastively, this approach enables seamless adaptation across domains, outperforming state-of-the-art methods on multiple DG benchmarks. Additionally, we propose AD-CLIP for unsupervised domain adaptation (DA), leveraging CLIPs frozen vision backbone to learn domain-invariant prompts through image style and content features. By aligning domains in embedding space with entropy minimization, AD-CLIP effectively handles domain shifts, even when only target domain samples are available. Lastly, we outline future work on class discovery using prompt learning for semantic segmentation in remote sensing, focusing on identifying novel or rare classes in unstructured environments. This paves the way for more adaptive and generalizable models in complex, real-world scenarios.

[CV-16] Boosting Latent Diffusion with Perceptual Objectives

链接: https://arxiv.org/abs/2411.04873
作者: Tariq Berrada,Pietro Astolfi,Jakob Verbeek,Melissa Hall,Marton Havasi,Michal Drozdzal,Yohann Benchetrit,Adriana Romero-Soriano,Karteek Alahari
关键词-EN: Latent diffusion models, RGB image space, high-resolution generative image, Latent diffusion, Latent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Pre-print

点击查看摘要

Abstract:Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative – with boosts between 6% and 20% in FID – and qualitative results when using our perceptual loss.

[CV-17] End-to-end Inception-Unet based Generative Adversarial Networks for Snow and Rain Removals

链接: https://arxiv.org/abs/2411.04821
作者: Ibrahim Kajo,Mohamed Kas,Yassine Ruichek
关键词-EN: removing atmospheric particles, deep learning approaches, superior performance introduced, favors their usage, removing atmospheric
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The superior performance introduced by deep learning approaches in removing atmospheric particles such as snow and rain from a single image; favors their usage over classical ones. However, deep learning-based approaches still suffer from challenges related to the particle appearance characteristics such as size, type, and transparency. Furthermore, due to the unique characteristics of rain and snow particles, single network based deep learning approaches struggle in handling both degradation scenarios simultaneously. In this paper, a global framework that consists of two Generative Adversarial Networks (GANs) is proposed where each handles the removal of each particle individually. The architectures of both desnowing and deraining GANs introduce the integration of a feature extraction phase with the classical U-net generator network which in turn enhances the removal performance in the presence of severe variations in size and appearance. Furthermore, a realistic dataset that contains pairs of snowy images next to their groundtruth images estimated using a low-rank approximation approach; is presented. The experiments show that the proposed desnowing and deraining approaches achieve significant improvements in comparison to the state-of-the-art approaches when tested on both synthetic and realistic datasets.

[CV-18] GANESH: Generalizable NeRF for Lensless Imaging

链接: https://arxiv.org/abs/2411.04810
作者: Rakesh Raj Madavan,Akshat Kaimal,Badhrinarayanan K V,Vinayak Gupta,Rohit Choudhary,Chandrakala Shanmuganathan,Kaushik Mitra
关键词-EN: bulky lens system, develop ultra-compact cameras, conventional bulky lens, Lensless imaging offers, lens system
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Lensless imaging offers a significant opportunity to develop ultra-compact cameras by removing the conventional bulky lens system. However, without a focusing element, the sensor’s output is no longer a direct image but a complex multiplexed scene representation. Traditional methods have attempted to address this challenge by employing learnable inversions and refinement models, but these methods are primarily designed for 2D reconstruction and do not generalize well to 3D reconstruction. We introduce GANESH, a novel framework designed to enable simultaneous refinement and novel view synthesis from multi-view lensless images. Unlike existing methods that require scene-specific training, our approach supports on-the-fly inference without retraining on each scene. Moreover, our framework allows us to tune our model to specific scenes, enhancing the rendering and refinement quality. To facilitate research in this area, we also present the first multi-view lensless dataset, LenslessScenes. Extensive experiments demonstrate that our method outperforms current approaches in reconstruction accuracy and refinement quality. Code and video results are available at this https URL

[CV-19] aming Rectified Flow for Inversion and Editing

链接: https://arxiv.org/abs/2411.04746
作者: Jiangshan Wang,Junfu Pu,Zhongang Qi,Jiayi Guo,Yue Ma,Nisha Huang,Yuxin Chen,Xiu Li,Ying Shan
关键词-EN: FLUX and OpenSora, demonstrated exceptional performance, diffusion transformers, demonstrated exceptional, video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rectified-flow-based diffusion transformers, such as FLUX and OpenSora, have demonstrated exceptional performance in the field of image and video generation. Despite their robust generative capabilities, these models often suffer from inaccurate inversion, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that enhances inversion precision by reducing errors in the process of solving rectified flow ODEs. Specifically, we derive the exact formulation of the rectified flow ODE and perform a high-order Taylor expansion to estimate its nonlinear components, significantly decreasing the approximation error at each timestep. Building upon RF-Solver, we further design RF-Edit, which comprises specialized sub-modules for image and video editing. By sharing self-attention layer features during the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments on text-to-image generation, image video inversion, and image video editing demonstrate the robust performance and adaptability of our methods. Code is available at this https URL.

[CV-20] Convolutional Differentiable Logic Gate Networks NEURIPS2024

链接: https://arxiv.org/abs/2411.04732
作者: Felix Petersen,Hilde Kuehne,Christian Borgelt,Julian Welzel,Stefano Ermon
关键词-EN: machine learning models, increasing inference cost, logic gate, logic gate networks, cost of machine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at NeurIPS 2024 (Oral)

点击查看摘要

Abstract:With the increasing inference cost of machine learning models, there is a growing interest in models with fast and efficient inference. Recently, an approach for learning logic gate networks directly via a differentiable relaxation was proposed. Logic gate networks are faster than conventional neural network approaches because their inference only requires logic gate operators such as NAND, OR, and XOR, which are the underlying building blocks of current hardware and can be efficiently executed. We build on this idea, extending it by deep logic gate tree convolutions, logical OR pooling, and residual initializations. This allows scaling logic gate networks up by over one order of magnitude and utilizing the paradigm of convolution. On CIFAR-10, we achieve an accuracy of 86.29% using only 61 million logic gates, which improves over the SOTA while being 29x smaller.

[CV-21] Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation

链接: https://arxiv.org/abs/2411.04724
作者: Benito Buchheim,Max Reimann,Jürgen Döllner
关键词-EN: present a methodology, diffusion models, human parametric model, parametric model, synthetic data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a methodology for conditional control of human shape and pose in pretrained text-to-image diffusion models using a 3D human parametric model (SMPL). Fine-tuning these diffusion models to adhere to new conditions requires large datasets and high-quality annotations, which can be more cost-effectively acquired through synthetic data generation rather than real-world data. However, the domain gap and low scene diversity of synthetic data can compromise the pretrained model’s visual fidelity. We propose a domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain. To achieve SMPL control, we fine-tune a ControlNet-based architecture on the synthetic SURREAL dataset of rendered humans and apply our domain adaptation at generation time. Experiments demonstrate that our model achieves greater shape and pose diversity than the 2d pose-based ControlNet, while maintaining the visual fidelity and improving stability, proving its usefulness for downstream tasks such as human animation.

[CV-22] Subspace-Constrained Quadratic Matrix Factorization: Algorithm and Applications

链接: https://arxiv.org/abs/2411.04717
作者: Zheng Zhai,Xiaohui Li
关键词-EN: widely adopted framework, modeling data exhibiting, data exhibiting low-rank, exhibiting low-rank structures, quadratic matrix factorization
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Matrix Factorization has emerged as a widely adopted framework for modeling data exhibiting low-rank structures. To address challenges in manifold learning, this paper presents a subspace-constrained quadratic matrix factorization model. The model is designed to jointly learn key low-dimensional structures, including the tangent space, the normal subspace, and the quadratic form that links the tangent space to a low-dimensional representation. We solve the proposed factorization model using an alternating minimization method, involving an in-depth investigation of nonlinear regression and projection subproblems. Theoretical properties of the quadratic projection problem and convergence characteristics of the alternating strategy are also investigated. To validate our approach, we conduct numerical experiments on synthetic and real-world datasets. Results demonstrate that our model outperforms existing methods, highlighting its robustness and efficacy in capturing core low-dimensional structures.

[CV-23] NeuroFly: A framework for whole-brain single neuron reconstruction

链接: https://arxiv.org/abs/2411.04715
作者: Rubin Zhao,Yang Liu,Shiqi Zhang,Zijian Yi,Yanyang Xiao,Fang Xu,Yi Yang,Pencheng Zhou
关键词-EN: enable efficient signal, efficient signal integration, tree-like dendritic, enable efficient, dendritic and axonal
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Neurons, with their elongated, tree-like dendritic and axonal structures, enable efficient signal integration and long-range communication across brain regions. By reconstructing individual neurons’ morphology, we can gain valuable insights into brain connectivity, revealing the structure basis of cognition, movement, and perception. Despite the accumulation of extensive 3D microscopic imaging data, progress has been considerably hindered by the absence of automated tools to streamline this process. Here we introduce NeuroFly, a validated framework for large-scale automatic single neuron reconstruction. This framework breaks down the process into three distinct stages: segmentation, connection, and proofreading. In the segmentation stage, we perform automatic segmentation followed by skeletonization to generate over-segmented neuronal fragments without branches. During the connection stage, we use a 3D image-based path following approach to extend each fragment and connect it with other fragments of the same neuron. Finally, human annotators are required only to proofread the few unresolved positions. The first two stages of our process are clearly defined computer vision problems, and we have trained robust baseline models to solve them. We validated NeuroFly’s efficiency using in-house datasets that include a variety of challenging scenarios, such as dense arborizations, weak axons, images with contamination. We will release the datasets along with a suite of visualization and annotation tools for better reproducibility. Our goal is to foster collaboration among researchers to address the neuron reconstruction challenge, ultimately accelerating advancements in neuroscience research. The dataset and code are available at this https URL

[CV-24] Revisiting Disparity from Dual-Pixel Images: Physics-Informed Lightweight Depth Estimation WACV

链接: https://arxiv.org/abs/2411.04714
作者: Teppei Kurita,Yuhi Kondo,Legong Sun,Takayuki Sasaki,Sho Nitta,Yasuhiro Hashimoto,Yoshinori Muramatsu,Yusuke Moriuchi
关键词-EN: disparity, propose a high-performance, estimation method, disparity estimation method, method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IEEE Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:In this study, we propose a high-performance disparity (depth) estimation method using dual-pixel (DP) images with few parameters. Conventional end-to-end deep-learning methods have many parameters but do not fully exploit disparity constraints, which limits their performance. Therefore, we propose a lightweight disparity estimation method based on a completion-based network that explicitly constrains disparity and learns the physical and systemic disparity properties of DP. By modeling the DP-specific disparity error parametrically and using it for sampling during training, the network acquires the unique properties of DP and enhances robustness. This learning also allows us to use a common RGB-D dataset for training without a DP dataset, which is labor-intensive to acquire. Furthermore, we propose a non-learning-based refinement framework that efficiently handles inherent disparity expansion errors by appropriately refining the confidence map of the network output. As a result, the proposed method achieved state-of-the-art results while reducing the overall system size to 1/5 of that of the conventional method, even without using the DP dataset for training, thereby demonstrating its effectiveness. The code and dataset are available on our project site.

[CV-25] Multi-Reward as Condition for Instruction-based Image Editing

链接: https://arxiv.org/abs/2411.04713
作者: Xin Gu,Ming Li,Libo Zhang,Fan Chen,Longyin Wen,Tiejian Luo,Sijie Zhu
关键词-EN: Stable Diffusion, instruction-based image editing, essential for instruction-based, Large Vision Language, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text-to-image generative models (e.g., Stable Diffusion, DALL-E) which are not trained for image editing. Accordingly, these datasets suffer from inaccurate instruction following, poor detail preserving, and generation artifacts. In this paper, we propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. 1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i.e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality. For each perspective, we collected quantitative score in 0\sim 5 and text descriptive feedback on the specific failure points in ground-truth edited images, resulting in a high-quality editing reward dataset, i.e., RewardEdit20K. 2) We further proposed a novel training framework to seamlessly integrate the metric output, regarded as multi-reward, into editing models to learn from the imperfect training triplets. During training, the reward scores and text descriptions are encoded as embeddings and fed into both the latent space and the U-Net of the editing models as auxiliary conditions. During inference, we set these additional conditions to the highest score with no text description for failure points, to aim at the best generation outcome. Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines, i.e., InsPix2Pix and SmartEdit. The code and dataset will be released.

[CV-26] SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

链接: https://arxiv.org/abs/2411.04712
作者: Shivanshu Shekhar,Shreyas Singh,Tong Zhang
关键词-EN: Direct Preference Optimization, Direct Preference, Preference Optimization, align large language, large language models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics.

[CV-27] Progressive Multi-Level Alignments for Semi-Supervised Domain Adaptation SAR Target Recognition Using Simulated Data

链接: https://arxiv.org/abs/2411.04711
作者: Xinzheng Zhang,Hui Zhu,Hongqian Zhuang
关键词-EN: intriguing research trend, train ATR models, synthetic aperture radar, ATR models, construct ATR models
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recently, an intriguing research trend for automatic target recognition (ATR) from synthetic aperture radar (SAR) imagery has arisen: using simulated data to train ATR models is a feasible solution to the issue of inadequate measured data. To close the domain gap that exists between the real and simulated data, the unsupervised domain adaptation (UDA) techniques are frequently exploited to construct ATR models. However, for UDA, the target domain lacks labeled data to direct the model training, posing a great challenge to ATR performance. To address the above problem, a semi-supervised domain adaptation (SSDA) framework has been proposed adopting progressive multi-level alignments for simulated data-aided SAR ATR. First, a progressive wavelet transform data augmentation (PWTDA) is presented by analyzing the discrepancies of wavelet decomposition sub-bands of two domain images, obtaining the domain-level alignment. Specifically, the domain gap is narrowed by mixing the wavelet transform high-frequency sub-band components. Second, we develop an asymptotic instance-prototype alignment (AIPA) strategy to push the source domain instances close to the corresponding target prototypes, aiming to achieve category-level alignment. Moreover, the consistency alignment is implemented by excavating the strong-weak augmentation consistency of both individual samples and the multi-sample relationship, enhancing the generalization capability of the model. Extensive experiments on the Synthetic and Measured Paired Labeled Experiment (SAMPLE) dataset, indicate that our approach obtains recognition accuracies of 99.63% and 98.91% in two common experimental settings with only one labeled sample per class of the target domain, outperforming the most advanced SSDA techniques.

[CV-28] IP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

链接: https://arxiv.org/abs/2411.04709
作者: Wenhao Wang,Yi Yang
关键词-EN: revolutionizing content creation, drawing increasing attention, increasing attention due, models drawing increasing, text and image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The project is publicly available at this https URL

点击查看摘要

Abstract:Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of their trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset. The project is publicly available at this https URL.

[CV-29] From CNN to ConvRNN: Adapting Visualization Techniques for Time-Series Anomaly Detection

链接: https://arxiv.org/abs/2411.04707
作者: Fabien Poirier
关键词-EN: neural networks, solve various problems, networks are commonly, Nowadays, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Nowadays, neural networks are commonly used to solve various problems. Unfortunately, despite their effectiveness, they are often perceived as black boxes capable of providing answers without explaining their decisions, which raises numerous ethical and legal concerns. Fortunately, the field of explainability helps users understand these results. This aspect of machine learning allows users to grasp the decision-making process of a model and verify the relevance of its outcomes. In this article, we focus on the learning process carried out by a time distributed convRNN, which performs anomaly detection from video data.

[CV-30] ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing

链接: https://arxiv.org/abs/2411.04706
作者: Zhihui Zhang,Jinhui Pang,Jianan Li,Xiaoshuai Hao
关键词-EN: remote sensing community, remote sensing, challenging research task, sensing community, spatial correlations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-Image Super-Resolution (MISR) is a crucial yet challenging research task in the remote sensing community. In this paper, we address the challenging task of Multi-Image Super-Resolution in Remote Sensing (MISR-RS), aiming to generate a High-Resolution (HR) image from multiple Low-Resolution (LR) images obtained by satellites. Recently, the weak temporal correlations among LR images have attracted increasing attention in the MISR-RS task. However, existing MISR methods treat the LR images as sequences with strong temporal correlations, overlooking spatial correlations and imposing temporal dependencies. To address this problem, we propose a novel end-to-end framework named Enhancing Spatial Correlations in MISR (ESC-MISR), which fully exploits the spatial-temporal relations of multiple images for HR image reconstruction. Specifically, we first introduce a novel fusion module named Multi-Image Spatial Transformer (MIST), which emphasizes parts with clearer global spatial features and enhances the spatial correlations between LR images. Besides, we perform a random shuffle strategy for the sequential inputs of LR images to attenuate temporal dependencies and capture weak temporal correlations in the training stage. Compared with the state-of-the-art methods, our ESC-MISR achieves 0.70dB and 0.76dB cPSNR improvements on the two bands of the PROBA-V dataset respectively, demonstrating the superiority of our method.

[CV-31] Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion IJCAI2024

链接: https://arxiv.org/abs/2411.04697
作者: Yiming Sun,Bing Cao,Pengfei Zhu,Qinghua Hu
关键词-EN: integrate modality strengths, visually enhanced, Infrared and visible, aim to integrate, integrate modality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCAI 2024

点击查看摘要

Abstract:Infrared and visible image fusion aim to integrate modality strengths for visually enhanced, informative images. Visible imaging in real-world scenarios is susceptible to dynamic environmental brightness fluctuations, leading to texture degradation. Existing fusion methods lack robustness against such brightness perturbations, significantly compromising the visual fidelity of the fused imagery. To address this challenge, we propose the Brightness Adaptive multimodal dynamic fusion framework (BA-Fusion), which achieves robust image fusion despite dynamic brightness fluctuations. Specifically, we introduce a Brightness Adaptive Gate (BAG) module, which is designed to dynamically select features from brightness-related channels for normalization, while preserving brightness-independent structural information within the source images. Furthermore, we propose a brightness consistency loss function to optimize the BAG module. The entire framework is tuned via alternating training strategies. Extensive experiments validate that our method surpasses state-of-the-art methods in preserving multi-modal image information and visual fidelity, while exhibiting remarkable robustness across varying brightness levels. Our code is available: this https URL.

[CV-32] DNN-based 3D Cloud Retrieval for Variable Solar Illumination and Multiview Spaceborne Imaging

链接: https://arxiv.org/abs/2411.04682
作者: Tamar Klein,Tom Aizenberg,Roi Ronen
关键词-EN: retrieve two-dimensional maps, remotely sensed images, studies often rely, rely on remotely, remotely sensed
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 4 pages, 4 figures

点击查看摘要

Abstract:Climate studies often rely on remotely sensed images to retrieve two-dimensional maps of cloud properties. To advance volumetric analysis, we focus on recovering the three-dimensional (3D) heterogeneous extinction coefficient field of shallow clouds using multiview remote sensing data. Climate research requires large-scale worldwide statistics. To enable scalable data processing, previous deep neural networks (DNNs) can infer at spaceborne remote sensing downlink rates. However, prior methods are limited to a fixed solar illumination direction. In this work, we introduce the first scalable DNN-based system for 3D cloud retrieval that accommodates varying camera poses and solar directions. By integrating multiview cloud intensity images with camera poses and solar direction data, we achieve greater flexibility in recovery. Training of the DNN is performed by a novel two-stage scheme to address the high number of degrees of freedom in this problem. Our approach shows substantial improvements over previous state-of-the-art, particularly in handling variations in the sun’s zenith angle.

[CV-33] Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models

链接: https://arxiv.org/abs/2411.04663
作者: Taylor Arnold,Lauren Tilton
关键词-EN: permissible re-use licences, made large digitized, re-use licences, cultural institutions, institutions have made
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, CHR 2024: Computational Humanities Research Conference, December 4 - 6, 2024, Aarhus University, Denmark

点击查看摘要

Abstract:Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difficult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to offer concrete textual explanations of each recommendation without the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to addressing privacy and ethical concerns. Through a case study using a collection of documentary photographs, we provide several metrics showing the efficacy and possibilities of our approach.

[CV-34] Automated Image Color Mapping for a Historic Photographic Collection

链接: https://arxiv.org/abs/2411.04659
作者: Taylor Arnold,Lauren Tilton
关键词-EN: United States Environmental, States Environmental Protection, Environmental Protection Agency, Protection Agency sponsored, environmental subjects nation-wide
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: 11 pages, CHR 2024: Computational Humanities Research Conference, December 4 - 6, 2024, Aarhus University, Denmark

点击查看摘要

Abstract:In the 1970s, the United States Environmental Protection Agency sponsored Documerica, a large-scale photography initiative to document environmental subjects nation-wide. While over 15,000 digitized public-domain photographs from the collection are available online, most of the images were scanned from damaged copies of the original prints. We present and evaluate a modified histogram matching technique based on the underlying chemistry of the prints for correcting the damaged images by using training data collected from a small set of undamaged prints. The entire set of color-adjusted Documerica images is made available in an open repository.

[CV-35] ICH-SCNet: Intracerebral Hemorrhage Segmentation and Prognosis Classification Network Using CLIP-guided SAM mechanism

链接: https://arxiv.org/abs/2411.04656
作者: Xinlei Yu,Ahmed Elazab,Ruiquan Ge,Hui Jin,Xinchen Jiang,Gangyong Jia,Qing Wu,Qinglei Shi,Changmiao Wang
关键词-EN: Intracerebral hemorrhage, incidence of disability, fatal subtype, subtype of stroke, high incidence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 2 figures, 3 tables, published to BIBM 2024

点击查看摘要

Abstract:Intracerebral hemorrhage (ICH) is the most fatal subtype of stroke and is characterized by a high incidence of disability. Accurate segmentation of the ICH region and prognosis prediction are critically important for developing and refining treatment plans for post-ICH patients. However, existing approaches address these two tasks independently and predominantly focus on imaging data alone, thereby neglecting the intrinsic correlation between the tasks and modalities. This paper introduces a multi-task network, ICH-SCNet, designed for both ICH segmentation and prognosis classification. Specifically, we integrate a SAM-CLIP cross-modal interaction mechanism that combines medical text and segmentation auxiliary information with neuroimaging data to enhance cross-modal feature recognition. Additionally, we develop an effective feature fusion module and a multi-task loss function to improve performance further. Extensive experiments on an ICH dataset reveal that our approach surpasses other state-of-the-art methods. It excels in the overall performance of classification tasks and outperforms competing models in all segmentation task metrics.

[CV-36] DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction

链接: https://arxiv.org/abs/2411.04646
作者: Li Zhao,Zhengmin Lu
关键词-EN: Skeleton Diffusion Transformer, Spatio-Temporal Skeleton Diffusion, Spatio-Temporal Skeleton, dance movements synchronized, Diffusion Transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces DanceFusion, a novel framework for reconstructing and generating dance movements synchronized to music, utilizing a Spatio-Temporal Skeleton Diffusion Transformer. The framework adeptly handles incomplete and noisy skeletal data common in short-form dance videos on social media platforms like TikTok. DanceFusion incorporates a hierarchical Transformer-based Variational Autoencoder (VAE) integrated with a diffusion model, significantly enhancing motion realism and accuracy. Our approach introduces sophisticated masking techniques and a unique iterative diffusion process that refines the motion sequences, ensuring high fidelity in both motion generation and synchronization with accompanying audio cues. Comprehensive evaluations demonstrate that DanceFusion surpasses existing methods, providing state-of-the-art performance in generating dynamic, realistic, and stylistically diverse dance motions. Potential applications of this framework extend to content creation, virtual reality, and interactive entertainment, promising substantial advancements in automated dance generation. Visit our project page at this https URL.

[CV-37] Improved Multi-Task Brain Tumour Segmentation with Synthetic Data Augmentation

链接: https://arxiv.org/abs/2411.04632
作者: André Ferreira,Tiago Jesus,Behrus Puladi,Jens Kleesiek,Victor Alves,Jan Egger
关键词-EN: presents the winning, winning solution, third-placed solution, synthetic data, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the winning solution of task 1 and the third-placed solution of task 3 of the BraTS challenge. The use of automated tools in clinical practice has increased due to the development of more and more sophisticated and reliable algorithms. However, achieving clinical standards and developing tools for real-life scenarios is a major challenge. To this end, BraTS has organised tasks to find the most advanced solutions for specific purposes. In this paper, we propose the use of synthetic data to train state-of-the-art frameworks in order to improve the segmentation of adult gliomas in a post-treatment scenario, and the segmentation of meningioma for radiotherapy planning. Our results suggest that the use of synthetic data leads to more robust algorithms, although the synthetic data generation pipeline is not directly suited to the meningioma task. The code for these tasks is available at this https URL.

[CV-38] Brain Tumour Removing and Missing Modality Generation using 3D WDM

链接: https://arxiv.org/abs/2411.04630
作者: André Ferreira,Gijs Luijten,Behrus Puladi,Jens Kleesiek,Victor Alves,Jan Egger
关键词-EN: second-placed solution, participation solution, paper presents, presents the second-placed, MRI modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the second-placed solution for task 8 and the participation solution for task 7 of BraTS 2024. The adoption of automated brain analysis algorithms to support clinical practice is increasing. However, many of these algorithms struggle with the presence of brain lesions or the absence of certain MRI modalities. The alterations in the brain’s morphology leads to high variability and thus poor performance of predictive models that were trained only on healthy brains. The lack of information that is usually provided by some of the missing MRI modalities also reduces the reliability of the prediction models trained with all modalities. In order to improve the performance of these models, we propose the use of conditional 3D wavelet diffusion models. The wavelet transform enabled full-resolution image training and prediction on a GPU with 48 GB VRAM, without patching or downsampling, preserving all information for prediction. For the inpainting task of BraTS 2024, the use of a large and variable number of healthy masks and the stability and efficiency of the 3D wavelet diffusion model resulted in 0.007, 22.61 and 0.842 in the validation set and 0.07 , 22.8 and 0.91 in the testing set (MSE, PSNR and SSIM respectively). The code for these tasks is available at this https URL.

[CV-39] Multi-temporal crack segmentation in concrete structure using deep learning approaches

链接: https://arxiv.org/abs/2411.04620
作者: Said Harb,Pedro Achanccaray,Mehdi Maboudi,Markus Gerke
关键词-EN: earliest indicators, indicators of deterioration, segmentation quality, Swin UNETR trained, multi-temporal
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Cracks are among the earliest indicators of deterioration in concrete structures. Early automatic detection of these cracks can significantly extend the lifespan of critical infrastructures, such as bridges, buildings, and tunnels, while simultaneously reducing maintenance costs and facilitating efficient structural health monitoring. This study investigates whether leveraging multi-temporal data for crack segmentation can enhance segmentation quality. Therefore, we compare a Swin UNETR trained on multi-temporal data with a U-Net trained on mono-temporal data to assess the effect of temporal information compared with conventional single-epoch approaches. To this end, a multi-temporal dataset comprising 1356 images, each with 32 sequential crack propagation images, was created. After training the models, experiments were conducted to analyze their generalization ability, temporal consistency, and segmentation quality. The multi-temporal approach consistently outperformed its mono-temporal counterpart, achieving an IoU of 82.72% and a F1-score of 90.54% , representing a significant improvement over the mono-temporal model’s IoU of 76.69% and F1-score of 86.18% , despite requiring only half of the trainable parameters. The multi-temporal model also displayed a more consistent segmentation quality, with reduced noise and fewer errors. These results suggest that temporal information significantly enhances the performance of segmentation models, offering a promising solution for improved crack detection and the long-term monitoring of concrete structures, even with limited sequential data.

[CV-40] Population estimation using 3D city modelling and Carto2S datasets – A case study

链接: https://arxiv.org/abs/2411.04612
作者: Jai G Singla
关键词-EN: resolution Digital Elevation, Digital Elevation Model, high resolution images, High resolution Digital, high resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the launch of Carto2S series of satellites, high resolution images (0.6-1.0 meters) are acquired and available for use. High resolution Digital Elevation Model (DEM) with better accuracies can be generated using C2S multi-view and multi date datasets. DEMs are further used as an input to derive Digital terrain models (DTMs) and to extract accurate heights of the objects (building and tree) over the surface of the Earth. Extracted building heights are validated with ground control points and can be used for generation of city modelling and resource estimation like population estimation, health planning, water and transport resource estimations. In this study, an attempt is made to assess the population of a township using high-resolution Indian remote sensing satellite datasets. We used Carto 2S multi-view data and generated a precise DEM and DTM over a city area. Using DEM and DTM datasets, accurate heights of the buildings are extracted which are further validated with ground data. Accurate building heights and high resolution imagery are used for generating accurate virtual 3D city model and assessing the number of floor and carpet area of the houses/ flats/ apartments. Population estimation of the area is made using derived information of no of houses/ flats/ apartments from the satellite datasets. Further, information about number of hospital and schools around the residential area is extracted from open street maps (OSM). Population estimation using satellite data and derived information from OSM datasets can prove to be very good tool for local administrator and decision makers.

[CV-41] Solar potential analysis over Indian cities using high-resolution satellite imagery and DEM

链接: https://arxiv.org/abs/2411.04610
作者: Jai Singla
关键词-EN: utilizing aerial imagery, performed utilizing aerial, satellite imagery, utilizing aerial, aerial imagery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most of the research work in the solar potential analysis is performed utilizing aerial imagery, LiDAR data, and satellite imagery. However, in the existing studies using satellite data, parameters such as trees/ vegetation shadow, adjacent higher architectural structures, and eccentric roof structures in urban areas were not considered, and relatively coarser-resolution datasets were used for analysis. In this work, we have implemented a novel approach to estimate rooftop solar potential using inputs of high-resolution satellite imagery (0.5 cm), a digital elevation model (1m), along with ground station radiation data. Solar radiation analysis is performed using the diffusion proportion and transmissivity ratio derived from the ground station data hosted by IMD. It was observed that due to seasonal variations, environmental effects and technical reasons such as solar panel structure etc., there can be a significant loss of electricity generation up to 50%. Based on the results, it is also understood that using 1m DEM and 50cm satellite imagery, more authentic results are produced over the urban areas.

[CV-42] Cross- and Intra-image Prototypical Learning for Multi-label Disease Diagnosis and Interpretation

链接: https://arxiv.org/abs/2411.04607
作者: Chong Wang,Fengbei Liu,Yuanhong Chen,Helen Frazer,Gustavo Carneiro
关键词-EN: shown remarkable potential, Recent advances, associating activation maps, class-specific training prototypes, prototypical learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in prototypical learning have shown remarkable potential to provide useful decision interpretations associating activation maps and predictions with class-specific training prototypes. Such prototypical learning has been well-studied for various single-label diseases, but for quite relevant and more challenging multi-label diagnosis, where multiple diseases are often concurrent within an image, existing prototypical learning models struggle to obtain meaningful activation maps and effective class prototypes due to the entanglement of the multiple diseases. In this paper, we present a novel Cross- and Intra-image Prototypical Learning (CIPL) framework, for accurate multi-label disease diagnosis and interpretation from medical images. CIPL takes advantage of common cross-image semantics to disentangle the multiple diseases when learning the prototypes, allowing a comprehensive understanding of complicated pathological lesions. Furthermore, we propose a new two-level alignment-based regularisation strategy that effectively leverages consistent intra-image information to enhance interpretation robustness and predictive performance. Extensive experiments show that our CIPL attains the state-of-the-art (SOTA) classification accuracy in two public multi-label benchmarks of disease diagnosis: thoracic radiography and fundus images. Quantitative interpretability results show that CIPL also has superiority in weakly-supervised thoracic disease localisation over other leading saliency- and prototype-based explanation methods.

[CV-43] Social EgoMesh Estimation

链接: https://arxiv.org/abs/2411.04598
作者: Luca Scofano,Alessio Sampieri,Edoardo De Matteis,Indro Spinelli,Fabio Galasso
关键词-EN: augmented reality applications, modeling human behavior, egocentric video sequences, Accurately estimating, reality applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurately estimating the 3D pose of the camera wearer in egocentric video sequences is crucial to modeling human behavior in virtual and augmented reality applications. The task presents unique challenges due to the limited visibility of the user’s body caused by the front-facing camera mounted on their head. Recent research has explored the utilization of the scene and ego-motion, but it has overlooked humans’ interactive nature. We propose a novel framework for Social Egocentric Estimation of body MEshes (SEE-ME). Our approach is the first to estimate the wearer’s mesh using only a latent probabilistic diffusion model, which we condition on the scene and, for the first time, on the social wearer-interactee interactions. Our in-depth study sheds light on when social interaction matters most for ego-mesh estimation; it quantifies the impact of interpersonal distance and gaze direction. Overall, SEE-ME surpasses the current best technique, reducing the pose estimation error (MPJPE) by 53%. The code is available at this https URL.

[CV-44] he Impact of Semi-Supervised Learning on Line Segment Detection

链接: https://arxiv.org/abs/2411.04596
作者: Johanna Engman,Karl Åström,Magnus Oskarsson
关键词-EN: paper we present, line segment detection, semi-supervised framework, consistency loss based, perturbed unlabeled images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 7 tables

点击查看摘要

Abstract:In this paper we present a method for line segment detection in images, based on a semi-supervised framework. Leveraging the use of a consistency loss based on differently augmented and perturbed unlabeled images with a small amount of labeled data, we show comparable results to fully supervised methods. This opens up application scenarios where annotation is difficult or expensive, and for domain specific adaptation of models. We are specifically interested in real-time and online applications, and investigate small and efficient learning backbones. Our method is to our knowledge the first to target line detection using modern state-of-the-art methodologies for semi-supervised learning. We test the method on both standard benchmarks and domain specific scenarios for forestry applications, showing the tractability of the proposed method.

[CV-45] PASSION for Dermatology: Bridging the Diversity Gap with Pigmented Skin Images from Sub-Saharan Africa MICCAI2024

链接: https://arxiv.org/abs/2411.04584
作者: Philippe Gottfrois,Fabian Gröger,Faly Herizo Andriambololoniaina,Ludovic Amruthalingam,Alvaro Gonzalez-Jimenez,Christophe Hsu,Agnes Kessy,Simone Lionetti,Daudi Mavura,Wingston Ng’ambi,Dingase Faith Ngongonda,Marc Pouly,Mendrika Fifaliana Rakotoarisaona,Fahafahantsoa Rapelanoro Rabenja,Ibrahima Traoré,Alexander A. Navarini
关键词-EN: Africa faces, shortage of dermatologists, million people, faces a huge, huge shortage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024

点击查看摘要

Abstract:Africa faces a huge shortage of dermatologists, with less than one per million people. This is in stark contrast to the high demand for dermatologic care, with 80% of the paediatric population suffering from largely untreated skin conditions. The integration of AI into healthcare sparks significant hope for treatment accessibility, especially through the development of AI-supported teledermatology. Current AI models are predominantly trained on white-skinned patients and do not generalize well enough to pigmented patients. The PASSION project aims to address this issue by collecting images of skin diseases in Sub-Saharan countries with the aim of open-sourcing this data. This dataset is the first of its kind, consisting of 1,653 patients for a total of 4,901 images. The images are representative of telemedicine settings and encompass the most common paediatric conditions: eczema, fungals, scabies, and impetigo. We also provide a baseline machine learning model trained on the dataset and a detailed performance analysis for the subpopulations represented in the dataset. The project website can be found at this https URL.

[CV-46] DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning NEURIPS2024

链接: https://arxiv.org/abs/2411.04571
作者: Yuxuan Duan,Yan Hong,Bo Zhang,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang,Li Niu,Liqing Zhang
关键词-EN: text prompt describing, recent progress, provide a text, text prompt, prompt describing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:The recent progress in text-to-image models pretrained on large-scale datasets has enabled us to generate various images as long as we provide a text prompt describing what we want. Nevertheless, the availability of these models is still limited when we expect to generate images that fall into a specific domain either hard to describe or just unseen to the models. In this work, we propose DomainGallery, a few-shot domain-driven image generation method which aims at finetuning pretrained Stable Diffusion on few-shot target datasets in an attribute-centric manner. Specifically, DomainGallery features prior attribute erasure, attribute disentanglement, regularization and enhancement. These techniques are tailored to few-shot domain-driven generation in order to solve key issues that previous works have failed to settle. Extensive experiments are given to validate the superior performance of DomainGallery on a variety of domain-driven generation scenarios. Codes are available at this https URL.

[CV-47] Neural Fingerprints for Adversarial Attack Detection

链接: https://arxiv.org/abs/2411.04533
作者: Haim Fisher,Moni Shahar,Yehezkel S. Resheff
关键词-EN: Deep learning models, Deep learning, recent years, standard tools, tools in recent
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 14 pages

点击查看摘要

Abstract:Deep learning models for image classification have become standard tools in recent years. A well known vulnerability of these models is their susceptibility to adversarial examples. These are generated by slightly altering an image of a certain class in a way that is imperceptible to humans but causes the model to classify it wrongly as another class. Many algorithms have been proposed to address this problem, falling generally into one of two categories: (i) building robust classifiers (ii) directly detecting attacked images. Despite the good performance of these detectors, we argue that in a white-box setting, where the attacker knows the configuration and weights of the network and the detector, they can overcome the detector by running many examples on a local copy, and sending only those that were not detected to the actual model. This problem is common in security applications where even a very good model is not sufficient to ensure safety. In this paper we propose to overcome this inherent limitation of any static defence with randomization. To do so, one must generate a very large family of detectors with consistent performance, and select one or more of them randomly for each input. For the individual detectors, we suggest the method of neural fingerprints. In the training phase, for each class we repeatedly sample a tiny random subset of neurons from certain layers of the network, and if their average is sufficiently different between clean and attacked images of the focal class they are considered a fingerprint and added to the detector bank. During test time, we sample fingerprints from the bank associated with the label predicted by the model, and detect attacks using a likelihood ratio test. We evaluate our detectors on ImageNet with different attack methods and model architectures, and show near-perfect detection with low rates of false detection.

[CV-48] 0-Regularized Sparse Coding-based Interpretable Network for Multi-Modal Image Fusion

链接: https://arxiv.org/abs/2411.04519
作者: Gargi Panda,Soumitra Kundu,Saumik Bhattacharya,Aurobinda Routray
关键词-EN: improving visualization, MMIF task, MMIF tasks, multi-modal convolutional sparse, common features obtained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an l0-regularized multi-modal convolutional sparse coding (MCSC) model. Specifically, for solving the l0-regularized CSC problem, we develop an algorithm unrolling-based l0-regularized sparse coding (LZSC) block. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an l0-regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet’s training. Extensive experiments show that FNet achieves high-quality fusion results across five different MMIF tasks. Furthermore, we show that FNet enhances downstream object detection in visible-thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network.

[CV-49] Pose2Trajectory: Using Transformers on Body Pose to Predict Tennis Players Trajectory

链接: https://arxiv.org/abs/2411.04501
作者: Ali K. AlShami,Terrance Boult,Jugal Kalita
关键词-EN: player future trajectory, tennis player, operators in production, tennis, trajectory
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Tracking the trajectory of tennis players can help camera operators in production. Predicting future movement enables cameras to automatically track and predict a player’s future trajectory without human intervention. Predicting future human movement in the context of complex physical tasks is also intellectually satisfying. Swift advancements in sports analytics and the wide availability of videos for tennis have inspired us to propose a novel method called Pose2Trajectory, which predicts a tennis player’s future trajectory as a sequence derived from their body joints’ data and ball position. Demonstrating impressive accuracy, our approach capitalizes on body joint information to provide a comprehensive understanding of the human body’s geometry and motion, thereby enhancing the prediction of the player’s trajectory. We use encoder-decoder Transformer architecture trained on the joints and trajectory information of the players with ball positions. The predicted sequence can provide information to help close-up cameras to keep tracking the tennis player, following centroid coordinates. We generate a high-quality dataset from multiple videos to assist tennis player movement prediction using object detection and human pose estimation methods. It contains bounding boxes and joint information for tennis players and ball positions in singles tennis games. Our method shows promising results in predicting the tennis player’s movement trajectory with different sequence prediction lengths using the joints and trajectory information with the ball position.

[CV-50] Synergy-Guided Regional Supervision of Pseudo Labels for Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2411.04493
作者: Tao Wang,Xinlin Zhang,Yuanbin Chen,Yuanbo Zhou,Longxuan Zhao,Tao Tan,Tong Tong
关键词-EN: received considerable attention, leverage abundant unlabeled, Semi-supervised learning, enhance model robustness, received considerable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised learning has received considerable attention for its potential to leverage abundant unlabeled data to enhance model robustness. Pseudo labeling is a widely used strategy in semi supervised learning. However, existing methods often suffer from noise contamination, which can undermine model performance. To tackle this challenge, we introduce a novel Synergy-Guided Regional Supervision of Pseudo Labels (SGRS-Net) framework. Built upon the mean teacher network, we employ a Mix Augmentation module to enhance the unlabeled data. By evaluating the synergy before and after augmentation, we strategically partition the pseudo labels into distinct regions. Additionally, we introduce a Region Loss Evaluation module to assess the loss across each delineated area. Extensive experiments conducted on the LA dataset have demonstrated superior performance over state-of-the-art techniques, underscoring the efficiency and practicality of our framework.

[CV-51] CFPNet: Improving Lightweight ToF Depth Completion via Cross-zone Feature Propagation

链接: https://arxiv.org/abs/2411.04480
作者: Laiyan Ding,Hualie Jiang,Rui Xu,Rui Huang
关键词-EN: lightweight ToF sensors, low cost, zone area, attractive due, Depth completion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Depth completion using lightweight time-of-flight (ToF) depth sensors is attractive due to their low cost. However, lightweight ToF sensors usually have a limited field of view (FOV) compared with cameras. Thus, only pixels in the zone area of the image can be associated with depth signals. Previous methods fail to propagate depth features from the zone area to the outside-zone area effectively, thus suffering from degraded depth completion performance outside the zone. To this end, this paper proposes the CFPNet to achieve cross-zone feature propagation from the zone area to the outside-zone area with two novel modules. The first is a direct-attention-based propagation module (DAPM), which enforces direct cross-zone feature acquisition. The second is a large-kernel-based propagation module (LKPM), which realizes cross-zone feature propagation by utilizing convolution layers with kernel sizes up to 31. CFPNet achieves state-of-the-art (SOTA) depth completion performance by combining these two modules properly, as verified by extensive experimental results on the ZJU-L5 dataset. The code will be made public.

[CV-52] Deep Learning Models for UAV-Assisted Bridge Inspection: A YOLO Benchmark Analysis

链接: https://arxiv.org/abs/2411.04475
作者: Trong-Nhan Phan,Hoang-Hai Nguyen,Thi-Thu-Hien Ha,Huy-Tan Thai,Kim-Hung Le
关键词-EN: potential failures early, identify potential failures, Visual inspections, failures early, critical to ensure
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual inspections of bridges are critical to ensure their safety and identify potential failures early. This inspection process can be rapidly and accurately automated by using unmanned aerial vehicles (UAVs) integrated with deep learning models. However, choosing an appropriate model that is lightweight enough to integrate into the UAV and fulfills the strict requirements for inference time and accuracy is challenging. Therefore, our work contributes to the advancement of this model selection process by conducting a benchmark of 23 models belonging to the four newest YOLO variants (YOLOv5, YOLOv6, YOLOv7, YOLOv8) on COCO-Bridge-2021+, a dataset for bridge details detection. Through comprehensive benchmarking, we identify YOLOv8n, YOLOv7tiny, YOLOv6m, and YOLOv6m6 as the models offering an optimal balance between accuracy and processing speed, with mAP@50 scores of 0.803, 0.837, 0.853, and 0.872, and inference times of 5.3ms, 7.5ms, 14.06ms, and 39.33ms, respectively. Our findings accelerate the model selection process for UAVs, enabling more efficient and reliable bridge inspections.

[CV-53] FreeCap: Hybrid Calibration-Free Motion Capture in Open Environments

链接: https://arxiv.org/abs/2411.04469
作者: Aoru Xue,Yiming Ren,Zining Song,Mao Ye,Xinge Zhu,Yuexin Ma
关键词-EN: accurately capture global, hybrid calibration-free method, calibration-free method FreeCap, open environments, hybrid calibration-free
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a novel hybrid calibration-free method FreeCap to accurately capture global multi-person motions in open environments. Our system combines a single LiDAR with expandable moving cameras, allowing for flexible and precise motion estimation in a unified world coordinate. In particular, We introduce a local-to-global pose-aware cross-sensor human-matching module that predicts the alignment among each sensor, even in the absence of calibration. Additionally, our coarse-to-fine sensor-expandable pose optimizer further optimizes the 3D human key points and the alignments, it is also capable of incorporating additional cameras to enhance accuracy. Extensive experiments on Human-M3 and FreeMotion datasets demonstrate that our method significantly outperforms state-of-the-art single-modal methods, offering an expandable and efficient solution for multi-person motion capture across various applications.

[CV-54] Efficient single image non-uniformity correction algorithm

链接: https://arxiv.org/abs/2411.04457
作者: Yohann Tendero,Jerome Gilles,Stephane Landeau,Jean-Michel Morel
关键词-EN: uncooled infrared-type images, correct the non-uniformity, paper introduces, uncooled infrared-type, method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2411.03615

点击查看摘要

Abstract:This paper introduces a new way to correct the non-uniformity (NU) in uncooled infrared-type images. The main defect of these uncooled images is the lack of a column (resp. line) time-dependent cross-calibration, resulting in a strong column (resp. line) and time dependent noise. This problem can be considered as a 1D flicker of the columns inside each frame. Thus, classic movie deflickering algorithms can be adapted, to equalize the columns (resp. the lines). The proposed method therefore applies to the series formed by the columns of an infrared image a movie deflickering algorithm. The obtained single image method works on static images, and therefore requires no registration, no camera motion compensation, and no closed aperture sensor equalization. Thus, the method has only one camera dependent parameter, and is landscape independent. This simple method will be compared to a state of the art total variation single image correction on raw real and simulated images. The method is real time, requiring only two operations per pixel. It involves no test-pattern calibration and produces no “ghost artifacts”.

[CV-55] BendVLM: Test-Time Debiasing of Vision-Language Embeddings

链接: https://arxiv.org/abs/2411.04420
作者: Walter Gerych,Haoran Zhang,Kimia Hamidieh,Eileen Pan,Maanas Sharma,Thomas Hartvigsen,Marzyeh Ghassemi
关键词-EN: encode biases present, prescribe negative characteristics, Vision-language model, training data, gender identities
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-language model (VLM) embeddings have been shown to encode biases present in their training data, such as societal biases that prescribe negative characteristics to members of various racial and gender identities. VLMs are being quickly adopted for a variety of tasks ranging from few-shot classification to text-guided image generation, making debiasing VLM embeddings crucial. Debiasing approaches that fine-tune the VLM often suffer from catastrophic forgetting. On the other hand, fine-tuning-free methods typically utilize a “one-size-fits-all” approach that assumes that correlation with the spurious attribute can be explained using a single linear direction across all possible inputs. In this work, we propose Bend-VLM, a nonlinear, fine-tuning-free approach for VLM embedding debiasing that tailors the debiasing operation to each unique input. This allows for a more flexible debiasing approach. Additionally, we do not require knowledge of the set of inputs a priori to inference time, making our method more appropriate for online, open-set tasks such as retrieval and text guided image generation.

[CV-56] Image Understanding Makes for A Good Tokenizer for Image Generation NEURIPS2024

链接: https://arxiv.org/abs/2411.04406
作者: Luting Wang,Yang Zhao,Zijian Zhang,Jiashi Feng,Si Liu,Bingyi Kang
关键词-EN: Abstract Modern image, Abstract Modern, Modern image generation, Modern image, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Abstract Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to project images into token sequences. Currently, pixel reconstruction (e.g., VQGAN) dominates the training objective for image tokenizers. In contrast, our approach adopts the feature reconstruction objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders. Comprehensive comparisons indicate that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks. Notably, VQ-KD CLIP achieves 4.10 FID on ImageNet-1k (IN-1k). Visualization suggests that the superiority of VQ-KD can be partly attributed to the rich semantics within the VQ-KD codebook. We further introduce a straightforward pipeline to directly transform IU encoders into tokenizers, demonstrating exceptional effectiveness for IG tasks. These discoveries may energize further exploration into image tokenizer research and inspire the community to reassess the relationship between IU and IG. The code is released at this https URL.

[CV-57] ProGraph: Temporally-alignable Probability Guided Graph Topological Modeling for 3D Human Reconstruction

链接: https://arxiv.org/abs/2411.04399
作者: Hongsheng Wang,Zehui Feng,Tong Xiao,Genfan Yang,Shengyu Zhang,Fei Wu,Feng Lin
关键词-EN: current reconstruction window, monocular videos rely, Graph Topological Modeling, Temporally-alignable Probability Guided, Guided Graph Topological
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current 3D human motion reconstruction methods from monocular videos rely on features within the current reconstruction window, leading to distortion and deformations in the human structure under local occlusions or blurriness in video frames. To estimate realistic 3D human mesh sequences based on incomplete features, we propose Temporally-alignable Probability Guided Graph Topological Modeling for 3D Human Reconstruction (ProGraph). For missing parts recovery, we exploit the explicit topological-aware probability distribution across the entire motion sequence. To restore the complete human, Graph Topological Modeling (GTM) learns the underlying topological structure, focusing on the relationships inherent in the individual parts. Next, to generate blurred motion parts, Temporal-alignable Probability Distribution (TPDist) utilizes the GTM to predict features based on distribution. This interactive mechanism facilitates motion consistency, allowing the restoration of human parts. Furthermore, Hierarchical Human Loss (HHLoss) constrains the probability distribution errors of inter-frame features during topological structure variation. Our Method achieves superior results than other SOTA methods in addressing occlusions and blurriness on 3DPW.

[CV-58] MegaPortrait: Revisiting Diffusion Control for High-fidelity Portrait Generation

链接: https://arxiv.org/abs/2411.04357
作者: Han Yang,Sotiris Anagnostidis,Enis Simsar,Thomas Hofmann
关键词-EN: Net, Shading Net, Harmonization Net, Identity Net, Shading Net re-renders
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:We propose MegaPortrait. It’s an innovative system for creating personalized portrait images in computer vision. It has three modules: Identity Net, Shading Net, and Harmonization Net. Identity Net generates learned identity using a customized model fine-tuned with source images. Shading Net re-renders portraits using extracted representations. Harmonization Net fuses pasted faces and the reference image’s body for coherent results. Our approach with off-the-shelf Controlnets is better than state-of-the-art AI portrait products in identity preservation and image fidelity. MegaPortrait has a simple but effective design and we compare it with other methods and products to show its superiority.

[CV-59] LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers

链接: https://arxiv.org/abs/2411.04351
作者: Yeong-Seung Baek,Heung-Seon Oh
关键词-EN: locate relevant objects, aims to locate, locate relevant, based on natural, LiDAR point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:3D visual grounding (VG) aims to locate relevant objects or regions within 3D scenes based on natural language descriptions. Although recent methods for indoor 3D VG have successfully transformer-based architectures to capture global contextual information and enable fine-grained cross-modal fusion, they are unsuitable for outdoor environments due to differences in the distribution of point clouds between indoor and outdoor settings. Specifically, first, extensive LiDAR point clouds demand unacceptable computational and memory resources within transformers due to the high-dimensional visual features. Second, dominant background points and empty spaces in sparse LiDAR point clouds complicate cross-modal fusion owing to their irrelevant visual information. To address these challenges, we propose LidaRefer, a transformer-based 3D VG framework designed for large-scale outdoor scenes. Moreover, during training, we introduce a simple and effective localization method, which supervises the decoder’s queries to localize not only a target object but also ambiguous objects that might be confused as the target due to the exhibition of similar attributes in a scene or the incorrect understanding of a language description. This supervision enhances the model’s ability to distinguish ambiguous objects from a target by learning the differences in their spatial relationships and attributes. LidaRefer achieves state-of-the-art performance on Talk2Car-3D, a 3D VG dataset for autonomous driving, with significant improvements under various evaluation settings.

[CV-60] UEVAVD: A Dataset for Developing UAVs Eye View Active Object Detection

链接: https://arxiv.org/abs/2411.04348
作者: Xinhua Jiang,Tianpeng Liu,Li Liu,Zhen Liu,Yongxiang Liu
关键词-EN: UAV-based object detection, UAV AOD, UAV AOD problem, UAV AOD method, object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Occlusion is a longstanding difficulty that challenges the UAV-based object detection. Many works address this problem by adapting the detection model. However, few of them exploit that the UAV could fundamentally improve detection performance by changing its viewpoint. Active Object Detection (AOD) offers an effective way to achieve this purpose. Through Deep Reinforcement Learning (DRL), AOD endows the UAV with the ability of autonomous path planning to search for the observation that is more conducive to target identification. Unfortunately, there exists no available dataset for developing the UAV AOD method. To fill this gap, we released a UAV’s eye view active vision dataset named UEVAVD and hope it can facilitate research on the UAV AOD problem. Additionally, we improve the existing DRL-based AOD method by incorporating the inductive bias when learning the state representation. First, due to the partial observability, we use the gated recurrent unit to extract state representations from the observation sequence instead of the single-view observation. Second, we pre-decompose the scene with the Segment Anything Model (SAM) and filter out the irrelevant information with the derived masks. With these practices, the agent could learn an active viewing policy with better generalization capability. The effectiveness of our innovations is validated by the experiments on the UEVAVD dataset. Our dataset will soon be available at this https URL.

[CV-61] GazeGen: Gaze-Driven User Interaction for Visual Content Generation

链接: https://arxiv.org/abs/2411.04335
作者: He-Yen Hsieh,Ziyun Li,Sai Qian Zhang,Wei-Te Mark Ting,Kao-Den Chang,Barbara De Salvo,Chiao Liu,H. T. Kung
关键词-EN: DFT Gaze, gaze, visual content, visual content generation, DFT
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user’s eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface material changes of image objects, and converts static images into videos. Central to GazeGen is the DFT Gaze (Distilled and Fine-Tuned Gaze) agent, an ultra-lightweight model with only 281K parameters, performing accurate real-time gaze predictions tailored to individual users’ eyes on small edge devices. GazeGen is the first system to combine visual content generation with real-time gaze estimation, made possible exclusively by DFT Gaze. This real-time gaze estimation enables various visual content generation tasks, all controlled by the user’s gaze. The input for DFT Gaze is the user’s eye images, while the inputs for visual content generation are the user’s view and the predicted gaze point from DFT Gaze. To achieve efficient gaze predictions, we derive the small model from a large model (10x larger) via novel knowledge distillation and personal adaptation techniques. We integrate knowledge distillation with a masked autoencoder, developing a compact yet powerful gaze estimation model. This model is further fine-tuned with Adapters, enabling highly accurate and personalized gaze predictions with minimal user input. DFT Gaze ensures low-latency and precise gaze tracking, supporting a wide range of gaze-driven tasks. We validate the performance of DFT Gaze on AEA and OpenEDS2020 benchmarks, demonstrating low angular gaze error and low latency on the edge device (Raspberry Pi 4). Furthermore, we describe applications of GazeGen, illustrating its versatility and effectiveness in various usage scenarios.

[CV-62] HandCraft: Anatomically Correct Restoration of Malformed Hands in Diffusion Generated Images WACV2025

链接: https://arxiv.org/abs/2411.04332
作者: Zhenyue Qin,Yiqun Zhang,Yang Liu,Dylan Campbell
关键词-EN: Stable Diffusion, generate diverse, demonstrated a remarkable, remarkable ability, ability to generate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Generative text-to-image models, such as Stable Diffusion, have demonstrated a remarkable ability to generate diverse, high-quality images. However, they are surprisingly inept when it comes to rendering human hands, which are often anatomically incorrect or reside in the “uncanny valley”. In this paper, we propose a method HandCraft for restoring such malformed hands. This is achieved by automatically constructing masks and depth images for hands as conditioning signals using a parametric model, allowing a diffusion-based image editor to fix the hand’s anatomy and adjust its pose while seamlessly integrating the changes into the original image, preserving pose, color, and style. Our plug-and-play hand restoration solution is compatible with existing pretrained diffusion models, and the restoration process facilitates adoption by eschewing any fine-tuning or training requirements for the diffusion models. We also contribute MalHand datasets that contain generated images with a wide variety of malformed hands in several styles for hand detector training and hand restoration benchmarking, and demonstrate through qualitative and quantitative evaluation that HandCraft not only restores anatomical correctness but also maintains the integrity of the overall image.

[CV-63] Increasing the scalability of graph convolution for FPGA-implemented event-based vision

链接: https://arxiv.org/abs/2411.04269
作者: Piotr Wzorek,Kamil Jeziorek,Tomasz Kryjak,Andrea Pinna
关键词-EN: frame-based vision sensors, traditional frame-based vision, vision sensors, mobile robotics, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for the PhD forum during FPT 2024 (International Conference on Field Programmable Technology), 10-12 December 2024, Sydney, Australia

点击查看摘要

Abstract:Event cameras are becoming increasingly popular as an alternative to traditional frame-based vision sensors, especially in mobile robotics. Taking full advantage of their high temporal resolution, high dynamic range, low power consumption and sparsity of event data, which only reflects changes in the observed scene, requires both an efficient algorithm and a specialised hardware platform. A recent trend involves using Graph Convolutional Neural Networks (GCNNs) implemented on a heterogeneous SoC FPGA. In this paper we focus on optimising hardware modules for graph convolution to allow flexible selection of the FPGA resource (BlockRAM, DSP and LUT) for their implementation. We propose a ‘‘two-step convolution’’ approach that utilises additional BRAM buffers in order to reduce up to 94% of LUT usage for multiplications. This method significantly improves the scalability of GCNNs, enabling the deployment of models with more layers, larger graphs sizes and their application for more dynamic scenarios.

[CV-64] Pose-Transformation and Radial Distance Clustering for Unsupervised Person Re-identification

链接: https://arxiv.org/abs/2411.04255
作者: Siddharth Seth,Akash Sonth,Anirban Chakraborty
关键词-EN: aims to tackle, non-overlapping cameras, tackle the problem, problem of matching, matching identities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Person re-identification (re-ID) aims to tackle the problem of matching identities across non-overlapping cameras. Supervised approaches require identity information that may be difficult to obtain and are inherently biased towards the dataset they are trained on, making them unscalable across domains. To overcome these challenges, we propose an unsupervised approach to the person re-ID setup. Having zero knowledge of true labels, our proposed method enhances the discriminating ability of the learned features via a novel two-stage training strategy. The first stage involves training a deep network on an expertly designed pose-transformed dataset obtained by generating multiple perturbations for each original image in the pose space. Next, the network learns to map similar features closer in the feature space using the proposed discriminative clustering algorithm. We introduce a novel radial distance loss, that attends to the fundamental aspects of feature learning - compact clusters with low intra-cluster and high inter-cluster variation. Extensive experiments on several large-scale re-ID datasets demonstrate the superiority of our method compared to state-of-the-art approaches.

[CV-65] PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing WACV2025

链接: https://arxiv.org/abs/2411.04249
作者: Siddharth Seth,Rishabh Dabral,Diogo Luvizon,Marc Habermann,Ming-Hsuan Yang,Christian Theobalt,Adam Kortylewski
关键词-EN: loose clothing, area of research, plausibly deform, deform to articulations, active area
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: WACV 2025

点击查看摘要

Abstract:Modeling a human avatar that can plausibly deform to articulations is an active area of research. We present PocoLoco – the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing. We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations. Consequently, they are restricted to modeling clothing that is topologically similar to the naked body and do not extend well to loose clothing. The few methods that attempt to model loose clothing typically require either canonicalization or a UV-parameterization and need to address the challenging problem of explicitly estimating correspondences for the deforming clothes. In this work, we formulate avatar clothing deformation as a conditional point-cloud generation task within the denoising diffusion framework. Crucially, our framework operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template. This also enables a variety of practical applications, such as point-cloud completion and pose-based editing – important features for virtual human animation. As current datasets for human avatars in loose clothing are far too small for training diffusion models, we release a dataset of two subjects performing various poses in loose clothing with a total of 75K point clouds. By contributing towards tackling the challenging task of effectively modeling loose clothing and expanding the available data for training these models, we aim to set the stage for further innovation in digital humans. The source code is available at this https URL .

[CV-66] PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes

链接: https://arxiv.org/abs/2411.04227
作者: Kebin Peng,John Quarles,Kevin Desai
关键词-EN: monocular depth estimation, dynamic scenes theoretically, dynamic scenes, method for monocular, estimation in dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel method for monocular depth estimation in dynamic scenes. We first explore the arbitrariness of object’s movement trajectory in dynamic scenes theoretically. To overcome the arbitrariness, we use assume that points move along a straight line over short distances and then summarize it as a triangular constraint loss in two dimensional Euclidean space. To overcome the depth inconsistency problem around the edges, we propose a deformable support window module that learns features from different shapes of objects, making depth value more accurate around edge area. The proposed model is trained and tested on two outdoor datasets - KITTI and Make3D, as well as an indoor dataset - NYU Depth V2. The quantitative and qualitative results reported on these datasets demonstrate the success of our proposed model when compared against other approaches. Ablation study results on the KITTI dataset also validate the effectiveness of the proposed pixel movement prediction module as well as the deformable support window module.

[CV-67] Differentiable Gaussian Representation for Incomplete CT Reconstruction

链接: https://arxiv.org/abs/2411.04844
作者: Shaokai Wu,Yuxiang Lu,Wei Ji,Suizhi Huang,Fengyu Yang,Shalayiding Sirejiding,Qichen He,Jing Tong,Yanbiao Ji,Yue Ding,Hongtao Lu
关键词-EN: Incomplete Computed Tomography, Computed Tomography, reducing radiation exposure, Incomplete Computed, benefits patients
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Incomplete Computed Tomography (CT) benefits patients by reducing radiation exposure. However, reconstructing high-fidelity images from limited views or angles remains challenging due to the ill-posed nature of the problem. Deep Learning Reconstruction (DLR) methods have shown promise in enhancing image quality, but the paradox between training data diversity and high generalization ability remains unsolved. In this paper, we propose a novel Gaussian Representation for Incomplete CT Reconstruction (GRCT) without the usage of any neural networks or full-dose CT data. Specifically, we model the 3D volume as a set of learnable Gaussians, which are optimized directly from the incomplete sinogram. Our method can be applied to multiple views and angles without changing the architecture. Additionally, we propose a differentiable Fast CT Reconstruction method for efficient clinical usage. Extensive experiments on multiple datasets and settings demonstrate significant improvements in reconstruction quality metrics and high efficiency. We plan to release our code as open-source.

[CV-68] An Effective Pipeline for Whole-Slide Image Glomerulus Segmentation

链接: https://arxiv.org/abs/2411.04782
作者: Quan Huu Cap
关键词-EN: diagnosing kidney diseases, accurately diagnosing kidney, Whole-slide images, kidney diseases, essential for accurately
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Whole-slide images (WSI) glomerulus segmentation is essential for accurately diagnosing kidney diseases. In this work, we propose a practical pipeline for glomerulus segmentation that effectively enhances both patch-level and WSI-level segmentation tasks. Our approach leverages stitching on overlapping patches, increasing the detection coverage, especially when glomeruli are located near patch image borders. In addition, we conduct comprehensive evaluations from different segmentation models across two large and diverse datasets with over 30K glomerulus annotations. Experimental results demonstrate that models using our pipeline outperform the previous state-of-the-art method, achieving superior results across both datasets and setting a new benchmark for glomerulus segmentation in WSIs. The code and pre-trained models are available at this https URL.

[CV-69] xLiverNet: Leveraging Medical Knowledge and Spatial-Frequency Perception for Enhanced Liver Tumor Segmentation

链接: https://arxiv.org/abs/2411.04595
作者: Xiaoyan Jiang,Zhi Zhou,Hailing Wang,Guozhong Wang,Zhijun Fang
关键词-EN: Integrating textual data, enhancing diagnostic accuracy, Integrating textual, diagnostic accuracy, textual data
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Integrating textual data with imaging in liver tumor segmentation is essential for enhancing diagnostic accuracy. However, current multi-modal medical datasets offer only general text annotations, lacking lesion-specific details critical for extracting nuanced features, especially for fine-grained segmentation of tumor boundaries and small lesions. To address these limitations, we developed datasets with lesion-specific text annotations for liver tumors and introduced the TexLiverNet model. TexLiverNet employs an agent-based cross-attention module that integrates text features efficiently with visual features, significantly reducing computational costs. Additionally, enhanced spatial and adaptive frequency domain perception is proposed to precisely delineate lesion boundaries, reduce background interference, and recover fine details in small lesions. Comprehensive evaluations on public and private datasets demonstrate that TexLiverNet achieves superior performance compared to current state-of-the-art methods.

[CV-70] Properties of BV-G structures textures decomposition models. Application to road detection in satellite images

链接: https://arxiv.org/abs/2411.04456
作者: Jerome Gilles,Yves Meyer
关键词-EN: structures-textures image decomposition, image decomposition model, paper we present, present some theoretical, theoretical results
类目: Functional Analysis (math.FA); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper we present some theoretical results about a structures-textures image decomposition model which was proposed by the second author. We prove a theorem which gives the behavior of this model in different cases. Finally, as a consequence of the theorem we derive an algorithm for the detection of long and thin objects applied to a road networks detection application in aerial or satellite images.

[CV-71] Enhancing Bronchoscopy Depth Estimation through Synthetic-to-Real Domain Adaptation

链接: https://arxiv.org/abs/2411.04404
作者: Qingyao Tian,Huai Liao,Xinyan Huang,Lujie Li,Hongbin Liu
关键词-EN: general imaging tasks, Monocular depth estimation, imaging tasks, aiding in localization, Monocular depth
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monocular depth estimation has shown promise in general imaging tasks, aiding in localization and 3D reconstruction. While effective in various domains, its application to bronchoscopic images is hindered by the lack of labeled data, challenging the use of supervised learning methods. In this work, we propose a transfer learning framework that leverages synthetic data with depth labels for training and adapts domain knowledge for accurate depth estimation in real bronchoscope data. Our network demonstrates improved depth prediction on real footage using domain adaptation compared to training solely on synthetic data, validating our approach.

[CV-72] MINDSETS: Multi-omics Integration with Neuroimaging for Dementia Subtyping and Effective Temporal Study

链接: https://arxiv.org/abs/2411.04155
作者: Salma Hassan,Dawlat Akaila,Maryam Arjemandi,Vijay Papineni,Mohammad Yaqub
关键词-EN: presenting entangled symptoms, distinct treatment approaches, requiring distinct treatment, Alzheimer disease, prevalent dementia types
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the complex realm of cognitive disorders, Alzheimer’s disease (AD) and vascular dementia (VaD) are the two most prevalent dementia types, presenting entangled symptoms yet requiring distinct treatment approaches. The crux of effective treatment in slowing neurodegeneration lies in early, accurate diagnosis, as this significantly assists doctors in determining the appropriate course of action. However, current diagnostic practices often delay VaD diagnosis, impeding timely intervention and adversely affecting patient prognosis. This paper presents an innovative multi-omics approach to accurately differentiate AD from VaD, achieving a diagnostic accuracy of 89.25%. The proposed method segments the longitudinal MRI scans and extracts advanced radiomics features. Subsequently, it synergistically integrates the radiomics features with an ensemble of clinical, cognitive, and genetic data to provide state-of-the-art diagnostic accuracy, setting a new benchmark in classification accuracy on a large public dataset. The paper’s primary contribution is proposing a comprehensive methodology utilizing multi-omics data to provide a nuanced understanding of dementia subtypes. Additionally, the paper introduces an interpretable model to enhance clinical decision-making coupled with a novel model architecture for evaluating treatment efficacy. These advancements lay the groundwork for future work not only aimed at improving differential diagnosis but also mitigating and preventing the progression of dementia.

[CV-73] Urban Flood Mapping Using Satellite Synthetic Aperture Radar Data: A Review of Characteristics Approaches and Datasets

链接: https://arxiv.org/abs/2411.04153
作者: Jie Zhao,Ming Li,Yu Li,Patrick Matgen,Marco Chini
关键词-EN: assessing building damage, urban flood mapping, Understanding the extent, Synthetic Aperture Radar, urban flood
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE Geoscience and Remote Sensing Magazine

点击查看摘要

Abstract:Understanding the extent of urban flooding is crucial for assessing building damage, casualties and economic losses. Synthetic Aperture Radar (SAR) technology offers significant advantages for mapping flooded urban areas due to its ability to collect data regardless weather and solar illumination conditions. However, the wide range of existing methods makes it difficult to choose the best approach for a specific situation and to identify future research directions. Therefore, this study provides a comprehensive review of current research on urban flood mapping using SAR data, summarizing key characteristics of floodwater in SAR images and outlining various approaches from scientific articles. Additionally, we provide a brief overview of the advantages and disadvantages of each method category, along with guidance on selecting the most suitable approach for different scenarios. This study focuses on the challenges and advancements in SAR-based urban flood mapping. It specifically addresses the limitations of spatial and temporal resolution in SAR data and discusses the essential pre-processing steps. Moreover, the article explores the potential benefits of Polarimetric SAR (PolSAR) techniques and uncertainty analysis for future research. Furthermore, it highlights a lack of open-access SAR datasets for urban flood mapping, hindering development in advanced deep learning-based methods. Besides, we evaluated the Technology Readiness Levels (TRLs) of urban flood mapping techniques to identify challenges and future research areas. Finally, the study explores the practical applications of SAR-based urban flood mapping in both the private and public sectors and provides a comprehensive overview of the benefits and potential impact of these methods.

机器学习

[LG-0] DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

链接: https://arxiv.org/abs/2411.04999
作者: Peiqi Liu,Zhanqiu Guo,Mohit Warke,Soumith Chintala,Chris Paxton,Nur Muhammad Mahi Shafiullah,Lerrel Pinto
关键词-EN: natural language description, Significant progress, language description, perform tasks, natural language
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system’s applicability in real-world scenarios where environments frequently change due to human intervention or the robot’s own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot’s environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems. Our code as well as our experiment and deployment videos are open sourced and can be found on our project website: this https URL

[LG-1] Which bits went where? Past and future transfer entropy decomposition with the information bottleneck NEURIPS2024

链接: https://arxiv.org/abs/2411.04992
作者: Kieran A. Murphy,Zhuowen Yin,Dani S. Bassett
关键词-EN: transfer entropy measures, transfer entropy, shoal of fish, collection of neurons, causal relationships
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: NeurIPS 2024 workshop “Machine learning and the physical sciences” Camera ready

点击查看摘要

Abstract:Whether the system under study is a shoal of fish, a collection of neurons, or a set of interacting atmospheric and oceanic processes, transfer entropy measures the flow of information between time series and can detect possible causal relationships. Much like mutual information, transfer entropy is generally reported as a single value summarizing an amount of shared variation, yet a more fine-grained accounting might illuminate much about the processes under study. Here we propose to decompose transfer entropy and localize the bits of variation on both sides of information flow: that of the originating process’s past and that of the receiving process’s future. We employ the information bottleneck (IB) to compress the time series and identify the transferred entropy. We apply our method to decompose the transfer entropy in several synthetic recurrent processes and an experimental mouse dataset of concurrent behavioral and neural activity. Our approach highlights the nuanced dynamics within information flow, laying a foundation for future explorations into the intricate interplay of temporal processes in complex systems.

[LG-2] Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games

链接: https://arxiv.org/abs/2411.04976
作者: Usman Anwar,Ashish Pandian,Jia Wan,David Krueger,Jakob Foerster
关键词-EN: NZSC, reinforcement learning, ZSC, studying the ability, ability of reinforcement
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot coordination (ZSC) is a popular setting for studying the ability of reinforcement learning (RL) agents to coordinate with novel partners. Prior ZSC formulations assume the \textitproblem setting is common knowledge: each agent knows the underlying Dec-POMDP, knows others have this knowledge, and so on ad infinitum. However, this assumption rarely holds in complex real-world settings, which are often difficult to fully and correctly specify. Hence, in settings where this common knowledge assumption is invalid, agents trained using ZSC methods may not be able to coordinate well. To address this limitation, we formulate the \textitnoisy zero-shot coordination (NZSC) problem. In NZSC, agents observe different noisy versions of the ground truth Dec-POMDP, which are assumed to be distributed according to a fixed noise model. Only the distribution of ground truth Dec-POMDPs and the noise model are common knowledge. We show that a NZSC problem can be reduced to a ZSC problem by designing a meta-Dec-POMDP with an augmented state space consisting of all the ground-truth Dec-POMDPs. For solving NZSC problems, we propose a simple and flexible meta-learning method called NZSC training, in which the agents are trained across a distribution of coordination problems - which they only get to observe noisy versions of. We show that with NZSC training, RL agents can be trained to coordinate well with novel partners even when the (exact) problem setting of the coordination is not common knowledge.

[LG-3] Fed-LDR: Federated Local Data-infused Graph Creation with Node-centric Model Refinement

链接: https://arxiv.org/abs/2411.04936
作者: Jiechao Gao,Yuangang Li,Syeda Faiza Ahmed
关键词-EN: enhancing urban infrastructure, Node-centric Model Refinement, Local Data-Infused Graph, Data-Infused Graph Creation, infrastructure and services
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The rapid acceleration of global urbanization has introduced novel challenges in enhancing urban infrastructure and services. Spatio-temporal data, integrating spatial and temporal dimensions, has emerged as a critical tool for understanding urban phenomena and promoting sustainability. In this context, Federated Learning (FL) has gained prominence as a distributed learning paradigm aligned with the privacy requirements of urban IoT environments. However, integrating traditional and deep learning models into the FL framework poses significant challenges, particularly in capturing complex spatio-temporal dependencies and adapting to diverse urban conditions. To address these challenges, we propose the Federated Local Data-Infused Graph Creation with Node-centric Model Refinement (Fed-LDR) algorithm. Fed-LDR leverages FL and Graph Convolutional Networks (GCN) to enhance spatio-temporal data analysis in urban environments. The algorithm comprises two key modules: (1) the Local Data-Infused Graph Creation (LDIGC) module, which dynamically reconfigures adjacency matrices to reflect evolving spatial relationships within urban environments, and (2) the Node-centric Model Refinement (NoMoR) module, which customizes model parameters for individual urban nodes to accommodate heterogeneity. Evaluations on the PeMSD4 and PeMSD8 datasets demonstrate Fed-LDR’s superior performance over six baseline methods. Fed-LDR achieved the lowest Mean Absolute Error (MAE) values of 20.15 and 17.30, and the lowest Root Mean Square Error (RMSE) values of 32.30 and 27.15, respectively, while maintaining a high correlation coefficient of 0.96 across both datasets. Notably, on the PeMSD4 dataset, Fed-LDR reduced MAE and RMSE by up to 81% and 78%, respectively, compared to the best-performing baseline FedMedian.

[LG-4] Structure Matters: Dynamic Policy Gradient

链接: https://arxiv.org/abs/2411.04913
作者: Sara Klein,Xiangyuan Zhang,Tamer Başar,Simon Weissmann,Leif Döring
关键词-EN: Markov decision processes, tabular Markov decision, Markov decision, infinite-horizon tabular Markov, framework called dynamic
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注: 46 pages, 4 figures

点击查看摘要

Abstract:In this work, we study \gamma -discounted infinite-horizon tabular Markov decision processes (MDPs) and introduce a framework called dynamic policy gradient (DynPG). The framework directly integrates dynamic programming with (any) policy gradient method, explicitly leveraging the Markovian property of the environment. DynPG dynamically adjusts the problem horizon during training, decomposing the original infinite-horizon MDP into a sequence of contextual bandit problems. By iteratively solving these contextual bandits, DynPG converges to the stationary optimal policy of the infinite-horizon MDP. To demonstrate the power of DynPG, we establish its non-asymptotic global convergence rate under the tabular softmax parametrization, focusing on the dependencies on salient but essential parameters of the MDP. By combining classical arguments from dynamic programming with more recent convergence arguments of policy gradient schemes, we prove that softmax DynPG scales polynomially in the effective horizon (1-\gamma)^-1 . Our findings contrast recent exponential lower bound examples for vanilla policy gradient.

[LG-5] Enhancing Missing Data Imputation through Combined Bipartite Graph and Complete Directed Graph

链接: https://arxiv.org/abs/2411.04907
作者: Zhaoyang Zhang,Hongtu Zhu,Ziqi Chen,Yingjie Zhang,Hai Shu
关键词-EN: Graph Neural Network, Complete Directed Graph, Directed Graph Neural, identifying and leveraging, aim to address
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we aim to address a significant challenge in the field of missing data imputation: identifying and leveraging the interdependencies among features to enhance missing data imputation for tabular data. We introduce a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN). Within BCGNN, observations and features are differentiated as two distinct node types, and the values of observed features are converted into attributed edges linking them. The bipartite segment of our framework inductively learns embedding representations for nodes, efficiently utilizing the comprehensive information encapsulated in the attributed edges. In parallel, the complete directed graph segment adeptly outlines and communicates the complex interdependencies among features. When compared to contemporary leading imputation methodologies, BCGNN consistently outperforms them, achieving a noteworthy average reduction of 15% in mean absolute error for feature imputation tasks under different missing mechanisms. Our extensive experimental investigation confirms that an in-depth grasp of the interdependence structure substantially enhances the model’s feature embedding ability. We also highlight the model’s superior performance in label prediction tasks involving missing data, and its formidable ability to generalize to unseen data points.

[LG-6] Sampling-guided Heterogeneous Graph Neural Network with Temporal Smoothing for Scalable Longitudinal Data Imputation

链接: https://arxiv.org/abs/2411.04899
作者: Zhaoyang Zhang,Ziqi Chen,Qiao Liu,Jinhan Xie,Hongtu Zhu
关键词-EN: Graph Neural Network, Sampling-guided Heterogeneous Graph, Heterogeneous Graph Neural, Neural Network, Sampling-guided Heterogeneous
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel framework, the Sampling-guided Heterogeneous Graph Neural Network (SHT-GNN), to effectively tackle the challenge of missing data imputation in longitudinal studies. Unlike traditional methods, which often require extensive preprocessing to handle irregular or inconsistent missing data, our approach accommodates arbitrary missing data patterns while maintaining computational efficiency. SHT-GNN models both observations and covariates as distinct node types, connecting observation nodes at successive time points through subject-specific longitudinal subnetworks, while covariate-observation interactions are represented by attributed edges within bipartite graphs. By leveraging subject-wise mini-batch sampling and a multi-layer temporal smoothing mechanism, SHT-GNN efficiently scales to large datasets, while effectively learning node representations and imputing missing data. Extensive experiments on both synthetic and real-world datasets, including the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, demonstrate that SHT-GNN significantly outperforms existing imputation methods, even with high missing data rates. The empirical results highlight SHT-GNN’s robust imputation capabilities and superior performance, particularly in the context of complex, large-scale longitudinal data.

[LG-7] Non-Euclidean Mixture Model for Social Network Embedding

链接: https://arxiv.org/abs/2411.04876
作者: Roshni G. Iyer,Yewen Wang,Wei Wang,Yizhou Sun
关键词-EN: largely agreed, model, mixture model, Non-Euclidean, link generation
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It is largely agreed that social network links are formed due to either homophily or social influence. Inspired by this, we aim at understanding the generation of links via providing a novel embedding-based graph formation model. Different from existing graph representation learning, where link generation probabilities are defined as a simple function of the corresponding node embeddings, we model the link generation as a mixture model of the two factors. In addition, we model the homophily factor in spherical space and the influence factor in hyperbolic space to accommodate the fact that (1) homophily results in cycles and (2) influence results in hierarchies in networks. We also design a special projection to align these two spaces. We call this model Non-Euclidean Mixture Model, i.e., NMM. We further integrate NMM with our non-Euclidean graph variational autoencoder (VAE) framework, NMM-GNN. NMM-GNN learns embeddings through a unified framework which uses non-Euclidean GNN encoders, non-Euclidean Gaussian priors, a non-Euclidean decoder, and a novel space unification loss component to unify distinct non-Euclidean geometric spaces. Experiments on public datasets show NMM-GNN significantly outperforms state-of-the-art baselines on social network generation and classification tasks, demonstrating its ability to better explain how the social network is formed.

[LG-8] OneProt: Towards Multi-Modal Protein Foundation Models

链接: https://arxiv.org/abs/2411.04863
作者: Klemens Flöge,Srisruthi Udayakumar,Johanna Sommer,Marie Piraud,Stefan Kesselheim,Vincent Fortuin,Stephan Günneman,Karel J van der Weg,Holger Gohlke,Alina Bazarova,Erinc Merdivan
关键词-EN: translate diverse information, Recent AI advances, enabled multi-modal systems, diverse information spaces, advances have enabled
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 28 pages, 15 figures, 7 tables

点击查看摘要

Abstract:Recent AI advances have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of modality encoders along protein sequences. It demonstrates strong performance in retrieval tasks and surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction. This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.

[LG-9] Clinicians Voice: Fundamental Considerations for XAI in Healthcare

链接: https://arxiv.org/abs/2411.04855
作者: T. E. Röber,R. Goedhart,S. İ. Birbil
关键词-EN: holds the promise, promise of advancing, advancing the implementation, implementation and adoption, high-stakes environments
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) holds the promise of advancing the implementation and adoption of AI-based tools in practice, especially in high-stakes environments like healthcare. However, most of the current research is disconnected from its practical applications and lacks input of end users. To address this, we conducted semi-structured interviews with clinicians to discuss their thoughts, hopes, and concerns. We find that clinicians generally think positively about developing AI-based tools for clinical practice, but they have concerns about how these will fit into their workflow and how it will impact clinician-patient relations. We further identify education of clinicians on AI as a crucial factor for the success of AI in healthcare and highlight aspects clinicians are looking for in (X)AI-based tools. In contrast to other studies, we take on a holistic and exploratory perspective to identify general requirements, which is necessary before moving on to testing specific (X)AI products for healthcare.

[LG-10] Learning in Budgeted Auctions with Spacing Objectives

链接: https://arxiv.org/abs/2411.04843
作者: Giannis Fikioris,Robert Kleinberg,Yoav Kolumbus,Raunak Kumar,Yishay Mansour,Éva Tardos
关键词-EN: participants care, winnings are distributed, repeated auction settings, time, repeated auction
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many repeated auction settings, participants care not only about how frequently they win but also how their winnings are distributed over time. This problem arises in various practical domains where avoiding congested demand is crucial, such as online retail sales and compute services, as well as in advertising campaigns that require sustained visibility over time. We introduce a simple model of this phenomenon, modeling it as a budgeted auction where the value of a win is a concave function of the time since the last win. This implies that for a given number of wins, even spacing over time is optimal. We also extend our model and results to the case when not all wins result in “conversions” (realization of actual gains), and the probability of conversion depends on a context. The goal is to maximize and evenly space conversions rather than just wins. We study the optimal policies for this setting in second-price auctions and offer learning algorithms for the bidders that achieve low regret against the optimal bidding policy in a Bayesian online setting. Our main result is a computationally efficient online learning algorithm that achieves \tilde O(\sqrt T) regret. We achieve this by showing that an infinite-horizon Markov decision process (MDP) with the budget constraint in expectation is essentially equivalent to our problem, even when limiting that MDP to a very small number of states. The algorithm achieves low regret by learning a bidding policy that chooses bids as a function of the context and the system’s state, which will be the time elapsed since the last win (or conversion). We show that state-independent strategies incur linear regret even without uncertainty of conversions. We complement this by showing that there are state-independent strategies that, while still having linear regret, achieve a (1-\frac 1 e) approximation to the optimal reward. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2411.04843 [cs.GT] (or arXiv:2411.04843v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2411.04843 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] A Simple Packing Algorithm for Optimized Mapping of Artificial Neural Networks onto Non-Volatile Memory Cross-Bar Arrays

链接: https://arxiv.org/abs/2411.04814
作者: W. Haensch
关键词-EN: improve computing efficiency, Neuromorphic computing, improve computing, computing efficiency, machine learning
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Neuromorphic computing with crossbar arrays has emerged as a promising alternative to improve computing efficiency for machine learning. Previous work has focused on implementing crossbar arrays to perform basic mathematical operations. However, in this paper, we explore the impact of mapping the layers of an artificial neural network onto physical cross-bar arrays arranged in tiles across a chip. We have developed a simplified mapping algorithm to determine the number of physical tiles, with fixed optimal array dimensions, and to estimate the minimum area occupied by these tiles for a given design objective. This simplified algorithm is compared with conventional binary linear optimization, which solves the equivalent bin-packing problem. We have found that the optimum solution is not necessarily related to the minimum number of tiles; rather, it is shown to be an interaction between tile array capacity and the scaling properties of its peripheral circuits. Additionally, we have discovered that square arrays are not always the best choice for optimal mapping, and that performance optimization comes at the cost of total tile area

[LG-12] Soft Hoeffding Tree: A Transparent and Differentiable Model on Data Streams

链接: https://arxiv.org/abs/2411.04812
作者: Kirsten Köbschall,Lisa Hartung,Stefan Kramer
关键词-EN: Hoeffding trees, soft Hoeffding trees, propose soft Hoeffding, Hoeffding, changing data streams
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose soft Hoeffding trees (SoHoT) as a new differentiable and transparent model for possibly infinite and changing data streams. Stream mining algorithms such as Hoeffding trees grow based on the incoming data stream, but they currently lack the adaptability of end-to-end deep learning systems. End-to-end learning can be desirable if a feature representation is learned by a neural network and used in a tree, or if the outputs of trees are further processed in a deep learning model or workflow. Different from Hoeffding trees, soft trees can be integrated into such systems due to their differentiability, but are neither transparent nor explainable. Our novel model combines the extensibility and transparency of Hoeffding trees with the differentiability of soft trees. We introduce a new gating function to regulate the balance between univariate and multivariate splits in the tree. Experiments are performed on 20 data streams, comparing SoHoT to standard Hoeffding trees, Hoeffding trees with limited complexity, and soft trees applying a sparse activation function for sample routing. The results show that soft Hoeffding trees outperform Hoeffding trees in estimating class probabilities and, at the same time, maintain transparency compared to soft trees, with relatively small losses in terms of AUROC and cross-entropy. We also demonstrate how to trade off transparency against performance using a hyperparameter, obtaining univariate splits at one end of the spectrum and multivariate splits at the other.

[LG-13] Learn to Solve Vehicle Routing Problems ASAP: A Neural Optimization Approach for Time-Constrained Vehicle Routing Problems with Finite Vehicle Fleet

链接: https://arxiv.org/abs/2411.04777
作者: Elija Deineko,Carina Kehrt
关键词-EN: Vehicle Routing Problem, Routing Problem, Finding a feasible, seamless logistics, Vehicle Routing
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: Affiliation: German Aerospace Center (DLR), Institute of Transport Research, Rudower Chaussee 7, 12489 Berlin Correspondence: this http URL @dlr.de

点击查看摘要

Abstract:Finding a feasible and prompt solution to the Vehicle Routing Problem (VRP) is a prerequisite for efficient freight transportation, seamless logistics, and sustainable mobility. Traditional optimization methods reach their limits when confronted with the real-world complexity of VRPs, which involve numerous constraints and objectives. Recently, the ability of generative Artificial Intelligence (AI) to solve combinatorial tasks, known as Neural Combinatorial Optimization (NCO), demonstrated promising results, offering new perspectives. In this study, we propose an NCO approach to solve a time-constrained capacitated VRP with a finite vehicle fleet size. The approach is based on an encoder-decoder architecture, formulated in line with the Policy Optimization with Multiple Optima (POMO) protocol and trained via a Proximal Policy Optimization (PPO) algorithm. We successfully trained the policy with multiple objectives (minimizing the total distance while maximizing vehicle utilization) and evaluated it on medium and large instances, benchmarking it against state-of-the-art heuristics. The method is able to find adequate and cost-efficient solutions, showing both flexibility and robust generalization. Finally, we provide a critical analysis of the solution generated by NCO and discuss the challenges and opportunities of this new branch of intelligent learning algorithms emerging in optimization science, focusing on freight transportation.

[LG-14] Mining the Minoria: Unknown Under-represented and Under-performing Minority Groups VLDB2025

链接: https://arxiv.org/abs/2411.04761
作者: Mohsen Dehghankar,Abolfazl Asudeh
关键词-EN: grouping information required, variety of reasons, wild often misses, required for identifying, grouping information
类目: Machine Learning (cs.LG)
*备注: This paper is currently under review at VLDB 2025

点击查看摘要

Abstract:Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities. Comments: This paper is currently under review at VLDB 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.04761 [cs.LG] (or arXiv:2411.04761v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.04761 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks

链接: https://arxiv.org/abs/2411.04760
作者: Sanja Karilanova,Maxime Fabre,Emre Neftci,Ayça Özçelikkale
关键词-EN: Spiking Neural Networks, deep neural networks, biologically-inspired deep neural, Neural Networks, Spiking Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are biologically-inspired deep neural networks that efficiently extract temporal information while offering promising gains in terms of energy efficiency and latency when deployed on neuromorphic devices. However, SNN model parameters are sensitive to temporal resolution, leading to significant performance drops when the temporal resolution of target data at the edge is not the same with that of the pre-deployment source data used for training, especially when fine-tuning is not possible at the edge. To address this challenge, we propose three novel domain adaptation methods for adapting neuron parameters to account for the change in time resolution without re-training on target time-resolution. The proposed methods are based on a mapping between neuron dynamics in SNNs and State Space Models (SSMs); and are applicable to general neuron models. We evaluate the proposed methods under spatio-temporal data tasks, namely the audio keyword spotting datasets SHD and MSWC as well as the image classification NMINST dataset. Our methods provide an alternative to - and in majority of the cases significantly outperform - the existing reference method that simply scales the time constant. Moreover, our results show that high accuracy on high temporal resolution data can be obtained by time efficient training on lower temporal resolution data and model adaptation.

[LG-16] Respecting the limit:Bayesian optimization with a bound on the optimal value

链接: https://arxiv.org/abs/2411.04744
作者: Hanyang Wang,Juergen Branke,Matthias Poloczek
关键词-EN: real-world optimization problems, prior information, Bayesian optimization, optimization problems, surrogate model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world optimization problems, we have prior information about what objective function values are achievable. In this paper, we study the scenario that we have either exact knowledge of the minimum value or a, possibly inexact, lower bound on its value. We propose bound-aware Bayesian optimization (BABO), a Bayesian optimization method that uses a new surrogate model and acquisition function to utilize such prior information. We present SlogGP, a new surrogate model that incorporates bound information and adapts the Expected Improvement (EI) acquisition function accordingly. Empirical results on a variety of benchmarks demonstrate the benefit of taking prior information about the optimal value into account, and that the proposed approach significantly outperforms existing techniques. Furthermore, we notice that even in the absence of prior information on the bound, the proposed SlogGP surrogate model still performs better than the standard GP model in most cases, which we explain by its larger expressiveness.

[LG-17] Neuromorphic Wireless Split Computing with Multi-Level Spikes

链接: https://arxiv.org/abs/2411.04728
作者: Dengyu Wu,Jiechen Chen,Bipin Rajendran,H. Vincent Poor,Osvaldo Simeone
关键词-EN: offering significant efficiency, involving sequential data, workloads involving sequential, perform inference tasks, Inspired by biological
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Inspired by biological processes, neuromorphic computing utilizes spiking neural networks (SNNs) to perform inference tasks, offering significant efficiency gains for workloads involving sequential data. Recent advances in hardware and software have demonstrated that embedding a few bits of payload in each spike exchanged between the spiking neurons can further enhance inference accuracy. In a split computing architecture, where the SNN is divided across two separate devices, the device storing the first layers must share information about the spikes generated by the local output neurons with the other device. Consequently, the advantages of multi-level spikes must be balanced against the challenges of transmitting additional bits between the two devices. This paper addresses these challenges by investigating a wireless neuromorphic split computing architecture employing multi-level SNNs. For this system, we present the design of digital and analog modulation schemes optimized for an orthogonal frequency division multiplexing (OFDM) radio interface. Simulation and experimental results using software-defined radios provide insights into the performance gains of multi-level SNN models and the optimal payload size as a function of the quality of the connection between a transmitter and receiver. Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2411.04728 [cs.LG] (or arXiv:2411.04728v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.04728 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] Exploring Hierarchical Molecular Graph Representation in Multimodal LLM s

链接: https://arxiv.org/abs/2411.04708
作者: Chengxin Hu,Hao Li
关键词-EN: large language models, language models, feature levels, graph features, milestones in large
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Following the milestones in large language models (LLMs) and multimodal models, we have seen a surge in applying LLMs to biochemical tasks. Leveraging graph features and molecular text representations, LLMs can tackle various tasks, such as predicting chemical reaction outcomes and describing molecular properties. However, most current work overlooks the multi-level nature of graph features. The impact of different feature levels on LLMs and the importance of each level remain unexplored, and it is possible that different chemistry tasks require different feature levels. In this work, we first investigate the effect of feature granularity by fusing GNN-generated feature tokens, discovering that even reducing all tokens to a single token does not significantly impact performance. We then explore the effect of various feature levels on performance, finding that both the quality of LLM-generated molecules and performance on different tasks benefit from different feature levels. We conclude with two key insights: (1) current molecular Multimodal LLMs(MLLMs) lack a comprehensive understanding of graph features, and (2) static processing is not sufficient for hierarchical graph feature. Our code will be publicly available soon.

[LG-19] Field Assessment of Force Torque Sensors for Planetary Rover Navigation

链接: https://arxiv.org/abs/2411.04700
作者: Levin Gerdes,Carlos Pérez del Pulgar,Raúl Castilla Arquillo,Martin Azkarate
关键词-EN: Proprioceptive sensors, planetary rovers serve, serve for state, state estimation, locomotion performance
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proprioceptive sensors on planetary rovers serve for state estimation and for understanding terrain and locomotion performance. While inertial measurement units (IMUs) are widely used to this effect, force-torque sensors are less explored for planetary navigation despite their potential to directly measure interaction forces and provide insights into traction performance. This paper presents an evaluation of the performance and use cases of force-torque sensors based on data collected from a six-wheeled rover during tests over varying terrains, speeds, and slopes. We discuss challenges, such as sensor signal reliability and terrain response accuracy, and identify opportunities regarding the use of these sensors. The data is openly accessible and includes force-torque measurements from each of the six-wheel assemblies as well as IMU data from within the rover chassis. This paper aims to inform the design of future studies and rover upgrades, particularly in sensor integration and control algorithms, to improve navigation capabilities.

[LG-20] Is network fragmentation a useful complexity measure?

链接: https://arxiv.org/abs/2411.04695
作者: Coenraad Mouton,Randle Rabe,Daniël G. Haasbroek,Marthinus W. Theunissen,Hermanus L. Potgieter,Marelie H. Davel
关键词-EN: model function rapidly, classifiers can exhibit, model function, function rapidly, rapidly changes class
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It has been observed that the input space of deep neural network classifiers can exhibit `fragmentation’, where the model function rapidly changes class as the input space is traversed. The severity of this fragmentation tends to follow the double descent curve, achieving a maximum at the interpolation regime. We study this phenomenon in the context of image classification and ask whether fragmentation could be predictive of generalization performance. Using a fragmentation-based complexity measure, we show this to be possible by achieving good performance on the PGDL (Predicting Generalization in Deep Learning) benchmark. In addition, we report on new observations related to fragmentation, namely (i) fragmentation is not limited to the input space but occurs in the hidden representations as well, (ii) fragmentation follows the trends in the validation error throughout training, and (iii) fragmentation is not a direct result of increased weight norms. Together, this indicates that fragmentation is a phenomenon worth investigating further when studying the generalization ability of deep neural networks.

[LG-21] Differentially Private Continual Learning using Pre-Trained Models NEURIPS2024

链接: https://arxiv.org/abs/2411.04680
作者: Marlon Tobaben,Marcus Klasson,Rui Li,Arno Solin,Antti Honkela
关键词-EN: continual learning, work explores, explores the intersection, differential privacy, continual learning models
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 15 pages, 3 figures, Accepted at Scalable Continual Learning for Lifelong Foundation Models Workshop at 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:This work explores the intersection of continual learning (CL) and differential privacy (DP). Crucially, continual learning models must retain knowledge across tasks, but this conflicts with the differential privacy requirement of restricting individual samples to be memorised in the model. We propose using pre-trained models to address the trade-offs between privacy and performance in a continual learning this http URL specifically, we present necessary assumptions to enable privacy-preservation and propose combining pre-trained models with parameter-free classifiers and parameter-efficient adapters that are learned under differential privacy. Our experiments demonstrate their effectiveness and provide insights into balancing the competing demands of continual learning and privacy.

[LG-22] Semantic-Aware Resource Management for C-V2X Platooning via Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2411.04672
作者: Zhiyu Shao,Qiong Wu,Pingyi Fan,Kezhi Wang,Qiang Fan,Wen Chen,Khaled B. Letaief
关键词-EN: semantic-aware multi-modal resource, multi-agent reinforcement learning, multi-modal resource allocation, systems where cellular, paper presents
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:This paper presents a semantic-aware multi-modal resource allocation (SAMRA) for multi-task using multi-agent reinforcement learning (MARL), termed SAMRAMARL, utilizing in platoon systems where cellular vehicle-to-everything (C-V2X) communication is employed. The proposed approach leverages the semantic information to optimize the allocation of communication resources. By integrating a distributed multi-agent reinforcement learning (MARL) algorithm, SAMRAMARL enables autonomous decision-making for each vehicle, channel assignment optimization, power allocation, and semantic symbol length based on the contextual importance of the transmitted information. This semantic-awareness ensures that both vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications prioritize data that is critical for maintaining safe and efficient platoon operations. The framework also introduces a tailored quality of experience (QoE) metric for semantic communication, aiming to maximize QoE in V2V links while improving the success rate of semantic information transmission (SRS). Extensive simulations has demonstrated that SAMRAMARL outperforms existing methods, achieving significant gains in QoE and communication efficiency in C-V2X platooning scenarios.

[LG-23] Enhancing Trust in Clinically Significant Prostate Cancer Prediction with Multiple Magnetic Resonance Imaging Modalities ML4H ALT

链接: https://arxiv.org/abs/2411.04662
作者: Benjamin Ng,Chi-en Amy Tai,E. Zhixuan Zeng,Alexander Wong
关键词-EN: United States, deaths in males, multiple MRI modalities, prostate cancer, MRI modalities
类目: Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 6 pages

点击查看摘要

Abstract:In the United States, prostate cancer is the second leading cause of deaths in males with a predicted 35,250 deaths in 2024. However, most diagnoses are non-lethal and deemed clinically insignificant which means that the patient will likely not be impacted by the cancer over their lifetime. As a result, numerous research studies have explored the accuracy of predicting clinical significance of prostate cancer based on magnetic resonance imaging (MRI) modalities and deep neural networks. Despite their high performance, these models are not trusted by most clinical scientists as they are trained solely on a single modality whereas clinical scientists often use multiple magnetic resonance imaging modalities during their diagnosis. In this paper, we investigate combining multiple MRI modalities to train a deep learning model to enhance trust in the models for clinically significant prostate cancer prediction. The promising performance and proposed training pipeline showcase the benefits of incorporating multiple MRI modalities for enhanced trust and accuracy.

[LG-24] Centrality Graph Shift Operators for Graph Neural Networks

链接: https://arxiv.org/abs/2411.04655
作者: Yassine Abbahaddou,Fragkiskos D. Malliaros,Johannes F. Lutzeyer,Michalis Vazirgiannis
关键词-EN: Graph Shift Operators, graph Laplacian matrices, graph representation learning, Shift Operators, Graph Shift
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Spectral Theory (math.SP); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph Shift Operators (GSOs), such as the adjacency and graph Laplacian matrices, play a fundamental role in graph theory and graph representation learning. Traditional GSOs are typically constructed by normalizing the adjacency matrix by the degree matrix, a local centrality metric. In this work, we instead propose and study Centrality GSOs (CGSOs), which normalize adjacency matrices by global centrality metrics such as the PageRank, k -core or count of fixed length walks. We study spectral properties of the CGSOs, allowing us to get an understanding of their action on graph signals. We confirm this understanding by defining and running the spectral clustering algorithm based on different CGSOs on several synthetic and real-world datasets. We furthermore outline how our CGSO can act as the message passing operator in any Graph Neural Network and in particular demonstrate strong performance of a variant of the Graph Convolutional Network and Graph Attention Network using our CGSOs on several real-world benchmark datasets.

[LG-25] IGDrivSim: A Benchmark for the Imitation Gap in Autonomous Driving

链接: https://arxiv.org/abs/2411.04653
作者: Clémence Grislain,Risto Vuorio,Cong Lu,Shimon Whiteson
关键词-EN: Developing autonomous vehicles, navigate complex environments, Developing autonomous, navigate complex, complex environments
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 1 table

点击查看摘要

Abstract:Developing autonomous vehicles that can navigate complex environments with human-level safety and efficiency is a central goal in self-driving research. A common approach to achieving this is imitation learning, where agents are trained to mimic human expert demonstrations collected from real-world driving scenarios. However, discrepancies between human perception and the self-driving car’s sensors can introduce an \textitimitation gap, leading to imitation learning failures. In this work, we introduce \textbfIGDrivSim, a benchmark built on top of the Waymax simulator, designed to investigate the effects of the imitation gap in learning autonomous driving policy from human expert demonstrations. Our experiments show that this perception gap between human experts and self-driving agents can hinder the learning of safe and effective driving behaviors. We further show that combining imitation with reinforcement learning, using a simple penalty reward for prohibited behaviors, effectively mitigates these failures. Our code is open-sourced at: this https URL.

[LG-26] Cybercrime Prediction via Geographically Weighted Learning

链接: https://arxiv.org/abs/2411.04635
作者: Muhammad Al-Zafar Khan,Jamal Al-Karaki,Emad Mahafzah
关键词-EN: Geographically Weighted Regression, Gulf Cooperation Council, success of Geographically, Cooperation Council region, Geographically Weighted
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures, Submitted to the International Jordanian Cybersecurity Conference 2024 (IJCC24)

点击查看摘要

Abstract:Inspired by the success of Geographically Weighted Regression and its accounting for spatial variations, we propose GeogGNN – A graph neural network model that accounts for geographical latitude and longitudinal points. Using a synthetically generated dataset, we apply the algorithm for a 4-class classification problem in cybersecurity with seemingly realistic geographic coordinates centered in the Gulf Cooperation Council region. We demonstrate that it has higher accuracy than standard neural networks and convolutional neural networks that treat the coordinates as features. Encouraged by the speed-up in model accuracy by the GeogGNN model, we provide a general mathematical result that demonstrates that a geometrically weighted neural network will, in principle, always display higher accuracy in the classification of spatially dependent data by making use of spatial continuity and local averaging features.

[LG-27] Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

链接: https://arxiv.org/abs/2411.04625
作者: Heyang Zhao,Chenlu Ye,Quanquan Gu,Tong Zhang
关键词-EN: enhance policy optimization, reinforcement learning, RLHF, sample complexity, regularization has emerged
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same \mathcalO(1 / \epsilon^2) sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an \mathcalO(1 / \epsilon) sample complexity when \epsilon is sufficiently small. We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.04625 [cs.LG] (or arXiv:2411.04625v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.04625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] owards Robust Federated Analytics via Differentially Private Measurements of Statistical Heterogeneity

链接: https://arxiv.org/abs/2411.04579
作者: Mary Scott,Graham Cormode,Carsten Maple
关键词-EN: Statistical heterogeneity, measure statistical heterogeneity, heterogeneity, Statistical, analytic mechanism
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 26 pages, 6 tables, 1 figure

点击查看摘要

Abstract:Statistical heterogeneity is a measure of how skewed the samples of a dataset are. It is a common problem in the study of differential privacy that the usage of a statistically heterogeneous dataset results in a significant loss of accuracy. In federated scenarios, statistical heterogeneity is more likely to happen, and so the above problem is even more pressing. We explore the three most promising ways to measure statistical heterogeneity and give formulae for their accuracy, while simultaneously incorporating differential privacy. We find the optimum privacy parameters via an analytic mechanism, which incorporates root finding methods. We validate the main theorems and related hypotheses experimentally, and test the robustness of the analytic mechanism to different heterogeneity levels. The analytic mechanism in a distributed setting delivers superior accuracy to all combinations involving the classic mechanism and/or the centralized setting. All measures of statistical heterogeneity do not lose significant accuracy when a heterogeneous sample is used.

[LG-29] Higher-Order GNNs Meet Efficiency: Sparse Sobolev Graph Neural Networks

链接: https://arxiv.org/abs/2411.04570
作者: Jhony H. Giraldo,Aref Einizade,Andjela Todorovic,Jhon A. Castro-Correa,Mohsen Badiey,Thierry Bouwmans,Fragkiskos D. Malliaros
关键词-EN: Graph Neural Networks, Neural Networks, shown great promise, higher-order relationships remains, large-scale networks
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown great promise in modeling relationships between nodes in a graph, but capturing higher-order relationships remains a challenge for large-scale networks. Previous studies have primarily attempted to utilize the information from higher-order neighbors in the graph, involving the incorporation of powers of the shift operator, such as the graph Laplacian or adjacency matrix. This approach comes with a trade-off in terms of increased computational and memory demands. Relying on graph spectral theory, we make a fundamental observation: the regular and the Hadamard power of the Laplacian matrix behave similarly in the spectrum. This observation has significant implications for capturing higher-order information in GNNs for various tasks such as node classification and semi-supervised learning. Consequently, we propose a novel graph convolutional operator based on the sparse Sobolev norm of graph signals. Our approach, known as Sparse Sobolev GNN (S2-GNN), employs Hadamard products between matrices to maintain the sparsity level in graph representations. S2-GNN utilizes a cascade of filters with increasing Hadamard powers to generate a diverse set of functions. We theoretically analyze the stability of S2-GNN to show the robustness of the model against possible graph perturbations. We also conduct a comprehensive evaluation of S2-GNN across various graph mining, semi-supervised node classification, and computer vision tasks. In particular use cases, our algorithm demonstrates competitive performance compared to state-of-the-art GNNs in terms of performance and running time.

[LG-30] Uncertainty Prediction Neural Network (UpNet): Embedding Artificial Neural Network in Bayesian Inversion Framework to Quantify the Uncertainty of Remote Sensing Retrieval

链接: https://arxiv.org/abs/2411.04556
作者: Dasheng Fan,Xihan Mu,Yongkang Lai,Donghui Xie,Guangjian Yan
关键词-EN: radiative transfer models, large-scale vegetation biophysical, Artificial Neural Network, Prediction Neural Network, transfer models
类目: Machine Learning (cs.LG)
*备注: 24 pages, f figures

点击查看摘要

Abstract:For the retrieval of large-scale vegetation biophysical parameters, the inversion of radiative transfer models (RTMs) is the most commonly used approach. In recent years, Artificial Neural Network (ANN)-based methods have become the mainstream for inverting RTMs due to their high accuracy and computational efficiency. It has been widely used in the retrieval of biophysical variables (BV). However, due to the lack of the Bayesian inversion theory interpretation, it faces challenges in quantifying the retrieval uncertainty, a crucial metric for product quality validation and downstream applications such as data assimilation or ecosystem carbon cycling modeling. This study proved that the ANN trained with squared loss outputs the posterior mean, providing a rigorous foundation for its uncertainty quantification, regularization, and incorporation of prior information. A Bayesian theoretical framework was subsequently proposed for ANN-based methods. Using this framework, we derived a new algorithm called Uncertainty Prediction Neural Network (UpNet), which enables the simultaneous training of two ANNs to retrieve BV and provide retrieval uncertainty. To validate our method, we compared UpNet with the standard Bayesian inference method, i.e., Markov Chain Monte Carlo (MCMC), in the inversion of a widely used RTM called ProSAIL for retrieving BVs and estimating uncertainty. The results demonstrated that the BVs retrieved and the uncertainties estimated by UpNet were highly consistent with those from MCMC, achieving over a million-fold acceleration. These results indicated that UpNet has significant potential for fast retrieval and uncertainty quantification of BVs or other parameters with medium and high-resolution remote sensing data. Our Python implementation is available at: this https URL.

[LG-31] Peri-midFormer: Periodic Pyramid Transformer for Time Series Analysis NEURIPS2024

链接: https://arxiv.org/abs/2411.04554
作者: Qiang Wu,Gechang Yao,Zhixi Feng,Shuyuan Yang
关键词-EN: Time series, finds wide applications, Time series analysis, analysis finds wide, series analysis finds
类目: Machine Learning (cs.LG)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Time series analysis finds wide applications in fields such as weather forecasting, anomaly detection, and behavior recognition. Previous methods attempted to model temporal variations directly using 1D time series. However, this has been quite challenging due to the discrete nature of data points in time series and the complexity of periodic variation. In terms of periodicity, taking weather and traffic data as an example, there are multi-periodic variations such as yearly, monthly, weekly, and daily, etc. In order to break through the limitations of the previous methods, we decouple the implied complex periodic variations into inclusion and overlap relationships among different level periodic components based on the observation of the multi-periodicity therein and its inclusion relationships. This explicitly represents the naturally occurring pyramid-like properties in time series, where the top level is the original time series and lower levels consist of periodic components with gradually shorter periods, which we call the periodic pyramid. To further extract complex temporal variations, we introduce self-attention mechanism into the periodic pyramid, capturing complex periodic relationships by computing attention between periodic components based on their inclusion, overlap, and adjacency relationships. Our proposed Peri-midFormer demonstrates outstanding performance in five mainstream time series analysis tasks, including short- and long-term forecasting, imputation, classification, and anomaly detection.

[LG-32] Hypercube Policy Regularization Framework for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2411.04534
作者: Yi Shen,Hanyan Huang
关键词-EN: Offline reinforcement learning, reinforcement learning, Offline reinforcement, received extensive attention, reinforcement learning methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning has received extensive attention from scholars because it avoids the interaction between the agent and the environment by learning a policy through a static dataset. However, general reinforcement learning methods cannot get satisfactory results in offline reinforcement learning due to the out-of-distribution state actions that the dataset cannot cover during training. To solve this problem, the policy regularization method that tries to directly clone policies used in static datasets has received numerous studies due to its simplicity and effectiveness. However, policy constraint methods make the agent choose the corresponding actions in the static dataset. This type of constraint is usually over-conservative, which results in suboptimal policies, especially in low-quality static datasets. In this paper, a hypercube policy regularization framework is proposed, this method alleviates the constraints of policy constraint methods by allowing the agent to explore the actions corresponding to similar states in the static dataset, which increases the effectiveness of algorithms in low-quality datasets. It was also theoretically demonstrated that the hypercube policy regularization framework can effectively improve the performance of original algorithms. In addition, the hypercube policy regularization framework is combined with TD3-BC and Diffusion-QL for experiments on D4RL datasets which are called TD3-BC-C and Diffusion-QL-C. The experimental results of the score demonstrate that TD3-BC-C and Diffusion-QL-C perform better than state-of-the-art algorithms like IQL, CQL, TD3-BC and Diffusion-QL in most D4RL environments in approximate time.

[LG-33] Real-time stress detection on social network posts using big data technology

链接: https://arxiv.org/abs/2411.04532
作者: Hai-Yen Phan Nguyen,Phi-Lan Ly,Duc-Manh Le,Trong-Hop Do
关键词-EN: social media, modern life, emotions and moods, social media posts, context of modern
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:In the context of modern life, particularly in Industry 4.0 within the online space, emotions and moods are frequently conveyed through social media posts. The trend of sharing stories, thoughts, and feelings on these platforms generates a vast and promising data source for Big Data. This creates both a challenge and an opportunity for research in applying technology to develop more automated and accurate methods for detecting stress in social media users. In this study, we developed a real-time system for stress detection in online posts, using the “Dreaddit: A Reddit Dataset for Stress Analysis in Social Media,” which comprises 187,444 posts across five different Reddit domains. Each domain contains texts with both stressful and non-stressful content, showcasing various expressions of stress. A labeled dataset of 3,553 lines was created for training. Apache Kafka, PySpark, and AirFlow were utilized to build and deploy the model. Logistic Regression yielded the best results for new streaming data, achieving 69,39% for measuring accuracy and 68,97 for measuring F1-scores.

[LG-34] Normalized Space Alignment: A Versatile Metric for Representation Analysis

链接: https://arxiv.org/abs/2411.04512
作者: Danish Ebadulla,Aditya Gulati,Ambuj Singh
关键词-EN: manifold analysis technique, Normalized Space Alignment, introduce a manifold, loss function, network loss function
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:We introduce a manifold analysis technique for neural network representations. Normalized Space Alignment (NSA) compares pairwise distances between two point clouds derived from the same source and having the same size, while potentially possessing differing dimensionalities. NSA can act as both an analytical tool and a differentiable loss function, providing a robust means of comparing and aligning representations across different layers and models. It satisfies the criteria necessary for both a similarity metric and a neural network loss function. We showcase NSA’s versatility by illustrating its utility as a representation space analysis metric, a structure-preserving loss function, and a robustness analysis tool. NSA is not only computationally efficient but it can also approximate the global structural discrepancy during mini-batching, facilitating its use in a wide variety of neural network training paradigms.

[LG-35] LLM -R: A Framework for Domain-Adaptive Maintenance Scheme Generation Combining Hierarchical Agents and RAG

链接: https://arxiv.org/abs/2411.04476
作者: Laifa Tao,Qixuan Huang,Xianjun Wu,Weiwei Zhang,Yunlong Wu,Bin Li,Chen Lu,Xingshuo Hai
关键词-EN: Language User Interfaces, Graphical User Interfaces, Electronic Technical Manuals, User Interfaces, maintenance
类目: Machine Learning (cs.LG)
*备注: 30 pages, 7 figures

点击查看摘要

Abstract:The increasing use of smart devices has emphasized the critical role of maintenance in production activities. Interactive Electronic Technical Manuals (IETMs) are vital tools that support the maintenance of smart equipment. However, traditional IETMs face challenges such as transitioning from Graphical User Interfaces (GUIs) to natural Language User Interfaces (LUIs) and managing complex logical relationships. Additionally, they must meet the current demands for higher intelligence. This paper proposes a Maintenance Scheme Generation Method based on Large Language Models (LLM-R). The proposed method includes several key innovations: We propose the Low Rank Adaptation-Knowledge Retention (LORA-KR) loss technology to proportionally adjust mixed maintenance data for fine-tuning the LLM. This method prevents knowledge conflicts caused by mixed data, improving the model’s adaptability and reasoning ability in specific maintenance domains, Besides, Hierarchical Task-Based Agent and Instruction-level Retrieval-Augmented Generation (RAG) technologies are adopted to optimize the generation steps and mitigate the phenomenon of hallucination caused by the model’s Inability to access contextual information. This enhancement improves the model’s flexibility and accuracy in handling known or unknown maintenance objects and maintenance scheme scenarios. To validate the proposed method’s effectiveness in maintenance tasks, a maintenance scheme dataset was constructed using objects from different fields. The experimental results show that the accuracy of the maintenance schemes generated by the proposed method reached 91.59%, indicating which improvement enhances the intelligence of maintenance schemes and introduces novel technical approaches for equipment maintenance.

[LG-36] GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection

链接: https://arxiv.org/abs/2411.04459
作者: Prashank Kadam
关键词-EN: increasing number, services available online, financial services, Symbolic Regression MCTS, decision-making process
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: ACM International Conference on Information and Knowledge Management 2024 RAG - Enterprise

点击查看摘要

Abstract:With the increasing number of financial services available online, the rate of financial fraud has also been increasing. The traffic and transaction rates on the internet have increased considerably, leading to a need for fast decision-making. Financial institutions also have stringent regulations that often require transparency and explainability of the decision-making process. However, most state-of-the-art algorithms currently used in the industry are highly parameterized black-box models that rely on complex computations to generate a score. These algorithms are inherently slow and lack the explainability and speed of traditional rule-based learners. This work introduces SR-MCTS (Symbolic Regression MCTS), which utilizes a foundational GPT model to guide the MCTS, significantly enhancing its convergence speed and the quality of the generated expressions which are further extracted to rules. Our experiments show that SR-MCTS can detect fraud more efficiently than widely used methods in the industry while providing substantial insights into the decision-making process.

[LG-37] Comparing Fairness of Generative Mobility Models

链接: https://arxiv.org/abs/2411.04453
作者: Daniel Wang,Jack McFarland,Afra Mashhadi,Ekin Ugurel
关键词-EN: Deep Gravity, work examines, overlooked dimension, Gravity, geographic regions
类目: Machine Learning (cs.LG)
*备注: 2 pages, Accepted at the Network Mobility (NetMob) 2024 conference

点击查看摘要

Abstract:This work examines the fairness of generative mobility models, addressing the often overlooked dimension of equity in model performance across geographic regions. Predictive models built on crowd flow data are instrumental in understanding urban structures and movement patterns; however, they risk embedding biases, particularly in spatiotemporal contexts where model performance may reflect and reinforce existing inequities tied to geographic distribution. We propose a novel framework for assessing fairness by measuring the utility and equity of generated traces. Utility is assessed via the Common Part of Commuters (CPC), a similarity metric comparing generated and real mobility flows, while fairness is evaluated using demographic parity. By reformulating demographic parity to reflect the difference in CPC distribution between two groups, our analysis reveals disparities in how various models encode biases present in the underlying data. We utilized four models (Gravity, Radiation, Deep Gravity, and Non-linear Gravity) and our results indicate that traditional gravity and radiation models produce fairer outcomes, although Deep Gravity achieves higher CPC. This disparity underscores a trade-off between model accuracy and equity, with the feature-rich Deep Gravity model amplifying pre-existing biases in community representations. Our findings emphasize the importance of integrating fairness metrics in mobility modeling to avoid perpetuating inequities.

[LG-38] owards Unifying Interpretability and Control: Evaluation via Intervention

链接: https://arxiv.org/abs/2411.04430
作者: Usha Bhalla,Suraj Srinivas,Asma Ghandeharioun,Himabindu Lakkaraju
关键词-EN: large language models, understand model reasoning, reasoning has emerged, growing complexity, complexity and capability
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both, with the connection between interpretation and control more broadly remaining tenuous. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess these methods’ practical utility and efficacy. To address this, we propose intervention as a fundamental goal of interpretability and introduce success criteria to evaluate how well methods are able to control model behavior through interventions. We unify and extend four popular interpretability methods–sparse autoencoders, logit lens, tuned lens, and probing–into an abstract encoder-decoder framework. This framework maps intermediate latent representations to human-interpretable feature spaces, enabling interventions on these interpretable features, which can then be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and the coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) although current methods allow for intervention, they are inconsistent across models and features, (2) lens-based methods outperform others in achieving simple, concrete interventions, and (3) interventions often compromise model performance and coherence, underperforming simpler alternatives, such as prompting, for steering model behavior and highlighting a critical shortcoming of current interpretability approaches in real-world applications requiring control.

[LG-39] Unsupervised Abnormal Stop Detection for Long Distance Coaches with Low-Frequency GPS

链接: https://arxiv.org/abs/2411.04422
作者: Jiaxin Deng,Junbiao Pang,Jiayu Xu,Haitao Yu
关键词-EN: Abnormal Stop Detection, abnormal stop, distance coaches supply, long distance coaches, urban life
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In our urban life, long distance coaches supply a convenient yet economic approach to the transportation of the public. One notable problem is to discover the abnormal stop of the coaches due to the important reason, i.e., illegal pick up on the way which possibly endangers the safety of passengers. It has become a pressing issue to detect the coach abnormal stop with low-quality GPS. In this paper, we propose an unsupervised method that helps transportation managers to efficiently discover the Abnormal Stop Detection (ASD) for long distance coaches. Concretely, our method converts the ASD problem into an unsupervised clustering framework in which both the normal stop and the abnormal one are decomposed. Firstly, we propose a stop duration model for the low frequency GPS based on the assumption that a coach changes speed approximately in a linear approach. Secondly, we strip the abnormal stops from the normal stop points by the low rank assumption. The proposed method is conceptually simple yet efficient, by leveraging low rank assumption to handle normal stop points, our approach enables domain experts to discover the ASD for coaches, from a case study motivated by traffic managers. Datset and code are publicly available at: this https URL.

[LG-40] Remote Sensing-Based Assessment of Economic Development

链接: https://arxiv.org/abs/2411.04396
作者: Yijian Pan,Yongchang Ma,Bolin Shen,Linyang He
关键词-EN: including nighttime light, remote sensing images, nighttime light data, economic development level, including nighttime
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal of our project is to use satellite data (including nighttime light data and remote sensing images) to give us some statistical estimation of the economic development level of a selected area (Singapore). Findings from the project could inform policymakers about areas needing intervention or support for economic development initiatives. Insights gained might aid in targeted policy formulation for infrastructure, agriculture, urban planning, or resource management.

[LG-41] Unlearning in- vs. out-of-distribution data in LLM s under gradient-based method NEURIPS2024

链接: https://arxiv.org/abs/2411.04388
作者: Teodora Baluta,Pascal Lamblin,Daniel Tarlow,Fabian Pedregosa,Gintare Karolina Dziugaite
关键词-EN: Machine unlearning aims, Machine unlearning, aims to solve, removing the influence, influence of selected
类目: Machine Learning (cs.LG)
*备注: Accepted at Safe Generative AI Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Machine unlearning aims to solve the problem of removing the influence of selected training examples from a learned model. Despite the increasing attention to this problem, it remains an open research question how to evaluate unlearning in large language models (LLMs), and what are the critical properties of the data to be unlearned that affect the quality and efficiency of unlearning. This work formalizes a metric to evaluate unlearning quality in generative models, and uses it to assess the trade-offs between unlearning quality and performance. We demonstrate that unlearning out-of-distribution examples requires more unlearning steps but overall presents a better trade-off overall. For in-distribution examples, however, we observe a rapid decay in performance as unlearning progresses. We further evaluate how example’s memorization and difficulty affect unlearning under a classical gradient ascent-based approach.

[LG-42] rajGPT: Controlled Synthetic Trajectory Generation Using a Multitask Transformer-Based Spatiotemporal Model

链接: https://arxiv.org/abs/2411.04381
作者: Shang-Ling Hsu,Emmanuel Tung,John Krumm,Cyrus Shahabi,Khurram Shafique
关键词-EN: Human mobility modeling, synthetic trajectory generation, Human mobility, trajectory generation, synthetic trajectory
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 32nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2024)

点击查看摘要

Abstract:Human mobility modeling from GPS-trajectories and synthetic trajectory generation are crucial for various applications, such as urban planning, disaster management and epidemiology. Both of these tasks often require filling gaps in a partially specified sequence of visits - a new problem that we call “controlled” synthetic trajectory generation. Existing methods for next-location prediction or synthetic trajectory generation cannot solve this problem as they lack the mechanisms needed to constrain the generated sequences of visits. Moreover, existing approaches (1) frequently treat space and time as independent factors, an assumption that fails to hold true in real-world scenarios, and (2) suffer from challenges in accuracy of temporal prediction as they fail to deal with mixed distributions and the inter-relationships of different modes with latent variables (e.g., day-of-the-week). These limitations become even more pronounced when the task involves filling gaps within sequences instead of solely predicting the next visit. We introduce TrajGPT, a transformer-based, multi-task, joint spatiotemporal generative model to address these issues. Taking inspiration from large language models, TrajGPT poses the problem of controlled trajectory generation as that of text infilling in natural language. TrajGPT integrates the spatial and temporal models in a transformer architecture through a Bayesian probability model that ensures that the gaps in a visit sequence are filled in a spatiotemporally consistent manner. Our experiments on public and private datasets demonstrate that TrajGPT not only excels in controlled synthetic visit generation but also outperforms competing models in next-location prediction tasks - Relatively, TrajGPT achieves a 26-fold improvement in temporal accuracy while retaining more than 98% of spatial accuracy on average. Comments: 10 pages, 3 figures, 32nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2024) Subjects: Machine Learning (cs.LG) ACMclasses: I.6.5; I.2.6; G.3 Cite as: arXiv:2411.04381 [cs.LG] (or arXiv:2411.04381v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.04381 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3678717.3691303 Focus to learn more DOI(s) linking to related resources

[LG-43] Game-Theoretic Defenses for Robust Conformal Prediction Against Adversarial Attacks in Medical Imaging

链接: https://arxiv.org/abs/2411.04376
作者: Rui Luo,Jie Bao,Zhixin Zhou,Chuangyin Dang
关键词-EN: pose significant threats, attacks pose significant, Adversarial attacks pose, medical imaging, pose significant
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Adversarial attacks pose significant threats to the reliability and safety of deep learning models, especially in critical domains such as medical imaging. This paper introduces a novel framework that integrates conformal prediction with game-theoretic defensive strategies to enhance model robustness against both known and unknown adversarial perturbations. We address three primary research questions: constructing valid and efficient conformal prediction sets under known attacks (RQ1), ensuring coverage under unknown attacks through conservative thresholding (RQ2), and determining optimal defensive strategies within a zero-sum game framework (RQ3). Our methodology involves training specialized defensive models against specific attack types and employing maximum and minimum classifiers to aggregate defenses effectively. Extensive experiments conducted on the MedMNIST datasets, including PathMNIST, OrganAMNIST, and TissueMNIST, demonstrate that our approach maintains high coverage guarantees while minimizing prediction set sizes. The game-theoretic analysis reveals that the optimal defensive strategy often converges to a singular robust model, outperforming uniform and simple strategies across all evaluated datasets. This work advances the state-of-the-art in uncertainty quantification and adversarial robustness, providing a reliable mechanism for deploying deep learning models in adversarial environments.

[LG-44] Impact of white noise in artificial neural networks trained for classification: performance and noise mitigation strategies

链接: https://arxiv.org/abs/2411.04354
作者: Nadezhda Semenova,Daniel Brunner
关键词-EN: leveraging physical coupling, recent years, increased in relevance, hardware implementation, coupling and analog
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:In recent years, the hardware implementation of neural networks, leveraging physical coupling and analog neurons has substantially increased in relevance. Such nonlinear and complex physical networks provide significant advantages in speed and energy efficiency, but are potentially susceptible to internal noise when compared to digital emulations of such networks. In this work, we consider how additive and multiplicative Gaussian white noise on the neuronal level can affect the accuracy of the network when applied for specific tasks and including a softmax function in the readout layer. We adapt several noise reduction techniques to the essential setting of classification tasks, which represent a large fraction of neural network computing. We find that these adjusted concepts are highly effective in mitigating the detrimental impact of noise.

[LG-45] Classification with Conceptual Safeguards

链接: https://arxiv.org/abs/2411.04342
作者: Hailey Joren,Charles Marx,Berk Ustun
关键词-EN: established concepts, promote safety, classification tasks, approach, concepts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new approach to promote safety in classification tasks with established concepts. Our approach – called a conceptual safeguard – acts as a verification layer for models that predict a target outcome by first predicting the presence of intermediate concepts. Given this architecture, a safeguard ensures that a model meets a minimal level of accuracy by abstaining from uncertain predictions. In contrast to a standard selective classifier, a safeguard provides an avenue to improve coverage by allowing a human to confirm the presence of uncertain concepts on instances on which it abstains. We develop methods to build safeguards that maximize coverage without compromising safety, namely techniques to propagate the uncertainty in concept predictions and to flag salient concepts for human review. We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks.

[LG-46] Enhancing classroom teaching with LLM s and RAG

链接: https://arxiv.org/abs/2411.04341
作者: Elizabeth A Mullins,Adrian Portillo,Kristalys Ruiz-Rohena,Aritran Piplai
关键词-EN: Large Language Models, Large Language, Language Models, daily inquiries, data source
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models have become a valuable source of information for our daily inquiries. However, after training, its data source quickly becomes out-of-date, making RAG a useful tool for providing even more recent or pertinent data. In this work, we investigate how RAG pipelines, with the course materials serving as a data source, might help students in K-12 education. The initial research utilizes Reddit as a data source for up-to-date cybersecurity information. Chunk size is evaluated to determine the optimal amount of context needed to generate accurate answers. After running the experiment for different chunk sizes, answer correctness was evaluated using RAGAs with average answer correctness not exceeding 50 percent for any chunk size. This suggests that Reddit is not a good source to mine for data for questions about cybersecurity threats. The methodology was successful in evaluating the data source, which has implications for its use to evaluate educational resources for effectiveness.

[LG-47] Efficient Symmetry-Aware Materials Generation via Hierarchical Generative Flow Networks

链接: https://arxiv.org/abs/2411.04323
作者: Tri Minh Nguyen,Sherif Abdulkader Tawfik,Truyen Tran,Sunil Gupta,Santu Rana,Svetha Venkatesh
关键词-EN: requires rapidly exploring, locating stable regions, Discovering new solid-state, solid-state materials requires, materials requires rapidly
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Discovering new solid-state materials requires rapidly exploring the vast space of crystal structures and locating stable regions. Generating stable materials with desired properties and compositions is extremely difficult as we search for very small isolated pockets in the exponentially many possibilities, considering elements from the periodic table and their 3D arrangements in crystal lattices. Materials discovery necessitates both optimized solution structures and diversity in the generated material structures. Existing methods struggle to explore large material spaces and generate diverse samples with desired properties and requirements. We propose the Symmetry-aware Hierarchical Architecture for Flow-based Traversal (SHAFT), a novel generative model employing a hierarchical exploration strategy to efficiently exploit the symmetry of the materials space to generate crystal structures given desired properties. In particular, our model decomposes the exponentially large materials space into a hierarchy of subspaces consisting of symmetric space groups, lattice parameters, and atoms. We demonstrate that SHAFT significantly outperforms state-of-the-art iterative generative methods, such as Generative Flow Networks (GFlowNets) and Crystal Diffusion Variational AutoEncoders (CDVAE), in crystal structure generation tasks, achieving higher validity, diversity, and stability of generated structures optimized for target properties and requirements.

[LG-48] owards Optimizing SQL Generation via LLM Routing NEURIPS2024

链接: https://arxiv.org/abs/2411.04319
作者: Mohammadhossein Malekpour,Nour Shaheen,Foutse Khomh,Amine Mhedhbi
关键词-EN: enables users, simplifying access, structured data, users to interact, interact with databases
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Table Representation Learning Workshop at NeurIPS 2024

点击查看摘要

Abstract:Text-to-SQL enables users to interact with databases through natural language, simplifying access to structured data. Although highly capable large language models (LLMs) achieve strong accuracy for complex queries, they incur unnecessary latency and dollar cost for simpler ones. In this paper, we introduce the first LLM routing approach for Text-to-SQL, which dynamically selects the most cost-effective LLM capable of generating accurate SQL for each query. We present two routing strategies (score- and classification-based) that achieve accuracy comparable to the most capable LLM while reducing costs. We design the routers for ease of training and efficient inference. In our experiments, we highlight a practical and explainable accuracy-cost trade-off on the BIRD dataset.

[LG-49] heoretically informed selection of latent activation in autoencoder based recommender systems

链接: https://arxiv.org/abs/2411.04315
作者: Aviad Susman
关键词-EN: computationally efficient recommender, distilling sparse high-dimensional, sparse high-dimensional data, efficient recommender systems, lower-dimensional latent representations
类目: Machine Learning (cs.LG)
*备注: 2 pages, 1 figure

点击查看摘要

Abstract:Autoencoders may lend themselves to the design of more accurate and computationally efficient recommender systems by distilling sparse high-dimensional data into dense lower-dimensional latent representations. However, designing these systems remains challenging due to the lack of theoretical guidance. This work addresses this by identifying three key mathematical properties that the encoder in an autoencoder should exhibit to improve recommendation accuracy: (1) dimensionality reduction, (2) preservation of similarity ordering in dot product comparisons, and (3) preservation of non-zero vectors. Through theoretical analysis, we demonstrate that common activation functions, such as ReLU and tanh, cannot fulfill these properties jointly within a generalizable framework. In contrast, sigmoid-like activations emerge as suitable choices for latent activations. This theoretically informed approach offers a more systematic method for hyperparameter selection, enhancing the efficiency of model design.

[LG-50] Fair Exploration and Exploitation

链接: https://arxiv.org/abs/2411.04295
作者: Stephen Pasteris,Chris Hicks,Vasilios Mavroudis
关键词-EN: contextual bandit problem, context set, infinite and clustered, contextual bandit, context
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we consider the contextual bandit problem with a finite (or infinite and clustered) context set. We consider the fully adversarial problem in which, apart from having bounded losses, there are no assumptions whatsoever on the generation of the contexts and losses. In our problem we assume that the context set is partitioned into a set of protected groups. At the start of each trial we are given a probability distribution over the context set and are required (on that trial) to be fair with respect to that distribution, in that if the context (for that trial) was drawn from the distribution then our choice of action would be unbiased towards any protected group. We develop an algorithm FexEx for this problem which has remarkable efficiency, having a space and per-trial time complexity at most linear in the dimensionality of the policy space. FexEx can handle non-stationarity, in that its regret can be bounded with respect to any sequence of policies satisfying the fairness constraints. For such a sequence the regret bound of FexEx is essentially the same as that of running Exp3.S for each context independently (an approach that does not satisfy the fairness constraints).

[LG-51] Enhancing Security Control Production With Generative AI

链接: https://arxiv.org/abs/2411.04284
作者: Chen Ling,Mina Ghashami,Vianne Gao,Ali Torkamani,Ruslan Vaulin,Nivedita Mangam,Bhavya Jain,Farhan Diwan,Malini SS,Mingrui Cheng,Shreya Tarur Kumar,Felix Candelario
关键词-EN: Security controls, protect information, Security, mechanisms or policies, policies designed
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Security controls are mechanisms or policies designed for cloud based services to reduce risk, protect information, and ensure compliance with security regulations. The development of security controls is traditionally a labor-intensive and time-consuming process. This paper explores the use of Generative AI to accelerate the generation of security controls. We specifically focus on generating Gherkin codes which are the domain-specific language used to define the behavior of security controls in a structured and understandable format. By leveraging large language models and in-context learning, we propose a structured framework that reduces the time required for developing security controls from 2-3 days to less than one minute. Our approach integrates detailed task descriptions, step-by-step instructions, and retrieval-augmented generation to enhance the accuracy and efficiency of the generated Gherkin code. Initial evaluations on AWS cloud services demonstrate promising results, indicating that GenAI can effectively streamline the security control development process, thus providing a robust and dynamic safeguard for cloud-based infrastructures.

[LG-52] Labels in Extremes: How Well Calibrated are Extreme Multi-label Classifiers?

链接: https://arxiv.org/abs/2411.04276
作者: Nasib Ullah,Erik Schultheis,Jinbin Zhang,Rohit Babbar
关键词-EN: large-scale document tagging, related product recommendation, Extreme multilabel classification, multilabel classification, problems occur
类目: Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:Extreme multilabel classification (XMLC) problems occur in settings such as related product recommendation, large-scale document tagging, or ad prediction, and are characterized by a label space that can span millions of possible labels. There are two implicit tasks that the classifier performs: \emphEvaluating each potential label for its expected worth, and then \emphselecting the best candidates. For the latter task, only the relative order of scores matters, and this is what is captured by the standard evaluation procedure in the XMLC literature. However, in many practical applications, it is important to have a good estimate of the actual probability of a label being relevant, e.g., to decide whether to pay the fee to be allowed to display the corresponding ad. To judge whether an extreme classifier is indeed suited to this task, one can look, for example, to whether it returns \emphcalibrated probabilities, which has hitherto not been done in this field. Therefore, this paper aims to establish the current status quo of calibration in XMLC by providing a systematic evaluation, comprising nine models from four different model families across seven benchmark datasets. As naive application of Expected Calibration Error (ECE) leads to meaningless results in long-tailed XMC datasets, we instead introduce the notion of \emphcalibration@k (e.g., ECE@k), which focusses on the top- k probability mass, offering a more appropriate measure for evaluating probability calibration in XMLC scenarios. While we find that different models can exhibit widely varying reliability plots, we also show that post-training calibration via a computationally efficient isotonic regression method enhances model calibration without sacrificing prediction accuracy. Thus, the practitioner can choose the model family based on accuracy considerations, and leave calibration to isotonic regression.

[LG-53] Generative Discrete Event Process Simulation for Hidden Markov Models to Predict Competitor Time-to-Market

链接: https://arxiv.org/abs/2411.04266
作者: Nandakishore Santhi,Stephan Eidenbenz,Brian Key,George Tompkins
关键词-EN: Firm, estimate is revised, Hidden Markov Model, information is obtained, challenge of predicting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the challenge of predicting the time at which a competitor product, such as a novel high-capacity EV battery or a new car model, will be available to customers; as new information is obtained, this time-to-market estimate is revised. Our scenario is as follows: We assume that the product is under development at a Firm B, which is a competitor to Firm A; as they are in the same industry, Firm A has a relatively good understanding of the processes and steps required to produce the product. While Firm B tries to keep its activities hidden (think of stealth-mode for start-ups), Firm A is nevertheless able to gain periodic insights by observing what type of resources Firm B is using. We show how Firm A can build a model that predicts when Firm B will be ready to sell its product; the model leverages knowledge of the underlying processes and required resources to build a Parallel Discrete Simulation (PDES)-based process model that it then uses as a generative model to train a Hidden Markov Model (HMM). We study the question of how many resource observations Firm A requires in order to accurately assess the current state of development at Firm B. In order to gain general insights into the capabilities of this approach, we study the effect of different process graph densities, different densities of the resource-activity maps, etc., and also scaling properties as we increase the number resource counts. We find that in most cases, the HMM achieves a prediction accuracy of 70 to 80 percent after 20 (daily) observations of a production process that lasts 150 days on average and we characterize the effects of different problem instance densities on this prediction accuracy. Our results give insight into the level of market knowledge required for accurate and early time-to-market prediction.

[LG-54] LSHBloom: Memory-efficient Extreme-scale Document Deduplication

链接: https://arxiv.org/abs/2411.04257
作者: Arham Khan,Robert Underwood,Carlo Siebenschuh,Yadu Babuji,Aswathy Ajith,Kyle Hippe,Ozan Gokdemir,Alexander Brace,Kyle Chard,Ian Foster
关键词-EN: eliminating additional instances, large language models, curating training datasets, detecting and eliminating, large language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deduplication is a major focus for assembling and curating training datasets for large language models (LLM) – detecting and eliminating additional instances of the same content – in large collections of technical documents. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Contemporary approaches to document-level deduplication are often extremely expensive in both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same deduplication performance as MinhashLSH with only a marginal increase in false positives (as low as 1e-5 in our experiments); demonstrates competitive runtime (270% faster than MinhashLSH on peS2o); and, crucially, uses just 0.6% of the disk space required by MinhashLSH to deduplicate peS2o. We demonstrate that this space advantage scales with increased dataset size – at the extreme scale of several billion documents, LSHBloom promises a 250% speedup and a 54 \times space advantage over traditional MinHashLSH scaling deduplication of text datasets to many billions of documents.

[LG-55] Multimodal Structure-Aware Quantum Data Processing

链接: https://arxiv.org/abs/2411.04242
作者: Hala Hawashin,Mehrnoosh Sadrzadeh
关键词-EN: black box nature, box nature obscures, black box, decision-making processes, advanced the field
类目: Machine Learning (cs.LG)
*备注: 10 Pages, 16 Figures

点击查看摘要

Abstract:While large language models (LLMs) have advanced the field of natural language processing (NLP), their black box'' nature obscures their decision-making processes. To address this, researchers developed structured approaches using higher order tensors. These are able to model linguistic relations, but stall when training on classical computers due to their excessive size. Tensors are natural inhabitants of quantum systems and training on quantum computers provides a solution by translating text to variational quantum circuits. In this paper, we develop MultiQ-NLP: a framework for structure-aware data processing with multimodal text+image data. Here, structure’’ refers to syntactic and grammatical relationships in language, as well as the hierarchical organization of visual elements in images. We enrich the translation with new types and type homomorphisms and develop novel architectures to represent structure. When tested on a main stream image classification task (SVO Probes), our best model showed a par performance with the state of the art classical models; moreover the best model was fully structured.

[LG-56] Approximate Equivariance in Reinforcement Learning

链接: https://arxiv.org/abs/2411.04225
作者: Jung Yeon Park,Sujay Bhatt,Sihan Zeng,Lawson L.S. Wong,Alec Koppel,Sumitra Ganesh,Robin Walters
关键词-EN: improving sample efficiency, shown great success, Equivariant neural networks, improving sample, shown great
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Equivariant neural networks have shown great success in reinforcement learning, improving sample efficiency and generalization when there is symmetry in the task. However, in many problems, only approximate symmetry is present, which makes imposing exact symmetry inappropriate. Recently, approximately equivariant networks have been proposed for supervised classification and modeling physical systems. In this work, we develop approximately equivariant algorithms in reinforcement learning (RL). We define approximately equivariant MDPs and theoretically characterize the effect of approximate equivariance on the optimal Q function. We propose novel RL architectures using relaxed group convolutions and experiment on several continuous control domains and stock trading with real financial data. Our results demonstrate that approximate equivariance matches prior work when exact symmetries are present, and outperforms them when domains exhibit approximate symmetry. As an added byproduct of these techniques, we observe increased robustness to noise at test time.

[LG-57] Scalable DP-SGD: Shuffling vs. Poisson Subsampling NEURIPS2024

链接: https://arxiv.org/abs/2411.04205
作者: Lynn Chua,Badih Ghazi,Pritish Kamath,Ravi Kumar,Pasin Manurangsi,Amer Sinha,Chiyuan Zhang
关键词-EN: Batch Linear Queries, Adaptive Batch Linear, multi-epoch Adaptive Batch, shuffled batch sampling, Linear Queries
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注: To appear at NeurIPS 2024

点击查看摘要

Abstract:We provide new lower bounds on the privacy guarantee of the multi-epoch Adaptive Batch Linear Queries (ABLQ) mechanism with shuffled batch sampling, demonstrating substantial gaps when compared to Poisson subsampling; prior analysis was limited to a single epoch. Since the privacy analysis of Differentially Private Stochastic Gradient Descent (DP-SGD) is obtained by analyzing the ABLQ mechanism, this brings into serious question the common practice of implementing shuffling-based DP-SGD, but reporting privacy parameters as if Poisson subsampling was used. To understand the impact of this gap on the utility of trained machine learning models, we introduce a practical approach to implement Poisson subsampling at scale using massively parallel computation, and efficiently train models with the same. We compare the utility of models trained with Poisson-subsampling-based DP-SGD, and the optimistic estimates of utility when using shuffling, via our new lower bounds on the privacy guarantee of ABLQ with shuffling.

[LG-58] Online Budgeted Matching with General Bids NEURIPS2024

链接: https://arxiv.org/abs/2411.04204
作者: Jianyi Yang,Pengfei Li,Adam Wierman,Shaolei Ren
关键词-EN: Online Budgeted Matching, Budgeted Matching, online service matching, Online Budgeted, revenue management
类目: Computer Science and Game Theory (cs.GT); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Online Budgeted Matching (OBM) is a classic problem with important applications in online advertising, online service matching, revenue management, and beyond. Traditional online algorithms typically assume a small bid setting, where the maximum bid-to-budget ratio (\kappa) is infinitesimally small. While recent algorithms have tried to address scenarios with non-small or general bids, they often rely on the Fractional Last Matching (FLM) assumption, which allows for accepting partial bids when the remaining budget is insufficient. This assumption, however, does not hold for many applications with indivisible bids. In this paper, we remove the FLM assumption and tackle the open problem of OBM with general bids. We first establish an upper bound of 1-\kappa on the competitive ratio for any deterministic online algorithm. We then propose a novel meta algorithm, called MetaAd, which reduces to different algorithms with first known provable competitive ratios parameterized by the maximum bid-to-budget ratio \kappa \in [0, 1]. As a by-product, we extend MetaAd to the FLM setting and get provable competitive algorithms. Finally, we apply our competitive analysis to the design learning-augmented algorithms.

[LG-59] Joint torques prediction of a robotic arm using neural networks

链接: https://arxiv.org/abs/2405.00695
作者: Giulia d’Addato,Ruggero Carli,Eurico Pedrosa,Artur Pereira,Luigi Palopoli,Daniele Fontanelli
关键词-EN: Accurate dynamic models, Accurate dynamic, robotic applications, dynamic models, Lagrangian or Newtonian
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 5 figures, submitted to CASE 2024

点击查看摘要

Abstract:Accurate dynamic models are crucial for many robotic applications. Traditional approaches to deriving these models are based on the application of Lagrangian or Newtonian mechanics. Although these methods provide a good insight into the physical behaviour of the system, they rely on the exact knowledge of parameters such as inertia, friction and joint flexibility. In addition, the system is often affected by uncertain and nonlinear effects, such as saturation and dead zones, which can be difficult to model. A popular alternative is the application of Machine Learning (ML) techniques - e.g., Neural Networks (NNs) - in the context of a “black-box” methodology. This paper reports on our experience with this approach for a real-life 6 degrees of freedom (DoF) manipulator. Specifically, we considered several NN architectures: single NN, multiple NNs, and cascade NN. We compared the performance of the system by using different policies for selecting the NN hyperparameters. Our experiments reveal that the best accuracy and performance are obtained by a cascade NN, in which we encode our prior physical knowledge about the dependencies between joints, complemented by an appropriate optimisation of the hyperparameters.

[LG-60] Pareto Set Identification With Posterior Sampling

链接: https://arxiv.org/abs/2411.04939
作者: Cyrille Kone,Marc Jourdan,Emilie Kaufmann
关键词-EN: Toggle, Toggle Hugging Face, Code, Code Toggle Papers, Papers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The problem of identifying the best answer among a collection of items having real-valued distribution is well-understood. Despite its practical relevance for many applications, fewer works have studied its extension when multiple and potentially conflicting metrics are available to assess an item’s quality. Pareto set identification (PSI) aims to identify the set of answers whose means are not uniformly worse than another. This paper studies PSI in the transductive linear setting with potentially correlated objectives. Building on posterior sampling in both the stopping and the sampling rules, we propose the PSIPS algorithm that deals simultaneously with structure and correlation without paying the computational cost of existing oracle-based algorithms. Both from a frequentist and Bayesian perspective, PSIPS is asymptotically optimal. We demonstrate its good empirical performance in real-world and synthetic instances. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2411.04939 [stat.ML] (or arXiv:2411.04939v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.04939 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cyrille Kone [view email] [v1] Thu, 7 Nov 2024 18:15:38 UTC (394 KB) Full-text links: Access Paper: View a PDF of the paper titled Pareto Set Identification With Posterior Sampling, by Cyrille Kone and 2 other authorsView PDFTeX SourceOther Formats view license Current browse context: stat.ML prev | next new | recent | 2024-11 Change to browse by: cs cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-61] Conformalized Credal Regions for Classification with Ambiguous Ground Truth

链接: https://arxiv.org/abs/2411.04852
作者: Michele Caprio,David Stutz,Shuo Li,Arnaud Doucet
关键词-EN: Imprecise Probabilistic Machine, Probabilistic Machine Learning, Imprecise Probabilistic, Machine Learning, Probabilistic Machine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An open question in \emphImprecise Probabilistic Machine Learning is how to empirically derive a credal region (i.e., a closed and convex family of probabilities on the output space) from the available data, without any prior knowledge or assumption. In classification problems, credal regions are a tool that is able to provide provable guarantees under realistic assumptions by characterizing the uncertainty about the distribution of the labels. Building on previous work, we show that credal regions can be directly constructed using conformal methods. This allows us to provide a novel extension of classical conformal prediction to problems with ambiguous ground truth, that is, when the exact labels for given inputs are not exactly known. The resulting construction enjoys desirable practical and theoretical properties: (i) conformal coverage guarantees, (ii) smaller prediction sets (compared to classical conformal prediction regions) and (iii) disentanglement of uncertainty sources (epistemic, aleatoric). We empirically verify our findings on both synthetic and real datasets.

[LG-62] Asymptotic regularity of a generalised stochastic Halpern scheme with applications

链接: https://arxiv.org/abs/2411.04845
作者: Nicholas Pischke,Thomas Powell
关键词-EN: generalized stochastic Halpern-style, stochastic Halpern-style iteration, highly uniform rates, stochastic Halpern-style, asymptotic regularity
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: 29 pages

点击查看摘要

Abstract:We provide abstract, general and highly uniform rates of asymptotic regularity for a generalized stochastic Halpern-style iteration, which incorporates a second mapping in the style of a Krasnoselskii-Mann iteration. This iteration is general in two ways: First, it incorporates stochasticity in a completely abstract way rather than fixing a sampling method; secondly, it includes as special cases stochastic versions of various schemes from the optimization literature, including Halpern’s iteration as well as a Krasnoselskii-Mann iteration with Tikhonov regularization terms in the sense of Boţ, Csetnek and Meier. For these particular cases, we in particular obtain linear rates of asymptotic regularity, matching (or improving) the currently best known rates for these iterations in stochastic optimization, and quadratic rates of asymptotic regularity are obtained in the context of inner product spaces for the general iteration. We utilize these rates to give bounds on the oracle complexity of such iterations under suitable variance assumptions and batching strategies, again presented in an abstract style. Finally, we sketch how the schemes presented here can be instantiated in the context of reinforcement learning to yield novel methods for Q-learning.

[LG-63] Learning dynamical systems from data: Gradient-based dictionary optimization

链接: https://arxiv.org/abs/2411.04775
作者: Mohammad Tabish,Neil K. Chada,Stefan Klus
关键词-EN: Koopman operator plays, Koopman operator, plays a crucial, crucial role, role in analyzing
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Koopman operator plays a crucial role in analyzing the global behavior of dynamical systems. Existing data-driven methods for approximating the Koopman operator or discovering the governing equations of the underlying system typically require a fixed set of basis functions, also called dictionary. The optimal choice of basis functions is highly problem-dependent and often requires domain knowledge. We present a novel gradient descent-based optimization framework for learning suitable and interpretable basis functions from data and show how it can be used in combination with EDMD, SINDy, and PDE-FIND. We illustrate the efficacy of the proposed approach with the aid of various benchmark problems such as the Ornstein-Uhlenbeck process, Chua’s circuit, a nonlinear heat equation, as well as protein-folding data.

[LG-64] Measure-to-measure interpolation using Transformers

链接: https://arxiv.org/abs/2411.04551
作者: Borjan Geshkovski,Philippe Rigollet,Domènec Ruiz-Balet
关键词-EN: large language models, deep neural network, neural network architectures, language models, deep neural
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformers are deep neural network architectures that underpin the recent successes of large language models. Unlike more classical architectures that can be viewed as point-to-point maps, a Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere: the input is the empirical measure of tokens in a prompt and its evolution is governed by the continuity equation. In fact, Transformers are not limited to empirical measures and can in principle process any input measure. As the nature of data processed by Transformers is expanding rapidly, it is important to investigate their expressive power as maps from an arbitrary measure to another arbitrary measure. To that end, we provide an explicit choice of parameters that allows a single Transformer to match N arbitrary input measures to N arbitrary target measures, under the minimal assumption that every pair of input-target measures can be matched by some transport map.

[LG-65] Improve the Fitting Accuracy of Deep Learning for the Nonlinear Schr"odinger Equation Using Linear Feature Decoupling Method

链接: https://arxiv.org/abs/2411.04511
作者: Yunfan Zhang,Zekun Niu,Minghui Shi,Weisheng Hu,Lilin Yi
关键词-EN: Nonlinear Schrodinger Equation, Feature Decoupling Distributed, NLSE loss compared, Schrodinger Equation, Nonlinear Schrodinger
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We utilize the Feature Decoupling Distributed (FDD) method to enhance the capability of deep learning to fit the Nonlinear Schrodinger Equation (NLSE), significantly reducing the NLSE loss compared to non decoupling model.

[LG-66] Statistical-Computational Trade-offs for Greedy Recursive Partitioning Estimators

链接: https://arxiv.org/abs/2411.04394
作者: Yan Shuo Tan,Jason M. Klusowski,Krishnakumar Balasubramanian
关键词-EN: curse of dimensionality, Merged Staircase Property, decision trees, ensembles are popular, popular for high-dimensional
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Models based on recursive partitioning such as decision trees and their ensembles are popular for high-dimensional regression as they can potentially avoid the curse of dimensionality. Because empirical risk minimization (ERM) is computationally infeasible, these models are typically trained using greedy algorithms. Although effective in many cases, these algorithms have been empirically observed to get stuck at local optima. We explore this phenomenon in the context of learning sparse regression functions over d binary features, showing that when the true regression function f^* does not satisfy the so-called Merged Staircase Property (MSP), greedy training requires \exp(\Omega(d)) to achieve low estimation error. Conversely, when f^* does satisfy MSP, greedy training can attain small estimation error with only O(\log d) samples. This performance mirrors that of two-layer neural networks trained with stochastic gradient descent (SGD) in the mean-field regime, thereby establishing a head-to-head comparison between SGD-trained neural networks and greedy recursive partitioning estimators. Furthermore, ERM-trained recursive partitioning estimators achieve low estimation error with O(\log d) samples irrespective of whether f^* satisfies MSP, thereby demonstrating a statistical-computational trade-off for greedy training. Our proofs are based on a novel interpretation of greedy recursive partitioning using stochastic process theory and a coupling technique that may be of independent interest.

[LG-67] Approximate Frank-Wolfe Algorithm over Graph-structured Support Set

链接: https://arxiv.org/abs/2411.04389
作者: Yijian Pan
关键词-EN: graph-structured convex optimization, deals graph-structured convex, convex optimization, approximate Frank-Wolfe, reviewed a paper
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this project, we reviewed a paper that deals graph-structured convex optimization (GSCO) problem with the approximate Frank-Wolfe (FW) algorithm. We analyzed and implemented the original algorithm and introduced some extensions based on that. Then we conducted experiments to compare the results and concluded that our backtracking line-search method effectively reduced the number of iterations, while our new DMO method (Top-g+ optimal visiting) did not make satisfying enough improvements.

[LG-68] ION-C: Integration of Overlapping Networks via Constraints

链接: https://arxiv.org/abs/2411.04243
作者: Praveen Nair,Payal Bhandari,Mohammadsajad Abavisani,Sergey Plis,David Danks
关键词-EN: causal learning problems, causal learning, distributed across multiple, multiple datasets, variables of interest
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:In many causal learning problems, variables of interest are often not all measured over the same observations, but are instead distributed across multiple datasets with overlapping variables. Tillman et al. (2008) presented the first algorithm for enumerating the minimal equivalence class of ground-truth DAGs consistent with all input graphs by exploiting local independence relations, called ION. In this paper, this problem is formulated as a more computationally efficient answer set programming (ASP) problem, which we call ION-C, and solved with the ASP system clingo. The ION-C algorithm was run on random synthetic graphs with varying sizes, densities, and degrees of overlap between subgraphs, with overlap having the largest impact on runtime, number of solution graphs, and agreement within the output set. To validate ION-C on real-world data, we ran the algorithm on overlapping graphs learned from data from two successive iterations of the European Social Survey (ESS), using a procedure for conducting joint independence tests to prevent inconsistencies in the input.

[LG-69] dsld: A Socially Relevant Tool for Teaching Statistics

链接: https://arxiv.org/abs/2411.04228
作者: Taha Abdullah,Arjun Ashok,Brandon Estrada,Norman Matloff,Aditya Mittal
关键词-EN: necessitating nuanced understanding, effective mitigation strategies, addressing social discrimination, data science, Data Science Education
类目: Methodology (stat.ME); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP)
*备注: To be submitted to the Journal of Statistics and Data Science Education

点击查看摘要

Abstract:The growing power of data science can play a crucial role in addressing social discrimination, necessitating nuanced understanding and effective mitigation strategies of potential biases. Data Science Looks At Discrimination (dsld) is an R and Python package designed to provide users with a comprehensive toolkit of statistical and graphical methods for assessing possible discrimination related to protected groups, such as race, gender, and age. Our software offers techniques for discrimination analysis by identifying and mitigating confounding variables, along with methods for reducing bias in predictive models. In educational settings, dsld offers instructors powerful tools to teach important statistical principles through motivating real world examples of discrimination analysis. The inclusion of an 80-page Quarto book further supports users, from statistics educators to legal professionals, in effectively applying these analytical tools to real world scenarios. Comments: To be submitted to the Journal of Statistics and Data Science Education Subjects: Methodology (stat.ME); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2411.04228 [stat.ME] (or arXiv:2411.04228v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2411.04228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] Debiasing Synthetic Data Generated by Deep Generative Models NEURIPS2024

链接: https://arxiv.org/abs/2411.04216
作者: Alexander Decruyenaere,Heidelinde Dehaene,Paloma Rabaey,Christiaan Polet,Johan Decruyenaere,Thomas Demeester,Stijn Vansteelandt
关键词-EN: necessitate innovative solutions, hold great promise, data hold great, poses significant challenges, analysis poses significant
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted for the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), joint first authors

点击查看摘要

Abstract:While synthetic data hold great promise for privacy protection, their statistical analysis poses significant challenges that necessitate innovative solutions. The use of deep generative models (DGMs) for synthetic data generation is known to induce considerable bias and imprecision into synthetic data analyses, compromising their inferential utility as opposed to original data analyses. This bias and uncertainty can be substantial enough to impede statistical convergence rates, even in seemingly straightforward analyses like mean calculation. The standard errors of such estimators then exhibit slower shrinkage with sample size than the typical 1 over root- n rate. This complicates fundamental calculations like p-values and confidence intervals, with no straightforward remedy currently available. In response to these challenges, we propose a new strategy that targets synthetic data created by DGMs for specific data analyses. Drawing insights from debiased and targeted machine learning, our approach accounts for biases, enhances convergence rates, and facilitates the calculation of estimators with easily approximated large sample variances. We exemplify our proposal through a simulation study on toy data and two case studies on real-world data, highlighting the importance of tailoring DGMs for targeted data analysis. This debiasing strategy contributes to advancing the reliability and applicability of synthetic data in statistical inference.

[LG-71] Machine Learning Mutation-Acyclicity of Quivers

链接: https://arxiv.org/abs/2411.04209
作者: Kymani T. K. Armstrong-Williams,Edward Hirst,Blake Jackson,Kyu-Hwan Lee
关键词-EN: recent years, powerful tool, research in recent, mathematical research, Machine learning
类目: Combinatorics (math.CO); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Representation Theory (math.RT)
*备注: 30 pages, 14 figures, 8 tables

点击查看摘要

Abstract:Machine learning (ML) has emerged as a powerful tool in mathematical research in recent years. This paper applies ML techniques to the study of quivers–a type of directed multigraph with significant relevance in algebra, combinatorics, computer science, and mathematical physics. Specifically, we focus on the challenging problem of determining the mutation-acyclicity of a quiver on 4 vertices, a property that is pivotal since mutation-acyclicity is often a necessary condition for theorems involving path algebras and cluster algebras. Although this classification is known for quivers with at most 3 vertices, little is known about quivers on more than 3 vertices. We give a computer-assisted proof of a theorem to prove that mutation-acyclicity is decidable for quivers on 4 vertices with edge weight at most 2. By leveraging neural networks (NNs) and support vector machines (SVMs), we then accurately classify more general 4-vertex quivers as mutation-acyclic or non-mutation-acyclic. Our results demonstrate that ML models can efficiently detect mutation-acyclicity, providing a promising computational approach to this combinatorial problem, from which the trained SVM equation provides a starting point to guide future theoretical development.

[LG-72] BAPULM: Binding Affinity Prediction using Language Models

链接: https://arxiv.org/abs/2411.04150
作者: Radheesh Sharma Meda,Amir Barati Farimani
关键词-EN: Identifying drug-target interactions, developing effective therapeutics, Identifying drug-target, effective therapeutics, essential for developing
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying drug-target interactions is essential for developing effective therapeutics. Binding affinity quantifies these interactions, and traditional approaches rely on computationally intensive 3D structural data. In contrast, language models can efficiently process sequential data, offering an alternative approach to molecular representation. In the current study, we introduce BAPULM, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and ligands through MolFormer, eliminating reliance on complex 3D configurations. Our approach was validated extensively on benchmark datasets, achieving scoring power ® values of 0.925 \pm 0.043, 0.914 \pm 0.004, and 0.8132 \pm 0.001 on benchmark1k2101, Test2016_290, and CSAR-HiQ_36, respectively. These findings indicate the robustness and accuracy of BAPULM across diverse datasets and underscore the potential of sequence-based models in-silico drug discovery, offering a scalable alternative to 3D-centric methods for screening potential ligands.

[LG-73] ShEPhERD: Diffusing shape electrostatics and pharmacophores for bioisosteric drug design

链接: https://arxiv.org/abs/2411.04130
作者: Keir Adams,Kento Abeywardane,Jenna Fromer,Connor W. Coley
关键词-EN: Engineering molecules, exhibit precise, environment forms, forms the basis, Engineering
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Engineering molecules to exhibit precise 3D intermolecular interactions with their environment forms the basis of chemical design. In ligand-based drug design, bioisosteric analogues of known bioactive hits are often identified by virtually screening chemical libraries with shape, electrostatic, and pharmacophore similarity scoring functions. We instead hypothesize that a generative model which learns the joint distribution over 3D molecular structures and their interaction profiles may facilitate 3D interaction-aware chemical design. We specifically design ShEPhERD, an SE(3)-equivariant diffusion model which jointly diffuses/denoises 3D molecular graphs and representations of their shapes, electrostatic potential surfaces, and (directional) pharmacophores to/from Gaussian noise. Inspired by traditional ligand discovery, we compose 3D similarity scoring functions to assess ShEPhERD’s ability to conditionally generate novel molecules with desired interaction profiles. We demonstrate ShEPhERD’s potential for impact via exemplary drug design tasks including natural product ligand hopping, protein-blind bioactive hit diversification, and bioisosteric fragment merging.

[LG-74] On the analysis of saturated pressure to detect fatigue

链接: https://arxiv.org/abs/2411.04128
作者: Marcos Faundez-Zanuy,Josep Lopez-Xarbau,Moises Diaz,Manuel Garnacho-Castaño
关键词-EN: capital words text, cursive text, including drawings, words text, capital words
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 12 pages. arXiv admin note: substantial text overlap with arXiv:2203.14782

点击查看摘要

Abstract:This paper examines the saturation of pressure signals during various handwriting tasks, including drawings, cursive text, capital words text, and signature, under different levels of fatigue. Experimental results demonstrate a significant rise in the proportion of saturated samples following strenuous exercise in tasks performed without resting wrist. The analysis of saturation highlights significant differences when comparing the results to the baseline situation and strenuous fatigue.

信息检索

[IR-0] Orbit: A Framework for Designing and Evaluating Multi-objective Rankers

链接: https://arxiv.org/abs/2411.04798
作者: Chenyang Yang,Tesi Xiao,Michael Shavlovsky,Christian Kästner,Tongshuang Wu
关键词-EN: Machine learning, learning in production, evident in ranking, ranking or recommendation, Machine
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Machine learning in production needs to balance multiple objectives: This is particularly evident in ranking or recommendation models, where conflicting objectives such as user engagement, satisfaction, diversity, and novelty must be considered at the same time. However, designing multi-objective rankers is inherently a dynamic wicked problem – there is no single optimal solution, and the needs evolve over time. Effective design requires collaboration between cross-functional teams and careful analysis of a wide range of information. In this work, we introduce Orbit, a conceptual framework for Objective-centric Ranker Building and Iteration. The framework places objectives at the center of the design process, to serve as boundary objects for communication and guide practitioners for design and evaluation. We implement Orbit as an interactive system, which enables stakeholders to interact with objective spaces directly and supports real-time exploration and evaluation of design trade-offs. We evaluate Orbit through a user study involving twelve industry practitioners, showing that it supports efficient design space exploration, leads to more informed decision-making, and enhances awareness of the inherent trade-offs of multiple objectives. Orbit (1) opens up new opportunities of an objective-centric design process for any multi-objective ML models, as well as (2) sheds light on future designs that push practitioners to go beyond a narrow metric-centric or example-centric mindset.

[IR-1] Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval WSDM’25

链接: https://arxiv.org/abs/2411.04677
作者: Ferdinand Schlatt,Maik Fröbe,Matthias Hagen
关键词-EN: transformer-based language models, information retrieval tasks, wide range, transformer-based language, information retrieval
类目: Information Retrieval (cs.IR)
*备注: Accepted as a demo at WSDM’25

点击查看摘要

Abstract:A wide range of transformer-based language models have been proposed for information retrieval tasks. However, fine-tuning and inference of these models is often complex and requires substantial engineering effort. This paper introduces Lightning IR, a PyTorch Lightning-based framework for fine-tuning and inference of transformer-based language models for information retrieval. Lightning IR provides a modular and extensible architecture that supports all stages of an information retrieval pipeline: from fine-tuning and indexing to searching and re-ranking. It is designed to be straightforward to use, scalable, and reproducible. Lightning IR is available as open-source: this https URL.

[IR-2] he Concatenator: A Bayesian Approach To Real Time Concatenative Musaicing

链接: https://arxiv.org/abs/2411.04366
作者: Christopher Tralie,Ben Cantil
关键词-EN: target audio stream, Driedger NMF-based technique, audio-guided concatenative synthesis, target audio, audio mosaicing technique
类目: ound (cs.SD); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 6 figures, Accepted for Publication in The International Society for Music Information Retrieval Proceedings, 2024

点击查看摘要

Abstract:We present The Concatenator,'' a real time system for audio-guided concatenative synthesis. Similarly to Driedger et al.'s musaicing’’ (or ``audio mosaicing’') technique, we concatenate a set number of windows within a corpus of audio to re-create the harmonic and percussive aspects of a target audio stream. Unlike Driedger’s NMF-based technique, however, we instead use an explicitly Bayesian point of view, where corpus window indices are hidden states and the target audio stream is an observation. We use a particle filter to infer the best hidden corpus states in real-time. Our transition model includes a tunable parameter to control the time-continuity of corpus grains, and our observation model allows users to prioritize how quickly windows change to match the target. Because the computational complexity of the system is independent of the corpus size, our system scales to corpora that are hours long, which is an important feature in the age of vast audio data collections. Within The Concatenator module itself, composers can vary grain length, fit to target, and pitch shift in real time while reacting to the sounds they hear, enabling them to rapidly iterate ideas. To conclude our work, we evaluate our system with extensive quantitative tests of the effects of parameters, as well as a qualitative evaluation with artistic insights. Based on the quality of the results, we believe the real-time capability unlocks new avenues for musical expression and control, suitable for live performance and modular synthesis integration, which furthermore represents an essential breakthrough in concatenative synthesis technology.

附件下载

点击下载今日全部论文列表