本篇博文主要展示 2024-10-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-15)

今日共更新1016篇论文,其中:

  • 自然语言处理211篇(Computation and Language (cs.CL))
  • 人工智能296篇(Artificial Intelligence (cs.AI))
  • 计算机视觉244篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习358篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

【速读】: 该论文试图解决长上下文大语言模型(LLMs)部署中的计算和内存挑战,特别是全注意力机制中Key和Value(KV)状态缓存占用大量内存的问题。解决方案的关键在于识别并区分两种注意力头:一种是处理长上下文的关键注意力头(Retrieval Heads),另一种是主要关注近期标记和注意力汇聚点的流式注意力头(Streaming Heads)。论文提出的DuoAttention框架通过仅对Retrieval Heads应用完整的KV缓存,而对Streaming Heads使用轻量级的固定长度KV缓存,从而显著减少内存消耗和延迟,同时保持模型的长上下文处理能力。该方法通过轻量级的优化算法和合成数据准确识别Retrieval Heads,实现了在不影响模型性能的前提下,大幅提升推理效率和内存利用率。

链接: https://arxiv.org/abs/2410.10819
作者: Guangxuan Xiao,Jiaming Tang,Jingwei Zuo,Junxian Guo,Shang Yang,Haotian Tang,Yao Fu,Song Han
关键词-EN: Deploying long-context large, poses significant computational, long-context large language, large language models, Deploying long-context
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks–referred to as Streaming Heads–do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM’s decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in this https URL.
摘要:部署长上下文大语言模型 (LLM) 是至关重要的,但也带来了显著的计算和内存挑战。缓存所有注意力头中的 Key 和 Value (KV) 状态会消耗大量内存。现有的 KV 缓存剪枝方法要么损害 LLM 的长上下文处理能力,要么仅提供有限的效率提升。本文中,我们发现只有一小部分注意力头,即所谓的检索头 (Retrieval Heads),对于处理长上下文是关键的,并且需要对所有 Token 进行全注意力处理。相比之下,其他主要关注近期 Token 和注意力汇聚点 (Streaming Heads) 的头不需要全注意力。基于这一洞察,我们提出了 DuoAttention 框架,该框架仅对检索头应用完整的 KV 缓存,而对流式头使用轻量级的、固定长度的 KV 缓存,从而在不损害其长上下文能力的情况下,减少了 LLM 的解码和预填充内存及延迟。DuoAttention 使用一种基于优化的轻量级算法,通过合成数据准确识别检索头。我们的方法显著减少了长上下文推理内存,MHA 模型最多减少 2.55 倍,GQA 模型最多减少 1.67 倍,同时解码速度最多提升 2.18 倍和 1.50 倍,预填充速度最多提升 1.73 倍和 1.63 倍,与全注意力相比,精度损失最小。值得注意的是,结合量化技术,DuoAttention 使得 Llama-3-8B 在单个 A100 GPU 上能够进行 330 万上下文长度的解码。代码可在以下链接获取:https URL。

[NLP-1] mporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

【速读】: 该论文试图解决现有视频基准在评估模型对视频中细粒度时间动态理解能力方面的不足。解决方案的关键在于引入TemporalBench,这是一个新的基准,包含约10K个视频问答对,源自约2K个高质量的人工标注,详细描述了视频片段中的时间动态。TemporalBench不仅提供了对动作频率、运动幅度、事件顺序等时间理解能力的评估,还支持视频问答和字幕生成等多种任务的评估,涵盖短视频和长视频理解,以及多模态视频嵌入模型和文本生成模型。通过对比人类和AI在TemporalBench上的表现,发现现有最先进模型如GPT-4o在问答准确率上仅为38.5%,显示出显著差距(约30%),并提出了Multiple Binary Accuracy (MBA)来纠正多选问答中的偏差。

链接: https://arxiv.org/abs/2410.10818
作者: Mu Cai,Reuben Tan,Jianrui Zhang,Bocheng Zou,Kai Zhang,Feng Yao,Fangrui Zhu,Jing Gu,Yiwu Zhong,Yuzhang Shang,Yao Dou,Jaden Park,Jianfeng Gao,Yong Jae Lee,Jianwei Yang
关键词-EN: temporal understanding, temporal, fine-grained temporal, Understanding, video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models’ temporal reasoning capabilities. Both dataset and evaluation code will be made available.
摘要:理解细粒度的时间动态对于多模态视频的理解和生成至关重要。由于缺乏细粒度的时间标注,现有的视频基准大多类似于静态图像基准,无法有效评估模型的时间理解能力。本文中,我们引入了 TemporalBench,这是一个专注于评估视频中细粒度时间理解的新基准。TemporalBench 包含约 10,000 个视频问答对,源自约 2,000 个高质量的人工标注,详细描述了视频片段中的时间动态。因此,我们的基准为评估各种时间理解和推理能力(如动作频率、运动幅度、事件顺序等)提供了独特的测试平台。此外,它支持对多种任务(如视频问答和字幕生成、短视频和长视频理解)以及不同模型(如多模态视频嵌入模型和文本生成模型)的评估。结果显示,像 GPT-4o 这样的最先进模型在 TemporalBench 上的问答准确率仅为 38.5%,表明人类与 AI 在时间理解方面存在显著差距(约 30%)。此外,我们注意到多选问答中的一个关键陷阱,即大语言模型 (LLM) 能够检测到负面字幕中的细微变化,并将集中描述作为其预测的线索,我们提出了多重二元准确率 (MBA) 来纠正这种偏差。我们希望 TemporalBench 能够促进提升模型时间推理能力的研究。数据集和评估代码将公开发布。

[NLP-2] Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

【速读】: 该论文试图解决大型语言模型(LLMs)在生成任务上表现优异,但其解码器架构在未进行进一步表示微调的情况下,作为嵌入模型时性能受限的问题。论文通过研究Mixture-of-Experts(MoE)LLMs,发现其中的专家路由器可以作为即插即用的嵌入模型,在多种嵌入任务上表现出色,无需任何微调。关键解决方案在于提出了一种结合MoE路由权重(RW)和隐藏状态(HS)的新方法MoEE,通过实验证明这种组合在多个嵌入任务上显著优于单独使用RW或HS,且无需进一步微调。

链接: https://arxiv.org/abs/2410.10814
作者: Ziyue Li,Tianyi Zhou
关键词-EN: large language models, excel on generation, large language, decoder-only architecture, architecture often limits
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:While large language models (LLMs) excel on generation tasks, their decoder-only architecture often limits their potential as embedding models if no further representation finetuning is applied. Does this contradict their claim of generalists? To answer the question, we take a closer look at Mixture-of-Experts (MoE) LLMs. Our study shows that the expert routers in MoE LLMs can serve as an off-the-shelf embedding model with promising performance on a diverse class of embedding-focused tasks, without requiring any finetuning. Moreover, our extensive analysis shows that the MoE routing weights (RW) is complementary to the hidden state (HS) of LLMs, a widely-used embedding. Compared to HS, we find that RW is more robust to the choice of prompts and focuses on high-level semantics. Motivated by the analysis, we propose MoEE combining RW and HS, which achieves better performance than using either separately. Our exploration of their combination and prompting strategy shed several novel insights, e.g., a weighted sum of RW and HS similarities outperforms the similarity on their concatenation. Our experiments are conducted on 6 embedding tasks with 20 datasets from the Massive Text Embedding Benchmark (MTEB). The results demonstrate the significant improvement brought by MoEE to LLM-based embedding without further finetuning.
摘要:尽管大语言模型 (LLM) 在生成任务上表现出色,但其仅解码器架构在未进行进一步表示微调的情况下,往往限制了其作为嵌入模型的潜力。这是否与其通用性的宣称相矛盾?为了回答这个问题,我们深入研究了混合专家 (MoE) 大语言模型。我们的研究表明,MoE 大语言模型中的专家路由器可以作为一个即用型的嵌入模型,在多种嵌入任务上展现出有前景的性能,而无需任何微调。此外,我们的广泛分析表明,MoE 路由权重 (RW) 与大语言模型的隐藏状态 (HS) 是互补的,后者是一种广泛使用的嵌入方式。与 HS 相比,我们发现 RW 对提示的选择更为鲁棒,并且更专注于高层语义。基于这一分析,我们提出了 MoEE,它结合了 RW 和 HS,其性能优于单独使用其中任何一种。我们对它们组合及提示策略的探索揭示了若干新颖的见解,例如,RW 和 HS 相似度的加权和优于它们的连接相似度。我们的实验在 Massive Text Embedding Benchmark (MTEB) 的 6 个嵌入任务和 20 个数据集上进行。结果显示,MoEE 在不进行进一步微调的情况下,显著提升了基于 LLM 的嵌入模型的性能。

[NLP-3] LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

【速读】: 该论文试图解决大型语言模型(LLM)驱动的聊天助手系统在长期交互中的记忆能力不足的问题。解决方案的关键在于引入LongMemEval基准,通过评估信息提取、多会话推理、时间推理、知识更新和拒绝回答等五项核心长期记忆能力,揭示现有系统的不足。论文提出了一种统一的框架,将长期记忆设计分解为索引、检索和阅读三个阶段的设计选择,并通过实验验证了会话分解、事实增强的关键扩展和时间感知的查询扩展等优化策略,显著提升了记忆召回率和下游问答性能。

链接: https://arxiv.org/abs/2410.10813
作者: Di Wu,Hongwei Wang,Wenhao Yu,Yuwei Zhang,Kai-Wei Chang,Dong Yu
关键词-EN: Recent large language, large language model, Recent large, integrated memory components, driven chat assistant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.
摘要:近期,由大语言模型 (LLM) 驱动的聊天助手系统已集成了记忆组件,以追踪用户与助手之间的聊天历史,从而实现更准确和个性化的响应。然而,其在持续交互中的长期记忆能力仍未得到充分探索。本文介绍了 LongMemEval,这是一个综合基准,旨在评估聊天助手的五项核心长期记忆能力:信息提取、多会话推理、时间推理、知识更新和拒绝回答。LongMemEval 包含 500 个精心设计的问答,嵌入在可自由扩展的用户-助手聊天历史中,对现有的长期记忆系统提出了重大挑战,商业聊天助手和长上下文 LLM 在跨持续交互记忆信息时准确率下降了 30%。随后,我们提出了一种统一的框架,将长期记忆设计分解为索引、检索和阅读阶段的四个设计选择。基于关键实验洞察,我们提出了几种记忆设计,包括会话分解以优化价值粒度、事实增强的关键扩展以增强索引结构,以及时间感知的查询扩展以细化搜索范围。实验结果表明,这些优化显著提高了 LongMemEval 上的记忆召回率和下游问答效果。总体而言,我们的研究为提升基于 LLM 的聊天助手的长期记忆能力提供了宝贵的资源和指导,为实现更个性化和可靠的对话式 AI 铺平了道路。

[NLP-4] Local and Global Decoding in Text Generation EMNLP2024

【速读】: 该论文试图解决传统文本生成解码算法(如top-k和top-π)在应用局部归一化时可能导致的分布扭曲问题。解决方案的关键在于引入全局归一化的解码方法,并通过独立的Metropolis-Hastings算法近似采样全局归一化分布,从而在不显式计算全局归一化分布的情况下保留分布的完整性。

链接: https://arxiv.org/abs/2410.10810
作者: Daniel Gareev,Thomas Hofmann,Ezhilmathi Krishnasamy,Tiago Pimentel
关键词-EN: Text generation, dialogue systems, key component, component in applications, sample strings
类目: Computation and Language (cs.CL)
备注: Paper accepted in EMNLP 2024. Code is available in this https URL

点击查看摘要

Abstract:Text generation, a key component in applications such as dialogue systems, relies on decoding algorithms that sample strings from a language model distribution. Traditional methods, such as top- k and top- \pi , apply local normalisation to the model’s output distribution, which can distort it. In this paper, we investigate the effect of this distortion by introducing globally-normalised versions of these decoding methods. Additionally, we propose an independent Metropolis-Hastings algorithm to approximate sampling from globally-normalised distributions without explicitly computing them. Our empirical analysis compares the performance of local and global normalisation across two decoding algorithms (top- k and top- \pi ) with various hyperparameters, using Pythia language models. Results show that, in most configurations, global decoding performs worse than the local decoding version of the same algorithms – despite preserving the distribution’s integrity. Our results suggest that distortion is an important feature of local decoding algorithms.
摘要:文本生成是对话系统等应用中的关键组成部分,依赖于从语言模型分布中采样字符串的解码算法。传统方法,如 top-k 和 top-π,对模型的输出分布应用局部归一化,这可能会扭曲分布。本文通过引入这些解码方法的全局归一化版本,研究了这种扭曲的影响。此外,我们提出了一种独立的 Metropolis-Hastings 算法,以近似从全局归一化分布中采样,而无需显式计算它们。我们的实证分析比较了在 Pythia 语言模型上,使用不同超参数的两种解码算法(top-k 和 top-π)的局部和全局归一化性能。结果显示,在大多数配置中,全局解码的表现不如相同算法的局部解码版本——尽管保留了分布的完整性。我们的结果表明,扭曲是局部解码算法的一个重要特征。

[NLP-5] Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

【速读】: 该论文试图解决大型语言模型(LLMs)在多语言环境中应用时的安全性和性能问题。解决方案的关键在于通过模型合并(model merging)技术,在多任务和多语言背景下结合安全性和通用任务,特别是采用基于目标的合并方法,相较于数据混合方法,显著提升了模型在通用性能和安全性方面的表现,分别提高了8%和10%。此外,基于语言的合并策略通过合并单语言微调模型,进一步提升了通用性能4%并减少了7%的危害,为构建强大且安全的多语言模型提供了有效框架。

链接: https://arxiv.org/abs/2410.10801
作者: Aakanksha,Arash Ahmadian,Seraphina Goldfarb-Tarrant,Beyza Ermis,Marzieh Fadaee,Sara Hooker
关键词-EN: Large Language Models, Large Language, variety of applications, adopted and deployed, deployed worldwide
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been adopted and deployed worldwide for a broad variety of applications. However, ensuring their safe use remains a significant challenge. Preference training and safety measures often overfit to harms prevalent in Western-centric datasets, and safety protocols frequently fail to extend to multilingual settings. In this work, we explore model merging in a diverse multi-task setting, combining safety and general-purpose tasks within a multilingual context. Each language introduces unique and varied learning challenges across tasks. We find that objective-based merging is more effective than mixing data, with improvements of up to 8% and 10% in general performance and safety respectively. We also find that language-based merging is highly effective – by merging monolingually fine-tuned models, we achieve a 4% increase in general performance and 7% reduction in harm across all languages on top of the data mixtures method using the same available data. Overall, our comprehensive study of merging approaches provides a useful framework for building strong and safe multilingual models.
摘要:大语言模型 (LLMs) 已在全球范围内被广泛采用和部署,应用于多种多样的场景。然而,确保其安全使用仍然是一个重大挑战。偏好训练和安全措施往往过度适应以西方为中心的数据集中普遍存在的危害,而安全协议经常无法扩展到多语言环境中。在本研究中,我们探索了在多样化的多任务环境中进行模型合并,将安全和通用任务结合在多语言背景下。每种语言在不同任务中引入了独特的学习挑战。我们发现,基于目标的合并比混合数据更为有效,在通用性能和安全性方面分别提高了8%和10%。我们还发现,基于语言的合并非常有效——通过合并单语言微调模型,我们在使用相同可用数据的情况下,在数据混合方法的基础上,所有语言的通用性能提高了4%,危害减少了7%。总体而言,我们对合并方法的全面研究为构建强大且安全的多语言模型提供了一个有用的框架。

[NLP-6] Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance

【速读】: 该论文试图解决大语言模型在指令微调后对上下文依赖性下降的问题,即“上下文-参数反转”现象。解决方案的关键在于理解指令微调过程中,模型对上下文的依赖性先增后减的原因,并识别出微调数据中上下文信息与模型参数知识重叠的部分。论文通过理论分析和实验验证,提出了一种可能的缓解策略,并强调了这一现象在LLM训练中的重要性,希望为解决此类问题提供起点。

链接: https://arxiv.org/abs/2410.10796
作者: Sachin Goyal,Christina Baek,J. Zico Kolter,Aditi Raghunathan
关键词-EN: Large language models, Large language, follow user instructions, input context, instruction-finetuned to enhance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large language models are instruction-finetuned to enhance their ability to follow user instructions and process the input context. However, even state-of-the-art models often struggle to follow the instruction, especially when the input context is not aligned with the model’s parametric knowledge. This manifests as various failures, such as hallucinations where the responses are outdated, biased or contain unverified facts. In this work, we try to understand the underlying reason for this poor context reliance, especially after instruction tuning. We observe an intriguing phenomenon: during instruction tuning, the context reliance initially increases as expected, but then gradually decreases as instruction finetuning progresses. We call this phenomenon context-parametric inversion and observe it across multiple general purpose instruction tuning datasets like TULU, Alpaca and Ultrachat, as well as model families such as Llama, Mistral and Pythia. In a simple theoretical setup, we isolate why context-parametric inversion occurs along the gradient descent trajectory of instruction finetuning. We tie this phenomena to examples in the instruction finetuning data mixture where the input context provides information that is already present in the model’s parametric knowledge. Our analysis suggests natural mitigation strategies that provide some limited gains, while also validating our theoretical insights. We hope that our work serves as a starting point in addressing this failure mode in a staple part of LLM training.
摘要:大语言模型经过指令微调以增强其遵循用户指令和处理输入上下文的能力。然而,即使是目前最先进的模型,在输入上下文与模型的参数化知识不一致时,也常常难以遵循指令。这种问题表现为各种失败,例如生成的响应过时、带有偏见或包含未经核实的信息。在本研究中,我们试图理解这种对上下文依赖性较差的原因,特别是在指令微调之后。我们观察到一个有趣的现象:在指令微调过程中,上下文依赖性起初如预期般增加,但随着指令微调的深入,这种依赖性逐渐减弱。我们称这种现象为上下文-参数化反转,并在多个通用指令微调数据集(如 TULU、Alpaca 和 Ultrachat)以及多个模型家族(如 Llama、Mistral 和 Pythia)中观察到这一现象。在一个简单的理论框架中,我们解释了为什么在指令微调的梯度下降轨迹中会出现上下文-参数化反转。我们将这一现象与指令微调数据混合中的示例联系起来,其中输入上下文提供的信息已经存在于模型的参数化知识中。我们的分析提出了一些自然缓解策略,这些策略虽然带来了有限的收益,但也验证了我们的理论见解。我们希望这项工作能够成为解决大语言模型训练中这一常见失败模式的起点。

[NLP-7] When Attention Sink Emerges in Language Models: An Empirical View

【速读】: 该论文试图解决语言模型(LMs)中普遍存在的“注意力汇聚”(attention sink)现象的理解问题。解决方案的关键在于揭示注意力汇聚在模型预训练过程中的形成机制,特别是优化、数据分布、损失函数和模型架构对这一现象的影响。研究发现,注意力汇聚在有效优化和充足训练数据的基础上出现,且其位置与损失函数和数据分布高度相关。此外,论文指出注意力汇聚更像是关键偏差,存储了额外的注意力分数,这些分数可能不具信息量且不参与值的计算。通过替换softmax注意力机制为不进行归一化的sigmoid注意力机制,可以消除或至少部分消除注意力汇聚现象。

链接: https://arxiv.org/abs/2410.10781
作者: Xiangming Gu,Tianyu Pang,Chao Du,Qian Liu,Fengzhuo Zhang,Cunxiao Du,Ye Wang,Min Lin
关键词-EN: Language Models, assign significant attention, attention, attention sink, assign significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens’ inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at this https URL.
摘要:语言模型 (Language Models, LMs) 对第一个 Token 赋予了显著的关注度,即使该 Token 在语义上并不重要,这种现象被称为注意力汇聚 (attention sink)。注意力汇聚已被广泛应用于流式/长上下文生成、KV 缓存优化、推理加速、模型量化等领域。尽管其应用广泛,但对 LMs 中注意力汇聚的深入理解仍然不足。在本研究中,我们首先证明了注意力汇聚在各种输入下普遍存在于不同规模的 LMs 中,甚至在小型模型中也存在。此外,我们观察到注意力汇聚在 LM 预训练过程中出现,这促使我们研究预训练中的优化、数据分布、损失函数和模型架构如何影响其出现。我们强调,注意力汇聚在充足训练数据的有效优化后出现。汇聚位置与损失函数和数据分布高度相关。最重要的是,我们发现注意力汇聚更像是一种关键偏置 (key biases),存储了额外的注意力分数,这些分数可能不具有信息性,且不参与值的计算。我们还观察到,这种现象(至少部分地)源于 Token 对注意力分数的内在依赖,这是由于 softmax 归一化导致的。通过将 softmax 注意力替换为其他不进行归一化的注意力操作(如 sigmoid 注意力),我们发现即使在高达 10 亿参数的 LMs 中,注意力汇聚也不会出现。代码可在以下链接获取:[https URL]。

[NLP-8] AFlow: Automating Agent ic Workflow Generation

【速读】: 该论文试图解决大型语言模型(LLMs)在复杂任务中应用时,构建和优化工作流程所需的人力成本高、自动化程度不足的问题。解决方案的关键在于将工作流程优化重新定义为一个基于代码表示的搜索问题,并通过引入AFlow框架,利用蒙特卡洛树搜索(Monte Carlo Tree Search)进行高效探索和迭代优化。AFlow通过代码修改、树结构经验和执行反馈的迭代过程,实现了工作流程的自动化生成和优化,显著提升了性能并降低了成本。

链接: https://arxiv.org/abs/2410.10762
作者: Jiayi Zhang,Jinyu Xiang,Zhaoyang Yu,Fengwei Teng,Xionghui Chen,Jiaqi Chen,Mingchen Zhuge,Xin Cheng,Sirui Hong,Jinlin Wang,Bingnan Zheng,Bang Liu,Yuyu Luo,Chenglin Wu
关键词-EN: Large language models, demonstrated remarkable potential, follow detailed instructions, Large language, employing agentic workflows
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow’s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code will be available at this https URL.
摘要:大语言模型 (LLMs) 在解决跨领域复杂任务方面展现了显著的潜力,通常通过遵循详细指令和操作序列的智能体工作流程来实现。然而,构建这些工作流程需要大量的人力投入,限制了其可扩展性和通用性。近期研究试图自动化这些工作流程的生成和优化,但现有方法仍依赖于初始的手动设置,未能实现完全自动化和有效的工作流程生成。为应对这一挑战,我们将工作流程优化重新定义为代码表示的工作流程上的搜索问题,其中 LLM 调用节点通过边连接。我们引入了 AFlow,一个自动化的框架,该框架使用蒙特卡洛树搜索高效地探索这一空间,通过代码修改、树结构经验和执行反馈迭代地优化工作流程。在六个基准数据集上的实证评估表明,AFlow 的有效性,平均比最先进的基线提高了 5.7%。此外,AFlow 使较小模型在特定任务上以 GPT-4o 4.55% 的推理成本实现了超越。代码将在 https URL 上提供。

[NLP-9] Denial-of-Service Poisoning Attacks against Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在面对服务拒绝(DoS)攻击时的脆弱性问题,特别是通过语音接口(如语音命令)实施的DoS攻击。解决方案的关键在于提出了一种基于数据中毒的DoS攻击(P-DoS),通过在模型微调数据中注入特制的毒样本,突破了传统DoS攻击中输出长度受限于模型监督微调(SFT)数据最大长度的限制。这种攻击方法能够使模型产生超出正常范围的重复输出,直至达到最大推理长度(如16K tokens),从而有效实施DoS攻击。论文通过实验验证了该方法对GPT-4o等模型的有效性,并强调了防御此类攻击的紧迫性。

链接: https://arxiv.org/abs/2410.10760
作者: Kuofeng Gao,Tianyu Pang,Chao Du,Yong Yang,Shu-Tao Xia,Min Lin
关键词-EN: prompts trigger endless, trigger endless outputs, non-semantic prompts trigger, Recent studies, adversarial inputs
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have shown that LLMs are vulnerable to denial-of-service (DoS) attacks, where adversarial inputs like spelling errors or non-semantic prompts trigger endless outputs without generating an [EOS] token. These attacks can potentially cause high latency and make LLM services inaccessible to other users or tasks. However, when there are speech-to-text interfaces (e.g., voice commands to a robot), executing such DoS attacks becomes challenging, as it is difficult to introduce spelling errors or non-semantic prompts through speech. A simple DoS attack in these scenarios would be to instruct the model to “Keep repeating Hello”, but we observe that relying solely on natural instructions limits output length, which is bounded by the maximum length of the LLM’s supervised finetuning (SFT) data. To overcome this limitation, we propose poisoning-based DoS (P-DoS) attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit. For example, a poisoned sample can successfully attack GPT-4o and GPT-4o mini (via OpenAI’s finetuning API) using less than 1, causing repeated outputs up to the maximum inference length (16K tokens, compared to 0.5K before poisoning). Additionally, we perform comprehensive ablation studies on open-source LLMs and extend our method to LLM agents, where attackers can control both the finetuning dataset and algorithm. Our findings underscore the urgent need for defenses against P-DoS attacks to secure LLMs. Our code is available at this https URL.
摘要:最近的研究表明,大语言模型 (LLM) 容易受到拒绝服务 (DoS) 攻击,其中对抗性输入(如拼写错误或非语义提示)会触发无休止的输出,而不会生成 [EOS] Token。这些攻击可能导致高延迟,并使 LLM 服务对其他用户或任务不可用。然而,在存在语音转文本接口(例如,对机器人的语音命令)的情况下,执行此类 DoS 攻击变得具有挑战性,因为通过语音难以引入拼写错误或非语义提示。在这些场景中,简单的 DoS 攻击可能是指示模型“不断重复 Hello”,但我们观察到,仅依赖自然指令会限制输出长度,该长度受限于 LLM 监督微调 (SFT) 数据的最大长度。为了克服这一限制,我们提出了基于中毒的 DoS (P-DoS) 攻击,证明注入一个为 DoS 目的设计的单一中毒样本可以打破输出长度限制。例如,一个中毒样本可以成功攻击 GPT-4o 和 GPT-4o mini(通过 OpenAI 的微调 API),使用不到 1 的资源,导致重复输出达到最大推理长度(16K Token,相比中毒前的 0.5K)。此外,我们对开源 LLM 进行了全面的消融研究,并将我们的方法扩展到 LLM 智能体,其中攻击者可以控制微调数据集和算法。我们的研究结果强调了迫切需要防御 P-DoS 攻击以保护 LLM 的安全。我们的代码可在以下链接获取:https URL。

[NLP-10] Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation for Classification

【速读】: 该论文试图解决在基于生成式大型语言模型(LLMs)进行数据增强时,样本选择策略对增强效果的影响问题。解决方案的关键在于比较和评估现有的小样本学习文献中的不同样本选择策略,特别是在分布内和分布外的分类器性能上的效果。研究结果表明,虽然某些“知情”选择策略在某些情况下能略微提升模型性能,但总体上随机样本选择仍然是一个有效的默认选项。

链接: https://arxiv.org/abs/2410.10756
作者: Jan Cegin,Branislav Pecher,Jakub Simko,Ivan Srba,Maria Bielikova,Peter Brusilovsky
关键词-EN: generative large language, large language models, generated anew, generative large, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for classifier fine-tuning. Existing works on augmentation leverage the few-shot scenarios, where samples are given to LLMs as part of prompts, leading to better augmentations. Yet, the samples are mostly selected randomly and a comprehensive overview of the effects of other (more informed'') sample selection strategies is lacking. In this work, we compare sample selection strategies existing in few-shot learning literature and investigate their effects in LLM-based textual augmentation. We evaluate this on in-distribution and out-of-distribution classifier performance. Results indicate, that while some informed’’ selection strategies increase the performance of models, especially for out-of-distribution data, it happens only seldom and with marginal performance increases. Unless further advances are made, a default of random sample selection remains a good option for augmentation practitioners.
摘要:生成式大语言模型 (LLMs) 越来越多地被用于数据增强任务,其中文本样本被改写(或重新生成),然后用于分类器的微调。现有的数据增强工作利用了少样本场景,其中样本作为提示的一部分提供给 LLMs,从而实现更好的增强效果。然而,这些样本大多随机选择,缺乏对其他(更“有信息”的)样本选择策略效果的全面概述。在本研究中,我们比较了少样本学习文献中存在的样本选择策略,并研究了它们在基于 LLM 的文本增强中的效果。我们在分布内和分布外的分类器性能上进行了评估。结果表明,尽管某些“有信息”的选择策略确实提高了模型的性能,特别是在分布外数据上,但这种情况并不常见,且性能提升有限。除非有进一步的进展,否则随机样本选择仍然是增强实践者的良好默认选项。

[NLP-11] Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs

【速读】: 该论文试图解决的问题是如何在保持指令跟随能力的同时,使大型语言模型(LLMs)持续更新以适应最新数据。解决方案的关键在于探讨连续预训练与指令微调之间的关系,并研究连续预训练对基础模型及其指令微调模型指令跟随能力的影响。论文提出了一种计算效率高的策略,即在不使用指令数据和微调的情况下,通过连续预训练来获取最新的知识和指令跟随能力。研究通过在LLaMa 3, 3.1和Qwen 2, 2.5系列模型上进行实证验证,证明了该策略的有效性。

链接: https://arxiv.org/abs/2410.10739
作者: Ishan Jindal,Chandana Badrinath,Pranjal Bharti,Lakkidi Vinay,Sachin Dev Sharma
关键词-EN: Large Language Models, Large Language, Language Models, continuous pre-training, instruction
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) for public use require continuous pre-training to remain up-to-date with the latest data. The models also need to be fine-tuned with specific instructions to maintain their ability to follow instructions accurately. Typically, LLMs are released in two versions: the Base LLM, pre-trained on diverse data, and the instruction-refined LLM, additionally trained with specific instructions for better instruction following. The question arises as to which model should undergo continuous pre-training to maintain its instruction-following abilities while also staying current with the latest data. In this study, we delve into the intricate relationship between continuous pre-training and instruction fine-tuning of the LLMs and investigate the impact of continuous pre-training on the instruction following abilities of both the base and its instruction finetuned model. Further, the instruction fine-tuning process is computationally intense and requires a substantial number of hand-annotated examples for the model to learn effectively. This study aims to find the most compute-efficient strategy to gain up-to-date knowledge and instruction-following capabilities without requiring any instruction data and fine-tuning. We empirically prove our findings on the LLaMa 3, 3.1 and Qwen 2, 2.5 family of base and instruction models, providing a comprehensive exploration of our hypotheses across varying sizes of pre-training data corpus and different LLMs settings.
摘要:面向公众使用的大语言模型 (LLM) 需要持续的预训练以保持与最新数据的同步。这些模型还需要通过特定的指令进行微调,以维持其准确遵循指令的能力。通常,LLM 会发布两个版本:基础 LLM,在多样化的数据上进行预训练;以及指令精炼的 LLM,额外通过特定指令进行训练以更好地遵循指令。问题在于,应该对哪个模型进行持续的预训练,以在保持其指令遵循能力的同时,也能与最新数据保持同步。在本研究中,我们深入探讨了 LLM 的持续预训练与指令微调之间的复杂关系,并研究了持续预训练对基础模型及其指令微调模型指令遵循能力的影响。此外,指令微调过程计算量大,且需要大量手工标注的示例才能使模型有效学习。本研究旨在找到一种计算效率最高的策略,以获取最新的知识和指令遵循能力,而无需任何指令数据和微调。我们通过在 LLaMa 3、3.1 和 Qwen 2、2.5 系列的基础模型和指令模型上进行实证验证,全面探索了我们的假设在不同规模的预训练数据语料库和不同 LLM 设置下的表现。

[NLP-12] Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning

【速读】: 该论文试图解决大型语言模型(LLMs)在数学推理中经常遇到的准确性问题,特别是在生成错误推理和结果的情况下。解决方案的关键在于引入了一种名为“自我修正链(Chain of Self-Correction, CoSC)”的新机制,该机制通过嵌入自我修正能力,使LLMs能够验证和纠正其自身的结果。CoSC机制通过一系列自我修正阶段运作,每个阶段包括生成解决问题的程序、执行程序以获取输出、验证输出,并根据验证结果决定是否进入下一修正阶段或最终确定答案。为实现低成本的CoSC机制,论文采用了两阶段微调方法,首先使用GPT-4生成的少量种子数据进行初步训练,然后通过第一阶段训练的模型生成的更大规模自生成数据进行进一步增强。实验结果表明,CoSC显著提升了现有开源LLMs在传统数学数据集上的表现,特别是在最具挑战性的MATH数据集上,CoSC-Code-34B模型表现超过了ChatGPT、GPT-4以及多模态LLMs如GPT-4V、Gemini-1.0 Pro和Gemini-1.0 Ultra。

链接: https://arxiv.org/abs/2410.10735
作者: Kuofeng Gao,Huanqia Cai,Qingyao Shuai,Dihong Gong,Zhifeng Li
关键词-EN: Large Language Models, Large Language, Accurate mathematical reasoning, Accurate mathematical, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate mathematical reasoning with Large Language Models (LLMs) is crucial in revolutionizing domains that heavily rely on such reasoning. However, LLMs often encounter difficulties in certain aspects of mathematical reasoning, leading to flawed reasoning and erroneous results. To mitigate these issues, we introduce a novel mechanism, the Chain of Self-Correction (CoSC), specifically designed to embed self-correction as an inherent ability in LLMs, enabling them to validate and rectify their own results. The CoSC mechanism operates through a sequence of self-correction stages. In each stage, the LLMs generate a program to address a given problem, execute this program using program-based tools to obtain an output, subsequently verify this output. Based on the verification, the LLMs either proceed to the next correction stage or finalize the answer. This iterative self-correction process allows the LLMs to refine their reasoning steps and improve the accuracy of their mathematical reasoning. To enable the CoSC mechanism at a low cost, we employ a two-phase finetuning approach. In the first phase, the LLMs are trained with a relatively small volume of seeding data generated from GPT-4, establishing an initial CoSC capability. In the second phase, the CoSC capability is further enhanced by training with a larger volume of self-generated data using the trained model in the first phase, without relying on the paid GPT-4. Our comprehensive experiments demonstrate that CoSC significantly improves performance on traditional mathematical datasets among existing open-source LLMs. Notably, our CoSC-Code-34B model achieved a 53.5% score on MATH, the most challenging mathematical reasoning dataset in the public domain, surpassing the performance of well-established models such as ChatGPT, GPT-4, and even multi-modal LLMs like GPT-4V, Gemini-1.0 Pro, and Gemini-1.0 Ultra.
摘要:在依赖于精确数学推理的领域中,大语言模型 (LLMs) 的准确数学推理能力对于推动革命性变革至关重要。然而,LLMs 在某些数学推理方面经常遇到困难,导致推理错误和结果错误。为了缓解这些问题,我们引入了一种新颖的机制,即自我修正链 (Chain of Self-Correction, CoSC),专门设计用于将自我修正嵌入为 LLMs 的固有能力,使其能够验证和纠正自身的结果。CoSC 机制通过一系列自我修正阶段运行。在每个阶段,LLMs 生成一个程序来解决给定问题,使用基于程序的工具执行该程序以获得输出,随后验证该输出。根据验证结果,LLMs 要么进入下一修正阶段,要么最终确定答案。这种迭代的自我修正过程允许 LLMs 优化其推理步骤,并提高其数学推理的准确性。为了以低成本实现 CoSC 机制,我们采用了一种两阶段微调方法。在第一阶段,LLMs 使用从 GPT-4 生成的一小部分种子数据进行训练,建立初步的 CoSC 能力。在第二阶段,通过使用第一阶段训练的模型生成的更大规模的自生成数据进行训练,进一步增强 CoSC 能力,而无需依赖付费的 GPT-4。我们的全面实验表明,CoSC 显著提高了现有开源 LLMs 在传统数学数据集上的性能。值得注意的是,我们的 CoSC-Code-34B 模型在 MATH 数据集上取得了 53.5% 的分数,这是公共领域中最具挑战性的数学推理数据集,超过了 ChatGPT、GPT-4 以及多模态 LLMs 如 GPT-4V、Gemini-1.0 Pro 和 Gemini-1.0 Ultra 等知名模型的性能。

[NLP-13] Large Language Models Are Active Critics in NLG Evaluation ICLR2025

【速读】: 该论文试图解决传统大语言模型(LLMs)在自然语言生成(NLG)系统评估中依赖于预定义任务和评估标准的问题,这些标准难以适应多样化的NLG任务。解决方案的关键在于引入Active-Critic协议,使LLMs能够作为“主动批评者”,通过两个关键阶段实现自适应评估:第一阶段,LLM从数据中推断目标NLG任务并建立相关评估标准;第二阶段,动态优化提示以引导LLM做出更符合人类判断的评分,并生成详细的解释以支持其评估。实验结果表明,该方法在多个NLG任务中与人类判断的吻合度优于现有最先进的方法。

链接: https://arxiv.org/abs/2410.10724
作者: Shuying Xu,Junjie Hu,Ming Jiang
关键词-EN: large language models, natural language generation, evaluating natural language, systems typically relies, NLG
类目: Computation and Language (cs.CL)
备注: Submitted to ICLR2025

点击查看摘要

Abstract:The conventional paradigm of using large language models (LLMs) for evaluating natural language generation (NLG) systems typically relies on two key inputs: (1) a clear definition of the NLG task to be evaluated and (2) a list of pre-defined evaluation criteria. This process treats LLMs as ‘‘passive critics,’’ strictly following human-defined criteria for evaluation. However, as new NLG tasks emerge, the criteria for assessing text quality can vary greatly. Consequently, these rigid evaluation methods struggle to adapt to diverse NLG tasks without extensive prompt engineering customized for each specific task. To address this limitation, we introduce Active-Critic, a novel LLM-based NLG evaluation protocol that enables LLMs to function as ‘‘active critics.’’ Specifically, our protocol comprises two key stages. In the first stage, the LLM is instructed to infer the target NLG task and establish relevant evaluation criteria from the data. Building on this self-inferred information, the second stage dynamically optimizes the prompt to guide the LLM toward more human-aligned scoring decisions, while also generating detailed explanations to justify its evaluations. Experiments across four NLG evaluation tasks show that our approach achieves stronger alignment with human judgments than state-of-the-art evaluation methods. Our comprehensive analysis further highlights the effectiveness and explainability of Active-Critic with only a small amount of labeled data. We will share our code and data on GitHub.
摘要:传统的利用大语言模型 (LLMs) 评估自然语言生成 (NLG) 系统的方法通常依赖于两个关键输入:(1) 对要评估的 NLG 任务的明确界定,以及 (2) 一组预定义的评估标准。这种方法将 LLMs 视为“被动批评者”,严格遵循人类定义的标准进行评估。然而,随着新的 NLG 任务不断涌现,评估文本质量的标准可能会有很大差异。因此,这些僵化的评估方法在没有为每个特定任务进行大量提示工程定制的情况下,难以适应多样化的 NLG 任务。为了解决这一局限性,我们提出了 Active-Critic,一种基于 LLM 的新型 NLG 评估协议,使 LLMs 能够作为“主动批评者”发挥作用。具体而言,我们的协议包含两个关键阶段。在第一阶段,LLM 被指示从数据中推断目标 NLG 任务并建立相关的评估标准。基于这些自我推断的信息,第二阶段动态优化提示,以引导 LLM 做出更符合人类评分的决策,同时生成详细的解释以证明其评估的合理性。在四个 NLG 评估任务上的实验表明,我们的方法在人类判断的一致性方面优于最先进的评估方法。我们的全面分析进一步突显了 Active-Critic 在仅使用少量标注数据情况下的有效性和可解释性。我们将在 GitHub 上分享我们的代码和数据。

[NLP-14] Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

【速读】: 该论文旨在解决大型语言模型(LLMs)在多轮交互中存在的安全漏洞问题,特别是恶意用户通过多轮查询掩盖有害意图的情况。论文提出的解决方案是ActorAttack,这是一种受演员网络理论启发的新型多轮攻击方法。其关键在于通过建模语义关联的演员网络作为攻击线索,生成多样且有效的攻击路径,从而在多轮对话中隐藏有害意图并揭示多样化的攻击路径。这种方法通过创建无害的对话主题来掩盖有害意图,并利用LLMs的知识指定相关演员作为不同的攻击线索,从而在多轮攻击中表现优于现有的单轮和多轮攻击方法。

链接: https://arxiv.org/abs/2410.10700
作者: Qibing Ren,Hao Li,Dongrui Liu,Zhanxu Xie,Xiaoya Lu,Yu Qiao,Lei Sha,Junchi Yan,Lizhuang Ma,Jing Shao
关键词-EN: Large Language Models, Large Language, vulnerabilities of Large, Language Models, obscure harmful intents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs’ knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at this https URL.
摘要:本研究揭示了大语言模型 (LLM) 在多轮交互中的安全漏洞,恶意用户可以通过多个查询来掩盖其有害意图。我们提出了 ActorAttack,这是一种受行动者网络理论启发的新型多轮攻击方法,通过将语义关联的行动者网络建模为攻击线索,生成多样且有效的攻击路径以达到有害目标。ActorAttack 解决了多轮攻击中的两个主要挑战:(1) 通过创建关于行动者的无害对话主题来掩盖有害意图,以及 (2) 利用大语言模型的知识,将相关行动者指定为各种攻击线索,以揭示针对同一有害目标的多样化攻击路径。通过这种方式,ActorAttack 在先进的对齐大语言模型中,甚至在 GPT-o1 上,均优于现有的单轮和多轮攻击方法。我们将发布一个名为 SafeMTData 的数据集,其中包括由 ActorAttack 生成的多轮对抗提示和安全对齐数据。我们证明,使用我们的安全数据集进行安全调优的模型对多轮攻击更具鲁棒性。代码可在以下链接获取:https URL。

[NLP-15] Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)

【速读】: 该论文试图解决时间序列分析领域缺乏高质量基准数据集的问题,解决方案的关键在于借鉴自然语言处理(NLP)领域基准数据集的成功经验,并将其应用于时间序列数据的独特挑战中。论文提出了一种新的方法来创建全面的时间序列基准数据集,强调了数据集的多样性、代表性和挑战性,并探讨了多任务学习策略以提升时间序列模型的性能。通过采用NLP领域的成功策略,该研究旨在推动时间序列建模技术的发展。

链接: https://arxiv.org/abs/2410.10687
作者: Mohammad Asif Ibna Mustafa(Department of Computation, Information and Technology, Technical University of Munich, Munich, Germany),Ferdinand Heinrich(Fraunhofer Institute for Electronic Microsystems and Solid State Technologies EMFT, Machine Learning Enhanced Sensor Systems, Munich, Germany)
关键词-EN: Natural Language Processing, Time series analysis, Time series, effective models relies, models relies heavily
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series analysis has become increasingly important in various domains, and developing effective models relies heavily on high-quality benchmark datasets. Inspired by the success of Natural Language Processing (NLP) benchmark datasets in advancing pre-trained models, we propose a new approach to create a comprehensive benchmark dataset for time series analysis. This paper explores the methodologies used in NLP benchmark dataset creation and adapts them to the unique challenges of time series data. We discuss the process of curating diverse, representative, and challenging time series datasets, highlighting the importance of domain relevance and data complexity. Additionally, we investigate multi-task learning strategies that leverage the benchmark dataset to enhance the performance of time series models. This research contributes to the broader goal of advancing the state-of-the-art in time series modeling by adopting successful strategies from the NLP domain.
摘要:时间序列分析在各个领域中变得越来越重要,而开发有效的模型在很大程度上依赖于高质量的基准数据集。受到自然语言处理 (NLP) 基准数据集在推动预训练模型发展方面的成功启发,我们提出了一种新的方法来创建一个全面的时间序列分析基准数据集。本文探讨了 NLP 基准数据集创建中使用的方法论,并将其适应于时间序列数据的独特挑战。我们讨论了策划多样化、代表性和具有挑战性的时间序列数据集的过程,强调了领域相关性和数据复杂性的重要性。此外,我们还研究了利用基准数据集来增强时间序列模型性能的多任务学习策略。这项研究通过采用 NLP 领域的成功策略,为推动时间序列建模的最新技术水平做出了贡献。

[NLP-16] Large Language Model Evaluation via Matrix Nuclear-Norm

【速读】: 该论文试图解决大规模语言模型(LLMs)在评估信息压缩和减少冗余能力时,传统评估指标如矩阵熵(Matrix Entropy)计算复杂度高(( O(n^3) ))的问题。解决方案的关键在于引入矩阵核范数(Matrix Nuclear-Norm),通过使用 ( L_1,2\text-norm ) 近似核范数,将计算复杂度降低至 ( O(n^2) ),从而显著提高评估效率。该方法不仅提供了对模型信息压缩能力的量化,还避免了奇异值分解(SVD)的计算,使得评估速度比矩阵熵快8到24倍,尤其在大规模模型中表现更为突出。

链接: https://arxiv.org/abs/2410.10672
作者: Yahan Li,Tingyu Xia,Yi Chang,Yuan Wu
关键词-EN: large language models, continue to evolve, Matrix Entropy, Matrix Entropy offer, large language
类目: Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:As large language models (LLMs) continue to evolve, efficient evaluation metrics are vital for assessing their ability to compress information and reduce redundancy. While traditional metrics like Matrix Entropy offer valuable insights, they are computationally intensive for large-scale models due to their ( O(n^3) ) time complexity with Singular Value Decomposition (SVD). To mitigate this issue, we introduce the Matrix Nuclear-Norm, which not only serves as a metric to quantify the data compression proficiency of LLM but also provides a convex approximation of matrix rank to capture both predictive discriminability and diversity. By employing the ( L_1,2\text-norm ) to further approximate the nuclear norm, we can effectively assess the model’s information compression capabilities. This approach reduces the time complexity to ( O(n^2) ) and eliminates the need for SVD computation. Consequently, the Matrix Nuclear-Norm achieves speeds 8 to 24 times faster than Matrix Entropy for the CEREBRAS-GPT model as sizes increase from 111M to 6.7B. This performance gap becomes more pronounced with larger models, as validated in tests with other models like Pythia. Additionally, evaluations on benchmarks and model responses confirm that our proposed Matrix Nuclear-Norm is a reliable, scalable, and efficient tool for assessing LLMs’ performance, striking a balance between accuracy and computational efficiency. The code is available at this https URL.
摘要:随着大语言模型 (LLM) 的不断发展,高效的评估指标对于评估其信息压缩和冗余减少能力至关重要。尽管传统的矩阵熵等指标提供了有价值的见解,但由于其与奇异值分解 (SVD) 相关的 ( O(n^3) ) 时间复杂度,它们在大规模模型中计算量巨大。为了解决这一问题,我们引入了矩阵核范数,它不仅作为一种度量来量化 LLM 的数据压缩能力,还提供了矩阵秩的凸近似,以捕捉预测的区分性和多样性。通过采用 ( L_1,2\text-norm ) 进一步近似核范数,我们可以有效评估模型的信息压缩能力。这种方法将时间复杂度降低到 ( O(n^2) ),并消除了对 SVD 计算的需求。因此,对于从 111M 到 6.7B 的 CEREBRAS-GPT 模型,矩阵核范数的计算速度比矩阵熵快 8 到 24 倍。随着模型规模的增大,这一性能差距在 Pythia 等其他模型的测试中变得更加显著。此外,基准测试和模型响应的评估证实,我们提出的矩阵核范数是一种可靠、可扩展且高效的工具,能够在准确性和计算效率之间取得平衡,用于评估 LLM 的性能。代码可在以下链接获取:https URL。

[NLP-17] Double Jeopardy and Climate Impact in the Use of Large Language Models : Socio-economic Disparities and Reduced Utility for Non-English Speakers

【速读】: 该论文试图解决人工智能(AI),特别是大型语言模型(LLMs)在语言和信息获取方面的不平等问题,尤其是在低收入和中等收入国家中,非英语使用者面临的高成本和低性能问题。解决方案的关键在于改进算法开发,特别是针对低资源语言的tokenization过程,以减少这些语言使用者在访问和使用LLMs时的高成本和不平等性能表现,从而实现更公平的语言技术应用。

链接: https://arxiv.org/abs/2410.10665
作者: Aivin V. Solatorio,Gabriel Stefanini Vicente,Holly Krambeck,Olivier Dupriez
关键词-EN: Artificial Intelligence, World Development Indicators, holds the potential, information gaps, developing nations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
备注: Project GitHub repository at this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI’s GPT models via APIs because of how the system processes the input – tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.
摘要:人工智能 (AI),特别是大语言模型 (LLMs),具有弥合语言和信息鸿沟的潜力,这可以惠及发展中国家的经济。然而,我们对 FLORES-200、FLORES+、Ethnologue 和世界发展指标数据的分析显示,这些好处主要惠及英语使用者。低收入和中等偏下收入国家的语言使用者在通过 API 使用 OpenAI 的 GPT 模型时面临更高的成本,这是由于系统处理输入的方式——Token 化。约 15 亿人,主要来自中等偏下收入国家的语言使用者,可能会面临比英语使用者高出 4 到 6 倍的成本。大语言模型的性能差异显著,按 Token 定价的模型加剧了访问、成本和效用方面的不平等。此外,以翻译任务质量为代理指标,我们发现大语言模型在低资源语言上的表现不佳,这为这些用户带来了“双重困境”——更高的成本和较差的表现。我们还讨论了低资源语言 Token 化碎片化对气候的直接影响。这突显了开发更公平算法以惠及所有语言群体的必要性。

[NLP-18] Generative AI and Its Impact on Personalized Intelligent Tutoring Systems

【速读】: 该论文试图解决如何通过生成式人工智能(Generative AI),特别是大型语言模型(如GPT-4),提升智能辅导系统(ITS)中的个性化和自适应学习环境的问题。解决方案的关键在于利用生成式AI实现动态内容生成、实时反馈和自适应学习路径,具体包括自动化问题生成、定制化反馈机制和响应个体学习者需求的交互式对话系统。此外,论文还探讨了确保教学准确性、减少AI模型中的固有偏见以及维持学习者参与度等挑战,并展望了多模态AI集成、情感智能在辅导系统中的应用以及AI驱动教育的伦理影响等未来方向。

链接: https://arxiv.org/abs/2410.10650
作者: Subhankar Maity,Aniket Deroy
关键词-EN: Generative Artificial Intelligence, Intelligent Tutoring Systems, enabling highly personalized, adaptive learning environments, Generative Artificial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Scientific Report (Under Review)

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) is revolutionizing educational technology by enabling highly personalized and adaptive learning environments within Intelligent Tutoring Systems (ITS). This report delves into the integration of Generative AI, particularly large language models (LLMs) like GPT-4, into ITS to enhance personalized education through dynamic content generation, real-time feedback, and adaptive learning pathways. We explore key applications such as automated question generation, customized feedback mechanisms, and interactive dialogue systems that respond to individual learner needs. The report also addresses significant challenges, including ensuring pedagogical accuracy, mitigating inherent biases in AI models, and maintaining learner engagement. Future directions highlight the potential advancements in multimodal AI integration, emotional intelligence in tutoring systems, and the ethical implications of AI-driven education. By synthesizing current research and practical implementations, this report underscores the transformative potential of Generative AI in creating more effective, equitable, and engaging educational experiences.
摘要:生成式人工智能 (Generative AI) 正在通过在智能辅导系统 (Intelligent Tutoring Systems, ITS) 中实现高度个性化和自适应的学习环境,彻底改变教育技术。本报告深入探讨了生成式 AI,特别是像 GPT-4 这样的大语言模型 (Large Language Models, LLMs),如何集成到 ITS 中,通过动态内容生成、实时反馈和自适应学习路径来增强个性化教育。我们探讨了关键应用,如自动问题生成、定制反馈机制和响应个体学习者需求的交互式对话系统。报告还涉及了重大挑战,包括确保教学准确性、减轻 AI 模型中的固有偏见以及维持学习者参与度。未来的方向突显了多模态 AI 集成、辅导系统中的情感智能以及 AI 驱动教育的伦理影响方面的潜在进展。通过综合当前的研究和实际应用,本报告强调了生成式 AI 在创造更有效、公平和引人入胜的教育体验方面的变革潜力。

[NLP-19] hinking LLMs: General Instruction Following with Thought Generation

【速读】: 该论文试图解决现有大型语言模型(LLMs)在回答复杂问题时缺乏显式思考能力的问题。解决方案的关键在于提出了一种迭代搜索和优化方法,通过探索可能的思维生成空间,使模型在没有额外人类数据的情况下学习如何思考。具体来说,对于每个指令,模型生成多个思维候选,并使用评判模型对其进行评分和优化,从而在不依赖直接监督的情况下提升模型的思考能力,并在多个任务领域(包括非推理类任务)中表现出优越性能。

链接: https://arxiv.org/abs/2410.10630
作者: Tianhao Wu,Janice Lan,Weizhe Yuan,Jiantao Jiao,Jason Weston,Sainbayar Sukhbaatar
关键词-EN: human experts respond, answer user questions, follow instructions similarly, experts respond, typically trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning – but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning problem-solving tasks.
摘要:大语言模型 (LLM) 通常被训练来回答用户问题或遵循指令,其方式类似于人类专家的回应。然而,在标准的对齐框架中,它们缺乏在回答之前进行明确思考的基本能力。对于需要推理和规划的复杂问题,思考是至关重要的——但这种能力可以应用于任何任务。我们提出了一种训练方法,使现有的大语言模型具备这种思考能力,以遵循一般指令,而无需使用额外的人类数据。我们通过一种迭代搜索和优化过程来实现这一目标,该过程探索了可能的思维生成空间,使模型能够在没有直接监督的情况下学习如何思考。对于每条指令,思维候选者使用一个评判模型进行评分,仅评估其回应,然后通过偏好优化进行优化。我们展示了这种方法在 AlpacaEval 和 Arena-Hard 上的优越性能,并且在非推理类别(如营销、健康和一般知识)上显示出思考带来的增益,除了传统的推理问题解决任务外。

[NLP-20] Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts

【速读】: 该论文试图解决低资源语言在医疗领域应用中的数据稀缺问题,关键解决方案是利用多语言大型语言模型(LLMs)的泛化能力。通过构建高质量的医疗数据集并分析其质量,论文提出了一种新的Mixture of Experts (MoE)路由方法,结合语言特定的专家和跨语言路由,从多语言角度探索LLMs的内部信息流。研究发现,早期层集中跨语言信息流,而后期层呈现语言特定分歧,这一发现促成了Post-MoE架构的开发,该架构仅在后期层应用稀疏路由,同时保持其他层的密集性。实验结果表明,这种方法增强了多语言模型的泛化能力,同时保持了解释性。此外,论文还引入了语言家族专家的概念,利用语言学先验知识,使得在不增加额外参数的情况下,模型能够扩展到50种语言。

链接: https://arxiv.org/abs/2410.10626
作者: Guorui Zheng,Xidong Wang,Juhao Liang,Nuo Chen,Yuping Zheng,Benyou Wang
关键词-EN: Adapting medical Large, accessing healthcare services, data scarcity remains, medical Large Language, Large Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting medical Large Language Models to local languages can reduce barriers to accessing healthcare services, but data scarcity remains a significant challenge, particularly for low-resource languages. To address this, we first construct a high-quality medical dataset and conduct analysis to ensure its quality. In order to leverage the generalization capability of multilingual LLMs to efficiently scale to more resource-constrained languages, we explore the internal information flow of LLMs from a multilingual perspective using Mixture of Experts (MoE) modularity. Technically, we propose a novel MoE routing method that employs language-specific experts and cross-lingual routing. Inspired by circuit theory, our routing analysis revealed a Spread Out in the End information flow mechanism: while earlier layers concentrate cross-lingual information flow, the later layers exhibit language-specific divergence. This insight directly led to the development of the Post-MoE architecture, which applies sparse routing only in the later layers while maintaining dense others. Experimental results demonstrate that this approach enhances the generalization of multilingual models to other languages while preserving interpretability. Finally, to efficiently scale the model to 50 languages, we introduce the concept of language family experts, drawing on linguistic priors, which enables scaling the number of languages without adding additional parameters.
摘要:将医疗大语言模型 (Large Language Models) 适配到本地语言可以降低获取医疗服务的障碍,但数据稀缺仍然是一个重大挑战,尤其是对于资源匮乏的语言。为了解决这一问题,我们首先构建了一个高质量的医疗数据集并进行分析以确保其质量。为了利用多语言大语言模型的泛化能力,高效地扩展到更多资源受限的语言,我们通过专家混合 (Mixture of Experts, MoE) 模块化探索了大语言模型的内部信息流。技术上,我们提出了一种新的 MoE 路由方法,该方法采用特定语言的专家和跨语言路由。受电路理论启发,我们的路由分析揭示了一种“末端扩散”的信息流机制:早期层集中跨语言信息流,而后期层表现出语言特定的发散。这一见解直接导致了 Post-MoE 架构的开发,该架构仅在后期层应用稀疏路由,同时保持其他层的密集性。实验结果表明,这种方法增强了多语言模型对其他语言的泛化能力,同时保持了解释性。最后,为了高效地将模型扩展到 50 种语言,我们引入了语言家族专家的概念,借鉴语言学先验知识,使得在不增加额外参数的情况下扩展语言数量成为可能。

[NLP-21] SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition

【速读】: 该论文试图解决大语言模型(LLMs)在处理时间序列任务如人体活动识别(HAR)时的局限性问题。解决方案的关键在于提出了SensorLLM框架,该框架通过两个阶段实现:首先在传感器-语言对齐阶段,引入特殊标记符和自动生成趋势描述文本,将传感器数据与文本输入对齐,使LLMs能够捕捉数值变化、通道特定信息和不同长度的传感器数据,而无需人工标注;其次在任务感知调优阶段,通过冻结LLM和对齐模块,针对HAR分类任务进行模型微调,从而在性能上达到或超越现有最先进模型。这一方法不仅解决了LLMs在处理传感器数据时的语义上下文缺失、计算限制和数值输入处理困难等问题,还展示了SensorLLM在不同数据集上的泛化能力,为未来时间序列与文本对齐研究奠定了基础。

链接: https://arxiv.org/abs/2410.10624
作者: Zechen Li,Shohreh Deldari,Linyao Chen,Hao Xue,Flora D. Salim
关键词-EN: Large Language Models, enabling Large Language, Large Language, human activity recognition, wearable sensor technology
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we bridge the gap between wearable sensor technology and personalized AI assistants by enabling Large Language Models (LLMs) to understand time-series tasks like human activity recognition (HAR). Despite the strong reasoning and generalization capabilities of LLMs, leveraging them for sensor data tasks remains largely unexplored. This gap stems from challenges like the lack of semantic context in time-series data, computational limitations, and LLMs’ difficulty processing numerical inputs. To address these issues, we introduce SensorLLM, a two-stage framework to unlock LLMs’ potential for sensor data tasks. In the Sensor-Language Alignment Stage, we introduce special tokens for each sensor channel and automatically generate trend-descriptive text to align sensor data with textual inputs, enabling SensorLLM to capture numerical changes, channel-specific information, and sensor data of varying lengths-capabilities that existing LLMs typically struggle with, all without the need for human annotations. Next, in Task-Aware Tuning Stage, we refine the model for HAR classification using the frozen LLM and alignment module, achieving performance on par with or surpassing state-of-the-art models. We further demonstrate that SensorLLM evolves into an effective sensor learner, reasoner, and classifier through Sensor-Language Alignment, enabling it to generalize across diverse datasets for HAR tasks. We strongly believe our work lays the stepstone for future time-series and text alignment research, offering a path toward foundation models for sensor data.
摘要:在本研究中,我们通过使大语言模型 (LLM) 能够理解时间序列任务(如人体活动识别 (HAR)),填补了可穿戴传感器技术与个性化 AI 助手之间的鸿沟。尽管 LLM 具有强大的推理和泛化能力,但将其应用于传感器数据任务的研究仍处于起步阶段。这一差距源于时间序列数据缺乏语义上下文、计算限制以及 LLM 处理数值输入的困难等挑战。为解决这些问题,我们提出了 SensorLLM,这是一个两阶段的框架,旨在释放 LLM 在传感器数据任务中的潜力。在传感器-语言对齐阶段,我们为每个传感器通道引入特殊 Token,并自动生成趋势描述性文本,以将传感器数据与文本输入对齐,使 SensorLLM 能够捕捉数值变化、通道特定信息以及不同长度的传感器数据——这些能力是现有 LLM 通常难以实现的,且无需人工标注。接下来,在任务感知调优阶段,我们使用冻结的 LLM 和对齐模块对模型进行 HAR 分类的精调,实现了与最先进模型相当或超越的性能。我们进一步证明,通过传感器-语言对齐,SensorLLM 演变为一个有效的传感器学习者、推理者和分类器,使其能够在 HAR 任务中跨不同数据集进行泛化。我们坚信,我们的工作为未来的时间序列与文本对齐研究奠定了基础,为传感器数据的基石模型提供了一条路径。

[NLP-22] Modeling News Interactions and Influence for Financial Market Prediction EMNLP2024

【速读】: 该论文试图解决金融新闻对市场价格影响的复杂性问题,特别是如何评估新闻事件与市场波动之间的关联。解决方案的关键在于引入了一种名为FININ(Financial Interconnected News Influence Network)的新型市场预测模型,该模型不仅捕捉新闻与价格之间的直接联系,还考虑了新闻项目之间的相互作用。FININ通过整合来自市场数据和新闻文章的多模态信息,显著提升了市场预测的准确性,实验结果显示其在SP 500和NASDAQ 100指数上的每日夏普比率分别提高了0.429和0.341,超越了现有的先进市场预测模型。

链接: https://arxiv.org/abs/2410.10614
作者: Mengyu Wang,Shay B. Cohen,Tiejun Ma
关键词-EN: complex process, making it challenging, challenging to evaluate, evaluate the connections, Influence Network
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:The diffusion of financial news into market prices is a complex process, making it challenging to evaluate the connections between news events and market movements. This paper introduces FININ (Financial Interconnected News Influence Network), a novel market prediction model that captures not only the links between news and prices but also the interactions among news items themselves. FININ effectively integrates multi-modal information from both market data and news articles. We conduct extensive experiments on two datasets, encompassing the SP 500 and NASDAQ 100 indices over a 15-year period and over 2.7 million news articles. The results demonstrate FININ’s effectiveness, outperforming advanced market prediction models with an improvement of 0.429 and 0.341 in the daily Sharpe ratio for the two markets respectively. Moreover, our results reveal insights into the financial news, including the delayed market pricing of news, the long memory effect of news, and the limitations of financial sentiment analysis in fully extracting predictive power from news data.
摘要:金融新闻对市场价格的扩散是一个复杂的过程,这使得评估新闻事件与市场波动之间的联系变得具有挑战性。本文介绍了 FININ (Financial Interconnected News Influence Network),这是一种新颖的市场预测模型,不仅捕捉了新闻与价格之间的联系,还捕捉了新闻项目之间的相互作用。FININ 有效地整合了来自市场数据和新闻文章的多模态信息。我们在两个数据集上进行了广泛的实验,涵盖了 SP 500 和 NASDAQ 100 指数在 15 年期间以及超过 270 万篇新闻文章。结果表明,FININ 的有效性,其每日 Sharpe 比率在两个市场上分别提高了 0.429 和 0.341,优于先进的市场预测模型。此外,我们的结果揭示了金融新闻的一些见解,包括新闻的市场定价延迟、新闻的长记忆效应以及金融情感分析在完全提取新闻数据的预测能力方面的局限性。

[NLP-23] VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

【速读】: 该论文试图解决传统检索增强生成(RAG)系统无法利用多模态文档中的视觉信息(如布局和图像)的问题。解决方案的关键在于引入VisRAG,这是一种基于视觉语言模型(VLM)的RAG管道,通过直接将文档嵌入为图像并进行检索,从而在生成过程中充分利用原始文档中的数据信息,避免了传统文本解析过程中引入的信息损失。实验结果表明,VisRAG在检索和生成阶段均优于传统文本RAG,实现了25-39%的端到端性能提升。

链接: https://arxiv.org/abs/2410.10594
作者: Shi Yu,Chaoyue Tang,Bokai Xu,Junbo Cui,Junhao Ran,Yukun Yan,Zhenghao Liu,Shuo Wang,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: enables large language, external knowledge sources, utilize external knowledge, large language models, RAG
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25–39% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at this https URL .
摘要:检索增强生成 (Retrieval-augmented generation, RAG) 是一种有效技术,使大语言模型 (Large Language Models, LLMs) 能够利用外部知识源进行生成。然而,当前的 RAG 系统仅基于文本,无法利用视觉信息如布局和图像,这些在现实世界的多模态文档中起着至关重要的作用。本文中,我们介绍了 VisRAG,通过建立基于视觉语言模型 (Vision-Language Model, VLM) 的 RAG 流程来解决这一问题。在该流程中,文档直接作为图像使用 VLM 嵌入,然后进行检索以增强 VLM 的生成,而不是首先解析文档以获取文本。与传统的基于文本的 RAG 相比,VisRAG 最大化地保留和利用了原始文档中的数据信息,消除了解析过程中引入的信息损失。我们收集了开源和合成数据来训练 VisRAG 中的检索器,并探索了多种生成方法。实验表明,VisRAG 在检索和生成阶段均优于传统的 RAG,相对于传统的基于文本的 RAG 流程,实现了 25–39% 的端到端性能提升。进一步分析显示,VisRAG 在利用训练数据方面表现有效,并展示了强大的泛化能力,使其成为多模态文档上 RAG 的一个有前景的解决方案。我们的代码和数据可在以下链接获取:https URL。

[NLP-24] "ubingen-CL at SemEval-2024 Task 1:Ensemble Learning for Semantic Relatedness Estimation

【速读】: 该论文试图解决SemEval-2024 Task 1中的句子对相关性预测问题,其关键在于提出了一种集成方法,通过结合统计文本特征和深度学习模型的输出,来估计句子间的语义相关性。研究结果表明,语义相关性可以从多种来源推断,并且集成模型在估计语义相关性方面优于许多单一系统。

链接: https://arxiv.org/abs/2410.10585
作者: Leixin Zhang,Çağrı Çöltekin
关键词-EN: paper introduces, sentence pairs, Task, semantic relatedness, aims to predict
类目: Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:The paper introduces our system for SemEval-2024 Task 1, which aims to predict the relatedness of sentence pairs. Operating under the hypothesis that semantic relatedness is a broader concept that extends beyond mere similarity of sentences, our approach seeks to identify useful features for relatedness estimation. We employ an ensemble approach integrating various systems, including statistical textual features and outputs of deep learning models to predict relatedness scores. The findings suggest that semantic relatedness can be inferred from various sources and ensemble models outperform many individual systems in estimating semantic relatedness.
摘要:本文介绍了我们为 SemEval-2024 任务 1 开发的系统,该任务旨在预测句子对的相关性。基于语义相关性是一个超越句子相似性的更广泛概念的假设,我们的方法旨在识别用于相关性估计的有用特征。我们采用集成方法,结合了多种系统,包括统计文本特征和深度学习模型的输出,以预测相关性分数。研究结果表明,语义相关性可以从多种来源推断,并且集成模型在估计语义相关性方面优于许多单个系统。

[NLP-25] Multilingual Controlled Generation And Gold-Standard-Agnostic Evaluation of Code-Mixed Sentences COLING2025

【速读】: 该论文试图解决代码混合文本生成和评估的问题。由于代码混合的口语性质,传统的基于n-gram的机器翻译评估指标(如BLEU)不适用于代码混合文本的评估。论文提出了一种新的生成方法——受控生成(Controlled Generation),通过参数化代码混合程度(CMD),能够从给定的英文句子生成多个语义等价的代码混合句子。解决方案的关键在于引入了一种新的评估指标——GAME(Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences),该指标不依赖于黄金标准的代码混合句子,从而消除了人工标注的需求,并且在评估语义等价的代码混合句子时,GAME得分比BLEU得分具有更低的标准差。

链接: https://arxiv.org/abs/2410.10580
作者: Ayushman Gupta,Akhil Bhogal,Kripabandhu Ghosh
关键词-EN: code-mixed sentences, code-mixed, multilingual communities, practice of alternating, common phenomenon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Manuscript submitted to COLING 2025

点击查看摘要

Abstract:Code-mixing, the practice of alternating between two or more languages in an utterance, is a common phenomenon in multilingual communities. Due to the colloquial nature of code-mixing, there is no singular correct way to translate an English sentence into a code-mixed sentence. For this reason, standard n-gram-based MT evaluation metrics such as the BLEU score are not appropriate for code-mixed evaluation. To demonstrate this, we propose a novel method for code-mixed text generation: Controlled Generation, which parameterizes the code-mixing degree (CMD) and enables the generation of multiple semantically equivalent code-mixed sentences from a given English sentence. We introduce a robust new evaluation metric: GAME: A Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences. GAME is both language-agnostic and gold-standard-agnostic, i.e. unlike other metrics, GAME does not require gold-standard code-mixed sentences for evaluation, thus eliminating the need for human annotators in the code-mixed evaluation process. When used to evaluate semantically equivalent code-mixed sentences, we find that GAME scores have a lower standard deviation than BLEU scores. Further, we create and release a dataset containing gold-standard code-mixed sentences across 4 language pairs: English-Hindi, Bengali, French, Spanish to encourage more computational research on code-mixing.
摘要:代码混合(Code-mixing),即在同一话语中交替使用两种或多种语言,是多语社区中的常见现象。由于代码混合的口语性质,将英语句子翻译成代码混合句子并没有唯一正确的标准。因此,基于 n-gram 的标准机器翻译评估指标,如 BLEU 分数,并不适用于代码混合的评估。为了证明这一点,我们提出了一种新的代码混合文本生成方法:受控生成(Controlled Generation),该方法参数化了代码混合程度(CMD),并能够从给定的英语句子生成多个语义等价的代码混合句子。我们引入了一种新的鲁棒评估指标:GAME(A Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences)。GAME 既不依赖于特定语言,也不依赖于黄金标准,即与其他指标不同,GAME 不需要黄金标准的代码混合句子进行评估,从而消除了代码混合评估过程中对人工标注者的需求。在用于评估语义等价的代码混合句子时,我们发现 GAME 分数的标准差低于 BLEU 分数。此外,我们创建并发布了一个包含四种语言对(英语-印地语、孟加拉语、法语、西班牙语)的黄金标准代码混合句子数据集,以鼓励更多关于代码混合的计算研究。

[NLP-26] Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios? EMNLP2024

【速读】: 该论文试图解决在数据稀缺的语言环境中进行词性标注(POS tagging)的问题。解决方案的关键在于采用零样本学习(zero-shot learning)方法,通过选择与目标语言具有强语言关系且高质量的多语言数据集,利用多语言BERT(mBERT)模型进行微调,从而在没有目标语言标注数据的情况下实现有效的词性标注。研究强调了数据集选择的重要性,特别是对于极低资源语言,零样本模型被证明是一种可行的选择。

链接: https://arxiv.org/abs/2410.10576
作者: Zeno Vandenbulcke,Lukas Vermeire,Miryam de Lhoneux
关键词-EN: POS tagging, POS tagging plays, POS, numerous applications, plays a fundamental
类目: Computation and Language (cs.CL)
备注: To appear at the 4th Multilingual NLP workshop collocated with EMNLP 2024

点击查看摘要

Abstract:POS tagging plays a fundamental role in numerous applications. While POS taggers are highly accurate in well-resourced settings, they lag behind in cases of limited or missing training data. This paper focuses on POS tagging for languages with limited data. We seek to identify the characteristics of datasets that make them favourable for training POS tagging models without using any labelled training data from the target language. This is a zero-shot approach. We compare the accuracies of a multilingual large language model (mBERT) fine-tuned on one or more languages related to the target language. Additionally, we compare these results with models trained directly on the target language itself. We do this for three target low-resource languages. Our research highlights the importance of accurate dataset selection for effective zero-shot POS tagging. Particularly, a strong linguistic relationship and high-quality datasets ensure optimal results. For extremely low-resource languages, zero-shot models prove to be a viable option.
摘要:词性标注在众多应用中扮演着基础角色。尽管在资源充足的情况下,词性标注器具有高准确性,但在训练数据有限或缺失的情况下,其表现则有所滞后。本文聚焦于数据有限的语言的词性标注问题。我们旨在识别那些使得数据集在没有使用目标语言的任何标注训练数据的情况下,仍能有利于训练词性标注模型的特征。这是一种零样本方法。我们比较了在一种或多种与目标语言相关的语言上微调的多语言大语言模型 (mBERT) 的准确性。此外,我们还将其结果与直接在目标语言上训练的模型进行了比较。我们针对三种低资源目标语言进行了此项研究。我们的研究强调了准确数据集选择对于有效零样本词性标注的重要性。特别是,强大的语言关系和高品质的数据集确保了最佳结果。对于极度低资源的语言,零样本模型被证明是一个可行的选择。

[NLP-27] Is Structure Dependence Shaped for Efficient Communication?: A Case Study on Coordination CONLL2024

【速读】: 该论文试图解决的问题是:结构依赖性(structure dependence)这一抽象的句法普遍性是否可以通过高效的沟通来解释。解决方案的关键在于设计了三种人工语言,分别具有结构依赖的简化操作、无简化操作以及线性简化操作,并通过量化这些语言的沟通效率来验证。研究结果表明,具有结构依赖简化操作的语言在沟通效率上显著优于其他两种语言,从而支持了结构依赖性可以通过高效沟通来解释的观点。

链接: https://arxiv.org/abs/2410.10556
作者: Kohei Kajikawa,Yusuke Kubota,Yohei Oseki
关键词-EN: efficient communication, Natural language exhibits, communication, Natural language, efficient
类目: Computation and Language (cs.CL)
备注: CoNLL 2024

点击查看摘要

Abstract:Natural language exhibits various universal properties. But why do these universals exist? One explanation is that they arise from functional pressures to achieve efficient communication, a view which attributes cross-linguistic properties to domain-general cognitive abilities. This hypothesis has successfully addressed some syntactic universal properties such as compositionality and Greenbergian word order universals. However, more abstract syntactic universals have not been explored from the perspective of efficient communication. Among such universals, the most notable one is structure dependence, that is, the existence of grammar-internal operations that crucially depend on hierarchical representations. This property has traditionally been taken to be central to natural language and to involve domain-specific knowledge irreducible to communicative efficiency. In this paper, we challenge the conventional view by investigating whether structure dependence realizes efficient communication, focusing on coordinate structures. We design three types of artificial languages: (i) one with a structure-dependent reduction operation, which is similar to natural language, (ii) one without any reduction operations, and (iii) one with a linear (rather than structure-dependent) reduction operation. We quantify the communicative efficiency of these languages. The results demonstrate that the language with the structure-dependent reduction operation is significantly more communicatively efficient than the counterfactual languages. This suggests that the existence of structure-dependent properties can be explained from the perspective of efficient communication. Comments: CoNLL 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.10556 [cs.CL] (or arXiv:2410.10556v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.10556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:自然语言表现出多种普遍特性。但这些普遍特性为何存在?一种解释是,它们源于实现高效沟通的功能压力,这种观点将跨语言特性归因于领域通用的认知能力。这一假设成功解释了一些句法普遍特性,如组合性和格林伯格式的词序普遍性。然而,更抽象的句法普遍性尚未从高效沟通的角度进行探讨。在这些普遍性中,最显著的是结构依赖性,即语法内部操作的存在,这些操作关键依赖于层次结构表示。这一特性传统上被认为是自然语言的核心,并涉及不可简化为沟通效率的领域特定知识。本文通过研究结构依赖性是否实现高效沟通,特别是关注并列结构,挑战了传统观点。我们设计了三种人工语言:(i) 一种具有结构依赖性简化操作的语言,类似于自然语言;(ii) 一种没有任何简化操作的语言;(iii) 一种具有线性(而非结构依赖性)简化操作的语言。我们量化了这些语言的沟通效率。结果表明,具有结构依赖性简化操作的语言在沟通效率上显著优于反事实语言。这表明,结构依赖性的存在可以从高效沟通的角度得到解释。

评论:CoNLL 2024 主题:计算与语言 (cs.CL) 引用方式:arXiv:2410.10556 [cs.CL] (或 arXiv:2410.10556v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.10556 了解更多 通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-28] SLaNC: Static LayerNorm Calibration NEURIPS2024

【速读】: 该论文试图解决在大规模语言模型(LLMs)推理过程中,由于硬件计算和存储限制导致的LayerNorm计算精度不足的问题。解决方案的关键在于提出了一种计算高效的缩放技术,通过离线计算线性层权重来调整LayerNorm的输入,从而避免数值溢出或下溢,确保在不同硬件架构上实现平滑、准确且资源高效的推理。该方法无需在推理过程中增加额外的延迟或计算开销。

链接: https://arxiv.org/abs/2410.10553
作者: Mahsa Salmani,Nikita Trukhanov,Ilya Soloveychik
关键词-EN: Large Language Models, generated enormous pressure, rapidly expanding fields, Large Language, sizes of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 3 figures, NeurIPS 2024 MLNCP Workshop

点击查看摘要

Abstract:The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of the most rapidly expanding fields of the AI industry. Various approaches have been explored to enable efficient and accurate processing of LLMs on the available accelerators given their computational and storage limitations. Among these, various quantization techniques have become the main focus of the community as a means of reducing the compute, communication and storage requirements. Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations. When it comes to processing the popular Transformer models on hardware, one of the main issues becomes calculation of the LayerNorm simply because accumulation of the variance requires a much wider dynamic range than the hardware enables. In this article, we address this matter and propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference. Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers. The scaling factors are computed offline, based solely on the linear layer weights, hence no latency or computational overhead is added during inference. Most importantly, our technique ensures that no numerical issues such as overflow or underflow could happen during the compute. This approach offers smooth, accurate and resource-effective inference across a wide range of hardware architectures. The article provides theoretical justification as well as supporting numerical simulations.
摘要:随着大语言模型 (Large Language Models, LLMs) 的规模不断扩大,参数数量超过数千亿,给专用硬件加速器的制造商带来了巨大的压力,并使得后者的创新设计成为 AI 行业中发展最为迅速的领域之一。为了在现有加速器的计算和存储限制下实现对 LLMs 的高效且准确的处理,各种方法被探索出来。其中,各种量化技术成为了社区的主要关注点,作为减少计算、通信和存储需求的一种手段。量化到较低精度的格式自然带来了许多挑战,这是由于可用值表示的范围有限所导致的。在硬件上处理流行的 Transformer 模型时,主要问题之一是 LayerNorm 的计算,因为方差的累积需要比硬件所能提供的更宽的动态范围。在本文中,我们解决了这个问题,并提出了一种计算高效的缩放技术,该技术可以在推理过程中轻松应用于 Transformer 模型。我们的方法建议基于紧接其前的线性层的静态权重,对 LayerNorm 输入进行简单的缩放。缩放因子是离线计算的,仅基于线性层权重,因此在推理过程中不会增加延迟或计算开销。最重要的是,我们的技术确保在计算过程中不会出现如溢出或下溢等数值问题。这种方法在广泛的硬件架构上提供了平滑、准确且资源高效的推理。本文提供了理论依据以及支持性的数值模拟。

[NLP-29] Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models EMNLP2024

【速读】: 该论文试图解决在印度司法背景下,如何利用基于Transformer的模型和大型语言模型(LLMs)在现实场景中预测判决的问题。解决方案的关键在于模拟法庭在案件提交时仅基于当时可用的信息(如案件事实、法规、先例和论点)进行判决预测,而非依赖事后分析。论文通过实验发现,GPT-3.5 Turbo在现实场景中表现出色,且结合额外的法律信息(如法规和先例)能显著提升预测效果。此外,论文引入了清晰度和关联性两个人类评估指标,以评估预测和解释的质量,结果表明尽管LLMs有所进步,但尚未达到专家级水平。

链接: https://arxiv.org/abs/2410.10542
作者: Shubham Kumar Nigam,Aniket Deroy,Subhankar Maity,Arnab Bhattacharya
关键词-EN: context of Indian, study investigates judgment, Indian judgments, including InLegalBERT, utilizing a range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted on NLLP at EMNLP 2024

点击查看摘要

Abstract:This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.
摘要:本研究在印度判决的背景下,探讨了在现实场景中的判决预测问题,采用了多种基于 Transformer 的模型,包括 InLegalBERT、BERT 和 XLNet,以及大语言模型 (LLM) 如 Llama-2 和 GPT-3.5 Turbo。在现实场景中,我们模拟了在案件提交法院决策时,如何仅利用当时可获得的信息(如案件事实、法规、先例和论点)来预测判决。这种方法模拟了现实世界中的条件,即决策必须在缺乏事后洞察的情况下做出,这与以往研究中常见的回顾性分析不同。对于 Transformer 模型,我们尝试了层次化 Transformer 和判决事实的摘要,以优化这些模型的输入。我们在 LLM 上的实验表明,GPT-3.5 Turbo 在现实场景中表现出色,在判决预测中展现出强大的性能。此外,结合额外的法律信息(如法规和先例)显著改善了预测任务的结果。LLM 还为其预测提供了解释。为了评估这些预测和解释的质量,我们引入了两个人类评估指标:清晰度 (Clarity) 和关联性 (Linking)。我们的研究结果表明,尽管 LLM 取得了进展,但在判决预测和解释任务中,它们尚未达到专家级水平。

[NLP-30] Everyday Speech in the Indian Subcontinent ICASSP2025

【速读】: 该论文试图解决多语言语音合成中由于语言多样性导致的词汇量大和合成器适应性差的问题。解决方案的关键在于开发了一个基于音素的通用标签集(CLS),通过将印度语言文本转换为CLS,并使用匹配目标语言音系特征的合成器,实现了高质量的语音合成,且无需额外的适应数据。这种方法不仅减少了合成器的资源占用,还支持13种印度语言及英语之间的无缝代码切换。

链接: https://arxiv.org/abs/2410.10508
作者: Utkarsh Pathak(1),Chandra Sai Krishna Gunda(1),Sujitha Sathiyamoorthy(1),Keshav Agarwal(1),Hema A. Murthy(1) ((1) Indian Institute of Technology, Madras)
关键词-EN: Common Label Set, India, Label Set, Common Label, languages
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 Pages, 1 Figure, Submitted to ICASSP 2025

点击查看摘要

Abstract:India has 1369 languages of which 22 are official. About 13 different scripts are used to represent these languages. A Common Label Set (CLS) was developed based on phonetics to address the issue of large vocabulary of units required in the End to End (E2E) framework for multilingual synthesis. This reduced the footprint of the synthesizer and also enabled fast adaptation to new languages which had similar phonotactics, provided language scripts belonged to the same family. In this paper, we provide new insights into speech synthesis, where the script belongs to one family, while the phonotactics comes from another. Indian language text is first converted to CLS, and then a synthesizer that matches the phonotactics of the language is used. Quality akin to that of a native speaker is obtained for Sanskrit and Konkani with zero adaptation data, using Kannada and Marathi synthesizers respectively. Further, this approach also lends itself seamless code switching across 13 Indian languages and English in a given native speaker’s voice.
摘要:印度拥有1369种语言,其中22种为官方语言。大约有13种不同的文字用于表示这些语言。为了解决在多语言合成端到端 (E2E) 框架中所需的大量词汇单位问题,我们基于音韵学开发了一个通用标签集 (CLS)。这不仅减少了合成器的占用空间,还使得能够快速适应具有相似音韵学特征的新语言,前提是这些语言的文字属于同一语系。在本文中,我们提供了关于语音合成的新见解,其中文字属于一个语系,而音韵学特征来自另一个语系。首先将印度语言文本转换为CLS,然后使用与该语言音韵学特征匹配的合成器。通过分别使用卡纳达语和马拉地语合成器,无需适应数据即可获得类似于母语者的梵语和孔卡尼语质量。此外,这种方法还使得在给定母语者的声音中无缝切换13种印度语言和英语成为可能。

[NLP-31] Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation

【速读】: 该论文试图解决的问题是大型语言模型(LLMs)在不同语言和文化背景下的表现差异,特别是低资源语言的表现较差,可能导致数字鸿沟加剧。解决方案的关键在于提高LLMs对非英语语言的熟悉度,这可以通过增加这些语言的数字资源可用性来实现。论文提出了两种主要策略:一是从头开始开发多语言LLMs,二是通过在多样化的语言数据集上进行精细调优,如非洲语言项目所展示的那样。

链接: https://arxiv.org/abs/2410.10489
作者: Sharif Kazemi,Gloria Gerhardt,Jonty Katz,Caroline Ida Kuria,Estelle Pan,Umang Prabhakar
关键词-EN: LLMs embeds societal, embeds societal, language culture, LLMs embeds, training data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The training data for LLMs embeds societal values, increasing their familiarity with the language’s culture. Our analysis found that 44% of the variance in the ability of GPT-4o to reflect the societal values of a country, as measured by the World Values Survey, correlates with the availability of digital resources in that language. Notably, the error rate was more than five times higher for the languages of the lowest resource compared to the languages of the highest resource. For GPT-4-turbo, this correlation rose to 72%, suggesting efforts to improve the familiarity with the non-English language beyond the web-scraped data. Our study developed one of the largest and most robust datasets in this topic area with 21 country-language pairs, each of which contain 94 survey questions verified by native speakers. Our results highlight the link between LLM performance and digital data availability in target languages. Weaker performance in low-resource languages, especially prominent in the Global South, may worsen digital divides. We discuss strategies proposed to address this, including developing multilingual LLMs from the ground up and enhancing fine-tuning on diverse linguistic datasets, as seen in African language initiatives.
摘要:大语言模型的训练数据嵌入了社会价值观,增强了其对语言文化的熟悉度。我们的分析发现,GPT-4o 反映一个国家社会价值观的能力(通过世界价值观调查衡量)中,44% 的变异与该语言数字资源的可用性相关。值得注意的是,与资源最丰富的语言相比,资源最匮乏的语言的错误率高出五倍以上。对于 GPT-4-turbo,这一相关性上升至 72%,表明在网络抓取数据之外,还需努力提高对非英语语言的熟悉度。我们的研究在这一领域开发了最大且最稳健的数据集之一,涵盖 21 个国家和语言对,每个对包含 94 个经母语者验证的调查问题。我们的结果突显了大语言模型性能与目标语言数字数据可用性之间的联系。低资源语言(尤其是在全球南方)的性能较弱,可能会加剧数字鸿沟。我们讨论了应对这一问题的策略,包括从头开始开发多语言大语言模型,以及在多样化的语言数据集上进行增强微调,如非洲语言倡议中所见。

[NLP-32] Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?

【速读】: 该论文旨在解决大型语言模型(LLM)在时间关系分类任务中的性能问题,并探讨其决策过程的可解释性。解决方案的关键在于通过对比开放和闭源LLM的性能,发现LLM在上下文学习中显著落后于基于RoBERTa的较小编码器模型,主要原因是LLM的自回归特性导致其仅关注序列的最后部分。论文通过可解释性方法分析了这一差距的原因,并评估了两种模型的词嵌入差异,以揭示预训练过程中的不同。

链接: https://arxiv.org/abs/2410.10476
作者: Gabriel Roccabruna,Massimo Rizzoli,Giuseppe Riccardi
关键词-EN: Temporal Relation Classification, automatic detection, temporal relations, Large Language Models, temporal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering. Nevertheless, recent studies have tested the LLMs’ performance in detecting temporal relations of closed-source models only, limiting the interpretability of those results. In this work, we investigate LLMs’ performance and decision process in the Temporal Relation Classification task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models can be found respectively on GitHub.
摘要:事件之间的时间关系自动检测主要通过 RoBERTa 等仅编码器模型进行研究。大语言模型 (LLM) 在时间推理任务(如时间问答)中最近显示出有前景的表现。然而,最近的研究仅测试了闭源模型在检测时间关系方面的性能,限制了这些结果的可解释性。在本研究中,我们探讨了 LLM 在时间关系分类任务中的性能和决策过程。首先,我们评估了七种开源和闭源 LLM 在上下文学习和轻量级微调方法下的表现。结果显示,采用上下文学习的 LLM 显著低于基于 RoBERTa 的小型仅编码器模型。然后,我们通过应用可解释方法深入探讨了这一差距的可能原因。结果表明,由于 LLM 的自回归性质,导致其仅关注序列的最后部分,从而在这一任务中存在局限性。此外,我们评估了这两种模型的词嵌入,以更好地理解其预训练差异。代码和微调模型分别可在 GitHub 上找到。

[NLP-33] Ada-K Routing: Boosting the Efficiency of MoE-based LLMs

【速读】: 该论文试图解决在大规模语言模型(LLMs)中使用混合专家(MoE)架构时,传统静态Top-K路由策略在计算效率和模型性能之间的平衡问题。解决方案的关键在于提出了一种新颖的Ada-K路由策略,该策略通过动态调整每个token激活的专家数量,从而在保持模型性能的同时显著提高计算效率。具体来说,Ada-K路由策略引入了可学习的轻量级分配器模块,这些模块根据上下文需求为每个token定制专家资源分配,并通过近端策略优化(PPO)算法实现端到端的非微分决策框架学习。实验结果表明,Ada-K路由方法在减少计算量和加速推理的同时,仍能提升模型在多个基准测试中的表现。

链接: https://arxiv.org/abs/2410.10456
作者: Tongtian Yue,Longteng Guo,Jie Cheng,Xuange Gao,Jing Liu
关键词-EN: Large Language Models, Large Language, era of Large, managing computational costs, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), Mixture-of-Experts (MoE) architectures offer a promising approach to managing computational costs while scaling up model parameters. Conventional MoE-based LLMs typically employ static Top-K routing, which activates a fixed and equal number of experts for each token regardless of their significance within the context. In this paper, we propose a novel Ada-K routing strategy that dynamically adjusts the number of activated experts for each token, thereby improving the balance between computational efficiency and model performance. Specifically, our strategy incorporates learnable and lightweight allocator modules that decide customized expert resource allocation tailored to the contextual needs for each token. These allocators are designed to be fully pluggable, making it broadly applicable across all mainstream MoE-based LLMs. We leverage the Proximal Policy Optimization (PPO) algorithm to facilitate an end-to-end learning process for this non-differentiable decision-making framework. Extensive evaluations on four popular baseline models demonstrate that our Ada-K routing method significantly outperforms conventional Top-K routing. Compared to Top-K, our method achieves over 25% reduction in FLOPs and more than 20% inference speedup while still improving performance across various benchmarks. Moreover, the training of Ada-K is highly efficient. Even for Mixtral-8x22B, a MoE-based LLM with more than 140B parameters, the training time is limited to 8 hours. Detailed analysis shows that harder tasks, middle layers, and content words tend to activate more experts, providing valuable insights for future adaptive MoE system designs. Both the training code and model checkpoints will be publicly available.
摘要:在大语言模型 (LLM) 的时代,专家混合 (Mixture-of-Experts, MoE) 架构提供了一种有前景的方法来管理计算成本,同时扩展模型参数。传统的基于 MoE 的 LLM 通常采用静态的 Top-K 路由策略,即无论 Token 在上下文中的重要性如何,都激活固定且数量相等的专家。本文提出了一种新颖的 Ada-K 路由策略,该策略动态调整每个 Token 激活的专家数量,从而在计算效率和模型性能之间实现更好的平衡。具体而言,我们的策略引入了可学习和轻量级的分配器模块,这些模块根据每个 Token 的上下文需求定制专家资源的分配。这些分配器设计为完全可插拔,使其广泛适用于所有主流的基于 MoE 的 LLM。我们利用近端策略优化 (Proximal Policy Optimization, PPO) 算法来促进这一非可微决策框架的端到端学习过程。在四个流行的基准模型上的广泛评估表明,我们的 Ada-K 路由方法显著优于传统的 Top-K 路由。与 Top-K 相比,我们的方法在 FLOPs 上减少了超过 25%,推理速度提高了 20% 以上,同时在各种基准测试中仍然提高了性能。此外,Ada-K 的训练效率非常高。即使是拥有超过 1400 亿参数的基于 MoE 的 LLM Mixtral-8x22B,训练时间也限制在 8 小时内。详细分析显示,更难的任务、中间层和内容词倾向于激活更多的专家,为未来的自适应 MoE 系统设计提供了有价值的见解。训练代码和模型检查点将公开发布。

[NLP-34] Advancing Academic Knowledge Retrieval via LLM-enhanced Representation Similarity Fusion KDD

【速读】: 该论文旨在解决KDD Cup 2024 AQA Challenge中的学术术语检索问题,关键在于利用大型语言模型(LLM)的强大语言理解和开放领域知识,通过LLM-KnowSimFuser模型进行微调和推理,并基于推理结果的相似度矩阵进行加权融合,从而提升检索模型的性能。实验结果表明,该方法在竞赛数据集上表现优异,最终在排行榜上获得了0.20726的分数。

链接: https://arxiv.org/abs/2410.10455
作者: Wei Dai,Peng Fu,Chunjing Gan
关键词-EN: swift information renewal, robust technological growth, avant-garde academic insights, academic insights spanning, information renewal
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The 2nd Place of KDD Cup 2024 OAG-Challenge AQA

点击查看摘要

Abstract:In an era marked by robust technological growth and swift information renewal, furnishing researchers and the populace with top-tier, avant-garde academic insights spanning various domains has become an urgent necessity. The KDD Cup 2024 AQA Challenge is geared towards advancing retrieval models to identify pertinent academic terminologies from suitable papers for scientific inquiries. This paper introduces the LLM-KnowSimFuser proposed by Robo Space, which wins the 2nd place in the competition. With inspirations drawed from the superior performance of LLMs on multiple tasks, after careful analysis of the provided datasets, we firstly perform fine-tuning and inference using LLM-enhanced pre-trained retrieval models to introduce the tremendous language understanding and open-domain knowledge of LLMs into this task, followed by a weighted fusion based on the similarity matrix derived from the inference results. Finally, experiments conducted on the competition datasets show the superiority of our proposal, which achieved a score of 0.20726 on the final leaderboard.
摘要:在科技迅猛发展和信息快速更新的时代,为研究人员和公众提供跨领域的顶级前沿学术见解已成为迫切需求。KDD Cup 2024 AQA挑战赛旨在推动检索模型的发展,使其能够从合适的论文中识别出与科学查询相关的学术术语。本文介绍了Robo Space提出的LLM-KnowSimFuser,该方法在比赛中获得第二名。受到大语言模型在多项任务中优异表现的启发,在仔细分析了提供的数据集后,我们首先使用增强型预训练检索模型进行微调和推理,将大语言模型的强大语言理解和开放领域知识引入到这一任务中,然后基于推理结果得出的相似度矩阵进行加权融合。最后,在比赛数据集上进行的实验表明,我们的方案具有优越性,最终在排行榜上获得了0.20726的分数。

[NLP-35] KBLaM: Knowledge Base augmented Language Model

【速读】: 该论文试图解决如何有效地将外部知识库与大型语言模型(LLMs)结合的问题。解决方案的关键在于提出了一种名为Knowledge Base augmented Language Model (KBLaM)的新方法,通过预训练的句子编码器和线性适配器将知识库中的知识转换为连续的键值向量对,并利用专门的矩形注意力机制将其集成到预训练的LLMs中。这种方法避免了外部检索模块的使用,且计算开销随知识库大小线性增长而非二次增长,从而能够在单个A100 80GB GPU上将超过10K个三元组的知识库集成到仅8K上下文窗口的8B参数LLM中,并支持动态更新而不需要模型微调或重新训练。

链接: https://arxiv.org/abs/2410.10450
作者: Xi Wang,Liana Mikaelyan,Taketomo Isazawa,James Hensman
关键词-EN: augmenting Large Language, Base augmented Language, Large Language Models, augmented Language Model, propose Knowledge Base
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge. KBLaM works with a knowledge base (KB) constructed from a corpus of documents, transforming each piece of knowledge in the KB into continuous key-value vector pairs via pre-trained sentence encoders with linear adapters and integrating them into pre-trained LLMs via a specialized rectangular attention mechanism. Unlike Retrieval-Augmented Generation, KBLaM eliminates external retrieval modules, and unlike in-context learning, its computational overhead scales linearly with KB size rather than quadratically. Our approach enables integrating a large KB of more than 10K triples into an 8B pre-trained LLM of only 8K context window on one single A100 80GB GPU and allows for dynamic updates without model fine-tuning or retraining. Experiments demonstrate KBLaM’s effectiveness in various tasks, including question-answering and open-ended reasoning, while providing interpretable insights into its use of the augmented knowledge.
摘要:本文提出了一种名为知识库增强语言模型 (Knowledge Base augmented Language Model, KBLaM) 的新方法,用于将外部知识融入大语言模型 (Large Language Models, LLMs)。KBLaM 通过一个从文档语料库构建的知识库 (KB) 工作,将知识库中的每一条知识通过预训练的句子编码器与线性适配器转换为连续的键值向量对,并通过一种专门的矩形注意力机制将其整合到预训练的 LLMs 中。与检索增强生成不同,KBLaM 消除了外部检索模块,并且与上下文学习不同,其计算开销随知识库大小线性增长而非二次增长。我们的方法能够在单个 A100 80GB GPU 上将超过 10K 个三元组的大型知识库集成到一个仅具有 8K 上下文窗口的 8B 预训练 LLM 中,并且允许在不进行模型微调或重新训练的情况下进行动态更新。实验结果表明,KBLaM 在问答和开放式推理等多种任务中表现出色,同时提供了对其使用增强知识的可解释性洞察。

[NLP-36] QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios EMNLP2024

【速读】: 该论文试图解决现有概率推理数据集在处理复杂推理任务时的局限性,特别是简化任务(如仅要求模型对文本选项进行排序、仅包含二元随机变量或使用有限模板)的问题。解决方案的关键在于提出了QUITE数据集,该数据集包含了真实世界的贝叶斯推理场景,涉及分类随机变量和复杂关系,并提供高质量的自然语言前提表述、证据陈述以及要求以概率形式回答的问题。通过实验,论文发现基于逻辑的模型在所有推理类型(因果、证据和解释消除)上均优于现成的大型语言模型,表明神经符号模型在提升复杂推理能力方面具有潜力。

链接: https://arxiv.org/abs/2410.10449
作者: Timo Pierre Schrader,Lukas Lange,Simon Razniewski,Annemarie Friedrich
关键词-EN: decision making processes, making processes, decision making, Reasoning, cs.CL
类目: Computation and Language (cs.CL)
备注: accepted at EMNLP 2024 (main)

点击查看摘要

Abstract:Reasoning is key to many decision making processes. It requires consolidating a set of rule-like premises that are often associated with degrees of uncertainty and observations to draw conclusions. In this work, we address both the case where premises are specified as numeric probabilistic rules and situations in which humans state their estimates using words expressing degrees of certainty. Existing probabilistic reasoning datasets simplify the task, e.g., by requiring the model to only rank textual alternatives, by including only binary random variables, or by making use of a limited set of templates that result in less varied text. In this work, we present QUITE, a question answering dataset of real-world Bayesian reasoning scenarios with categorical random variables and complex relationships. QUITE provides high-quality natural language verbalizations of premises together with evidence statements and expects the answer to a question in the form of an estimated probability. We conduct an extensive set of experiments, finding that logic-based models outperform out-of-the-box large language models on all reasoning types (causal, evidential, and explaining-away). Our results provide evidence that neuro-symbolic models are a promising direction for improving complex reasoning. We release QUITE and code for training and experiments on Github. Comments: accepted at EMNLP 2024 (main) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.10449 [cs.CL] (or arXiv:2410.10449v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.10449 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:推理是许多决策过程中的关键。它需要整合一组通常带有不确定性的规则性前提和观察结果来得出结论。在这项工作中,我们处理了两种情况:一是前提被指定为数值概率规则,二是人类使用表达确定性程度的词语陈述其估计。现有的概率推理数据集简化了任务,例如,要求模型仅对文本选项进行排序,仅包含二元随机变量,或使用有限的一组模板,导致文本变化较少。在这项工作中,我们提出了 QUITE,一个包含真实世界贝叶斯推理场景的问答数据集,涉及分类随机变量和复杂关系。QUITE 提供了高质量的自然语言前提表述,以及证据陈述,并期望以估计概率的形式回答问题。我们进行了一系列广泛的实验,发现基于逻辑的模型在所有推理类型(因果推理、证据推理和解释消除)上均优于现成的大语言模型。我们的结果表明,神经符号模型是改进复杂推理的一个有前景的方向。我们已在 Github 上发布了 QUITE 数据集以及训练和实验代码。

评论:已被 EMNLP 2024 (主会议) 接受
主题:计算与语言 (cs.CL)
引用方式:arXiv:2410.10449 [cs.CL]
(或 arXiv:2410.10449v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.10449
了解更多信息
arXiv 通过 DataCite 发布的 DOI(待注册)

[NLP-37] On Calibration of LLM-based Guard Models for Reliable Content Moderation

【速读】: 该论文试图解决现有基于大型语言模型(LLM)的防护模型在可靠性和校准方面的问题。研究指出,这些防护模型在生成预测时往往过于自信,容易在面对越狱攻击时出现严重校准错误,并且对不同类型响应模型的输出缺乏鲁棒性。解决方案的关键在于通过实证研究评估现有防护模型的校准情况,并引入后验校准方法,特别是温度缩放和上下文校准,以提高模型的校准效果,尤其是在缺乏验证集的情况下。论文强调了未来在发布基于LLM的防护模型时应纳入可靠性评估的重要性。

链接: https://arxiv.org/abs/2410.10414
作者: Hongfu Liu,Hengguan Huang,Hao Wang,Xiangming Gu,Ye Wang
关键词-EN: Large language models, LLM-based guard models, Large language, guard models, generating harmful content
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) pose significant risks due to the potential for generating harmful content or users attempting to evade guardrails. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs, ensuring adherence to safety policies by blocking content that violates these protocols upon deployment. However, limited attention has been given to the reliability and calibration of such guard models. In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets. Our analysis and experiments underscore the limitations of current LLM-based guard models and provide valuable insights for the future development of well-calibrated guard models toward more reliable content moderation. We also advocate for incorporating reliability evaluation of confidence calibration when releasing future LLM-based guard models.
摘要:大语言模型 (LLMs) 由于可能生成有害内容或用户试图规避防护措施,存在显著风险。现有研究已开发出基于 LLM 的防护模型,旨在调节威胁 LLM 的输入和输出,确保在部署时阻止违反这些协议的内容,从而遵守安全政策。然而,对这类防护模型的可靠性和校准问题关注有限。在本研究中,我们对 9 个现有的基于 LLM 的防护模型在 12 个基准上的置信度校准进行了全面实证调查,涵盖用户输入和模型输出分类。我们的研究发现,当前的基于 LLM 的防护模型倾向于 1) 产生过度自信的预测,2) 在遭受越狱攻击时表现出显著的校准误差,以及 3) 对不同类型的响应模型生成的输出表现出有限的鲁棒性。此外,我们评估了事后校准方法以缓解校准误差的效果。我们展示了温度缩放的有效性,并首次强调了上下文校准对防护模型置信度校准的益处,特别是在缺乏验证集的情况下。我们的分析和实验突显了当前基于 LLM 的防护模型的局限性,并为未来开发更可靠的内容调节防护模型提供了宝贵见解。我们还主张在发布未来的基于 LLM 的防护模型时,应纳入置信度校准的可靠性评估。

[NLP-38] Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)中普遍存在的幻觉问题,即生成的内容虽然连贯但事实错误,影响其广泛应用。解决方案的关键在于提出了Medico框架,这是一个多源证据融合增强的幻觉检测与修正框架。Medico通过融合来自多个来源的多样化证据,自动检测生成内容中的事实错误,并提供判断依据,同时迭代修正幻觉内容。实验结果表明,Medico在证据检索、幻觉检测和修正方面均表现出色,显示出其巨大的潜力。

链接: https://arxiv.org/abs/2410.10408
作者: Xinping Zhao,Jindi Yu,Zhenyu Liu,Jifang Wang,Dongfang Li,Yibin Chen,Baotian Hu,Min Zhang
关键词-EN: Large Language Models, Language Models, Large Language, prevail in Large, factually incorrect
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 3 figures, 6 tables. Accepted by EMNLP 2024’s demo track

点击查看摘要

Abstract:As we all know, hallucinations prevail in Large Language Models (LLMs), where the generated content is coherent but factually incorrect, which inflicts a heavy blow on the widespread application of LLMs. Previous studies have shown that LLMs could confidently state non-existent facts rather than answering ``I don’t know’'. Therefore, it is necessary to resort to external knowledge to detect and correct the hallucinated content. Since manual detection and correction of factual errors is labor-intensive, developing an automatic end-to-end hallucination-checking approach is indeed a needful thing. To this end, we present Medico, a Multi-source evidence fusion enhanced hallucination detection and correction framework. It fuses diverse evidence from multiple sources, detects whether the generated content contains factual errors, provides the rationale behind the judgment, and iteratively revises the hallucinated content. Experimental results on evidence retrieval (0.964 HR@5, 0.908 MRR@5), hallucination detection (0.927-0.951 F1), and hallucination correction (0.973-0.979 approval rate) manifest the great potential of Medico. A video demo of Medico can be found at this https URL.
摘要:众所周知,大语言模型 (LLM) 中普遍存在幻觉现象,即生成的内容虽然连贯但事实错误,这对 LLM 的广泛应用造成了沉重打击。先前的研究表明,LLM 可能会自信地陈述不存在的信息,而不是回答“我不知道”。因此,有必要借助外部知识来检测和纠正这些幻觉内容。由于人工检测和纠正事实错误是劳动密集型的,开发一种自动化的端到端幻觉检测方法确实是必要的。为此,我们提出了 Medico,一个多源证据融合增强的幻觉检测与纠正框架。它融合了来自多个来源的多样化证据,检测生成内容是否包含事实错误,提供判断的理由,并迭代地修正幻觉内容。实验结果表明,Medico 在证据检索 (0.964 HR@5, 0.908 MRR@5)、幻觉检测 (0.927-0.951 F1) 和幻觉纠正 (0.973-0.979 批准率) 方面展现了巨大的潜力。Medico 的视频演示可以在以下链接找到:https URL。

[NLP-39] MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages

【速读】: 该论文试图解决低资源印度语言(Indic languages)中多模态假新闻检测的问题。解决方案的关键在于引入了一个名为Multimodal Multilingual Caption-aware framework for Fake News Detection (MMCFND)的框架,该框架利用预训练的单模态编码器和基础模型中的成对编码器,从新闻文章的视觉和文本组件中提取深层表示。通过多模态融合编码器整合文本和图像表示,生成全面的跨模态表示,并生成描述性图像标题以提供额外上下文,从而检测不一致和操纵。最终,提取的特征被融合并输入分类器以确定新闻文章的真实性。

链接: https://arxiv.org/abs/2410.10407
作者: Shubhi Bansal,Nishit Sushil Singh,Shahid Shafi Dar,Nagendra Kumar
关键词-EN: combine deceptive text, Fake News Detection, false information, widespread dissemination, dissemination of false
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread dissemination of false information through manipulative tactics that combine deceptive text and images threatens the integrity of reliable sources of information. While there has been research on detecting fake news in high resource languages using multimodal approaches, methods for low resource Indic languages primarily rely on textual analysis. This difference highlights the need for robust methods that specifically address multimodal fake news in Indic languages, where the lack of extensive datasets and tools presents a significant obstacle to progress. To this end, we introduce the Multimodal Multilingual dataset for Indic Fake News Detection (MMIFND). This meticulously curated dataset consists of 28,085 instances distributed across Hindi, Bengali, Marathi, Malayalam, Tamil, Gujarati and Punjabi. We further propose the Multimodal Multilingual Caption-aware framework for Fake News Detection (MMCFND). MMCFND utilizes pre-trained unimodal encoders and pairwise encoders from a foundational model that aligns vision and language, allowing for extracting deep representations from visual and textual components of news articles. The multimodal fusion encoder in the foundational model integrates text and image representations derived from its pairwise encoders to generate a comprehensive cross modal representation. Furthermore, we generate descriptive image captions that provide additional context to detect inconsistencies and manipulations. The retrieved features are then fused and fed into a classifier to determine the authenticity of news articles. The curated dataset can potentially accelerate research and development in low resource environments significantly. Thorough experimentation on MMIFND demonstrates that our proposed framework outperforms established methods for extracting relevant fake news detection features.
摘要:通过结合欺骗性文本和图像的操纵策略广泛传播的虚假信息,威胁着可靠信息来源的完整性。尽管在高资源语言中,已有研究使用多模态方法检测假新闻,但对于低资源印度语言,主要依赖于文本分析。这种差异凸显了针对印度语言多模态假新闻检测的稳健方法的必要性,因为在这些语言中,缺乏广泛的数据集和工具是进展的主要障碍。为此,我们引入了印度假新闻检测的多模态多语言数据集 (MMIFND)。该精心策划的数据集包含 28,085 个实例,分布在印地语、孟加拉语、马拉地语、马拉雅拉姆语、泰米尔语、古吉拉特语和旁遮普语中。我们进一步提出了用于假新闻检测的多模态多语言标题感知框架 (MMCFND)。MMCFND 利用预训练的单模态编码器和来自基础模型的成对编码器,该基础模型对齐视觉和语言,从而能够从新闻文章的视觉和文本组件中提取深层表示。基础模型中的多模态融合编码器整合了从其成对编码器中导出的文本和图像表示,以生成全面的跨模态表示。此外,我们生成描述性图像标题,为检测不一致和操纵提供额外上下文。然后,检索到的特征被融合并输入分类器,以确定新闻文章的真实性。精心策划的数据集有可能显著加速低资源环境中的研究和开发。对 MMIFND 的彻底实验表明,我们提出的框架在提取相关假新闻检测特征方面优于现有方法。

[NLP-40] Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search

【速读】: 该论文试图解决指令数据合成过程中缺乏控制和不确定性高的问题,提出了一种名为IDEA-MCTS(基于蒙特卡洛树搜索的指令数据增强)的通用且可扩展的框架。其关键在于利用树搜索和评估模型,有效地引导每个指令向高质量形式演化,从而显著提升指令数据的质量、多样性和复杂性,并在低资源环境下提高语言模型在实际指令遵循任务中的准确性。

链接: https://arxiv.org/abs/2410.10392
作者: Chenglin Li,Qianglong Chen,Zhi Li,Feng Tao,Yicheng Li,Hao Chen,Fei Yu,Yin Zhang
关键词-EN: humans’ actual goals, aligning language models, real world, crucial technique, technique for aligning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning is a crucial technique for aligning language models with humans’ actual goals in the real world. Extensive research has highlighted the quality of instruction data is essential for the success of this alignment. However, creating high-quality data manually is labor-intensive and time-consuming, which leads researchers to explore using LLMs to synthesize data. Recent studies have focused on using a stronger LLM to iteratively enhance existing instruction data, showing promising results. Nevertheless, previous work often lacks control over the evolution direction, resulting in high uncertainty in the data synthesis process and low-quality instructions. In this paper, we introduce a general and scalable framework, IDEA-MCTS (Instruction Data Enhancement using Monte Carlo Tree Search), a scalable framework for efficiently synthesizing instructions. With tree search and evaluation models, it can efficiently guide each instruction to evolve into a high-quality form, aiding in instruction fine-tuning. Experimental results show that IDEA-MCTS significantly enhances the seed instruction data, raising the average evaluation scores of quality, diversity, and complexity from 2.19 to 3.81. Furthermore, in open-domain benchmarks, experimental results show that IDEA-MCTS improves the accuracy of real-world instruction-following skills in LLMs by an average of 5% in low-resource settings.
摘要:指令调优是使语言模型与人类在现实世界中的实际目标相一致的关键技术。大量研究表明,指令数据的质量对于这种一致性的成功至关重要。然而,手动创建高质量数据既费时又费力,这促使研究人员探索使用大语言模型 (LLM) 来合成数据。最近的研究集中在使用更强的 LLM 来迭代增强现有的指令数据,显示出有希望的结果。尽管如此,以往的工作往往缺乏对进化方向的控制,导致数据合成过程的不确定性高,指令质量低。在本文中,我们介绍了一个通用且可扩展的框架,IDEA-MCTS (Instruction Data Enhancement using Monte Carlo Tree Search),这是一个可扩展的框架,用于高效地合成指令。通过树搜索和评估模型,它可以有效地指导每个指令演变为高质量的形式,从而有助于指令微调。实验结果表明,IDEA-MCTS 显著增强了种子指令数据,将质量、多样性和复杂性的平均评估分数从 2.19 提高到 3.81。此外,在开放领域的基准测试中,实验结果显示,IDEA-MCTS 在低资源设置下将 LLM 的现实指令遵循技能的准确性平均提高了 5%。

[NLP-41] BookWorm: A Dataset for Character Description and Analysis EMNLP2024

【速读】: 该论文试图解决在长篇书籍中对角色理解的问题,特别是生成角色的简要事实描述(character description)和深入分析(character analysis)。解决方案的关键在于引入了一个名为BookWorm的数据集,该数据集将古腾堡项目中的书籍与人工编写的角色描述和分析配对。通过使用这个数据集,论文评估了当前最先进的长时间上下文模型在零样本和微调设置下的表现,并比较了基于检索和层次处理的方法。研究发现,基于检索的方法在两个任务中均优于层次处理方法,而使用共指消解的检索方法在生成事实性描述方面表现最佳。

链接: https://arxiv.org/abs/2410.10372
作者: Argyrios Papoudakis,Mirella Lapata,Frank Keller
关键词-EN: driving the plot, engaging readers, plot and engaging, numerous interacting characters, Characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 30 pages, 2 figures, EMNLP 2024 Findings

点击查看摘要

Abstract:Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.
摘要:角色是每个故事的核心,推动情节发展并吸引读者。在本研究中,我们探讨了对长篇书籍中角色的理解,这些书籍包含复杂的叙事和众多互动的角色。我们定义了两个任务:角色描述,生成简要的事实简介;以及角色分析,提供深入的解读,包括角色发展、性格和社会背景。我们引入了 BookWorm 数据集,将古腾堡项目中的书籍与人工编写的描述和分析配对。利用该数据集,我们在零样本和微调设置下评估了最先进的长时间上下文模型,使用基于检索和层次处理的方法处理书籍长度的输入。我们的研究结果显示,在两个任务中,基于检索的方法均优于层次处理方法。此外,使用基于共指检索的微调模型在事实和蕴含度量标准下生成了最准确的角色描述。我们希望我们的数据集、实验和分析能够激发基于角色叙事理解的进一步研究。

[NLP-42] Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning

【速读】: 该论文试图解决大语言模型(LLMs)在生成幻觉和知识过时问题上的挑战,特别是如何有效整合外部检索知识以提升模型性能。解决方案的关键在于提出了一种名为Parenting的新框架,该框架通过在参数空间内分离模型的依从性和鲁棒性,利用基于前向激活增益的关键参数挖掘方法,识别并隔离与依从性和鲁棒性紧密相关的关键参数单元。随后,采用类型引导的定制调优策略,对代表不同能力的参数单元应用特定的微调方法,以实现依从性和鲁棒性的平衡增强。

链接: https://arxiv.org/abs/2410.10360
作者: Yongxin Xu,Ruizhe Zhang,Xinke Jiang,Yujie Feng,Yuzhen Xiao,Xinyu Ma,Runchuan Zhu,Xu Chu,Junfeng Zhao,Yasha Wang
关键词-EN: Large Language Models, Large Language, incorporating externally retrieved, externally retrieved knowledge, faced by Large
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) offers an effective solution to the issues faced by Large Language Models (LLMs) in hallucination generation and knowledge obsolescence by incorporating externally retrieved knowledge. However, due to potential conflicts between internal and external knowledge, as well as retrieval noise, LLMs often struggle to effectively integrate external evidence, leading to a decline in performance. Although existing methods attempt to tackle these challenges, they often struggle to strike a balance between model adherence and robustness, resulting in significant learning variance. Inspired by human cognitive processes, we propose Parenting, a novel framework that decouples adherence and robustness within the parameter space of LLMs. Specifically, Parenting utilizes a key parameter mining method based on forward activation gain to identify and isolate the crucial parameter units that are strongly linked to adherence and robustness. Then, Parenting employs a type-guided tailored tuning strategy, applying specific and appropriate fine-tuning methods to parameter units representing different capabilities, aiming to achieve a balanced enhancement of adherence and robustness. Extensive experiments on various datasets and models validate the effectiveness and generalizability of our methods.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 通过结合外部检索的知识,为大语言模型 (Large Language Models, LLMs) 在幻觉生成和知识过时问题上提供了有效的解决方案。然而,由于内部和外部知识之间可能存在的冲突以及检索噪声,LLMs 往往难以有效整合外部证据,导致性能下降。尽管现有方法试图解决这些挑战,但它们往往难以在模型遵从性和鲁棒性之间取得平衡,导致显著的学习方差。受人类认知过程的启发,我们提出了 Parenting,一种在 LLMs 参数空间内解耦遵从性和鲁棒性的新型框架。具体而言,Parenting 利用基于前向激活增益的关键参数挖掘方法,识别并隔离与遵从性和鲁棒性紧密相关的关键参数单元。随后,Parenting 采用类型引导的定制调优策略,对代表不同能力的参数单元应用特定且适当的微调方法,旨在实现遵从性和鲁棒性的平衡增强。在多种数据集和模型上的广泛实验验证了我们方法的有效性和普适性。

[NLP-43] LLM-based Code-Switched Text Generation for Grammatical Error Correction

【速读】: 该论文试图解决全球化背景下,多语言切换(Code-Switching, CSW)文本在语法错误纠正(Grammatical Error Correction, GEC)中的应用难题。解决方案的关键在于通过生成合成CSW GEC数据,构建了一个大规模的CSW数据集,并训练模型以提升对单语和CSW文本中语法错误的纠正能力。这一方法不仅解决了真实CSW数据稀缺的问题,还显著提高了现有系统的性能,旨在为英语作为第二语言(ESL)学习者提供支持,同时不限制其自然的多语言表达。

链接: https://arxiv.org/abs/2410.10349
作者: Tom Potter,Zheng Yuan
关键词-EN: Grammatical Error Correction, Error Correction, natural language processing, synthetic CSW GEC, CSW GEC data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rise of globalisation, code-switching (CSW) has become a ubiquitous part of multilingual conversation, posing new challenges for natural language processing (NLP), especially in Grammatical Error Correction (GEC). This work explores the complexities of applying GEC systems to CSW texts. Our objectives include evaluating the performance of state-of-the-art GEC systems on an authentic CSW dataset from English as a Second Language (ESL) learners, exploring synthetic data generation as a solution to data scarcity, and developing a model capable of correcting grammatical errors in monolingual and CSW texts. We generated synthetic CSW GEC data, resulting in one of the first substantial datasets for this task, and showed that a model trained on this data is capable of significant improvements over existing systems. This work targets ESL learners, aiming to provide educational technologies that aid in the development of their English grammatical correctness without constraining their natural multilingualism.
摘要:随着全球化的发展,代码转换 (Code-switching, CSW) 已成为多语言对话中普遍存在的现象,为自然语言处理 (Natural Language Processing, NLP),尤其是在语法错误纠正 (Grammatical Error Correction, GEC) 领域,带来了新的挑战。本研究探讨了将 GEC 系统应用于 CSW 文本的复杂性。我们的目标包括评估最先进的 GEC 系统在来自英语作为第二语言 (English as a Second Language, ESL) 学习者的真实 CSW 数据集上的表现,探索合成数据生成作为解决数据稀缺问题的方案,以及开发一种能够纠正单语和 CSW 文本中语法错误的模型。我们生成了合成 CSW GEC 数据,从而创建了该任务的首批大规模数据集之一,并证明基于该数据训练的模型能够在现有系统的基础上实现显著改进。本研究面向 ESL 学习者,旨在提供有助于他们英语语法正确性发展的教育技术,同时不限制其自然的多语言交流。

[NLP-44] Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement

【速读】: 该论文试图解决的问题是如何自动化生成和筛选包含中间步骤的演示示例,以提升大型语言模型(LLMs)在任务中的表现,特别是在需要链式思维(CoT)或上下文学习(ICL)的任务中。解决方案的关键在于提出了自动数据标注与优化(ADLR)方法,该方法从少量手工制作的种子示例出发,自动生成并过滤包含中间步骤的演示示例,从而减少了手动工作量,并在基于代码的表格问答和数学推理任务中实现了高达5.5%的性能提升。

链接: https://arxiv.org/abs/2410.10348
作者: Joseph Shtok,Amit Alfassy,Foad Abo Dahood,Eliyahu Schwartz,Sivan Doveh,Assaf Arbelle
关键词-EN: Large Language Models’, Language Models’, Chain of Thought, Large Language, tasks using Chain
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It has been shown that Large Language Models’ (LLMs) performance can be improved for many tasks using Chain of Thought (CoT) or In-Context Learning (ICL), which involve demonstrating the steps needed to solve a task using a few examples. However, while datasets with input-output pairs are relatively easy to produce, providing demonstrations which include intermediate steps requires cumbersome manual work. These steps may be executable programs, as in agentic flows, or step-by-step reasoning as in CoT. In this work, we propose Automatic Data Labeling and Refinement (ADLR), a method to automatically generate and filter demonstrations which include the above intermediate steps, starting from a small seed of manually crafted examples. We demonstrate the advantage of ADLR in code-based table QA and mathematical reasoning, achieving up to a 5.5% gain. The code implementing our method is provided in the Supplementary material and will be made available.
摘要:研究表明,通过使用思维链 (Chain of Thought, CoT) 或上下文学习 (In-Context Learning, ICL),可以显著提升大语言模型 (Large Language Models, LLMs) 在许多任务中的表现,这些方法涉及通过少量示例展示解决任务所需的步骤。然而,尽管生成包含输入输出对的训练数据相对容易,但要提供包含中间步骤的示例则需要繁琐的手动工作。这些步骤可能是可执行的程序,如在智能体流程中,或如在 CoT 中的逐步推理。在本研究中,我们提出了自动数据标注与优化 (Automatic Data Labeling and Refinement, ADLR) 方法,该方法能够从少量手工制作的种子示例开始,自动生成并筛选包含上述中间步骤的示例。我们在基于代码的表格问答和数学推理任务中展示了 ADLR 的优势,实现了高达 5.5% 的性能提升。实现我们方法的代码已提供在补充材料中,并将公开发布。

[NLP-45] A Unified Approach to Routing and Cascading for LLMs

【速读】: 该论文试图解决在给定一组针对特定任务微调的大型语言模型(LLMs)时,如何选择最优模型以最大化整体性能的问题。解决方案的关键在于提出了一种名为“cascade routing”的新方法,该方法结合了路由(routing)和级联(cascading)的优点,既具有路由的适应性,又具备级联的成本效率。通过实验验证,cascade routing在各种设置下均优于单纯的路由和级联策略,不仅提高了输出质量,还降低了计算成本,从而为模型选择问题提供了一种统一且高效的解决方案。

链接: https://arxiv.org/abs/2410.10347
作者: Jasper Dekoninck,Maximilian Baader,Martin Vechev
关键词-EN: targeting specific tasks, sizes targeting specific, specific tasks, widespread applicability, increased the availability
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread applicability of large language models (LLMs) has increased the availability of many fine-tuned models of various sizes targeting specific tasks. Given a set of such specialized models, to maximize overall performance, it is important to figure out the optimal strategy for selecting the right model for a given user query. An effective strategy could drastically increase overall performance and even offer improvements over a single large monolithic model. Existing approaches typically fall into two categories: routing, where a single model is selected for each query, and cascading, which runs a sequence of increasingly larger models until a satisfactory answer is obtained. However, both have notable limitations: routing commits to an initial model without flexibility, while cascading requires executing every model in sequence, which can be inefficient. Additionally, the conditions under which these strategies are provably optimal remain unclear. In this work, we derive optimal strategies for both routing and cascading. Building on this analysis, we propose a novel approach called cascade routing, which combines the adaptability of routing with the cost-efficiency of cascading. Our experiments demonstrate that cascade routing consistently outperforms both routing and cascading across a variety of settings, improving both output quality and lowering computational cost, thus offering a unified and efficient solution to the model selection problem.
摘要:大语言模型 (LLM) 的广泛适用性使得许多针对特定任务的微调模型得以广泛应用。在拥有一组这样的专用模型时,为了最大化整体性能,确定为给定用户查询选择合适模型的最佳策略至关重要。一个有效的策略可以显著提高整体性能,甚至可能优于单一的大型整体模型。现有的方法通常分为两类:路由 (routing),即每个查询选择一个模型;以及级联 (cascading),即按顺序运行一系列越来越大的模型,直到获得满意的答案。然而,这两种方法都有明显的局限性:路由在初始阶段就确定了模型,缺乏灵活性;而级联则需要按顺序执行每个模型,这可能效率低下。此外,这些策略在何种条件下被证明是最优的仍不清楚。在本研究中,我们推导了路由和级联的最优策略。在此分析基础上,我们提出了一种名为级联路由 (cascade routing) 的新方法,该方法结合了路由的适应性和级联的成本效益。我们的实验表明,级联路由在各种设置下始终优于路由和级联,不仅提高了输出质量,还降低了计算成本,从而为模型选择问题提供了一个统一且高效的解决方案。

[NLP-46] Locking Down the Finetuned LLMs Safety

【速读】: 该论文试图解决在微调大型语言模型(LLMs)过程中可能引入的安全风险问题。解决方案的关键在于引入了一种名为SafetyLock的新型对齐干预方法,该方法通过利用微调模型与基础模型在安全相关激活表示上的相似性,提取出Meta-SafetyLock,即一组代表安全响应关键激活模式的安全偏置方向。这些方向可以普遍应用于微调后的模型,以增强其安全性,且在不到0.01秒的时间内完成重新对齐,无需额外计算成本。实验结果表明,SafetyLock能将有害指令响应率从60%降低至低于1%,显著优于传统方法,为定制化LLMs的安全性提供了一种可扩展且非侵入性的解决方案。

链接: https://arxiv.org/abs/2410.10343
作者: Minjun Zhu,Linyi Yang,Yifan Wei,Ningyu Zhang,Yue Zhang
关键词-EN: specific downstream tasks, Fine-tuning large language, large language models, downstream tasks, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks. However, existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning. Alarmingly, fine-tuning with just 10 toxic sentences can make models comply with harmful instructions. We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning through efficient and transferable mechanisms. SafetyLock leverages our discovery that fine-tuned models retain similar safety-related activation representations to their base models. This insight enables us to extract what we term the Meta-SafetyLock, a set of safety bias directions representing key activation patterns associated with safe responses in the original model. We can then apply these directions universally to fine-tuned models to enhance their safety. By searching for activation directions across multiple token dimensions, SafetyLock achieves enhanced robustness and transferability. SafetyLock re-aligns fine-tuned models in under 0.01 seconds without additional computational cost. Our experiments demonstrate that SafetyLock can reduce the harmful instruction response rate from 60% to below 1% in toxic fine-tuned models. It surpasses traditional methods in both performance and efficiency, offering a scalable, non-invasive solution for ensuring the safety of customized LLMs. Our analysis across various fine-tuning scenarios confirms SafetyLock’s robustness, advocating its integration into safety protocols for aligned LLMs. The code is released at this https URL.
摘要:在特定下游任务中优化大语言模型 (LLMs) 时,通常需要在额外的数据集上进行微调。然而,现有的安全对齐措施在推理过程中限制有害行为,但在微调过程中不足以缓解安全风险。令人担忧的是,仅使用 10 句有毒句子进行微调就能使模型遵循有害指令。我们引入了 SafetyLock,这是一种新颖的对齐干预方法,通过高效且可迁移的机制在微调后保持强大的安全性。SafetyLock 利用了我们的一项发现,即微调后的模型与其基础模型在安全相关的激活表示上保持相似。这一洞察使我们能够提取所谓的 Meta-SafetyLock,这是一组代表原始模型中与安全响应相关关键激活模式的安全偏差方向。然后,我们可以将这些方向普遍应用于微调后的模型,以增强其安全性。通过在多个 Token 维度上搜索激活方向,SafetyLock 实现了增强的鲁棒性和可迁移性。SafetyLock 在不到 0.01 秒的时间内重新对齐微调后的模型,且无需额外的计算成本。我们的实验表明,SafetyLock 可以将有毒微调模型中有害指令响应率从 60% 降低到低于 1%。它在性能和效率上均超越了传统方法,为确保定制化 LLMs 的安全性提供了一种可扩展、非侵入性的解决方案。我们在各种微调场景中的分析证实了 SafetyLock 的鲁棒性,并主张将其整合到对齐 LLMs 的安全协议中。代码已发布在 https URL。

[NLP-47] CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

【速读】: 该论文试图解决大型语言模型(LLMs)在数学推理方面的挑战,尽管链式思维(Chain-of-Thought, CoT)等提示技术有所进步。解决方案的关键在于提出了一种名为“链式数学注释思维”(Chain of Mathematically Annotated Thought, CoMAT)的新方法,该方法通过两个阶段增强推理能力:符号转换(将自然语言查询转换为符号形式)和推理执行(从符号表示中推导答案)。CoMAT完全依赖于单一的LLM,无需外部求解器,并在多个基准测试中显著优于传统的CoT方法,特别是在MMLU-Redux(MATH)和GaoKao MCQ上分别提升了4.48%和4.58%。此外,CoMAT确保了推理过程的忠实性和可验证性,为复杂的数学任务提供了透明的推理过程。

链接: https://arxiv.org/abs/2410.10336
作者: Joshua Ong Jun Leang,Aryo Pradipta Gema,Shay B. Cohen
关键词-EN: Mathematically Annotated Thought, large language models, remains a significant, significant challenge, challenge for large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: 8 pages, 12 figures

点击查看摘要

Abstract:Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks
摘要:尽管在如思维链 (Chain-of-Thought, CoT) 等提示技术方面取得了进展,数学推理对于大语言模型 (Large Language Models, LLMs) 仍然是一个重大挑战。我们提出了数学注释思维链 (Chain of Mathematically Annotated Thought, CoMAT),通过两个阶段增强推理能力:符号转换 (将自然语言查询转换为符号形式) 和推理执行 (从符号表示中推导出答案)。CoMAT 完全依赖于单一的 LLM,无需外部求解器。在四个 LLM 中,CoMAT 在七个基准测试中的六个上优于传统的 CoT,在 MMLU-Redux (MATH) 上取得了 4.48% 的提升,在 GaoKao MCQ 上取得了 4.58% 的提升。除了性能的提升,CoMAT 还确保了忠实性和可验证性,为复杂的数学任务提供了透明的推理过程。

[NLP-48] Disentangling Hate Across Target Identities

【速读】: 该论文试图解决仇恨言论(Hate Speech, HS)分类器在检测针对不同目标身份的仇恨表达时表现不一致的问题,以及这些分类器在预测仇恨程度评分时存在的系统性偏差。解决方案的关键在于通过两个新提出的功能测试数据集,量化分析不同因素对HS预测的影响。研究发现,HS检测器往往仅基于特定目标身份的提及就赋予更高的仇恨评分,并且模型常混淆仇恨性与情感极性。此外,研究还揭示了仇恨预测的准确性与刻板印象强度之间的强相关性,这一发现对于改进HS检测模型的公平性和准确性具有重要意义。

链接: https://arxiv.org/abs/2410.10332
作者: Yiping Jin,Leo Wanner,Aneesh Moideen Koya
关键词-EN: perform equally, detecting hateful expressions, target identities, Hate speech, specific target identities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hate speech (HS) classifiers do not perform equally well in detecting hateful expressions towards different target identities. They also demonstrate systematic biases in predicted hatefulness scores. Tapping on two recently proposed functionality test datasets for HS detection, we quantitatively analyze the impact of different factors on HS prediction. Experiments on popular industrial and academic models demonstrate that HS detectors assign a higher hatefulness score merely based on the mention of specific target identities. Besides, models often confuse hatefulness and the polarity of emotions. This result is worrisome as the effort to build HS detectors might harm the vulnerable identity groups we wish to protect: posts expressing anger or disapproval of hate expressions might be flagged as hateful themselves. We also carry out a study inspired by social psychology theory, which reveals that the accuracy of hatefulness prediction correlates strongly with the intensity of the stereotype.
摘要:仇恨言论 (Hate Speech, HS) 分类器在检测针对不同目标身份的仇恨表达时表现并不一致。它们在预测仇恨程度评分时也表现出系统性偏差。基于两个最近提出的用于 HS 检测的功能测试数据集,我们定量分析了不同因素对 HS 预测的影响。对流行工业和学术模型的实验表明,HS 检测器仅基于提及特定目标身份就赋予更高的仇恨程度评分。此外,模型常常混淆仇恨程度与情感极性。这一结果令人担忧,因为构建 HS 检测器的努力可能会伤害我们希望保护的弱势身份群体:表达对仇恨言论的愤怒或不赞同的帖子可能会被标记为仇恨言论本身。我们还进行了一项受社会心理学理论启发的研究,结果显示仇恨程度预测的准确性与刻板印象的强度密切相关。

[NLP-49] MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media

【速读】: 该论文试图解决在社交媒体上进行心理健康分析时,传统深度学习模型缺乏解释性和灵活性的问题。解决方案的关键在于引入了一个名为C-IMHI的多任务中文社交媒体可解释心理健康指令数据集,并开发了MentalGLM系列模型,这些模型是专门为中文社交媒体设计的可解释心理健康分析的大型语言模型。通过在50K指令语料库上训练,MentalGLM模型在三个下游任务中表现优于或可与深度学习模型、通用大型语言模型以及任务微调的大型语言模型相媲美。此外,模型生成的决策解释经过专家验证,显示出其在临床领域的潜在应用价值。

链接: https://arxiv.org/abs/2410.10323
作者: Wei Zhai,Nan Bai,Qing Zhao,Jianqiang Li,Fan Wang,Hongzhi Qi,Meng Jiang,Xiaoqin Wang,Bing Xiang Yang,Guanghui Fu
关键词-EN: Chinese Social Media, mental health challenges, social media, analyzing mental health, mental health
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the prevalence of mental health challenges, social media has emerged as a key platform for individuals to express their this http URL learning tends to be a promising solution for analyzing mental health on social media. However, black box models are often inflexible when switching between tasks, and their results typically lack explanations. With the rise of large language models (LLMs), their flexibility has introduced new approaches to the field. Also due to the generative nature, they can be prompted to explain decision-making processes. However, their performance on complex psychological analysis still lags behind deep learning. In this paper, we introduce the first multi-task Chinese Social Media Interpretable Mental Health Instructions (C-IMHI) dataset, consisting of 9K samples, which has been quality-controlled and manually validated. We also propose MentalGLM series models, the first open-source LLMs designed for explainable mental health analysis targeting Chinese social media, trained on a corpus of 50K instructions. The proposed models were evaluated on three downstream tasks and achieved better or comparable performance compared to deep learning models, generalized LLMs, and task fine-tuned LLMs. We validated a portion of the generated decision explanations with experts, showing promising results. We also evaluated the proposed models on a clinical dataset, where they outperformed other LLMs, indicating their potential applicability in the clinical field. Our models show strong performance, validated across tasks and perspectives. The decision explanations enhance usability and facilitate better understanding and practical application of the models. Both the constructed dataset and the models are publicly available via: this https URL.
摘要:随着心理健康问题的普遍性增加,社交媒体已成为个人表达心理状态的重要平台。生成式 AI (Generative AI) 在分析社交媒体上的心理健康方面显示出巨大的潜力。然而,黑箱模型在任务切换时通常缺乏灵活性,且其结果往往缺乏解释性。随着大语言模型 (LLMs) 的兴起,其灵活性为该领域引入了新的方法。此外,由于其生成特性,这些模型可以被提示来解释决策过程。然而,它们在复杂的心理分析方面的表现仍落后于深度学习。本文介绍了首个多任务中文社交媒体可解释心理健康指令 (C-IMHI) 数据集,该数据集包含 9K 样本,经过质量控制和人工验证。我们还提出了 MentalGLM 系列模型,这是首个针对中文社交媒体的可解释心理健康分析的开源大语言模型,训练于 50K 指令语料库。所提出的模型在三个下游任务上进行了评估,与深度学习模型、通用大语言模型和任务微调的大语言模型相比,表现更好或相当。我们通过专家验证了部分生成的决策解释,结果显示具有良好的潜力。我们还在临床数据集上评估了所提出的模型,结果显示其优于其他大语言模型,表明其在临床领域的潜在应用价值。我们的模型在多个任务和视角上表现出色,决策解释增强了其可用性,并促进了更好的理解和实际应用。构建的数据集和模型均可通过以下链接公开获取:this http URL。

[NLP-50] EasyRAG: Efficient Retrieval-Augmented Generation Framework for Network Automated Operations

【速读】: 该论文试图解决网络自动化操作中的信息检索与生成问题,提出了一个名为EasyRAG的轻量级、高效的检索增强生成框架。解决方案的关键在于设计了一个简洁的RAG方案,包括特定的数据处理流程、双路径稀疏检索用于粗排序、LLM重排序器用于重排序以及LLM答案生成与优化。该方案无需模型微调,占用极少VRAM,易于部署且高度可扩展,同时通过高效的推理加速方案显著降低了RAG系统的推理延迟,保持了良好的准确性。

链接: https://arxiv.org/abs/2410.10315
作者: Zhangchi Feng,Dongdong Kuang,Zhongyuan Wang,Zhijie Nie,Yaowei Zheng,Richong Zhang
关键词-EN: paper presents EasyRAG, network automated operations, URL Question Answering, retrieval-augmented generation framework, http URL Question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:This paper presents EasyRAG, a simple, lightweight, and efficient retrieval-augmented generation framework for network automated operations. The advantages of our solution are: this http URL Question Answering: We designed a straightforward RAG scheme based on (1) a specific data processing workflow (2) dual-route sparse retrieval for coarse ranking (3) LLM Reranker for reranking (4) LLM answer generation and optimization. This approach achieved first place in the GLM4 track in the preliminary round and second place in the GLM4 track in the semifinals. this http URL Deployment: Our method primarily consists of BM25 retrieval and BGE-reranker reranking, requiring no fine-tuning of any models, occupying minimal VRAM, easy to deploy, and highly scalable; we provide a flexible code library with various search and generation strategies, facilitating custom process implementation. this http URL Inference: We designed an efficient inference acceleration scheme for the entire coarse ranking, reranking, and generation process that significantly reduces the inference latency of RAG while maintaining a good level of accuracy; each acceleration scheme can be plug-and-play into any component of the RAG process, consistently enhancing the efficiency of the RAG system. Our code and data are released at this https URL.
摘要:本文介绍了 EasyRAG,一个简单、轻量且高效的网络自动化操作增强生成框架。我们的解决方案具有以下优势:

  • 问答 (Question Answering):我们设计了一种基于 (1) 特定数据处理流程 (2) 双路径稀疏检索用于粗排序 (3) 大语言模型 (LLM) 重排序器用于重排序 (4) 大语言模型答案生成与优化的直接 RAG 方案。该方法在初赛 GLM4 赛道中获得第一名,在半决赛 GLM4 赛道中获得第二名。

  • 部署 (Deployment):我们的方法主要由 BM25 检索和 BGE-重排序器重排序组成,无需对任何模型进行微调,占用极少的 VRAM,易于部署且高度可扩展;我们提供了一个灵活的代码库,包含多种搜索和生成策略,便于实现自定义流程。

  • 推理 (Inference):我们为整个粗排序、重排序和生成过程设计了一种高效的推理加速方案,显著降低了 RAG 的推理延迟,同时保持了良好的准确性;每种加速方案都可以即插即用到 RAG 过程的任何组件中,持续提升 RAG 系统的效率。我们的代码和数据已发布在 此链接

[NLP-51] A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification NEURIPS2024

【速读】: 该论文试图解决多语言环境下使用大型语言模型(LLMs)进行跨语言事实核查时存在的翻译偏差和准确性问题。解决方案的关键在于系统评估两种翻译方法(预翻译和自翻译)在不同语言中的效果,并强调了低资源语言在训练数据中的代表性不足导致直接推理准确性较低的问题。研究结果表明,较大的模型在自翻译中表现更优,能够提高翻译准确性并减少偏差,因此需要平衡的多语言训练,特别是对低资源语言,以促进可靠的事实核查工具的公平访问,并减少在不同语言环境中传播错误信息的风险。

链接: https://arxiv.org/abs/2410.10303
作者: Aryan Singhal,Veronica Shao,Gary Sun,Ryan Ding,Jonathan Lu,Kevin Zhu
关键词-EN: Large Language Models, multilingual Large Language, Large Language, rise of digital, heightened interest
类目: Computation and Language (cs.CL)
备注: Accepted to ATTRIB @ NeurIPS 2024

点击查看摘要

Abstract:The rise of digital misinformation has heightened interest in using multilingual Large Language Models (LLMs) for fact-checking. This study systematically evaluates translation bias and the effectiveness of LLMs for cross-lingual claim verification across 15 languages from five language families: Romance, Slavic, Turkic, Indo-Aryan, and Kartvelian. Using the XFACT dataset to assess their impact on accuracy and bias, we investigate two distinct translation methods: pre-translation and self-translation. We use mBERT’s performance on the English dataset as a baseline to compare language-specific accuracies. Our findings reveal that low-resource languages exhibit significantly lower accuracy in direct inference due to underrepresentation in the training data. Furthermore, larger models demonstrate superior performance in self-translation, improving translation accuracy and reducing bias. These results highlight the need for balanced multilingual training, especially in low-resource languages, to promote equitable access to reliable fact-checking tools and minimize the risk of spreading misinformation in different linguistic contexts.
摘要:数字虚假信息的兴起引发了人们对使用多语言大语言模型 (LLM) 进行事实核查的兴趣。本研究系统评估了翻译偏差以及 LLM 在跨语言声明验证中的有效性,涵盖了来自五个语系(罗曼语系、斯拉夫语系、突厥语系、印欧语系和卡尔特维尔语系)的 15 种语言。利用 XFACT 数据集评估其对准确性和偏差的影响,我们研究了两种不同的翻译方法:预翻译和自翻译。我们使用 mBERT 在英语数据集上的表现作为基线,以比较特定语言的准确性。研究结果显示,由于训练数据中的代表性不足,低资源语言在直接推理中的准确性显著较低。此外,较大的模型在自翻译中表现出更优越的性能,提高了翻译准确性并减少了偏差。这些结果强调了平衡多语言训练的必要性,特别是在低资源语言中,以促进可靠事实核查工具的公平访问,并最小化在不同语言环境中传播虚假信息的风险。

[NLP-52] FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

【速读】: 该论文试图解决现有检索增强生成(RAG)模型中检索过程的扁平化问题,即检索过程被视为一次性操作且粒度恒定,导致单个检索器负担过重和检索性能上限受限。解决方案的关键在于提出了一种渐进式检索范式,称为FunnelRAG,通过粗到细的粒度、大到小的数量和低到高的容量协作,建立渐进式检索流水线,从而减轻单个检索器的负担并提升检索性能上限。实验结果表明,FunnelRAG在保持检索性能的同时,时间开销减少了近40%。

链接: https://arxiv.org/abs/2410.10293
作者: Xinping Zhao,Yan Zhong,Zetian Sun,Xinshuo Hu,Zhenyu Liu,Dongfang Li,Baotian Hu,Min Zhang
关键词-EN: Large Language Models, Language Models, Large Language, prevails in Large, retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 18 pages, 6 figures, 13 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) prevails in Large Language Models. It mainly consists of retrieval and generation. The retrieval modules (a.k.a. retrievers) aim to find useful information used to facilitate generation modules (a.k.a. generators). As such, generators’ performance largely depends on the effectiveness and efficiency of retrievers. However, the retrieval paradigm that we design and use remains flat, which treats the retrieval procedures as a one-off deal with constant granularity. Despite effectiveness, we argue that they suffer from two limitations: (1) flat retrieval exerts a significant burden on one retriever; (2) constant granularity limits the ceiling of retrieval performance. In this work, we propose a progressive retrieval paradigm with coarse-to-fine granularity for RAG, termed FunnelRAG, so as to balance effectiveness and efficiency. Specifically, FunnelRAG establishes a progressive retrieval pipeline by collaborating coarse-to-fine granularity, large-to-small quantity, and low-to-high capacity, which can relieve the burden on one retriever and also promote the ceiling of retrieval performance. Extensive experiments manifest that FunnelRAG achieves comparable retrieval performance while the time overhead is reduced by nearly 40 percent.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 在大语言模型中占据重要地位。它主要由检索和生成两部分组成。检索模块(又称检索器)旨在找到有助于生成模块(又称生成器)的有用信息。因此,生成器的性能在很大程度上取决于检索器的有效性和效率。然而,我们设计和使用的检索范式仍然是扁平的,将检索过程视为一次性交易,具有恒定的粒度。尽管有效,我们认为它们存在两个局限性:(1) 扁平检索对单个检索器施加了显著负担;(2) 恒定粒度限制了检索性能的上限。在本工作中,我们提出了一种渐进式检索范式,采用由粗到细的粒度,称为 FunnelRAG,以平衡有效性和效率。具体而言,FunnelRAG 通过协作粗到细的粒度、大到小的数量和低到高的容量,建立了一个渐进式检索管道,这不仅可以减轻单个检索器的负担,还可以提升检索性能的上限。大量实验表明,FunnelRAG 在实现相当的检索性能的同时,时间开销减少了近 40%。

[NLP-53] Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

【速读】: 该论文试图解决文本到图像(T2I)合成中,现有模型难以捕捉语序变化带来的语义差异,以及现有评估方法无法可靠评估这些挑战的问题。解决方案的关键在于提出了新的评估指标SemVarEffect和基准测试SemVarBench,通过两种类型的语言排列来实现语义变化,并评估输入与输出之间的因果关系。实验结果表明,CogView-3-Plus和Ideogram 2在处理语义变化方面表现最佳,而跨模态对齐在UNet或Transformer中的作用被强调为处理语义变化的关键因素。

链接: https://arxiv.org/abs/2410.10291
作者: Xiangru Zhu,Penglei Sun,Yaoxian Song,Yanghua Xiao,Zhixu Li,Chengyu Wang,Jun Huang,Bei Yang,Xiaoxiao Xu
关键词-EN: Accurate interpretation, interpretation and visualization, semantic variations, Accurate, variations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Our benchmark and code are available at this https URL

点击查看摘要

Abstract:Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community’s exploration of human instruction understanding.
摘要:准确解读和可视化人类指令对于文本到图像 (T2I) 合成至关重要。然而,当前模型难以捕捉词序变化带来的语义变化,而现有的评估方法依赖于间接指标如文本图像相似度,无法可靠地评估这些挑战。这往往通过关注频繁的词组合来掩盖模型在复杂或不常见语言模式上的表现不佳。为解决这些不足,我们提出了一种名为 SemVarEffect 的新指标和一个名为 SemVarBench 的基准,旨在评估 T2I 合成中输入与输出之间的语义变化因果关系。语义变化通过两种语言排列方式实现,同时避免易于预测的字面变化。实验表明,CogView-3-Plus 和 Ideogram 2 表现最佳,得分达到 0.2/1。对象关系中的语义变化理解程度低于属性,得分分别为 0.07/1 和 0.17-0.19/1。我们发现,在 UNet 或 Transformer 中的跨模态对齐在处理语义变化中起着关键作用,这一因素此前因对文本编码器的关注而被忽视。我们的工作建立了一个有效的评估框架,推动了 T2I 合成社区对人类指令理解的研究。

[NLP-54] A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification in Greek Tweets

【速读】: 该论文试图解决现有解释性技术在文本分类任务中生成的解释难以被非专业用户理解和接受的问题。解决方案的关键在于提出了一种新的管道,该管道结合了分类器和自然语言解释生成器,能够在文本分类任务中同时提供预测结果和易于理解的自然语言解释。该方法的核心在于利用希腊大型语言模型(LLM)生成作为理由的解释,并通过用户研究验证了其在情感分析和冒犯性语言识别任务中的有效性。

链接: https://arxiv.org/abs/2410.10290
作者: Nikolaos Mylonas,Nikolaos Stylianou,Theodora Tsikrika,Stefanos Vrochidis,Ioannis Kompatsiaris
关键词-EN: past few years, existing interpretability techniques, interpretability techniques produce, existing interpretability, interpretability techniques
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work In Progress

点击查看摘要

Abstract:Interpretability is a topic that has been in the spotlight for the past few years. Most existing interpretability techniques produce interpretations in the form of rules or feature importance. These interpretations, while informative, may be harder to understand for non-expert users and therefore, cannot always be considered as adequate explanations. To that end, explanations in natural language are often preferred, as they are easier to comprehend and also more presentable to end-users. This work introduces an early concept for a novel pipeline that can be used in text classification tasks, offering predictions and explanations in natural language. It comprises of two models: a classifier for labelling the text and an explanation generator which provides the explanation. The proposed pipeline can be adopted by any text classification task, given that ground truth rationales are available to train the explanation generator. Our experiments are centred around the tasks of sentiment analysis and offensive language identification in Greek tweets, using a Greek Large Language Model (LLM) to obtain the necessary explanations that can act as rationales. The experimental evaluation was performed through a user study based on three different metrics and achieved promising results for both datasets.
摘要:可解释性是近年来备受关注的主题。大多数现有的可解释性技术产生的解释形式为规则或特征重要性。这些解释虽然信息丰富,但对于非专业用户来说可能较难理解,因此不能总是被视为充分的解释。为此,自然语言解释通常更受欢迎,因为它们更容易理解,也更易于向最终用户展示。本文介绍了一种用于文本分类任务的新型流水线的早期概念,该流水线能够提供自然语言形式的预测和解释。它包括两个模型:一个用于标注文本的分类器和一个提供解释的解释生成器。该流水线可以被任何文本分类任务采用,前提是能够获得用于训练解释生成器的真实理由。我们的实验围绕希腊推文中的情感分析和冒犯性语言识别任务展开,使用希腊大语言模型 (LLM) 来获取可作为理由的必要解释。实验评估通过基于三种不同指标的用户研究进行,并在两个数据集上取得了有希望的结果。

[NLP-55] Back-of-the-Book Index Automation for Arabic Documents

【速读】: 该论文旨在解决阿拉伯书籍后书索引的手动创建过程中存在的劳动强度大和易出错的问题。解决方案的关键在于自动化索引项的提取和验证。具体方法包括:首先通过词性分析从相关页面的段落中提取所有可能的名词短语作为候选词,并将其存储在向量数据库中以实现高效检索;然后利用多种度量标准(如精确匹配、词汇相似度和语义相似度)对候选词进行评分,选择得分最高的候选词作为索引项的准确出现位置。通过这种方法,论文提出的启发式算法在F1-score上达到了0.966的高精度,为后书索引的自动化生成和检查提供了有效途径。

链接: https://arxiv.org/abs/2410.10286
作者: Nawal Haidar,Fadi A. Zaraket
关键词-EN: indexes are crucial, book readability, index, Arabic books, Abstract
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Back-of-the-book indexes are crucial for book readability. Their manual creation is laborious and error prone. In this paper, we consider automating back-of-the-book index extraction for Arabic books to help simplify both the creation and review tasks. Given a back-of-the-book index, we aim to check and identify the accurate occurrences of index terms relative to the associated pages. To achieve this, we first define a pool of candidates for each term by extracting all possible noun phrases from paragraphs appearing on the relevant index pages. These noun phrases, identified through part-of-speech analysis, are stored in a vector database for efficient retrieval. We use several metrics, including exact matches, lexical similarity, and semantic similarity, to determine the most appropriate occurrence. The candidate with the highest score based on these metrics is chosen as the occurrence of the term. We fine-tuned a heuristic method, that considers the above metrics and that achieves an F1-score of .966 (precision=.966, recall=.966). These excellent results open the door for future work related to automation of back-of-the-book index generation and checking.
摘要:书籍的索引对于提高可读性至关重要。手动创建索引既费时又容易出错。本文探讨了如何自动化提取阿拉伯语书籍的索引,以简化索引的创建和审核任务。给定一个书籍索引,我们的目标是检查并识别索引术语在相关页面上准确出现的位置。为此,我们首先通过从相关索引页面上出现的段落中提取所有可能的名词短语,为每个术语定义一个候选池。这些名词短语通过词性分析识别,并存储在向量数据库中以便高效检索。我们使用多种指标,包括精确匹配、词汇相似度和语义相似度,来确定最合适的出现位置。根据这些指标得分最高的候选者被选为术语的出现位置。我们微调了一种启发式方法,该方法综合考虑上述指标,并实现了 F1 得分为 0.966(精确率=0.966,召回率=0.966)。这些优异的结果为未来自动化书籍索引生成和检查的相关工作打开了大门。

[NLP-56] Machine Translation Evaluation Benchmark for Wu Chinese: Workflow and Analysis EMNLP

【速读】: 该论文试图解决现代吴语机器翻译模型评估基准的问题,特别是针对吴语这一低资源语言的挑战。解决方案的关键在于引入了一个名为FLORES+的数据集,该数据集不仅与现有的吴语数据兼容,还包含了手动翻译的开放源代码数据集、数据集创建和验证实验的完整文档、初步的吴语规范化与分词工具,以及对数据集优缺点的分析,为其他低资源语言的机器翻译研究提供了借鉴。

链接: https://arxiv.org/abs/2410.10278
作者: Hongjian Yu,Yiming Shi,Zherui Zhou,Christopher Haberland
关键词-EN: introduce a FLORES, existing Wu data, Chinese machine translation, machine translation models, evaluation benchmark
类目: Computation and Language (cs.CL)
备注: EMNLP WMT 24 Open Language Data Initiative Shared Task

点击查看摘要

Abstract:We introduce a FLORES+ dataset as an evaluation benchmark for modern Wu Chinese machine translation models and showcase its compatibility with existing Wu data. Wu Chinese is mutually unintelligible with other Sinitic languages such as Mandarin and Yue (Cantonese), but uses a set of Hanzi (Chinese characters) that profoundly overlaps with others. The population of Wu speakers is the second largest among languages in China, but the language has been suffering from significant drop in usage especially among the younger generations. We identify Wu Chinese as a textually low-resource language and address challenges for its machine translation models. Our contributions include: (1) an open-source, manually translated dataset, (2) full documentations on the process of dataset creation and validation experiments, (3) preliminary tools for Wu Chinese normalization and segmentation, and (4) benefits and limitations of our dataset, as well as implications to other low-resource languages.
摘要:我们引入了一个 FLORES+ 数据集,作为现代吴语机器翻译模型的评估基准,并展示了其与现有吴语数据的兼容性。吴语与其他汉藏语系语言(如普通话和粤语)之间相互难以理解,但使用了一套与其他语言深刻重叠的汉字。吴语使用者的数量在中国各语言中位居第二,但该语言的使用率,尤其是在年轻一代中,显著下降。我们将吴语识别为一种文本资源匮乏的语言,并解决了其机器翻译模型面临的挑战。我们的贡献包括:(1) 一个开源的手动翻译数据集,(2) 关于数据集创建和验证实验过程的完整文档,(3) 初步的吴语规范化与分词工具,以及 (4) 我们数据集的优缺点,以及对其他低资源语言的启示。

[NLP-57] QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis

【速读】: 该论文试图解决自动化数据探索(ADE)系统在处理大规模数据集时,依赖人工设定目标和需要大量计算资源与重新训练的问题。解决方案的关键在于引入了QUIS系统,该系统通过两阶段操作实现全自动化:首先,QUGen模块迭代生成问题,无需人工干预或手动示例;其次,ISGen模块根据生成的每个问题分析数据,生成多个相关见解,无需预先训练,从而使QUIS能够适应新数据集。

链接: https://arxiv.org/abs/2410.10270
作者: Abhijit Manatkar,Ashlesha Akella,Parthivi Gupta,Krishnasuri Narayanam
关键词-EN: Exploratory Data Analysis, Discovering meaningful insights, Large Language Models, Discovering meaningful, Exploratory Data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注: 6 pages

点击查看摘要

Abstract:Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.
摘要:从大型数据集中发现有意义的洞察,即探索性数据分析 (Exploratory Data Analysis, EDA),是一项需要对数据进行深入探索和分析的挑战性任务。自动化数据探索 (Automated Data Exploration, ADE) 系统采用面向目标的方法,结合大语言模型和强化学习,以实现完全自动化。然而,这些方法需要人工参与来预设目标,这可能限制洞察的提取,而完全自动化的系统则需要大量的计算资源和重新训练以适应新数据集。我们引入了 QUIS,这是一个完全自动化的 EDA 系统,分为两个阶段:洞察生成 (Insight Generation, ISGen) 由问题生成 (Question Generation, QUGen) 驱动。QUGen 模块通过迭代生成问题,从前一轮迭代中提炼问题以增强覆盖范围,无需人工干预或手动策划的示例。ISGen 模块分析数据以生成与每个问题相关的多个洞察,无需预先训练,使 QUIS 能够适应新数据集。

[NLP-58] LoLCATs: On Low-Rank Linearizing of Large Language Models

【速读】: 该论文试图解决大规模语言模型(LLMs)线性化过程中模型质量下降、训练成本高昂以及适用模型规模受限的问题。解决方案的关键在于提出了一种名为“低秩线性转换通过注意力转移(LoLCATs)”的两步方法。首先,通过训练线性注意力以匹配原始softmax注意力的输出,实现注意力转移;其次,利用低秩适应(LoRA)调整近似误差,恢复模型质量。这种方法显著提升了线性化质量、训练效率和可扩展性,减少了模型参数和训练数据的需求,并在多个大规模模型上实现了显著的性能提升。

链接: https://arxiv.org/abs/2410.10254
作者: Michael Zhang,Simran Arora,Rahul Chalamala,Alan Wu,Benjamin Spector,Aaryan Singhal,Krithik Ramesh,Christopher Ré
关键词-EN: popular Transformer-based LLMs, Recent works show, expensive pretraining costs, linearize large language, popular Transformer-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 47 pages, 20 figures, 18 tables, preprint

点击查看摘要

Abstract:Recent works show we can linearize large language models (LLMs) – swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention – avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. We base these steps on two findings. First, we can replace an LLM’s softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss (“attention transfer”). Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. Furthermore, LoLCATs does so with only 0.2% of past methods’ model parameters and 0.4% of their training tokens. Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.
摘要:近期研究表明,我们可以将大语言模型 (LLMs) 线性化——通过将流行的基于 Transformer 的 LLMs 中的二次注意力替换为次二次的近似形式,如线性注意力,从而避免高昂的预训练成本。然而,线性化 LLMs 通常会显著降低模型质量,仍然需要训练数十亿个 Token,并且仅限于较小的 1.3B 至 7B LLMs。因此,我们提出了通过注意力转移的低秩线性转换 (LoLCATs),这是一种简单的两步方法,能够以数量级更少的内存和计算资源提高 LLM 线性化的质量。我们的方法基于两个发现。首先,我们可以通过训练线性注意力以匹配其 softmax 对应项的输出均方误差损失 (“注意力转移”),简单地用近似线性注意力替换 LLM 的 softmax 注意力。然后,这使得可以通过低秩适应 (LoRA) 简单地调整近似误差并恢复 LLM 质量。LoLCATs 显著提高了线性化质量、训练效率和可扩展性。我们显著缩小了线性化质量差距,并从 Llama 3 8B 和 Mistral 7B v0.1 中生成了最先进的次二次 LLMs,在 5-shot MMLU 上实现了 20 多点的改进。此外,LoLCATs 仅使用了过去方法 0.2% 的模型参数和 0.4% 的训练 Token。最后,我们将 LoLCATs 应用于创建首个线性化的 70B 和 405B LLMs (比先前工作大 50 倍)。在与先前方法在相同计算预算下进行比较时,LoLCATs 显著提高了线性化质量,在 5-shot MMLU 上将线性化与原始 Llama 3.1 70B 和 405B LLMs 之间的差距分别缩小了 77.8% 和 78.1%。

[NLP-59] BanglaQuAD: A Bengali Open-domain Question Answering Dataset COLING2024 LREC

【速读】: 该论文试图解决在自然语言处理领域中,孟加拉语(Bengali)作为低资源语言在问答系统中的应用问题。解决方案的关键在于引入了BanglaQuAD,这是一个由母语者从孟加拉语维基百科文章中构建的问答数据集,包含30,808个问答对。此外,论文还提出了一种注释工具,用于在本地机器上方便地构建问答数据集,并通过定性分析展示了所提出数据集的高质量。

链接: https://arxiv.org/abs/2410.10229
作者: Md Rashad Al Hasan Rony,Sudipto Kumar Shaha,Rakib Al Hasan,Sumon Kanti Dey,Amzad Hossain Rafi,Amzad Hossain Rafi,Ashraf Hasan Sirajee,Jens Lehmann
关键词-EN: natural language processing, seventh most spoken, considered a low-resource, field of natural, Bengali
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted into LREC-COLING 2024, Turin, Italy

点击查看摘要

Abstract:Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP). Question answering over unstructured text is a challenging NLP task as it requires understanding both question and passage. Very few researchers attempted to perform question answering over Bengali (natively pronounced as Bangla) text. Typically, existing approaches construct the dataset by directly translating them from English to Bengali, which produces noisy and improper sentence structures. Furthermore, they lack topics and terminologies related to the Bengali language and people. This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers. Additionally, we propose an annotation tool that facilitates question-answering dataset construction on a local machine. A qualitative analysis demonstrates the quality of our proposed dataset.
摘要:孟加拉语是世界上使用人数第七多的语言,但在自然语言处理 (NLP) 领域被视为低资源语言。在非结构化文本上的问答是一项具有挑战性的 NLP 任务,因为它需要理解问题和段落。很少有研究者尝试在孟加拉语 (本地发音为 Bangla) 文本上进行问答。通常,现有的方法通过直接将数据从英语翻译成孟加拉语来构建数据集,这会产生噪音和不恰当的句子结构。此外,它们缺乏与孟加拉语及其人民相关的主题和术语。本文介绍了 BanglaQuAD,一个包含 30,808 个问答对的孟加拉语问答数据集,这些问答对由母语者从孟加拉语维基百科文章中构建。此外,我们提出了一种注释工具,该工具便于在本地机器上构建问答数据集。定性分析展示了我们提出的数据集的质量。

[NLP-60] QE-EBM: Using Quality Estimators as Energy Loss for Machine Translation

【速读】: 该论文试图解决强化学习在机器翻译任务中无法利用质量评估(QE)模型梯度的问题。解决方案的关键在于提出了一种名为QE-EBM的方法,该方法将质量评估模型作为可训练的损失网络,能够直接反向传播到神经机器翻译(NMT)模型,从而有效提升翻译质量,特别是在低资源目标语言上的表现。

链接: https://arxiv.org/abs/2410.10228
作者: Gahyun Yoo,Jay Yoon Lee
关键词-EN: shown great promise, text generation tasks, including machine translation, including machine, Reinforcement learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has shown great promise in aligning language models with human preferences in a variety of text generation tasks, including machine translation. For translation tasks, rewards can easily be obtained from quality estimation (QE) models which can generate rewards for unlabeled data. Despite its usefulness, reinforcement learning cannot exploit the gradients with respect to the QE score. We propose QE-EBM, a method of employing quality estimators as trainable loss networks that can directly backpropagate to the NMT model. We examine our method on several low and high resource target languages with English as the source language. QE-EBM outperforms strong baselines such as REINFORCE and proximal policy optimization (PPO) as well as supervised fine-tuning for all target languages, especially low-resource target languages. Most notably, for English-to-Mongolian translation, our method achieves improvements of 2.5 BLEU, 7.1 COMET-KIWI, 5.3 COMET, and 6.4 XCOMET relative to the supervised baseline.
摘要:强化学习在将语言模型与人类偏好对齐方面展示了巨大的潜力,适用于多种文本生成任务,包括机器翻译。对于翻译任务,可以从质量评估 (Quality Estimation, QE) 模型中轻松获得奖励,这些模型可以为未标注的数据生成奖励。尽管强化学习有用,但它无法利用相对于 QE 分数的梯度。我们提出了 QE-EBM,一种将质量评估器作为可训练损失网络的方法,能够直接反向传播到神经机器翻译 (Neural Machine Translation, NMT) 模型。我们在多种低资源和高资源目标语言上进行了实验,源语言为英语。QE-EBM 在所有目标语言上均优于强基线方法,如 REINFORCE 和近端策略优化 (Proximal Policy Optimization, PPO),以及有监督的微调方法,特别是在低资源目标语言上。最为显著的是,在英蒙翻译任务中,我们的方法相较于有监督基线,实现了 2.5 BLEU、7.1 COMET-KIWI、5.3 COMET 和 6.4 XCOMET 的提升。

[NLP-61] ChakmaNMT: A Low-resource Machine Translation On Chakma Language ACL

【速读】: 该论文试图解决Chakma族群与孟加拉国主流社会之间的文化和语言隔阂问题,关键解决方案是通过开发从Chakma语到孟加拉语的机器翻译(MT)模型来缓解这一隔阂。研究团队引入了包含15,021个平行样本和42,783个单语样本的新数据集,并使用BanglaT5模型进行微调,结合回译和Chakma语的音译,在基准数据集上实现了最高的BLEU评分(17.8和4.41),这是首次针对Chakma语的机器翻译研究,旨在丰富语言资源并保护濒危语言。

链接: https://arxiv.org/abs/2410.10219
作者: Aunabil Chakma,Aditya Chakma,Soham Khisa,Chumui Tripura,Masum Hasan,Rifat Shahriyar
关键词-EN: mainstream Bangladesh creates, maintains distinct cultural, distinct cultural traditions, indigenous Chakma population, mainstream Bangladesh
类目: Computation and Language (cs.CL)
备注: to be submitted in ACL findings 2025

点击查看摘要

Abstract:The geopolitical division between the indigenous Chakma population and mainstream Bangladesh creates a significant cultural and linguistic gap, as the Chakma community, mostly residing in the hill tracts of Bangladesh, maintains distinct cultural traditions and language. Developing a Machine Translation (MT) model or Chakma to Bangla could play a crucial role in alleviating this cultural-linguistic divide. Thus, we have worked on MT between CCP-BN(Chakma-Bangla) by introducing a novel dataset of 15,021 parallel samples and 42,783 monolingual samples of the Chakma Language. Moreover, we introduce a small set for Benchmarking containing 600 parallel samples between Chakma, Bangla, and English. We ran traditional and state-of-the-art models in NLP on the training set, where fine-tuning BanglaT5 with back-translation using transliteration of Chakma achieved the highest BLEU score of 17.8 and 4.41 in CCP-BN and BN-CCP respectively on the Benchmark Dataset. As far as we know, this is the first-ever work on MT for the Chakma Language. Hopefully, this research will help to bridge the gap in linguistic resources and contribute to preserving endangered languages. Our dataset link and codes will be published soon.
摘要:原住民查克玛族群与孟加拉主流社会之间的地缘政治分隔造成了显著的文化和语言隔阂。查克玛社区主要居住在孟加拉的丘陵地区,保留了独特的文化传统和语言。开发从查克玛语到孟加拉语的机器翻译 (MT) 模型,对于缓解这种文化语言隔阂具有重要作用。因此,我们通过引入一个包含 15,021 个平行样本和 42,783 个查克玛语单语样本的新数据集,开展了查克玛语-孟加拉语 (CCP-BN) 之间的机器翻译研究。此外,我们还引入了一个包含 600 个查克玛语、孟加拉语和英语平行样本的小型基准数据集。我们在训练集上运行了传统的和最先进的自然语言处理 (NLP) 模型,其中使用查克玛语的音译进行反向翻译的 BanglaT5 微调模型在基准数据集上分别在 CCP-BN 和 BN-CCP 方向上取得了 17.8 和 4.41 的最高 BLEU 分数。据我们所知,这是首次针对查克玛语的机器翻译研究。希望这项研究能够帮助填补语言资源上的空白,并为保护濒危语言做出贡献。我们的数据集链接和代码将很快发布。

[NLP-62] SkillAggregation: Reference-free LLM-Dependent Aggregation

【速读】: 该论文试图解决在没有参考标签的情况下,如何有效地聚合多个大型语言模型(LLMs)的预测结果以提高自然语言处理任务的性能。解决方案的关键在于提出了一种名为SkillAggregation的新方法,该方法通过学习如何结合LLM判断者的估计值,而不需要额外的数据或真实标签。SkillAggregation扩展了用于图像分类的Crowdlayer聚合方法,在推理过程中利用判断者的估计值,从而在HaluEval-Dialogue、TruthfulQA和Chatbot Arena等任务中表现优异,超越了现有的标准聚合方法。

链接: https://arxiv.org/abs/2410.10215
作者: Guangzhi Sun,Anmol Kagrecha,Potsawee Manakul,Phil Woodland,Mark Gales
关键词-EN: Large Language Models, Large Language, Language Models, generate human-like judgments, assess NLP tasks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to assess NLP tasks due to their ability to generate human-like judgments. Single LLMs were used initially, however, recent work suggests using multiple LLMs as judges yields improved performance. An important step in exploiting multiple judgements is the combination stage, aggregation. Existing methods in NLP either assign equal weight to all LLM judgments or are designed for specific tasks such as hallucination detection. This work focuses on aggregating predictions from multiple systems where no reference labels are available. A new method called SkillAggregation is proposed, which learns to combine estimates from LLM judges without needing additional data or ground truth. It extends the Crowdlayer aggregation method, developed for image classification, to exploit the judge estimates during inference. The approach is compared to a range of standard aggregation methods on HaluEval-Dialogue, TruthfulQA and Chatbot Arena tasks. SkillAggregation outperforms Crowdlayer on all tasks, and yields the best performance over all approaches on the majority of tasks.
摘要:大语言模型 (LLMs) 因其生成类人判断的能力而越来越多地被用于评估自然语言处理 (NLP) 任务。最初使用单个 LLM,但最近的研究表明,使用多个 LLM 作为评判者可以提高性能。利用多个评判的重要步骤是组合阶段,即聚合。现有的 NLP 方法要么对所有 LLM 评判赋予相同的权重,要么是为特定任务(如幻觉检测)设计的。本文专注于在没有参考标签的情况下,从多个系统中聚合预测。提出了一种名为 SkillAggregation 的新方法,该方法能够在不需要额外数据或真实标签的情况下,学习如何组合来自 LLM 评判者的估计。它扩展了为图像分类开发的 Crowdlayer 聚合方法,以在推理过程中利用评判者的估计。该方法在 HaluEval-Dialogue、TruthfulQA 和 Chatbot Arena 任务上与一系列标准聚合方法进行了比较。SkillAggregation 在所有任务中均优于 Crowdlayer,并且在大多数任务中表现优于所有方法。

[NLP-63] Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key

【速读】: 该论文试图解决大型语言模型在生成长文本输出时表现不佳的问题,主要原因是训练数据中缺乏足够的长输出样本。解决方案的关键在于通过精心筛选的高质量数据重新对基础模型进行对齐训练,从而提升模型在指令下生成长文本的能力。研究显示,即使使用少量的高质量数据和计算资源,也能显著提升模型的长文本生成性能,并且这种方法在多个模型上具有良好的通用性。

链接: https://arxiv.org/abs/2410.10210
作者: Yingda Chen,Xingjun Wang,Jintao Huang,Yunlin Mao,Daoze Zhang,Yuze Zhao
关键词-EN: support longer context, large language models, language models rapidly, models rapidly evolve, longer context
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models rapidly evolve to support longer context, there is a notable disparity in their capability to generate output at greater lengths. Recent study suggests that the primary cause for this imbalance may arise from the lack of data with long-output during alignment training. In light of this observation, attempts are made to re-align foundation models with data that fills the gap, which result in models capable of generating lengthy output when instructed. In this paper, we explore the impact of data-quality in tuning a model for long output, and the possibility of doing so from the starting points of human-aligned (instruct or chat) models. With careful data curation, we show that it possible to achieve similar performance improvement in our tuned models, with only a small fraction of training data instances and compute. In addition, we assess the generalizability of such approaches by applying our tuning-recipes to several models. our findings suggest that, while capacities for generating long output vary across different models out-of-the-box, our approach to tune them with high-quality data using lite compute, consistently yields notable improvement across all models we experimented on. We have made public our curated dataset for tuning long-writing capability, the implementations of model tuning and evaluation, as well as the fine-tuned models, all of which can be openly-accessed.
摘要:随着大语言模型迅速发展以支持更长的上下文,它们在生成更长输出方面的能力存在显著差异。最近的研究表明,这种不平衡的主要原因可能在于对齐训练过程中缺乏长输出的数据。基于这一观察,我们尝试通过填补这一数据空缺来重新对齐基础模型,从而使模型在指令下能够生成较长的输出。本文探讨了在调整模型以生成长输出时数据质量的影响,以及从人类对齐(指令或聊天)模型出发进行调整的可能性。通过精心筛选数据,我们展示了在仅使用少量训练数据实例和计算资源的情况下,我们的调整模型也能实现类似的性能提升。此外,我们通过将调整方法应用于多个模型来评估其泛化能力。我们的研究结果表明,尽管不同模型在开箱即用时生成长输出的能力各异,但使用高质量数据和轻量计算资源进行调整的方法,在我们实验的所有模型中均能持续带来显著改进。我们已经公开了用于调整长写作能力的精选数据集、模型调整和评估的实现,以及经过微调的模型,所有这些资源均可公开访问。

[NLP-64] Effi-Code: Unleashing Code Efficiency in Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在代码生成过程中效率和正确性不足的问题。解决方案的关键在于提出了一种名为Effi-Code的方法,通过自优化过程和开销分析(Overhead Profiling)来生成高质量的正确且高效的代码样本数据集,并利用该数据集对各种LLMs进行微调。该方法通过迭代优化生成的代码,结合运行时性能指标和正确性检查,显著提升了代码生成模型的正确性和效率。实验结果表明,经过Effi-Code微调的模型在代码正确性和执行效率上均有显著提升。

链接: https://arxiv.org/abs/2410.10209
作者: Dong Huang,Guangtao Zeng,Jianbo Dai,Meng Luo,Han Weng,Yuhao Qing,Heming Cui,Zhijiang Guo,Jie M. Zhang
关键词-EN: large language models, code, large language, critical to enhance, correctness
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Under Review

点击查看摘要

Abstract:As the use of large language models (LLMs) for code generation becomes more prevalent in software development, it is critical to enhance both the efficiency and correctness of the generated code. Existing methods and models primarily focus on the correctness of LLM-generated code, ignoring efficiency. In this work, we present Effi-Code, an approach to enhancing code generation in LLMs that can improve both efficiency and correctness. We introduce a Self-Optimization process based on Overhead Profiling that leverages open-source LLMs to generate a high-quality dataset of correct and efficient code samples. This dataset is then used to fine-tune various LLMs. Our method involves the iterative refinement of generated code, guided by runtime performance metrics and correctness checks. Extensive experiments demonstrate that models fine-tuned on the Effi-Code show significant improvements in both code correctness and efficiency across task types. For example, the pass@1 of DeepSeek-Coder-6.7B-Instruct generated code increases from \textbf43.3% to \textbf76.8%, and the average execution time for the same correct tasks decreases by \textbf30.5%. Effi-Code offers a scalable and generalizable approach to improving code generation in AI systems, with potential applications in software development, algorithm design, and computational problem-solving. The source code of Effi-Code was released in \urlthis https URL.
摘要:随着大语言模型 (LLM) 在代码生成中的应用日益普及,提升生成代码的效率和正确性变得至关重要。现有方法和模型主要关注 LLM 生成代码的正确性,而忽视了效率问题。本文提出 Effi-Code,一种旨在提升 LLM 代码生成效率和正确性的方法。我们引入了一种基于开销分析的自优化过程,利用开源 LLM 生成高质量的正确且高效的代码样本数据集。该数据集随后用于微调多种 LLM。我们的方法通过运行时性能指标和正确性检查指导生成代码的迭代优化。大量实验表明,经过 Effi-Code 微调的模型在各类任务中均显著提升了代码的正确性和效率。例如,DeepSeek-Coder-6.7B-Instruct 生成的代码通过率 (pass@1) 从 43.3% 提升至 76.8%,同时相同正确任务的平均执行时间减少了 30.5%。Effi-Code 提供了一种可扩展且通用的方法,用于改进 AI 系统中的代码生成,具有在软件开发、算法设计和计算问题解决中的潜在应用价值。Effi-Code 的源代码已在 [https URL] 发布。

[NLP-65] Scalable Multi-Domain Adaptation of Language Models using Modular Experts

【速读】: 该论文试图解决预训练语言模型(PLMs)在特定领域应用时,如何在资源受限的情况下平衡领域特定性能、保留通用知识以及训练和推理效率的问题。解决方案的关键在于提出了一种名为Modular Domain Experts(MoDE)的混合专家架构。MoDE通过增加模块化的领域专家来增强通用PLMs,这些专家独立训练并通过轻量级训练过程组合在一起。与标准的低秩适应方法相比,每个MoDE专家由多个Transformer层组成,能够更好地随训练样本和参数数量扩展。该架构不仅在目标任务性能上与全参数微调相当,还能更好地保留通用知识,并支持灵活的分片配置,提高训练速度。

链接: https://arxiv.org/abs/2410.10181
作者: Peter Schafhalter,Shun Liao,Yanqi Zhou,Chih-Kuan Yeh,Arun Kandoor,James Laudon
关键词-EN: pre-trained language models, multiple targeted tasks, language models, targeted tasks, resource-constrained use cases
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE’s architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.
摘要:领域特定适应对于最大化预训练语言模型 (PLMs) 在单个或多个目标任务上的性能至关重要,尤其是在资源受限的使用场景中,如边缘设备。然而,现有方法往往难以平衡领域特定性能、通用知识的保留以及训练和推理的效率。为解决这些挑战,我们提出了模块化领域专家 (Modular Domain Experts, MoDE)。MoDE 是一种专家混合架构,通过模块化、领域专业化的专家增强通用 PLMs。这些专家独立训练并通过轻量级训练过程组合在一起。与标准的低秩适应方法相比,每个 MoDE 专家由多个 Transformer 层组成,这些层在更多训练样本和更大参数数量的情况下具有更好的扩展性。我们的评估表明,MoDE 在实现与全参数微调相当的性能的同时,保留性能提高了 1.65%。此外,MoDE 的架构支持灵活的分片配置,并比最先进的分布式训练配置提高了高达 38% 的训练速度。

[NLP-66] Is Parameter Collision Hindering Continual Learning in LLMs?

【速读】: 该论文试图解决大型语言模型(LLMs)在连续学习多个任务时面临的灾难性遗忘问题。解决方案的关键在于构建非碰撞参数(non-collision parameters),这些参数能够提供更好的任务正交性,从而减少参数间的相互依赖,保留多领域知识,降低遗忘先前数据的风险。论文提出的Non-collision Low-Rank Adaptation (N-LoRA)方法通过利用低碰撞率来增强LLMs的连续学习能力,实验结果表明N-LoRA在多个连续学习基准测试中表现优异,显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2410.10179
作者: Shuo Yang,Kun-Peng Ning,Yu-Yang Liu,Jia-Yu Yao,Yong-Hong Tian,Yi-Bing Song,Li Yuan
关键词-EN: Large Language Models, Large Language, Language Models, making continual learning, dynamic deployment
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as O-LoRA, typically focus on constructing orthogonality tasks to decouple parameter interdependence from various this http URL this paper, we reveal that building non-collision parameters is a more critical factor in addressing CL challenges. Our theoretical and experimental analyses demonstrate that non-collision parameters can provide better task orthogonality, which is a sufficient but unnecessary condition. Furthermore, knowledge from multiple domains will be preserved in non-collision parameter subspaces, making it more difficult to forget previously seen data. Leveraging this insight, we propose Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach leveraging low collision rates to enhance CL in LLMs. Experimental results on multiple CL benchmarks indicate that N-LoRA achieves superior performance (+2.9), higher task orthogonality (*4.1 times), and lower parameter collision (*58.1 times) than SOTA methods.
摘要:大语言模型 (LLM) 在顺序学习多个任务时常常遭受灾难性遗忘,这使得持续学习 (CL) 对其动态部署至关重要。现有的最先进 (SOTA) 方法,如 O-LoRA,通常专注于构建正交任务以解耦参数间的相互依赖。本文揭示,构建无碰撞参数是解决 CL 挑战的一个更为关键的因素。我们的理论和实验分析表明,无碰撞参数能够提供更好的任务正交性,这是一个充分但不必要的条件。此外,多个领域的知识将保留在无碰撞参数子空间中,使得遗忘先前见过的数据变得更加困难。基于这一洞察,我们提出了无碰撞低秩适应 (N-LoRA),这是一种简单而有效的方法,利用低碰撞率来增强 LLM 中的 CL。在多个 CL 基准测试中的实验结果表明,N-LoRA 在性能上优于 SOTA 方法 (+2.9),任务正交性更高 (*4.1 倍),参数碰撞率更低 (*58.1 倍)。

[NLP-67] HSR-Enhanced Sparse Attention Acceleration

【速读】: 该论文试图解决大型语言模型(LLMs)在处理长上下文任务时,由于注意力机制的计算复杂度高而导致的性能瓶颈问题。解决方案的关键在于利用注意力机制中的固有稀疏性,通过引入Half-Space Reporting (HSR)数据结构,快速识别注意力矩阵中的非零或“大规模激活”条目,从而显著降低计算复杂度。具体来说,该方法在注意力生成和长上下文的全注意力计算中,分别将运行时间从O(mn)降低到O(mn^4/5)和O(mn^1 - 1 / \lfloor d/2\rfloor + mn^4/5),且在ReLU注意力中引入零误差,在Softmax注意力中引入可证明的微小误差。这一方法显著提升了LLMs在长上下文任务中的处理效率,扩展了其在各领域的应用潜力。

链接: https://arxiv.org/abs/2410.10165
作者: Bo Chen,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. This paper introduces a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with \mathsfReLU^\alpha activation, \alpha \in \mathbbN_+ ), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to rapidly identify non-zero or “massively activated” entries in the attention matrix. We present theoretical analyses for two key scenarios: attention generation and full attention computation with long input context. Our approach achieves a running time of O(mn^4/5) significantly faster than the naive approach O(mn) for attention generation, where n is the context length, m is the query length, and d is the hidden dimension. We can also reduce the running time of full attention computation from O(mn) to O(mn^1 - 1 / \lfloor d/2\rfloor + mn^4/5) . Importantly, our method introduces no error for ReLU attention and only provably negligible error for Softmax attention, where the latter is supported by our empirical validation. This work represents a significant step towards enabling efficient long-context processing in LLMs, potentially broadening their applicability across various domains.
摘要:大语言模型 (LLMs) 在各种应用中展示了显著的能力,但其在长上下文任务中的表现往往受限于注意力机制的计算复杂性。本文介绍了一种加速 LLMs 中注意力计算的新方法,特别是在长上下文场景中。我们利用了注意力机制中的固有稀疏性,无论是传统的 Softmax 注意力还是 ReLU 注意力 (使用 \mathsfReLU^\alpha 激活,\alpha \in \mathbbN_+),以显著降低运行时间复杂度。我们的方法采用了一种半空间报告 (HSR) 数据结构,以快速识别注意力矩阵中的非零或“大规模激活”条目。我们针对两个关键场景——注意力生成和长输入上下文的全注意力计算——进行了理论分析。我们的方法在注意力生成方面实现了 O(mn^4/5) 的运行时间,显著快于朴素方法的 O(mn),其中 n 是上下文长度,m 是查询长度,d 是隐藏维度。我们还可以将全注意力计算的运行时间从 O(mn) 减少到 O(mn^1 - 1 / \lfloor d/2\rfloor + mn^4/5)。重要的是,我们的方法对 ReLU 注意力引入的误差为零,而对 Softmax 注意力的误差仅为可证明的微小误差,这一点得到了我们实证验证的支持。这项工作标志着在实现 LLMs 中高效长上下文处理方面迈出了重要一步,可能扩大其在各个领域的应用范围。

[NLP-68] Diagnosing Hate Speech Classification: Where Do Humans and Machines Disagree and Why?

【速读】: 该论文试图解决社交媒体上仇恨言论分类的诊断问题,特别是人类标注与机器分类之间的不一致性。解决方案的关键在于利用余弦相似度比率、嵌入回归和手动重新标注来诊断仇恨言论分类。通过计算余弦相似度比率,研究展示了仇恨言论内容的基本描述;通过嵌入回归诊断,发现女性标注者对针对黑人的种族歧视词汇更为敏感;最后,利用先进的预训练语言模型NV-Embed-v2进行文本嵌入并运行逻辑回归,训练出的分类器在测试集上达到了94%的准确率。研究还发现,尽管人类标注被视为训练集的“真实”标准,机器在识别长篇事实陈述时表现更好,而在识别短句中的脏话时表现较差,这可能是由于模型在创建时被调整以避免生成明显的仇恨言论,从而降低了其检测此类内容的能力。

链接: https://arxiv.org/abs/2410.10153
作者: Xilin Yang
关键词-EN: hate speech, cosine similarity ratio, Measuring Hate Speech, hate speech classification, diagnose hate speech
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study uses the cosine similarity ratio, embedding regression, and manual re-annotation to diagnose hate speech classification. We begin by computing cosine similarity ratio on a dataset “Measuring Hate Speech” that contains 135,556 annotated comments on social media. This way, we show a basic use of cosine similarity as a description of hate speech content. We then diagnose hate speech classification starting from understanding the inconsistency of human annotation from the dataset. Using embedding regression as a basic diagnostic, we found that female annotators are more sensitive to racial slurs that target the black population. We perform with a more complicated diagnostic by training a hate speech classifier using a SoTA pre-trained large language model, NV-Embed-v2, to convert texts to embeddings and run a logistic regression. This classifier achieves a testing accuracy of 94%. In diagnosing where machines disagree with human annotators, we found that machines make fewer mistakes than humans despite the fact that human annotations are treated as ground truth in the training set. Machines perform better in correctly labeling long statements of facts, but perform worse in labeling short instances of swear words. We hypothesize that this is due to model alignment - while curating models at their creation prevents the models from producing obvious hate speech, it also reduces the model’s ability to detect such content.
摘要:本研究利用余弦相似度比率、嵌入回归和人工重新标注来诊断仇恨言论分类。我们首先在一个包含 135,556 条社交媒体标注评论的数据集“测量仇恨言论”上计算余弦相似度比率。通过这种方式,我们展示了余弦相似度作为仇恨言论内容描述的基本应用。接着,我们从理解数据集中人类标注的不一致性出发,诊断仇恨言论分类。使用嵌入回归作为基本诊断工具,我们发现女性标注者对针对黑人群体的种族侮辱更为敏感。我们通过训练一个使用最先进预训练大语言模型 NV-Embed-v2 将文本转换为嵌入并运行逻辑回归的仇恨言论分类器,进行了更复杂的诊断。该分类器在测试集上达到了 94% 的准确率。在诊断机器与人类标注者意见不一致的地方时,我们发现尽管在训练集中将人类标注视为真实标签,机器犯的错误仍少于人类。机器在正确标注长篇事实陈述方面表现更好,但在标注简短的脏话实例时表现较差。我们假设这是由于模型对齐导致的——在模型创建时进行调整虽然防止了模型生成明显的仇恨言论,但也降低了模型检测此类内容的能力。

[NLP-69] Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

【速读】: 该论文旨在研究指令微调的大型语言模型(LLMs)的安全机制,并发现通过重新加权多层感知器(MLP)神经元可以显著削弱模型的安全性,特别是在句子结尾的推理过程中。论文提出了一种假设,即LLMs在句子结尾推理时评估提示的危害性,而MLP层在这一过程中起关键作用。基于此假设,论文开发了两种新型的白盒越狱方法:针对特定提示的方法和针对一般提示的方法。前者针对个别提示进行实时优化攻击,后者则是离线预训练,能够泛化到未见过的有害提示。这些方法在7个流行的开源LLMs上展示了强大的性能,模型规模从2B到72B不等。论文不仅揭示了指令微调LLMs的安全漏洞,还深化了对LLMs内部机制的理解。

链接: https://arxiv.org/abs/2410.10150
作者: Yifan Luo,Zhennan Zhou,Meitan Wang,Bin Dong
关键词-EN: instruction fine-tuned large, fine-tuned large language, large language models, instruction fine-tuned, fine-tuned large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model’s safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM’s safety and deepens the understanding of the internal mechanisms of LLMs.
摘要:本文研究了指令微调大语言模型 (LLM) 的安全机制。我们发现,重新加权多层感知器 (MLP) 神经元可以显著削弱模型的安全性,尤其是在句子结尾推理中的 MLP 层。我们假设,LLM 在句子结尾推理过程中评估提示的危害性,而 MLP 层在这一过程中起着关键作用。基于这一假设,我们开发了两种新颖的白盒越狱方法:特定提示方法和通用提示方法。特定提示方法针对个别提示进行优化,实时进行攻击,而通用提示方法则是离线预训练,能够泛化到未见过的有害提示。我们的方法在 7 个流行的开源 LLM 上表现出色,模型规模从 2B 到 72B 不等。此外,我们的研究揭示了指令微调 LLM 安全性的漏洞,并加深了对 LLM 内部机制的理解。

[NLP-70] alpha-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

【速读】: 该论文试图解决大语言模型(LLMs)与人类价值观和意图对齐的问题,特别是在使用强化学习从人类反馈(RLHF)时面临的计算效率和训练稳定性挑战。解决方案的关键是提出了一种名为 \alpha -DPO 的自适应偏好优化算法,通过引入动态奖励边际来克服现有方法(如DPO和SimPO)的局限性。具体来说,\alpha -DPO 采用自适应偏好分布,平衡策略模型和参考模型,以实现个性化的奖励边际,并通过理论保证和KL散度控制来平衡对齐和多样性。实证评估表明,\alpha -DPO 在各种模型设置下均优于DPO和SimPO,显著提高了胜率,显示出其在LLM微调中的强大潜力。

链接: https://arxiv.org/abs/2410.10148
作者: Junkang Wu,Xue Wang,Zhengyi Yang,Jiancan Wu,Jinyang Gao,Bolin Ding,Xiang Wang,Rong Jin,Xiangnan He
关键词-EN: Aligning large language, Aligning large, large language models, Direct Preference Optimization, Simple Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO’s assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose \alpha -DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, \alpha -DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for \alpha -DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that \alpha -DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at this https URL
摘要:将大语言模型 (LLMs) 与人类价值观和意图对齐对于其效用、诚实性和安全性至关重要。从人类反馈中进行强化学习 (RLHF) 是实现这种对齐的流行方法,但它面临着计算效率和训练稳定性的挑战。最近的方法如直接偏好优化 (DPO) 和简单偏好优化 (SimPO) 提出了 RLHF 的离线替代方案,通过重新参数化奖励函数简化了过程。然而,DPO 依赖于一个可能次优的参考模型,而 SimPO 对固定目标奖励边际的假设可能在多样化的数据设置中导致次优决策。在本工作中,我们提出了 \alpha -DPO,一种自适应偏好优化算法,旨在通过引入动态奖励边际来解决这些限制。具体来说,\alpha -DPO 采用自适应偏好分布,平衡策略模型和参考模型以实现个性化的奖励边际。我们为 \alpha -DPO 提供了理论保证,证明了其作为替代优化目标的有效性,并通过 KL 散度控制实现对齐和多样性的平衡。在 AlpacaEval 2 和 Arena-Hard 上的实证评估显示,\alpha -DPO 在各种模型设置下始终优于 DPO 和 SimPO,确立了其作为微调 LLMs 的稳健方法。我们的方法在胜率上实现了显著提升,突显了其在 LLM 对齐中的强大潜力。代码可在以下链接获取:https URL

[NLP-71] Unified Representation of Genomic and Biomedical Concepts through Multi-Task Multi-Source Contrastive Learning

【速读】: 该论文试图解决基因组数据与临床概念之间的复杂关系映射问题,解决方案的关键在于引入GENomic Encoding REpresentation with Language Model (GENEREL)框架。GENEREL通过微调语言模型,将生物学知识融入临床概念,构建一个统一的嵌入空间,涵盖广泛的常见SNPs和生物医学概念。通过多任务对比学习,GENEREL能够对齐SNPs和临床概念的嵌入,从而有效捕捉两者之间的细微关系,并克服传统代码映射系统在不同数据源间的局限性。

链接: https://arxiv.org/abs/2410.10144
作者: Hongyi Yuan,Suqi Liu,Kelly Cho,Katherine Liao,Alexandre Pereira,Tianxi Cai
关键词-EN: introduce GENomic Encoding, GENomic Encoding REpresentation, GENomic Encoding, Encoding REpresentation, biomedical knowledge bases
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
备注: 15 pages, 2 figures, 5 tables

点击查看摘要

Abstract:We introduce GENomic Encoding REpresentation with Language Model (GENEREL), a framework designed to bridge genetic and biomedical knowledge bases. What sets GENEREL apart is its ability to fine-tune language models to infuse biological knowledge behind clinical concepts such as diseases and medications. This fine-tuning enables the model to capture complex biomedical relationships more effectively, enriching the understanding of how genomic data connects to clinical outcomes. By constructing a unified embedding space for biomedical concepts and a wide range of common SNPs from sources such as patient-level data, biomedical knowledge graphs, and GWAS summaries, GENEREL aligns the embeddings of SNPs and clinical concepts through multi-task contrastive learning. This allows the model to adapt to diverse natural language representations of biomedical concepts while bypassing the limitations of traditional code mapping systems across different data sources. Our experiments demonstrate GENEREL’s ability to effectively capture the nuanced relationships between SNPs and clinical concepts. GENEREL also emerges to discern the degree of relatedness, potentially allowing for a more refined identification of concepts. This pioneering approach in constructing a unified embedding system for both SNPs and biomedical concepts enhances the potential for data integration and discovery in biomedical research.
摘要:我们介绍了基于语言模型的基因组编码表示框架 (GENomic Encoding REpresentation with Language Model, GENEREL),该框架旨在连接基因和生物医学知识库。GENEREL 的独特之处在于其能够微调语言模型,以融入临床概念(如疾病和药物)背后的生物学知识。这种微调使模型能够更有效地捕捉复杂的生物医学关系,丰富了基因组数据与临床结果之间联系的理解。通过构建一个统一的嵌入空间,涵盖来自患者级数据、生物医学知识图谱和 GWAS 摘要等多种来源的广泛常见 SNPs 和生物医学概念,GENEREL 通过多任务对比学习对齐 SNPs 和临床概念的嵌入。这使得模型能够适应生物医学概念的多样化自然语言表示,同时克服了传统代码映射系统在不同数据源之间的局限性。我们的实验表明,GENEREL 能够有效捕捉 SNPs 与临床概念之间微妙的关系。GENEREL 还能辨别相关性的程度,潜在地允许更精细的概念识别。这种在 SNPs 和生物医学概念上构建统一嵌入系统的前沿方法,增强了生物医学研究中数据整合和发现的潜力。

[NLP-72] mperature-Centric Investigation of Speculative Decoding with Knowledge Distillation EMNLP2024

【速读】: 该论文试图解决在自回归语言模型中,推测解码(speculative decoding)过程中解码温度对解码效率的影响问题。解决方案的关键在于研究并揭示不同解码温度设置对推测解码性能的影响,特别是在高温度生成设置下的表现。论文通过知识蒸馏(KD)方法,探讨了在一致温度设置下进行解码的挑战,并提出了相应的改进措施,以期在高温度生成环境下进一步提升推测解码的速度。研究结果强调了生成配置对推测解码性能的显著影响,并指出了开发针对多样化解码配置方法的必要性。

链接: https://arxiv.org/abs/2410.10141
作者: Siru Ouyang,Shuohang Wang,Minhao Jiang,Ming Zhong,Donghan Yu,Jiawei Han,Yelong Shen
关键词-EN: Speculative decoding, Speculative decoding stands, inference in autoregressive, decoding, pivotal technique
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Speculative decoding stands as a pivotal technique to expedite inference in autoregressive (large) language models. This method employs a smaller draft model to speculate a block of tokens, which the target model then evaluates for acceptance. Despite a wealth of studies aimed at increasing the efficiency of speculative decoding, the influence of generation configurations on the decoding process remains poorly understood, especially concerning decoding temperatures. This paper delves into the effects of decoding temperatures on speculative decoding’s efficacy. Beginning with knowledge distillation (KD), we first highlight the challenge of decoding at higher temperatures, and demonstrate KD in a consistent temperature setting could be a remedy. We also investigate the effects of out-of-domain testing sets with out-of-range temperatures. Building upon these findings, we take an initial step to further the speedup for speculative decoding, particularly in a high-temperature generation setting. Our work offers new insights into how generation configurations drastically affect the performance of speculative decoding, and underscores the need for developing methods that focus on diverse decoding configurations. Code is publically available at this https URL.
摘要: 推测性解码 (Speculative Decoding) 是加速自回归 (大) 语言模型推理的关键技术。该方法利用一个较小的草稿模型来推测一组 Token,然后由目标模型进行评估以决定是否接受。尽管已有大量研究致力于提高推测性解码的效率,但生成配置对解码过程的影响仍未得到充分理解,尤其是在解码温度方面。本文深入探讨了解码温度对推测性解码效果的影响。我们从知识蒸馏 (Knowledge Distillation, KD) 开始,首先强调了在高温度下解码的挑战,并展示了在一致温度设置下进行 KD 可能是一种补救措施。我们还研究了超出范围温度的域外测试集的影响。基于这些发现,我们迈出了进一步加速推测性解码的第一步,特别是在高温生成设置下。我们的工作为生成配置如何显著影响推测性解码的性能提供了新的见解,并强调了开发关注多样化解码配置方法的必要性。代码已公开,可访问此 https URL。

[NLP-73] MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

【速读】: 该论文试图解决现有多模态理解和生成能力评估中的不足,特别是数据规模、范围和评估深度方面的局限性,以及现有评估指标的高成本、偏见和不可靠性问题。解决方案的关键在于引入MMIE,这是一个大规模知识密集型基准,用于评估大型视觉-语言模型(LVLMs)中的交错多模态理解和生成能力。MMIE包含20K精心策划的多模态查询,涵盖3个类别、12个领域和102个子领域,支持交错的输入和输出,并提供多种选择和开放式问题格式以评估多样化的能力。此外,论文提出了一种基于人类标注数据微调的评分模型,旨在减少偏见并提高评估准确性。

链接: https://arxiv.org/abs/2410.10139
作者: Peng Xia,Siwei Han,Shi Qiu,Yiyang Zhou,Zhaoyang Wang,Wenhao Zheng,Zhaorun Chen,Chenhang Cui,Mingyu Ding,Linjie Li,Lijuan Wang,Huaxiu Yao
关键词-EN: Interleaved multimodal comprehension, arbitrary sequences, produce and interpret, interpret both images, images and text
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in this https URL.
摘要:交错多模态理解和生成,使模型能够以任意顺序生成和解释图像与文本,已成为多模态学习中的关键领域。尽管取得了显著进展,但对此能力的评估仍显不足。现有基准在数据规模、范围和评估深度方面存在局限性,而当前的评估指标往往成本高昂或存在偏见,缺乏实际应用中的可靠性。为应对这些挑战,我们引入了 MMIE,这是一个用于评估大视觉语言模型 (LVLMs) 中交错多模态理解和生成的大规模知识密集型基准。MMIE 包含 20K 条精心策划的多模态查询,涵盖 3 个类别、12 个领域和 102 个子领域,包括数学、编码、物理、文学、健康和艺术。它支持交错的输入和输出,提供多种选择题和开放式问题格式,以评估多样化的能力。此外,我们提出了一种可靠的自动化评估指标,利用经过人类标注数据微调的评分模型和系统评估标准,旨在减少偏见并提高评估准确性。广泛的实验表明,我们的基准和指标在全面评估交错 LVLMs 方面具有有效性。具体而言,我们评估了八种 LVLMs,发现即使是最佳模型也存在显著改进空间,大多数模型仅取得中等结果。我们相信 MMIE 将推动交错 LVLMs 开发的进一步进步。我们在该 https URL 公开发布了我们的基准和代码。

[NLP-74] Beyond-RAG: Question Identification and Answer Generation in Real-Time Conversations

【速读】: 该论文试图解决客户联络中心中人工代理因手动解读查询和检索相关知识库文章而导致的高平均处理时间(AHT)问题。解决方案的关键在于提出一种决策支持系统,该系统能够实时识别客户问题,并在查询匹配常见问题(FAQ)时直接从FAQ数据库中检索答案,否则通过检索增强生成(RAG)系统生成答案。这种方法减少了对手动查询的依赖,能够在2秒内为代理提供响应,从而提高效率、降低AHT并降低运营成本。此外,论文还引入了一种自动化的LLM-agentic工作流程,用于在没有预定义FAQ的情况下从历史记录中识别FAQ。

链接: https://arxiv.org/abs/2410.10136
作者: Garima Agrawal,Sashank Gummuluri,Cosimo Spera
关键词-EN: relevant knowledge base, long average handling, average handling times, customer contact centers, retrieve relevant knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In customer contact centers, human agents often struggle with long average handling times (AHT) due to the need to manually interpret queries and retrieve relevant knowledge base (KB) articles. While retrieval augmented generation (RAG) systems using large language models (LLMs) have been widely adopted in industry to assist with such tasks, RAG faces challenges in real-time conversations, such as inaccurate query formulation and redundant retrieval of frequently asked questions (FAQs). To address these limitations, we propose a decision support system that can look beyond RAG by first identifying customer questions in real time. If the query matches an FAQ, the system retrieves the answer directly from the FAQ database; otherwise, it generates answers via RAG. Our approach reduces reliance on manual queries, providing responses to agents within 2 seconds. Deployed in AI-powered human-agent assist solution at Minerva CQ, this system improves efficiency, reduces AHT, and lowers operational costs. We also introduce an automated LLM-agentic workflow to identify FAQs from historical transcripts when no predefined FAQs exist.
摘要:在客户联络中心,人类客服人员常常因需要手动解读查询并检索相关知识库 (KB) 文章而面临较长的平均处理时间 (AHT)。尽管使用大语言模型 (LLM) 的检索增强生成 (RAG) 系统已在业界广泛应用以协助此类任务,但 RAG 在实时对话中仍面临挑战,如查询表述不准确和频繁检索常见问题 (FAQ)。为解决这些局限性,我们提出了一种决策支持系统,该系统能够超越 RAG,首先实时识别客户问题。若查询匹配 FAQ,系统直接从 FAQ 数据库中检索答案;否则,通过 RAG 生成答案。我们的方法减少了对手动查询的依赖,在 2 秒内为客服人员提供响应。在 Minerva CQ 的 AI 驱动的客服辅助解决方案中部署后,该系统提高了效率,减少了 AHT,并降低了运营成本。我们还引入了一种自动化的 LLM 智能体工作流程,用于从历史记录中识别 FAQ,当没有预定义的 FAQ 时。

[NLP-75] FormalAlign: Automated Alignment Evaluation for Autoformalization

【速读】: 该论文试图解决自动形式化过程中自然语言与形式语言之间的语义对齐问题,关键在于提出了\textsc{FormalAlign}框架。该框架通过结合自动形式化序列生成任务和输入输出之间的表示对齐任务,采用双损失函数进行训练,从而实现对自然语言与形式语言对齐的自动化评估。实验结果表明,\textsc{FormalAlign}在多个基准测试中显著优于GPT-4,有效减少了手动验证的需求。

链接: https://arxiv.org/abs/2410.10135
作者: Jianqiao Lu,Yingjia Wan,Yinya Huang,Jing Xiong,Zhengying Liu,Zhijiang Guo
关键词-EN: convert informal mathematical, informal mathematical proofs, machine-verifiable formats, bridging the gap, aims to convert
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: 23 pages, 13 tables, 3 figures

点击查看摘要

Abstract:Autoformalization aims to convert informal mathematical proofs into machine-verifiable formats, bridging the gap between natural and formal languages. However, ensuring semantic alignment between the informal and formalized statements remains challenging. Existing approaches heavily rely on manual verification, hindering scalability. To address this, we introduce \textscFormalAlign, the first automated framework designed for evaluating the alignment between natural and formal languages in autoformalization. \textscFormalAlign trains on both the autoformalization sequence generation task and the representational alignment between input and output, employing a dual loss that combines a pair of mutually enhancing autoformalization and alignment tasks. Evaluated across four benchmarks augmented by our proposed misalignment strategies, \textscFormalAlign demonstrates superior performance. In our experiments, \textscFormalAlign outperforms GPT-4, achieving an Alignment-Selection Score 11.58% higher on \forml-Basic (99.21% vs. 88.91%) and 3.19% higher on MiniF2F-Valid (66.39% vs. 64.34%). This effective alignment evaluation significantly reduces the need for manual verification. Both the dataset and code can be accessed via~\urlthis https URL.
摘要:自动形式化旨在将非正式的数学证明转换为机器可验证的格式,弥合自然语言与形式语言之间的差距。然而,确保非正式陈述与形式化陈述之间的语义一致性仍然是一个挑战。现有方法严重依赖人工验证,阻碍了可扩展性。为此,我们引入了 \textscFormalAlign,这是首个专为评估自动形式化中自然语言与形式语言之间对齐度的自动化框架。\textscFormalAlign 在自动形式化序列生成任务和输入输出之间的表示对齐任务上进行训练,采用了一种结合了相互增强的自动形式化与对齐任务的双重损失。在经过我们提出的对齐策略增强的四个基准测试中,\textscFormalAlign 表现出色。在我们的实验中,\textscFormalAlign 优于 GPT-4,在 \forml-Basic 上实现了 11.58% 的对齐选择得分提升(99.21% vs. 88.91%),在 MiniF2F-Valid 上实现了 3.19% 的提升(66.39% vs. 64.34%)。这种有效的对齐评估显著减少了人工验证的需求。数据集和代码可通过以下链接访问:~\urlthis https URL。

[NLP-76] Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models

【速读】: 该论文试图解决在联邦学习(FL)环境中,如何更有效地个性化预训练视觉-语言模型(如CLIP)的提示学习(prompt learning)问题。传统FL方法限制客户端只能下载单一的全局聚合模型,这在处理轻量级提示时显得不够灵活。论文提出的解决方案关键在于引入个性化联邦混合自适应提示(pFedMoAP)框架,通过混合专家(Mixture of Experts, MoE)机制,允许客户端下载多个预聚合的非本地专家提示。这些非本地专家从服务器维护的池中稀疏选择,结合本地注意力门控网络,生成增强的文本特征,以更好地与本地图像数据对齐,从而在联邦学习环境中实现更高效的个性化提示学习。

链接: https://arxiv.org/abs/2410.10114
作者: Jun Luo,Chen Chen,Shandong Wu
关键词-EN: diverse downstream tasks, demonstrated potent applicability, pre-trained Vision-Language Models, Prompt learning, downstream tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Prompt learning for pre-trained Vision-Language Models (VLMs) like CLIP has demonstrated potent applicability across diverse downstream tasks. This lightweight approach has quickly gained traction from federated learning (FL) researchers who seek to efficiently adapt VLMs to heterogeneous scenarios. However, current federated prompt learning methods are habitually restricted to the traditional FL paradigm, where the participating clients are generally only allowed to download a single globally aggregated model from the server. While justifiable for training full-sized models under federated settings, in this work, we argue that this paradigm is ill-suited for lightweight prompts. By facilitating the clients to download multiple pre-aggregated prompts as fixed non-local experts, we propose Personalized Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that personalizes the prompt learning process through the lens of Mixture of Experts (MoE). pFedMoAP implements a local attention-based gating network that learns to generate enhanced text features for better alignment with local image data on the client, benefiting from both local and downloaded non-local adaptive prompt experts. The non-local experts are sparsely selected from a server-maintained pool, fostering collaborative learning across clients. To evaluate the proposed algorithm, we conduct extensive experiments across 9 datasets under various heterogeneous federated settings. The results show that pFedMoAP consistently outperforms the state-of-the-art alternatives, underscoring its efficacy in personalizing prompt learning for CLIP within the federated learning paradigm.
摘要:针对预训练的视觉-语言模型(Vision-Language Models, VLMs)如 CLIP 的提示学习(Prompt Learning)已在多种下游任务中展现出强大的适用性。这种轻量级方法迅速受到联邦学习(Federated Learning, FL)研究者的青睐,他们寻求高效地将 VLMs 适应于异构场景。然而,当前的联邦提示学习方法通常局限于传统的 FL 范式,其中参与的客户端通常仅允许从服务器下载一个全局聚合的模型。虽然在联邦设置下训练全尺寸模型时这种做法是合理的,但在本文中,我们认为这种范式不适用于轻量级提示。通过使客户端能够下载多个预聚合的提示作为固定的非本地专家,我们提出了个性化联邦自适应提示混合(Personalized Federated Mixture of Adaptive Prompts, pFedMoAP),这是一种新颖的 FL 框架,通过专家混合(Mixture of Experts, MoE)的视角来个性化提示学习过程。pFedMoAP 实现了一个基于局部注意力的门控网络,该网络学习生成增强的文本特征,以更好地与客户端的本地图像数据对齐,得益于本地和下载的非本地自适应提示专家。非本地专家从服务器维护的池中稀疏选择,促进了客户端之间的协作学习。为了评估所提出的算法,我们在各种异构联邦设置下对 9 个数据集进行了广泛的实验。结果表明,pFedMoAP 始终优于最先进的替代方案,突显了其在联邦学习范式中个性化 CLIP 提示学习的有效性。

[NLP-77] Can We Predict Performance of Large Models across Vision-Language Tasks?

【速读】: 该论文试图解决大规模视觉语言模型(LVLMs)评估成本高昂的问题,特别是由于计算成本高和任务种类繁多导致的评估困难。解决方案的关键在于提出了一种基于概率矩阵分解(PMF)和马尔可夫链蒙特卡洛(MCMC)的矩阵补全框架,用于预测未知任务上的模型性能。具体来说,论文构建了一个稀疏性能矩阵,通过PMF和MCMC方法补全矩阵,从而预测未测试任务上的性能分数,并估计预测的不确定性。这种方法允许从业者在高不确定性任务上优先进行评估,以快速减少性能预测中的误差,并针对稀疏观测数据提出了多项改进措施,以提高预测的准确性和可靠性。

链接: https://arxiv.org/abs/2410.10112
作者: Qinyu Zhao,Ming Xu,Kartik Gupta,Akshay Asthana,Liang Zheng,Stephen Gould
关键词-EN: Evaluating large vision-language, high computational costs, Evaluating large, large vision-language models, performance
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Under Review. Project page: this https URL

点击查看摘要

Abstract:Evaluating large vision-language models (LVLMs) is very expensive, due to the high computational costs and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix \boldsymbolR , where each entry R_mn represents the performance score of the m -th model on the n -th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, that is, predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, quickly reducing errors in performance prediction. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. In experiments, we systematically evaluate 108 LVLMs on 176 datasets from 36 benchmarks, constructing training and testing sets for validating our framework. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
摘要:评估大型视觉-语言模型 (LVLMs) 的成本非常高,这主要是由于高计算成本和任务的多样性。好消息是,如果我们已经有一些观测到的性能分数,我们可能能够推断出未知的分数。在本研究中,我们提出了一种新的框架,用于基于其他 LVLMs 或任务的观测分数来预测未知的性能分数。我们首先将性能预测形式化为一个矩阵补全任务。具体来说,我们构建了一个稀疏性能矩阵 \boldsymbol{R},其中每个元素 R_mn 表示第 m 个模型在第 n 个数据集上的性能分数。通过应用概率矩阵分解 (PMF) 与马尔可夫链蒙特卡罗 (MCMC),我们可以完成性能矩阵,即预测未知的分数。此外,我们基于 MCMC 估计性能预测的不确定性。实践者可以首先在高不确定性的未测试任务上评估他们的模型,从而快速减少性能预测中的错误。我们进一步引入了几种改进,以增强 PMF 在稀疏观测性能分数场景中的表现。在实验中,我们系统地评估了 108 个 LVLMs 在来自 36 个基准的 176 个数据集上的表现,构建了训练和测试集以验证我们的框架。我们的实验证明了 PMF 在预测未知分数方面的准确性,不确定性估计在排序评估中的可靠性,以及我们针对稀疏数据处理的改进的有效性。

[NLP-78] Learning Linear Attention in Polynomial Time

【速读】: 该论文试图解决Transformer模型在模拟布尔电路或图灵机时的学习能力问题,特别是从观测数据中学习这些模拟器的可行性。解决方案的关键在于证明了单层线性注意力Transformer的强、不可知PAC学习在多项式时间内是可行的。通过将线性注意力视为在适当定义的再生核希尔伯特空间(RKHS)中的线性预测器,论文展示了如何将学习线性Transformer的问题转化为在扩展特征空间中学习普通线性预测器的问题,并进一步转换回多头线性Transformer。此外,论文还展示了如何高效地识别训练数据集,确保学习的模型在所有输入上都能正确泛化。最终,论文通过实验验证了理论结果,证明了线性注意力Transformer在多项式时间内可学习,包括关联记忆、有限自动机和一类具有多项式计算历史的通用图灵机。

链接: https://arxiv.org/abs/2410.10101
作者: Morris Yau,Ekin Akyurek,Jiayuan Mao,Joshua B. Tenenbaum,Stefanie Jegelka,Jacob Andreas
关键词-EN: simulating Boolean circuits, Previous research, simulating Boolean, Boolean circuits, Universal Turing Machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key–value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
摘要:以往的研究探讨了 Transformer 模型在模拟布尔电路或图灵机时的计算表达能力。然而,这些模拟器从观测数据中学习的能力仍是一个未解之谜。我们的研究通过提供首个多项式时间可学习性结果(特别是强无偏 PAC 学习)填补了这一空白,针对的是具有线性注意力的单层 Transformer。我们证明,线性注意力可以被视为在适当定义的 RKHS 中的线性预测器。因此,学习任何线性 Transformer 的问题可以转化为在扩展特征空间中学习普通线性预测器的问题,并且任何此类预测器都可以转换回多头线性 Transformer。在泛化方面,我们展示了如何高效地识别训练数据集,使得每个经验风险最小化器都等价(在平凡对称性范围内)于生成数据的线性 Transformer,从而保证所学模型能够正确泛化到所有输入。最后,我们提供了一些可通过线性注意力表达的计算示例,因此是多项式时间可学习的,包括联想记忆、有限自动机以及一类具有多项式界限计算历史的通用图灵机 (UTM)。我们在三个任务上实证验证了我们的理论发现:学习随机线性注意力网络、键值关联以及学习执行有限自动机。我们的研究填补了 Transformer 理论表达能力与可学习性之间的关键空白,并表明灵活且通用的计算模型是高效可学习的。

[NLP-79] How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective EMNLP2024

【速读】: 该论文试图解决如何高效地将大型语言模型与离线示范数据对齐的问题。解决方案的关键在于提出了一种新的广义自模仿学习(GSIL)框架,通过密度比估计推导出模仿学习的替代目标函数,从而利用自生成数据并使用简单的分类损失优化模仿学习目标。GSIL框架避免了标准模仿学习中复杂的对抗训练,实现了对大型语言模型的轻量级和高效微调,并通过一类广义的凸函数参数化的离线损失,提供了一种统一的示范数据对齐视角。

链接: https://arxiv.org/abs/2410.10093
作者: Teng Xiao,Mingxiao Li,Yige Yuan,Huaisheng Zhu,Chao Cui,Vasant G Honavar
关键词-EN: GSIL, generalized self-imitation learning, efficiently aligns large, textbf, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024 Main

点击查看摘要

Abstract:This paper introduces a novel generalized self-imitation learning ( \textbfGSIL ) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop \textbfGSIL by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with simple classification losses. \textbfGSIL eliminates the need for complex adversarial training in standard imitation learning, achieving lightweight and efficient fine-tuning for large language models. In addition, \textbfGSIL encompasses a family of offline losses parameterized by a general class of convex functions for density ratio estimation and enables a unified view for alignment with demonstration data. Extensive experiments show that \textbfGSIL consistently and significantly outperforms baselines in many challenging benchmarks, such as coding (HuamnEval), mathematical reasoning (GSM8K) and instruction-following benchmark (MT-Bench).
摘要:本文介绍了一种新颖的广义自模仿学习 (Generalized Self-Imitation Learning, GSIL) 框架,该框架能够有效且高效地将大语言模型与离线演示数据对齐。我们通过推导出一种基于密度比估计的模仿学习代理目标,开发了 GSIL,从而促进了自生成数据的使用,并利用简单的分类损失优化了模仿学习目标。GSIL 消除了标准模仿学习中复杂的对抗训练需求,实现了对大语言模型的轻量级且高效的微调。此外,GSIL 包含了一类由密度比估计的广义凸函数参数化的离线损失族,并提供了一种统一的视角来对齐演示数据。广泛的实验表明,GSIL 在许多具有挑战性的基准测试中,如编码 (HuamnEval)、数学推理 (GSM8K) 和指令跟随基准 (MT-Bench),均持续且显著优于基线。

[NLP-80] RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates

【速读】: 该论文试图解决大规模语言模型(LMs)在参数高效微调(PEFT)中的计算和内存效率问题。解决方案的关键在于提出了一种名为RoCoFT的方法,该方法通过仅更新Transformer权重矩阵中的少数行和列来实现微调,从而在保持与现有最先进的PEFT方法相当的准确性的同时,显著提高了计算和内存效率。通过神经切线核理论的工具,研究者还解释了该方法的有效性,并展示了其构建的核在数值上接近全参数核,且在分类性能上表现相当。

链接: https://arxiv.org/abs/2410.10075
作者: Md Kowsher,Tara Esmaeilbeig,Chun-Nam Yu,Mojtaba Soltanalian,Niloofar Yousefi
关键词-EN: large-scale language models, parameter-efficient fine-tuning method, propose RoCoFT, language models, based on updating
类目: Computation and Language (cs.CL)
备注: RoCoFT is a parameter-efficient method

点击查看摘要

Abstract:We propose RoCoFT, a parameter-efficient fine-tuning method for large-scale language models (LMs) based on updating only a few rows and columns of the weight matrices in transformers. Through extensive experiments with medium-size LMs like BERT and RoBERTa, and larger LMs like Bloom-7B, Llama2-7B, and Llama2-13B, we show that our method gives comparable or better accuracies than state-of-art PEFT methods while also being more memory and computation-efficient. We also study the reason behind the effectiveness of our method with tools from neural tangent kernel theory. We empirically demonstrate that our kernel, constructed using a restricted set of row and column parameters, are numerically close to the full-parameter kernel and gives comparable classification performance. Ablation studies are conducted to investigate the impact of different algorithmic choices, including the selection strategy for rows and columns as well as the optimal rank for effective implementation of our method.
摘要:我们提出了 RoCoFT,这是一种基于仅更新 Transformer 权重矩阵中少数行和列的大规模语言模型 (LM) 参数高效微调方法。通过在中等规模语言模型如 BERT 和 RoBERTa,以及更大规模的语言模型如 Bloom-7B、Llama2-7B 和 Llama2-13B 上的广泛实验,我们展示了我们的方法在精度上与最先进的参数高效微调 (PEFT) 方法相当或更优,同时在内存和计算效率上也更高。我们还利用神经正切核理论的工具研究了该方法有效性的原因。我们通过实验证明,使用受限的行和列参数集构建的核在数值上接近全参数核,并能提供相当的分类性能。我们还进行了消融研究,以调查不同算法选择的影响,包括行和列的选择策略以及实现我们方法有效性的最佳秩。

[NLP-81] Divide Reweight and Conquer: A Logit Arithmetic Approach for In-Context Learning

【速读】: 该论文试图解决大语言模型(LLMs)在上下文学习(ICL)中随着示例数量增加导致的性能下降和计算成本呈二次增长的问题。解决方案的关键是提出了Logit算术重加权方法(LARA),通过基于logit的集成技术对多个示例进行重加权,将长输入示例分割为可并行处理的短输入,从而显著降低内存需求,并通过非梯度优化方法有效聚合信息。此外,引入的二进制LARA(B-LARA)通过将权重限制为二进制值,简化了搜索空间并进一步减少了内存使用。实验结果表明,LARA和B-LARA在准确性和内存效率方面均优于所有基线方法,并且在不同数量的示例场景下具有良好的泛化能力。

链接: https://arxiv.org/abs/2410.10074
作者: Chengsong Huang,Langlin Huang,Jiaxin Huang
关键词-EN: Large Language Models, updating model parameters, Large Language, Language Models, In-Context Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-Context Learning (ICL) emerges as a key feature for Large Language Models (LLMs), allowing them to adapt to new tasks by leveraging task-specific examples without updating model parameters. However, ICL faces challenges with increasing numbers of examples due to performance degradation and quadratic computational costs. In this paper, we propose Logit Arithmetic Reweighting Approach (LARA), a novel framework that enhances ICL by using logit-based ensembling of multiple demonstrations. Our approach divides long input demonstrations into parallelizable shorter inputs to significantly reduce memory requirements, and then effectively aggregate the information by reweighting logits of each group via a non-gradient optimization approach. We further introduce Binary LARA (B-LARA), a variant that constrains weights to binary values to simplify the search space and reduces memory usage by filtering out less informative demonstration groups. Experiments on BBH and MMLU demonstrate that LARA and B-LARA outperform all baseline methods in both accuracy and memory efficiency. We also conduct extensive analysis to show that LARA generalizes well to scenarios of varying numbers of examples from limited to many-shot demonstrations.
摘要:上下文学习 (In-Context Learning, ICL) 作为大语言模型 (Large Language Models, LLMs) 的关键特性,使其能够通过利用任务特定的示例来适应新任务,而无需更新模型参数。然而,随着示例数量的增加,ICL 面临着性能下降和二次计算成本的挑战。本文中,我们提出了 Logit 算术重加权方法 (Logit Arithmetic Reweighting Approach, LARA),这是一种新颖的框架,通过使用基于 logit 的多个示例集成来增强 ICL。我们的方法将长输入示例分割成可并行处理的较短输入,从而显著减少内存需求,并通过非梯度优化方法对每个组的 logit 进行重加权来有效聚合信息。我们进一步引入了二进制 LARA (Binary LARA, B-LARA),这是一种变体,通过将权重限制为二进制值来简化搜索空间,并通过过滤掉信息量较少的示例组来减少内存使用。在 BBH 和 MMLU 上的实验表明,LARA 和 B-LARA 在准确性和内存效率方面均优于所有基线方法。我们还进行了广泛的分析,以展示 LARA 在从有限到多示例的多种示例数量场景中具有良好的泛化能力。

[NLP-82] Ukrainian-to-English folktale corpus: Parallel corpus creation and augmentation for machine translation in low-resource languages

【速读】: 该论文试图解决乌克兰民间故事翻译资源稀缺的问题,解决方案的关键在于构建了一个乌克兰语到英语的平行语料库,并采用了结合领域特定知识的策略来增强语料库。通过词和句子的对齐,确保了语料库在机器翻译模型训练中的意义准确性,从而提高了翻译质量和效率。

链接: https://arxiv.org/abs/2410.10063
作者: Olena Burda-Lassen
关键词-EN: source language, linguistically very rich, rich and culturally, culturally significant, significant in understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Folktales are linguistically very rich and culturally significant in understanding the source language. Historically, only human translation has been used for translating folklore. Therefore, the number of translated texts is very sparse, which limits access to knowledge about cultural traditions and customs. We have created a new Ukrainian-To-English parallel corpus of familiar Ukrainian folktales based on available English translations and suggested several new ones. We offer a combined domain-specific approach to building and augmenting this corpus, considering the nature of the domain and differences in the purpose of human versus machine translation. Our corpus is word and sentence-aligned, allowing for the best curation of meaning, specifically tailored for use as training data for machine translation models.
摘要:民间故事在语言上非常丰富,并且在理解源语言的文化方面具有重要意义。历史上,民间故事的翻译仅依赖于人工翻译,因此翻译文本的数量非常稀少,这限制了对文化传统和习俗知识的获取。我们基于现有的英文翻译,创建了一个新的乌克兰语到英语的平行语料库,收录了熟悉的乌克兰民间故事,并提出了几种新的翻译。我们采用了一种结合领域特定方法来构建和扩展这个语料库,考虑到该领域的性质以及人工翻译与机器翻译目的的差异。我们的语料库进行了词和句子的对齐,能够最佳地提炼出意义,特别适合作为机器翻译模型的训练数据使用。

[NLP-83] AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality

【速读】: 该论文试图解决在大型语言模型(LLMs)中使用低秩适应(LoRA)和混合专家(MoE)结构时存在的专家冗余问题。解决方案的关键在于利用重尾自正则化(HT-SR)理论设计一种细粒度的分配策略,即AlphaLoRA。该方法通过理论指导和无需训练的方式,根据各层训练质量的差异,非均匀地分配LoRA专家,从而有效减少冗余并提升模型在多种任务上的性能。

链接: https://arxiv.org/abs/2410.10054
作者: Peijun Qing,Chongyang Gao,Yefan Zhou,Xingjian Diao,Yaoqing Yang,Soroush Vosoughi
关键词-EN: Parameter-efficient fine-tuning methods, Parameter-efficient fine-tuning, Low-Rank Adaptation, efficiency in Large, Large Language Models
类目: Computation and Language (cs.CL)
备注: The 2024 Conference on Empirical Methods in Natural Language Processing

点击查看摘要

Abstract:Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), are known to enhance training efficiency in Large Language Models (LLMs). Due to the limited parameters of LoRA, recent studies seek to combine LoRA with Mixture-of-Experts (MoE) to boost performance across various tasks. However, inspired by the observed redundancy in traditional MoE structures, previous studies identify similar redundancy among LoRA experts within the MoE architecture, highlighting the necessity for non-uniform allocation of LoRA experts across different layers. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory to design a fine-grained allocation strategy. Our analysis reveals that the number of experts per layer correlates with layer training quality, which exhibits significant variability across layers. Based on this, we introduce AlphaLoRA, a theoretically principled and training-free method for allocating LoRA experts to further mitigate redundancy. Experiments on three models across ten language processing and reasoning benchmarks demonstrate that AlphaLoRA achieves comparable or superior performance over all baselines. Our code is available at this https URL.
摘要:参数高效的微调方法,如低秩适应 (Low-Rank Adaptation, LoRA),已知能提升大语言模型 (Large Language Models, LLMs) 的训练效率。由于 LoRA 的参数有限,近期研究尝试将 LoRA 与专家混合 (Mixture-of-Experts, MoE) 结合,以提升多种任务的性能。然而,受传统 MoE 结构中观察到的冗余现象启发,先前研究识别出 MoE 架构中 LoRA 专家间的相似冗余,强调了在不同层之间非均匀分配 LoRA 专家的必要性。本文中,我们利用重尾自正则化 (Heavy-Tailed Self-Regularization, HT-SR) 理论设计了一种细粒度分配策略。我们的分析显示,每层专家数量与层训练质量相关,且各层间存在显著差异。基于此,我们提出了 AlphaLoRA,一种理论上合理且无需训练的方法,用于分配 LoRA 专家以进一步减少冗余。在十个语言处理和推理基准上的三个模型实验表明,AlphaLoRA 在所有基线中实现了相当或更优的性能。我们的代码可在以下链接获取:https URL。

[NLP-84] LoRE: Logit-Ranked Retriever Ensemble for Enhancing Open-Domain Question Answering

【速读】: 该论文试图解决基于检索的问答系统中存在的位置偏差问题,导致答案生成不理想的情况。解决方案的关键是提出了一种名为LoRE(Logit-Ranked Retriever Ensemble)的新方法,通过减少位置偏差来提高答案的准确性和相关性。LoRE的核心创新在于使用了一个基于对数概率(logit)的答案排序算法,该算法结合了大型语言模型(LLM)的对数概率得分和段落检索的排名。通过集成多种检索器(如BM25和使用FAISS索引的句子转换器),LoRE在多个数据集(如NarrativeQA和SQuAD)上的实验结果表明,其在精确匹配(EM)和F1分数上显著优于现有的检索方法,特别是在处理复杂查询时,生成的答案更加相关和准确。

链接: https://arxiv.org/abs/2410.10042
作者: Saikrishna Sanniboina,Shiv Trivedi,Sreenidhi Vijayaraghavan
关键词-EN: question answering systems, Retrieval-based question answering, suboptimal answer generation, positional bias, mitigating positional bias
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-based question answering systems often suffer from positional bias, leading to suboptimal answer generation. We propose LoRE (Logit-Ranked Retriever Ensemble), a novel approach that improves answer accuracy and relevance by mitigating positional bias. LoRE employs an ensemble of diverse retrievers, such as BM25 and sentence transformers with FAISS indexing. A key innovation is a logit-based answer ranking algorithm that combines the logit scores from a large language model (LLM), with the retrieval ranks of the passages. Experimental results on NarrativeQA, SQuAD demonstrate that LoRE significantly outperforms existing retrieval-based methods in terms of exact match and F1 scores. On SQuAD, LoRE achieves 14.5%, 22.83%, and 14.95% improvements over the baselines for ROUGE-L, EM, and F1, respectively. Qualitatively, LoRE generates more relevant and accurate answers, especially for complex queries.
摘要:基于检索的问答系统常常受到位置偏差的影响,导致答案生成效果不佳。我们提出了一种名为 LoRE (Logit-Ranked Retriever Ensemble) 的新方法,通过减轻位置偏差来提高答案的准确性和相关性。LoRE 采用多种检索器的集成,例如 BM25 和使用 FAISS 索引的句子 Transformer。其关键创新在于一种基于 logit 的答案排序算法,该算法结合了大语言模型 (LLM) 的 logit 分数与段落的检索排名。在 NarrativeQA 和 SQuAD 上的实验结果表明,LoRE 在精确匹配和 F1 分数方面显著优于现有的基于检索的方法。在 SQuAD 上,LoRE 分别在 ROUGE-L、EM 和 F1 上比基线提升了 14.5%、22.83% 和 14.95%。从定性分析来看,LoRE 生成的答案更加相关和准确,尤其是在处理复杂查询时。

[NLP-85] A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics

【速读】: 该论文试图解决自动化问答(QA)评估中现有评估指标的局限性问题。研究发现,现有评估指标在不同类型问题上的相关性较高,但单一指标无法充分反映人类评估的准确性。论文提出的关键解决方案是探讨如何通过“评分者混合模型”(Mixture Of Grader)来提高自动化QA评估的质量,以更接近人类评估的结果。

链接: https://arxiv.org/abs/2410.10030
作者: Yun Joon Soh,Jishen Zhao
关键词-EN: models and Question-Answering, datasets emphasizes, explosion of open-sourced, open-sourced models, emphasizes the importance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation. We studied the statistics of the existing evaluation metrics for a better understanding of their limitations. By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation. As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.
摘要:开源模型和问答 (QA) 数据集的激增突显了自动化 QA 评估的重要性。我们研究了现有评估指标的统计数据,以更好地理解其局限性。通过测量每个评估指标与人类评估分数之间的相关系数,我们观察到以下现象:(1) 现有指标在问题类型 (例如,单个词、单个短语等) 方面具有高度相关性,(2) 没有任何单一指标能够充分估计人类评估。作为一种潜在的解决方案,我们讨论了如何通过混合评分器 (Mixture Of Grader) 来提高自动化 QA 评估器的质量。

[NLP-86] Safety-Aware Fine-Tuning of Large Language Models NEURIPS2024

【速读】: 该论文试图解决在微调大型语言模型(LLMs)时,由于数据集多样性引入的有害数据样本的安全问题。解决方案的关键在于提出了一种名为Safety-Aware Fine-Tuning (SAFT)的新框架,该框架通过利用有害和良性样本的子空间信息,自动检测并移除潜在的有害数据。这一方法的核心在于设计了一个评分函数,能够有效区分和过滤有害数据,从而在不同LLMs和不同污染率下实现高达27.8%的有害性降低。

链接: https://arxiv.org/abs/2410.10014
作者: Hyeong Kyu Choi,Xuefeng Du,Yixuan Li
关键词-EN: Large Language Models, Fine-tuning Large Language, Large Language, Language Models, tailoring models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2024 Workshop on Safe Generative AI

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing practical challenges in real-world scenarios.
摘要: 微调大语言模型 (LLM) 已成为根据个人需求和偏好定制模型的常见做法。微调数据集的选择多种多样,这引入了关于潜在包含有害数据样本的安全性问题。然而,手动过滤或避免这些样本既耗时又主观。为了解决这些难题,我们提出了一种新颖的安全感知微调 (SAFT) 框架,该框架通过利用有害和良性样本的子空间信息评分函数,自动检测并移除潜在有害数据。实验结果表明,SAFT 在不同 LLM 和不同污染率下均表现出色,有害性降低幅度高达 27.8%。此外,我们深入探讨了该方法的机制,并验证了其在实际场景中解决实际挑战的广泛适用性。

[NLP-87] Leveraging Customer Feedback for Multi-modal Insight Extraction NAACL2024

【速读】: 该论文试图解决从多模态客户反馈(如文本和图像)中提取可操作且相关的内容片段的问题。解决方案的关键在于提出了一种新颖的多模态方法,通过在潜在空间中融合图像和文本信息,并使用基于图像-文本的文本解码器来提取相关反馈片段。此外,论文还引入了一种弱监督数据生成技术,用于为该任务生成训练数据。该方法在未见数据上的评估中表现优异,F1分数比现有基线高出14个百分点。

链接: https://arxiv.org/abs/2410.09999
作者: Sandeep Sricharan Mukku,Abinesh Kanagarajan,Pushpendu Ghosh,Chetan Aggarwal
关键词-EN: Businesses can benefit, customer feedback, products and services, enhance their products, multi-modal customer feedback
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: NAACL 2024

点击查看摘要

Abstract:Businesses can benefit from customer feedback in different modalities, such as text and images, to enhance their products and services. However, it is difficult to extract actionable and relevant pairs of text segments and images from customer feedback in a single pass. In this paper, we propose a novel multi-modal method that fuses image and text information in a latent space and decodes it to extract the relevant feedback segments using an image-text grounded text decoder. We also introduce a weakly-supervised data generation technique that produces training data for this task. We evaluate our model on unseen data and demonstrate that it can effectively mine actionable insights from multi-modal customer feedback, outperforming the existing baselines by 14 points in F1 score.
摘要:企业可以从不同形式的客户反馈(如文本和图像)中受益,以提升其产品和服务。然而,从客户反馈中一次性提取出可操作且相关的文本片段和图像对是困难的。本文提出了一种新颖的多模态方法,该方法在潜在空间中融合图像和文本信息,并使用基于图像-文本的文本解码器对其进行解码,以提取相关的反馈片段。我们还引入了一种弱监督数据生成技术,为该任务生成训练数据。我们在未见过的数据上评估了我们的模型,并证明它能够有效地从多模态客户反馈中挖掘出可操作的见解,F1 分数比现有基线高出 14 分。

[NLP-88] Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code

【速读】: 该论文试图解决大型语言模型(LLMs)在代码生成和自动程序修复(APR)任务中产生的幻觉问题,即生成看似合理但实际错误的代码,导致严重的经济损失。解决方案的关键在于引入了一个名为Collu-Bench的基准测试,该基准包含了从五个数据集和11种不同LLMs中收集的13,234个代码幻觉实例。Collu-Bench提供了详细的特征分析,如每步输出的对数概率、标记类型以及生成的代码的执行反馈,以深入分析代码幻觉。通过实验,论文展示了传统机器学习和神经网络在预测代码幻觉方面的准确率在22.03%到33.15%之间,揭示了准确本地化LLMs幻觉的挑战,并强调了需要更复杂的技术来解决这一问题。

链接: https://arxiv.org/abs/2410.09997
作者: Nan Jiang,Qi Li,Lin Tan,Tianyi Zhang
关键词-EN: large language models, face the critical, generating plausible, code, hallucinations
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite their success, large language models (LLMs) face the critical challenge of hallucinations, generating plausible but incorrect content. While much research has focused on hallucinations in multiple modalities including images and natural language text, less attention has been given to hallucinations in source code, which leads to incorrect and vulnerable code that causes significant financial loss. To pave the way for research in LLMs’ hallucinations in code, we introduce Collu-Bench, a benchmark for predicting code hallucinations of LLMs across code generation (CG) and automated program repair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. To better understand and predict code hallucinations, Collu-Bench provides detailed features such as the per-step log probabilities of LLMs’ output, token types, and the execution feedback of LLMs’ generated code for in-depth analysis. In addition, we conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 – 33.15% accuracy. Our experiments draw insightful findings of code hallucination patterns, reveal the challenge of accurately localizing LLMs’ hallucinations, and highlight the need for more sophisticated techniques.
摘要:尽管大语言模型 (LLM) 取得了成功,但它们面临着一个关键挑战,即产生看似合理但实际上错误的内容,这种现象被称为“幻觉”。虽然许多研究集中在图像和自然语言文本等多模态中的幻觉现象,但对于源代码中的幻觉现象关注较少,这导致了错误的、易受攻击的代码,从而造成重大的经济损失。为了推动对 LLM 在代码中幻觉现象的研究,我们引入了 Collu-Bench,这是一个用于预测 LLM 在代码生成 (CG) 和自动程序修复 (APR) 任务中代码幻觉的基准。Collu-Bench 包括从五个数据集和 11 个不同 LLM 中收集的 13,234 个代码幻觉实例,这些 LLM 涵盖了从开源模型到商业模型的范围。为了更好地理解和预测代码幻觉,Collu-Bench 提供了详细的特征,如 LLM 输出每一步的对数概率、Token 类型以及 LLM 生成代码的执行反馈,以便进行深入分析。此外,我们使用传统机器学习技术和神经网络在 Collu-Bench 上进行实验,以预测幻觉现象,达到了 22.03% 至 33.15% 的准确率。我们的实验揭示了代码幻觉的深刻模式,揭示了准确本地化 LLM 幻觉的挑战,并强调了需要更复杂的技术。

[NLP-89] Evaluating Gender Bias of LLMs in Making Morality Judgements EMNLP

【速读】: 该论文旨在探究当前大型语言模型(LLMs)在道德观点输出中是否存在性别偏见,特别是针对男性与女性角色的不同处理。解决方案的关键在于引入了一个新的数据集GenMO(Gender-bias in Morality Opinions),通过包含男性与女性角色的平行短故事来评估模型的性别偏见。研究结果显示,尽管采用了安全检查措施,所有测试的生产标准模型均表现出显著的性别偏见,其中GPT-3.5-turbo在24%的样本中给出偏见观点,且所有模型普遍倾向于支持女性角色。此外,研究还探讨了模型参数对性别偏见的影响,并分析了LLMs在实际道德决策情境中暴露偏见的情况。

链接: https://arxiv.org/abs/2410.09992
作者: Divij Bajaj,Yuanyuan Lei,Jonathan Tong,Ruihong Huang
关键词-EN: Natural Language Processing, Large Language Models, Large Language, Language Processing, Natural Language
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP Findings 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in a multitude of Natural Language Processing (NLP) tasks. However, these models are still not immune to limitations such as social biases, especially gender bias. This work investigates whether current closed and open-source LLMs possess gender bias, especially when asked to give moral opinions. To evaluate these models, we curate and introduce a new dataset GenMO (Gender-bias in Morality Opinions) comprising parallel short stories featuring male and female characters respectively. Specifically, we test models from the GPT family (GPT-3.5-turbo, GPT-3.5-turbo-instruct, GPT-4-turbo), Llama 3 and 3.1 families (8B/70B), Mistral-7B and Claude 3 families (Sonnet and Opus). Surprisingly, despite employing safety checks, all production-standard models we tested display significant gender bias with GPT-3.5-turbo giving biased opinions in 24% of the samples. Additionally, all models consistently favour female characters, with GPT showing bias in 68-85% of cases and Llama 3 in around 81-85% instances. Additionally, our study investigates the impact of model parameters on gender bias and explores real-world situations where LLMs reveal biases in moral decision-making.
摘要:大语言模型 (LLMs) 在众多自然语言处理 (NLP) 任务中展现了卓越的能力。然而,这些模型仍存在一些局限性,如社会偏见,尤其是性别偏见。本研究探讨了当前的闭源和开源 LLMs 是否存在性别偏见,特别是在提供道德观点时。为了评估这些模型,我们精心策划并引入了一个新的数据集 GenMO (Gender-bias in Morality Opinions),该数据集包含分别以男性和女性角色为主角的平行短篇故事。具体来说,我们测试了 GPT 家族的模型 (GPT-3.5-turbo, GPT-3.5-turbo-instruct, GPT-4-turbo)、Llama 3 和 3.1 家族的模型 (8B/70B)、Mistral-7B 以及 Claude 3 家族的模型 (Sonnet 和 Opus)。令人惊讶的是,尽管采用了安全检查,我们测试的所有生产标准模型都显示出显著的性别偏见,其中 GPT-3.5-turbo 在 24% 的样本中给出了偏见性观点。此外,所有模型一致倾向于女性角色,GPT 在 68-85% 的案例中表现出偏见,而 Llama 3 则在约 81-85% 的实例中表现出偏见。此外,我们的研究还探讨了模型参数对性别偏见的影响,并探索了 LLMs 在现实世界中揭示道德决策偏见的情境。

[NLP-90] MARS: Multilingual Aspect-centric Review Summarisation EMNLP2024

【速读】: 该论文试图解决大规模跨语言客户反馈总结的问题,旨在为不同行业的产品和服务提供可操作的见解。解决方案的关键在于提出了一种名为MARS的新框架,采用“提取-然后-总结”的两步范式,即先从多语言评论中提取关键信息,再进行总结,从而实现领域无关的多语言评论总结。该方法在自动和人工评估中均显示出显著优于抽象基线的性能,并提高了实时系统的效率。

链接: https://arxiv.org/abs/2410.09991
作者: Sandeep Sricharan Mukku,Abinesh Kanagarajan,Chetan Aggarwal,Promod Yenigalla
关键词-EN: Summarizing customer feedback, provide actionable insights, Summarizing customer, insights for products, services at scale
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024

点击查看摘要

Abstract:Summarizing customer feedback to provide actionable insights for products/services at scale is an important problem for businesses across industries. Lately, the review volumes are increasing across regions and languages, therefore the challenge of aggregating and understanding customer sentiment across multiple languages becomes increasingly vital. In this paper, we propose a novel framework involving a two-step paradigm \textitExtract-then-Summarise, namely MARS to revolutionise traditions and address the domain agnostic aspect-level multilingual review summarisation. Extensive automatic and human evaluation shows that our approach brings substantial improvements over abstractive baselines and efficiency to real-time systems.
摘要:大规模总结客户反馈以提供可操作的产品/服务洞察是跨行业企业面临的一个重要问题。近期,评论数量在各个地区和语言中不断增加,因此跨多种语言汇总和理解客户情感的挑战变得愈发重要。本文提出了一种新颖的框架,采用两步范式“提取-然后-总结”,即 MARS,以革新传统方法并解决领域无关的方面级多语言评论总结问题。广泛的自动和人工评估表明,我们的方法在抽象基线方法上带来了显著的改进,并为实时系统提高了效率。

[NLP-91] Self-Data Distillation for Recovering Quality in Pruned Large Language Models NEURIPS2024

【速读】: 该论文试图解决大规模语言模型在结构化剪枝后性能下降的问题,特别是在多步推理任务中。解决方案的关键在于提出了自数据蒸馏微调(self-data distilled fine-tuning)方法,通过利用原始未剪枝模型生成蒸馏数据集,保持语义丰富性并减少灾难性遗忘,从而在剪枝后恢复模型性能。实验结果表明,该方法在剪枝6个解码器块的情况下,相比标准微调(SFT),平均准确率提高了8%,并且在不同数据集上表现出良好的扩展性。

链接: https://arxiv.org/abs/2410.09982
作者: Vithursan Thangarasa,Ganesh Venkatesh,Nish Sinnadurai,Sean Lie
关键词-EN: Large language models, natural language processing, deployment requires substantial, requires substantial compute, Large language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at the NeurIPS 2024 Machine Learning and Compression Workshop

点击查看摘要

Abstract:Large language models have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one-shot pruning often results in significant quality degradation, particularly in tasks requiring multi-step reasoning. To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting by shifting the model’s learned data distribution. Therefore, addressing the degradation from both pruning and SFT is essential to preserve the original model’s quality. In this work, we propose self-data distilled fine-tuning to address these challenges. Our approach leverages the original, unpruned model to generate a distilled dataset that preserves semantic richness and mitigates catastrophic forgetting by maintaining alignment with the base model’s knowledge. Empirically, we demonstrate that self-data distillation consistently outperforms standard SFT, improving average accuracy by up to 8% on the HuggingFace OpenLLM Leaderboard v1. Specifically, when pruning 6 decoder blocks on Llama3.1-8B Instruct (i.e., 32 to 24 layers, reducing the model size from 8.03B to 6.72B parameters), our method retains 91.2% of the original model’s accuracy compared to 81.7% with SFT, while reducing real-world FLOPs by 16.30%. Furthermore, our approach scales effectively across datasets, with the quality improving as the dataset size increases.
摘要:大语言模型在自然语言处理领域取得了显著进展,但其部署需要大量的计算和内存资源。随着模型规模的扩大,压缩技术成为平衡模型质量与计算效率的关键。结构化剪枝,即移除模型中不那么关键的组件,是一种有前景的降低复杂性的策略。然而,一次性剪枝往往会导致显著的质量下降,尤其是在需要多步推理的任务中。为了恢复丢失的质量,通常会应用监督微调 (SFT),但这可能会导致灾难性遗忘,即模型学习的数据分布发生偏移。因此,解决剪枝和 SFT 带来的质量下降问题对于保持原始模型的质量至关重要。在本研究中,我们提出了自数据蒸馏微调方法来应对这些挑战。我们的方法利用原始的未剪枝模型生成一个蒸馏数据集,该数据集保留了语义丰富性,并通过与基础模型的知识保持一致来缓解灾难性遗忘。实证研究表明,自数据蒸馏在 HuggingFace OpenLLM Leaderboard v1 上始终优于标准 SFT,平均准确率提高了高达 8%。具体而言,在 Llama3.1-8B Instruct 上剪枝 6 个解码器块(即从 32 层减少到 24 层,模型大小从 8.03B 减少到 6.72B 参数)时,我们的方法保留了原始模型 91.2% 的准确率,而 SFT 仅保留了 81.7%,同时实际 FLOPs 减少了 16.30%。此外,我们的方法在不同数据集上具有良好的扩展性,随着数据集规模的增加,质量有所提升。

[NLP-92] When Neutral Summaries are not that Neutral: Quantifying Political Neutrality in LLM-Generated News Summaries

【速读】: 该论文试图解决大语言模型(LLMs)在处理具有政治争议性的新闻文章时是否存在政治偏见的问题。解决方案的关键在于通过抽象文本摘要技术,量化LLMs在处理涉及美国政治五大热点问题(堕胎、枪支管控/权利、医疗保健、移民和LGBTQ+权利)时的政治中立性。研究通过对20,344篇新闻文章的分析,揭示了多个知名LLMs在处理这些争议性话题时表现出明显的亲民主党偏见,尤其是在枪支管控和医疗保健问题上,偏见最为显著。进一步分析发现,LLMs在处理这些争议话题时使用的词汇存在高度一致性,表明其在政治立场上的倾向性。

链接: https://arxiv.org/abs/2410.09978
作者: Supriti Vijay,Aman Priyanshu,Ashique R. KhudaBukhsh
关键词-EN: important research question, political neutrality, algorithmic curation, research question, investigating the political
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 12 pages, 3 figures, 4 tables

点击查看摘要

Abstract:In an era where societal narratives are increasingly shaped by algorithmic curation, investigating the political neutrality of LLMs is an important research question. This study presents a fresh perspective on quantifying the political neutrality of LLMs through the lens of abstractive text summarization of polarizing news articles. We consider five pressing issues in current US politics: abortion, gun control/rights, healthcare, immigration, and LGBTQ+ rights. Via a substantial corpus of 20,344 news articles, our study reveals a consistent trend towards pro-Democratic biases in several well-known LLMs, with gun control and healthcare exhibiting the most pronounced biases (max polarization differences of -9.49% and -6.14%, respectively). Further analysis uncovers a strong convergence in the vocabulary of the LLM outputs for these divisive topics (55% overlap for Democrat-leaning representations, 52% for Republican). Being months away from a US election of consequence, we consider our findings important.
摘要:在社会叙事日益受到算法筛选影响的当下,探究大语言模型 (LLM) 的政治中立性成为一个重要的研究课题。本研究通过极化新闻文章的抽象文本摘要视角,提出了一种量化 LLM 政治中立性的新方法。我们考察了当前美国政治中的五个紧迫议题:堕胎、枪支管控/权利、医疗保健、移民和 LGBTQ+ 权利。通过对 20,344 篇新闻文章的大量语料库分析,本研究发现多个知名 LLM 呈现出一致的亲民主党偏见趋势,其中枪支管控和医疗保健显示出最为明显的偏见(最大极化差异分别为 -9.49% 和 -6.14%)。进一步分析揭示了这些分歧议题上 LLM 输出词汇的高度一致性(民主党倾向的表述重叠率为 55%,共和党为 52%)。鉴于距离美国重要选举仅有数月之遥,我们认为本研究结果具有重要意义。

[NLP-93] MisinfoEval: Generative AI in the Era of “Alternative Facts” EMNLP2024

【速读】: 该论文试图解决社交媒体上错误信息的传播问题,提出了一种基于大型语言模型(LLM)的干预框架(MisinfoEval)。解决方案的关键在于利用生成式AI技术,通过个性化解释和模拟社交环境实验,评估和提升用户对错误信息的识别能力。研究发现,LLM-based干预措施在提高用户准确性方面非常有效(最高可提升41.72%),并且用户更倾向于接受个性化干预,从而显著提高识别错误信息的准确性。

链接: https://arxiv.org/abs/2410.09949
作者: Saadia Gabriel,Liang Lyu,James Siderius,Marzyeh Ghassemi,Jacob Andreas,Asu Ozdaglar
关键词-EN: threatens democratic processes, massive economic losses, endangers public health, platforms threatens democratic, democratic processes
类目: Computation and Language (cs.CL)
备注: EMNLP 2024. Correspondence can be sent to skgabrie@cs. this http URL

点击查看摘要

Abstract:The spread of misinformation on social media platforms threatens democratic processes, contributes to massive economic losses, and endangers public health. Many efforts to address misinformation focus on a knowledge deficit model and propose interventions for improving users’ critical thinking through access to facts. Such efforts are often hampered by challenges with scalability, and by platform users’ personal biases. The emergence of generative AI presents promising opportunities for countering misinformation at scale across ideological barriers. In this paper, we introduce a framework (MisinfoEval) for generating and comprehensively evaluating large language model (LLM) based misinformation interventions. We present (1) an experiment with a simulated social media environment to measure effectiveness of misinformation interventions, and (2) a second experiment with personalized explanations tailored to the demographics and beliefs of users with the goal of countering misinformation by appealing to their pre-existing values. Our findings confirm that LLM-based interventions are highly effective at correcting user behavior (improving overall user accuracy at reliability labeling by up to 41.72%). Furthermore, we find that users favor more personalized interventions when making decisions about news reliability and users shown personalized interventions have significantly higher accuracy at identifying misinformation. Comments: EMNLP 2024. Correspondence can be sent to skgabrie@cs.this http URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.09949 [cs.CL] (or arXiv:2410.09949v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.09949 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:社交媒体平台上错误信息的传播威胁着民主进程,导致巨大的经济损失,并危及公众健康。许多应对错误信息的努力集中在知识缺陷模型上,并提出通过提供事实来提高用户批判性思维的干预措施。然而,这些努力往往受到可扩展性问题的阻碍,以及平台用户个人偏见的影响。生成式 AI (Generative AI) 的出现为跨越意识形态障碍大规模对抗错误信息提供了有希望的机会。本文中,我们介绍了一个框架 (MisinfoEval),用于生成和全面评估基于大语言模型 (LLM) 的错误信息干预措施。我们展示了 (1) 在模拟社交媒体环境中进行的实验,以测量错误信息干预措施的有效性,以及 (2) 第二个实验,通过针对用户的人口统计特征和信仰定制个性化解释,旨在通过诉诸其现有价值观来对抗错误信息。我们的研究结果证实,基于 LLM 的干预措施在纠正用户行为方面非常有效 (总体用户在可靠性标签上的准确性提高了高达 41.72%)。此外,我们发现用户在做出新闻可靠性决策时更倾向于个性化干预,并且接受个性化干预的用户在识别错误信息方面的准确性显著提高。

评论:EMNLP 2024。通信可以发送至 skgabrie@cs.this http URL 主题:计算与语言 (cs.CL) 引用为:arXiv:2410.09949 [cs.CL] (或 arXiv:2410.09949v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.09949 关注以了解更多 由 arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-94] State of NLP in Kenya: A Survey

【速读】: 该论文试图解决肯尼亚在自然语言处理(NLP)领域中,由于资源和工具的限制,导致其丰富的本土语言在数字空间中代表性不足的问题。解决方案的关键在于:1) 创建大规模的语言数据集;2) 开发适用于本土语言的机器翻译、情感分析和语音识别模型;3) 制定和实施相关政策和法规,以促进AI和NLP技术的发展;4) 提出一个战略路线图,指导未来的研究和开发工作,以满足肯尼亚多样化的语言需求。

链接: https://arxiv.org/abs/2410.09948
作者: Cynthia Jayne Amol,Everlyn Asiko Chimoto,Rose Delilah Gesicho,Antony M. Gitau,Naome A. Etori,Caringtone Kinyanjui,Steven Ndung’u,Lawrence Moruye,Samson Otieno Ooko,Kavengi Kitonga,Brian Muhia,Catherine Gitau,Antony Ndolo,Lilian D. A. Wanzare,Albert Njoroge Kahira,Ronald Tombe
关键词-EN: Natural Language Processing, advancing Natural Language, faces unique challenges, advancing Natural, underrepresented indigenous languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:Kenya, known for its linguistic diversity, faces unique challenges and promising opportunities in advancing Natural Language Processing (NLP) technologies, particularly for its underrepresented indigenous languages. This survey provides a detailed assessment of the current state of NLP in Kenya, emphasizing ongoing efforts in dataset creation, machine translation, sentiment analysis, and speech recognition for local dialects such as Kiswahili, Dholuo, Kikuyu, and Luhya. Despite these advancements, the development of NLP in Kenya remains constrained by limited resources and tools, resulting in the underrepresentation of most indigenous languages in digital spaces. This paper uncovers significant gaps by critically evaluating the available datasets and existing NLP models, most notably the need for large-scale language models and the insufficient digital representation of Indigenous languages. We also analyze key NLP applications: machine translation, information retrieval, and sentiment analysis-examining how they are tailored to address local linguistic needs. Furthermore, the paper explores the governance, policies, and regulations shaping the future of AI and NLP in Kenya and proposes a strategic roadmap to guide future research and development efforts. Our goal is to provide a foundation for accelerating the growth of NLP technologies that meet Kenya’s diverse linguistic demands.
摘要:肯尼亚以其语言多样性著称,在推进自然语言处理 (NLP) 技术方面面临着独特的挑战和机遇,尤其是对于那些代表性不足的本土语言。本调查详细评估了肯尼亚 NLP 技术的现状,重点介绍了在数据集创建、机器翻译、情感分析和方言语音识别方面的持续努力,涉及的方言包括斯瓦希里语 (Kiswahili)、多罗语 (Dholuo)、基库尤语 (Kikuyu) 和卢哈亚语 (Luhya)。尽管取得了这些进展,肯尼亚 NLP 的发展仍受到资源和工具有限的制约,导致大多数本土语言在数字空间中的代表性不足。本文通过批判性地评估现有数据集和 NLP 模型,揭示了显著的差距,特别是大规模语言模型的需求以及本土语言数字代表性不足的问题。我们还分析了关键的 NLP 应用:机器翻译、信息检索和情感分析,考察了这些应用如何针对当地语言需求进行定制。此外,本文探讨了塑造肯尼亚 AI 和 NLP 未来发展的治理、政策和法规,并提出了一条战略路线图,以指导未来的研究和开发工作。我们的目标是提供一个基础,以加速满足肯尼亚多样化语言需求的 NLP 技术的发展。

[NLP-95] Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization

【速读】: 该论文试图解决为多个具有不同任务、骨干大语言模型和检索增强策略的检索增强生成(RAG)代理设计统一搜索引擎的问题。解决方案的关键在于引入一种迭代方法,通过离线阶段收集RAG代理对检索文档质量的反馈,并利用一种新颖的期望最大化算法迭代优化搜索引擎,以最大化每个代理的效用函数。此外,该方法还适应在线设置,允许搜索引擎根据实时个体代理的反馈进行行为调整,从而更好地为每个代理提供检索结果。实验结果表明,该方法在KILT基准的多样化数据集上显著优于竞争基线,并能有效“个性化”每个RAG代理的检索过程。

链接: https://arxiv.org/abs/2410.09942
作者: Alireza Salemi,Hamed Zamani
关键词-EN: multiple retrieval-augmented generation, backbone large language, unified search engine, search engine, serve multiple retrieval-augmented
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper investigates the design of a unified search engine to serve multiple retrieval-augmented generation (RAG) agents, each with a distinct task, backbone large language model (LLM), and retrieval-augmentation strategy. We introduce an iterative approach where the search engine generates retrieval results for these RAG agents and gathers feedback on the quality of the retrieved documents during an offline phase. This feedback is then used to iteratively optimize the search engine using a novel expectation-maximization algorithm, with the goal of maximizing each agent’s utility function. Additionally, we adapt this approach to an online setting, allowing the search engine to refine its behavior based on real-time individual agents feedback to better serve the results for each of them. Experiments on diverse datasets from the Knowledge-Intensive Language Tasks (KILT) benchmark demonstrates that our approach significantly on average outperforms competitive baselines across 18 RAG models. We also demonstrate that our method effectively ``personalizes’’ the retrieval process for each RAG agent based on the collected feedback. Finally, we provide a comprehensive ablation study to explore various aspects of our method.
摘要:本文探讨了为多个检索增强生成 (RAG) 智能体设计统一搜索引擎的问题,每个智能体具有不同的任务、骨干大语言模型 (LLM) 和检索增强策略。我们提出了一种迭代方法,其中搜索引擎为这些 RAG 智能体生成检索结果,并在离线阶段收集关于检索文档质量的反馈。然后,利用一种新颖的期望最大化算法,根据这些反馈迭代优化搜索引擎,目标是最大化每个智能体的效用函数。此外,我们将这种方法适应于在线环境,使搜索引擎能够根据实时个体智能体的反馈来优化其行为,从而更好地服务于每个智能体的结果。通过对来自知识密集型语言任务 (KILT) 基准的多样化数据集进行实验,结果表明,我们的方法在平均水平上显著优于 18 个 RAG 模型的竞争基线。我们还证明了我们的方法能够基于收集的反馈有效地“个性化”每个 RAG 智能体的检索过程。最后,我们提供了一项全面的消融研究,以探讨我们方法的各个方面。

[NLP-96] Retrieval Instead of Fine-tuning: A Retrieval-based Parameter Ensemble for Zero-shot Learning

【速读】: 该论文试图解决在资源受限环境中,深度学习模型的高效微调和适应性问题。解决方案的关键是引入了一种名为Retrieval-based Parameter Ensemble (RPE)的新方法,该方法通过创建一个包含Low-Rank Adaptation (LoRA)的向量化数据库,实现了对新任务的高效模型适应。RPE方法无需大量训练或标注数据,特别适用于零样本学习,并且在隐私敏感领域如医疗保健中表现出色,因为它在修改模型参数时无需访问原始数据。

链接: https://arxiv.org/abs/2410.09908
作者: Pengfei Jin,Peng Shu,Sekeun Kim,Qing Xiao,Sifan Song,Cheng Chen,Tianming Liu,Xiang Li,Quanzheng Li
关键词-EN: Foundation models, techniques like Low-Rank, Foundation, RPE, Retrieval-based Parameter Ensemble
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.
摘要:基础模型已成为深度学习的基石,其中低秩适应 (Low-Rank Adaptation, LoRA) 等技术为大型模型的有效微调提供了途径。同样,检索增强生成 (Retrieval-Augmented Generation, RAG) 等方法通过利用向量化数据库,进一步提升了模型性能,使其输出基于外部信息。尽管这些方法展示了显著的成功,但它们通常需要大量的训练或标注数据,这在资源受限的环境中限制了其适应性。为应对这些挑战,我们引入了基于检索的参数集成 (Retrieval-based Parameter Ensemble, RPE),这是一种新方法,它创建了一个 LoRA 的向量化数据库,使得模型适应新任务的检索和应用更加高效。RPE 最小化了广泛训练的需求,并消除了对标注数据的依赖,使其特别适用于零样本学习。此外,RPE 非常适合医疗等隐私敏感领域,因为它在修改模型参数时无需访问原始数据。在应用于医疗报告生成和图像分割等任务时,RPE 不仅表现出色,而且在某些情况下超越了监督微调方法,突显了其在提升深度学习应用中的计算效率和隐私保护方面的潜力。

[NLP-97] Reddit is all you need: Authorship profiling for Romanian

【速读】: 该论文试图解决基于短文本进行作者特征分析的问题,特别是针对罗马尼亚语的作者。解决方案的关键在于利用Reddit平台的社区结构(subreddits),通过分析用户参与的subreddit及其相关线索,推断用户的年龄、职业、兴趣和社会倾向等特征,从而构建一个包含23,000+样本的罗马尼亚语短文本语料库。随后,论文通过微调和评估大型语言模型(LLMs)来验证该语料库在作者特征分析中的潜力,并公开了所有资源,强调了该领域进一步研究的必要性。

链接: https://arxiv.org/abs/2410.09907
作者: Ecaterina Ştefănescu,Alexandru-Iulius Jerpelea
关键词-EN: Natural Language Processing, process of identifying, Language Processing, Large Language Models, Natural Language
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 tables and 1 figure, submitted to The 19th International Conference on Linguistic Resources and Tools for Natural Language Processing (ConsILR 2024)

点击查看摘要

Abstract:Authorship profiling is the process of identifying an author’s characteristics based on their writings. This centuries old problem has become more intriguing especially with recent developments in Natural Language Processing (NLP). In this paper, we introduce a corpus of short texts in the Romanian language, annotated with certain author characteristic keywords; to our knowledge, the first of its kind. In order to do this, we exploit a social media platform called Reddit. We leverage its thematic community-based structure (subreddits structure), which offers information about the author’s background. We infer an user’s demographic and some broad personal traits, such as age category, employment status, interests, and social orientation based on the subreddit and other cues. We thus obtain a 23k+ samples corpus, extracted from 100+ Romanian subreddits. We analyse our dataset, and finally, we fine-tune and evaluate Large Language Models (LLMs) to prove baselines capabilities for authorship profiling using the corpus, indicating the need for further research in the field. We publicly release all our resources.
摘要:作者画像是指通过作者的写作内容来识别其特征的过程。这一已有数百年历史的问题,随着自然语言处理 (NLP) 的最新发展,变得更加引人入胜。本文中,我们介绍了一个罗马尼亚语短文本语料库,该语料库带有特定的作者特征关键词标注;据我们所知,这是同类语料库中的首个。为了实现这一目标,我们利用了一个名为 Reddit 的社交媒体平台。我们利用其基于主题的社区结构(即 subreddits 结构),该结构提供了关于作者背景的信息。我们根据 subreddit 和其他线索推断用户的某些人口统计信息和一些广泛的个人特质,如年龄类别、就业状况、兴趣和社会取向。由此,我们获得了一个包含 23,000 多个样本的语料库,这些样本来自 100 多个罗马尼亚语 subreddits。我们对数据集进行了分析,并最终对大语言模型 (LLMs) 进行了微调和评估,以证明使用该语料库进行作者画像的基线能力,并指出了该领域进一步研究的必要性。我们公开发布了所有相关资源。

[NLP-98] RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

【速读】: 该论文试图解决当前奖励模型(Reward Models, RMs)评估方法与其实际对齐性能之间不一致的问题。解决方案的关键在于提出了一个名为RMB的综合性RM基准,该基准涵盖了49个真实世界场景,并采用了成对比较和Best-of-N(BoN)评估方法,以更准确地反映RMs在指导对齐优化中的有效性。通过这一基准,论文展示了其与下游对齐任务性能的正相关性,并揭示了现有最先进RMs的泛化缺陷,同时探讨了生成性RMs的潜力及其评估标准和指导方法的影响。

链接: https://arxiv.org/abs/2410.09893
作者: Enyu Zhou,Guodong Zheng,Binghai Wang,Zhiheng Xi,Shihan Dou,Rong Bao,Wei Shen,Limao Xiong,Jessica Fan,Yurong Mou,Rui Zheng,Tao Gui,Qi Zhang,Xuanjing Huang
关键词-EN: large language models, preferred by humans, large language, behaviors preferred, RMs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models (RMs) guide the alignment of large language models (LLMs), steering them toward behaviors preferred by humans. Evaluating RMs is the key to better aligning LLMs. However, the current evaluation of RMs may not directly correspond to their alignment performance due to the limited distribution of evaluation data and evaluation methods that are not closely related to alignment objectives. To address these limitations, we propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (BoN) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization. We demonstrate a positive correlation between our benchmark and the downstream alignment task performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs, revealing their generalization defects that were not discovered by previous benchmarks, and highlighting the potential of generative RMs. Furthermore, we delve into open questions in reward models, specifically examining the effectiveness of majority voting for the evaluation of reward models and analyzing the impact factors of generative RMs, including the influence of evaluation criteria and instructing methods. Our evaluation code and datasets are available at this https URL.
摘要:奖励模型 (Reward Models, RMs) 指导大语言模型 (Large Language Models, LLMs) 的对齐过程,使其行为更符合人类的偏好。评估 RMs 是优化 LLMs 对齐效果的关键。然而,当前的 RMs 评估方法可能无法直接反映其对齐性能,原因在于评估数据的分布有限以及评估方法与对齐目标关联不紧密。为解决这些局限性,我们提出了 RMB,一个涵盖超过 49 个真实场景的综合性 RM 基准,该基准包括成对比较和最佳 N 项 (Best-of-N, BoN) 评估,以更好地反映 RMs 在指导对齐优化中的有效性。我们展示了该基准与下游对齐任务性能之间的正相关关系。基于此基准,我们对当前最先进的 RMs 进行了广泛分析,揭示了先前基准未发现的泛化缺陷,并突显了生成式 RMs 的潜力。此外,我们深入探讨了奖励模型中的开放问题,特别是考察了多数投票在评估奖励模型中的有效性,并分析了生成式 RMs 的影响因素,包括评估标准和指导方法的影响。我们的评估代码和数据集可通过此 https URL 获取。

[NLP-99] ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

【速读】: 该论文试图解决大语言模型(LLMs)在评估和确保其时间累积知识方面的挑战。现有方法通常依赖单一时间戳,无法有效处理知识的累积性。论文提出的解决方案之关键是引入ChroKnowBench基准数据集,该数据集能够评估跨多个领域、时间依赖性和时间状态的累积知识,并区分知识的变化与恒定性。基于此,论文提出了ChroKnowledge框架,通过采样方法评估和更新LLMs的非参数时间知识,并通过ChroKnowPrompt提示技术逐步引导模型回忆时间知识,从而在生物医学和通用领域分别实现了11.9%和2.8%的知识更新效果提升。

链接: https://arxiv.org/abs/2410.09870
作者: Yein Park,Chanwoong Yoon,Jungwoo Park,Donghyeon Lee,Minbyul Jeong,Jaewoo Kang
关键词-EN: Large language models, Large language, knowledge, significantly impacted, chronological knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly impacted many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the accumulative nature of knowledge, often relying on a single time stamp. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating and updating LLMs’ non-parametric chronological knowledge. Our evaluation shows: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that our framework successfully updates the overall knowledge across the entire timeline in both the biomedical domain (+11.9%) and the general domain (+2.8%), demonstrating its effectiveness in refining temporal knowledge. This non-parametric approach also enables knowledge updates not only in open-source models but also in proprietary LLMs, ensuring comprehensive applicability across model types. We perform a comprehensive analysis based on temporal characteristics of ChroKnowPrompt and validate the potential of various models to elicit intrinsic temporal knowledge through our method.
摘要:大语言模型 (LLMs) 已经显著影响了我们生活的许多方面。然而,评估和确保其时间知识仍然是一个挑战。现有方法在处理知识的累积性方面存在不足,通常依赖于单一的时间戳。为了克服这一问题,我们引入了 ChroKnowBench,这是一个基准数据集,旨在评估跨三个关键方面的时间累积知识:多领域、时间依赖性和时间状态。我们的基准区分了知识的发展(例如,科学发现、修订的法律)和恒定知识(例如,数学真理、常识事实)。基于此基准,我们提出了 ChroKnowledge(知识的时间分类),这是一个新颖的基于采样的框架,用于评估和更新大语言模型的非参数时间知识。我们的评估显示:(1) 模型提取时间知识的能力因训练数据格式而异。(2) 大语言模型部分回忆知识或在时间边界处显示截断,而不是正确回忆知识的各个方面。因此,我们应用了 ChroKnowPrompt,这是一种深入的提示方法,通过逐步遍历周围的时间跨度来提取时间知识。我们观察到,我们的框架在生物医学领域(+11.9%)和通用领域(+2.8%)中成功更新了整个时间线上的总体知识,展示了其在细化时间知识方面的有效性。这种非参数方法还使得不仅在开源模型中,而且在专有大语言模型中也能进行知识更新,确保了跨模型类型的全面适用性。我们基于 ChroKnowPrompt 的时间特性进行了全面的分析,并验证了各种模型通过我们的方法提取内在时间知识的潜力。

[NLP-100] Generating Driving Simulations via Conversation

【速读】: 该论文试图解决自动驾驶车辆在模拟测试中场景和行为规范的生成问题,特别是为非编程领域的专家提供一个自然语言接口,以便他们能够有效地合成所需的测试场景。解决方案的关键在于设计了一个基于指令跟随的大型语言模型,通过对话方式将自然语言描述转换为符号化程序,从而显著提高了场景生成的成功率,实验表明,通过对话生成的成功率是不进行对话的4.5倍。

链接: https://arxiv.org/abs/2410.09829
作者: Rimvydas Rubavicius,Antonio Valerio Miceli-Barone,Alex Lascarides,Subramanian Ramamoorthy
关键词-EN: Cyber-physical systems, autonomous vehicles, scenario specification, Cyber-physical, domain-specific programs
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Robotics (cs.RO)
备注: 6 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Cyber-physical systems like autonomous vehicles are tested in simulation before deployment, using domain-specific programs for scenario specification. To aid the testing of autonomous vehicles in simulation, we design a natural language interface, using an instruction-following large language model, to assist a non-coding domain expert in synthesising the desired scenarios and vehicle behaviours. We show that using it to convert utterances to the symbolic program is feasible, despite the very small training dataset. Human experiments show that dialogue is critical to successful simulation generation, leading to a 4.5 times higher success rate than a generation without engaging in extended conversation.
摘要:像自动驾驶汽车这样的网络物理系统在部署前会在模拟环境中进行测试,使用特定领域的程序来指定场景。为了协助在模拟环境中测试自动驾驶汽车,我们设计了一个自然语言界面,使用遵循指令的大语言模型,帮助非编程领域的专家合成所需的场景和车辆行为。我们展示了尽管训练数据集非常小,但使用该界面将话语转换为符号程序是可行的。人类实验表明,对话对于成功生成模拟至关重要,与不进行深入对话的生成相比,成功率提高了4.5倍。

[NLP-101] Dynamic and Textual Graph Generation Via Large-Scale LLM-based Agent Simulation

【速读】: 该论文试图解决传统基于规则和深度学习的图生成方法在动态图演化过程中难以捕捉社区结构和生成多样化图的问题。解决方案的关键在于引入GraphAgent-Generator (GAG)框架,通过大型语言模型(LLM)模拟人类行为,无需训练或微调LLM,即可有效复现网络科学理论中的七种宏观结构特征,并在图扩展任务中超越现有基线31%。GAG框架通过节点分类任务验证了其生成的文本丰富图能有效保留真实世界网络的节点特征,并支持通过并行加速生成包含近10万个节点或1000万条边的图,速度提升至少90.4%。

链接: https://arxiv.org/abs/2410.09824
作者: Jiarui Ji,Runlin Lei,Jialing Bi,Zhewei Wei,Yankai Lin,Xuchen Pan,Yaliang Li,Bolin Ding
关键词-EN: dynamic graph generation, studied in social, scientific analysis, extensively studied, Graph
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graph generation is a fundamental task that has been extensively studied in social, technological, and scientific analysis. For modeling the dynamic graph evolution process, traditional rule-based methods struggle to capture community structures within graphs, while deep learning methods only focus on fitting training graphs. This limits existing graph generators to producing graphs that adhere to predefined rules or closely resemble training datasets, achieving poor performance in dynamic graph generation. Given that graphs are abstract representations arising from pairwise interactions in human activities, a realistic simulation of human-wise interaction could provide deeper insights into the graph evolution mechanism. With the increasing recognition of large language models (LLMs) in simulating human behavior, we introduce GraphAgent-Generator (GAG), a novel simulation-based framework for dynamic graph generation. Without training or fine-tuning process of LLM, our framework effectively replicates seven macro-level structural characteristics in established network science theories while surpassing existing baselines in graph expansion tasks by 31% on specific evaluation metrics. Through node classification task, we validate GAG effectively preserves characteristics of real-world network for node-wise textual features in generated text-rich graph. Furthermore, by incorporating parallel acceleration, GAG supports generating graphs with up to nearly 100,000 nodes or 10 million edges through large-scale LLM-based agent simulation, with a minimum speed-up of 90.4%. The source code is available at this https URL.
摘要:图生成是一个在社会、技术和科学分析中被广泛研究的基础任务。对于动态图演化过程的建模,传统的基于规则的方法难以捕捉图中的社区结构,而深度学习方法则仅关注于拟合训练图。这限制了现有图生成器只能生成遵循预定义规则或与训练数据集高度相似的图,在动态图生成方面表现不佳。鉴于图是人类活动中成对交互的抽象表示,对人类交互的现实模拟可以为图演化机制提供更深入的见解。随着大语言模型 (LLM) 在模拟人类行为方面的认可度不断提高,我们引入了 GraphAgent-Generator (GAG),这是一种基于模拟的动态图生成新框架。无需对 LLM 进行训练或微调,我们的框架有效地复现了已建立的网络科学理论中的七种宏观结构特征,同时在特定评估指标上的图扩展任务中超越了现有基线 31%。通过节点分类任务,我们验证了 GAG 在生成的文本丰富图中有效地保留了现实网络的节点文本特征。此外,通过结合并行加速,GAG 支持通过大规模基于 LLM 的智能体模拟生成最多近 100,000 个节点或 1000 万条边的图,最小加速比为 90.4%。源代码可在以下链接获取:https URL。

[NLP-102] Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

【速读】: 该论文试图解决在大规模语言模型微调过程中由于使用Zeroth-Order(ZO)优化方法导致的内存占用过高和训练时间过长的问题。解决方案的关键在于引入了一种名为LeZO的新型层级稀疏计算和内存高效ZO优化器。LeZO通过将层作为稀疏化的基本单位,动态地在每一步中扰动不同的参数子集,从而实现全参数微调。它结合了层级参数稀疏性在同时扰动随机逼近(SPSA)和ZO随机梯度下降(ZO-SGD)过程中的应用,实现了扰动和更新过程中的计算加速,同时不增加额外的内存开销。实验结果表明,LeZO在加速训练的同时,并未牺牲ZO优化的性能,在多个任务上相比MeZO实现了超过3倍的加速。

链接: https://arxiv.org/abs/2410.09823
作者: Fei Wang,Li Shen,Liang Ding,Chao Xue,Ye Liu,Changxing Ding
关键词-EN: adapting large language, huge memory usages, large language models, powerful for adapting, adapting large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace First-Order (FO) gradient calculations, albeit with longer training time due to its stochastic nature. By revisiting the Memory-efficient ZO (MeZO) optimizer, we discover that the full-parameter perturbation and updating processes consume over 50% of its overall fine-tuning time cost. Based on these observations, we introduce a novel layer-wise sparse computation and memory efficient ZO optimizer, named LeZO. LeZO treats layers as fundamental units for sparsification and dynamically perturbs different parameter subsets in each step to achieve full-parameter fine-tuning. LeZO incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation (SPSA) and ZO stochastic gradient descent (ZO-SGD). It achieves accelerated computation during perturbation and updating processes without additional memory overhead. We conduct extensive experiments with the OPT model family on the SuperGLUE benchmark and two generative tasks. The experiments show that LeZO accelerates training without compromising the performance of ZO optimization. Specifically, it achieves over 3x speedup compared to MeZO on the SST-2, BoolQ, and Copa tasks.
摘要:微调对于将大语言模型适应于下游任务非常有效,但通常会导致巨大的内存使用。一种有前景的缓解方法是使用零阶 (Zeroth-Order, ZO) 优化,该方法通过估计梯度来替代一阶 (First-Order, FO) 梯度计算,尽管由于其随机性导致训练时间较长。通过重新审视内存高效的 ZO (Memory-efficient ZO, MeZO) 优化器,我们发现全参数扰动和更新过程消耗了其整体微调时间成本的超过 50%。基于这些观察,我们引入了一种新颖的层级稀疏计算和内存高效的 ZO 优化器,命名为 LeZO。LeZO 将层作为稀疏化的基本单位,并在每一步动态扰动不同的参数子集以实现全参数微调。LeZO 在同时扰动随机近似 (Simultaneous Perturbation Stochastic Approximation, SPSA) 和 ZO 随机梯度下降 (ZO-SGD) 过程中结合了层级参数稀疏性。它在不增加额外内存开销的情况下实现了扰动和更新过程的加速计算。我们在 SuperGLUE 基准和两个生成任务上使用 OPT 模型家族进行了广泛的实验。实验表明,LeZO 加速了训练过程而没有损害 ZO 优化的性能。具体来说,在 SST-2、BoolQ 和 Copa 任务上,LeZO 相比 MeZO 实现了超过 3 倍的加速。

[NLP-103] Reverse Modeling in Large Language Models

【速读】: 该论文试图解决自回归大型语言模型(LLMs)在处理逆序文本输入时的理解能力问题。研究发现,现有的预训练LLMs无法理解逆序输入的文本。解决方案的关键在于从零开始训练LLMs,使其同时学习正序和逆序文本,从而在推理阶段能够同等程度地理解这两种输入方式。通过分析不同内容文本在正序和逆序输入下的损失差异,论文提出了一种基于损失差异的数据选择方法,用于继续预训练,从而显著提升LLMs在各种语言理解基准测试中的性能。

链接: https://arxiv.org/abs/2410.09817
作者: Sicheng Yu,Yuanchen Xu,Cunxiao Du,Yanying Zhou,Minghui Qiu,Qianru Sun,Hao Zhang,Jiawei Wu
关键词-EN: natural bias extends, accustomed to reading, reading and writing, natural bias, bias extends
类目: Computation and Language (cs.CL)
备注: 13 Pages, 6 Figures, 7 Tables

点击查看摘要

Abstract:Humans are accustomed to reading and writing in a forward manner, and this natural bias extends to text understanding in auto-regressive large language models (LLMs). This paper investigates whether LLMs, like humans, struggle with reverse modeling, specifically with reversed text inputs. We found that publicly available pre-trained LLMs cannot understand such inputs. However, LLMs trained from scratch with both forward and reverse texts can understand them equally well during inference. Our case study shows that different-content texts result in different losses if input (to LLMs) in different directions – some get lower losses for forward while some for reverse. This leads us to a simple and nice solution for data selection based on the loss differences between forward and reverse directions. Using our selected data in continued pretraining can boost LLMs’ performance by a large margin across different language understanding benchmarks.
摘要:人类习惯于以正向方式阅读和写作,这种自然倾向也延伸到了自回归大语言模型 (LLM) 的文本理解中。本文探讨了 LLM 是否像人类一样在逆向建模方面存在困难,特别是对于逆向文本输入的处理。我们发现,公开可用的预训练 LLM 无法理解此类输入。然而,从零开始训练的 LLM,同时使用正向和逆向文本,在推理过程中能够同样好地理解这些输入。我们的案例研究表明,不同内容的文本在不同方向输入时(即正向或逆向)会导致不同的损失——某些文本在正向输入时损失较低,而另一些则在逆向输入时损失较低。这为我们提供了一个基于正向和逆向方向损失差异的数据选择简单而有效的解决方案。使用我们选定的数据进行持续预训练可以大幅提升 LLM 在不同语言理解基准测试中的表现。

[NLP-104] Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

【速读】: 该论文试图解决在基于方面的情感分析(ABSA)任务中,由于自然语言的多样性导致方面和观点词的表面形式多变,使得现有评估方法仅依赖单一标准答案而忽视语义等价预测的问题。解决方案的关键在于提出一种全自动化的管道,通过扩展现有测试集,引入方面和观点词的替代有效响应,从而更公平地评估语言模型,提高人类一致性评分(Kendall’s Tau),并揭示大型语言模型(LLMs)在ABSA任务中的潜力可能被低估。

链接: https://arxiv.org/abs/2410.09807
作者: Soyoung Yang,Hojun Cho,Jiyoung Lee,Sohee Yoon,Edward Choi,Jaegul Choo,Won Ik Cho
关键词-EN: Aspect-based sentiment analysis, Aspect-based sentiment, sentiment analysis, extracting sentiment, aspect and opinion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language. Due to the inherent variability of natural language, aspect and opinion terms can be expressed in various surface forms, making their accurate identification complex. Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form. To address this limitation, we propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms. This approach enables a fairer assessment of language models by accommodating linguistic diversity, resulting in higher human agreement than single-answer test sets (up to 10%p improvement in Kendall’s Tau score). Our experimental results demonstrate that Large Language Models (LLMs) show substantial performance improvements over T5 models when evaluated using our augmented test set, suggesting that LLMs’ capabilities in ABSA tasks may have been underestimated. This work contributes to a more comprehensive evaluation framework for ABSA, potentially leading to more accurate assessments of model performance in information extraction tasks, particularly those involving span extraction.
摘要:基于方面的情感分析 (Aspect-based Sentiment Analysis, ABSA) 是一项具有挑战性的任务,旨在从人类语言中提取情感及其对应的方面和观点。由于自然语言的固有变异性,方面和观点术语可以以多种表面形式表达,使得其准确识别变得复杂。当前对该任务的评估方法通常将答案限制为单一的基准真相,对在表面形式上不同但语义上等价的预测进行惩罚。为了解决这一限制,我们提出了一种全新的、全自动化的流程,通过为方面和观点术语增加替代的有效响应来增强现有的测试集。这种方法通过容纳语言多样性,实现了对语言模型更公平的评估,从而在人类共识上比单一答案的测试集有更高的表现(Kendall’s Tau 分数提高了多达 10%)。我们的实验结果表明,在使用我们增强的测试集进行评估时,大语言模型 (Large Language Models, LLMs) 相较于 T5 模型显示出显著的性能提升,这表明 LLMs 在 ABSA 任务中的能力可能被低估了。这项工作为 ABSA 提供了一个更全面的评估框架,有望在涉及跨度提取的信息提取任务中,对模型性能进行更准确的评估。

[NLP-105] BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

【速读】: 该论文试图解决现有大语言模型(LLMs)在面对jailbreak攻击时,单一关注攻击成功率(ASR)而忽视攻击相关性、隐蔽性等问题。解决方案的关键在于引入BlackDAN框架,通过多目标优化(MOO)和多目标进化算法(MOEAs),如NSGA-II,来生成高质量的攻击提示,确保在提高攻击成功率的同时,维持攻击提示的语义相关性和隐蔽性。BlackDAN通过集成变异、交叉和Pareto优势等机制,提供了一个透明且可解释的攻击生成过程,并允许用户根据偏好定制攻击提示,从而在不同LLMs和多模态LLMs中实现更高的攻击成功率和更强的鲁棒性。

链接: https://arxiv.org/abs/2410.09804
作者: Xinyuan Wang,Victor Shea-Jay Huang,Renmiao Chen,Hao Wang,Chengwei Pan,Lei Sha,Minlie Huang
关键词-EN: encounter potential security, potential security risks, bypass security measures, large language models, exhibit remarkable capabilities
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.
摘要:尽管大语言模型 (LLM) 在各种任务中展现出卓越的能力,但它们也面临着潜在的安全风险,如越狱攻击,这些攻击利用漏洞绕过安全措施并生成有害输出。现有的越狱策略主要集中在最大化攻击成功率 (ASR),往往忽视了其他关键因素,包括越狱响应与查询的相关性以及隐蔽性。这种单一目标的狭隘关注可能导致攻击效果不佳,要么缺乏上下文相关性,要么容易被识别。在本研究中,我们引入了 BlackDAN,这是一个具有多目标优化的创新黑箱攻击框架,旨在生成高质量的提示,有效促进越狱同时保持上下文相关性并最小化可检测性。BlackDAN 利用多目标进化算法 (MOEA),特别是 NSGA-II 算法,优化跨越多个目标的越狱,包括 ASR、隐蔽性和语义相关性。通过整合变异、交叉和 Pareto 支配等机制,BlackDAN 提供了一个透明且可解释的越狱生成过程。此外,该框架允许根据用户偏好进行定制,使用户能够选择平衡有害性、相关性等因素的提示。实验结果表明,BlackDAN 优于传统的单一目标方法,在各种 LLM 和多模态 LLM 中实现了更高的成功率和更强的鲁棒性,同时确保越狱响应既相关又不易被检测。

[NLP-106] Expanding Search Space with Diverse Prompting Agents : An Efficient Sampling Approach for LLM Mathematical Reasoning

【速读】: 该论文试图解决传统大语言模型(LLMs)在数学推理任务中依赖单一提示方法导致探索问题解决策略多样性受限的问题。解决方案的关键在于通过实验分析不同提示方法在数学推理领域的应用,发现每种方法探索的搜索空间不同,且随着问题复杂性增加,这种差异更加明显。论文提出了一种高效的采样过程,通过均匀结合来自这些不同方法的样本,不仅扩展了最大搜索空间,还在减少运行次数的情况下实现了更高的性能,特别是在MATH数据集的困难子集MATH-hard中,平均减少了约43%的运行次数。这一发现强调了整合多样问题解决策略对于提升LLMs推理能力的重要性。

链接: https://arxiv.org/abs/2410.09780
作者: Gisang Lee,Sangwoo Park,Junyoung Park,Andrew Chung,Sieun Park,Yoonah Park,Byungju Kim,Min-gyu Cho
关键词-EN: Large Language Models, Large Language, Language Models, exhibited remarkable capabilities, complex tasks including
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited remarkable capabilities in many complex tasks including mathematical reasoning. However, traditional approaches heavily rely on ensuring self-consistency within single prompting method, which limits the exploration of diverse problem-solving strategies. This study addresses these limitations by performing an experimental analysis of distinct prompting methods within the domain of mathematical reasoning. Our findings demonstrate that each method explores a distinct search space, and this differentiation becomes more evident with increasing problem complexity. To leverage this phenomenon, we applied efficient sampling process that uniformly combines samples from these diverse methods, which not only expands the maximum search space but achieves higher performance with fewer runs compared to single methods. Especially, within the subset of difficult questions of MATH dataset named MATH-hard, The maximum search space was achieved while utilizing approximately 43% fewer runs than single methods on average. These findings highlight the importance of integrating diverse problem-solving strategies to enhance the reasoning abilities of LLMs.
摘要:大语言模型 (LLMs) 在包括数学推理在内的许多复杂任务中展现了卓越的能力。然而,传统方法主要依赖于确保单一提示方法内的自洽性,这限制了对多样化问题解决策略的探索。本研究通过在数学推理领域内对不同提示方法进行实验分析,解决了这些局限性。我们的研究发现,每种方法探索了不同的搜索空间,并且随着问题复杂性的增加,这种差异变得更加明显。为了利用这一现象,我们采用了高效的采样过程,均匀地结合了这些多样化方法的样本,这不仅扩展了最大搜索空间,而且在比单一方法更少的运行次数下实现了更高的性能。特别是在 MATH 数据集的困难子集 MATH-hard 中,我们实现了最大搜索空间,同时平均减少了约 43% 的运行次数。这些发现强调了整合多样化问题解决策略以增强大语言模型推理能力的重要性。

[NLP-107] ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos EMNLP2024

【速读】: 该论文试图解决从视频中生成以实体为中心的信息寻求型问题的挑战。解决方案的关键在于:1) 识别值得提问的信息;2) 将这些信息与实体关联;3) 有效利用多模态信号(如标题、转录、字幕和嵌入)。论文提出了一种结合Transformer架构、丰富上下文信号以及交叉熵和对比损失函数的模型,以促进实体中心问题的生成,并在实验中取得了较高的BLEU、ROUGE、CIDEr和METEOR评分,证明了其实用性。此外,论文还贡献了一个包含411个YouTube视频和2265个手动标注问题的多样化数据集VideoQuestions。

链接: https://arxiv.org/abs/2410.09776
作者: Arpan Phukan,Manish Gupta,Asif Ekbal
关键词-EN: Previous studies, focused on generating, common objects, objects and attributes, Previous
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted in EMNLP 2024, this https URL

点击查看摘要

Abstract:Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for video-based learning, recommending ``People Also Ask’’ questions, video-based chatbots, and fact-checking. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals. Further, to the best of our knowledge, there does not exist a large-scale dataset for this task. Most video question generation datasets are on TV shows, movies, or human activities or lack entity-centric information-seeking questions. Hence, we contribute a diverse dataset of YouTube videos, VideoQuestions, consisting of 411 videos with 2265 manually annotated questions. We further propose a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation. Our best method yields BLEU, ROUGE, CIDEr, and METEOR scores of 71.3, 78.6, 7.31, and 81.9, respectively, demonstrating practical usability. We make the code and dataset publicly available. this https URL
摘要:以往关于从视频中生成问题的研究大多集中在生成关于常见物体和属性的问题,因此并非以实体为中心。在本研究中,我们专注于从视频中生成以实体为中心的信息寻求型问题。这样的系统对于基于视频的学习、推荐“人们也问”的问题、基于视频的聊天机器人以及事实核查等方面具有潜在的应用价值。我们的工作解决了三个关键挑战:识别值得提问的信息、将其与实体关联,以及有效利用多模态信号。此外,据我们所知,目前尚不存在针对此任务的大规模数据集。大多数视频问答数据集集中在电视节目、电影或人类活动上,或者缺乏以实体为中心的信息寻求型问题。因此,我们贡献了一个多样化的YouTube视频数据集,即VideoQuestions,包含411个视频和2265个手动标注的问题。我们进一步提出了一种结合Transformer、丰富上下文信号(标题、转录文本、字幕、嵌入)以及交叉熵和对比损失函数的组合模型架构,以促进生成以实体为中心的问题。我们的最佳方法分别获得了BLEU、ROUGE、CIDEr和METEOR评分为71.3、78.6、7.31和81.9,展示了其实际可用性。我们公开了代码和数据集。

[NLP-108] EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

【速读】: 该论文试图解决现有大型语言模型(LLMs)评估工具在透明性、可控性和成本效益方面的不足,特别是使用闭源模型如GPT-4作为评估器所带来的问题。解决方案的关键在于开发了一个名为EasyJudge的开源评估模型,该模型具有轻量级、精确、高效和用户友好的特点,通过优化的数据集和提示词进行模型训练,实现了与人类和专有模型评估的高度一致性。EasyJudge还提供直观的可视化界面,便于部署和使用,并能在消费级GPU甚至CPU上高效运行。

链接: https://arxiv.org/abs/2410.09775
作者: Yijie Li,Yuan Sun
关键词-EN: growing trend, judge the quality, employing large language, Recently, model
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, there has been a growing trend of employing large language models (LLMs) to judge the quality of other LLMs. Many studies have adopted closed-source models, mainly using GPT-4 as the evaluator. However, due to the closed-source nature of the GPT-4 model, employing it as an evaluator has resulted in issues including transparency, controllability, and cost-effectiveness. Some researchers have turned to using fine-tuned open-source LLMs as evaluators. However, existing open-source evaluation LLMs generally lack a user-friendly visualization tool, and they have not been optimized for accelerated model inference, which causes inconvenience for researchers with limited resources and those working across different fields. This paper presents EasyJudge, a model developed to evaluate significant language model responses. It is lightweight, precise, efficient, and user-friendly, featuring an intuitive visualization interface for ease of deployment and use. EasyJudge uses detailed datasets and refined prompts for model optimization, achieving strong consistency with human and proprietary model evaluations. The model optimized with quantitative methods enables EasyJudge to run efficiently on consumer-grade GPUs or even CPUs. We also provide detailed analysis and case studies to further reveal the potential of our method.
摘要:近年来,利用大语言模型 (LLMs) 来评判其他 LLMs 的质量已成为一种趋势。许多研究采用了闭源模型,主要使用 GPT-4 作为评估器。然而,由于 GPT-4 模型的闭源特性,将其作为评估器带来了透明性、可控性和成本效益等问题。一些研究人员转向使用经过微调的开源 LLMs 作为评估器。然而,现有的开源评估 LLMs 普遍缺乏用户友好的可视化工具,并且未针对加速模型推理进行优化,这给资源有限的研究人员和跨领域工作者带来了不便。本文介绍了 EasyJudge,一种用于评估重要语言模型响应的模型。它轻量、精确、高效且用户友好,具备直观的可视化界面,便于部署和使用。EasyJudge 使用详细的数据集和精炼的提示进行模型优化,实现了与人类和专有模型评估的高度一致性。通过量化方法优化的模型使得 EasyJudge 能够在消费级 GPU 甚至 CPU 上高效运行。我们还提供了详细的分析和案例研究,以进一步揭示我们方法的潜力。

[NLP-109] A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

【速读】: 该论文试图解决混合语言多文档新闻摘要(MLMD)的问题,即如何有效地从涉及多种语言的多个文档中生成摘要。解决方案的关键在于构建了一个包含四种语言和10,992对源文档集群与目标摘要的MLMD-news数据集,并提出了一种基于图的提取生成模型,通过公开发布数据集和代码,推动MLMD场景下的摘要研究。

链接: https://arxiv.org/abs/2410.09773
作者: Shengxiang Gao,Fang nan,Yongbing Zhang,Yuxin Huang,Kaiwen Tan,Zhengtao Yu
关键词-EN: single-language single-document, cross-language single-document, summarization primarily focuses, Existing research, primarily focuses
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and code\footnote[1]this https URL, aiming to advance research in summarization within MLMD scenarios.
摘要:现有关于新闻摘要的研究主要集中在单语言单文档 (SLSD)、单语言多文档 (SLMD) 或跨语言单文档 (CLSD) 上。然而,在现实场景中,关于国际事件的新闻通常涉及多种语言的多个文档,即混合语言多文档 (MLMD)。因此,对 MLMD 新闻进行摘要具有重要意义。然而,MLMD 新闻摘要数据集的缺乏限制了该领域研究的发展。为了填补这一空白,我们构建了一个混合语言多文档新闻摘要数据集 (MLMD-news),该数据集包含四种不同语言的 10,992 个源文档簇和目标摘要对。此外,我们提出了一种基于图的抽取-生成模型,并在 MLMD-news 数据集上对多种方法进行了基准测试,并公开发布了我们的数据集和代码 [1],旨在推动 MLMD 场景下摘要研究的发展。

[NLP-110] Quis custodiet ipsos custodes? Who will watch the watchmen? On Detecting AI-generated peer-reviews EMNLP

【速读】: 该论文试图解决的问题是如何帮助编辑或主席判断一篇同行评审是否由ChatGPT生成,以维护学术评审的完整性。解决方案的关键在于提出了两种模型:Term Frequency (TF) 模型和 Review Regeneration (RR) 模型。TF模型基于AI生成文本中重复使用词汇的特性,而RR模型则利用ChatGPT在重新提示时生成相似输出的特点。论文通过对比这两种模型与其他AI文本检测方法的性能,发现TF模型在无攻击情况下表现更优,而RR模型在面对攻击时更为稳健。此外,论文还提出了针对 paraphrasing 攻击的防御策略,进一步提升了模型的实用性。

链接: https://arxiv.org/abs/2410.09770
作者: Sandeep Kumar,Mohit Sahu,Vardhan Gacche,Tirthankar Ghosal,Asif Ekbal
关键词-EN: maintaining scientific rigor, process is vital, vital for maintaining, rigor and trust, academic community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: EMNLP Main, 17 pages, 5 figures, 9 tables

点击查看摘要

Abstract:The integrity of the peer-review process is vital for maintaining scientific rigor and trust within the academic community. With the steady increase in the usage of large language models (LLMs) like ChatGPT in academic writing, there is a growing concern that AI-generated texts could compromise scientific publishing, including peer-reviews. Previous works have focused on generic AI-generated text detection or have presented an approach for estimating the fraction of peer-reviews that can be AI-generated. Our focus here is to solve a real-world problem by assisting the editor or chair in determining whether a review is written by ChatGPT or not. To address this, we introduce the Term Frequency (TF) model, which posits that AI often repeats tokens, and the Review Regeneration (RR) model, which is based on the idea that ChatGPT generates similar outputs upon re-prompting. We stress test these detectors against token attack and paraphrasing. Finally, we propose an effective defensive strategy to reduce the effect of paraphrasing on our models. Our findings suggest both our proposed methods perform better than the other AI text detectors. Our RR model is more robust, although our TF model performs better than the RR model without any attacks. We make our code, dataset, and model public.
摘要:同行评审过程的完整性对于维护学术界的科学严谨性和信任至关重要。随着像 ChatGPT 这样的大语言模型 (LLM) 在学术写作中的使用稳步增加,人们越来越担心 AI 生成的文本可能会损害科学出版,包括同行评审。先前的工作主要集中在通用 AI 生成文本的检测上,或提出了估计同行评审中 AI 生成部分的方法。我们的重点是通过协助编辑或主席确定评审是否由 ChatGPT 撰写,来解决一个现实世界的问题。为此,我们引入了词频 (TF) 模型,该模型假设 AI 常常重复 Token;以及评审再生 (RR) 模型,该模型基于 ChatGPT 在重新提示时生成相似输出的想法。我们对这些检测器进行了压力测试,以应对 Token 攻击和改写。最后,我们提出了一种有效的防御策略,以减少改写对我们模型的影响。我们的研究结果表明,我们提出的两种方法在性能上均优于其他 AI 文本检测器。尽管在没有攻击的情况下,我们的 TF 模型表现优于 RR 模型,但 RR 模型更为稳健。我们将代码、数据集和模型公开。

[NLP-111] BiDoRA: Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation

【速读】: 该论文试图解决参数高效微调(PEFT)方法中,权重分解低秩适应(DoRA)引入额外参数导致过拟合风险增加,以及方向和幅度组件耦合更新限制学习能力的问题。解决方案的关键在于提出了一种基于双层优化的PEFT方法——BiDoRA,通过在不同数据集上分别优化方向和幅度组件,实现异步优化,从而降低过拟合风险并增强组件的解耦,提升对下游任务的适应性。

链接: https://arxiv.org/abs/2410.09758
作者: Peijia Qin,Ruiyi Zhang,Pengtao Xie
关键词-EN: gained considerable attention, Parameter-efficient fine-tuning, large language models, adapting LLMs, gained considerable
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) of large language models (LLMs) has gained considerable attention as a flexible and efficient way of adapting LLMs to downstream tasks. Among these methods, weighted decomposed low-rank adaptation (DoRA) has emerged as a promising approach. DoRA bridges the gap between low-rank adaptation (LoRA) and full fine-tuning (FT) by decomposing the weight matrices into magnitude and direction components, thereby maintaining learning behavior similar to FT. Although DoRA shows encouraging performance, it introduces additional parameters compared to LoRA, which potentially increases the risk of overfitting. Moreover, optimizing magnitude and direction simultaneously leads to a coupled gradient updating pattern for both components, limiting its learning capacity. To overcome these limitations, we propose BiDoRA, a bi-level optimization-based PEFT method. In BiDoRA, the direction and magnitude components are optimized on two distinct datasets at different optimization levels, mitigating the risk of overfitting. Additionally, the asynchronous optimization of the two components promotes their decoupling, allowing for more flexible gradient updates suitable for various downstream tasks. Evaluation of BiDoRA on fourteen datasets spanning natural language understanding, natural language generation, and token classification reveals that it significantly outperforms DoRA and other PEFT methods. The superior performance of BiDoRA underscores its effectiveness. The code for BiDoRA is available at this https URL.
摘要:大语言模型 (LLM) 的参数高效微调 (PEFT) 因其灵活且高效地适应下游任务的方式而备受关注。在这些方法中,加权分解低秩适应 (DoRA) 作为一种有前景的方法崭露头角。DoRA 通过将权重矩阵分解为幅度和方向分量,弥合了低秩适应 (LoRA) 和全微调 (FT) 之间的差距,从而保持了与 FT 相似的学习行为。尽管 DoRA 表现出令人鼓舞的性能,但它相比 LoRA 引入了额外的参数,这可能增加了过拟合的风险。此外,同时优化幅度和方向会导致两者的梯度更新模式耦合,限制了其学习能力。为克服这些限制,我们提出了 BiDoRA,一种基于双层优化的 PEFT 方法。在 BiDoRA 中,方向和幅度分量在不同优化层级的两个独立数据集上进行优化,从而降低了过拟合的风险。此外,两分量的异步优化促进了它们的解耦,使得梯度更新更加灵活,适用于各种下游任务。在涵盖自然语言理解、自然语言生成和 Token 分类的十四个数据集上的评估显示,BiDoRA 显著优于 DoRA 和其他 PEFT 方法。BiDoRA 的优越性能突显了其有效性。BiDoRA 的代码可在以下链接获取:https URL。

[NLP-112] Empirical Study of Mutual Reinforcement Effect and Application in Few-shot Text Classification Tasks via Prompt

【速读】: 该论文试图解决文本分类任务中词级别和文本级别分类之间的相互增强效应(Mutual Reinforcement Effect, MRE)的理论验证问题。解决方案的关键在于通过实证实验验证MRE的存在,并展示其在模型性能提升中的作用。具体方法包括在21个MRE混合数据集上进行比较实验,利用微调技术观察MRE的影响,并将MRE应用于提示学习中,通过词级别信息作为语言模型预测文本级别分类标签的增强手段。实验结果表明,MRE显著提升了模型的F1分数,特别是在18个数据集中超过了基线,从而验证了词级别信息对整体文本理解的重要性。

链接: https://arxiv.org/abs/2410.09745
作者: Chengguang Gan,Tatsunori Mori
关键词-EN: Mutual Reinforcement Effect, Reinforcement Effect, Mutual Reinforcement, text classification tasks, MRE mix datasets
类目: Computation and Language (cs.CL)
备注: 10 pagess, 4 figures

点击查看摘要

Abstract:The Mutual Reinforcement Effect (MRE) investigates the synergistic relationship between word-level and text-level classifications in text classification tasks. It posits that the performance of both classification levels can be mutually enhanced. However, this mechanism has not been adequately demonstrated or explained in prior research. To address this gap, we employ empirical experiment to observe and substantiate the MRE theory. Our experiments on 21 MRE mix datasets revealed the presence of MRE in the model and its impact. Specifically, we conducted compare experiments use fine-tune. The results of findings from comparison experiments corroborates the existence of MRE. Furthermore, we extended the application of MRE to prompt learning, utilizing word-level information as a verbalizer to bolster the model’s prediction of text-level classification labels. In our final experiment, the F1-score significantly surpassed the baseline in 18 out of 21 MRE Mix datasets, further validating the notion that word-level information enhances the language model’s comprehension of the text as a whole.
摘要:互增强效应 (Mutual Reinforcement Effect, MRE) 研究了文本分类任务中词级别和文本级别分类之间的协同关系。它假设这两个分类级别的性能可以相互增强。然而,这一机制在先前的研究中并未得到充分的证明或解释。为了填补这一空白,我们通过实证实验来观察和证实 MRE 理论。我们在 21 个 MRE 混合数据集上的实验揭示了模型中 MRE 的存在及其影响。具体来说,我们进行了对比实验,使用了微调 (fine-tune) 方法。对比实验的结果证实了 MRE 的存在。此外,我们将 MRE 的应用扩展到提示学习 (prompt learning),利用词级别信息作为语言生成器 (verbalizer) 来增强模型对文本级别分类标签的预测。在我们的最终实验中,F1 分数在 21 个 MRE 混合数据集中的 18 个上显著超过了基线,进一步验证了词级别信息增强了大语言模型对整个文本理解的观点。

[NLP-113] aming Overconfidence in LLMs: Reward Calibration in RLHF

【速读】: 该论文试图解决大型语言模型(LLMs)在经过人类反馈强化学习(RLHF)训练后出现的过度自信问题。解决方案的关键在于提出了两种改进的近端策略优化(PPO)变体:PPO-M和PPO-C。PPO-M通过在奖励模型训练中引入显式的置信度评分,使奖励模型更好地捕捉响应质量和表达自信度之间的对齐;PPO-C则在PPO过程中根据当前奖励与过去奖励移动平均值的差异调整奖励分数。这两种方法均能有效降低校准误差,同时保持与标准PPO相当的性能,且无需额外的黄金标签。

链接: https://arxiv.org/abs/2410.09724
作者: Jixuan Leng,Chengsong Huang,Banghua Zhu,Jiaxin Huang
关键词-EN: Large Language Models, Proximal Policy Optimization, Language model calibration, Large Language, Calibrated Reward Calculation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO with Calibrated Reward Modeling and PPO-C: PPO with Calibrated Reward Calculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the moving average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experiment results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they do not compromise model capabilities in open-ended conversation settings.
摘要:语言模型校准指的是模型置信度与其响应实际表现之间的对齐。尽管先前研究表明大语言模型 (LLM) 存在过度自信现象,并指出通过人类反馈强化学习 (RLHF) 训练的 LLM 在输出概率上表现出更明显的过度自信,本研究揭示了 RLHF 倾向于使模型在其自身响应中表现出言语上的过度自信。我们探究了这种过度自信的根本原因,并证明用于近端策略优化 (PPO) 的奖励模型在评分上存在固有的高置信度偏差,而与响应的实际质量无关。基于这一洞察,我们提出了两种 PPO 变体:PPO-M:带有校准奖励模型的 PPO,以及 PPO-C:带有校准奖励计算的 PPO。PPO-M 在奖励模型训练中整合了显式的置信度评分,从而使奖励模型能够更好地捕捉响应质量与言语置信度之间的对齐。PPO-C 根据当前奖励与过去奖励移动平均值之间的差异调整 PPO 过程中的奖励评分。PPO-M 和 PPO-C 均能无缝集成到当前的 PPO 流程中,且无需额外的黄金标签。我们在 Llama3-8B 和 Mistral-7B 上对这两种方法进行了评估,涵盖了六个多样化的数据集,包括多项选择和开放式生成任务。实验结果表明,我们的方法均能减少校准误差,并保持与标准 PPO 相当的性能。我们进一步证明,这两种方法在开放式对话环境中不会损害模型的能力。

[NLP-114] Honest AI: Fine-Tuning “Small” Language Models to Say “I Dont Know” and Reducing Hallucination in RAG

【速读】: 该论文试图解决大型语言模型(LLMs)在企业应用中因信息准确性敏感而产生的幻觉问题。解决方案的关键在于提出了一种名为Honest AI的新策略,通过微调“小型”语言模型(参数少于10亿)使其能够回答“我不知道”来减少幻觉,并结合了多种检索增强生成(RAG)方法,包括使用搜索引擎和知识图谱结果的RAG、微调基础LLMs以及两者的混合方法。研究表明,单独使用RAG对性能提升有限,而结合微调的混合方法在CRAG基准测试中表现最佳,强调了资源效率的重要性。

链接: https://arxiv.org/abs/2410.09699
作者: Xinxi Chen,Li Wang,Wei Wu,Qi Tang,Yiyao Liu
关键词-EN: Large Language Models, Large Language, applications of Large, enterprise applications, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucination is a key roadblock for applications of Large Language Models (LLMs), particularly for enterprise applications that are sensitive to information accuracy. To address this issue, two general approaches have been explored: Retrieval-Augmented Generation (RAG) to supply LLMs with updated information as context, and fine-tuning the LLMs with new information and desired output styles. In this paper, we propose Honest AI: a novel strategy to fine-tune “small” language models to say “I don’t know” to reduce hallucination, along with several alternative RAG approaches. The solution ranked 1st in Task 2 for the false premise question. The alternative approaches include using RAG with search engine and knowledge graph results, fine-tuning base LLMs with new information and combinations of both approaches. Although all approaches improve the performance of the LLMs, RAG alone does not significantly improve the performance and fine-tuning is needed for better results. Finally, the hybrid approach achieved the highest score in the CRAG benchmark. In addition, our approach emphasizes the use of relatively small models with fewer than 10 billion parameters, promoting resource efficiency.
摘要:幻觉是大型语言模型 (LLM) 应用中的一个关键障碍,尤其是在对信息准确性敏感的企业应用中。为解决这一问题,研究者探索了两种通用方法:检索增强生成 (RAG) 以提供更新信息作为上下文,以及通过新信息和期望的输出风格对 LLM 进行微调。本文提出 Honest AI:一种新颖的策略,通过微调“小型”语言模型使其在面对不确定信息时说“我不知道”,从而减少幻觉,并结合几种替代的 RAG 方法。该解决方案在虚假前提问题的任务 2 中排名第一。替代方法包括使用搜索引擎和知识图谱结果的 RAG,以及通过新信息微调基础 LLM 和结合这两种方法的组合策略。尽管所有方法都提升了 LLM 的性能,但仅靠 RAG 并不能显著提升性能,微调是获得更好结果的必要手段。最终,混合方法在 CRAG 基准测试中取得了最高分。此外,我们的方法强调使用参数少于 100 亿的相对较小的模型,以促进资源效率。

[NLP-115] MoIN: Mixture of Introvert Experts to Upcycle an LLM

【速读】: 该论文旨在解决现有大型语言模型在持续预训练时资源消耗巨大的问题。解决方案的关键在于将预训练数据按语义相关性分组,并为每个子集训练一个轻量级的专家模型(adapter),这些专家模型被添加到冻结的基础模型之上。在推理过程中,查询首先被路由到最相关的专家模型,然后由该专家模型进行前向传播。与传统的混合专家模型(MoE)不同,这些专家模型不与其他专家协作处理单个查询,因此被称为“内向”专家。通过冻结基础模型并使用轻量级适配器,该方法在训练和推理过程中实现了极高的并行性,所有专家模型的训练可以并行进行,推理时也可以通过将专家分布在不同GPU上来实现高度并行化。

链接: https://arxiv.org/abs/2410.09687
作者: Ajinkya Tejankar,KL Navaneet,Ujjawal Panchal,Kossar Pourahmadi,Hamed Pirsiavash
关键词-EN: existing large language, large language model, existing large, large language, prohibitive requirements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them “introvert” experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.
摘要:本文的目标是通过改进(升级)现有的大语言模型,而无需进行全模型继续预训练的繁重要求。我们的思路是将预训练数据分割成语义相关的组,并在每个子集上训练一个专家。专家以轻量级适配器的形式添加在冻结的基础模型之上。在推理过程中,传入的查询首先被路由到最相关的专家,然后该专家被加载到基础模型上进行前向传递。与典型的专家混合模型 (Mixture of Experts, MoE) 不同,我们的方法中的专家不会为单个查询与其他专家合作。因此,我们将这些专家称为“内向型”专家。冻结基础模型并将专家保持为轻量级适配器,使得训练和推理过程中可以实现极端的并行化。所有专家的训练可以并行进行,而无需它们之间的通信通道。同样,通过将专家分布在不同的 GPU 上,并将每个请求路由到包含其相关专家的 GPU,推理过程也可以高度并行化。我们实现了一个概念验证版本的这种方法,并展示了我们方法的有效性。

[NLP-116] COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

【速读】: 该论文试图解决现有大型语言模型(LLMs)在复杂任务中使用迭代细化的方法时,由于自回归(AR)模型的顺序标记生成导致的推理延迟问题。解决方案的关键在于提出了一种名为Context-Wise Order-Agnostic Language Modeling (COrAL)的新方法,该方法将迭代细化直接集成到LLM架构中,同时保持计算效率。COrAL通过在可管理的上下文窗口内建模多个标记依赖关系,利用滑动块状的无序解码技术,在生成过程中进行内部迭代细化,从而实现并行处理,显著提高了推理速度和任务性能。

链接: https://arxiv.org/abs/2410.09675
作者: Yuxi Xie,Anirudh Goyal,Xiaobao Wu,Xunjian Yin,Xiao Xu,Min-Yen Kan,Liangming Pan,William Yang Wang
关键词-EN: Iterative refinement, large language models, effective paradigm, paradigm for enhancing, enhancing the capabilities
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures, 3 tables (23 pages, 9 figures, 4 tables including references and appendices)

点击查看摘要

Abstract:Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. However, existing approaches typically implement iterative refinement at the application or prompting level, relying on autoregressive (AR) modeling. The sequential token generation in AR models can lead to high inference latency. To overcome these challenges, we propose Context-Wise Order-Agnostic Language Modeling (COrAL), which incorporates iterative refinement directly into the LLM architecture while maintaining computational efficiency. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally during the generation process. Leveraging the order-agnostic nature of COrAL, we introduce sliding blockwise order-agnostic decoding, which performs multi-token forward prediction and backward reconstruction within context windows. This allows the model to iteratively refine its outputs in parallel in the sliding block, effectively capturing diverse dependencies without the high inference cost of sequential generation. Empirical evaluations on reasoning tasks demonstrate that COrAL improves performance and inference speed, respectively, achieving absolute accuracy gains of 4.6% on GSM8K and 4.0% on LogiQA, along with inference speedups of up to 3.9\times over next-token baselines. Preliminary results on code generation indicate a drop in pass rates due to inconsistencies in order-agnostic outputs, highlighting the inherent quality–speed trade-off. Our code is publicly available at this https URL.
摘要:迭代细化已成为提升大语言模型 (LLM) 在复杂任务上能力的一种有效范式。然而,现有方法通常在应用层或提示层实现迭代细化,依赖于自回归 (AR) 建模。AR 模型中的序列 Token 生成可能导致高推理延迟。为克服这些挑战,我们提出了上下文感知无序语言建模 (COrAL),它将迭代细化直接融入 LLM 架构中,同时保持计算效率。我们的方法在可管理的上下文窗口内建模多个 Token 依赖关系,使模型在生成过程中能够内部执行迭代细化。利用 COrAL 的无序特性,我们引入了滑动块无序解码,在上下文窗口内执行多 Token 前向预测和后向重建。这使得模型能够在滑动块中并行迭代细化其输出,有效捕捉多样依赖关系,而无需顺序生成的推理成本。在推理任务上的实证评估表明,COrAL 分别提高了性能和推理速度,在 GSM8K 上实现了 4.6% 的绝对准确率提升,在 LogiQA 上实现了 4.0% 的提升,推理速度比下一 Token 基线快 3.9 倍。在代码生成上的初步结果显示,由于无序输出的不一致性,通过率有所下降,突显了内在的质量-速度权衡。我们的代码已公开,详见此 https URL。

[NLP-117] OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在推理能力上的提升问题,解决方案的关键在于引入了一个名为OpenR的开源框架。OpenR通过整合数据获取、强化学习训练(包括在线和离线)以及非自回归解码,形成一个统一的软件平台,旨在通过测试时计算、强化学习和过程监督来增强LLMs的推理能力。该框架首次将OpenAI的o1模型的核心技术与强化学习相结合,实现了超越传统自回归方法的高级推理能力,并通过在MATH数据集上的实验验证了其有效性。

链接: https://arxiv.org/abs/2410.09671
作者: Jun Wang,Meng Fang,Ziyu Wan,Muning Wen,Jiachen Zhu,Anjie Liu,Ziqin Gong,Yan Song,Lei Chen,Lionel M. Ni,Linyi Yang,Ying Wen,Weinan Zhang
关键词-EN: integrate key components, large language models, reinforcement learning, technical report, key components
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this technical report, we introduce OpenR, an open-source framework designed to integrate key components for enhancing the reasoning capabilities of large language models (LLMs). OpenR unifies data acquisition, reinforcement learning training (both online and offline), and non-autoregressive decoding into a cohesive software platform. Our goal is to establish an open-source platform and community to accelerate the development of LLM reasoning. Inspired by the success of OpenAI’s o1 model, which demonstrated improved reasoning abilities through step-by-step reasoning and reinforcement learning, OpenR integrates test-time compute, reinforcement learning, and process supervision to improve reasoning in LLMs. Our work is the first to provide an open-source framework that explores the core techniques of OpenAI’s o1 model with reinforcement learning, achieving advanced reasoning capabilities beyond traditional autoregressive methods. We demonstrate the efficacy of OpenR by evaluating it on the MATH dataset, utilising publicly available data and search methods. Our initial experiments confirm substantial gains, with relative improvements in reasoning and performance driven by test-time computation and reinforcement learning through process reward models. The OpenR framework, including code, models, and datasets, is accessible at this https URL.
摘要:在本技术报告中,我们介绍了 OpenR,一个开源框架,旨在整合关键组件以增强大语言模型 (LLM) 的推理能力。OpenR 将数据获取、强化学习训练(包括在线和离线)以及非自回归解码统一为一个连贯的软件平台。我们的目标是建立一个开源平台和社区,以加速 LLM 推理的发展。受 OpenAI 的 o1 模型成功的启发,该模型通过逐步推理和强化学习展示了改进的推理能力,OpenR 整合了测试时计算、强化学习和过程监督,以提升 LLM 的推理能力。我们的工作是首个提供开源框架,探索 OpenAI 的 o1 模型的核心技术与强化学习相结合,实现了超越传统自回归方法的高级推理能力。我们通过在 MATH 数据集上评估 OpenR,利用公开可用的数据和搜索方法,展示了其有效性。我们的初步实验证实了显著的收益,推理和性能的相对改进由测试时计算和通过过程奖励模型进行的强化学习驱动。OpenR 框架,包括代码、模型和数据集,可通过此 https URL 访问。

[NLP-118] Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在提示优化过程中过度关注性能而忽视安全性和安全性的问题。解决方案的关键是引入“Survival of the Safest”(SoS)框架,这是一个创新的多目标提示优化框架,通过集成语义、反馈和交叉变异的多目标进化策略,实现性能和安全性的同时提升。SoS框架不同于计算密集型的Pareto前沿方法,提供了一种可扩展的解决方案,能够在复杂的高维离散搜索空间中加速优化过程,同时保持计算需求较低。该方法允许灵活的目标权重分配,并生成一组优化候选提示,使用户能够根据特定的性能和安全需求选择最佳提示。实验结果表明,SoS在多个基准数据集上的表现优于单一目标方法,显著提升了安全性和安全性。

链接: https://arxiv.org/abs/2410.09652
作者: Ankita Sinha,Wendi Cui,Kamalika Das,Jiaxin Zhang
关键词-EN: Large language models, demonstrated remarkable capabilities, Large language, prioritized performance metrics, historically prioritized performance
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: EMNLP 2024 Industry Track

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities; however, the optimization of their prompts has historically prioritized performance metrics at the expense of crucial safety and security considerations. To overcome this shortcoming, we introduce “Survival of the Safest” (SoS), an innovative multi-objective prompt optimization framework that enhances both performance and security in LLMs simultaneously. SoS utilizes an interleaved multi-objective evolution strategy, integrating semantic, feedback, and crossover mutations to effectively traverse the prompt landscape. Differing from the computationally demanding Pareto front methods, SoS provides a scalable solution that expedites optimization in complex, high-dimensional discrete search spaces while keeping computational demands low. Our approach accommodates flexible weighting of objectives and generates a pool of optimized candidates, empowering users to select prompts that optimally meet their specific performance and security needs. Experimental evaluations across diverse benchmark datasets affirm SoS’s efficacy in delivering high performance and notably enhancing safety and security compared to single-objective methods. This advancement marks a significant stride towards the deployment of LLM systems that are both high-performing and secure across varied industrial applications
摘要:大语言模型 (LLMs) 展示了显著的能力;然而,其提示的优化历来优先考虑性能指标,而牺牲了关键的安全性和安全性考量。为了克服这一缺陷,我们引入了“最安全生存” (Survival of the Safest, SoS),这是一个创新的多目标提示优化框架,能够同时提升大语言模型的性能和安全性。SoS 采用了一种交错的多目标进化策略,结合了语义、反馈和交叉变异,以有效遍历提示空间。与计算密集型的 Pareto 前沿方法不同,SoS 提供了一种可扩展的解决方案,能够在保持计算需求低的同时,加速复杂高维离散搜索空间的优化。我们的方法支持灵活的目标加权,并生成一组优化的候选方案,使用户能够选择最能满足其特定性能和安全需求的提示。在多种基准数据集上的实验评估证实,与单一目标方法相比,SoS 在提供高性能的同时,显著增强了安全性和安全性。这一进展标志着在各种工业应用中部署高性能且安全的大语言模型系统方面迈出了重要的一步。

[NLP-119] Learning the Bitter Lesson: Empirical Evidence from 20 Years of CVPR Proceedings EMNLP2024

【速读】: 该论文试图解决的问题是评估计算机视觉领域的研究是否符合Rich Sutton提出的“苦涩教训”原则,即强调通用学习算法和计算资源的利用。解决方案的关键在于利用先进的自然语言处理技术分析过去二十年CVPR会议的摘要和标题,通过大型语言模型(LLMs)系统地评估研究方法的演变,从而揭示计算机视觉领域在采用通用学习算法和利用计算资源方面的显著趋势,并为未来研究方向提供指导。

链接: https://arxiv.org/abs/2410.09649
作者: Mojtaba Yousefi,Jack Collins
关键词-EN: Pattern Recognition, Rich Sutton, proposed by Rich, Computer Vision, bitter lesson
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NLP4Sceince Workshop, EMNLP 2024

点击查看摘要

Abstract:This study examines the alignment of \emphConference on Computer Vision and Pattern Recognition (CVPR) research with the principles of the “bitter lesson” proposed by Rich Sutton. We analyze two decades of CVPR abstracts and titles using large language models (LLMs) to assess the field’s embracement of these principles. Our methodology leverages state-of-the-art natural language processing techniques to systematically evaluate the evolution of research approaches in computer vision. The results reveal significant trends in the adoption of general-purpose learning algorithms and the utilization of increased computational resources. We discuss the implications of these findings for the future direction of computer vision research and its potential impact on broader artificial intelligence development. This work contributes to the ongoing dialogue about the most effective strategies for advancing machine learning and computer vision, offering insights that may guide future research priorities and methodologies in the field.
摘要:本研究探讨了计算机视觉与模式识别会议 (CVPR) 研究与 Rich Sutton 提出的“苦涩教训”原则的契合度。我们利用大语言模型 (LLM) 分析了 CVPR 二十年的摘要和标题,以评估该领域对这些原则的接受程度。我们的方法采用最先进的自然语言处理技术,系统地评估计算机视觉研究方法的演变。结果显示,通用学习算法的采用和计算资源的增加呈现出显著趋势。我们讨论了这些发现对计算机视觉研究未来方向及其对更广泛的人工智能发展潜在影响的含义。本研究为推进机器学习和计算机视觉的最有效策略的持续对话做出了贡献,提供了可能指导该领域未来研究重点和方法的见解。

[NLP-120] Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

【速读】: 该论文试图解决预训练语言模型在新词汇融入时面临的词汇过度碎片化问题,并提出了一种名为VocADT的新方法。解决方案的关键在于使用适配器模块,这些模块能够在保持模型权重不变的情况下,学习现有嵌入的最优线性组合,从而实现词汇的灵活和可扩展适应,无需依赖外部资源或语言限制。实验结果表明,VocADT在多种多语言任务中优于原始模型和其他基线方法,特别是在拉丁字母语言和高度碎片化的语言中表现尤为突出。

链接: https://arxiv.org/abs/2410.09644
作者: HyoJung Han,Akiko Eriguchi,Haoran Xu,Hieu Hoang,Marine Carpuat,Huda Khayrallah
关键词-EN: mitigates token over-fragmentation, enables expansion, token over-fragmentation, mitigates token, Vocabulary adaptation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vocabulary adaptation, which integrates new vocabulary into pre-trained language models (LMs), enables expansion to new languages and mitigates token over-fragmentation. However, existing approaches are limited by their reliance on heuristic or external embeddings. We propose VocADT, a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model’s weights fixed. VocADT offers a flexible and scalable solution without requiring external resources or language constraints. Across 11 languages-with various scripts, resource availability, and fragmentation-we demonstrate that VocADT outperforms the original Mistral model and other baselines across various multilingual tasks. We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation. We further fine-tune the adapted model on the generative task of machine translation and find that vocabulary adaptation is still beneficial after fine-tuning and that VocADT is the most effective method.
摘要:词汇适应(Vocabulary Adaptation),即将新词汇整合到预训练语言模型(LMs)中,能够扩展到新语言并缓解 Token 过度碎片化的问题。然而,现有方法受限于其对启发式或外部嵌入的依赖。我们提出了 VocADT,一种使用适配器模块的新型词汇适应方法,这些模块经过训练以学习现有嵌入的最优线性组合,同时保持模型权重固定。VocADT 提供了一种灵活且可扩展的解决方案,无需外部资源或语言限制。在涵盖 11 种语言(具有不同的文字、资源可用性和碎片化程度)的实验中,我们证明了 VocADT 在各种多语言任务中优于原始的 Mistral 模型和其他基线方法。我们发现,拉丁文字语言和高度碎片化的语言从词汇适应中受益最大。我们进一步在机器翻译的生成任务上对适应后的模型进行了微调,发现词汇适应在微调后仍然有益,并且 VocADT 是最有效的方法。

[NLP-121] RepMatch: Quantifying Cross-Instance Similarities in Representation Space

【速读】: 该论文试图解决现有数据集分析方法在评估训练数据实例相似性方面的局限性,特别是那些仅关注单个实例且局限于数据集内部分析的方法。解决方案的关键在于引入了一种名为RepMatch的新方法,该方法通过比较在不同子集上训练的模型所编码的知识来量化训练实例子集之间的相似性。这种方法不仅克服了现有方法的局限性,还支持跨数据集和实例的相似性比较,从而实现更广泛的数据集评估和分析。通过在多个自然语言处理任务、数据集和模型上的实验验证,RepMatch展示了其在识别更具代表性的数据子集和揭示挑战数据集构建背后的启发式规则方面的有效性。

链接: https://arxiv.org/abs/2410.09642
作者: Mohammad Reza Modarres,Sina Abbasi,Mohammad Taher Pilehvar
关键词-EN: categorizing data based, characterizing training data, techniques have enabled, enabled more sophisticated, sophisticated approaches
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advances in dataset analysis techniques have enabled more sophisticated approaches to analyzing and characterizing training data instances, often categorizing data based on attributes such as ``difficulty’'. In this work, we introduce RepMatch, a novel method that characterizes data through the lens of similarity. RepMatch quantifies the similarity between subsets of training instances by comparing the knowledge encoded in models trained on them, overcoming the limitations of existing analysis methods that focus solely on individual instances and are restricted to within-dataset analysis. Our framework allows for a broader evaluation, enabling similarity comparisons across arbitrary subsets of instances, supporting both dataset-to-dataset and instance-to-dataset analyses. We validate the effectiveness of RepMatch across multiple NLP tasks, datasets, and models. Through extensive experimentation, we demonstrate that RepMatch can effectively compare datasets, identify more representative subsets of a dataset (that lead to better performance than randomly selected subsets of equivalent size), and uncover heuristics underlying the construction of some challenge datasets.
摘要:数据集分析技术的进步使得对训练数据实例的分析和特征化方法更加精细,通常根据“难度”等属性对数据进行分类。在本研究中,我们提出了 RepMatch,一种通过相似性视角来表征数据的新方法。RepMatch 通过比较在这些数据子集上训练的模型所编码的知识,量化训练实例子集之间的相似性,克服了现有分析方法仅关注单个实例且局限于数据集内分析的局限性。我们的框架支持更广泛的评估,能够对任意实例子集进行相似性比较,既支持数据集间比较,也支持实例与数据集间的比较。我们在多个 NLP 任务、数据集和模型上验证了 RepMatch 的有效性。通过广泛的实验,我们证明了 RepMatch 能够有效比较数据集,识别更具代表性的数据子集(这些子集的表现优于同等大小的随机选择子集),并揭示了一些挑战数据集构建背后的启发式方法。

[NLP-122] SciGisPy: a Novel Metric for Biomedical Text Simplification via Gist Inference Score

【速读】: 该论文试图解决生物医学文本自动简化(ATS)的评估难题,特别是现有评估指标在捕捉和保留生物医学领域核心信息方面的不足。解决方案的关键在于引入了一种名为SciGisPy的新型评估指标,该指标基于Gist Inference Score(GIS)并结合了Fuzzy-Trace Theory(FTT),通过引入语义块分割、信息内容理论和专用嵌入等领域的特定增强措施,改进了GIS的原始公式,从而更准确地评估简化文本在传达生物医学内容核心意义方面的效果。实验结果表明,SciGisPy在识别正确简化文本方面的表现显著优于原始GIS和其他现有方法。

链接: https://arxiv.org/abs/2410.09632
作者: Chen Lyu,Gabriele Pergola
关键词-EN: highly specialized language, challenges for non-experts, written in highly, posing significant comprehension, text
类目: Computation and Language (cs.CL)
备注: Accepted by he Third Workshop on Text Simplification, Accessibility and Readability

点击查看摘要

Abstract:Biomedical literature is often written in highly specialized language, posing significant comprehension challenges for non-experts. Automatic text simplification (ATS) offers a solution by making such texts more accessible while preserving critical information. However, evaluating ATS for biomedical texts is still challenging due to the limitations of existing evaluation metrics. General-domain metrics like SARI, BLEU, and ROUGE focus on surface-level text features, and readability metrics like FKGL and ARI fail to account for domain-specific terminology or assess how well the simplified text conveys core meanings (gist). To address this, we introduce SciGisPy, a novel evaluation metric inspired by Gist Inference Score (GIS) from Fuzzy-Trace Theory (FTT). SciGisPy measures how well a simplified text facilitates the formation of abstract inferences (gist) necessary for comprehension, especially in the biomedical domain. We revise GIS for this purpose by introducing domain-specific enhancements, including semantic chunking, Information Content (IC) theory, and specialized embeddings, while removing unsuitable indexes. Our experimental evaluation on the Cochrane biomedical text simplification dataset demonstrates that SciGisPy outperforms the original GIS formulation, with a significant increase in correctly identified simplified texts (84% versus 44.8%). The results and a thorough ablation study confirm that SciGisPy better captures the essential meaning of biomedical content, outperforming existing approaches.
摘要:生物医学文献通常使用高度专业化的语言,这给非专业人士带来了显著的理解挑战。自动文本简化 (ATS) 通过使这些文本更易理解,同时保留关键信息,提供了一种解决方案。然而,由于现有评估指标的局限性,对生物医学文本的 ATS 进行评估仍然具有挑战性。通用领域的指标如 SARI、BLEU 和 ROUGE 侧重于表面层面的文本特征,而可读性指标如 FKGL 和 ARI 未能考虑到领域特定的术语,或评估简化文本传达核心意义 (gist) 的能力。为了解决这一问题,我们引入了 SciGisPy,这是一种受模糊痕迹理论 (Fuzzy-Trace Theory, FTT) 中的核心意义推理评分 (Gist Inference Score, GIS) 启发的新型评估指标。SciGisPy 衡量简化文本在促进形成理解所需的抽象推理 (gist) 方面的效果,尤其是在生物医学领域。为此,我们对 GIS 进行了修订,引入了领域特定的增强措施,包括语义分块、信息内容 (Information Content, IC) 理论和专用嵌入,同时去除了不适合的指标。我们在 Cochrane 生物医学文本简化数据集上的实验评估表明,SciGisPy 优于原始的 GIS 公式,正确识别的简化文本数量显著增加 (84% 对 44.8%)。结果和详细的消融研究证实,SciGisPy 更好地捕捉了生物医学内容的核心意义,优于现有的方法。

[NLP-123] Society of Medical Simplifiers

【速读】: 该论文试图解决医学文本简化的问题,特别是如何使复杂的生物医学文献更易于非专业人士理解。解决方案的关键在于引入了一个基于大型语言模型(LLM)的框架,称为“医学简化者协会”,该框架受“心智社会”(Society of Mind)哲学的启发。该框架通过分配五个不同的角色(Layperson、Simplifier、Medical Expert、Language Clarifier、Redundancy Checker)并组织成交互循环,利用LLM的优势进行迭代精炼和协作,从而在保持原文复杂性和准确性的同时,逐步提升文本的简化效果。该方法在Cochrane文本简化数据集上的评估显示,其性能与最先进的方法相当或更优,通过控制简化过程实现了更高的可读性和内容保留。

链接: https://arxiv.org/abs/2410.09631
作者: Chen Lyu,Gabriele Pergola
关键词-EN: making complex biomedical, complex biomedical literature, Medical text simplification, text simplification, accessible to non-experts
类目: Computation and Language (cs.CL)
备注: Accepted by Third Workshop on Text Simplification, Accessibility and Readability

点击查看摘要

Abstract:Medical text simplification is crucial for making complex biomedical literature more accessible to non-experts. Traditional methods struggle with the specialized terms and jargon of medical texts, lacking the flexibility to adapt the simplification process dynamically. In contrast, recent advancements in large language models (LLMs) present unique opportunities by offering enhanced control over text simplification through iterative refinement and collaboration between specialized agents. In this work, we introduce the Society of Medical Simplifiers, a novel LLM-based framework inspired by the “Society of Mind” (SOM) philosophy. Our approach leverages the strengths of LLMs by assigning five distinct roles, i.e., Layperson, Simplifier, Medical Expert, Language Clarifier, and Redundancy Checker, organized into interaction loops. This structure allows the agents to progressively improve text simplification while maintaining the complexity and accuracy of the original content. Evaluations on the Cochrane text simplification dataset demonstrate that our framework is on par with or outperforms state-of-the-art methods, achieving superior readability and content preservation through controlled simplification processes.
摘要:医学文本简化对于使复杂的生物医学文献更容易被非专业人士理解至关重要。传统方法在处理医学文本的专业术语和行话时显得力不从心,缺乏动态适应简化过程的灵活性。相比之下,大语言模型 (LLM) 的最新进展通过提供迭代精炼和专业智能体之间协作的增强控制,为文本简化带来了独特的机会。在本研究中,我们引入了医学简化者协会,这是一个受“心智社会”(Society of Mind, SOM) 哲学启发的新型基于 LLM 的框架。我们的方法通过分配五个不同的角色,即外行人、简化者、医学专家、语言澄清者和冗余检查者,并组织成交互循环,利用了 LLM 的优势。这种结构使得智能体能够在保持原始内容复杂性和准确性的同时,逐步改进文本简化。在 Cochrane 文本简化数据集上的评估表明,我们的框架与最先进的方法相比,在可读性和内容保留方面表现相当甚至更优,通过受控的简化过程实现了卓越的简化效果。

[NLP-124] Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在处理已知知识或从外部来源整合新知识时的能力不足问题。解决方案的关键在于提出了一种名为Ski的新型合成知识摄取方法,该方法通过细粒度合成、交错生成和组装增强策略,从原始知识源构建高质量的数据表示。随后,将Ski及其变体与三种知识注入技术(RAG、SFT和CPT)结合,以实现语言模型中的知识注入和精炼。实验结果表明,Ski在多个领域的问答任务中显著优于基线方法,有效提升了知识注入的效果。

链接: https://arxiv.org/abs/2410.09629
作者: Jiaxin Zhang,Wendi Cui,Yiran Huang,Kamalika Das,Sricharan Kumar
关键词-EN: Large language models, Large language, Retrieval Augmented Generation, proficient in capturing, knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: EMNLP 2024 main conference long paper

点击查看摘要

Abstract:Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.
摘要:大语言模型 (LLMs) 在捕捉跨领域的事实知识方面表现出色。然而,如何在其已掌握的知识基础上进行优化,或从外部来源整合新知识,仍然是一个重大挑战。在本研究中,我们提出了一种名为 Ski 的新型合成知识摄取方法,该方法利用细粒度合成、交错生成和组装增强策略,从原始知识源构建高质量的数据表示。随后,我们将 Ski 及其变体与三种知识注入技术相结合:检索增强生成 (RAG)、监督微调 (SFT) 和持续预训练 (CPT),以在语言模型中注入和优化知识。我们在涵盖金融、生物医学和开放生成领域的各种问答任务上进行了广泛的实证实验,结果表明,Ski 通过促进有效的知识注入,显著优于基线方法。我们相信,我们的工作是提升大语言模型输出事实准确性的重要一步,通过优化知识表示和注入能力。

[NLP-125] Enhanced Electronic Health Records Text Summarization Using Large Language Models

【速读】: 该论文试图解决电子健康记录(EHR)摘要生成系统中,临床医生需要特定、聚焦的摘要以快速获取洞察的问题。解决方案的关键在于利用Google Flan-T5模型,通过在以Stanford Question Answering Dataset (SQuAD)格式化的EHR问答数据集上进行微调,生成基于临床医生指定主题的定制化EHR摘要。这一方法通过优化超参数和使用Seq2SeqTrainer进行训练,显著提升了摘要的准确性和相关性,具体表现为高Exact Match (EM)分数(81.81%)和ROUGE指标(ROUGE-1: 96.03%, ROUGE-2: 86.67%, ROUGE-L: 96.10%),以及BLEU分数(63%),从而支持医疗领域的数字化转型,优化工作流程并实现更个性化的患者护理。

链接: https://arxiv.org/abs/2410.09628
作者: Ruvarashe Madzime,Clement Nyirenda
关键词-EN: Electronic Health Records, Health Records summarization, Electronic Health, Health Records, development of Electronic
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of Electronic Health Records summarization systems has revolutionized patient data management. Previous research advanced this field by adapting Large Language Models for clinical tasks, using diverse datasets to generate general EHR summaries. However, clinicians often require specific, focused summaries for quicker insights. This project builds on prior work by creating a system that generates clinician-preferred, focused summaries, improving EHR summarization for more efficient patient care. The proposed system leverages the Google Flan-T5 model to generate tailored EHR summaries based on clinician-specified topics. The approach involved fine-tuning the Flan-T5 model on an EHR question-answering dataset formatted in the Stanford Question Answering Dataset (SQuAD) style, which is a large-scale reading comprehension dataset with questions and answers. Fine-tuning utilized the Seq2SeqTrainer from the Hugging Face Transformers library with optimized hyperparameters. Key evaluation metrics demonstrated promising results: the system achieved an Exact Match (EM) score of 81.81%. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics showed strong performance, with ROUGE-1 at 96.03%, ROUGE-2 at 86.67%, and ROUGE-L at 96.10%. Additionally, the Bilingual Evaluation Understudy (BLEU) score was 63%, reflecting the model’s coherence in generating summaries. By enhancing EHR summarization through LLMs, this project supports digital transformation efforts in healthcare, streamlining workflows, and enabling more personalized patient care.
摘要:电子健康记录(Electronic Health Records, EHR)摘要系统的发展彻底改变了患者数据管理。以往的研究通过将大语言模型(Large Language Models, LLM)应用于临床任务,并使用多样化的数据集生成通用的EHR摘要,推动了这一领域的发展。然而,临床医生通常需要特定且集中的摘要以更快地获取洞察。本项目在先前工作的基础上,创建了一个生成临床医生偏好的集中摘要的系统,从而提高了EHR摘要的效率,以实现更有效的患者护理。所提出的系统利用Google Flan-T5模型,根据临床医生指定的主题生成定制的EHR摘要。该方法涉及在以Stanford Question Answering Dataset (SQuAD)格式组织的EHR问答数据集上微调Flan-T5模型,这是一个包含问题和答案的大规模阅读理解数据集。微调过程使用了Hugging Face Transformers库中的Seq2SeqTrainer,并优化了超参数。关键评估指标显示了有希望的结果:系统达到了81.81%的精确匹配(Exact Match, EM)分数。ROUGE(Recall-Oriented Understudy for Gisting Evaluation)指标表现出色,ROUGE-1为96.03%,ROUGE-2为86.67%,ROUGE-L为96.10%。此外,双语评估替补(Bilingual Evaluation Understudy, BLEU)得分为63%,反映了模型在生成摘要时的连贯性。通过增强LLM的EHR摘要功能,本项目支持医疗领域的数字化转型,简化了工作流程,并实现了更加个性化的患者护理。

[NLP-126] Quebec Automobile Insurance Question-Answering With Retrieval-Augmented Generation EMNLP

【速读】: 该论文试图解决在保险领域中,特别是魁北克汽车保险问题回答中,大语言模型(LLMs)的应用效果问题。解决方案的关键在于引入两个语料库:魁北克汽车保险专家参考语料库和一组82个面向普通人的汽车保险问题专家答案。通过利用这两个语料库,论文评估了GPT4-o这一先进LLM在回答魁北克汽车保险问题上的表现,结果表明使用专家参考语料库生成的回答在自动和手动评估指标上均表现更优。然而,研究也指出LLM在关键领域的应用仍存在不可靠性,有5%到13%的回答包含可能导致客户误解的错误信息。

链接: https://arxiv.org/abs/2410.09623
作者: David Beauchemin,Zachary Gagnon,Ricahrd Khoury
关键词-EN: Large Language Models, Nuruzzaman and Hussain, Large Language, Language Models, Quebec Automobile Insurance
类目: Computation and Language (cs.CL)
备注: Accepted to NLLP 2024 EMNLP workshop

点击查看摘要

Abstract:Large Language Models (LLMs) perform outstandingly in various downstream tasks, and the use of the Retrieval-Augmented Generation (RAG) architecture has been shown to improve performance for legal question answering (Nuruzzaman and Hussain, 2020; Louis et al., 2024). However, there are limited applications in insurance questions-answering, a specific type of legal document. This paper introduces two corpora: the Quebec Automobile Insurance Expertise Reference Corpus and a set of 82 Expert Answers to Layperson Automobile Insurance Questions. Our study leverages both corpora to automatically and manually assess a GPT4-o, a state-of-the-art LLM, to answer Quebec automobile insurance questions. Our results demonstrate that, on average, using our expertise reference corpus generates better responses on both automatic and manual evaluation metrics. However, they also highlight that LLM QA is unreliable enough for mass utilization in critical areas. Indeed, our results show that between 5% to 13% of answered questions include a false statement that could lead to customer misunderstanding.
摘要:大语言模型 (LLMs) 在各种下游任务中表现出色,而使用检索增强生成 (RAG) 架构已被证明可以提高法律问答的性能 (Nuruzzaman 和 Hussain, 2020; Louis 等人, 2024)。然而,在保险问答这一特定类型的法律文档中,应用有限。本文介绍了两个语料库:魁北克汽车保险专家参考语料库和一组 82 个面向普通人的汽车保险问题专家答案。我们的研究利用这两个语料库,自动和手动评估了 GPT4-o,一个最先进的大语言模型,以回答魁北克汽车保险问题。我们的结果表明,平均而言,使用我们的专家参考语料库在自动和手动评估指标上生成的回答更优。然而,这也突显出大语言模型问答在关键领域的广泛应用中仍不够可靠。实际上,我们的结果显示,5% 到 13% 的回答中包含可能导致客户误解的错误陈述。

[NLP-127] ransformer-based Language Models for Reasoning in the Description Logic ALCQ KR

【速读】: 该论文旨在解决现有基于Transformer的语言模型在复杂逻辑推理能力上的不足,特别是针对简单的一阶逻辑句子生成的基准测试。论文的关键解决方案是构建了一个名为DELTA_D的自然语言数据集,该数据集使用表达性描述逻辑语言ALCQ\mathcal{ALCQ},包含384K个示例,增加了推理深度和语言复杂性两个维度。通过在DELTA_D上微调DeBERTa模型和使用少量样本(9 shots)提示GPT-3.5和GPT-4,研究团队展示了这些模型在蕴含检查任务上的显著性能提升,并开源了代码和数据集。

链接: https://arxiv.org/abs/2410.09613
作者: Angelos Poulis,Eleni Tsalapati,Manolis Koubarakis
关键词-EN: Recent advancements, advancements in transformer-based, sparked research, logical reasoning capabilities, transformer-based language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities. Most of the benchmarks used to evaluate these models are simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. We construct the natural language dataset, DELTA _D , using the expressive description logic language \mathcalALCQ . DELTA _D comprises 384K examples and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) with few-shot prompting. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task. Moreover, the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.
摘要:近年来,基于 Transformer 的语言模型的进步激发了对其逻辑推理能力的研究。大多数用于评估这些模型的基准测试都非常简单:由仅包含少量逻辑运算符和量词的一阶逻辑句子的片段生成。我们使用表达性描述逻辑语言 \mathcalALCQ 构建了自然语言数据集 DELTA _D。DELTA _D 包含 384K 个示例,并在两个维度上增加:i) 推理深度,和 ii) 语言复杂性。通过这种方式,我们系统地研究了经过监督微调的 DeBERTa 模型和两个大语言模型(GPT-3.5,GPT-4)在少样本提示下的逻辑推理能力。我们展示了在数据集上微调的 DeBERTa 模型能够掌握蕴含检查任务。此外,即使只提供少量样本(9 个样本),GPT 的性能也能显著提高。我们开源了代码和数据集。

[NLP-128] raversing Emotional Landscapes and Linguistic Patterns in Bernard-Marie Kolt`es Plays: An NLP Perspective

【速读】: 该论文试图通过自然语言处理(NLP)技术,深入分析法国当代戏剧家伯纳德-玛丽·科尔泰斯(Bernard-Marie Koltès)戏剧作品中的语言和情感维度。解决方案的关键在于运用先进的计算技术,解析科尔泰斯的叙事风格,揭示其作品中语言与情感之间的微妙互动,从而深化对其主题探索的理解,并为文学分析领域的数字人文研究做出贡献。

链接: https://arxiv.org/abs/2410.09609
作者: Arezou Zahiri Pourzarandi,Farshad Jafari
关键词-EN: Natural Language Processing, contemporary French theatre, study employs Natural, employs Natural Language, French theatre
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study employs Natural Language Processing (NLP) to analyze the intricate linguistic and emotional dimensions within the plays of Bernard-Marie Koltès, a central figure in contemporary French theatre. By integrating advanced computational techniques, we dissect Koltès’ narrative style, revealing the subtle interplay between language and emotion across his dramatic oeuvre. Our findings highlight how Koltès crafts his narratives, enriching our understanding of his thematic explorations and contributing to the broader field of digital humanities in literary analysis.
摘要:本研究运用自然语言处理 (Natural Language Processing, NLP) 技术,分析了法国当代戏剧核心人物 Bernard-Marie Koltès 剧作中复杂的语言和情感维度。通过整合先进的计算技术,我们剖析了 Koltès 的叙事风格,揭示了其戏剧作品中语言与情感之间微妙的相互作用。研究结果突显了 Koltès 如何构建其叙事,深化了对其主题探索的理解,并为文学分析领域的数字人文研究做出了贡献。

[NLP-129] I or Not I: Unraveling the Linguistic Echoes of Identity in Samuel Becketts “Not I” Through Natural Language Processing

【速读】: 该论文试图通过先进的自然语言处理技术,深入解析塞缪尔·贝克特的戏剧《不是我》中的复杂语言结构,以揭示其语言如何反映主角的破碎心理。解决方案的关键在于分析词频、利用基于BERT的模型检测情感倾向,以及考察文本中的重复主题,从而揭示贝克特如何通过递归的语言模式和节奏性重复,巧妙地编织时间、记忆和存在焦虑等主题,进而深化对贝克特文学风格的理解,并凸显其在现代文学中通过语言探索深刻存在问题的独特地位。

链接: https://arxiv.org/abs/2410.09608
作者: Arezou Zahiri Pourzarandi,Farshad Jafari
关键词-EN: Exploring the depths, language processing techniques, depths of Samuel, intricate linguistic structures, advanced natural language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Exploring the depths of Samuel Beckett’s “Not I” through advanced natural language processing techniques, this research uncovers the intricate linguistic structures that underpin the text. By analyzing word frequency, detecting emotional sentiments with a BERT-based model, and examining repetitive motifs, we unveil how Beckett’s minimalist yet complex language reflects the protagonist’s fragmented psyche. Our results demonstrate that recurring themes of time, memory, and existential angst are artfully woven through recursive linguistic patterns and rhythmic repetition. This innovative approach not only deepens our understanding of Beckett’s stylistic contributions but also highlights his unique role in modern literature, where language transcends simple communication to explore profound existential questions.
摘要:通过先进的自然语言处理技术深入探索塞缪尔·贝克特的《不是我》,本研究揭示了支撑文本的复杂语言结构。通过分析词频、使用基于 BERT 的模型检测情感倾向,并考察重复的意象,我们揭示了贝克特极简却复杂的语言如何反映主角破碎的心理状态。我们的结果表明,时间、记忆和存在焦虑等反复出现的主题巧妙地通过递归的语言模式和节奏性的重复交织在一起。这种创新方法不仅加深了我们对贝克特风格贡献的理解,还突显了他在现代文学中的独特地位,即语言超越了简单的交流,探索了深刻的存在的意义。

[NLP-130] raining Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis NEURIPS2024

【速读】: 该论文试图解决的是理解浅层Transformer在训练过程中如何识别两个指定词的共现关系,并揭示其训练动态。解决方案的关键在于分析同时训练三个注意力矩阵和一个线性MLP层的梯度流动态,提出了一种通过耦合动力系统分析这些动态的框架。论文通过证明梯度流的“自动平衡”特性,解释了训练过程中损失值如何几乎以相同速率下降,从而达到接近最小损失的状态。这一特性是实现训练损失接近最小值的关键,并通过实验验证了理论结果。

链接: https://arxiv.org/abs/2410.09605
作者: Hongru Yang,Bhavya Kailkhura,Zhangyang Wang,Yingbin Liang
关键词-EN: large language models, important to explain, explain the impressive, impressive capabilities, capabilities behind large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training. We discover that gradient flow serves as an inherent mechanism that naturally divide the training process into two phases. In Phase 1, the linear MLP quickly aligns with the two target signals for correct classification, whereas the softmax attention remains almost unchanged. In Phase 2, the attention matrices and the MLP evolve jointly to enlarge the classification margin and reduce the loss to a near minimum value. Technically, we prove a novel property of the gradient flow, termed \textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss. We also conduct experiments to verify our theoretical results.
摘要:理解 Transformer 的训练动态对于解释大语言模型背后的卓越能力至关重要。在本研究中,我们探讨了在识别两个指定词共现任务上训练浅层 Transformer 的动态过程。在研究 Transformer 训练动态的文献中,通常采用多种简化方法,如权重重参数化、注意力线性化、特殊初始化以及惰性机制。相比之下,我们分析了从随机初始化开始同时训练三个注意力矩阵和一个线性 MLP 层的梯度流动态,并提供了一个通过耦合动力系统分析此类动态的框架。我们确立了近似最小损失,并描述了训练后的注意力模型。我们发现,梯度流作为一种内在机制,自然地将训练过程分为两个阶段。在第一阶段,线性 MLP 迅速与两个目标信号对齐,以实现正确分类,而 softmax 注意力几乎保持不变。在第二阶段,注意力矩阵和 MLP 共同演化,以扩大分类边际并将近似最小化损失。从技术上讲,我们证明了梯度流的一个新特性,称为“梯度的自动平衡”,该特性使得不同样本的损失值几乎以相同速率下降,并进一步促进了近似最小训练损失的证明。我们还进行了实验以验证我们的理论结果。

[NLP-131] oward General Instruction-Following Alignment for Retrieval-Augmented Generation

【速读】: 该论文试图解决在检索增强生成(RAG)系统中指令遵循(instruction-following, IF)对齐的评估和改进问题。解决方案的关键在于提出了VIF-RAG,这是一个自动化、可扩展且可验证的合成流水线,用于在RAG系统中实现指令遵循对齐。VIF-RAG通过手动创建原子指令集并开发组合规则来合成和验证复杂指令,利用监督模型进行指令重写,并通过Python执行器自动验证指令质量。最终,通过自动化流程将这些指令与广泛的RAG和通用数据样本集成,生成高质量的VIF-RAG-QA数据集(100k)。此外,论文还引入了FollowRAG基准测试,用于评估RAG系统中的指令遵循能力,涵盖22种通用指令约束和四个知识密集型QA数据集。

链接: https://arxiv.org/abs/2410.09584
作者: Guanting Dong,Xiaoshuai Song,Yutao Zhu,Runqi Qiao,Zhicheng Dou,Ji-Rong Wen
关键词-EN: Retrieval-Augmented Generation, Large Language Models, RAG systems, RAG, effective application
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Working in progress

点击查看摘要

Abstract:Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions (100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset (100k) through automated processes. To further bridge the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and eight widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems. Our code and datasets are released at this https URL.
摘要:遵循自然指令对于检索增强生成 (RAG) 系统的有效应用至关重要。尽管大语言模型 (LLM) 近期取得了进展,但在 RAG 领域内评估和改进指令遵循 (IF) 对齐的研究仍然有限。为解决这一问题,我们提出了 VIF-RAG,这是首个自动化、可扩展且可验证的合成流水线,用于 RAG 系统中的指令遵循对齐。我们首先手动构建了一个最小原子指令集 (100 条),并开发了组合规则以合成和验证种子集的复杂指令。随后,我们使用监督模型进行指令重写,同时生成代码通过 Python 执行器自动化验证指令质量。最后,我们将这些指令与广泛的 RAG 和通用数据样本整合,通过自动化流程扩展至高质量的 VIF-RAG-QA 数据集 (100k)。为进一步弥合 RAG 系统指令遵循自动评估的差距,我们引入了 FollowRAG 基准,该基准包含约 3K 测试样本,涵盖 22 类通用指令约束和四个知识密集型 QA 数据集。由于其稳健的流水线设计,FollowRAG 可以无缝集成到不同的 RAG 基准中。通过使用 FollowRAG 和八个广泛使用的 IF 及基础能力基准对 LLM 进行测试,我们证明 VIF-RAG 显著提升了 LLM 在广泛通用指令约束下的性能,同时有效利用其在 RAG 场景中的能力。进一步的分析为实现 RAG 系统中的 IF 对齐提供了实际见解。我们的代码和数据集已在此 https URL 发布。

[NLP-132] SAPIENT: Mastering Multi-turn Conversational Recommendation with Strategic Planning and Monte Carlo Tree Search

【速读】: 该论文试图解决现有对话推荐系统(CRS)在对话规划中可能出现的次优问题,特别是基于强化学习(RL)的方法在动作选择上采用贪婪策略或采样策略导致的不足。解决方案的关键在于引入蒙特卡洛树搜索(MCTS)技术,构建了一个名为SAPIENT的新型CRS框架。该框架包括一个对话代理(S-agent)和一个对话规划器(S-planner),其中S-planner利用MCTS基于S-agent的初始动作构建对话搜索树,以发现最优对话计划。这些最优计划随后用于指导S-agent的训练,形成一个自训练循环,使S-agent能够迭代提升其对话规划能力。此外,论文还提出了一个高效变体SAPIENT-e,以在训练效率和性能之间取得平衡。

链接: https://arxiv.org/abs/2410.09580
作者: Hanwen Du,Bo Peng,Xia Ning
关键词-EN: Conversational Recommender Systems, Recommender Systems, proactively engage users, provide personalized recommendations, elicit user preferences
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRS) proactively engage users in interactive dialogues to elicit user preferences and provide personalized recommendations. Existing methods train Reinforcement Learning (RL)-based agent with greedy action selection or sampling strategy, and may suffer from suboptimal conversational planning. To address this, we present a novel Monte Carlo Tree Search (MCTS)-based CRS framework SAPIENT. SAPIENT consists of a conversational agent (S-agent) and a conversational planner (S-planner). S-planner builds a conversational search tree with MCTS based on the initial actions proposed by S-agent to find conversation plans. The best conversation plans from S-planner are used to guide the training of S-agent, creating a self-training loop where S-agent can iteratively improve its capability for conversational planning. Furthermore, we propose an efficient variant SAPIENT-e for trade-off between training efficiency and performance. Extensive experiments on four benchmark datasets validate the effectiveness of our approach, showing that SAPIENT outperforms the state-of-the-art baselines.
摘要:对话推荐系统 (Conversational Recommender Systems, CRS) 通过主动与用户进行互动对话,以获取用户偏好并提供个性化推荐。现有方法通常使用基于强化学习 (Reinforcement Learning, RL) 的智能体,采用贪心动作选择或采样策略进行训练,但可能面临次优的对话规划问题。为解决这一问题,我们提出了一种基于蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 的 CRS 框架 SAPIENT。SAPIENT 由一个对话智能体 (S-agent) 和一个对话规划器 (S-planner) 组成。S-planner 基于 S-agent 提出的初始动作,使用 MCTS 构建对话搜索树以找到对话计划。S-planner 生成的最佳对话计划用于指导 S-agent 的训练,形成一个自训练循环,使 S-agent 能够迭代提升其对话规划能力。此外,我们还提出了一种高效的变体 SAPIENT-e,以在训练效率和性能之间取得平衡。在四个基准数据集上的广泛实验验证了我们方法的有效性,结果表明 SAPIENT 优于当前最先进的基线方法。

[NLP-133] he Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models

【速读】: 该论文试图解决如何利用大型语言模型(LLMs)和生成式AI在教育领域中实现自动化问题生成和答案评估的问题。解决方案的关键在于探索LLMs的机制,如零样本提示和思维链提示等技术,以生成高质量、多样化的多语言问题,并通过微调和提示调优等高级NLP方法实现任务特定的问题生成。此外,论文还探讨了LLMs在自动答案评估中的应用,展示了其在提供准确反馈和识别细微理解差异方面的潜力,从而在教育过程中替代耗时且成本高昂的人工评估。

链接: https://arxiv.org/abs/2410.09576
作者: Subhankar Maity,Aniket Deroy
关键词-EN: large language models, natural language processing, revolutionized natural language, offering unprecedented capabilities, recent years
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Book Chapter (Under Review)

点击查看摘要

Abstract:In recent years, large language models (LLMs) and generative AI have revolutionized natural language processing (NLP), offering unprecedented capabilities in education. This chapter explores the transformative potential of LLMs in automated question generation and answer assessment. It begins by examining the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text. The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-thought prompting, are evaluated for their effectiveness in generating high-quality questions, including open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning and prompt-tuning are explored for their role in generating task-specific questions, despite associated costs. The chapter also covers the human evaluation of generated questions, highlighting quality variations across different methods and areas for improvement. Furthermore, it delves into automated answer assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback, and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments and areas needing improvement. The discussion underscores the potential of LLMs to replace costly, time-consuming human assessments when appropriately guided, showcasing their advanced understanding and reasoning capabilities in streamlining educational processes.
摘要:近年来,大语言模型 (LLMs) 和生成式 AI 彻底改变了自然语言处理 (NLP),在教育领域提供了前所未有的能力。本章探讨了 LLMs 在自动问题生成和答案评估中的变革潜力。首先,本章考察了 LLMs 背后的机制,强调其理解和生成类人文本的能力。接着,本章讨论了创建多样化、上下文相关问题的各种方法,通过定制的、适应性策略增强学习效果。关键的提示技术,如零样本提示和思维链提示,被评估其在生成高质量问题(包括开放式和多项选择格式,以及多种语言)中的有效性。尽管存在相关成本,本章还探讨了微调 (fine-tuning) 和提示调优 (prompt-tuning) 等高级 NLP 方法在生成任务特定问题中的作用。此外,本章涵盖了生成问题的人工评估,突出了不同方法之间的质量差异和改进领域。进一步地,本章深入探讨了自动答案评估,展示了 LLMs 如何准确评估回答、提供建设性反馈,并识别细微的理解或误解。示例展示了成功评估的案例和需要改进的领域。讨论强调了在适当引导下,LLMs 有可能取代昂贵且耗时的人工评估,展示其在简化教育过程中的高级理解和推理能力。

[NLP-134] Reconstructive Visual Instruction Tuning

【速读】: 该论文试图解决传统视觉指令调优方法仅依赖文本输出监督,导致输入图像中的丰富细节丢失的问题。解决方案的关键在于引入重建视觉指令调优(ROSS),通过重建输入图像的潜在表示来监督视觉输出,而非直接回归原始RGB值。这种方法利用去噪目标来避免空间冗余问题,从而增强模型对图像细节的保持能力,提升细粒度理解并减少幻觉现象。ROSS在不同视觉编码器和语言模型上均表现出显著改进,且在性能上与依赖多视觉专家的外部辅助方法相媲美,证明了其基于视觉输出的监督策略的有效性。

链接: https://arxiv.org/abs/2410.09575
作者: Haochen Wang,Anlin Zheng,Yucheng Zhao,Tiancai Wang,Zheng Ge,Xiangyu Zhang,Zhaoxiang Zhang
关键词-EN: Large Multimodal Models, Large Multimodal, paper introduces reconstructive, family of Large, visual instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.
摘要:本文介绍了重构视觉指令调优 (ROSS),这是一系列利用以视觉为中心的监督信号的大规模多模态模型 (LMM)。与传统仅监督文本输出的视觉指令调优方法不同,ROSS 通过重构输入图像来监督 LMM 的视觉输出。通过这种方式,ROSS 利用了输入图像本身固有的丰富细节,这些细节在纯文本监督中常常被忽略。然而,从自然图像中生成有意义的反馈由于视觉信号的空间冗余性而具有挑战性。为解决这一问题,ROSS 采用去噪目标来重构输入图像的潜在表示,避免了直接回归精确的原始 RGB 值。这种内在激活设计自然地鼓励 LMM 保持图像细节,从而增强其细粒度理解能力并减少幻觉。实证研究表明,ROSS 在不同的视觉编码器和语言模型中持续带来显著改进。与聚合多个视觉专家的外部辅助最先进方法相比,ROSS 以单一 SigLIP 视觉编码器实现了竞争性能,展示了我们为视觉输出量身定制的以视觉为中心的监督的有效性。

[NLP-135] Are You Human? An Adversarial Benchmark to Expose LLMs

【速读】: 该论文试图解决大语言模型(LLMs)在对话中伪装成人类的问题,特别是在涉及诈骗和欺骗的高风险情境中。解决方案的关键在于开发和评估文本提示,这些提示设计为实时挑战,用以暴露LLM的伪装。论文提出了两种类型的挑战:“隐式挑战”利用LLM的指令跟随机制导致角色偏移,而“显式挑战”测试LLM执行人类容易但LLM难以完成的简单任务的能力。研究结果表明,显式挑战在78.4%的情况下成功检测到LLM,而隐式挑战在22.9%的情况下有效。此外,用户研究表明,人类在显式挑战中表现优于LLM(78% vs 22%的成功率),验证了该方法在实际应用中的有效性。

链接: https://arxiv.org/abs/2410.09569
作者: Gilad Gressel,Rahul Pankajakshan,Yisroel Mirsky
关键词-EN: Large Language Models, Large Language, Language Models, raising concerns, scams and deception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation, raising concerns about their potential misuse in scams and deception. Humans have a right to know if they are conversing to an LLM. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time. To this end we compile and release an open-source benchmark dataset that includes ‘implicit challenges’ that exploit an LLM’s instruction-following mechanism to cause role deviation, and ‘exlicit challenges’ that test an LLM’s ability to perform simple tasks typically easy for humans but difficult for LLMs. Our evaluation of 9 leading models from the LMSYS leaderboard revealed that explicit challenges successfully detected LLMs in 78.4% of cases, while implicit challenges were effective in 22.9% of instances. User studies validate the real-world applicability of our methods, with humans outperforming LLMs on explicit challenges (78% vs 22% success rate). Our framework unexpectedly revealed that many study participants were using LLMs to complete tasks, demonstrating its effectiveness in detecting both AI impostors and human misuse of AI tools. This work addresses the critical need for reliable, real-time LLM detection methods in high-stakes conversations.
摘要:大语言模型 (LLMs) 在模仿人类对话方面展示了令人担忧的能力,引发了对其在诈骗和欺骗中潜在滥用的担忧。人类有权知道他们是否在与一个 LLM 进行对话。我们评估了旨在实时揭露 LLM 冒充者的基于文本的提示挑战。为此,我们编译并发布了一个开源基准数据集,其中包括利用 LLM 指令遵循机制导致角色偏移的“隐式挑战”,以及测试 LLM 执行通常对人类简单但对 LLM 困难的基本任务能力的“显式挑战”。我们对 LMSYS 排行榜上的 9 个领先模型进行的评估显示,显式挑战在 78.4% 的情况下成功检测到 LLM,而隐式挑战在 22.9% 的实例中有效。用户研究表明,我们的方法在现实世界中的适用性得到了验证,人类在显式挑战中表现优于 LLM(成功率 78% 对 22%)。我们的框架意外地揭示了许多研究参与者正在使用 LLM 完成任务,证明了其在检测 AI 冒充者和人类滥用 AI 工具方面的有效性。这项工作解决了在高风险对话中可靠、实时检测 LLM 方法的关键需求。

[NLP-136] Extended Japanese Commonsense Morality Dataset with Masked Token and Label Enhancement

【速读】: 该论文试图解决现有AI模型在道德推理中忽视地区和文化差异的问题。解决方案的关键在于扩展了JCommonsenseMorality(JCM)数据集,创建了Extended JCM(eJCM),并通过Masked Token and Label Enhancement(MTLE)方法,利用大型语言模型生成替代表达并重新分配标签,从而增加了数据集的多样性和复杂性。这一扩展显著提升了模型在处理日本文化特有的复杂道德推理任务中的表现,验证了考虑文化背景在开发AI模型和数据集中的重要性。

链接: https://arxiv.org/abs/2410.09564
作者: Takumi Ohashi,Tsubasa Nakagawa,Hitoshi Iyatomi
关键词-EN: Rapid advancements, artificial intelligence, advancements in artificial, made it crucial, crucial to integrate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid advancements in artificial intelligence (AI) have made it crucial to integrate moral reasoning into AI systems. However, existing models and datasets often overlook regional and cultural differences. To address this shortcoming, we have expanded the JCommonsenseMorality (JCM) dataset, the only publicly available dataset focused on Japanese morality. The Extended JCM (eJCM) has grown from the original 13,975 sentences to 31,184 sentences using our proposed sentence expansion method called Masked Token and Label Enhancement (MTLE). MTLE selectively masks important parts of sentences related to moral judgment and replaces them with alternative expressions generated by a large language model (LLM), while re-assigning appropriate labels. The model trained using our eJCM achieved an F1 score of 0.857, higher than the scores for the original JCM (0.837), ChatGPT one-shot classification (0.841), and data augmented using AugGPT, a state-of-the-art augmentation method (0.850). Specifically, in complex moral reasoning tasks unique to Japanese culture, the model trained with eJCM showed a significant improvement in performance (increasing from 0.681 to 0.756) and achieved a performance close to that of GPT-4 Turbo (0.787). These results demonstrate the validity of the eJCM dataset and the importance of developing models and datasets that consider the cultural context.
摘要:人工智能 (AI) 的快速发展使得将道德推理融入 AI 系统变得至关重要。然而,现有的模型和数据集往往忽视了地区和文化的差异。为了解决这一不足,我们扩展了 JCommonsenseMorality (JCM) 数据集,这是唯一一个专注于日本道德的公开可用数据集。扩展后的 JCM (eJCM) 通过我们提出的句子扩展方法——掩码 Token 和标签增强 (MTLE),从原始的 13,975 个句子增长到 31,184 个句子。MTLE 有选择地掩码与道德判断相关的重要句子部分,并使用大语言模型 (LLM) 生成的替代表达进行替换,同时重新分配适当的标签。使用我们的 eJCM 训练的模型达到了 0.857 的 F1 分数,高于原始 JCM (0.837)、ChatGPT 单样本分类 (0.841) 以及使用最先进的增强方法 AugGPT 进行数据增强后的分数 (0.850)。特别是在日本文化特有的复杂道德推理任务中,使用 eJCM 训练的模型显示出显著的性能提升 (从 0.681 增加到 0.756),并达到了接近 GPT-4 Turbo (0.787) 的性能。这些结果证明了 eJCM 数据集的有效性,以及开发考虑文化背景的模型和数据集的重要性。

[NLP-137] A Speaker Turn-Aware Multi-Task Adversarial Network for Joint User Satisfaction Estimation and Sentiment Analysis

【速读】: 该论文试图解决用户满意度估计(User Satisfaction Estimation, USE)和情感分析(Sentiment Analysis, SA)在对话系统中的联合建模问题。现有方法未能有效区分任务特定的特征和共同特征,导致下游任务的语句表示次优。论文提出的解决方案关键在于引入了一个新颖的说话人轮次感知的多任务对抗网络(Speaker Turn-Aware Multi-Task Adversarial Network, STMAN),通过多任务对抗策略训练任务判别器,使语句表示更具任务特定性,并利用说话人轮次感知的多任务交互策略提取互补的共同特征,从而提升模型在用户满意度估计和情感分析上的表现。

链接: https://arxiv.org/abs/2410.09556
作者: Kaisong Song,Yangyang Kang,Jiawei Liu,Xurui Li,Changlong Sun,Xiaozhong Liu
关键词-EN: User Satisfaction Estimation, User Satisfaction, Satisfaction Estimation, goal-oriented dialogue systems, User
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:User Satisfaction Estimation is an important task and increasingly being applied in goal-oriented dialogue systems to estimate whether the user is satisfied with the service. It is observed that whether the user’s needs are met often triggers various sentiments, which can be pertinent to the successful estimation of user satisfaction, and vice versa. Thus, User Satisfaction Estimation (USE) and Sentiment Analysis (SA) should be treated as a joint, collaborative effort, considering the strong connections between the sentiment states of speakers and the user satisfaction. Existing joint learning frameworks mainly unify the two highly pertinent tasks over cascade or shared-bottom implementations, however they fail to distinguish task-specific and common features, which will produce sub-optimal utterance representations for downstream tasks. In this paper, we propose a novel Speaker Turn-Aware Multi-Task Adversarial Network (STMAN) for dialogue-level USE and utterance-level SA. Specifically, we first introduce a multi-task adversarial strategy which trains a task discriminator to make utterance representation more task-specific, and then utilize a speaker-turn aware multi-task interaction strategy to extract the common features which are complementary to each task. Extensive experiments conducted on two real-world service dialogue datasets show that our model outperforms several state-of-the-art methods.
摘要:用户满意度估计是一项重要任务,并越来越多地应用于面向目标的对话系统中,以估计用户是否对服务感到满意。观察发现,用户需求是否得到满足往往会触发各种情绪,这些情绪与用户满意度的成功估计密切相关,反之亦然。因此,用户满意度估计 (User Satisfaction Estimation, USE) 和情感分析 (Sentiment Analysis, SA) 应被视为一个联合的、协作的努力,考虑到说话者的情绪状态与用户满意度之间的紧密联系。现有的联合学习框架主要通过级联或共享底层实现来统一这两个高度相关的任务,但它们未能区分任务特定的特征和共同特征,这将产生次优的语句表示,不利于下游任务。在本文中,我们提出了一种新颖的说话者轮次感知多任务对抗网络 (Speaker Turn-Aware Multi-Task Adversarial Network, STMAN),用于对话级别的用户满意度估计和语句级别的情感分析。具体来说,我们首先引入了一种多任务对抗策略,训练任务判别器以使语句表示更具任务特定性,然后利用说话者轮次感知的多任务交互策略来提取互补于各任务的共同特征。在两个真实世界的服务对话数据集上进行的广泛实验表明,我们的模型优于几种最先进的方法。

[NLP-138] Exploring space efficiency in a tree-based linear model for extreme multi-label classification EMNLP2024

【速读】: 该论文试图解决极端多标签分类(XMC)中树形线性模型的空间复杂度问题。解决方案的关键在于对稀疏数据(如文本数据)条件下树模型的存储空间进行理论和实证分析,发现训练二分类器时某些特征可能未被使用,导致权重向量中存在零值。通过仅存储非零元素,可以显著减少存储空间,实验结果表明树模型相比标准的一对多方法可节省高达95%的存储空间。研究还提供了一种在训练树节点分类器之前估算模型大小的简单方法,从而避免通过权重剪枝等技术修改模型。

链接: https://arxiv.org/abs/2410.09554
作者: He-Zhe Lin,Cheng-Hung Liu,Chih-Jen Lin
关键词-EN: identify relevant subsets, Extreme multi-label classification, Extreme multi-label, aims to identify, numerous labels
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Extreme multi-label classification (XMC) aims to identify relevant subsets from numerous labels. Among the various approaches for XMC, tree-based linear models are effective due to their superior efficiency and simplicity. However, the space complexity of tree-based methods is not well-studied. Many past works assume that storing the model is not affordable and apply techniques such as pruning to save space, which may lead to performance loss. In this work, we conduct both theoretical and empirical analyses on the space to store a tree model under the assumption of sparse data, a condition frequently met in text data. We found that, some features may be unused when training binary classifiers in a tree method, resulting in zero values in the weight vectors. Hence, storing only non-zero elements can greatly save space. Our experimental results indicate that tree models can achieve up to a 95% reduction in storage space compared to the standard one-vs-rest method for multi-label text classification. Our research provides a simple procedure to estimate the size of a tree model before training any classifier in the tree nodes. Then, if the model size is already acceptable, this approach can help avoid modifying the model through weight pruning or other techniques.
摘要:极端多标签分类 (Extreme Multi-label Classification, XMC) 旨在从大量标签中识别相关子集。在多种 XMC 方法中,基于树的线性模型因其高效性和简洁性而表现出色。然而,树基方法的空间复杂度研究尚不充分。许多过往研究假设存储模型成本过高,并采用剪枝等技术来节省空间,这可能导致性能损失。在本研究中,我们对在稀疏数据假设下存储树模型的空间进行了理论和实证分析,这种条件在文本数据中经常遇到。我们发现,在树方法中训练二分类器时,某些特征可能未被使用,导致权重向量中出现零值。因此,仅存储非零元素可以大幅节省空间。我们的实验结果表明,与标准的一对多方法相比,树模型在多标签文本分类中可以实现高达 95% 的存储空间减少。我们的研究提供了一种简单的方法,可以在训练树节点中的任何分类器之前估计树模型的大小。然后,如果模型大小已经可接受,这种方法可以帮助避免通过权重剪枝或其他技术修改模型。

[NLP-139] MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models

【速读】: 该论文试图解决现有大型语言模型(LLMs)在归纳推理能力评估中的局限性问题,特别是缺乏全面的评估和灵活的测试数据。解决方案的关键在于提出了一个名为Mirage的合成数据集,该数据集能够灵活地变化输入分布、任务场景和任务难度,从而全面评估LLMs在归纳和演绎阶段的推理能力。通过多方面的评估,论文揭示了LLMs在基于规则的推理中表现不佳,但在基于邻近特征的推理中表现较好,即模型倾向于利用特征空间中与当前测试样本相近的观察事实来进行归纳推理,从而在局部区域内显著提升演绎性能。

链接: https://arxiv.org/abs/2410.09542
作者: Jiachun Li,Pengfei Cao,Zhuoran Jin,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: achieve higher intelligence, large language models, higher intelligence, Inductive reasoning, essential capability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages,9 figures, under review

点击查看摘要

Abstract:Inductive reasoning is an essential capability for large language models (LLMs) to achieve higher intelligence, which requires the model to generalize rules from observed facts and then apply them to unseen examples. We present \scshape Mirage, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs’ capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs’ inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner. In many cases, when conducting inductive reasoning, they do not rely on a correct rule to answer the unseen case. From the perspectives of different prompting methods, observation numbers, and task forms, models tend to consistently conduct correct deduction without correct inductive rules. Besides, we find that LLMs are good neighbor-based reasoners. In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space. By leveraging these similar examples, the model maintains strong inductive capabilities within a localized region, significantly improving its deductive performance.
摘要:归纳推理是大型语言模型 (LLM) 实现更高智能的关键能力,这要求模型从观察到的事实中归纳出规则,并将其应用于未见过的例子。我们提出了 \scshape Mirage,这是一个合成数据集,旨在解决先前工作的局限性,特别是缺乏全面的评估和灵活的测试数据。在此数据集中,我们评估了 LLM 在归纳和演绎阶段的性能,允许在输入分布、任务场景和任务难度上进行灵活变化,以分析影响 LLM 归纳推理的因素。基于这些多方面的评估,我们展示了 LLM 在基于规则的推理方面表现不佳。在许多情况下,当进行归纳推理时,它们并不依赖于正确的规则来回答未见过的例子。从不同的提示方法、观察数量和任务形式的角度来看,模型倾向于在没有正确归纳规则的情况下进行正确的演绎。此外,我们发现 LLM 是优秀的基于邻近的推理者。在归纳推理过程中,模型倾向于关注在特征空间中与当前测试例子接近的观察事实。通过利用这些相似的例子,模型在局部区域内保持了强大的归纳能力,显著提高了其演绎性能。

[NLP-140] LINKED: Eliciting Filtering and Integrating Knowledge in Large Language Model for Commonsense Reasoning EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在知识密集型任务,特别是常识推理任务中表现不佳的问题。解决方案的关键在于提出了一种名为“LINKED”的新方法,该方法通过设计奖励模型来过滤噪声知识,并采用边际一致性推理模块来减少无效推理,从而提高LLMs在复杂常识推理任务中的准确性。实验结果表明,LINKED方法在两个复杂常识推理基准测试中显著优于现有的最先进基线,准确性提升高达9.0%。此外,论文还提出了一种新的评估指标——知识增强效果保留分数,用于衡量注入知识对模型性能的正负影响。

链接: https://arxiv.org/abs/2410.09541
作者: Jiachun Li,Pengfei Cao,Chenhao Wang,Zhuoran Jin,Yubo Chen,Kang Liu,Xiaojian Jiang,Jiexin Xu,Jun Zhao
关键词-EN: demonstrate poor performance, demonstrate poor, poor performance, performance on knowledge-intensive, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models (LLMs) sometimes demonstrate poor performance on knowledge-intensive tasks, commonsense reasoning is one of them. Researchers typically address these issues by retrieving related knowledge from knowledge graphs or employing self-enhancement methods to elicit knowledge in LLMs. However, noisy knowledge and invalid reasoning issues hamper their ability to answer questions accurately. To this end, we propose a novel method named eliciting, filtering and integrating knowledge in large language model (LINKED). In it, we design a reward model to filter out the noisy knowledge and take the marginal consistent reasoning module to reduce invalid reasoning. With our comprehensive experiments on two complex commonsense reasoning benchmarks, our method outperforms SOTA baselines (up to 9.0% improvement of accuracy). Besides, to measure the positive and negative impact of the injected knowledge, we propose a new metric called effectiveness-preservation score for the knowledge enhancement works. Finally, through extensive experiments, we conduct an in-depth analysis and find many meaningful conclusions about LLMs in commonsense reasoning tasks.
摘要:大语言模型 (LLM) 在知识密集型任务上有时表现不佳,常识推理就是其中之一。研究人员通常通过从知识图谱中检索相关知识或采用自我增强方法来激发 LLM 中的知识来解决这些问题。然而,噪声知识和无效推理问题阻碍了它们准确回答问题的能力。为此,我们提出了一种名为“在大语言模型中激发、过滤和整合知识 (LINKED)”的新方法。在该方法中,我们设计了一个奖励模型来过滤掉噪声知识,并采用边际一致推理模块来减少无效推理。通过在两个复杂的常识推理基准上进行的全面实验,我们的方法优于当前最先进的基线 (准确率提高了高达 9.0%)。此外,为了衡量注入知识的正面和负面影响,我们提出了一种新的指标,称为知识增强工作的有效性保留分数。最后,通过广泛的实验,我们进行了深入的分析,并发现了许多关于 LLM 在常识推理任务中的有意义的结论。

[NLP-141] LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English EMNLP2024

【速读】: 该论文旨在解决现有法律自然语言处理(NLP)基准仅关注预测任务而忽视生成任务的问题。解决方案的关键在于提出了LexSumm基准,这是一个专为评估法律摘要任务设计的基准,涵盖了来自美国、英国、欧盟和印度等多个司法管辖区的八个英语法律摘要数据集。此外,论文还发布了LexT5模型,这是一个面向法律领域的序列到序列模型,旨在弥补现有BERT风格仅编码器模型在法律领域的局限性。通过在LegalLAMA上的零样本探测和在LexSumm上的微调,研究揭示了即使是零样本大型语言模型生成的摘要也存在抽象和忠实性错误,这为未来的改进提供了机会。

链接: https://arxiv.org/abs/2410.09527
作者: T.Y.S.S. Santosh,Cornelius Weiss,Matthias Grabmair
关键词-EN: evolving NLP landscape, NLP landscape, evolving NLP, gauging progress, serve as yardsticks
类目: Computation and Language (cs.CL)
备注: Accepted to NLLP Workshop, EMNLP 2024

点击查看摘要

Abstract:In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress. However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking generative tasks. This work curates LexSumm, a benchmark designed for evaluating legal summarization tasks in English. It comprises eight English legal summarization datasets, from diverse jurisdictions, such as the US, UK, EU and India. Additionally, we release LexT5, legal oriented sequence-to-sequence model, addressing the limitation of the existing BERT-style encoder-only models in the legal domain. We assess its capabilities through zero-shot probing on LegalLAMA and fine-tuning on LexSumm. Our analysis reveals abstraction and faithfulness errors even in summaries generated by zero-shot LLMs, indicating opportunities for further improvements. LexSumm benchmark and LexT5 model are available at this https URL.
摘要:在不断演变的自然语言处理 (NLP) 领域中,基准测试 (benchmarks) 是衡量进展的标尺。然而,现有的法律 NLP 基准仅关注预测性任务,忽视了生成性任务。本研究精心策划了 LexSumm,一个专为评估英语法律摘要任务设计的基准。该基准包含八个来自不同司法管辖区(如美国、英国、欧盟和印度)的英语法律摘要数据集。此外,我们还发布了 LexT5,一个面向法律领域的序列到序列模型,旨在解决现有 BERT 风格仅编码器模型在法律领域的局限性。我们通过在 LegalLAMA 上的零样本探测和在 LexSumm 上的微调来评估其能力。我们的分析表明,即使是零样本大语言模型 (LLM) 生成的摘要也存在抽象性和忠实性错误,这表明还有进一步改进的空间。LexSumm 基准和 LexT5 模型可通过此 https URL 获取。

[NLP-142] Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

【速读】: 该论文试图解决在对话式文本到语音(CTTS)任务中,如何准确表达语音强调的问题。解决方案的关键在于提出了一个名为ER-CTTS的新型强调渲染方案,该方案通过综合考虑文本和声学上下文,进行全局和局部语义建模,以及深度整合多模态和多尺度上下文,来学习上下文对当前话语强调表达的影响。最终,推断出的强调特征被输入到神经语音合成器中,以生成具有适当强调的对话式语音。此外,为了解决数据稀缺问题,研究者在现有对话数据集(DailyTalk)上创建了强调强度注释。

链接: https://arxiv.org/abs/2410.09524
作者: Rui Liu,Zhenqi Jia,Jie Yang,Yifan Hu,Haizhou Li
关键词-EN: aims to accurately, attention nowadays, accurately express, attracts more attention, emphasis
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to IEEE Transaction

点击查看摘要

Abstract:Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which attracts more attention nowadays. While recognizing the significance of the CTTS task, prior studies have not thoroughly investigated speech emphasis expression, which is essential for conveying the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty in context understanding. In this paper, we propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS, that includes two main components: 1) we simultaneously take into account textual and acoustic contexts, with both global and local semantic modeling to understand the conversation context comprehensively; 2) we deeply integrate multi-modal and multi-scale context to learn the influence of context on the emphasis expression of the current utterance. Finally, the inferred emphasis feature is fed into the neural speech synthesizer to generate conversational speech. To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in emphasis rendering within a conversational setting. The code and audio samples are available at this https URL.
摘要:对话式文本到语音 (Conversational Text-to-Speech, CTTS) 旨在在对话环境中准确表达带有适当风格的语句,这一领域近年来吸引了越来越多的关注。尽管认识到 CTTS 任务的重要性,先前的研究并未深入探讨语音强调表达,这在人机交互场景中对于传达潜在意图和态度至关重要,原因在于对话强调数据集的稀缺性和上下文理解的难度。本文提出了一种新的 CTTS 模型强调渲染方案,称为 ER-CTTS,该方案包含两个主要组成部分:1) 我们同时考虑文本和声学上下文,通过全局和局部语义建模全面理解对话上下文;2) 我们深入整合多模态和多尺度上下文,以学习上下文对当前语句强调表达的影响。最终,推断出的强调特征被输入神经语音合成器以生成对话式语音。为解决数据稀缺问题,我们在现有的对话数据集 (DailyTalk) 上创建了强调强度注释。客观和主观评估均表明,我们的模型在对话环境中的强调渲染方面优于基线模型。代码和音频样本可在以下链接获取:https URL。

[NLP-143] Scito2M: A 2 Million 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis

【速读】: 该论文试图解决科学知识的创造、演变和传播过程中的关键问题,特别是跨学科领域的知识交流和引用模式。解决方案的关键在于引入Scito2M数据集,这是一个包含超过两百万学术出版物的纵向科学计量数据集,提供了全面的内容信息和引用图谱,支持跨学科分析。通过Scito2M,研究者能够进行长达30年的时序研究,探讨学术术语的演变、引用模式和跨学科知识交换,从而揭示不同学科领域在知识生产模式和引用实践上的差异,如应用驱动的领域(如大型语言模型)与传统理论学科(如口述历史)在引用年龄上的显著差异。

链接: https://arxiv.org/abs/2410.09510
作者: Yiqiao Jin,Yijia Xiao,Yiyang Wang,Jindong Wang
关键词-EN: bridging diverse subject, diverse subject areas, addressing complex global, complex global challenges, Understanding the creation
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Understanding the creation, evolution, and dissemination of scientific knowledge is crucial for bridging diverse subject areas and addressing complex global challenges such as pandemics, climate change, and ethical AI. Scientometrics, the quantitative and qualitative study of scientific literature, provides valuable insights into these processes. We introduce Scito2M, a longitudinal scientometric dataset with over two million academic publications, providing comprehensive contents information and citation graphs to support cross-disciplinary analyses. Using Scito2M, we conduct a temporal study spanning over 30 years to explore key questions in scientometrics: the evolution of academic terminology, citation patterns, and interdisciplinary knowledge exchange. Our findings reveal critical insights, such as disparities in epistemic cultures, knowledge production modes, and citation practices. For example, rapidly developing, application-driven fields like LLMs exhibit significantly shorter citation age (2.48 years) compared to traditional theoretical disciplines like oral history (9.71 years).
摘要:理解科学知识的创造、演变和传播对于弥合不同学科领域之间的差距,以及应对全球性复杂挑战(如疫情、气候变化和伦理 AI)至关重要。科学计量学(Scientometrics),即对科学文献进行定量和定性研究,为这些过程提供了宝贵的见解。我们引入了 Scito2M,这是一个包含超过两百万学术出版物的纵向科学计量数据集,提供了全面的内容信息和引用图谱,以支持跨学科分析。利用 Scito2M,我们进行了一项跨越 30 多年的时间研究,探讨了科学计量学中的关键问题:学术术语的演变、引用模式以及跨学科知识交流。我们的研究发现揭示了关键的见解,例如知识文化、知识生产模式和引用实践中的差异。例如,快速发展、应用驱动的领域(如大语言模型 (LLMs))的引用年龄显著短于传统的理论学科(如口述历史),前者为 2.48 年,后者为 9.71 年。

[NLP-144] CollabEdit: Towards Non-destructive Collaborative Knowledge Editing

【速读】: 该论文试图解决在大语言模型(LLMs)协作学习中,如何在保证隐私和效率的前提下进行知识编辑(Knowledge Editing, KE)的问题。解决方案的关键在于提出了一个非破坏性的协作知识编辑框架COLLABEDIT,该框架通过一种新颖的模型合并机制,模拟全局知识编辑行为,同时防止严重的性能下降。这一机制有效应对了协作知识编辑中的三大挑战:知识重叠、知识冲突和知识遗忘。

链接: https://arxiv.org/abs/2410.09508
作者: Jiamu Zheng,Jinghuai Zhang,Tianyu Du,Xuhong Zhang,Jianwei Yin,Tao Lin
关键词-EN: utilizing private data, large language models, efficiency and privacy, learning of large, large language
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Collaborative learning of large language models (LLMs) has emerged as a new paradigm for utilizing private data from different parties to guarantee efficiency and privacy. Meanwhile, Knowledge Editing (KE) for LLMs has also garnered increased attention due to its ability to manipulate the behaviors of LLMs explicitly, yet leaves the collaborative KE case (in which knowledge edits of multiple parties are aggregated in a privacy-preserving and continual manner) unexamined. To this end, this manuscript dives into the first investigation of collaborative KE, in which we start by carefully identifying the unique three challenges therein, including knowledge overlap, knowledge conflict, and knowledge forgetting. We then propose a non-destructive collaborative KE framework, COLLABEDIT, which employs a novel model merging mechanism to mimic the global KE behavior while preventing the severe performance drop. Extensive experiments on two canonical datasets demonstrate the superiority of COLLABEDIT compared to other destructive baselines, and results shed light on addressing three collaborative KE challenges and future applications.
摘要:大语言模型 (LLM) 的协同学习作为一种利用不同方私有数据以保证效率和隐私的新范式已经出现。同时,知识编辑 (KE) 由于其能够显式地操纵 LLM 的行为,也引起了越来越多的关注,但协同 KE 的情况(即多方知识编辑以隐私保护和持续方式聚合)尚未得到研究。为此,本文深入探讨了协同 KE 的首次研究,首先仔细识别了其中的三个独特挑战,包括知识重叠、知识冲突和知识遗忘。然后,我们提出了一种非破坏性的协同 KE 框架,COLLABEDIT,该框架采用了一种新颖的模型合并机制来模拟全局 KE 行为,同时防止了严重的性能下降。在两个经典数据集上的广泛实验表明,COLLABEDIT 相比其他破坏性基线具有优越性,实验结果揭示了应对三个协同 KE 挑战及未来应用的解决方案。

[NLP-145] AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment

【速读】: 该论文试图解决自动化评分系统中解释性不足的问题,特别是由于公开的解释数据稀缺和高昂的标注成本,导致现有方法通常依赖于大型语言模型(LLMs)生成的噪声解释。解决方案的关键是开发了AERA Chat平台,该平台通过交互式界面提供学生答案的可视化评估,并简化了解释的验证过程。平台的核心功能包括自动生成解释、创新的视觉展示以及强大的评估工具,这些功能使得教育工作者能够辅助评分过程,研究人员能够评估不同LLMs生成的解释质量,并作为高效标注的工具。

链接: https://arxiv.org/abs/2410.09507
作者: Jiazheng Li,Artem Bobrov,David West,Cesare Aloisi,Yulan He
关键词-EN: justify scoring decisions, automated scoring systems, scoring systems, Generating rationales, justify scoring
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating rationales that justify scoring decisions has emerged as a promising approach to enhance explainability in the development of automated scoring systems. However, the scarcity of publicly available rationale data and the high cost of annotation have resulted in existing methods typically relying on noisy rationales generated by large language models (LLMs). To address these challenges, we have developed AERA Chat, an interactive platform, to provide visually explained assessment of student answers and streamline the verification of rationales. Users can input questions and student answers to obtain automated, explainable assessment results from LLMs. The platform’s innovative visualization features and robust evaluation tools make it useful for educators to assist their marking process, and for researchers to evaluate assessment performance and quality of rationales generated by different LLMs, or as a tool for efficient annotation. We evaluated three rationale generation approaches on our platform to demonstrate its capability.
摘要:生成能够解释评分决策的理由已成为提升自动化评分系统可解释性的一个有前景的方法。然而,公开可用的理由数据稀缺以及标注成本高昂,导致现有方法通常依赖于由大语言模型 (LLM) 生成的噪声理由。为应对这些挑战,我们开发了 AERA Chat,一个交互式平台,用于提供学生答案的视觉解释评估,并简化理由的验证过程。用户可以输入问题和学生答案,从 LLM 获取自动化、可解释的评估结果。该平台的创新可视化功能和强大的评估工具使其对教育者辅助评分过程、研究人员评估不同 LLM 生成的评估性能和理由质量,或作为高效标注工具都具有实用性。我们在平台上评估了三种理由生成方法,以展示其能力。

[NLP-146] owards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks EMNLP2024

【速读】: 该论文试图解决Q-Former在大规模语言模型中进行多模态对齐时的训练效率问题。解决方案的关键在于采用参数高效微调(PEFT)技术对Q-Former进行微调,并通过InstructBLIP在视觉推理基准ScienceQA和IconQA上验证其有效性。研究发现,使用PEFT进行微调仅需不到2%的可训练参数即可达到与全量微调相当的性能。此外,论文还利用AdaLoRA进行动态参数预算重新分配,分析了Q-Former各子层在不同基准测试中的相对重要性,揭示了自注意力层在感知视觉-语言推理任务中的显著重要性,而前馈神经网络层的重要性则取决于任务中涉及的视觉-语言模式的复杂性。

链接: https://arxiv.org/abs/2410.09489
作者: Sungkyung Kim,Adam Lee,Junyoung Park,Andrew Chung,Jusang Oh,Jay-Yoon Lee
关键词-EN: demonstrated enhanced capabilities, large language models, employing additional encoders, Recent advancements, large language
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Recent advancements in large language models have demonstrated enhanced capabilities in visual reasoning tasks by employing additional encoders for aligning different modalities. While the Q-Former has been widely used as a general encoder for aligning several modalities including image, video, audio, and 3D with large language models, previous works on its efficient training and the analysis of its individual components have been limited. In this work, we investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and IconQA. We observe that applying PEFT to the Q-Former achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Additionally, we employ AdaLoRA for dynamic parameter budget reallocation to examine the relative importance of the Q-Former’s sublayers with 4 different benchmarks. Our findings reveal that the self-attention layers are noticeably more important in perceptual visual-language reasoning tasks, and relative importance of FFN layers depends on the complexity of visual-language patterns involved in tasks. The code is available at this https URL.
摘要:近年来,大语言模型的进步通过采用额外的编码器来对齐不同模态,显著提升了视觉推理任务的能力。尽管 Q-Former 已被广泛用作对齐图像、视频、音频和 3D 等多种模态的通用编码器,但关于其高效训练及其各组件分析的研究仍较为有限。在本研究中,我们探讨了使用 InstructBLIP 结合视觉推理基准 ScienceQA 和 IconQA,对 Q-Former 进行参数高效微调 (PEFT) 的有效性。我们观察到,应用 PEFT 到 Q-Former 可以在使用不到 2% 的可训练参数的情况下,达到与全量微调相当的性能。此外,我们采用 AdaLoRA 进行动态参数预算重新分配,以考察 Q-Former 子层在四个不同基准测试中的相对重要性。我们的研究结果表明,自注意力层在感知视觉-语言推理任务中明显更为重要,而前馈神经网络 (FFN) 层的相对重要性则取决于任务中涉及的视觉-语言模式的复杂性。代码可在以下链接获取:https URL。

[NLP-147] Automatic Speech Recognition with BERT and CTC Transformers: A Review

链接: https://arxiv.org/abs/2410.09456
作者: Noussaiba Djeffal,Hamza Kheddar,Djamel Addou,Ahmed Cherif Mazari,Yassine Himeur
关键词-EN:
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

[NLP-148] VERITAS-NLI : Validation and Extraction of Reliable Information Through Automated Scraping and Natural Language Inference

链接: https://arxiv.org/abs/2410.09455
作者: Arjun Shah,Hetansh Shah,Vedica Bafna,Charmi Khandor,Sindhu Nair
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, 15 pages, 7 figures

点击查看摘要

[NLP-149] Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning

链接: https://arxiv.org/abs/2410.09449
作者: Yi Dai
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Under Progress

点击查看摘要

[NLP-150] Solving the Challenge Set without Solving the Task: On Winograd Schemas as a Test of Pronominal Coreference Resolution CONLL2024

链接: https://arxiv.org/abs/2410.09448
作者: Ian Porada,Jackie Chi Kit Cheung
关键词-EN:
类目: Computation and Language (cs.CL)
备注: CoNLL 2024

点击查看摘要

[NLP-151] MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning

链接: https://arxiv.org/abs/2410.09437
作者: Yaming Yang,Dilixat Muhtar,Yelong Shen,Yuefeng Zhan,Jianfeng Liu,Yujing Wang,Hao Sun,Denvy Deng,Feng Sun,Qi Zhang,Weizhu Chen,Yunhai Tong
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 Pages, 4 Figures

点击查看摘要

[NLP-152] Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09432
作者: Raghav Singhal,Kaustubh Ponkshe,Praneeth Vepakomma
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: RS and KP contributed equally to this work: 18 Pages, 9 Figures, and 8 Tables. Another version of the paper accepted at NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability

点击查看摘要

[NLP-153] Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets KR

链接: https://arxiv.org/abs/2410.09428
作者: Thomas Eiter,Jan Hadl,Nelson Higuera,Johannes Oetsch
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

[NLP-154] FlatQuant: Flatness Matters for LLM Quantization

链接: https://arxiv.org/abs/2410.09426
作者: Yuxuan Sun,Ruikang Liu,Haoli Bai,Han Bao,Kang Zhao,Yuening Li,Jiaxin Hu,Xianzhi Yu,Lu Hou,Chun Yuan,Xin Jiang,Wulong Liu,Jun Yao
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

[NLP-155] VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment EMNLP2024

链接: https://arxiv.org/abs/2410.09421
作者: Lei Li,Zhihui Xie,Mukai Li,Shunian Chen,Peiyi Wang,Liang Chen,Yazheng Yang,Benyou Wang,Lingpeng Kong,Qi Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: EMNLP 2024 Main Conference camera-ready version. This article supersedes arXiv:2312.10665

点击查看摘要

[NLP-156] Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models

链接: https://arxiv.org/abs/2410.09418
作者: Yi-Fan Lu,Xian-Ling Mao,Tian Lan,Chen Xu,Heyan Huang
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-157] FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs Responsiveness to Human Feedback

链接: https://arxiv.org/abs/2410.09412
作者: Youquan Li,Miao Zheng,Fan Yang,Guosheng Dong,Bin Cui,Weipeng Chen,Zenan Zhou,Wentao Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-158] CAMPHOR: Collaborative Agents for Multi-input Planning and High-Order Reasoning On Device

链接: https://arxiv.org/abs/2410.09407
作者: Yicheng Fu,Raviteja Anantha,Jianpeng Cheng
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-159] wo Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation

链接: https://arxiv.org/abs/2410.09403
作者: Haoyang Su,Renqi Chen,Shixiang Tang,Xinzhe Zheng,Jingzhe Li,Zhenfei Yin,Wanli Ouyang,Nanqing Dong
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

[NLP-160] xt Classification using Graph Convolutional Networks: A Comprehensive Survey

链接: https://arxiv.org/abs/2410.09399
作者: Syed Mustafa Haider Rizvi,Ramsha Imran,Arif Mahmood
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-161] Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes

链接: https://arxiv.org/abs/2410.09397
作者: Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-162] LogLM: From Task-based to Instruction-based Automated Log Analysis

链接: https://arxiv.org/abs/2410.09352
作者: Yilun Liu,Yuhe Ji,Shimin Tao,Minggui He,Weibin Meng,Shenglin Zhang,Yongqian Sun,Yuming Xie,Boxing Chen,Hao Yang
关键词-EN:
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-163] Generative Subgraph Retrieval for Knowledge Graph-Grounded Dialog Generation EMNLP

链接: https://arxiv.org/abs/2410.09350
作者: Jinyoung Park,Minseok Joo,Joo-Kyung Kim,Hyunwoo J. Kim
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP (main)

点击查看摘要

[NLP-164] Inference and Verbalization Functions During In-Context Learning EMNLP2024

链接: https://arxiv.org/abs/2410.09349
作者: Junyi Tao,Xiaoyin Chen,Nelson F. Liu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

[NLP-165] DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

链接: https://arxiv.org/abs/2410.09344
作者: Wenlong Deng,Yize Zhao,Vala Vakilian,Minghui Chen,Xiaoxiao Li,Christos Thrampoulidis
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-166] ELICIT: LLM Augmentation via External In-Context Capability

链接: https://arxiv.org/abs/2410.09343
作者: Futing Wang,Jianhao Yan,Yue Zhang,Tao Lin
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

[NLP-167] LLMtimesMapReduce: Simplified Long-Sequence Processing using Large Language Models

链接: https://arxiv.org/abs/2410.09342
作者: Zihan Zhou,Chong Li,Xinyi Chen,Shuo Wang,Yu Chao,Zhili Li,Haoyu Wang,Rongqiao An,Qi Shi,Zhixing Tan,Xu Han,Xiaodong Shi,Zhiyuan Liu,Maosong Sun
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Work in Progress. Code: this https URL

点击查看摘要

[NLP-168] Keys to Robust Edits: from Theoretical Insights to Practical Advances

链接: https://arxiv.org/abs/2410.09338
作者: Jianhao Yan,Futing Wang,Yun Luo,Yafu Li,Yue Zhang
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

[NLP-169] Rethinking Data Selection at Scale: Random Selection is Almost All You Need

链接: https://arxiv.org/abs/2410.09335
作者: Tingyu Xia,Bowen Yu,Kai Dang,An Yang,Yuan Wu,Yuan Tian,Yi Chang,Junyang Lin
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-170] Zero-shot Commonsense Reasoning over Machine Imagination EMNLP2024

链接: https://arxiv.org/abs/2410.09329
作者: Hyuntae Park,Yeachan Kim,Jun-Hyung Park,SangKeun Lee(Korea University)
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 9 figures, EMNLP 2024 (Findings)

点击查看摘要

[NLP-171] Hey AI Can You Grade My Essay?: Automatic Essay Grading AAAI

链接: https://arxiv.org/abs/2410.09319
作者: Maisha Maliha,Vishal Pramanik
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in ICAAAIML (4th International Conference on Advances and Applications of Artificial Intelligence and Machine Learning) 2023

点击查看摘要

[NLP-172] Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbations

链接: https://arxiv.org/abs/2410.09318
作者: Saiful Islam Salim,Rubin Yuchan Yang,Alexander Cooper,Suryashree Ray,Saumya Debray,Sazzadur Rahaman
关键词-EN:
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注:

点击查看摘要

[NLP-173] llinstruct: An Instruction-tuned model for English Language Proficiency Assessments

链接: https://arxiv.org/abs/2410.09314
作者: Debanjan Ghosh,Sophia Chan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-174] Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

链接: https://arxiv.org/abs/2410.09303
作者: Buu Phan,Brandon Amos,Itai Gat,Marton Havasi,Matthew Muckley,Karen Ullrich
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-175] Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

链接: https://arxiv.org/abs/2410.09302
作者: Guanlin Liu,Kaixuan Ji,Renjie Zheng,Zheng Wu,Chen Dun,Quanquan Gu,Lin Yan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-176] Nudging: Inference-time Alignment via Model Collaboration

链接: https://arxiv.org/abs/2410.09300
作者: Yu Fei,Yasaman Razeghi,Sameer Singh
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-177] Natural Language Counterfactual Explanations for Graphs Using Large Language Models

链接: https://arxiv.org/abs/2410.09295
作者: Flavio Giorgi,Cesare Campagnano,Fabrizio Silvestri,Gabriele Tolomei
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-178] Comparative Analysis of Static and Contextual Embeddings for Analyzing Semantic Changes in Medieval Latin Charters

链接: https://arxiv.org/abs/2410.09283
作者: Yifan Liu,Gelila Tilahun,Xinxiang Gao,Qianfeng Wen,Michael Gervers
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures

点击查看摘要

[NLP-179] ReasonPlanner: Enhancing Autonomous Planning in Dynamic Environments with Temporal Knowledge Graphs and LLMs

链接: https://arxiv.org/abs/2410.09252
作者: Minh Pham Dinh,Munira Syed,Michael G Yankoski,Trenton W. Ford
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[NLP-180] Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

链接: https://arxiv.org/abs/2410.09247
作者: Jacob Haimes,Cenny Wenner,Kunvar Thaman,Vassil Tashev,Clement Neo,Esben Kran,Jason Schreiber
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-181] Sui Generis: Large Language Models for Authorship Attribution and Verification in Latin

链接: https://arxiv.org/abs/2410.09245
作者: Gleb Schmidt,Svetlana Gorovaia,Ivan P. Yamshchikov
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 9 pages, NLP4DH 2024

点击查看摘要

[NLP-182] nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder

链接: https://arxiv.org/abs/2410.09240
作者: Maksim Kuznetsov,Airat Valiev,Alex Aliper,Daniil Polykovskiy,Elena Tutubalina,Rim Shayakhmetov,Zulfat Miftahutdinov
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-183] Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports

链接: https://arxiv.org/abs/2410.09234
作者: Luoyao Chen,Revant Teotia,Antonio Verdone,Aidan Cardall,Lakshay Tyagi,Yiqiu Shen,Sumit Chopra
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 4 tables

点击查看摘要

[NLP-184] Improving semantic understanding in speech language models via brain-tuning ICLR2025

链接: https://arxiv.org/abs/2410.09230
作者: Omer Moussa,Dietrich Klakow,Mariya Toneva
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review at ICLR 2025

点击查看摘要

[NLP-185] he Same But Different: Structural Similarities and Differences in Multilingual Language Modeling

链接: https://arxiv.org/abs/2410.09223
作者: Ruochen Zhang,Qinan Yu,Matianyu Zang,Carsten Eickhoff,Ellie Pavlick
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-186] M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought EMNLP2024

链接: https://arxiv.org/abs/2410.09220
作者: Gitanjali Kumari,Kirtan Jain,Asif Ekbal
关键词-EN:
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 34 Pages. Accepted in The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Main Conference

点击查看摘要

[NLP-187] P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

链接: https://arxiv.org/abs/2410.09207
作者: Simeng Han,Aaron Yu,Rui Shen,Zhenting Qi,Martin Riddell,Wenfei Zhou,Yujie Qiao,Yilun Zhao,Semih Yavuz,Ye Liu,Shafiq Joty,Yingbo Zhou,Caiming Xiong,Dragomir Radev,Rex Ying,Arman Cohan
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-188] Encoding Agent Trajectories as Representations with Sequence Transformers

链接: https://arxiv.org/abs/2410.09204
作者: Athanasios Tsiligkaridis,Nicholas Kalinowski,Zhongheng Li,Elizabeth Hou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, to be presented at GeoAI workshop at ACM SigSpatial 2024

点击查看摘要

[NLP-189] Long Range Named Entity Recognition for Marathi Documents

链接: https://arxiv.org/abs/2410.09192
作者: Pranita Deshmukh,Nikita Kulkarni,Sanhita Kulkarni,Kareena Manghani,Geetanjali Kale,Raviraj Joshi
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-190] Automated Rewards via LLM-Generated Progress Functions

链接: https://arxiv.org/abs/2410.09187
作者: Vishnu Sarukkai,Brennan Shacklett,Zander Majercik,Kush Bhatia,Christopher Ré,Kayvon Fatahalian
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 5 figures

点击查看摘要

[NLP-191] L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi

链接: https://arxiv.org/abs/2410.09184
作者: Pranita Deshmukh,Nikita Kulkarni,Sanhita Kulkarni,Kareena Manghani,Raviraj Joshi
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-192] Can a large language model be a gaslighter?

链接: https://arxiv.org/abs/2410.09181
作者: Wei Li,Luyao Zhu,Yang Song,Ruixi Lin,Rui Mao,Yang You
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 10/26 (Main Body/Total), 8 figures

点击查看摘要

[NLP-193] Context-Aware SQL Error Correction Using Few-Shot Learning – A Novel Approach Based on NLQ Error and SQL Similarity CIKM2024

链接: https://arxiv.org/abs/2410.09174
作者: Divyansh Jain,Eric Yang
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise @ CIKM 2024

点击查看摘要

[NLP-194] Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications

链接: https://arxiv.org/abs/2410.09168
作者: Alexey Zhezherau,Alexei Yanockin
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 22 pages, 7 figures

点击查看摘要

[NLP-195] ACER: Automatic Language Model Context Extension via Retrieval

链接: https://arxiv.org/abs/2410.09141
作者: Luyu Gao,Yunyi Zhang,Jamie Callan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-196] nextlocllm: next location prediction using LLMs

链接: https://arxiv.org/abs/2410.09129
作者: Shuai Liu,Ning Cao,Yile Chen,Yue Jiang,Gao Cong
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

[NLP-197] HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction NEURIPS2024

链接: https://arxiv.org/abs/2410.09112
作者: Qianyue Hao,Jingyang Fan,Fengli Xu,Jian Yuan,Yong Li
关键词-EN:
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2024 paper

点击查看摘要

[NLP-198] Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

链接: https://arxiv.org/abs/2410.09102
作者: Tong Wu,Shujian Zhang,Kaiqiang Song,Silei Xu,Sanqiang Zhao,Ravi Agrawal,Sathish Reddy Indurthi,Chong Xiang,Prateek Mittal,Wenxuan Zhou
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

[NLP-199] Recent advancements in LLM Red-Teaming: Techniques Defenses and Ethical Considerations

链接: https://arxiv.org/abs/2410.09097
作者: Tarun Raheja,Nilay Pochhi
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures

点击查看摘要

[NLP-200] Mechanistic? EMNLP2024

链接: https://arxiv.org/abs/2410.09087
作者: Naomi Saphra,Sarah Wiegreffe
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Equal contribution. Position paper. Accepted for presentation at the BlackBoxNLP workshop at EMNLP 2024

点击查看摘要

[NLP-201] Diagnosing Robotics Systems Issues with Large Language Models

链接: https://arxiv.org/abs/2410.09084
作者: Jordis Emilia Herrmann,Aswath Mandakath Gopinath,Mikael Norrlof,Mark Niklas Müller
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

[NLP-202] Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

链接: https://arxiv.org/abs/2410.09083
作者: Lu Chen,Yuxuan Huang,Yixing Li,Yaohui Jin,Shuai Zhao,Zilong Zheng,Quanshi Zhang
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-203] Leveraging Social Determinants of Health in Alzheimers Research Using LLM-Augmented Literature Mining and Knowledge Graphs

链接: https://arxiv.org/abs/2410.09080
作者: Tianqi Shang,Shu Yang,Weiqing He,Tianhua Zhai,Dawei Li,Bojian Hou,Tianlong Chen,Jason H. Moore,Marylyn D. Ritchie,Li Shen
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-204] BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.09079
作者: Aofei Chang,Jiaqi Wang,Han Liu,Parminder Bhatia,Cao Xiao,Ting Wang,Fenglong Ma
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 (Findings)

点击查看摘要

[NLP-205] Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial Robustness of LLMs

链接: https://arxiv.org/abs/2410.09078
作者: Tomas Bueno Momcilovic,Dian Balta,Beat Buesser,Giulio Zizzo,Mark Purcell
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: Accepted in the VECOMP 2024 workshop

点击查看摘要

[NLP-206] A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2410.09077
作者: Yilong Zhao,Daifeng Li
关键词-EN:
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

[NLP-207] Llettuce: An Open Source Natural Language Processing Tool for the Translation of Medical Terms into Uniform Clinical Encoding

链接: https://arxiv.org/abs/2410.09076
作者: James Mitchell-White,Reza Omdivar,Esmond Urwin,Karthikeyan Sivakumar,Ruizhe Li,Andy Rae,Xiaoyan Wang,Theresia Mina,John Chambers,Grazziela Figueredo,Philip R Quinlan
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-208] Investigating the Impact of Text Summarization on Topic Modeling

链接: https://arxiv.org/abs/2410.09063
作者: Trishia Khandelwal
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures

点击查看摘要

[NLP-209] Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning ACL2024

链接: https://arxiv.org/abs/2402.18344
作者: Jiachun Li,Pengfei Cao,Chenhao Wang,Zhuoran Jin,Yubo Chen,Daojian Zeng,Kang Liu,Jun Zhao
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a long paper to ACL 2024 Main, 25 pages, 22 figures

点击查看摘要

[NLP-210] How to Construct Random Unitaries

链接: https://arxiv.org/abs/2410.10116
作者: Fermi Ma,Hsin-Yuan Huang
关键词-EN:
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Computation and Language (cs.CL); Mathematical Physics (math-ph)
备注: 76 pages

点击查看摘要

人工智能

[AI-0] mporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

链接: https://arxiv.org/abs/2410.10818
作者: Mu Cai,Reuben Tan,Jianrui Zhang,Bocheng Zou,Kai Zhang,Feng Yao,Fangrui Zhu,Jing Gu,Yiwu Zhong,Yuzhang Shang,Yao Dou,Jaden Park,Jianfeng Gao,Yong Jae Lee,Jianwei Yang
关键词-EN: temporal understanding, temporal, fine-grained temporal, Understanding, video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models’ temporal reasoning capabilities. Both dataset and evaluation code will be made available.

[AI-1] LVD-2M: A Long-take Video Dataset with Temporally Dense Captions NEURIPS2024

链接: https://arxiv.org/abs/2410.10816
作者: Tianwei Xiong,Yuqing Wang,Daquan Zhou,Zhijie Lin,Jiashi Feng,Xihui Liu
关键词-EN: long video generation, video generation models, video generation, generation models, generation models heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Dataset and Benchmark Track. Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

[AI-2] Depth Any Video with Scalable Synthetic Data

链接: https://arxiv.org/abs/2410.10815
作者: Honghui Yang,Di Huang,Wei Yin,Chunhua Shen,Haifeng Liu,Xiaofei He,Binbin Lin,Wanli Ouyang,Tong He
关键词-EN: ground truth data, scalable ground truth, leading to inconsistent, unreliable results, Video depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates-even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

[AI-3] HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

链接: https://arxiv.org/abs/2410.10812
作者: Haotian Tang,Yecheng Wu,Shang Yang,Enze Xie,Junsong Chen,Junyu Chen,Zhuoyang Zhang,Han Cai,Yao Lu,Song Han
关键词-EN: Hybrid Autoregressive Transformer, Autoregressive Transformer, introduce Hybrid Autoregressive, rivaling diffusion models, visual generation model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Demo: this https URL . The first two authors contributed equally to this work

点击查看摘要

Abstract:We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at this https URL.

[AI-4] Hard-Constrained Neural Networks with Universal Approximation Guarantees

链接: https://arxiv.org/abs/2410.10807
作者: Youngjae Min,Anoopkumar Sonar,Navid Azizan
关键词-EN: Incorporating prior knowledge, gained significant attention, Incorporating prior, significant attention, prior knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Incorporating prior knowledge or specifications of input-output relationships into machine learning models has gained significant attention, as it enhances generalization from limited data and leads to conforming outputs. However, most existing approaches use soft constraints by penalizing violations through regularization, which offers no guarantee of constraint satisfaction – an essential requirement in safety-critical applications. On the other hand, imposing hard constraints on neural networks may hinder their representational power, adversely affecting performance. To address this, we propose HardNet, a practical framework for constructing neural networks that inherently satisfy hard constraints without sacrificing model capacity. Specifically, we encode affine and convex hard constraints, dependent on both inputs and outputs, by appending a differentiable projection layer to the network’s output. This architecture allows unconstrained optimization of the network parameters using standard algorithms while ensuring constraint satisfaction by construction. Furthermore, we show that HardNet retains the universal approximation capabilities of neural networks. We demonstrate the versatility and effectiveness of HardNet across various applications: fitting functions under constraints, learning optimization solvers, optimizing control policies in safety-critical systems, and learning safe decision logic for aircraft systems.

[AI-5] Boosting Camera Motion Control for Video Diffusion Transformers

链接: https://arxiv.org/abs/2410.10802
作者: Soon Yau Cheong,Duygu Ceylan,Armin Mustafa,Andrew Gilbert,Chun-Hao Paul Huang
关键词-EN: Recent advancements, camera, camera control, enhanced the quality, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have significantly enhanced the quality of video generation. However, fine-grained control over camera pose remains a challenge. While U-Net-based models have shown promising results for camera control, transformer-based diffusion models (DiT)-the preferred architecture for large-scale video generation - suffer from severe degradation in camera motion accuracy. In this paper, we investigate the underlying causes of this issue and propose solutions tailored to DiT architectures. Our study reveals that camera control performance depends heavily on the choice of conditioning methods rather than camera pose representations that is commonly believed. To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), based on classifier-free guidance, which boosts camera control by over 400%. Additionally, we present a sparse camera control pipeline, significantly simplifying the process of specifying camera poses for long videos. Our method universally applies to both U-Net and DiT models, offering improved camera control for video generation tasks.

[AI-6] On Information-Theoretic Measures of Predictive Uncertainty

链接: https://arxiv.org/abs/2410.10786
作者: Kajetan Schweighofer,Lukas Aichberger,Mykyta Ielanskyi,Sepp Hochreiter
关键词-EN: machine learning applications, predictive uncertainty, predictive uncertainty measures, Reliable estimation, learning applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reliable estimation of predictive uncertainty is crucial for machine learning applications, particularly in high-stakes scenarios where hedging against risks is essential. Despite its significance, a consensus on the correct measurement of predictive uncertainty remains elusive. In this work, we return to first principles to develop a fundamental framework of information-theoretic predictive uncertainty measures. Our proposed framework categorizes predictive uncertainty measures according to two factors: (I) The predicting model (II) The approximation of the true predictive distribution. Examining all possible combinations of these two factors, we derive a set of predictive uncertainty measures that includes both known and newly introduced ones. We empirically evaluate these measures in typical uncertainty estimation settings, such as misclassification detection, selective prediction, and out-of-distribution detection. The results show that no single measure is universal, but the effectiveness depends on the specific setting. Thus, our work provides clarity about the suitability of predictive uncertainty measures by clarifying their implicit assumptions and relationships.

[AI-7] When Attention Sink Emerges in Language Models: An Empirical View

链接: https://arxiv.org/abs/2410.10781
作者: Xiangming Gu,Tianyu Pang,Chao Du,Qian Liu,Fengzhuo Zhang,Cunxiao Du,Ye Wang,Min Lin
关键词-EN: Language Models, assign significant attention, attention, attention sink, assign significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens’ inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at this https URL.

[AI-8] Focused ReAct: Improving ReAct through Reiterate and Early Stop

链接: https://arxiv.org/abs/2410.10779
作者: Shuoqiu Li,Han Xu,Haipeng Chen
关键词-EN: Large language models, Large language, decision-making capabilities, significantly improved, improved their reasoning
类目: Artificial Intelligence (cs.AI)
*备注: The Eighth Widening NLP Workshop (WiNLP 2024)

点击查看摘要

Abstract:Large language models (LLMs) have significantly improved their reasoning and decision-making capabilities, as seen in methods like ReAct. However, despite its effectiveness in tackling complex tasks, ReAct faces two main challenges: losing focus on the original question and becoming stuck in action loops. To address these issues, we introduce Focused ReAct, an enhanced version of the ReAct paradigm that incorporates reiteration and early stop mechanisms. These improvements help the model stay focused on the original query and avoid repetitive behaviors. Experimental results show accuracy gains of 18% to 530% and a runtime reduction of up to 34% compared to the original ReAct method.

[AI-9] Adaptive Diffusion Terrain Generator for Autonomous Uneven Terrain Navigation

链接: https://arxiv.org/abs/2410.10766
作者: Youwei Yu,Junhong Xu,Lantao Liu
关键词-EN: Model-free reinforcement learning, developing robust robot, robust robot control, robot control policies, control policies capable
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured terrains. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently challenging yet attainable terrains to facilitate continuous policy improvement. Existing methods of environment generation often rely on heuristics constrained by a set of parameters, limiting the diversity and realism. In this work, we introduce the Adaptive Diffusion Terrain Generator (ADTG), a novel method that leverages Denoising Diffusion Probabilistic Models to dynamically expand existing training environments by adding more diverse and complex terrains adaptive to the current policy. ADTG guides the diffusion model’s generation process through initial noise optimization, blending noise-corrupted terrains from existing training environments weighted by the policy’s performance in each corresponding environment. By manipulating the noise corruption level, ADTG seamlessly transitions between generating similar terrains for policy fine-tuning and novel ones to expand training diversity. Our experiments show that the policy trained by ADTG outperforms both procedural generated and natural environments, along with popular navigation methods.

[AI-10] AFlow: Automating Agent ic Workflow Generation

链接: https://arxiv.org/abs/2410.10762
作者: Jiayi Zhang,Jinyu Xiang,Zhaoyang Yu,Fengwei Teng,Xionghui Chen,Jiaqi Chen,Mingchen Zhuge,Xin Cheng,Sirui Hong,Jinlin Wang,Bingnan Zheng,Bang Liu,Yuyu Luo,Chenglin Wu
关键词-EN: Large language models, demonstrated remarkable potential, follow detailed instructions, Large language, employing agentic workflows
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow’s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code will be available at this https URL.

[AI-11] FlexGen: Flexible Multi-View Generation from Text and Image Inputs

链接: https://arxiv.org/abs/2410.10745
作者: Xinli Xu,Wenhang Ge,Jiantao Lin,Jiawei Feng,Lie Xu,HanFeng Zhao,Shunsi Zhang,Ying-Cong Chen
关键词-EN: flexible framework designed, framework designed, text, text annotations, multi-view
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. FlexGen tackles the challenges of controllable multi-view synthesis through additional conditioning on 3D-aware text annotations. We utilize the strong reasoning capabilities of GPT-4V to generate 3D-aware text annotations. By analyzing four orthogonal views of an object arranged as tiled multi-view images, GPT-4V can produce text annotations that include 3D-aware information with spatial relationship. By integrating the control signal with proposed adaptive dual-control module, our model can generate multi-view images that correspond to the specified text. FlexGen supports multiple controllable capabilities, allowing users to modify text prompts to generate reasonable and corresponding unseen parts. Additionally, users can influence attributes such as appearance and material properties, including metallic and roughness. Extensive experiments demonstrate that our approach offers enhanced multiple controllability, marking a significant advancement over existing multi-view diffusion models. This work has substantial implications for fields requiring rapid and flexible 3D content creation, including game development, animation, and virtual reality. Project page: this https URL.

[AI-12] NT-LLM: A Novel Node Tokenizer for Integrating Graph Structure into Large Language Models

链接: https://arxiv.org/abs/2410.10743
作者: Yanbiao Ji,Chang Liu,Xin Chen,Yue Ding,Dan Luo,Mei Li,Wenqing Lin,Hongtao Lu
关键词-EN: Large Language Models, real-world scenarios, Language Models, Large Language, relationships in real-world
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graphs are a fundamental data structure for representing relationships in real-world scenarios. With the success of Large Language Models (LLMs) across various natural language processing (NLP) tasks, there has been growing interest in integrating LLMs for graph learning. However, applying LLMs to graph-related tasks poses significant challenges, as these models are not inherently designed to capture the complex structural information present in graphs. Existing approaches address this challenge through two strategies: the chain of tasks approach, which uses Graph Neural Networks (GNNs) to encode the graph structure so that LLMs are relieved from understanding spatial positions; and Graph-to-Text Conversion, which translates graph structures into semantic text representations that LLMs can process. Despite their progress, these methods often struggle to fully preserve the topological information of graphs or require extensive computational resources, limiting their practical applicability. In this work, we introduce Node Tokenizer for Large Language Models (NT-LLM), a novel framework that efficiently encodes graph structures by selecting key nodes as anchors and representing each node based on its relative distance to these anchors. This position-anchored encoding effectively captures the graph topology, enabling enhanced reasoning capabilities in LLMs over graph data. Additionally, we implement a task-specific tuning procedure to further improve structural understanding within LLMs. Through extensive empirical evaluations, NT-LLM demonstrates significant performance improvements across a variety of graph-related tasks. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.10743 [cs.AI] (or arXiv:2410.10743v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.10743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing

链接: https://arxiv.org/abs/2410.10741
作者: Pengrui Quan,Xiaomin Ouyang,Jeya Vikranth Jeyakumar,Ziqi Wang,Yang Xing,Mani Srivastava
关键词-EN: Large Language Models, Effective processing, critical component, component of cyber-physical, Effective
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2410.10741 [cs.AI] (or arXiv:2410.10741v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.10741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-14] DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model NEURIPS2024

链接: https://arxiv.org/abs/2410.10738
作者: Yuqi Wang,Ke Cheng,Jiawei He,Qitai Wang,Hengchen Dai,Yuntao Chen,Fei Xia,Zhaoxiang Zhang
关键词-EN: gained increasing attention, increasing attention due, complex physical dynamics, gained increasing, increasing attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.

[AI-15] Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning

链接: https://arxiv.org/abs/2410.10735
作者: Kuofeng Gao,Huanqia Cai,Qingyao Shuai,Dihong Gong,Zhifeng Li
关键词-EN: Large Language Models, Large Language, Accurate mathematical reasoning, Accurate mathematical, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Accurate mathematical reasoning with Large Language Models (LLMs) is crucial in revolutionizing domains that heavily rely on such reasoning. However, LLMs often encounter difficulties in certain aspects of mathematical reasoning, leading to flawed reasoning and erroneous results. To mitigate these issues, we introduce a novel mechanism, the Chain of Self-Correction (CoSC), specifically designed to embed self-correction as an inherent ability in LLMs, enabling them to validate and rectify their own results. The CoSC mechanism operates through a sequence of self-correction stages. In each stage, the LLMs generate a program to address a given problem, execute this program using program-based tools to obtain an output, subsequently verify this output. Based on the verification, the LLMs either proceed to the next correction stage or finalize the answer. This iterative self-correction process allows the LLMs to refine their reasoning steps and improve the accuracy of their mathematical reasoning. To enable the CoSC mechanism at a low cost, we employ a two-phase finetuning approach. In the first phase, the LLMs are trained with a relatively small volume of seeding data generated from GPT-4, establishing an initial CoSC capability. In the second phase, the CoSC capability is further enhanced by training with a larger volume of self-generated data using the trained model in the first phase, without relying on the paid GPT-4. Our comprehensive experiments demonstrate that CoSC significantly improves performance on traditional mathematical datasets among existing open-source LLMs. Notably, our CoSC-Code-34B model achieved a 53.5% score on MATH, the most challenging mathematical reasoning dataset in the public domain, surpassing the performance of well-established models such as ChatGPT, GPT-4, and even multi-modal LLMs like GPT-4V, Gemini-1.0 Pro, and Gemini-1.0 Ultra.

[AI-16] Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

链接: https://arxiv.org/abs/2410.10733
作者: Junyu Chen,Han Cai,Junsong Chen,Enze Xie,Shang Yang,Haotian Tang,Muyang Li,Yao Lu,Song Han
关键词-EN: present Deep Compression, Deep Compression Autoencoder, present Deep, Deep Compression, spatial compression ratio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. First two authors contributed equally to this work

点击查看摘要

Abstract:We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder’s spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at this https URL.

[AI-17] owards LLM-guided Efficient and Interpretable Multi-linear Tensor Network Rank Selection

链接: https://arxiv.org/abs/2410.10728
作者: Giorgos Iacovides,Wuyang Zhou,Danilo Mandic
关键词-EN: leverages large language, higher-order data analysis, rank selection, guide the rank, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a novel framework that leverages large language models (LLMs) to guide the rank selection in tensor network models for higher-order data analysis. By utilising the intrinsic reasoning capabilities and domain knowledge of LLMs, our approach offers enhanced interpretability of the rank choices and can effectively optimise the objective function. This framework enables users without specialised domain expertise to utilise tensor network decompositions and understand the underlying rationale within the rank selection process. Experimental results validate our method on financial higher-order datasets, demonstrating interpretable reasoning, strong generalisation to unseen test data, and its potential for self-enhancement over successive iterations. This work is placed at the intersection of large language models and higher-order data analysis.

[AI-18] SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

链接: https://arxiv.org/abs/2410.10714
作者: Rasoul Shafipour,David Harrison,Maxwell Horton,Jeffrey Marker,Houman Bedayat,Sachin Mehta,Mohammad Rastegari,Mahyar Najibi,Saman Naderiparizi
关键词-EN: Large Language Models, natural language processing, transformed natural language, high runtime cost, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art compression methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama 3 70B, which is particularly challenging to compress, show that SeedLM achieves significantly better zero-shot accuracy retention at 4- and 3-bit than state-of-the-art techniques, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases to 70B, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.

[AI-19] Early Diagnoses of Acute Lymphoblastic Leukemia Using YOLOv8 and YOLOv11 Deep Learning Models

链接: https://arxiv.org/abs/2410.10701
作者: Alaa Awad,Mohamed Hegazy,Salah A. Aly
关键词-EN: individuals succumb annually, Acute Lymphoblastic Leukemia, Thousands of individuals, detecting Acute Lymphoblastic, individuals succumb
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Thousands of individuals succumb annually to leukemia alone. This study explores the application of image processing and deep learning techniques for detecting Acute Lymphoblastic Leukemia (ALL), a severe form of blood cancer responsible for numerous annual fatalities. As artificial intelligence technologies advance, the research investigates the reliability of these methods in real-world scenarios. The study focuses on recent developments in ALL detection, particularly using the latest YOLO series models, to distinguish between malignant and benign white blood cells and to identify different stages of ALL, including early stages. Additionally, the models are capable of detecting hematogones, which are often misclassified as ALL. By utilizing advanced deep learning models like YOLOv8 and YOLOv11, the study achieves high accuracy rates reaching 98.8%, demonstrating the effectiveness of these algorithms across multiple datasets and various real-world situations.

[AI-20] Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

链接: https://arxiv.org/abs/2410.10700
作者: Qibing Ren,Hao Li,Dongrui Liu,Zhanxu Xie,Xiaoya Lu,Yu Qiao,Lei Sha,Junchi Yan,Lizhuang Ma,Jing Shao
关键词-EN: Large Language Models, Large Language, vulnerabilities of Large, Language Models, obscure harmful intents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs’ knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at this https URL.

[AI-21] Building a Multivariate Time Series Benchmarking Datasets Inspired by Natural Language Processing (NLP)

链接: https://arxiv.org/abs/2410.10687
作者: Mohammad Asif Ibna Mustafa(Department of Computation, Information and Technology, Technical University of Munich, Munich, Germany),Ferdinand Heinrich(Fraunhofer Institute for Electronic Microsystems and Solid State Technologies EMFT, Machine Learning Enhanced Sensor Systems, Munich, Germany)
关键词-EN: Natural Language Processing, Time series analysis, Time series, effective models relies, models relies heavily
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series analysis has become increasingly important in various domains, and developing effective models relies heavily on high-quality benchmark datasets. Inspired by the success of Natural Language Processing (NLP) benchmark datasets in advancing pre-trained models, we propose a new approach to create a comprehensive benchmark dataset for time series analysis. This paper explores the methodologies used in NLP benchmark dataset creation and adapts them to the unique challenges of time series data. We discuss the process of curating diverse, representative, and challenging time series datasets, highlighting the importance of domain relevance and data complexity. Additionally, we investigate multi-task learning strategies that leverage the benchmark dataset to enhance the performance of time series models. This research contributes to the broader goal of advancing the state-of-the-art in time series modeling by adopting successful strategies from the NLP domain.

[AI-22] Combinatorial Multi-armed Bandits: Arm Selection via Group Testing

链接: https://arxiv.org/abs/2410.10679
作者: Arpan Mukherjee,Shashanka Ubaru,Keerthiram Murugesan,Karthikeyan Shanmugam,Ali Tajer
关键词-EN: combinatorial multi-armed bandits, combinatorial multi-armed, multi-armed bandits, bandits with semi-bandit, semi-bandit feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 26 pages

点击查看摘要

Abstract:This paper considers the problem of combinatorial multi-armed bandits with semi-bandit feedback and a cardinality constraint on the super-arm size. Existing algorithms for solving this problem typically involve two key sub-routines: (1) a parameter estimation routine that sequentially estimates a set of base-arm parameters, and (2) a super-arm selection policy for selecting a subset of base arms deemed optimal based on these parameters. State-of-the-art algorithms assume access to an exact oracle for super-arm selection with unbounded computational power. At each instance, this oracle evaluates a list of score functions, the number of which grows as low as linearly and as high as exponentially with the number of arms. This can be prohibitive in the regime of a large number of arms. This paper introduces a novel realistic alternative to the perfect oracle. This algorithm uses a combination of group-testing for selecting the super arms and quantized Thompson sampling for parameter estimation. Under a general separability assumption on the reward function, the proposed algorithm reduces the complexity of the super-arm-selection oracle to be logarithmic in the number of base arms while achieving the same regret order as the state-of-the-art algorithms that use exact oracles. This translates to at least an exponential reduction in complexity compared to the oracle-based approaches.

[AI-23] Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach

链接: https://arxiv.org/abs/2410.10674
作者: Rory Young,Nicolas Pugeault
关键词-EN: learning agents achieve, Deep reinforcement learning, simulated control tasks, agents achieve, reinforcement learning agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning agents achieve state-of-the-art performance in a wide range of simulated control tasks. However, successful applications to real-world problems remain limited. One reason for this dichotomy is because the learned policies are not robust to observation noise or adversarial attacks. In this paper, we investigate the robustness of deep RL policies to a single small state perturbation in deterministic continuous control tasks. We demonstrate that RL policies can be deterministically chaotic as small perturbations to the system state have a large impact on subsequent state and reward trajectories. This unstable non-linear behaviour has two consequences: First, inaccuracies in sensor readings, or adversarial attacks, can cause significant performance degradation; Second, even policies that show robust performance in terms of rewards may have unpredictable behaviour in practice. These two facets of chaos in RL policies drastically restrict the application of deep RL to real-world problems. To address this issue, we propose an improvement on the successful Dreamer V3 architecture, implementing a Maximal Lyapunov Exponent regularisation. This new approach reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of Deep Reinforcement Learning for real-world applications.

[AI-24] Double Jeopardy and Climate Impact in the Use of Large Language Models : Socio-economic Disparities and Reduced Utility for Non-English Speakers

链接: https://arxiv.org/abs/2410.10665
作者: Aivin V. Solatorio,Gabriel Stefanini Vicente,Holly Krambeck,Olivier Dupriez
关键词-EN: Artificial Intelligence, World Development Indicators, holds the potential, information gaps, developing nations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: Project GitHub repository at this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI’s GPT models via APIs because of how the system processes the input – tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.

[AI-25] ransforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning

链接: https://arxiv.org/abs/2410.10660
作者: William A. Stigall
关键词-EN: Convolutional Neural Networks, utilizing Convolutional Neural, Deep Q-Networks utilizing, Q-Networks utilizing Convolutional, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: KSU C-Day Spring 2024

点击查看摘要

Abstract:In this study, we investigate the performance of Deep Q-Networks utilizing Convolutional Neural Networks (CNNs) and Transformer architectures across three different Atari games. The advent of DQNs has significantly advanced Reinforcement Learning, enabling agents to directly learn optimal policies from high-dimensional sensory inputs from pixel or RAM data. While CNN-based DQNs have been extensively studied and deployed in various domains, Transformer-based DQNs are relatively unexplored. Our research aims to fill this gap by benchmarking the performance of both DCQNs and DTQNs across the Atari games Asteroids, Space Invaders, and Centipede. We find that in the 35-40 million parameter range, the DCQN outperforms the DTQN in speed across both ViT and Projection Architectures. We also find the DCQN outperforms the DTQN in all games except for Centipede.

[AI-26] Generative AI and Its Impact on Personalized Intelligent Tutoring Systems

链接: https://arxiv.org/abs/2410.10650
作者: Subhankar Maity,Aniket Deroy
关键词-EN: Generative Artificial Intelligence, Intelligent Tutoring Systems, enabling highly personalized, adaptive learning environments, Generative Artificial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Scientific Report (Under Review)

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) is revolutionizing educational technology by enabling highly personalized and adaptive learning environments within Intelligent Tutoring Systems (ITS). This report delves into the integration of Generative AI, particularly large language models (LLMs) like GPT-4, into ITS to enhance personalized education through dynamic content generation, real-time feedback, and adaptive learning pathways. We explore key applications such as automated question generation, customized feedback mechanisms, and interactive dialogue systems that respond to individual learner needs. The report also addresses significant challenges, including ensuring pedagogical accuracy, mitigating inherent biases in AI models, and maintaining learner engagement. Future directions highlight the potential advancements in multimodal AI integration, emotional intelligence in tutoring systems, and the ethical implications of AI-driven education. By synthesizing current research and practical implementations, this report underscores the transformative potential of Generative AI in creating more effective, equitable, and engaging educational experiences.

[AI-27] DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation

链接: https://arxiv.org/abs/2410.10646
作者: James R. Han,Hugues Thomas,Jian Zhang,Nicholas Rhinehart,Timothy D. Barfoot
关键词-EN: people exhibiting complex, complex motion patterns, exhibiting complex motion, people exhibiting, exhibiting complex
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, under review for IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:How can a robot safely navigate around people exhibiting complex motion patterns? Reinforcement Learning (RL) or Deep RL (DRL) in simulation holds some promise, although much prior work relies on simulators that fail to precisely capture the nuances of real human motion. To address this gap, we propose Deep Residual Model Predictive Control (DR-MPC), a method to enable robots to quickly and safely perform DRL from real-world crowd navigation data. By blending MPC with model-free DRL, DR-MPC overcomes the traditional DRL challenges of large data requirements and unsafe initial behavior. DR-MPC is initialized with MPC-based path tracking, and gradually learns to interact more effectively with humans. To further accelerate learning, a safety component estimates when the robot encounters out-of-distribution states and guides it away from likely collisions. In simulation, we show that DR-MPC substantially outperforms prior work, including traditional DRL and residual DRL models. Real-world experiments show our approach successfully enables a robot to navigate a variety of crowded situations with few errors using less than 4 hours of training data.

[AI-28] Adapt-infty: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection

链接: https://arxiv.org/abs/2410.10636
作者: Adyasha Maharana,Jaehong Yoon,Tianlong Chen,Mohit Bansal
关键词-EN: Visual instruction datasets, Visual instruction, redundant text-image pairs, Lifelong Instruction Tuning, text-image pairs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: First two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Visual instruction datasets from various distributors are released at different times and often contain a significant number of semantically redundant text-image pairs, depending on their task compositions (i.e., skills) or reference sources. This redundancy greatly limits the efficient deployment of lifelong adaptable multimodal large language models, hindering their ability to refine existing skills and acquire new competencies over time. To address this, we reframe the problem of Lifelong Instruction Tuning (LiIT) via data selection, where the model automatically selects beneficial samples to learn from earlier and new datasets based on the current state of acquired knowledge in the model. Based on empirical analyses that show that selecting the best data subset using a static importance measure is often ineffective for multi-task datasets with evolving distributions, we propose Adapt- \infty , a new multi-way and adaptive data selection approach that dynamically balances sample efficiency and effectiveness during LiIT. We construct pseudo-skill clusters by grouping gradient-based sample vectors. Next, we select the best-performing data selector for each skill cluster from a pool of selector experts, including our newly proposed scoring function, Image Grounding score. This data selector samples a subset of the most important samples from each skill cluster for training. To prevent the continuous increase in the size of the dataset pool during LiIT, which would result in excessive computation, we further introduce a cluster-wise permanent data pruning strategy to remove the most semantically redundant samples from each cluster, keeping computational requirements manageable. Training with samples selected by Adapt- \infty alleviates catastrophic forgetting, especially for rare tasks, and promotes forward transfer across the continuum using only a fraction of the original datasets.

[AI-29] hinking LLMs: General Instruction Following with Thought Generation

链接: https://arxiv.org/abs/2410.10630
作者: Tianhao Wu,Janice Lan,Weizhe Yuan,Jiantao Jiao,Jason Weston,Sainbayar Sukhbaatar
关键词-EN: human experts respond, answer user questions, follow instructions similarly, experts respond, typically trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning – but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning problem-solving tasks.

[AI-30] Modeling News Interactions and Influence for Financial Market Prediction EMNLP2024

链接: https://arxiv.org/abs/2410.10614
作者: Mengyu Wang,Shay B. Cohen,Tiejun Ma
关键词-EN: complex process, making it challenging, challenging to evaluate, evaluate the connections, Influence Network
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
*备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:The diffusion of financial news into market prices is a complex process, making it challenging to evaluate the connections between news events and market movements. This paper introduces FININ (Financial Interconnected News Influence Network), a novel market prediction model that captures not only the links between news and prices but also the interactions among news items themselves. FININ effectively integrates multi-modal information from both market data and news articles. We conduct extensive experiments on two datasets, encompassing the SP 500 and NASDAQ 100 indices over a 15-year period and over 2.7 million news articles. The results demonstrate FININ’s effectiveness, outperforming advanced market prediction models with an improvement of 0.429 and 0.341 in the daily Sharpe ratio for the two markets respectively. Moreover, our results reveal insights into the financial news, including the delayed market pricing of news, the long memory effect of news, and the limitations of financial sentiment analysis in fully extracting predictive power from news data.

[AI-31] Intelligent prospector v2.0: exploration drill planning under epistemic model uncertainty

链接: https://arxiv.org/abs/2410.10610
作者: John Mern,Anthony Corso,Damian Burch,Kurt House,Jef Caers
关键词-EN: Optimal Bayesian decision, Optimal Bayesian, Bayesian decision making, acquire requires stating, Data acquisition
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimal Bayesian decision making on what geoscientific data to acquire requires stating a prior model of uncertainty. Data acquisition is then optimized by reducing uncertainty on some property of interest maximally, and on average. In the context of exploration, very few, sometimes no data at all, is available prior to data acquisition planning. The prior model therefore needs to include human interpretations on the nature of spatial variability, or on analogue data deemed relevant for the area being explored. In mineral exploration, for example, humans may rely on conceptual models on the genesis of the mineralization to define multiple hypotheses, each representing a specific spatial variability of mineralization. More often than not, after the data is acquired, all of the stated hypotheses may be proven incorrect, i.e. falsified, hence prior hypotheses need to be revised, or additional hypotheses generated. Planning data acquisition under wrong geological priors is likely to be inefficient since the estimated uncertainty on the target property is incorrect, hence uncertainty may not be reduced at all. In this paper, we develop an intelligent agent based on partially observable Markov decision processes that plans optimally in the case of multiple geological or geoscientific hypotheses on the nature of spatial variability. Additionally, the artificial intelligence is equipped with a method that allows detecting, early on, whether the human stated hypotheses are incorrect, thereby saving considerable expense in data acquisition. Our approach is tested on a sediment-hosted copper deposit, and the algorithm presented has aided in the characterization of an ultra high-grade deposit in Zambia in 2023.

[AI-32] BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI

链接: https://arxiv.org/abs/2410.10604
作者: Shaohao Rui,Lingzhi Chen,Zhenyu Tang,Lilong Wang,Mianxin Liu,Shaoting Zhang,Xiaosong Wang
关键词-EN: Accurate diagnosis, complementary multi-parametric MRI, multi-parametric MRI imaging, MRI imaging data, abnormalities is greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of brain abnormalities is greatly enhanced by the inclusion of complementary multi-parametric MRI imaging data. There is significant potential to develop a universal pre-training model that can be quickly adapted for image modalities and various clinical scenarios. However, current models often rely on uni-modal image data, neglecting the cross-modal correlations among different image modalities or struggling to scale up pre-training in the presence of missing modality data. In this paper, we propose BrainMVP, a multi-modal vision pre-training framework for brain image analysis using multi-parametric MRI scans. First, we collect 16,022 brain MRI scans (over 2.4 million images), encompassing eight MRI modalities sourced from a diverse range of centers and devices. Then, a novel pre-training paradigm is proposed for the multi-modal MRI data, addressing the issue of missing modalities and achieving multi-modal information fusion. Cross-modal reconstruction is explored to learn distinctive brain image embeddings and efficient modality fusion capabilities. A modality-wise data distillation module is proposed to extract the essence representation of each MR image modality for both the pre-training and downstream application purposes. Furthermore, we introduce a modality-aware contrastive learning module to enhance the cross-modality association within a study. Extensive experiments on downstream tasks demonstrate superior performance compared to state-of-the-art pre-training methods in the medical domain, with Dice Score improvement of 0.28%-14.47% across six segmentation benchmarks and a consistent accuracy improvement of 0.65%-18.07% in four individual classification tasks.

[AI-33] Neural networks that overcome classic challenges through practice

链接: https://arxiv.org/abs/2410.10596
作者: Kazuki Irie,Brenden M. Lake
关键词-EN: neural network models, human cognitive abilities, mind and brain, critics have pointed, cognitive abilities
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Since the earliest proposals for neural network models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that has used metalearning to help overcome some of these challenges. We characterize their successes as addressing an important developmental problem: they provide machines with an incentive to improve X (where X represents the desired capability) and opportunities to practice it, through explicit optimization for X; unlike conventional approaches that hope for achieving X through generalization from related but different objectives. We review applications of this principle to four classic challenges: systematicity, catastrophic forgetting, few-shot learning and multi-step reasoning; we also discuss related aspects of human development in natural environments.

[AI-34] VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

链接: https://arxiv.org/abs/2410.10594
作者: Shi Yu,Chaoyue Tang,Bokai Xu,Junbo Cui,Junhao Ran,Yukun Yan,Zhenghao Liu,Shuo Wang,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: enables large language, external knowledge sources, utilize external knowledge, large language models, RAG
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25–39% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at this https URL .

[AI-35] RESTLE: A Model of Concept Formation in Structured Domains

链接: https://arxiv.org/abs/2410.10588
作者: Christopher J. MacLellan,Erik Harpstead,Vincent Aleven,Kenneth R. Koedinger
关键词-EN: concept formation, learning concepts incrementally, concepts incrementally, concept formation focus, TRESTLE
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, 1 table

点击查看摘要

Abstract:The literature on concept formation has demonstrated that humans are capable of learning concepts incrementally, with a variety of attribute types, and in both supervised and unsupervised settings. Many models of concept formation focus on a subset of these characteristics, but none account for all of them. In this paper, we present TRESTLE, an incremental account of probabilistic concept formation in structured domains that unifies prior concept learning models. TRESTLE works by creating a hierarchical categorization tree that can be used to predict missing attribute values and cluster sets of examples into conceptually meaningful groups. It updates its knowledge by partially matching novel structures and sorting them into its categorization tree. Finally, the system supports mixed-data representations, including nominal, numeric, relational, and component attributes. We evaluate TRESTLE’s performance on a supervised learning task and an unsupervised clustering task. For both tasks, we compare it to a nonincremental model and to human participants. We find that this new categorization model is competitive with the nonincremental approach and more closely approximates human behavior on both tasks. These results serve as an initial demonstration of TRESTLE’s capabilities and show that, by taking key characteristics of human learning into account, it can better model behavior than approaches that ignore them.

[AI-36] STACKFEED: Structured Textual Actor-Critic Knowledge Base Editing with FeedBack

链接: https://arxiv.org/abs/2410.10584
作者: Naman Gupta,Shashank Kirtania,Priyanshu Gupta,Krishna Kariya,Sumit Gulwani,Arun Iyer,Suresh Parthasarathy,Arjun Radhakrishna,Sriram K. Rajamani,Gustavo Soares
关键词-EN: Large Language Models, Large Language, Language Models, outdated information, private data
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. Each document is assigned to an actor, modeled as a ReACT agent, which performs structured edits based on document-specific targeted instructions from a centralized critic. Experimental results show that STACKFEED significantly improves KB quality and RAG system performance, enhancing accuracy by up to 8% over baselines.

[AI-37] Multilingual Controlled Generation And Gold-Standard-Agnostic Evaluation of Code-Mixed Sentences COLING2025

链接: https://arxiv.org/abs/2410.10580
作者: Ayushman Gupta,Akhil Bhogal,Kripabandhu Ghosh
关键词-EN: code-mixed sentences, code-mixed, multilingual communities, practice of alternating, common phenomenon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Manuscript submitted to COLING 2025

点击查看摘要

Abstract:Code-mixing, the practice of alternating between two or more languages in an utterance, is a common phenomenon in multilingual communities. Due to the colloquial nature of code-mixing, there is no singular correct way to translate an English sentence into a code-mixed sentence. For this reason, standard n-gram-based MT evaluation metrics such as the BLEU score are not appropriate for code-mixed evaluation. To demonstrate this, we propose a novel method for code-mixed text generation: Controlled Generation, which parameterizes the code-mixing degree (CMD) and enables the generation of multiple semantically equivalent code-mixed sentences from a given English sentence. We introduce a robust new evaluation metric: GAME: A Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences. GAME is both language-agnostic and gold-standard-agnostic, i.e. unlike other metrics, GAME does not require gold-standard code-mixed sentences for evaluation, thus eliminating the need for human annotators in the code-mixed evaluation process. When used to evaluate semantically equivalent code-mixed sentences, we find that GAME scores have a lower standard deviation than BLEU scores. Further, we create and release a dataset containing gold-standard code-mixed sentences across 4 language pairs: English-Hindi, Bengali, French, Spanish to encourage more computational research on code-mixing.

[AI-38] Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes

链接: https://arxiv.org/abs/2410.10578
作者: Juan Sebastian Rojas,Chi-Guhn Lee
关键词-EN: Markov decision processes, Average-reward Markov decision, Markov decision, Average-reward Markov, decision processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2006.16318 , arXiv:2110.13855 by other authors

点击查看摘要

Abstract:Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to episodic and discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various subtasks simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.

[AI-39] When Precedents Clash

链接: https://arxiv.org/abs/2410.10567
作者: Cecilia Di Florio,Huimin Dong,Antonino Rotolo
关键词-EN: avoid the problem, problem of retrieving, cases, case bases, Consistency
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages. Extended version with proofs of a paper accepted at JURIX 2024

点击查看摘要

Abstract:Consistency of case bases is a way to avoid the problem of retrieving conflicting constraining precedents for new cases to be decided. However, in legal practice the consistency requirements for case bases may not be satisfied. As pointed out in (Broughton 2019), a model of precedential constraint should take into account the hierarchical structure of the specific legal system under consideration and the temporal dimension of cases. This article continues the research initiated in (Liu et al. 2022; Di Florio et al. 2023), which established a connection between Boolean classifiers and legal case-based reasoning. On this basis, we enrich the classifier models with an organisational structure that takes into account both the hierarchy of courts and which courts issue decisions that are binding/constraining on subsequent cases. We focus on common law systems. We also introduce a temporal relation between cases. Within this enriched framework, we can formalise the notions of overruled cases and cases decided per incuriam: such cases are not to be considered binding on later cases. Finally, we show under which condition principles based on the hierarchical structure and on the temporal dimension can provide an unambiguous decision-making process for new cases in the presence of conflicting binding precedents.

[AI-40] ROSAR: An Adversarial Re-Training Framework for Robust Side-Scan Sonar Object Detection

链接: https://arxiv.org/abs/2410.10554
作者: Martin Aubard,László Antal,Ana Madureira,Luis F. Teixeira,Erika Ábrahám
关键词-EN: deep learning object, autonomous underwater vehicles, learning object detection, generated by autonomous, paper introduces ROSAR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces ROSAR, a novel framework enhancing the robustness of deep learning object detection models tailored for side-scan sonar (SSS) images, generated by autonomous underwater vehicles using sonar sensors. By extending our prior work on knowledge distillation (KD), this framework integrates KD with adversarial retraining to address the dual challenges of model efficiency and robustness against SSS noises. We introduce three novel, publicly available SSS datasets, capturing different sonar setups and noise conditions. We propose and formalize two SSS safety properties and utilize them to generate adversarial datasets for retraining. Through a comparative analysis of projected gradient descent (PGD) and patch-based adversarial attacks, ROSAR demonstrates significant improvements in model robustness and detection accuracy under SSS-specific conditions, enhancing the model’s robustness by up to 1.85%. ROSAR is available at this https URL.

[AI-41] SLaNC: Static LayerNorm Calibration NEURIPS2024

链接: https://arxiv.org/abs/2410.10553
作者: Mahsa Salmani,Nikita Trukhanov,Ilya Soloveychik
关键词-EN: Large Language Models, generated enormous pressure, rapidly expanding fields, Large Language, sizes of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 9 pages, 3 figures, NeurIPS 2024 MLNCP Workshop

点击查看摘要

Abstract:The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of the most rapidly expanding fields of the AI industry. Various approaches have been explored to enable efficient and accurate processing of LLMs on the available accelerators given their computational and storage limitations. Among these, various quantization techniques have become the main focus of the community as a means of reducing the compute, communication and storage requirements. Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations. When it comes to processing the popular Transformer models on hardware, one of the main issues becomes calculation of the LayerNorm simply because accumulation of the variance requires a much wider dynamic range than the hardware enables. In this article, we address this matter and propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference. Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers. The scaling factors are computed offline, based solely on the linear layer weights, hence no latency or computational overhead is added during inference. Most importantly, our technique ensures that no numerical issues such as overflow or underflow could happen during the compute. This approach offers smooth, accurate and resource-effective inference across a wide range of hardware architectures. The article provides theoretical justification as well as supporting numerical simulations.

[AI-42] Hybrid Transformer for Early Alzheimers Detection: Integration of Handwriting-Based 2D Images and 1D Signal Features

链接: https://arxiv.org/abs/2410.10547
作者: Changqing Gong,Huafeng Qin,Mounîm A. El-Yacoubi
关键词-EN: Alzheimer Disease, prevalent neurodegenerative condition, prevalent neurodegenerative, neurodegenerative condition, Alzheimer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a prevalent neurodegenerative condition where early detection is vital. Handwriting, often affected early in AD, offers a non-invasive and cost-effective way to capture subtle motor changes. State-of-the-art research on handwriting, mostly online, based AD detection has predominantly relied on manually extracted features, fed as input to shallow machine learning models. Some recent works have proposed deep learning (DL)-based models, either 1D-CNN or 2D-CNN architectures, with performance comparing favorably to handcrafted schemes. These approaches, however, overlook the intrinsic relationship between the 2D spatial patterns of handwriting strokes and their 1D dynamic characteristics, thus limiting their capacity to capture the multimodal nature of handwriting data. Moreover, the application of Transformer models remains basically unexplored. To address these limitations, we propose a novel approach for AD detection, consisting of a learnable multimodal hybrid attention model that integrates simultaneously 2D handwriting images with 1D dynamic handwriting signals. Our model leverages a gated mechanism to combine similarity and difference attention, blending the two modalities and learning robust features by incorporating information at different scales. Our model achieved state-of-the-art performance on the DARWIN dataset, with an F1-score of 90.32% and accuracy of 90.91% in Task 8 (‘L’ writing), surpassing the previous best by 4.61% and 6.06% respectively.

[AI-43] Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.10542
作者: Shubham Kumar Nigam,Aniket Deroy,Subhankar Maity,Arnab Bhattacharya
关键词-EN: context of Indian, study investigates judgment, Indian judgments, including InLegalBERT, utilizing a range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted on NLLP at EMNLP 2024

点击查看摘要

Abstract:This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.

[AI-44] Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Pitch Difference Feature

链接: https://arxiv.org/abs/2410.10537
作者: Jan Vrba,Jakub Steinbach,Tomáš Jirsa,Laura Verde,Roberta De Fazio,Noriyasu Homma,Yuwen Zeng,Key Ichiji,Lukáš Hájek,Zuzana Sedláková,Jan Mareš
关键词-EN: propose a robust, research of contemporary, contemporary practices, feature set, robust set
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 33 pages, 8 figures, code repository: this https URL

点击查看摘要

Abstract:In this study, we propose a robust set of features derived from a thorough research of contemporary practices in voice pathology detection. The feature set is based on the combination of acoustic handcrafted features. Additionally, we introduce pitch difference as a novel feature. We combine this feature set, containing data from the publicly available Saarbrücken Voice Database (SVD), with preprocessing using the K-Means Synthetic Minority Over-Sampling Technique algorithm to address class imbalance. Moreover, we applied multiple ML models as binary classifiers. We utilized support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest and AdaBoost classifiers. To determine the best classification approach, we performed grid search on feasible hyperparameters of respective classifiers and subsections of features. Our approach has achieved the state-of-the-art performance, measured by unweighted average recall in voice pathology detection on SVD database. We intentionally omit accuracy as it is highly biased metric in case of unbalanced data compared to aforementioned metrics. The results are further enhanced by eliminating the potential overestimation of the results with repeated stratified cross-validation. This advancement demonstrates significant potential for the clinical deployment of ML methods, offering a valuable tool for an objective examination of voice pathologies. To support our claims, we provide a publicly available GitHub repository with DOI https://doi.org/10.5281/zenodo.13771573. Finally, we provide REFORMS checklist. Comments: 33 pages, 8 figures, code repository: this https URL Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2410.10537 [cs.SD] (or arXiv:2410.10537v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2410.10537 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-45] Get Rid of Task Isolation: A Continuous Multi-task Spatio-Temporal Learning Framework NEURIPS2024

链接: https://arxiv.org/abs/2410.10524
作者: Zhongchao Yi,Zhengyang Zhou,Qihe Huang,Yanjiang Chen,Liheng Yu,Xu Wang,Yang Wang
关键词-EN: enable urban intelligence, pivotal technique, technique to enable, Spatiotemporal learning, urban
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Spatiotemporal learning has become a pivotal technique to enable urban intelligence. Traditional spatiotemporal models mostly focus on a specific task by assuming a same distribution between training and testing sets. However, given that urban systems are usually dynamic, multi-sourced with imbalanced data distributions, current specific task-specific models fail to generalize to new urban conditions and adapt to new domains without explicitly modeling interdependencies across various dimensions and types of urban data. To this end, we argue that there is an essential to propose a Continuous Multi-task Spatio-Temporal learning framework (CMuST) to empower collective urban intelligence, which reforms the urban spatiotemporal learning from single-domain to cooperatively multi-dimensional and multi-task learning. Specifically, CMuST proposes a new multi-dimensional spatiotemporal interaction network (MSTI) to allow cross-interactions between context and main observations as well as self-interactions within spatial and temporal aspects to be exposed, which is also the core for capturing task-level commonality and personalization. To ensure continuous task learning, a novel Rolling Adaptation training scheme (RoAda) is devised, which not only preserves task uniqueness by constructing data summarization-driven task prompts, but also harnesses correlated patterns among tasks by iterative model behavior modeling. We further establish a benchmark of three cities for multi-task spatiotemporal learning, and empirically demonstrate the superiority of CMuST via extensive evaluations on these datasets. The impressive improvements on both few-shot streaming data and new domain tasks against existing SOAT methods are achieved. Code is available at this https URL.

[AI-46] Continual Deep Reinforcement Learning to Prevent Catastrophic Forgetting in Jamming Mitigation

链接: https://arxiv.org/abs/2410.10521
作者: Kemal Davaslioglu,Sastry Kompella,Tugba Erpek,Yalin E. Sagduyu
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, reliable wireless communications, facilitate reliable wireless, DRL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: IEEE MILCOM 2024

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has been highly effective in learning from and adapting to RF environments and thus detecting and mitigating jamming effects to facilitate reliable wireless communications. However, traditional DRL methods are susceptible to catastrophic forgetting (namely forgetting old tasks when learning new ones), especially in dynamic wireless environments where jammer patterns change over time. This paper considers an anti-jamming system and addresses the challenge of catastrophic forgetting in DRL applied to jammer detection and mitigation. First, we demonstrate the impact of catastrophic forgetting in DRL when applied to jammer detection and mitigation tasks, where the network forgets previously learned jammer patterns while adapting to new ones. This catastrophic interference undermines the effectiveness of the system, particularly in scenarios where the environment is non-stationary. We present a method that enables the network to retain knowledge of old jammer patterns while learning to handle new ones. Our approach substantially reduces catastrophic forgetting, allowing the anti-jamming system to learn new tasks without compromising its ability to perform previously learned tasks effectively. Furthermore, we introduce a systematic methodology for sequentially learning tasks in the anti-jamming framework. By leveraging continual DRL techniques based on PackNet, we achieve superior anti-jamming performance compared to standard DRL methods. Our proposed approach not only addresses catastrophic forgetting but also enhances the adaptability and robustness of the system in dynamic jamming environments. We demonstrate the efficacy of our method in preserving knowledge of past jammer patterns, learning new tasks efficiently, and achieving superior anti-jamming performance compared to traditional DRL approaches.

[AI-47] UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

链接: https://arxiv.org/abs/2410.10516
作者: Shikun Feng,Yuyan Ni,Yan Lu,Zhi-Ming Ma,Wei-Ying Ma,Yanyan Lan
关键词-EN: property prediction, Molecular generation, drug discovery, developed independently, molecular property prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision.

[AI-48] A Practical Approach to Causal Inference over Time

链接: https://arxiv.org/abs/2410.10502
作者: Martina Cinquini,Isacco Beretta,Salvatore Ruggieri,Isabel Valera
关键词-EN: focus on estimating, causal, causal VAR framework, time, causal VAR
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we focus on estimating the causal effect of an intervention over time on a dynamical system. To that end, we formally define causal interventions and their effects over time on discrete-time stochastic processes (DSPs). Then, we show under which conditions the equilibrium states of a DSP, both before and after a causal intervention, can be captured by a structural causal model (SCM). With such an equivalence at hand, we provide an explicit mapping from vector autoregressive models (VARs), broadly applied in econometrics, to linear, but potentially cyclic and/or affected by unmeasured confounders, SCMs. The resulting causal VAR framework allows us to perform causal inference over time from observational time series data. Our experiments on synthetic and real-world datasets show that the proposed framework achieves strong performance in terms of observational forecasting while enabling accurate estimation of the causal effect of interventions on dynamical systems. We demonstrate, through a case study, the potential practical questions that can be addressed using the proposed causal VAR framework.

[AI-49] Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation

链接: https://arxiv.org/abs/2410.10489
作者: Sharif Kazemi,Gloria Gerhardt,Jonty Katz,Caroline Ida Kuria,Estelle Pan,Umang Prabhakar
关键词-EN: LLMs embeds societal, embeds societal, language culture, LLMs embeds, training data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The training data for LLMs embeds societal values, increasing their familiarity with the language’s culture. Our analysis found that 44% of the variance in the ability of GPT-4o to reflect the societal values of a country, as measured by the World Values Survey, correlates with the availability of digital resources in that language. Notably, the error rate was more than five times higher for the languages of the lowest resource compared to the languages of the highest resource. For GPT-4-turbo, this correlation rose to 72%, suggesting efforts to improve the familiarity with the non-English language beyond the web-scraped data. Our study developed one of the largest and most robust datasets in this topic area with 21 country-language pairs, each of which contain 94 survey questions verified by native speakers. Our results highlight the link between LLM performance and digital data availability in target languages. Weaker performance in low-resource languages, especially prominent in the Global South, may worsen digital divides. We discuss strategies proposed to address this, including developing multilingual LLMs from the ground up and enhancing fine-tuning on diverse linguistic datasets, as seen in African language initiatives.

[AI-50] Advancing Newborn Care: Precise Birth Time Detection Using AI-Driven Thermal Imaging with Adaptive Normalization

链接: https://arxiv.org/abs/2410.10483
作者: Jorge García-Torres,Øyvind Meinich-Bache,Anders Johannessen,Siren Rettedal,Vilde Kolstad,Kjersti Engan
关键词-EN: newborn resuscitation, start breathing, assistance to start, newborn, real newborn resuscitation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper submitted to Computer in Biology and Medicine, ELSEVIER

点击查看摘要

Abstract:Around 5-10% of newborns need assistance to start breathing. Currently, there is a lack of evidence-based research, objective data collection, and opportunities for learning from real newborn resuscitation emergency events. Generating and evaluating automated newborn resuscitation algorithm activity timelines relative to the Time of Birth (ToB) offers a promising opportunity to enhance newborn care practices. Given the importance of prompt resuscitation interventions within the “golden minute” after birth, having an accurate ToB with second precision is essential for effective subsequent analysis of newborn resuscitation episodes. Instead, ToB is generally registered manually, often with minute precision, making the process inefficient and susceptible to error and imprecision. In this work, we explore the fusion of Artificial Intelligence (AI) and thermal imaging to develop the first AI-driven ToB detector. The use of temperature information offers a promising alternative to detect the newborn while respecting the privacy of healthcare providers and mothers. However, the frequent inconsistencies in thermal measurements, especially in a multi-camera setup, make normalization strategies critical. Our methodology involves a three-step process: first, we propose an adaptive normalization method based on Gaussian mixture models (GMM) to mitigate issues related to temperature variations; second, we implement and deploy an AI model to detect the presence of the newborn within the thermal video frames; and third, we evaluate and post-process the model’s predictions to estimate the ToB. A precision of 88.1% and a recall of 89.3% are reported in the detection of the newborn within thermal frames during performance evaluation. Our approach achieves an absolute median deviation of 2.7 seconds in estimating the ToB relative to the manual annotations.

[AI-51] Model-Based Differentially Private Knowledge Transfer for Large Language Models

链接: https://arxiv.org/abs/2410.10481
作者: Zhaomin Wu,Jizhou Guo,Junyi Hou,Bingsheng He,Lixin Fan,Qiang Yang
关键词-EN: effectively leveraging domain-specific, large language models, web services, effectively leveraging, leveraging domain-specific knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly prevalent in web services, effectively leveraging domain-specific knowledge while ensuring privacy has become critical. Existing methods, such as retrieval-augmented generation (RAG) and differentially private data synthesis, often compromise either the utility of domain knowledge or the privacy of sensitive data, limiting their applicability in specialized domains. To address these challenges, we propose \textitLlamdex, a novel framework that integrates privacy-preserving, domain-specific models into LLMs. Our approach significantly enhances the accuracy of domain-specific tasks, achieving up to a 26% improvement compared to existing methods under the same differential privacy constraints. Experimental results show that Llamdex not only improves the accuracy of LLM responses but also maintains comparable inference efficiency to the original LLM, highlighting its potential for real-world applications.

[AI-52] MGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

链接: https://arxiv.org/abs/2410.10479
作者: Haochuan Wang,Xiachong Feng,Lei Li,Zhanyue Qin,Dianbo Sui,Lingpeng Kong
关键词-EN: drawing increasing attention, large language models, reasoning drawing increasing, strategic reasoning drawing, increasing attention
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs’ strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI’s latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench’s challenges.

[AI-53] Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?

链接: https://arxiv.org/abs/2410.10476
作者: Gabriel Roccabruna,Massimo Rizzoli,Giuseppe Riccardi
关键词-EN: Temporal Relation Classification, automatic detection, temporal relations, Large Language Models, temporal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering. Nevertheless, recent studies have tested the LLMs’ performance in detecting temporal relations of closed-source models only, limiting the interpretability of those results. In this work, we investigate LLMs’ performance and decision process in the Temporal Relation Classification task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models can be found respectively on GitHub.

[AI-54] ABCF: Counterfactual Explanations for Tabular Data Using a Transformer-Based VAE

链接: https://arxiv.org/abs/2410.10463
作者: Emmanouil Panagiotou,Manuel Heurich,Tim Landgraf,Eirini Ntoutsi
关键词-EN: field of Explainable, alter a prediction, interpret a black-box, XAI, specific feature types
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Paper accepted at ICAIF '24: 5th ACM International Conference on AI in Finance, Brooklyn, NY, USA, November 2024

点击查看摘要

Abstract:In the field of Explainable AI (XAI), counterfactual (CF) explanations are one prominent method to interpret a black-box model by suggesting changes to the input that would alter a prediction. In real-world applications, the input is predominantly in tabular form and comprised of mixed data types and complex feature interdependencies. These unique data characteristics are difficult to model, and we empirically show that they lead to bias towards specific feature types when generating CFs. To overcome this issue, we introduce TABCF, a CF explanation method that leverages a transformer-based Variational Autoencoder (VAE) tailored for modeling tabular data. Our approach uses transformers to learn a continuous latent space and a novel Gumbel-Softmax detokenizer that enables precise categorical reconstruction while preserving end-to-end differentiability. Extensive quantitative evaluation on five financial datasets demonstrates that TABCF does not exhibit bias toward specific feature types, and outperforms existing methods in producing effective CFs that align with common CF desiderata.

[AI-55] Compositional Shielding and Reinforcement Learning for Multi-Agent Systems

链接: https://arxiv.org/abs/2410.10460
作者: Asger Horn Brorholt,Kim Guldstrand Larsen,Christian Schilling
关键词-EN: obtaining high-performance policies, Deep reinforcement learning, Deep reinforcement, powerful tool, tool for obtaining
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning has emerged as a powerful tool for obtaining high-performance policies. However, the safety of these policies has been a long-standing issue. One promising paradigm to guarantee safety is a shield, which shields a policy from making unsafe actions. However, computing a shield scales exponentially in the number of state variables. This is a particular concern in multi-agent systems with many agents. In this work, we propose a novel approach for multi-agent shielding. We address scalability by computing individual shields for each agent. The challenge is that typical safety specifications are global properties, but the shields of individual agents only ensure local properties. Our key to overcome this challenge is to apply assume-guarantee reasoning. Specifically, we present a sound proof rule that decomposes a (global, complex) safety specification into (local, simple) obligations for the shields of the individual agents. Moreover, we show that applying the shields during reinforcement learning significantly improves the quality of the policies obtained for a given training budget. We demonstrate the effectiveness and scalability of our multi-agent shielding framework in two case studies, reducing the computation time from hours to seconds and achieving fast learning convergence.

[AI-56] Mobility-Aware Federated Learning: Multi-Armed Bandit Based Selection in Vehicular Network

链接: https://arxiv.org/abs/2410.10451
作者: Haoyu Tu,Lin Chen,Zuguang Li,Xiaopei Chen,Wen Wu
关键词-EN: vehicular federated learning, federated learning, vehicle selection problem, paper,we study, mobility-aware vehicular federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by 2024 IEEE Globecom Workshops (GC Wkshps)

点击查看摘要

Abstract:In this paper,we study a vehicle selection problem for federated learning (FL) over vehicular networks. Specifically, we design a mobility-aware vehicular federated learning (MAVFL) scheme in which vehicles drive through a road segment to perform FL. Some vehicles may drive out of the segment which leads to unsuccessful this http URL the proposed scheme, the real-time successful training participation ratio is utilized to implement vehicle selection. We conduct the convergence analysis to indicate the influence of vehicle mobility on training loss. Furthermore, we propose a multi-armed bandit-based vehicle selection algorithm to minimize the utility function considering training loss and delay. The simulation results show that compared with baselines, the proposed algorithm can achieve better training performance with approximately 28% faster convergence.

[AI-57] KBLaM: Knowledge Base augmented Language Model

链接: https://arxiv.org/abs/2410.10450
作者: Xi Wang,Liana Mikaelyan,Taketomo Isazawa,James Hensman
关键词-EN: augmenting Large Language, Base augmented Language, Large Language Models, augmented Language Model, propose Knowledge Base
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge. KBLaM works with a knowledge base (KB) constructed from a corpus of documents, transforming each piece of knowledge in the KB into continuous key-value vector pairs via pre-trained sentence encoders with linear adapters and integrating them into pre-trained LLMs via a specialized rectangular attention mechanism. Unlike Retrieval-Augmented Generation, KBLaM eliminates external retrieval modules, and unlike in-context learning, its computational overhead scales linearly with KB size rather than quadratically. Our approach enables integrating a large KB of more than 10K triples into an 8B pre-trained LLM of only 8K context window on one single A100 80GB GPU and allows for dynamic updates without model fine-tuning or retraining. Experiments demonstrate KBLaM’s effectiveness in various tasks, including question-answering and open-ended reasoning, while providing interpretable insights into its use of the augmented knowledge.

[AI-58] Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

链接: https://arxiv.org/abs/2410.10441
作者: Kai Han,Jianyuan Guo,Yehui Tang,Wei He,Enhua Wu,Yunhe Wang
关键词-EN: achieved remarkable success, understanding remains challenging, Vision-language large models, remains challenging due, Vision-language large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech report

点击查看摘要

Abstract:Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While training-based video-LLMs deliver high performance, they often require substantial resources for training and inference. Conversely, training-free approaches offer a more efficient alternative by adapting pre-trained image-LLMs models for video tasks without additional training, but they face inference efficiency bottlenecks due to the large number of visual tokens generated from video frames. In this work, we present a novel prompt-guided visual perception framework (abbreviated as \emphFree Video-LLM) for efficient inference of training-free video LLMs. The proposed framework decouples spatial-temporal dimension and performs temporal frame sampling and spatial RoI cropping respectively based on task-specific prompts. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks. Extensive experiments demonstrate that our approach achieves competitive results with significantly fewer tokens, offering an optimal trade-off between accuracy and computational efficiency compared to state-of-the-art video LLMs. The code will be available at \urlthis https URL.

[AI-59] LKASeg:Remote-Sensing Image Semantic Segmentation with Large Kernel Attention and Full-Scale Skip Connections ICASSP2025

链接: https://arxiv.org/abs/2410.10433
作者: Xuezhi Xiang,Yibo Ning,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen
关键词-EN: Large Kernel Attention, Convolutional Neural Networks, Full-Scale Skip Connections, Kernel Attention, semantic segmentation network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: The paper is under consideration at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Semantic segmentation of remote sensing images is a fundamental task in geospatial research. However, widely used Convolutional Neural Networks (CNNs) and Transformers have notable drawbacks: CNNs may be limited by insufficient remote sensing modeling capability, while Transformers face challenges due to computational complexity. In this paper, we propose a remote-sensing image semantic segmentation network named LKASeg, which combines Large Kernel Attention(LSKA) and Full-Scale Skip Connections(FSC). Specifically, we propose a decoder based on Large Kernel Attention (LKA), which extract global features while avoiding the computational overhead of self-attention and providing channel adaptability. To achieve full-scale feature learning and fusion, we apply Full-Scale Skip Connections (FSC) between the encoder and decoder. We conducted experiments by combining the LKA-based decoder with FSC. On the ISPRS Vaihingen dataset, the mF1 and mIoU scores achieved 90.33% and 82.77%.

[AI-60] FairMindSim: Alignment of Behavior Emotion and Belief in Humans and LLM Agents Amid Ethical Dilemmas

链接: https://arxiv.org/abs/2410.10398
作者: Yu Lei,Hao Liu,Chengxing Xie,Songjia Liu,Zhiyu Yin,Canyu chen,Guohao Li,Philip Torr,Zhen Wu
关键词-EN: control and safety, pivotal issue, Abstract, behavior, LLM agents
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI alignment is a pivotal issue concerning AI control and safety. It should consider not only value-neutral human preferences but also moral and ethical considerations. In this study, we introduced FairMindSim, which simulates the moral dilemma through a series of unfair scenarios. We used LLM agents to simulate human behavior, ensuring alignment across various stages. To explore the various socioeconomic motivations, which we refer to as beliefs, that drive both humans and LLM agents as bystanders to intervene in unjust situations involving others, and how these beliefs interact to influence individual behavior, we incorporated knowledge from relevant sociological fields and proposed the Belief-Reward Alignment Behavior Evolution Model (BREM) based on the recursive reward model (RRM). Our findings indicate that, behaviorally, GPT-4o exhibits a stronger sense of social justice, while humans display a richer range of emotions. Additionally, we discussed the potential impact of emotions on behavior. This study provides a theoretical foundation for applications in aligning LLMs with altruistic values.

[AI-61] PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation NEURIPS2024

链接: https://arxiv.org/abs/2410.10394
作者: Kaidong Zhang,Pengzhen Ren,Bingqian Lin,Junfan Lin,Shikui Ma,Hang Xu,Xiaodan Liang
关键词-EN: follow abstract user, Language-guided robotic manipulation, abstract user instructions, Language-guided robotic, waypOinT-aware world model
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.

[AI-62] Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search

链接: https://arxiv.org/abs/2410.10392
作者: Chenglin Li,Qianglong Chen,Zhi Li,Feng Tao,Yicheng Li,Hao Chen,Fei Yu,Yin Zhang
关键词-EN: humans’ actual goals, aligning language models, real world, crucial technique, technique for aligning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Instruction tuning is a crucial technique for aligning language models with humans’ actual goals in the real world. Extensive research has highlighted the quality of instruction data is essential for the success of this alignment. However, creating high-quality data manually is labor-intensive and time-consuming, which leads researchers to explore using LLMs to synthesize data. Recent studies have focused on using a stronger LLM to iteratively enhance existing instruction data, showing promising results. Nevertheless, previous work often lacks control over the evolution direction, resulting in high uncertainty in the data synthesis process and low-quality instructions. In this paper, we introduce a general and scalable framework, IDEA-MCTS (Instruction Data Enhancement using Monte Carlo Tree Search), a scalable framework for efficiently synthesizing instructions. With tree search and evaluation models, it can efficiently guide each instruction to evolve into a high-quality form, aiding in instruction fine-tuning. Experimental results show that IDEA-MCTS significantly enhances the seed instruction data, raising the average evaluation scores of quality, diversity, and complexity from 2.19 to 3.81. Furthermore, in open-domain benchmarks, experimental results show that IDEA-MCTS improves the accuracy of real-world instruction-following skills in LLMs by an average of 5% in low-resource settings.

[AI-63] Stein Variational Evolution Strategies

链接: https://arxiv.org/abs/2410.10390
作者: Cornelius V. Braun,Robert T. Lange,Marc Toussaint
关键词-EN: Variational Gradient Descent, Gradient Descent, highly efficient method, Stein Variational Gradient, Stein Variational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Stein Variational Gradient Descent (SVGD) is a highly efficient method to sample from an unnormalized probability distribution. However, the SVGD update relies on gradients of the log-density, which may not always be available. Existing gradient-free versions of SVGD make use of simple Monte Carlo approximations or gradients from surrogate distributions, both with limitations. To improve gradient-free Stein variational inference, we combine SVGD steps with evolution strategy (ES) updates. Our results demonstrate that the resulting algorithm generates high-quality samples from unnormalized target densities without requiring gradient information. Compared to prior gradient-free SVGD methods, we find that the integration of the ES update in SVGD significantly improves the performance on multiple challenging benchmark problems.

[AI-64] BookWorm: A Dataset for Character Description and Analysis EMNLP2024

链接: https://arxiv.org/abs/2410.10372
作者: Argyrios Papoudakis,Mirella Lapata,Frank Keller
关键词-EN: driving the plot, engaging readers, plot and engaging, numerous interacting characters, Characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 30 pages, 2 figures, EMNLP 2024 Findings

点击查看摘要

Abstract:Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.

[AI-65] Innovative Thinking Infinite Humor: Humor Research of Large Language Models through Structured Thought Leaps

链接: https://arxiv.org/abs/2410.10370
作者: Han Wang,Yilin Zhao,Dian Li,Xiaohan Wang,Gang Liu,Xuguang Lan,Hui Wang
关键词-EN: culturally nuanced aspect, possess good creativity, strong associative thinking, requiring participants, culturally nuanced
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Humor is a culturally nuanced aspect of human language that presents challenges for understanding and generation, requiring participants to possess good creativity and strong associative thinking. Similar to reasoning tasks like solving math problems, humor generation requires continuous reflection and revision to foster creative thinking, rather than relying on a sudden flash of inspiration like Creative Leap-of-Thought (CLoT) paradigm. Although CLoT can realize the ability of remote association generation, this paradigm fails to generate humor content. Therefore, in this paper, we propose a systematic way of thinking about generating humor and based on it, we built Creative Leap of Structured Thought (CLoST) frame. First, a reward model is necessary achieve the purpose of being able to correct errors, since there is currently no expert model of humor and a usable rule to determine whether a piece of content is humorous. Judgement-oriented instructions are designed to improve the capability of a model, and we also propose an open-domain instruction evolutionary method to fully unleash the potential. Then, through reinforcement learning, the model learns to hone its rationales of the thought chain and refine the strategies it uses. Thus, it learns to recognize and correct its mistakes, and finally generate the most humorous and creative answer. These findings deepen our understanding of the creative capabilities of LLMs and provide ways to enhance LLMs’ creative abilities for cross-domain innovative applications.

[AI-66] Affinity-Graph-Guided Contractive Learning for Pretext-Free Medical Image Segmentation with Minimal Annotation

链接: https://arxiv.org/abs/2410.10366
作者: Zehua Cheng,Di Yuan,Thomas Lukasiewicz
关键词-EN: medical image segmentation, semi-supervised contrastive learning, contrastive learning framework, image segmentation, contrastive learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: BIBM 2024

点击查看摘要

Abstract:The combination of semi-supervised learning (SemiSL) and contrastive learning (CL) has been successful in medical image segmentation with limited annotations. However, these works often rely on pretext tasks that lack the specificity required for pixel-level segmentation, and still face overfitting issues due to insufficient supervision signals resulting from too few annotations. Therefore, this paper proposes an affinity-graph-guided semi-supervised contrastive learning framework (Semi-AGCL) by establishing additional affinity-graph-based supervision signals between the student and teacher network, to achieve medical image segmentation with minimal annotations without pretext. The framework first designs an average-patch-entropy-driven inter-patch sampling method, which can provide a robust initial feature space without relying on pretext tasks. Furthermore, the framework designs an affinity-graph-guided loss function, which can improve the quality of the learned representation and the model generalization ability by exploiting the inherent structure of the data, thus mitigating overfitting. Our experiments indicate that with merely 10% of the complete annotation set, our model approaches the accuracy of the fully annotated baseline, manifesting a marginal deviation of only 2.52%. Under the stringent conditions where only 5% of the annotations are employed, our model exhibits a significant enhancement in performance surpassing the second best baseline by 23.09% on the dice metric and achieving an improvement of 26.57% on the notably arduous CRAG and ACDC datasets.

[AI-67] SpeGCL: Self-supervised Graph Spectrum Contrastive Learning without Positive Samples

链接: https://arxiv.org/abs/2410.10365
作者: Yuntao Shou,Xiangyong Cao,Deyu Meng
关键词-EN: GCL, existing GCL, existing GCL methods, GCL methods, Graph Contrastive Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) excels at managing noise and fluctuations in input data, making it popular in various fields (e.g., social networks, and knowledge graphs). Our study finds that the difference in high-frequency information between augmented graphs is greater than that in low-frequency information. However, most existing GCL methods focus mainly on the time domain (low-frequency information) for node feature representations and cannot make good use of high-frequency information to speed up model convergence. Furthermore, existing GCL paradigms optimize graph embedding representations by pulling the distance between positive sample pairs closer and pushing the distance between positive and negative sample pairs farther away, but our theoretical analysis shows that graph contrastive learning benefits from pushing negative pairs farther away rather than pulling positive pairs closer. To solve the above-mentioned problems, we propose a novel spectral GCL framework without positive samples, named SpeGCL. Specifically, to solve the problem that existing GCL methods cannot utilize high-frequency information, SpeGCL uses a Fourier transform to extract high-frequency and low-frequency information of node features, and constructs a contrastive learning mechanism in a Fourier space to obtain better node feature representation. Furthermore, SpeGCL relies entirely on negative samples to refine the graph embedding. We also provide a theoretical justification for the efficacy of using only negative samples in SpeGCL. Extensive experiments on un-supervised learning, transfer learning, and semi-supervised learning have validated the superiority of our SpeGCL framework over the state-of-the-art GCL methods.

[AI-68] CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

链接: https://arxiv.org/abs/2410.10336
作者: Joshua Ong Jun Leang,Aryo Pradipta Gema,Shay B. Cohen
关键词-EN: Mathematically Annotated Thought, large language models, remains a significant, significant challenge, challenge for large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: 8 pages, 12 figures

点击查看摘要

Abstract:Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks

[AI-69] Disentangling Hate Across Target Identities

链接: https://arxiv.org/abs/2410.10332
作者: Yiping Jin,Leo Wanner,Aneesh Moideen Koya
关键词-EN: perform equally, detecting hateful expressions, target identities, Hate speech, specific target identities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hate speech (HS) classifiers do not perform equally well in detecting hateful expressions towards different target identities. They also demonstrate systematic biases in predicted hatefulness scores. Tapping on two recently proposed functionality test datasets for HS detection, we quantitatively analyze the impact of different factors on HS prediction. Experiments on popular industrial and academic models demonstrate that HS detectors assign a higher hatefulness score merely based on the mention of specific target identities. Besides, models often confuse hatefulness and the polarity of emotions. This result is worrisome as the effort to build HS detectors might harm the vulnerable identity groups we wish to protect: posts expressing anger or disapproval of hate expressions might be flagged as hateful themselves. We also carry out a study inspired by social psychology theory, which reveals that the accuracy of hatefulness prediction correlates strongly with the intensity of the stereotype.

[AI-70] GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs

链接: https://arxiv.org/abs/2410.10329
作者: Yun Zhu,Haizhou Shi,Xiaotang Wang,Yongchao Liu,Yaoke Wang,Boci Peng,Chuntao Hong,Siliang Tang
关键词-EN: Large Language Models, Large Language, bolster TAG methodologies, gained significant attention, significant attention due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Recently, research on Text-Attributed Graphs (TAGs) has gained significant attention due to the prevalence of free-text node features in real-world applications and the advancements in Large Language Models (LLMs) that bolster TAG methodologies. However, current TAG approaches face two primary challenges: (i) Heavy reliance on label information and (ii) Limited cross-domain zero/few-shot transferability. These issues constrain the scaling of both data and model size, owing to high labor costs and scaling laws, complicating the development of graph foundation models with strong transferability. In this work, we propose the GraphCLIP framework to address these challenges by learning graph foundation models with strong cross-domain zero/few-shot transferability through a self-supervised contrastive graph-summary pretraining method. Specifically, we generate and curate large-scale graph-summary pair data with the assistance of LLMs, and introduce a novel graph-summary pretraining method, combined with invariant learning, to enhance graph foundation models with strong cross-domain zero-shot transferability. For few-shot learning, we propose a novel graph prompt tuning technique aligned with our pretraining objective to mitigate catastrophic forgetting and minimize learning costs. Extensive experiments show the superiority of GraphCLIP in both zero-shot and few-shot settings, while evaluations across various downstream tasks confirm the versatility of GraphCLIP. Our code is available at: this https URL

[AI-71] DiRW: Path-Aware Digraph Learning for Heterophily

链接: https://arxiv.org/abs/2410.10320
作者: Daohan Su,Xunkai Li,Zhenjun Li,Yinping Liao,Rong-Hua Li,Guoren Wang
关键词-EN: powerful representation learning, representation learning tool, graph-structured data, powerful representation, tool for graph-structured
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Recently, graph neural network (GNN) has emerged as a powerful representation learning tool for graph-structured data. However, most approaches are tailored for undirected graphs, neglecting the abundant information embedded in the edges of directed graphs (digraphs). In fact, digraphs are widely applied in the real world (e.g., social networks and recommendations) and are also confirmed to offer a new perspective for addressing topological heterophily challenges (i.e., connected nodes have complex patterns of feature distribution or labels). Despite recent significant advancements in DiGNNs, existing spatial- and spectral-based methods have inherent limitations due to the complex learning mechanisms and reliance on high-quality topology, leading to low efficiency and unstable performance. To address these issues, we propose Directed Random Walk (DiRW), which can be viewed as a plug-and-play strategy or an innovative neural architecture that provides a guidance or new learning paradigm for most spatial-based methods or digraphs. Specifically, DiRW incorporates a direction-aware path sampler optimized from the perspectives of walk probability, length, and number in a weight-free manner by considering node profiles and topological structure. Building upon this, DiRW utilizes a node-wise learnable path aggregator for generalized messages obtained by our proposed adaptive walkers to represent the current node. Extensive experiments on 9 datasets demonstrate that DiRW: (1) enhances most spatial-based methods as a plug-and-play strategy; (2) achieves SOTA performance as a new digraph learning paradigm.

[AI-72] EasyRAG: Efficient Retrieval-Augmented Generation Framework for Network Automated Operations

链接: https://arxiv.org/abs/2410.10315
作者: Zhangchi Feng,Dongdong Kuang,Zhongyuan Wang,Zhijie Nie,Yaowei Zheng,Richong Zhang
关键词-EN: paper presents EasyRAG, network automated operations, URL Question Answering, retrieval-augmented generation framework, http URL Question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:This paper presents EasyRAG, a simple, lightweight, and efficient retrieval-augmented generation framework for network automated operations. The advantages of our solution are: this http URL Question Answering: We designed a straightforward RAG scheme based on (1) a specific data processing workflow (2) dual-route sparse retrieval for coarse ranking (3) LLM Reranker for reranking (4) LLM answer generation and optimization. This approach achieved first place in the GLM4 track in the preliminary round and second place in the GLM4 track in the semifinals. this http URL Deployment: Our method primarily consists of BM25 retrieval and BGE-reranker reranking, requiring no fine-tuning of any models, occupying minimal VRAM, easy to deploy, and highly scalable; we provide a flexible code library with various search and generation strategies, facilitating custom process implementation. this http URL Inference: We designed an efficient inference acceleration scheme for the entire coarse ranking, reranking, and generation process that significantly reduces the inference latency of RAG while maintaining a good level of accuracy; each acceleration scheme can be plug-and-play into any component of the RAG process, consistently enhancing the efficiency of the RAG system. Our code and data are released at this https URL.

[AI-73] Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

链接: https://arxiv.org/abs/2410.10291
作者: Xiangru Zhu,Penglei Sun,Yaoxian Song,Yanghua Xiao,Zhixu Li,Chengyu Wang,Jun Huang,Bei Yang,Xiaoxiao Xu
关键词-EN: Accurate interpretation, interpretation and visualization, semantic variations, Accurate, variations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Our benchmark and code are available at this https URL

点击查看摘要

Abstract:Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community’s exploration of human instruction understanding.

[AI-74] ABBA-VSM: Time Series Classification using Symbolic Representation on the Edge

链接: https://arxiv.org/abs/2410.10285
作者: Meerzhan Kanatbekova,Shashikant Ilager,Ivona Brandic
关键词-EN: smart city management, recent years, city management, Edge, Internet of Things
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages with references, 5 figures

点击查看摘要

Abstract:In recent years, Edge AI has become more prevalent with applications across various industries, from environmental monitoring to smart city management. Edge AI facilitates the processing of Internet of Things (IoT) data and provides privacy-enabled and latency-sensitive services to application users using Machine Learning (ML) algorithms, e.g., Time Series Classification (TSC). However, existing TSC algorithms require access to full raw data and demand substantial computing resources to train and use them effectively in runtime. This makes them impractical for deployment in resource-constrained Edge environments. To address this, in this paper, we propose an Adaptive Brownian Bridge-based Symbolic Aggregation Vector Space Model (ABBA-VSM). It is a new TSC model designed for classification services on Edge. Here, we first adaptively compress the raw time series into symbolic representations, thus capturing the changing trends of data. Subsequently, we train the classification model directly on these symbols. ABBA-VSM reduces communication data between IoT and Edge devices, as well as computation cycles, in the development of resource-efficient TSC services on Edge. We evaluate our solution with extensive experiments using datasets from the UCR time series classification archive. The results demonstrate that the ABBA-VSM achieves up to 80% compression ratio and 90-100% accuracy for binary classification. Whereas, for non-binary classification, it achieves an average compression ratio of 60% and accuracy ranging from 60-80%.

[AI-75] rust or Bust: Ensuring Trustworthiness in Autonomous Weapon Systems

链接: https://arxiv.org/abs/2410.10284
作者: Kasper Cools,Clara Maathuis
关键词-EN: Autonomous Weapon Systems, Autonomous Weapon, military operations presents, integration of Autonomous, Weapon Systems
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted as a workshop paper at MILCOM 2024, 8 pages

点击查看摘要

Abstract:The integration of Autonomous Weapon Systems (AWS) into military operations presents both significant opportunities and challenges. This paper explores the multifaceted nature of trust in AWS, emphasising the necessity of establishing reliable and transparent systems to mitigate risks associated with bias, operational failures, and accountability. Despite advancements in Artificial Intelligence (AI), the trustworthiness of these systems, especially in high-stakes military applications, remains a critical issue. Through a systematic review of existing literature, this research identifies gaps in the understanding of trust dynamics during the development and deployment phases of AWS. It advocates for a collaborative approach that includes technologists, ethicists, and military strategists to address these ongoing challenges. The findings underscore the importance of Human-Machine teaming and enhancing system intelligibility to ensure accountability and adherence to International Humanitarian Law. Ultimately, this paper aims to contribute to the ongoing discourse on the ethical implications of AWS and the imperative for trustworthy AI in defense contexts.

[AI-76] QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis

链接: https://arxiv.org/abs/2410.10270
作者: Abhijit Manatkar,Ashlesha Akella,Parthivi Gupta,Krishnasuri Narayanam
关键词-EN: Exploratory Data Analysis, Discovering meaningful insights, Large Language Models, Discovering meaningful, Exploratory Data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.

[AI-77] LoLCATs: On Low-Rank Linearizing of Large Language Models

链接: https://arxiv.org/abs/2410.10254
作者: Michael Zhang,Simran Arora,Rahul Chalamala,Alan Wu,Benjamin Spector,Aaryan Singhal,Krithik Ramesh,Christopher Ré
关键词-EN: popular Transformer-based LLMs, Recent works show, expensive pretraining costs, linearize large language, popular Transformer-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 47 pages, 20 figures, 18 tables, preprint

点击查看摘要

Abstract:Recent works show we can linearize large language models (LLMs) – swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention – avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. We base these steps on two findings. First, we can replace an LLM’s softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss (“attention transfer”). Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. Furthermore, LoLCATs does so with only 0.2% of past methods’ model parameters and 0.4% of their training tokens. Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.

[AI-78] Feedback Favors the Generalization of Neural ODEs

链接: https://arxiv.org/abs/2410.10253
作者: Jindou Jia,Zihan Yang,Meng Wang,Kexin Guo,Jianfei Yang,Xiang Yu,Lei Guo
关键词-EN: well-known generalization problem, generalization problem hinders, varying latent dynamics, problem hinders, hinders the application
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 22 pages, 17 figures

点击查看摘要

Abstract:The well-known generalization problem hinders the application of artificial neural networks in continuous-time prediction tasks with varying latent dynamics. In sharp contrast, biological systems can neatly adapt to evolving environments benefiting from real-time feedback mechanisms. Inspired by the feedback philosophy, we present feedback neural networks, showing that a feedback loop can flexibly correct the learned latent dynamics of neural ordinary differential equations (neural ODEs), leading to a prominent generalization improvement. The feedback neural network is a novel two-DOF neural network, which possesses robust performance in unseen scenarios with no loss of accuracy performance on previous tasks. A linear feedback form is presented to correct the learned latent dynamics firstly, with a convergence guarantee. Then, domain randomization is utilized to learn a nonlinear neural feedback form. Finally, extensive tests including trajectory prediction of a real irregular object and model predictive control of a quadrotor with various uncertainties, are implemented, indicating significant improvements over state-of-the-art model-based and learning-based methods.

[AI-79] LOBG:Less Overfitting for Better Generalization in Vision-Language Model

链接: https://arxiv.org/abs/2410.10247
作者: Chenhao Ding,Xinyuan Gao,Songlin Dong,Yuhang He,Qiang Wang,Alex Kot,Yihong Gong
关键词-EN: Existing prompt learning, Vision-Language Models, Existing prompt, downstream tasks, VLM to downstream
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing prompt learning methods in Vision-Language Models (VLM) have effectively enhanced the transfer capability of VLM to downstream tasks, but they suffer from a significant decline in generalization due to severe overfitting. To address this issue, we propose a framework named LOBG for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that might cause overfitting, thereby guiding prompts with basic visual concepts. To further mitigate overfitting, we devel oped a structural topology preservation (STP) loss at the feature level, which endows the feature space with overall plasticity, allowing effective reshaping of the feature space during optimization. Additionally, we employed hierarchical logit distilation (HLD) at the output level to constrain outputs, complementing STP at the output end. Extensive experimental results demonstrate that our method significantly improves generalization capability and alleviates overfitting compared to state-of-the-art approaches.

[AI-80] Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning Perspective

链接: https://arxiv.org/abs/2410.10241
作者: Jintang Li,Ruofan Wu,Yuchang Zhu,Huizhe Zhang,Xinzhou Jin,Guibin Zhang,Zulun Zhu,Zibin Zheng,Liang Chen
关键词-EN: low-dimensional latent space, GAEs, self-supervised learning models, latent space, graph-structured data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Preprint, under review

点击查看摘要

Abstract:Graph autoencoders (GAEs) are self-supervised learning models that can learn meaningful representations of graph-structured data by reconstructing the input graph from a low-dimensional latent space. Over the past few years, GAEs have gained significant attention in academia and industry. In particular, the recent advent of GAEs with masked autoencoding schemes marks a significant advancement in graph self-supervised learning research. While numerous GAEs have been proposed, the underlying mechanisms of GAEs are not well understood, and a comprehensive benchmark for GAEs is still lacking. In this work, we bridge the gap between GAEs and contrastive learning by establishing conceptual and methodological connections. We revisit the GAEs studied in previous works and demonstrate how contrastive learning principles can be applied to GAEs. Motivated by these insights, we introduce lrGAE (left-right GAE), a general and powerful GAE framework that leverages contrastive learning principles to learn meaningful representations. Our proposed lrGAE not only facilitates a deeper understanding of GAEs but also sets a new benchmark for GAEs across diverse graph-based learning tasks. The source code for lrGAE, including the baselines and all the code for reproducing the results, is publicly available at this https URL.

[AI-81] ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization

链接: https://arxiv.org/abs/2410.10238
作者: Jiawei Li,Fanrui Zhang,Jiaying Zhu,Esther Sun,Qiang Zhang,Zheng-Jun Zha
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, shown strong capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong capabilities in visual reasoning and explanation generation. However, despite these strengths, they face significant challenges in the increasingly critical task of Image Forgery Detection and Localization (IFDL). Moreover, existing IFDL methods are typically limited to the learning of low-level semantic-agnostic clues and merely provide a single outcome judgment. To tackle these issues, we propose ForgeryGPT, a novel framework that advances the IFDL task by capturing high-order forensics knowledge correlations of forged images from diverse linguistic feature spaces, while enabling explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture. Specifically, ForgeryGPT enhances traditional LLMs by integrating the Mask-Aware Forgery Extractor, which enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts. The Mask-Aware Forgery Extractor consists of a Forgery Localization Expert (FL-Expert) and a Mask Encoder, where the FL-Expert is augmented with an Object-agnostic Forgery Prompt and a Vocabulary-enhanced Vision Encoder, allowing for effectively capturing of multi-scale fine-grained forgery details. To enhance its performance, we implement a three-stage training strategy, supported by our designed Mask-Text Alignment and IFDL Task-Specific Instruction Tuning datasets, which align vision-language modalities and improve forgery detection and instruction-following capabilities. Extensive experiments demonstrate the effectiveness of the proposed method.

[AI-82] BanglaQuAD: A Bengali Open-domain Question Answering Dataset COLING2024 LREC

链接: https://arxiv.org/abs/2410.10229
作者: Md Rashad Al Hasan Rony,Sudipto Kumar Shaha,Rakib Al Hasan,Sumon Kanti Dey,Amzad Hossain Rafi,Amzad Hossain Rafi,Ashraf Hasan Sirajee,Jens Lehmann
关键词-EN: natural language processing, seventh most spoken, considered a low-resource, field of natural, Bengali
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted into LREC-COLING 2024, Turin, Italy

点击查看摘要

Abstract:Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP). Question answering over unstructured text is a challenging NLP task as it requires understanding both question and passage. Very few researchers attempted to perform question answering over Bengali (natively pronounced as Bangla) text. Typically, existing approaches construct the dataset by directly translating them from English to Bengali, which produces noisy and improper sentence structures. Furthermore, they lack topics and terminologies related to the Bengali language and people. This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers. Additionally, we propose an annotation tool that facilitates question-answering dataset construction on a local machine. A qualitative analysis demonstrates the quality of our proposed dataset.

[AI-83] QE-EBM: Using Quality Estimators as Energy Loss for Machine Translation

链接: https://arxiv.org/abs/2410.10228
作者: Gahyun Yoo,Jay Yoon Lee
关键词-EN: shown great promise, text generation tasks, including machine translation, including machine, Reinforcement learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning has shown great promise in aligning language models with human preferences in a variety of text generation tasks, including machine translation. For translation tasks, rewards can easily be obtained from quality estimation (QE) models which can generate rewards for unlabeled data. Despite its usefulness, reinforcement learning cannot exploit the gradients with respect to the QE score. We propose QE-EBM, a method of employing quality estimators as trainable loss networks that can directly backpropagate to the NMT model. We examine our method on several low and high resource target languages with English as the source language. QE-EBM outperforms strong baselines such as REINFORCE and proximal policy optimization (PPO) as well as supervised fine-tuning for all target languages, especially low-resource target languages. Most notably, for English-to-Mongolian translation, our method achieves improvements of 2.5 BLEU, 7.1 COMET-KIWI, 5.3 COMET, and 6.4 XCOMET relative to the supervised baseline.

[AI-84] Large Language Model-Enhanced Reinforcement Learning for Generic Bus Holding Control Strategies

链接: https://arxiv.org/abs/2410.10212
作者: Jiajie Yu,Yuhong Wang,Wei Ma
关键词-EN: Bus holding control, Bus holding, bus holding strategies, widely-adopted strategy, strategy for maintaining
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 41 pages, 15 figures

点击查看摘要

Abstract:Bus holding control is a widely-adopted strategy for maintaining stability and improving the operational efficiency of bus systems. Traditional model-based methods often face challenges with the low accuracy of bus state prediction and passenger demand estimation. In contrast, Reinforcement Learning (RL), as a data-driven approach, has demonstrated great potential in formulating bus holding strategies. RL determines the optimal control strategies in order to maximize the cumulative reward, which reflects the overall control goals. However, translating sparse and delayed control goals in real-world tasks into dense and real-time rewards for RL is challenging, normally requiring extensive manual trial-and-error. In view of this, this study introduces an automatic reward generation paradigm by leveraging the in-context learning and reasoning capabilities of Large Language Models (LLMs). This new paradigm, termed the LLM-enhanced RL, comprises several LLM-based modules: reward initializer, reward modifier, performance analyzer, and reward refiner. These modules cooperate to initialize and iteratively improve the reward function according to the feedback from training and test results for the specified RL-based task. Ineffective reward functions generated by the LLM are filtered out to ensure the stable evolution of the RL agents’ performance over iterations. To evaluate the feasibility of the proposed LLM-enhanced RL paradigm, it is applied to various bus holding control scenarios, including a synthetic single-line system and a real-world multi-line system. The results demonstrate the superiority and robustness of the proposed paradigm compared to vanilla RL strategies, the LLM-based controller, and conventional space headway-based feedback control. This study sheds light on the great potential of utilizing LLMs in various smart mobility applications.

[AI-85] Predicting from Strings: Language Model Embeddings for Bayesian Optimization

链接: https://arxiv.org/abs/2410.10190
作者: Tung Nguyen,Qiuyi Zhang,Bangding Yang,Chansoo Lee,Jorg Bornschein,Yingjie Miao,Sagi Perel,Yutian Chen,Xingyou Song
关键词-EN: improving search efficiency, fixed search spaces, tabular input features, search efficiency, Bayesian Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bayesian Optimization is ubiquitous in the field of experimental design and blackbox optimization for improving search efficiency, but has been traditionally restricted to regression models which are only applicable to fixed search spaces and tabular input features. We propose Embed-then-Regress, a paradigm for applying in-context regression over string inputs, through the use of string embedding capabilities of pretrained language models. By expressing all inputs as strings, we are able to perform general-purpose regression for Bayesian Optimization over various domains including synthetic, combinatorial, and hyperparameter optimization, obtaining comparable results to state-of-the-art Gaussian Process-based algorithms. Code can be found at this http URL.

[AI-86] Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention

链接: https://arxiv.org/abs/2410.10184
作者: Ying Liu,Ge Bai,Chenji Lu,Shilong Li,Zhang Zhang,Ruifang Liu,Wenbin Guo
关键词-EN: Visual Question Answering, Question Answering, Visual Question, advancements in Visual, information remains unresolved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the remarkable advancements in Visual Question Answering (VQA), the challenge of mitigating the language bias introduced by textual information remains unresolved. Previous approaches capture language bias from a coarse-grained perspective. However, the finer-grained information within a sentence, such as context and keywords, can result in different biases. Due to the ignorance of fine-grained information, most existing methods fail to sufficiently capture language bias. In this paper, we propose a novel causal intervention training scheme named CIBi to eliminate language bias from a finer-grained perspective. Specifically, we divide the language bias into context bias and keyword bias. We employ causal intervention and contrastive learning to eliminate context bias and improve the multi-modal representation. Additionally, we design a new question-only branch based on counterfactual generation to distill and eliminate keyword bias. Experimental results illustrate that CIBi is applicable to various VQA models, yielding competitive performance.

[AI-87] Scalable Multi-Domain Adaptation of Language Models using Modular Experts

链接: https://arxiv.org/abs/2410.10181
作者: Peter Schafhalter,Shun Liao,Yanqi Zhou,Chih-Kuan Yeh,Arun Kandoor,James Laudon
关键词-EN: pre-trained language models, multiple targeted tasks, language models, targeted tasks, resource-constrained use cases
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE’s architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.

[AI-88] Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.10166
作者: Yongjin Yang,Sihyeon Kim,Hojung Jung,Sangmin Bae,SangMook Kim,Se-Young Yun,Kimin Lee
关键词-EN: human feedback datasets, human feedback, aligning model behavior, feedback datasets, feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. In this work, we propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets with direct preference optimization (DPO). Specifically, our approach selects data by solving an optimization problem to maximize three components: preference margin, text quality, and text diversity. The concept of preference margin is used to identify samples that contain high informational value to address the noisy nature of feedback dataset, which is calculated using a proxy reward model. Additionally, we incorporate text quality, assessed by large language models to prevent harmful contents, and consider text diversity through a k-nearest neighbor entropy estimator to improve generalization. Finally, we integrate all these components into an optimization process, with approximating the solution by assigning importance score to each data pair and selecting the most important ones. As a result, our method efficiently filters data automatically, without the need for manual intervention, and can be applied to any large-scale dataset. Experimental results show that FiFA significantly enhances training stability and achieves better performance, being preferred by humans 17% more, while using less than 0.5% of the full data and thus 1% of the GPU hours compared to utilizing full human feedback datasets.

[AI-89] HSR-Enhanced Sparse Attention Acceleration

链接: https://arxiv.org/abs/2410.10165
作者: Bo Chen,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. This paper introduces a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with \mathsfReLU^\alpha activation, \alpha \in \mathbbN_+ ), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to rapidly identify non-zero or “massively activated” entries in the attention matrix. We present theoretical analyses for two key scenarios: attention generation and full attention computation with long input context. Our approach achieves a running time of O(mn^4/5) significantly faster than the naive approach O(mn) for attention generation, where n is the context length, m is the query length, and d is the hidden dimension. We can also reduce the running time of full attention computation from O(mn) to O(mn^1 - 1 / \lfloor d/2\rfloor + mn^4/5) . Importantly, our method introduces no error for ReLU attention and only provably negligible error for Softmax attention, where the latter is supported by our empirical validation. This work represents a significant step towards enabling efficient long-context processing in LLMs, potentially broadening their applicability across various domains.

[AI-90] Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

链接: https://arxiv.org/abs/2410.10150
作者: Yifan Luo,Zhennan Zhou,Meitan Wang,Bin Dong
关键词-EN: instruction fine-tuned large, fine-tuned large language, large language models, instruction fine-tuned, fine-tuned large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model’s safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM’s safety and deepens the understanding of the internal mechanisms of LLMs.

[AI-91] alpha-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

链接: https://arxiv.org/abs/2410.10148
作者: Junkang Wu,Xue Wang,Zhengyi Yang,Jiancan Wu,Jinyang Gao,Bolin Ding,Xiang Wang,Rong Jin,Xiangnan He
关键词-EN: Aligning large language, Aligning large, large language models, Direct Preference Optimization, Simple Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO’s assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose \alpha -DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, \alpha -DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for \alpha -DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that \alpha -DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at this https URL

[AI-92] Unified Representation of Genomic and Biomedical Concepts through Multi-Task Multi-Source Contrastive Learning

链接: https://arxiv.org/abs/2410.10144
作者: Hongyi Yuan,Suqi Liu,Kelly Cho,Katherine Liao,Alexandre Pereira,Tianxi Cai
关键词-EN: introduce GENomic Encoding, GENomic Encoding REpresentation, GENomic Encoding, Encoding REpresentation, biomedical knowledge bases
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
*备注: 15 pages, 2 figures, 5 tables

点击查看摘要

Abstract:We introduce GENomic Encoding REpresentation with Language Model (GENEREL), a framework designed to bridge genetic and biomedical knowledge bases. What sets GENEREL apart is its ability to fine-tune language models to infuse biological knowledge behind clinical concepts such as diseases and medications. This fine-tuning enables the model to capture complex biomedical relationships more effectively, enriching the understanding of how genomic data connects to clinical outcomes. By constructing a unified embedding space for biomedical concepts and a wide range of common SNPs from sources such as patient-level data, biomedical knowledge graphs, and GWAS summaries, GENEREL aligns the embeddings of SNPs and clinical concepts through multi-task contrastive learning. This allows the model to adapt to diverse natural language representations of biomedical concepts while bypassing the limitations of traditional code mapping systems across different data sources. Our experiments demonstrate GENEREL’s ability to effectively capture the nuanced relationships between SNPs and clinical concepts. GENEREL also emerges to discern the degree of relatedness, potentially allowing for a more refined identification of concepts. This pioneering approach in constructing a unified embedding system for both SNPs and biomedical concepts enhances the potential for data integration and discovery in biomedical research.

[AI-93] Beyond-RAG: Question Identification and Answer Generation in Real-Time Conversations

链接: https://arxiv.org/abs/2410.10136
作者: Garima Agrawal,Sashank Gummuluri,Cosimo Spera
关键词-EN: relevant knowledge base, long average handling, average handling times, customer contact centers, retrieve relevant knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In customer contact centers, human agents often struggle with long average handling times (AHT) due to the need to manually interpret queries and retrieve relevant knowledge base (KB) articles. While retrieval augmented generation (RAG) systems using large language models (LLMs) have been widely adopted in industry to assist with such tasks, RAG faces challenges in real-time conversations, such as inaccurate query formulation and redundant retrieval of frequently asked questions (FAQs). To address these limitations, we propose a decision support system that can look beyond RAG by first identifying customer questions in real time. If the query matches an FAQ, the system retrieves the answer directly from the FAQ database; otherwise, it generates answers via RAG. Our approach reduces reliance on manual queries, providing responses to agents within 2 seconds. Deployed in AI-powered human-agent assist solution at Minerva CQ, this system improves efficiency, reduces AHT, and lowers operational costs. We also introduce an automated LLM-agentic workflow to identify FAQs from historical transcripts when no predefined FAQs exist.

[AI-94] FormalAlign: Automated Alignment Evaluation for Autoformalization

链接: https://arxiv.org/abs/2410.10135
作者: Jianqiao Lu,Yingjia Wan,Yinya Huang,Jing Xiong,Zhengying Liu,Zhijiang Guo
关键词-EN: convert informal mathematical, informal mathematical proofs, machine-verifiable formats, bridging the gap, aims to convert
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: 23 pages, 13 tables, 3 figures

点击查看摘要

Abstract:Autoformalization aims to convert informal mathematical proofs into machine-verifiable formats, bridging the gap between natural and formal languages. However, ensuring semantic alignment between the informal and formalized statements remains challenging. Existing approaches heavily rely on manual verification, hindering scalability. To address this, we introduce \textscFormalAlign, the first automated framework designed for evaluating the alignment between natural and formal languages in autoformalization. \textscFormalAlign trains on both the autoformalization sequence generation task and the representational alignment between input and output, employing a dual loss that combines a pair of mutually enhancing autoformalization and alignment tasks. Evaluated across four benchmarks augmented by our proposed misalignment strategies, \textscFormalAlign demonstrates superior performance. In our experiments, \textscFormalAlign outperforms GPT-4, achieving an Alignment-Selection Score 11.58% higher on \forml-Basic (99.21% vs. 88.91%) and 3.19% higher on MiniF2F-Valid (66.39% vs. 64.34%). This effective alignment evaluation significantly reduces the need for manual verification. Both the dataset and code can be accessed via~\urlthis https URL.

[AI-95] Learning Linear Attention in Polynomial Time

链接: https://arxiv.org/abs/2410.10101
作者: Morris Yau,Ekin Akyurek,Jiayuan Mao,Joshua B. Tenenbaum,Stefanie Jegelka,Jacob Andreas
关键词-EN: simulating Boolean circuits, Previous research, simulating Boolean, Boolean circuits, Universal Turing Machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key–value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.

[AI-96] PromptGCN: Bridging Subgraph Gaps in Lightweight GCNs

链接: https://arxiv.org/abs/2410.10089
作者: Shengwei Ji,Yujie Tian,Fei Liu,Xinlu Li,Le Wu
关键词-EN: Graph Convolutional Networks, Convolutional Networks, social networks, Graph Convolutional, Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) are widely used in graph-based applications, such as social networks and recommendation systems. Nevertheless, large-scale graphs or deep aggregation layers in full-batch GCNs consume significant GPU memory, causing out of memory (OOM) errors on mainstream GPUs (e.g., 29GB memory consumption on the Ogbnproducts graph with 5 layers). The subgraph sampling methods reduce memory consumption to achieve lightweight GCNs by partitioning the graph into multiple subgraphs and sequentially training GCNs on each subgraph. However, these methods yield gaps among subgraphs, i.e., GCNs can only be trained based on subgraphs instead of global graph information, which reduces the accuracy of GCNs. In this paper, we propose PromptGCN, a novel prompt-based lightweight GCN model to bridge the gaps among subgraphs. First, the learnable prompt embeddings are designed to obtain global information. Then, the prompts are attached into each subgraph to transfer the global information among subgraphs. Extensive experimental results on seven largescale graphs demonstrate that PromptGCN exhibits superior performance compared to baselines. Notably, PromptGCN improves the accuracy of subgraph sampling methods by up to 5.48% on the Flickr dataset. Overall, PromptGCN can be easily combined with any subgraph sampling method to obtain a lightweight GCN model with higher accuracy.

[AI-97] he Ingredients for Robotic Diffusion Transformers

链接: https://arxiv.org/abs/2410.10088
作者: Sudeep Dasari,Oier Mees,Sebastian Zhao,Mohan Kumar Srirama,Sergey Levine
关键词-EN: recent years roboticists, achieved remarkable progress, leveraging high capacity, high capacity Transformer, capacity Transformer network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ( 1500+ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: this https URL

[AI-98] Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

链接: https://arxiv.org/abs/2410.10083
作者: Yifan Feng,Chengwu Yang,Xingliang Hou,Shaoyi Du,Shihui Ying,Zongze Wu,Yue Gao
关键词-EN: high-order correlations found, Existing benchmarks, NLGraph and GraphQA, graphs by focusing, correlations found
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing benchmarks like NLGraph and GraphQA evaluate LLMs on graphs by focusing mainly on pairwise relationships, overlooking the high-order correlations found in real-world data. Hypergraphs, which can model complex beyond-pairwise relationships, offer a more robust framework but are still underexplored in the context of LLMs. To address this gap, we introduce LLM4Hypergraph, the first comprehensive benchmark comprising 21,500 problems across eight low-order, five high-order, and two isomorphism tasks, utilizing both synthetic and real-world hypergraphs from citation networks and protein structures. We evaluate six prominent LLMs, including GPT-4o, demonstrating our benchmark’s effectiveness in identifying model strengths and weaknesses. Our specialized prompting framework incorporates seven hypergraph languages and introduces two novel techniques, Hyper-BAG and Hyper-COT, which enhance high-order reasoning and achieve an average 4% (up to 9%) performance improvement on structure classification tasks. This work establishes a foundational testbed for integrating hypergraph computational capabilities into LLMs, advancing their comprehension.

[AI-99] VideoAgent : Self-Improving Video Generation

链接: https://arxiv.org/abs/2410.10076
作者: Achint Soni,Sreyas Venkataraman,Abhranil Chandra,Sebastian Fischmeister,Percy Liang,Bo Dai,Sherry Yang
关键词-EN: generated video plans, generated video, Video generation, Video, generate visual plans
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, utilizing feedback from a pretrained vision-language model (VLM). As the refined video plan is being executed, VideoAgent collects additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robotics can be an effective tool in grounding video generation in the physical world.

[AI-100] Divide Reweight and Conquer: A Logit Arithmetic Approach for In-Context Learning

链接: https://arxiv.org/abs/2410.10074
作者: Chengsong Huang,Langlin Huang,Jiaxin Huang
关键词-EN: Large Language Models, updating model parameters, Large Language, Language Models, In-Context Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In-Context Learning (ICL) emerges as a key feature for Large Language Models (LLMs), allowing them to adapt to new tasks by leveraging task-specific examples without updating model parameters. However, ICL faces challenges with increasing numbers of examples due to performance degradation and quadratic computational costs. In this paper, we propose Logit Arithmetic Reweighting Approach (LARA), a novel framework that enhances ICL by using logit-based ensembling of multiple demonstrations. Our approach divides long input demonstrations into parallelizable shorter inputs to significantly reduce memory requirements, and then effectively aggregate the information by reweighting logits of each group via a non-gradient optimization approach. We further introduce Binary LARA (B-LARA), a variant that constrains weights to binary values to simplify the search space and reduces memory usage by filtering out less informative demonstration groups. Experiments on BBH and MMLU demonstrate that LARA and B-LARA outperform all baseline methods in both accuracy and memory efficiency. We also conduct extensive analysis to show that LARA generalizes well to scenarios of varying numbers of examples from limited to many-shot demonstrations.

[AI-101] Ukrainian-to-English folktale corpus: Parallel corpus creation and augmentation for machine translation in low-resource languages

链接: https://arxiv.org/abs/2410.10063
作者: Olena Burda-Lassen
关键词-EN: source language, linguistically very rich, rich and culturally, culturally significant, significant in understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Folktales are linguistically very rich and culturally significant in understanding the source language. Historically, only human translation has been used for translating folklore. Therefore, the number of translated texts is very sparse, which limits access to knowledge about cultural traditions and customs. We have created a new Ukrainian-To-English parallel corpus of familiar Ukrainian folktales based on available English translations and suggested several new ones. We offer a combined domain-specific approach to building and augmenting this corpus, considering the nature of the domain and differences in the purpose of human versus machine translation. Our corpus is word and sentence-aligned, allowing for the best curation of meaning, specifically tailored for use as training data for machine translation models.

[AI-102] Dreaming to Assist: Learning to Align with Human Objectives for Shared Control in High-Speed Racing

链接: https://arxiv.org/abs/2410.10062
作者: Jonathan DeCastro,Andrew Silva,Deepak Gopinath,Emily Sumner,Thomas M. Balch,Laporsha Dees,Guy Rosman
关键词-EN: involving fast dynamics, Tight coordination, domains involving fast, coordination is required, required for effective
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted to CoRL 2024, Munich, Germany

点击查看摘要

Abstract:Tight coordination is required for effective human-robot teams in domains involving fast dynamics and tactical decisions, such as multi-car racing. In such settings, robot teammates must react to cues of a human teammate’s tactical objective to assist in a way that is consistent with the objective (e.g., navigating left or right around an obstacle). To address this challenge, we present Dream2Assist, a framework that combines a rich world model able to infer human objectives and value functions, and an assistive agent that provides appropriate expert assistance to a given human teammate. Our approach builds on a recurrent state space model to explicitly infer human intents, enabling the assistive agent to select actions that align with the human and enabling a fluid teaming interaction. We demonstrate our approach in a high-speed racing domain with a population of synthetic human drivers pursuing mutually exclusive objectives, such as “stay-behind” and “overtake”. We show that the combined human-robot team, when blending its actions with those of the human, outperforms the synthetic humans alone as well as several baseline assistance strategies, and that intent-conditioning enables adherence to human preferences during task execution, leading to improved performance while satisfying the human’s objective.

[AI-103] he Epochal Sawtooth Effect: Unveiling Training Loss Oscillations in Adam and Other Optimizers

链接: https://arxiv.org/abs/2410.10056
作者: Qi Liu,Wanjing Ma
关键词-EN: Epochal Sawtooth Effect, recurring training loss, Epochal Sawtooth, training loss pattern, Sawtooth Effect
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 15 pages, 21 figures

点击查看摘要

Abstract:In this paper, we identify and analyze a recurring training loss pattern, which we term the \textitEpochal Sawtooth Effect (ESE), commonly observed during training with adaptive gradient-based optimizers, particularly Adam optimizer. This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve. Through empirical observations, we demonstrate that while this effect is most pronounced with Adam, it persists, although less severely, with other optimizers such as RMSProp. We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect. The influences of factors like (\beta), batch size, data shuffling on this pattern have been studied. We quantify the influence of (\beta_2) on the shape of the loss curve, showing that higher values of (\beta_2) result in a nearly linear increase in loss, while lower values create a concave upward trend. Our analysis reveals that this behavior stems from the adaptive learning rate controlled by the second moment estimate, with (\beta_1) playing a minimal role when (\beta_2) is large. To support our analysis, we replicate this phenomenon through a controlled quadratic minimization task. By incrementally solving a series of quadratic optimization problems using Adam, we demonstrate that the Epochal Sawtooth Effect can emerge even in simple optimization scenarios, reinforcing the generality of this pattern. This paper provides both theoretical insights and quantitative analysis, offering a comprehensive understanding of this ubiquitous phenomenon in modern optimization techniques. Comments: 15 pages, 21 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2410.10056 [cs.LG] (or arXiv:2410.10056v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.10056 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-104] XAI-based Feature Selection for Improved Network Intrusion Detection Systems

链接: https://arxiv.org/abs/2410.10050
作者: Osvaldo Arreche,Tanish Guntur,Mustafa Abdallah
关键词-EN: Explainability and evaluation, network security field, feature selection, intrusion detection systems, modern intrusion detection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 24 pages, 4 figures

点击查看摘要

Abstract:Explainability and evaluation of AI models are crucial parts of the security of modern intrusion detection systems (IDS) in the network security field, yet they are lacking. Accordingly, feature selection is essential for such parts in IDS because it identifies the most paramount features, enhancing attack detection and its description. In this work, we tackle the feature selection problem for IDS by suggesting new ways of applying eXplainable AI (XAI) methods for this problem. We identify the crucial attributes originated by distinct AI methods in tandem with the novel five attribute selection methods. We then compare many state-of-the-art feature selection strategies with our XAI-based feature selection methods, showing that most AI models perform better when using the XAI-based approach proposed in this work. By providing novel feature selection techniques and establishing the foundation for several XAI-based strategies, this research aids security analysts in the AI decision-making reasoning of IDS by providing them with a better grasp of critical intrusion traits. Furthermore, we make the source codes available so that the community may develop additional models on top of our foundational XAI-based feature selection framework.

[AI-105] VQ-CNMP: Neuro-Symbolic Skill Learning for Bi-Level Planning

链接: https://arxiv.org/abs/2410.10045
作者: Hakan Aktas,Emre Ugur
关键词-EN: unlabeled demonstration data, neural network model, network model capable, demonstration data, neural network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, Submitted to Conference on Robot Learning LEAP Workshop 2024

点击查看摘要

Abstract:This paper proposes a novel neural network model capable of discovering high-level skill representations from unlabeled demonstration data. We also propose a bi-level planning pipeline that utilizes our model using a gradient-based planning approach. While extracting high-level representations, our model also preserves the low-level information, which can be used for low-level action planning. In the experiments, we tested the skill discovery performance of our model under different conditions, tested whether Multi-Modal LLMs can be utilized to label the learned high-level skill representations, and finally tested the high-level and low-level planning performance of our pipeline.

[AI-106] Are KAN Effective for Identifying and Tracking Concept Drift in Time Series?

链接: https://arxiv.org/abs/2410.10041
作者: Kunpeng Xu,Lifei Chen,Shengrui Wang
关键词-EN: online activity logs, understanding complex systems, Dynamic concepts, financial markets, activity logs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dynamic concepts in time series are crucial for understanding complex systems such as financial markets, healthcare, and online activity logs. These concepts help reveal structures and behaviors in sequential data for better decision-making and forecasting. Existing models struggle with detecting and tracking concept drift due to limitations in interpretability and adaptability. This paper introduces Kolmogorov-Arnold Networks (KAN) into time series and proposes WormKAN, a KAN-based auto-encoder to address concept drift in co-evolving time series. WormKAN integrates the KAN-SR module, in which the encoder, decoder, and self-representation layer are built on KAN, along with a temporal constraint to capture concept transitions. These transitions, akin to passing through a “wormhole”, are identified by abrupt changes in the latent space. Experiments show that KAN and KAN-based models (WormKAN) effectively segment time series into meaningful concepts, enhancing the identification and tracking of concept drifts.

[AI-107] A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics

链接: https://arxiv.org/abs/2410.10030
作者: Yun Joon Soh,Jishen Zhao
关键词-EN: models and Question-Answering, datasets emphasizes, explosion of open-sourced, open-sourced models, emphasizes the importance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation. We studied the statistics of the existing evaluation metrics for a better understanding of their limitations. By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation. As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.

[AI-108] Online Multi-modal Root Cause Analysis

链接: https://arxiv.org/abs/2410.10021
作者: Lecheng Zheng,Zhengzhang Chen,Haifeng Chen,Jingrui He
关键词-EN: Root Cause Analysis, RCA methods, Traditional data-driven RCA, essential for pinpointing, failures in microservice
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Traditional data-driven RCA methods are typically limited to offline applications due to high computational demands, and existing online RCA methods handle only single-modal data, overlooking complex interactions in multi-modal systems. In this paper, we introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization. OCEAN employs a dilated convolutional neural network to capture long-term temporal dependencies and graph neural networks to learn causal relationships among system entities and key performance indicators. We further design a multi-factor attention mechanism to analyze and reassess the relationships among different metrics and log indicators/attributes for enhanced online causal graph learning. Additionally, a contrastive mutual information maximization-based graph fusion module is developed to effectively model the relationships across various modalities. Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of our proposed method.

[AI-109] Adaptive Reasoning and Acting in Medical Language Agents

链接: https://arxiv.org/abs/2410.10020
作者: Abhishek Dutta,Yen-Che Hsiao
关键词-EN: enhancing diagnostic accuracy, simulated clinical environments, AgentClinic benchmark, innovative large language, paper presents
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents an innovative large language model (LLM) agent framework for enhancing diagnostic accuracy in simulated clinical environments using the AgentClinic benchmark. The proposed automatic correction enables doctor agents to iteratively refine their reasoning and actions following incorrect diagnoses, fostering improved decision-making over time. Experiments show that the implementation of the adaptive LLM-based doctor agents achieve correct diagnoses through dynamic interactions with simulated patients. The evaluations highlight the capacity of autonomous agents to adapt and improve in complex medical scenarios. Future enhancements will focus on refining the algorithm and expanding its applicability across a wider range of tasks and different large language models.

[AI-110] Improving accuracy and convergence of federated learning edge computing methods for generalized DER forecasting applications in power grid NEURIPS2022

链接: https://arxiv.org/abs/2410.10018
作者: Vineet Jagadeesan Nair,Lucas Pereira
关键词-EN: distributed energy resources, accurate federated learning, lower communication requirements, faster convergence properties, low-carbon power grids
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
*备注: Presented at the NeurIPS 2022 Tackling Climate Change with Machine Learning workshop

点击查看摘要

Abstract:This proposal aims to develop more accurate federated learning (FL) methods with faster convergence properties and lower communication requirements, specifically for forecasting distributed energy resources (DER) such as renewables, energy storage, and loads in modern, low-carbon power grids. This will be achieved by (i) leveraging recently developed extensions of FL such as hierarchical and iterative clustering to improve performance with non-IID data, (ii) experimenting with different types of FL global models well-suited to time-series data, and (iii) incorporating domain-specific knowledge from power systems to build more general FL frameworks and architectures that can be applied to diverse types of DERs beyond just load forecasting, and with heterogeneous clients.

[AI-111] Safety-Aware Fine-Tuning of Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.10014
作者: Hyeong Kyu Choi,Xuefeng Du,Yixuan Li
关键词-EN: Large Language Models, Fine-tuning Large Language, Large Language, Language Models, tailoring models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Workshop on Safe Generative AI

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing practical challenges in real-world scenarios.

[AI-112] Learning Interpretable Classifiers for PDDL Planning

链接: https://arxiv.org/abs/2410.10011
作者: Arnaud Lequen
关键词-EN: synthesizing interpretable models, similar planning tasks, expressed in PDDL, planning tasks expressed, synthesizing interpretable
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider the problem of synthesizing interpretable models that recognize the behaviour of an agent compared to other agents, on a whole set of similar planning tasks expressed in PDDL. Our approach consists in learning logical formulas, from a small set of examples that show how an agent solved small planning instances. These formulas are expressed in a version of First-Order Temporal Logic (FTL) tailored to our planning formalism. Such formulas are human-readable, serve as (partial) descriptions of an agent’s policy, and generalize to unseen instances. We show that learning such formulas is computationally intractable, as it is an NP-hard problem. As such, we propose to learn these behaviour classifiers through a topology-guided compilation to MaxSAT, which allows us to generate a wide range of different formulas. Experiments show that interesting and accurate formulas can be learned in reasonable time.

[AI-113] Leveraging Customer Feedback for Multi-modal Insight Extraction NAACL2024

链接: https://arxiv.org/abs/2410.09999
作者: Sandeep Sricharan Mukku,Abinesh Kanagarajan,Pushpendu Ghosh,Chetan Aggarwal
关键词-EN: Businesses can benefit, customer feedback, products and services, enhance their products, multi-modal customer feedback
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: NAACL 2024

点击查看摘要

Abstract:Businesses can benefit from customer feedback in different modalities, such as text and images, to enhance their products and services. However, it is difficult to extract actionable and relevant pairs of text segments and images from customer feedback in a single pass. In this paper, we propose a novel multi-modal method that fuses image and text information in a latent space and decodes it to extract the relevant feedback segments using an image-text grounded text decoder. We also introduce a weakly-supervised data generation technique that produces training data for this task. We evaluate our model on unseen data and demonstrate that it can effectively mine actionable insights from multi-modal customer feedback, outperforming the existing baselines by 14 points in F1 score.

[AI-114] SlimSeiz: Efficient Channel-Adaptive Seizure Prediction Using a Mamba-Enhanced Network

链接: https://arxiv.org/abs/2410.09998
作者: Guorui Lu,Jing Peng,Bingyuan Huang,Chang Gao,Todor Stefanov,Yong Hao,Qinyu Chen
关键词-EN: abnormal brain activity, Epileptic seizures, brain activity, lead to accidents, abnormal brain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Epileptic seizures cause abnormal brain activity, and their unpredictability can lead to accidents, underscoring the need for long-term seizure prediction. Although seizures can be predicted by analyzing electroencephalogram (EEG) signals, existing methods often require too many electrode channels or larger models, limiting mobile usability. This paper introduces a SlimSeiz framework that utilizes adaptive channel selection with a lightweight neural network model. SlimSeiz operates in two states: the first stage selects the optimal channel set for seizure prediction using machine learning algorithms, and the second stage employs a lightweight neural network based on convolution and Mamba for prediction. On the Children’s Hospital Boston-MIT (CHB-MIT) EEG dataset, SlimSeiz can reduce channels from 22 to 8 while achieving a satisfactory result of 94.8% accuracy, 95.5% sensitivity, and 94.0% specificity with only 21.2K model parameters, matching or outperforming larger models’ performance. We also validate SlimSeiz on a new EEG dataset, SRH-LEI, collected from Shanghai Renji Hospital, demonstrating its effectiveness across different patients. The code and SRH-LEI dataset are available at this https URL.

[AI-115] Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code

链接: https://arxiv.org/abs/2410.09997
作者: Nan Jiang,Qi Li,Lin Tan,Tianyi Zhang
关键词-EN: large language models, face the critical, generating plausible, code, hallucinations
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite their success, large language models (LLMs) face the critical challenge of hallucinations, generating plausible but incorrect content. While much research has focused on hallucinations in multiple modalities including images and natural language text, less attention has been given to hallucinations in source code, which leads to incorrect and vulnerable code that causes significant financial loss. To pave the way for research in LLMs’ hallucinations in code, we introduce Collu-Bench, a benchmark for predicting code hallucinations of LLMs across code generation (CG) and automated program repair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. To better understand and predict code hallucinations, Collu-Bench provides detailed features such as the per-step log probabilities of LLMs’ output, token types, and the execution feedback of LLMs’ generated code for in-depth analysis. In addition, we conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 – 33.15% accuracy. Our experiments draw insightful findings of code hallucination patterns, reveal the challenge of accurately localizing LLMs’ hallucinations, and highlight the need for more sophisticated techniques.

[AI-116] MARS: Multilingual Aspect-centric Review Summarisation EMNLP2024

链接: https://arxiv.org/abs/2410.09991
作者: Sandeep Sricharan Mukku,Abinesh Kanagarajan,Chetan Aggarwal,Promod Yenigalla
关键词-EN: Summarizing customer feedback, provide actionable insights, Summarizing customer, insights for products, services at scale
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:Summarizing customer feedback to provide actionable insights for products/services at scale is an important problem for businesses across industries. Lately, the review volumes are increasing across regions and languages, therefore the challenge of aggregating and understanding customer sentiment across multiple languages becomes increasingly vital. In this paper, we propose a novel framework involving a two-step paradigm \textitExtract-then-Summarise, namely MARS to revolutionise traditions and address the domain agnostic aspect-level multilingual review summarisation. Extensive automatic and human evaluation shows that our approach brings substantial improvements over abstractive baselines and efficiency to real-time systems.

[AI-117] HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

链接: https://arxiv.org/abs/2410.09988
作者: Jingxuan Fan,Sarah Martinson,Erik Y. Wang,Kaylie Hausknecht,Jonah Brenner,Danxian Liu,Nianli Peng,Corey Wang,Michael P. Brenner
关键词-EN: Large Language Model, existing Large Language, Large Language, Language Model, applied mathematics problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code and the HARDMath dataset is available at this https URL

点击查看摘要

Abstract:Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational tools, and subjective judgment, making them difficult for LLMs. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level applied math problems and underscore the importance of datasets like HARDMath to advance mathematical abilities of LLMs.

[AI-118] Facial Width-to-Height Ratio Does Not Predict Self-Reported Behavioral Tendencies

链接: https://arxiv.org/abs/2410.09979
作者: Michal Kosinski
关键词-EN: violent behavioral tendencies, growing number, antisocial or violent, behavioral tendencies, violent behavioral
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Psychological Science (2017)

点击查看摘要

Abstract:A growing number of studies have linked facial width-to-height ratio (fWHR) with various antisocial or violent behavioral tendencies. However, those studies have predominantly been laboratory based and low powered. This work reexamined the links between fWHR and behavioral tendencies in a large sample of 137,163 participants. Behavioral tendencies were measured using 55 well-established psychometric scales, including self-report scales measuring intelligence, domains and facets of the five-factor model of personality, impulsiveness, sense of fairness, sensational interests, self-monitoring, impression management, and satisfaction with life. The findings revealed that fWHR is not substantially linked with any of these self-reported measures of behavioral tendencies, calling into question whether the links between fWHR and behavior generalize beyond the small samples and specific experimental settings that have been used in past fWHR research.

[AI-119] Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions

链接: https://arxiv.org/abs/2410.09972
作者: Kyungmin Kim,JB Lanier,Pierre Baldi,Charless Fowlkes,Roy Fox
关键词-EN: Model-Based Reinforcement Learning, Recent advancements, Model-Based Reinforcement, Reinforcement Learning, advancements in Model-Based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent advancements in Model-Based Reinforcement Learning (MBRL) have made it a powerful tool for visual control tasks. Despite improved data efficiency, it remains challenging to train MBRL agents with generalizable perception. Training in the presence of visual distractions is particularly difficult due to the high variation they introduce to representation learning. Building on DREAMER, a popular MBRL method, we propose a simple yet effective auxiliary task to facilitate representation learning in distracting environments. Under the assumption that task-relevant components of image observations are straightforward to identify with prior knowledge in a given task, we use a segmentation mask on image observations to only reconstruct task-relevant components. In doing so, we greatly reduce the complexity of representation learning by removing the need to encode task-irrelevant objects in the latent representation. Our method, Segmentation Dreamer (SD), can be used either with ground-truth masks easily accessible in simulation or by leveraging potentially imperfect segmentation foundation models. The latter is further improved by selectively applying the reconstruction loss to avoid providing misleading learning signals due to mask prediction errors. In modified DeepMind Control suite (DMC) and Meta-World tasks with added visual distractions, SD achieves significantly better sample efficiency and greater final performance than prior work. We find that SD is especially helpful in sparse reward tasks otherwise unsolvable by prior work, enabling the training of visually robust agents without the need for extensive reward engineering.

[AI-120] Improving 3D Few-Shot Segmentation with Inference-Time Pseudo-Labeling

链接: https://arxiv.org/abs/2410.09967
作者: Mohammad Mozafari,Hosein Hasani,Reza Vahidimajd,Mohamadreza Fereydooni,Mahdieh Soleymani Baghshah
关键词-EN: offering remarkable adaptability, limited annotated data, medical imaging analysis, recent years, models have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, few-shot segmentation (FSS) models have emerged as a promising approach in medical imaging analysis, offering remarkable adaptability to segment novel classes with limited annotated data. Existing approaches to few-shot segmentation have often overlooked the potential of the query itself, failing to fully utilize the valuable information it contains. However, treating the query as unlabeled data provides an opportunity to enhance prediction accuracy. Specifically in the domain of medical imaging, the volumetric structure of queries offers a considerable source of valuable information that can be used to improve the target slice segmentation. In this work, we present a novel strategy to efficiently leverage the intrinsic information of the query sample for final segmentation during inference. First, we use the support slices from a reference volume to generate an initial segmentation score for the query slices through a prototypical approach. Subsequently, we apply a confidence-aware pseudo-labeling procedure to transfer the most informative parts of query slices to the support set. The final prediction is performed based on the new expanded support set, enabling the prediction of a more accurate segmentation mask for the query volume. Extensive experiments show that the proposed method can effectively boost performance across diverse settings and datasets.

[AI-121] Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing

链接: https://arxiv.org/abs/2410.09964
作者: Muhammad Umar,Muhammad Asif,Arif Mahmood
关键词-EN: Single-cell RNA sequencing, single cell level, Single-cell RNA, RNA sequencing, enables the study
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at this https URL.

[AI-122] EITNet: An IoT-Enhanced Framework for Real-Time Basketball Action Recognition

链接: https://arxiv.org/abs/2410.09954
作者: Jingyu Liu,Xinyu Liu,Mingzhe Qu,Tianyi Lyu
关键词-EN: Integrating IoT technology, basketball action recognition, Integrating IoT, providing crucial insights, basketball action
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: pages

点击查看摘要

Abstract:Integrating IoT technology into basketball action recognition enhances sports analytics, providing crucial insights into player performance and game strategy. However, existing methods often fall short in terms of accuracy and efficiency, particularly in complex, real-time environments where player movements are frequently occluded or involve intricate interactions. To overcome these challenges, we propose the EITNet model, a deep learning framework that combines EfficientDet for object detection, I3D for spatiotemporal feature extraction, and TimeSformer for temporal analysis, all integrated with IoT technology for seamless real-time data collection and processing. Our contributions include developing a robust architecture that improves recognition accuracy to 92%, surpassing the baseline EfficientDet model’s 87%, and reducing loss to below 5.0 compared to EfficientDet’s 9.0 over 50 epochs. Furthermore, the integration of IoT technology enhances real-time data processing, providing adaptive insights into player performance and strategy. The paper details the design and implementation of EITNet, experimental validation, and a comprehensive evaluation against existing models. The results demonstrate EITNet’s potential to significantly advance automated sports analysis and optimize data utilization for player performance and strategy improvement.

[AI-123] State of NLP in Kenya: A Survey

链接: https://arxiv.org/abs/2410.09948
作者: Cynthia Jayne Amol,Everlyn Asiko Chimoto,Rose Delilah Gesicho,Antony M. Gitau,Naome A. Etori,Caringtone Kinyanjui,Steven Ndung’u,Lawrence Moruye,Samson Otieno Ooko,Kavengi Kitonga,Brian Muhia,Catherine Gitau,Antony Ndolo,Lilian D. A. Wanzare,Albert Njoroge Kahira,Ronald Tombe
关键词-EN: Natural Language Processing, advancing Natural Language, faces unique challenges, advancing Natural, underrepresented indigenous languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:Kenya, known for its linguistic diversity, faces unique challenges and promising opportunities in advancing Natural Language Processing (NLP) technologies, particularly for its underrepresented indigenous languages. This survey provides a detailed assessment of the current state of NLP in Kenya, emphasizing ongoing efforts in dataset creation, machine translation, sentiment analysis, and speech recognition for local dialects such as Kiswahili, Dholuo, Kikuyu, and Luhya. Despite these advancements, the development of NLP in Kenya remains constrained by limited resources and tools, resulting in the underrepresentation of most indigenous languages in digital spaces. This paper uncovers significant gaps by critically evaluating the available datasets and existing NLP models, most notably the need for large-scale language models and the insufficient digital representation of Indigenous languages. We also analyze key NLP applications: machine translation, information retrieval, and sentiment analysis-examining how they are tailored to address local linguistic needs. Furthermore, the paper explores the governance, policies, and regulations shaping the future of AI and NLP in Kenya and proposes a strategic roadmap to guide future research and development efforts. Our goal is to provide a foundation for accelerating the growth of NLP technologies that meet Kenya’s diverse linguistic demands.

[AI-124] Generalized Group Data Attribution

链接: https://arxiv.org/abs/2410.09940
作者: Dan Ley,Shichang Zhang,Suraj Srinivas,Gili Rusak,Himabindu Lakkaraju
关键词-EN: Generalized Group Data, Group Data Attribution, data selection, Data Attribution, individual training data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification. However, existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models. To address this challenge, we introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones. GGDA is a general framework that subsumes existing attribution methods and can be applied to new DA techniques as they emerge. It allows users to optimize the trade-off between efficiency and fidelity based on their needs. Our empirical results demonstrate that GGDA applied to popular DA methods such as Influence Functions, TracIn, and TRAK results in upto 10x-50x speedups over standard DA methods while gracefully trading off attribution fidelity. For downstream applications such as dataset pruning and noisy label identification, we demonstrate that GGDA significantly improves computational efficiency and maintains effectiveness, enabling practical applications in large-scale machine learning scenarios that were previously infeasible.

[AI-125] M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models

链接: https://arxiv.org/abs/2410.09928
作者: Megha Sharma,Muhammad Taimoor Haseeb,Gus Xia,Yoshimasa Tsuruoka
关键词-EN: multi modal framework, tailored to Japanese, Japanese manga, paper introduces, multi modal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces M2M Gen, a multi modal framework for generating background music tailored to Japanese manga. The key challenges in this task are the lack of an available dataset or a baseline. To address these challenges, we propose an automated music generation pipeline that produces background music for an input manga book. Initially, we use the dialogues in a manga to detect scene boundaries and perform emotion classification using the characters faces within a scene. Then, we use GPT4o to translate this low level scene information into a high level music directive. Conditioned on the scene information and the music directive, another instance of GPT 4o generates page level music captions to guide a text to music model. This produces music that is aligned with the mangas evolving narrative. The effectiveness of M2M Gen is confirmed through extensive subjective evaluations, showcasing its capability to generate higher quality, more relevant and consistent music that complements specific scenes when compared to our baselines.

[AI-126] Analysis and Design of a Personalized Recommendation System Based on a Dynamic User Interest Model

链接: https://arxiv.org/abs/2410.09923
作者: Chunyan Mao,Shuaishuai Huang,Mingxiu Sui,Haowei Yang,Xueshe Wang
关键词-EN: important research topic, explosion of information, user interest model, rapid development, accurate personalized recommendations
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of the internet and the explosion of information, providing users with accurate personalized recommendations has become an important research topic. This paper designs and analyzes a personalized recommendation system based on a dynamic user interest model. The system captures user behavior data, constructs a dynamic user interest model, and combines multiple recommendation algorithms to provide personalized content to users. The research results show that this system significantly improves recommendation accuracy and user satisfaction. This paper discusses the system’s architecture design, algorithm implementation, and experimental results in detail and explores future research directions.

[AI-127] Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

链接: https://arxiv.org/abs/2410.09918
作者: DiJia Su,Sainbayar Sukhbaatar,Michael Rabbat,Yuandong Tian,Qinqing Zheng
关键词-EN: human cognition theory, intuitive System, deliberative System, System, human cognition
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In human cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative System 2. Recent studies have shown that incorporating System 2 process into Transformers including large language models (LLMs), significantly enhances their reasoning capabilities. Nevertheless, models that purely resemble System 2 thinking require substantially higher computational costs and are much slower to respond. To address this challenge, we present Dualformer, a single Transformer model that seamlessly integrates both the fast and slow reasoning modes. Dualformer is obtained by training on data with randomized reasoning traces, where different parts of the traces are dropped during training. The dropping strategies are specifically tailored according to the trace structure, analogous to analyzing our thinking process and creating shortcuts with patterns. At inference time, our model can be configured to output only the solutions (fast mode) or both the reasoning chain and the final solution (slow mode), or automatically decide which mode to engage (auto mode). In all cases, Dualformer outperforms the corresponding baseline models in both performance and computational efficiency: (1) in slow mode, Dualformer optimally solves unseen 30 x 30 maze navigation tasks 97.6% of the time, surpassing the Searchformer (trained on data with complete reasoning traces) baseline performance of 93.3%, while only using 45.5% fewer reasoning steps; (2) in fast mode, Dualformer completes those tasks with an 80% optimal rate, significantly outperforming the Solution-Only model (trained on solution-only data), which has an optimal rate of only 30%. For math problems, our techniques have also achieved improved performance with LLM fine-tuning, showing its generalization beyond task-specific models.

[AI-128] Retrieval Instead of Fine-tuning: A Retrieval-based Parameter Ensemble for Zero-shot Learning

链接: https://arxiv.org/abs/2410.09908
作者: Pengfei Jin,Peng Shu,Sekeun Kim,Qing Xiao,Sifan Song,Cheng Chen,Tianming Liu,Xiang Li,Quanzheng Li
关键词-EN: Foundation models, techniques like Low-Rank, Foundation, RPE, Retrieval-based Parameter Ensemble
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.

[AI-129] Equitable Access to Justice: Logical LLMs Show Promise

链接: https://arxiv.org/abs/2410.09904
作者: Manuj Kant,Manav Kant,Marzieh Nabi,Preston Carlson,Megan Ma
关键词-EN: American judicial system, judicial system limit, American judicial, system limit access, costs and complexity
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:The costs and complexity of the American judicial system limit access to legal solutions for many Americans. Large language models (LLMs) hold great potential to improve access to justice. However, a major challenge in applying AI and LLMs in legal contexts, where consistency and reliability are crucial, is the need for System 2 reasoning. In this paper, we explore the integration of LLMs with logic programming to enhance their ability to reason, bringing their strategic capabilities closer to that of a skilled lawyer. Our objective is to translate laws and contracts into logic programs that can be applied to specific legal cases, with a focus on insurance contracts. We demonstrate that while GPT-4o fails to encode a simple health insurance contract into logical code, the recently released OpenAI o1-preview model succeeds, exemplifying how LLMs with advanced System 2 reasoning capabilities can expand access to justice.

[AI-130] Large-Scale 3D Medical Image Pre-training with Geometric Context Priors CVPR2024

链接: https://arxiv.org/abs/2410.09890
作者: Linshan Wu,Jiaxin Zhuang,Hao Chen
关键词-EN: medical image analysis, poses a significant, medical, image analysis, medical images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVPR 2024 Extension

点击查看摘要

Abstract:The scarcity of annotations poses a significant challenge in medical image analysis. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Specifically, we (1) introduce the largest medical pre-training dataset PreCT-160K; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a benchmark encompassing 48 medical tasks. Extensive experiments highlight the superiority of VoCo. Codes at this https URL.

[AI-131] ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

链接: https://arxiv.org/abs/2410.09870
作者: Yein Park,Chanwoong Yoon,Jungwoo Park,Donghyeon Lee,Minbyul Jeong,Jaewoo Kang
关键词-EN: Large language models, Large language, knowledge, significantly impacted, chronological knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly impacted many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the accumulative nature of knowledge, often relying on a single time stamp. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating and updating LLMs’ non-parametric chronological knowledge. Our evaluation shows: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that our framework successfully updates the overall knowledge across the entire timeline in both the biomedical domain (+11.9%) and the general domain (+2.8%), demonstrating its effectiveness in refining temporal knowledge. This non-parametric approach also enables knowledge updates not only in open-source models but also in proprietary LLMs, ensuring comprehensive applicability across model types. We perform a comprehensive analysis based on temporal characteristics of ChroKnowPrompt and validate the potential of various models to elicit intrinsic temporal knowledge through our method.

[AI-132] Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset INTERSPEECH2024

链接: https://arxiv.org/abs/2410.09869
作者: Hideyuki Oiso,Yuto Matsunaga,Kazuya Kakizaki,Taiki Miyagawa
关键词-EN: audio deepfake detection, deepfake detection, study test-time domain, test-time domain adaptation, study test-time
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at Interspeech 2024. Hideyuki Oiso and Yuto Matsunaga contributed equally

点击查看摘要

Abstract:We study test-time domain adaptation for audio deepfake detection (ADD), addressing three challenges: (i) source-target domain gaps, (ii) limited target dataset size, and (iii) high computational costs. We propose an ADD method using prompt tuning in a plug-in style. It bridges domain gaps by integrating it seamlessly with state-of-the-art transformer models and/or with other fine-tuning methods, boosting their performance on target data (challenge (i)). In addition, our method can fit small target datasets because it does not require a large number of extra parameters (challenge (ii)). This feature also contributes to computational efficiency, countering the high computational costs typically associated with large-scale pre-trained models in ADD (challenge (iii)). We conclude that prompt tuning for ADD under domain gaps presents a promising avenue for enhancing accuracy with minimal target data and negligible extra computational burden.

[AI-133] Uncovering Explaining and Mitigating the Superficial Safety of Backdoor Defense NEURIPS2024

链接: https://arxiv.org/abs/2410.09838
作者: Rui Min,Zeyu Qin,Nevin L. Zhang,Li Shen,Minhao Cheng
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, threat to Deep, Attack Success Rates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: NeurIPS 2024 Spotlight paper. The first two authors contributed equally

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. Typically, these purified models exhibit low Attack Success Rates (ASR), rendering them resistant to backdoored inputs. However, Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? In this paper, we provide an affirmative answer to this question by thoroughly investigating the Post-Purification Robustness of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using a very small number of poisoned samples. Based on this, we further propose the practical Query-based Reactivation Attack (QRA) which could effectively reactivate the backdoor by merely querying purified models. We find the failure to achieve satisfactory post-tuning robustness stems from the insufficient deviation of purified models from the backdoored model along the backdoor-connected path. To improve the post-purification robustness, we propose a straightforward tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates. Extensive experiments demonstrate that PAM significantly improves post-purification robustness while maintaining a good clean accuracy and low ASR. Our work provides a new perspective on understanding the effectiveness of backdoor safety tuning and highlights the importance of faithfully assessing the model’s safety.

[AI-134] LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond ACCV2024

链接: https://arxiv.org/abs/2410.09831
作者: Md Tanvir Islam,Inzamamul Alam,Simon S. Woo,Saeed Anwar,IK Hyun Lee,Khan Muhammad
关键词-EN: computer vision tasks, numerous computer vision, LLIE, essential for numerous, numerous computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: Accepted by the Asian Conference on Computer Vision (ACCV 2024)

点击查看摘要

Abstract:Low-light image enhancement (LLIE) is essential for numerous computer vision tasks, including object detection, tracking, segmentation, and scene understanding. Despite substantial research on improving low-quality images captured in underexposed conditions, clear vision remains critical for autonomous vehicles, which often struggle with low-light scenarios, signifying the need for continuous research. However, paired datasets for LLIE are scarce, particularly for street scenes, limiting the development of robust LLIE methods. Despite using advanced transformers and/or diffusion-based models, current LLIE methods struggle in real-world low-light conditions and lack training on street-scene datasets, limiting their effectiveness for autonomous vehicles. To bridge these gaps, we introduce a new dataset LoLI-Street (Low-Light Images of Streets) with 33k paired low-light and well-exposed images from street scenes in developed cities, covering 19k object classes for object detection. LoLI-Street dataset also features 1,000 real low-light test images for testing LLIE models under real-life conditions. Furthermore, we propose a transformer and diffusion-based LLIE model named “TriFuse”. Leveraging the LoLI-Street dataset, we train and evaluate our TriFuse and SOTA models to benchmark on our dataset. Comparing various models, our dataset’s generalization feasibility is evident in testing across different mainstream datasets by significantly enhancing images and object detection for practical applications in autonomous driving and surveillance systems. The complete code and dataset is available on this https URL.

[AI-135] Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

链接: https://arxiv.org/abs/2410.09807
作者: Soyoung Yang,Hojun Cho,Jiyoung Lee,Sohee Yoon,Edward Choi,Jaegul Choo,Won Ik Cho
关键词-EN: Aspect-based sentiment analysis, Aspect-based sentiment, sentiment analysis, extracting sentiment, aspect and opinion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language. Due to the inherent variability of natural language, aspect and opinion terms can be expressed in various surface forms, making their accurate identification complex. Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form. To address this limitation, we propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms. This approach enables a fairer assessment of language models by accommodating linguistic diversity, resulting in higher human agreement than single-answer test sets (up to 10%p improvement in Kendall’s Tau score). Our experimental results demonstrate that Large Language Models (LLMs) show substantial performance improvements over T5 models when evaluated using our augmented test set, suggesting that LLMs’ capabilities in ABSA tasks may have been underestimated. This work contributes to a more comprehensive evaluation framework for ABSA, potentially leading to more accurate assessments of model performance in information extraction tasks, particularly those involving span extraction.

[AI-136] BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

链接: https://arxiv.org/abs/2410.09804
作者: Xinyuan Wang,Victor Shea-Jay Huang,Renmiao Chen,Hao Wang,Chengwei Pan,Lei Sha,Minlie Huang
关键词-EN: encounter potential security, potential security risks, bypass security measures, large language models, exhibit remarkable capabilities
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.

[AI-137] EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models ECCV2024

链接: https://arxiv.org/abs/2410.09802
作者: Eungbean Lee,Somi Jeong,Kwanghoon Sohn
关键词-EN: Exemplar-guided image translation, attracting attention due, enhance user control, synthesizing photo-realistic images, Exemplar-guided image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024

点击查看摘要

Abstract:Exemplar-guided image translation, synthesizing photo-realistic images that conform to both structural control and style exemplars, is attracting attention due to its ability to enhance user control over style manipulation. Previous methodologies have predominantly depended on establishing dense correspondences across cross-domain inputs. Despite these efforts, they incur quadratic memory and computational costs for establishing dense correspondence, resulting in limited versatility and performance degradation. In this paper, we propose a novel approach termed Exemplar-guided Image Translation with Brownian-Bridge Diffusion Models (EBDM). Our method formulates the task as a stochastic Brownian bridge process, a diffusion process with a fixed initial point as structure control and translates into the corresponding photo-realistic image while being conditioned solely on the given exemplar image. To efficiently guide the diffusion process toward the style of exemplar, we delineate three pivotal components: the Global Encoder, the Exemplar Network, and the Exemplar Attention Module to incorporate global and detailed texture information from exemplar images. Leveraging Bridge diffusion, the network can translate images from structure control while exclusively conditioned on the exemplar style, leading to more robust training and inference processes. We illustrate the superiority of our method over competing approaches through comprehensive benchmark evaluations and visual results.

[AI-138] Expanding Search Space with Diverse Prompting Agents : An Efficient Sampling Approach for LLM Mathematical Reasoning

链接: https://arxiv.org/abs/2410.09780
作者: Gisang Lee,Sangwoo Park,Junyoung Park,Andrew Chung,Sieun Park,Yoonah Park,Byungju Kim,Min-gyu Cho
关键词-EN: Large Language Models, Large Language, Language Models, exhibited remarkable capabilities, complex tasks including
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited remarkable capabilities in many complex tasks including mathematical reasoning. However, traditional approaches heavily rely on ensuring self-consistency within single prompting method, which limits the exploration of diverse problem-solving strategies. This study addresses these limitations by performing an experimental analysis of distinct prompting methods within the domain of mathematical reasoning. Our findings demonstrate that each method explores a distinct search space, and this differentiation becomes more evident with increasing problem complexity. To leverage this phenomenon, we applied efficient sampling process that uniformly combines samples from these diverse methods, which not only expands the maximum search space but achieves higher performance with fewer runs compared to single methods. Especially, within the subset of difficult questions of MATH dataset named MATH-hard, The maximum search space was achieved while utilizing approximately 43% fewer runs than single methods on average. These findings highlight the importance of integrating diverse problem-solving strategies to enhance the reasoning abilities of LLMs.

[AI-139] EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

链接: https://arxiv.org/abs/2410.09775
作者: Yijie Li,Yuan Sun
关键词-EN: growing trend, judge the quality, employing large language, Recently, model
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recently, there has been a growing trend of employing large language models (LLMs) to judge the quality of other LLMs. Many studies have adopted closed-source models, mainly using GPT-4 as the evaluator. However, due to the closed-source nature of the GPT-4 model, employing it as an evaluator has resulted in issues including transparency, controllability, and cost-effectiveness. Some researchers have turned to using fine-tuned open-source LLMs as evaluators. However, existing open-source evaluation LLMs generally lack a user-friendly visualization tool, and they have not been optimized for accelerated model inference, which causes inconvenience for researchers with limited resources and those working across different fields. This paper presents EasyJudge, a model developed to evaluate significant language model responses. It is lightweight, precise, efficient, and user-friendly, featuring an intuitive visualization interface for ease of deployment and use. EasyJudge uses detailed datasets and refined prompts for model optimization, achieving strong consistency with human and proprietary model evaluations. The model optimized with quantitative methods enables EasyJudge to run efficiently on consumer-grade GPUs or even CPUs. We also provide detailed analysis and case studies to further reveal the potential of our method.

[AI-140] HypomimiaCoach: An AU-based Digital Therapy System for Hypomimia Detection Rehabilitation with Parkinsons Disease

链接: https://arxiv.org/abs/2410.09772
作者: Yingjing Xu,Xueyan Cai,Zihong Zhou,Mengru Xue,Bo Wang,Haotian Wang,Zhengke Li,Chentian Weng,Wei Luo,Cheng Yao,Bo Lin,Jianwei Yin
关键词-EN: delayed facial movements, Parkinson disease, movements and expressions, articulation and emotion, manifests as delayed
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hypomimia is a non-motor symptom of Parkinson’s disease that manifests as delayed facial movements and expressions, along with challenges in articulation and emotion. Currently, subjective evaluation by neurologists is the primary method for hypomimia detection, and conventional rehabilitation approaches heavily rely on verbal prompts from rehabilitation physicians. There remains a deficiency in accessible, user-friendly and scientifically rigorous assistive tools for hypomimia treatments. To investigate this, we developed HypomimaCoach, an Action Unit (AU)-based digital therapy system for hypomimia detection and rehabilitation in Parkinson’s disease. The HypomimaCoach system was designed to facilitate engagement through the incorporation of both relaxed and controlled rehabilitation exercises, while also stimulating initiative through the integration of digital therapies that incorporated traditional face training methods. We extract action unit(AU) features and their relationship for hypomimia detection. In order to facilitate rehabilitation, a series of training programmes have been devised based on the Action Units (AUs) and patients are provided with real-time feedback through an additional AU recognition model, which guides them through their training routines. A pilot study was conducted with seven participants in China, all of whom exhibited symptoms of Parkinson’s disease hypomimia. The results of the pilot study demonstrated a positive impact on participants’ self-efficacy, with favourable feedback received. Furthermore, physician evaluations validated the system’s applicability in a therapeutic setting for patients with Parkinson’s disease, as well as its potential value in clinical applications.

[AI-141] Quis custodiet ipsos custodes? Who will watch the watchmen? On Detecting AI-generated peer-reviews EMNLP

链接: https://arxiv.org/abs/2410.09770
作者: Sandeep Kumar,Mohit Sahu,Vardhan Gacche,Tirthankar Ghosal,Asif Ekbal
关键词-EN: maintaining scientific rigor, process is vital, vital for maintaining, rigor and trust, academic community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: EMNLP Main, 17 pages, 5 figures, 9 tables

点击查看摘要

Abstract:The integrity of the peer-review process is vital for maintaining scientific rigor and trust within the academic community. With the steady increase in the usage of large language models (LLMs) like ChatGPT in academic writing, there is a growing concern that AI-generated texts could compromise scientific publishing, including peer-reviews. Previous works have focused on generic AI-generated text detection or have presented an approach for estimating the fraction of peer-reviews that can be AI-generated. Our focus here is to solve a real-world problem by assisting the editor or chair in determining whether a review is written by ChatGPT or not. To address this, we introduce the Term Frequency (TF) model, which posits that AI often repeats tokens, and the Review Regeneration (RR) model, which is based on the idea that ChatGPT generates similar outputs upon re-prompting. We stress test these detectors against token attack and paraphrasing. Finally, we propose an effective defensive strategy to reduce the effect of paraphrasing on our models. Our findings suggest both our proposed methods perform better than the other AI text detectors. Our RR model is more robust, although our TF model performs better than the RR model without any attacks. We make our code, dataset, and model public.

[AI-142] LibEER: A Comprehensive Benchmark and Algorithm Library for EEG-based Emotion Recognition

链接: https://arxiv.org/abs/2410.09767
作者: Huan Liu,Shusen Yang,Yuzhe Zhang,Mengze Wang,Fanyu Gong,Chengxi Xie,Guanjian Liu,Dalin Zhang
关键词-EN: garnering increasing attention, increasing attention due, analyzing human emotions, garnering increasing, increasing attention
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:EEG-based emotion recognition (EER) is garnering increasing attention due to its potential in understanding and analyzing human emotions. Recently, significant advancements have been achieved using various deep learning-based techniques to address the EER problem. However, the absence of a convincing benchmark and open-source codebase complicates fair comparisons between different models and poses reproducibility challenges for practitioners. These issues considerably impede progress in this field. In light of this, we propose a comprehensive benchmark and algorithm library (LibEER) for fair comparisons in EER by making most of the implementation details of different methods consistent and using the same single codebase in PyTorch. In response to these challenges, we propose LibEER, a comprehensive benchmark and algorithm library for fair comparisons in EER, by ensuring consistency in the implementation details of various methods and utilizing a single codebase in PyTorch. LibEER establishes a unified evaluation framework with standardized experimental settings, enabling unbiased evaluations of over ten representative deep learning-based EER models across the four most commonly used datasets. Additionally, we conduct an exhaustive and reproducible comparison of the performance and efficiency of popular models, providing valuable insights for researchers in selecting and designing EER models. We aspire for our work to not only lower the barriers for beginners entering the field of EEG-based emotion recognition but also promote the standardization of research in this domain, thereby fostering steady development. The source code is available at \urlthis https URL.

[AI-143] EEG-based AI-BCI Wheelchair Advancement: A Brain-Computer Interfacing Wheelchair System Using Machine Learning Mechanism with Right and Left Voluntary Hand Movement

链接: https://arxiv.org/abs/2410.09763
作者: Biplov Paneru,Bishwash Paneru,Khem Narayan Poudyal
关键词-EN: Artificial Intelligence, Left Hand Movement, presents an Artificial, Left Hand, voluntary Right Left
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper presents an Artificial Intelligence (AI) integrated novel approach to Brain-Computer Interface (BCI)-based wheelchair development, utilizing a voluntary Right Left Hand Movement mechanism for control. The system is designed to simulate wheelchair navigation based on voluntary right and left-hand movements using electroencephalogram (EEG) data. A pre-filtered dataset, obtained from an open-source EEG repository, was segmented into arrays of 19x200 to capture the onset of hand movements. The data was acquired at a sampling frequency 200Hz in the laboratory experiment. The system integrates a Tkinter-based interface for simulating wheelchair movements, offering users a functional and intuitive control system. Various machine learning models, including Support Vector Machines (SVM), XGBoost, random forest, and a Bi-directional Long Short-Term Memory (Bi-LSTM) attention-based model, were developed. The random forest model obtained 79% accuracy. Great performance was seen on the Logistic Regression model which outperforms other models with 92% accuracy and 91% accuracy on the Multi-Layer Perceptron (MLP) model. The Bi-LSTM attention-based model achieved a mean accuracy of 86% through cross-validation, showcasing the potential of attention mechanisms in BCI applications.

[AI-144] ChartKG: A Knowledge-Graph-Based Representation for Chart Images

链接: https://arxiv.org/abs/2410.09761
作者: Zhiguang Zhou,Haoxuan Wang,Zhengqing Zhao,Fengling Zheng,Yongheng Wang,Wei Chen,Yong Wang
关键词-EN: explosively produced due, Chart images, Chart, explosively produced, produced due
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Chart images, such as bar charts, pie charts, and line charts, are explosively produced due to the wide usage of data visualizations. Accordingly, knowledge mining from chart images is becoming increasingly important, which can benefit downstream tasks like chart retrieval and knowledge graph completion. However, existing methods for chart knowledge mining mainly focus on converting chart images into raw data and often ignore their visual encodings and semantic meanings, which can result in information loss for many downstream tasks. In this paper, we propose ChartKG, a novel knowledge graph (KG) based representation for chart images, which can model the visual elements in a chart image and semantic relations among them including visual encodings and visual insights in a unified manner. Further, we develop a general framework to convert chart images to the proposed KG-based representation. It integrates a series of image processing techniques to identify visual elements and relations, e.g., CNNs to classify charts, yolov5 and optical character recognition to parse charts, and rule-based methods to construct graphs. We present four cases to illustrate how our knowledge-graph-based representation can model the detailed visual elements and semantic relations in charts, and further demonstrate how our approach can benefit downstream applications such as semantic-aware chart retrieval and chart question answering. We also conduct quantitative evaluations to assess the two fundamental building blocks of our chart-to-KG framework, i.e., object recognition and optical character recognition. The results provide support for the usefulness and effectiveness of ChartKG.

[AI-145] SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2410.09754
作者: Hojoon Lee,Dongyoon Hwang,Donghu Kim,Hyunseung Kim,Jun Jet Tai,Kaushik Subramanian,Peter R. Wurman,Jaegul Choo,Peter Stone,Takuma Seno
关键词-EN: traditional theories suggesting, Recent advances, largely driven, traditional theories, theories suggesting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa’s broad applicability and effectiveness across diverse RL algorithms and environments.

[AI-146] Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09750
作者: Juseong Jin,Chang Wook Jeong
关键词-EN: Conversation agents powered, Conversation agents, agents powered, Conversation, scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 AIM-FM Workshop

点击查看摘要

Abstract:Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.

[AI-147] -READi: Transformer-Powered Robust and Efficient Multimodal Inference for Autonomous Driving

链接: https://arxiv.org/abs/2410.09747
作者: Pengfei Hu,Yuhang Qian,Tianyue Zheng,Ang Li,Zhe Chen,Yue Gao,Xiuzhen Cheng,Jun Luo
关键词-EN: autonomous vehicles, wide adoption, analytics to fuse, fuse their outputs, multimodal sensors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 16 figures

点击查看摘要

Abstract:Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.

[AI-148] Gradient-Free Neural Network Training on the Edge

链接: https://arxiv.org/abs/2410.09734
作者: Dotan Di Castro,Omkar Joglekar,Shir Kozlovsky,Vladimir Tchuiev,Michal Moshkovitz
关键词-EN: heavy and energy-intensive, computationally heavy, Training neural networks, neural networks, Training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training neural networks is computationally heavy and energy-intensive. Many methodologies were developed to save computational requirements and energy by reducing the precision of network weights at inference time and introducing techniques such as rounding, stochastic rounding, and quantization. However, most of these techniques still require full gradient precision at training time, which makes training such models prohibitive on edge devices. This work presents a novel technique for training neural networks without needing gradients. This enables a training process where all the weights are one or two bits, without any hidden full precision computations. We show that it is possible to train models without gradient-based optimization techniques by identifying erroneous contributions of each neuron towards the expected classification and flipping the relevant bits using logical operations. We tested our method on several standard datasets and achieved performance comparable to corresponding gradient-based baselines with a fraction of the compute power.

[AI-149] MIRAGE: Multimodal Identification and Recognition of Annotations in Indian General Prescriptions

链接: https://arxiv.org/abs/2410.09729
作者: Tavish Mankash,V.S. Chaithanya Kota,Anish De,Praveen Prakash,Kshitij Jadhav
关键词-EN: Hospitals generate thousands, Electronic Medical Records, Hospitals generate, availability of Electronic, Electronic Medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Hospitals generate thousands of handwritten prescriptions, a practice that remains prevalent despite the availability of Electronic Medical Records (EMR). This method of record-keeping hinders the examination of long-term medication effects, impedes statistical analysis, and makes the retrieval of records challenging. Handwritten prescriptions pose a unique challenge, requiring specialized data for training models to recognize medications and their patterns of recommendation. While current handwriting recognition approaches typically employ 2-D LSTMs, recent studies have explored the use of Large Language Models (LLMs) for Optical Character Recognition (OCR). Building on this approach, we focus on extracting medication names from medical records. Our methodology MIRAGE (Multimodal Identification and Recognition of Annotations in indian GEneral prescriptions) involves fine-tuning the LLaVA 1.6 and Idefics2 models. Our research utilizes a dataset provided by Medyug Technology, consisting of 743,118 fully annotated high-resolution simulated medical records from 1,133 doctors across India. We demonstrate that our methodology exhibits 82% accuracy in medication name and dosage extraction. We provide a detailed account of our research methodology and results, notes about HWR with Multimodal LLMs, and release a small dataset of 100 medical records with labels.

[AI-150] A Tidal Current Speed Forecasting Model based on Multiple Periodicity Learning

链接: https://arxiv.org/abs/2410.09718
作者: Tengfei Cheng,Yunxuan Dong,Yangdi Huang
关键词-EN: tidal current speed, tidal current, Tidal energy, Tidal, key components
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tidal energy is one of the key components in increasing the penetration rate of renewable energy. The penetration of tidal energy in the electrical grid depends on the accuracy of tidal current speed forecasting. Modeling inaccuracies hinder forecast accuracy. Previous research has primarily used physical models to forecast tidal current speed. However, tidal current variations influenced by the orbital periods of celestial bodies make accurate physical modeling challenging. Researching the multiple periodicity of tides is crucial for accurately forecasting tidal current speed. In this article, we propose the Wavelet-Enhanced Convolutional Network (WCN) to learn multiple periodicity. The framework embeds intra-period and inter-period variations of one-dimensional tidal current data into the rows and columns of a two-dimensional tensor. Then, the two-dimensional variations of the sequence can be processed by convolutional kernels. We integrate a time-frequency analysis method into the framework to further address local periodic features. Additionally, to enhance the framework’s stability, we optimize the framework’s hyperparameters with the Tree-structured Parzen Estimator algorithm. The proposed framework avoids the lack of learning multiple periodicity. Compared with benchmarks, the proposed framework reduces the mean absolute error and mean square error in 10-step forecasting by, at most, 90.36% and 97.56%, respectively.

[AI-151] Agent ic Information Retrieval

链接: https://arxiv.org/abs/2410.09713
作者: Weinan Zhang,Junwei Liao,Ning Li,Kounianhua Du
关键词-EN: information retrieval, information, Agentic Information Retrieval, Agentic, relevant information
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 11 pages, position paper

点击查看摘要

Abstract:What will information entry look like in the next generation of digital products? Since the 1970s, user access to relevant information has relied on domain-specific architectures of information retrieval (IR). Over the past two decades, the advent of modern IR systems, including web search engines and personalized recommender systems, has greatly improved the efficiency of retrieving relevant information from vast data corpora. However, the core paradigm of these IR systems remains largely unchanged, relying on filtering a predefined set of candidate items. Since 2022, breakthroughs in large language models (LLMs) have begun transforming how information is accessed, establishing a new technical paradigm. In this position paper, we introduce Agentic Information Retrieval (Agentic IR), a novel IR paradigm shaped by the capabilities of LLM agents. Agentic IR expands the scope of accessible tasks and leverages a suite of new techniques to redefine information retrieval. We discuss three types of cutting-edge applications of agentic IR and the challenges faced. We propose that agentic IR holds promise for generating innovative applications, potentially becoming a central information entry point in future digital ecosystems.

[AI-152] Honest AI: Fine-Tuning “Small” Language Models to Say “I Dont Know” and Reducing Hallucination in RAG

链接: https://arxiv.org/abs/2410.09699
作者: Xinxi Chen,Li Wang,Wei Wu,Qi Tang,Yiyao Liu
关键词-EN: Large Language Models, Large Language, applications of Large, enterprise applications, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hallucination is a key roadblock for applications of Large Language Models (LLMs), particularly for enterprise applications that are sensitive to information accuracy. To address this issue, two general approaches have been explored: Retrieval-Augmented Generation (RAG) to supply LLMs with updated information as context, and fine-tuning the LLMs with new information and desired output styles. In this paper, we propose Honest AI: a novel strategy to fine-tune “small” language models to say “I don’t know” to reduce hallucination, along with several alternative RAG approaches. The solution ranked 1st in Task 2 for the false premise question. The alternative approaches include using RAG with search engine and knowledge graph results, fine-tuning base LLMs with new information and combinations of both approaches. Although all approaches improve the performance of the LLMs, RAG alone does not significantly improve the performance and fine-tuning is needed for better results. Finally, the hybrid approach achieved the highest score in the CRAG benchmark. In addition, our approach emphasizes the use of relatively small models with fewer than 10 billion parameters, promoting resource efficiency.

[AI-153] Can In-context Learning Really Generalize to Out-of-distribution Tasks?

链接: https://arxiv.org/abs/2410.09695
作者: Qixun Wang,Yifei Wang,Yisen Wang,Xianghua Ying
关键词-EN: ICL, learn OOD, OOD, learn OOD task, learn OOD mathematical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint, under review

点击查看摘要

Abstract:In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL’s well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL’s performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbflow-test-error preference of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.

[AI-154] ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

链接: https://arxiv.org/abs/2410.09692
作者: Hai Huang,Randall Balestriero
关键词-EN: Large Language Model, Large Language, Language Model, butter of Large, LoRA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, AB , of a pretrained matrix parameter W to align the model to a new task or dataset with W+AB . We identify three core limitations to LoRA for finetuning–a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prevent overfitting. We prove that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes. Second, LoRA’s initialization of B at 0 creates a slow training dynamic between A and B . That dynamic is also exacerbated by Dropout that further slows the escape from 0 for B which is particularly harmful for short training episodes. Third, the scaling factor multiplying each LoRA additive perturbation creates ``short-sighted’’ interactions between the LoRA modules of different layers. Motivated by principled analysis of those limitations, we find an elegant solution: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate–coined ALLoRA. By scaling the per sample and per parameter gradients with a coefficient inversely proportional to parameters’ \ell_2 norm, ALLoRA alleviates those three limitations. As a by-product, ALLoRA removes two hyper-parameters from LoRA: the scaling factor and the dropout rate. Empirical results show that ALLoRA admits better accuracy than LoRA on various settings, including against recent LoRA variants such as Weight-Decomposed Low-Rank Adaptation (DoRA). Ablation studies show our solution is the optimal in a family of weight-dependent / output-dependent approaches on various LLMs including the latest Llama3.

[AI-155] Robust 3D Point Clouds Classification based on Declarative Defenders

链接: https://arxiv.org/abs/2410.09691
作者: Kaidong Li,Tianxiao Zhang,Chuncong Zhong,Ziming Zhang,Guanghui Wang
关键词-EN: respective input data, cloud classification requires, divergent characteristics, respective input, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:3D point cloud classification requires distinct models from 2D image classification due to the divergent characteristics of the respective input data. While 3D point clouds are unstructured and sparse, 2D images are structured and dense. Bridging the domain gap between these two data types is a non-trivial challenge to enable model interchangeability. Recent research using Lattice Point Classifier (LPC) highlights the feasibility of cross-domain applicability. However, the lattice projection operation in LPC generates 2D images with disconnected projected pixels. In this paper, we explore three distinct algorithms for mapping 3D point clouds into 2D images. Through extensive experiments, we thoroughly examine and analyze their performance and defense mechanisms. Leveraging current large foundation models, we scrutinize the feature disparities between regular 2D images and projected 2D images. The proposed approaches demonstrate superior accuracy and robustness against adversarial attacks. The generative model-based mapping algorithms yield regular 2D images, further minimizing the domain gap from regular 2D classification tasks. The source code is available at this https URL.

[AI-156] MoIN: Mixture of Introvert Experts to Upcycle an LLM

链接: https://arxiv.org/abs/2410.09687
作者: Ajinkya Tejankar,KL Navaneet,Ujjawal Panchal,Kossar Pourahmadi,Hamed Pirsiavash
关键词-EN: existing large language, large language model, existing large, large language, prohibitive requirements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them “introvert” experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.

[AI-157] Generalization of Compositional Tasks with Logical Specification via Implicit Planning

链接: https://arxiv.org/abs/2410.09686
作者: Duo Xu,Faramarz Fekri
关键词-EN: learning generalizable policies, compositional tasks, logic specification, tasks, learning generalizable
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this work, we study the problem of learning generalizable policies for compositional tasks given by a logic specification. These tasks are composed by temporally extended subgoals. Due to dependencies of subgoals and long task horizon, previous reinforcement learning (RL) algorithms, e.g., task-conditioned and goal-conditioned policies, still suffer from slow convergence and sub-optimality when solving the generalization problem of compositional tasks. In order to tackle these issues, this paper proposes a new hierarchical RL framework for the efficient and optimal generalization of compositional tasks. In the high level, we propose a new implicit planner designed specifically for generalizing compositional tasks. Specifically, the planner produces the selection of next sub-task and estimates the multi-step return of completing the rest of task from current state. It learns a latent transition model and conducts planning in the latent space based on a graph neural network (GNN). Then, the next sub-task selected by the high level guides the low-level agent efficiently to solve long-horizon tasks and the multi-step return makes the low-level policy consider dependencies of future sub-tasks. We conduct comprehensive experiments to show the advantage of proposed framework over previous methods in terms of optimality and efficiency.

[AI-158] LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

链接: https://arxiv.org/abs/2410.09681
作者: Christopher Diehl,Peter Karkus,Shushant Veer,Marco Pavone,Torsten Bertram
关键词-EN: Distribution shifts, self-driving vehicles, shifts between operational, operational domains, domains can severely
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Distribution shifts between operational domains can severely affect the performance of learned models in self-driving vehicles (SDVs). While this is a well-established problem, prior work has mostly explored naive solutions such as fine-tuning, focusing on the motion prediction task. In this work, we explore novel adaptation strategies for differentiable autonomy stacks consisting of prediction, planning, and control, perform evaluation in closed-loop, and investigate the often-overlooked issue of catastrophic forgetting. Specifically, we introduce two simple yet effective techniques: a low-rank residual decoder (LoRD) and multi-task fine-tuning. Through experiments across three models conducted on two real-world autonomous driving datasets (nuPlan, exiD), we demonstrate the effectiveness of our methods and highlight a significant performance gap between open-loop and closed-loop evaluation in prior approaches. Our approach improves forgetting by up to 23.33% and the closed-loop OOD driving score by 8.83% in comparison to standard fine-tuning.

[AI-159] OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models

链接: https://arxiv.org/abs/2410.09671
作者: Jun Wang,Meng Fang,Ziyu Wan,Muning Wen,Jiachen Zhu,Anjie Liu,Ziqin Gong,Yan Song,Lei Chen,Lionel M. Ni,Linyi Yang,Ying Wen,Weinan Zhang
关键词-EN: integrate key components, large language models, reinforcement learning, technical report, key components
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this technical report, we introduce OpenR, an open-source framework designed to integrate key components for enhancing the reasoning capabilities of large language models (LLMs). OpenR unifies data acquisition, reinforcement learning training (both online and offline), and non-autoregressive decoding into a cohesive software platform. Our goal is to establish an open-source platform and community to accelerate the development of LLM reasoning. Inspired by the success of OpenAI’s o1 model, which demonstrated improved reasoning abilities through step-by-step reasoning and reinforcement learning, OpenR integrates test-time compute, reinforcement learning, and process supervision to improve reasoning in LLMs. Our work is the first to provide an open-source framework that explores the core techniques of OpenAI’s o1 model with reinforcement learning, achieving advanced reasoning capabilities beyond traditional autoregressive methods. We demonstrate the efficacy of OpenR by evaluating it on the MATH dataset, utilising publicly available data and search methods. Our initial experiments confirm substantial gains, with relative improvements in reasoning and performance driven by test-time computation and reinforcement learning through process reward models. The OpenR framework, including code, models, and datasets, is accessible at this https URL.

[AI-160] LSTM-Based Proactive Congestion Management for Internet of Vehicle Networks

链接: https://arxiv.org/abs/2410.09656
作者: Aly Sabri Abdalla,Ahmad Al-Kabbany,Ehab F. Badran,Vuk Marojevic
关键词-EN: variety of safety, commercial applications, support a variety, Vehicles, networks support
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: This article has been accepted for publication in the IEEE VTC Fall 2024

点击查看摘要

Abstract:Vehicle-to-everything (V2X) networks support a variety of safety, entertainment, and commercial applications. This is realized by applying the principles of the Internet of Vehicles (IoV) to facilitate connectivity among vehicles and between vehicles and roadside units (RSUs). Network congestion management is essential for IoVs and it represents a significant concern due to its impact on improving the efficiency of transportation systems and providing reliable communication among vehicles for the timely delivery of safety-critical packets. This paper introduces a framework for proactive congestion management for IoV networks. We generate congestion scenarios and a data set to predict the congestion using LSTM. We present the framework and the packet congestion dataset. Simulation results using SUMO with NS3 demonstrate the effectiveness of the framework for forecasting IoV network congestion and clustering/prioritizing packets employing recurrent neural networks.

[AI-161] Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution EMNLP2024

链接: https://arxiv.org/abs/2410.09652
作者: Ankita Sinha,Wendi Cui,Kamalika Das,Jiaxin Zhang
关键词-EN: Large language models, demonstrated remarkable capabilities, Large language, prioritized performance metrics, historically prioritized performance
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: EMNLP 2024 Industry Track

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities; however, the optimization of their prompts has historically prioritized performance metrics at the expense of crucial safety and security considerations. To overcome this shortcoming, we introduce “Survival of the Safest” (SoS), an innovative multi-objective prompt optimization framework that enhances both performance and security in LLMs simultaneously. SoS utilizes an interleaved multi-objective evolution strategy, integrating semantic, feedback, and crossover mutations to effectively traverse the prompt landscape. Differing from the computationally demanding Pareto front methods, SoS provides a scalable solution that expedites optimization in complex, high-dimensional discrete search spaces while keeping computational demands low. Our approach accommodates flexible weighting of objectives and generates a pool of optimized candidates, empowering users to select prompts that optimally meet their specific performance and security needs. Experimental evaluations across diverse benchmark datasets affirm SoS’s efficacy in delivering high performance and notably enhancing safety and security compared to single-objective methods. This advancement marks a significant stride towards the deployment of LLM systems that are both high-performing and secure across varied industrial applications

[AI-162] Multimodal Physical Activity Forecasting in Free-Living Clinical Settings: Hunting Opportunities for Just-in-Time Interventions

链接: https://arxiv.org/abs/2410.09643
作者: Abdullah Mamun,Krista S. Leonard,Megan E. Petrov,Matthew P. Buman,Hassan Ghasemzadeh
关键词-EN: lifestyle intervention system, Multimodal LSTM, patient activity behavior, LSTM, called MoveSense
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Objective: This research aims to develop a lifestyle intervention system, called MoveSense, that forecasts a patient’s activity behavior to allow for early and personalized interventions in real-world clinical environments. Methods: We conducted two clinical studies involving 58 prediabetic veterans and 60 patients with obstructive sleep apnea to gather multimodal behavioral data using wearable devices. We develop multimodal long short-term memory (LSTM) network models, which are capable of forecasting the number of step counts of a patient up to 24 hours in advance by examining data from activity and engagement modalities. Furthermore, we design goal-based forecasting models to predict whether a person’s next-day steps will be over a certain threshold. Results: Multimodal LSTM with early fusion achieves 33% and 37% lower mean absolute errors than linear regression and ARIMA respectively on the prediabetes dataset. LSTM also outperforms linear regression and ARIMA with a margin of 13% and 32% on the sleep dataset. Multimodal forecasting models also perform with 72% and 79% accuracy on the prediabetes dataset and sleep dataset respectively on goal-based forecasting. Conclusion: Our experiments conclude that multimodal LSTM models with early fusion are better than multimodal LSTM with late fusion and unimodal LSTM models and also than ARIMA and linear regression models. Significance: We address an important and challenging task of time-series forecasting in uncontrolled environments. Effective forecasting of a person’s physical activity can aid in designing adaptive behavioral interventions to keep the user engaged and adherent to a prescribed routine.

[AI-163] ReLUs Revival: On the Entropic Overload in Normalization-Free Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09637
作者: Nandan Kumar Jha,Brandon Reagen
关键词-EN: ensuring smooth optimization, modern large language, large language models, smooth optimization, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Workshop on Attributing Model Behavior at Scale

点击查看摘要

Abstract:LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an \em opposite trend – ReLU significantly outperforms GELU in LayerNorm-free models, leading to an \bf 8.2% perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are \em ill-suited for LayerNorm-free architectures, whereas ReLU’s geometrical properties – specialization in input space and intra-class selectivity – lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.

[AI-164] Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health

链接: https://arxiv.org/abs/2410.09635
作者: Abdullah Mamun,Lawrence D. Devoe,Mark I. Evans,David W. Britt,Judith Klein-Seetharaman,Hassan Ghasemzadeh
关键词-EN: Early detection, risk enables interventions, adverse labor outcomes, mitigate adverse labor, Explaining Neonatal Health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Early detection of intrapartum risk enables interventions to potentially prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, there is no accurate automated system to predict such events to assist with clinical decision-making. To fill this gap, we propose “Artificial Intelligence (AI) for Modeling and Explaining Neonatal Health” (AIMEN), a deep learning framework that not only predicts adverse labor outcomes from maternal, fetal, obstetrical, and intrapartum risk factors but also provides the model’s reasoning behind the predictions made. The latter can provide insights into what modifications in the input variables of the model could have changed the predicted outcome. We address the challenges of imbalance and small datasets by synthesizing additional training data using Adaptive Synthetic Sampling (ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN uses an ensemble of fully-connected neural networks as the backbone for its classification with the data augmentation supported by either ADASYN or CTGAN. AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in classification. AIMEN can predict a high risk for adverse labor outcomes with an average F1 score of 0.784. It also provides counterfactual explanations that can be achieved by changing 2 to 3 attributes on average. Resources available: this https URL.

[AI-165] Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.09629
作者: Jiaxin Zhang,Wendi Cui,Yiran Huang,Kamalika Das,Sricharan Kumar
关键词-EN: Large language models, Large language, Retrieval Augmented Generation, proficient in capturing, knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: EMNLP 2024 main conference long paper

点击查看摘要

Abstract:Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.

[AI-166] SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

链接: https://arxiv.org/abs/2410.09615
作者: Mohammad Mozaffari,Maryam Mehri Dehnavi
关键词-EN: revolutionized natural language, natural language understanding, high memory consumption, slow inference times, inference times due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language understanding and generation tasks but suffer from high memory consumption and slow inference times due to their large parameter sizes. Traditional model compression techniques, such as quantization and pruning, mitigate these issues but often require retraining to maintain accuracy, which is computationally expensive. This paper introduces SLiM, a novel approach for compressing LLMs using a one-shot Quantized Sparse Plus Low-rank Approximation. SLiM eliminates the need for costly retraining by combining a symmetric quantization method (SLiM-Quant) with a saliency-based low-rank approximation. Our method reduces quantization error while leveraging sparse representations compatible with accelerated hardware architectures. Additionally, we propose a parameter-efficient fine-tuning recipe that significantly reduces overhead compared to conventional quantization-aware training. SLiM achieves up to a 5.4% improvement in model accuracy for sparsity patterns like 2:4, and the fine-tuning step further enhances accuracy by up to 5.8%, demonstrating state-of-the-art performance. This work provides a pathway for efficiently deploying large models in memory-constrained environments without compromising accuracy.

[AI-167] ransformer-based Language Models for Reasoning in the Description Logic ALCQ KR

链接: https://arxiv.org/abs/2410.09613
作者: Angelos Poulis,Eleni Tsalapati,Manolis Koubarakis
关键词-EN: Recent advancements, advancements in transformer-based, sparked research, logical reasoning capabilities, transformer-based language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities. Most of the benchmarks used to evaluate these models are simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. We construct the natural language dataset, DELTA _D , using the expressive description logic language \mathcalALCQ . DELTA _D comprises 384K examples and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) with few-shot prompting. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task. Moreover, the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.

[AI-168] raversing Emotional Landscapes and Linguistic Patterns in Bernard-Marie Kolt`es Plays: An NLP Perspective

链接: https://arxiv.org/abs/2410.09609
作者: Arezou Zahiri Pourzarandi,Farshad Jafari
关键词-EN: Natural Language Processing, contemporary French theatre, study employs Natural, employs Natural Language, French theatre
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study employs Natural Language Processing (NLP) to analyze the intricate linguistic and emotional dimensions within the plays of Bernard-Marie Koltès, a central figure in contemporary French theatre. By integrating advanced computational techniques, we dissect Koltès’ narrative style, revealing the subtle interplay between language and emotion across his dramatic oeuvre. Our findings highlight how Koltès crafts his narratives, enriching our understanding of his thematic explorations and contributing to the broader field of digital humanities in literary analysis.

[AI-169] EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

链接: https://arxiv.org/abs/2410.09604
作者: Chen Gao,Baining Zhao,Weichen Zhang,Jinzhu Mao,Jun Zhang,Zhiheng Zheng,Fanhang Man,Jianjie Fang,Zile Zhou,Jinqiang Cui,Xinlei Chen,Yong Li
关键词-EN: generating human-like behaviors, human-like behaviors, emphasizes the role, body in generating, generating human-like
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: All of the software, Python library, codes, datasets, tutorials, and real-time online service are available on this website: this https URL

点击查看摘要

Abstract:Embodied artificial intelligence emphasizes the role of an agent’s body in generating human-like behaviors. The recent efforts on EmbodiedAI pay a lot of attention to building up machine learning models to possess perceiving, planning, and acting abilities, thereby enabling real-time interaction with the world. However, most works focus on bounded indoor environments, such as navigation in a room or manipulating a device, with limited exploration of embodying the agents in open-world scenarios. That is, embodied intelligence in the open and outdoor environment is less explored, for which one potential reason is the lack of high-quality simulators, benchmarks, and datasets. To address it, in this paper, we construct a benchmark platform for embodied intelligence evaluation in real-world city environments. Specifically, we first construct a highly realistic 3D simulation environment based on the real buildings, roads, and other elements in a real city. In this environment, we combine historically collected data and simulation algorithms to conduct simulations of pedestrian and vehicle flows with high fidelity. Further, we designed a set of evaluation tasks covering different EmbodiedAI abilities. Moreover, we provide a complete set of input and output interfaces for access, enabling embodied agents to easily take task requirements and current environmental observations as input and then make decisions and obtain performance evaluations. On the one hand, it expands the capability of existing embodied intelligence to higher levels. On the other hand, it has a higher practical value in the real world and can support more potential applications for artificial general intelligence. Based on this platform, we evaluate some popular large language models for embodied intelligence capabilities of different dimensions and difficulties.

[AI-170] A Complete Characterization of Learnability for Stochastic Noisy Bandits

链接: https://arxiv.org/abs/2410.09597
作者: Steve Hanneke,Kun Wang
关键词-EN: unknown reward function, model, study the stochastic, mathcal, model class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the stochastic noisy bandit problem with an unknown reward function f^* in a known function class \mathcalF . Formally, a model M maps arms \pi to a probability distribution M(\pi) of reward. A model class \mathcalM is a collection of models. For each model M , define its mean reward function f^M(\pi)=\mathbbE_r \sim M(\pi)[r] . In the bandit learning problem, we proceed in rounds, pulling one arm \pi each round and observing a reward sampled from M(\pi) . With knowledge of \mathcalM , supposing that the true model M\in \mathcalM , the objective is to identify an arm \hat\pi of near-maximal mean reward f^M(\hat\pi) with high probability in a bounded number of rounds. If this is possible, then the model class is said to be learnable. Importantly, a result of \citehanneke2023bandit shows there exist model classes for which learnability is undecidable. However, the model class they consider features deterministic rewards, and they raise the question of whether learnability is decidable for classes containing sufficiently noisy models. For the first time, we answer this question in the positive by giving a complete characterization of learnability for model classes with arbitrary noise. In addition to that, we also describe the full spectrum of possible optimal query complexities. Further, we prove adaptivity is sometimes necessary to achieve the optimal query complexity. Last, we revisit an important complexity measure for interactive decision making, the Decision-Estimation-Coefficient \citepfoster2021statistical,foster2023tight, and propose a new variant of the DEC which also characterizes learnability in this setting. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2410.09597 [cs.LG] (or arXiv:2410.09597v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.09597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-171] ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model

链接: https://arxiv.org/abs/2410.09592
作者: Hongbin Xu,Weitao Chen,Zhipeng Zhou,Feng Xiao,Baigui Sun,Mike Zheng Shou,Wenxiong Kang
关键词-EN: challenging issue, recent advancements, remains a challenging, triplane decoder, generation methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Draft version. This paper is still in submission. For access to our project page and code, please visit: this https URL

点击查看摘要

Abstract:Despite recent advancements in 3D generation methods, achieving controllability still remains a challenging issue. Current approaches utilizing score-distillation sampling are hindered by laborious procedures that consume a significant amount of time. Furthermore, the process of first generating 2D representations and then mapping them to 3D lacks internal alignment between the two forms of representation. To address these challenges, we introduce ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM). ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations. To ensure unbiased evaluation, we curate evaluation samples from three distinct datasets (G-OBJ, GSO, ABO) rather than relying on cherry-picking manual generation. The comprehensive experiments conducted on quantitative and qualitative comparisons of 3D controllability and generation quality demonstrate the strong generalization capacity of our proposed approach.

[AI-172] oward General Instruction-Following Alignment for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2410.09584
作者: Guanting Dong,Xiaoshuai Song,Yutao Zhu,Runqi Qiao,Zhicheng Dou,Ji-Rong Wen
关键词-EN: Retrieval-Augmented Generation, Large Language Models, RAG systems, RAG, effective application
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Working in progress

点击查看摘要

Abstract:Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions (100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset (100k) through automated processes. To further bridge the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and eight widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems. Our code and datasets are released at this https URL.

[AI-173] Improving 3D Finger Traits Recognition via Generalizable Neural Rendering

链接: https://arxiv.org/abs/2410.09582
作者: Hongbin Xu,Junduan Huang,Yuer Ma,Zifeng Li,Wenxiong Kang
关键词-EN: demonstrated a powerful, powerful ability, finger, finger traits, Trait Guided Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper is accepted in IJCV. For further information and access to the code, please visit our project page: this https URL

点击查看摘要

Abstract:3D biometric techniques on finger traits have become a new trend and have demonstrated a powerful ability for recognition and anti-counterfeiting. Existing methods follow an explicit 3D pipeline that reconstructs the models first and then extracts features from 3D models. However, these explicit 3D methods suffer from the following problems: 1) Inevitable information dropping during 3D reconstruction; 2) Tight coupling between specific hardware and algorithm for 3D reconstruction. It leads us to a question: Is it indispensable to reconstruct 3D information explicitly in recognition tasks? Hence, we consider this problem in an implicit manner, leaving the nerve-wracking 3D reconstruction problem for learnable neural networks with the help of neural radiance fields (NeRFs). We propose FingerNeRF, a novel generalizable NeRF for 3D finger biometrics. To handle the shape-radiance ambiguity problem that may result in incorrect 3D geometry, we aim to involve extra geometric priors based on the correspondence of binary finger traits like fingerprints or finger veins. First, we propose a novel Trait Guided Transformer (TGT) module to enhance the feature correspondence with the guidance of finger traits. Second, we involve extra geometric constraints on the volume rendering loss with the proposed Depth Distillation Loss and Trait Guided Rendering Loss. To evaluate the performance of the proposed method on different modalities, we collect two new datasets: SCUT-Finger-3D with finger images and SCUT-FingerVein-3D with finger vein images. Moreover, we also utilize the UNSW-3D dataset with fingerprint images for evaluation. In experiments, our FingerNeRF can achieve 4.37% EER on SCUT-Finger-3D dataset, 8.12% EER on SCUT-FingerVein-3D dataset, and 2.90% EER on UNSW-3D dataset, showing the superiority of the proposed implicit method in 3D finger biometrics.

[AI-174] Structure of Artificial Neural Networks – Empirical Investigations

链接: https://arxiv.org/abs/2410.09579
作者: Julian Stier
关键词-EN: neural architecture, neural architecture search, Deep Learning overtook, neural, deep architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: PhD thesis

点击查看摘要

Abstract:Within one decade, Deep Learning overtook the dominating solution methods of countless problems of artificial intelligence. Deep'' refers to the deep architectures with operations in manifolds of which there are no immediate observations. For these deep architectures some kind of structure is pre-defined -- but what is this structure? With a formal definition for structures of neural networks, neural architecture search problems and solution methods can be formulated under a common framework. Both practical and theoretical questions arise from closing the gap between applied neural architecture search and learning theory. Does structure make a difference or can it be chosen arbitrarily? This work is concerned with deep structures of artificial neural networks and examines automatic construction methods under empirical principles to shed light on to the so called black-box models’'. Our contributions include a formulation of graph-induced neural networks that is used to pose optimisation problems for neural architecture. We analyse structural properties for different neural network objectives such as correctness, robustness or energy consumption and discuss how structure affects them. Selected automation methods for neural architecture optimisation problems are discussed and empirically analysed. With the insights gained from formalising graph-induced neural networks, analysing structural properties and comparing the applicability of neural architecture search methods qualitatively and quantitatively we advance these methods in two ways. First, new predictive models are presented for replacing computationally expensive evaluation schemes, and second, new generative models for informed sampling during neural architecture search are analysed and discussed. Comments: PhD thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Reportnumber: urn:nbn:de:bvb:739-opus4-14968 Cite as: arXiv:2410.09579 [cs.LG] (or arXiv:2410.09579v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.09579 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-175] he Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models

链接: https://arxiv.org/abs/2410.09576
作者: Subhankar Maity,Aniket Deroy
关键词-EN: large language models, natural language processing, revolutionized natural language, offering unprecedented capabilities, recent years
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Book Chapter (Under Review)

点击查看摘要

Abstract:In recent years, large language models (LLMs) and generative AI have revolutionized natural language processing (NLP), offering unprecedented capabilities in education. This chapter explores the transformative potential of LLMs in automated question generation and answer assessment. It begins by examining the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text. The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-thought prompting, are evaluated for their effectiveness in generating high-quality questions, including open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning and prompt-tuning are explored for their role in generating task-specific questions, despite associated costs. The chapter also covers the human evaluation of generated questions, highlighting quality variations across different methods and areas for improvement. Furthermore, it delves into automated answer assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback, and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments and areas needing improvement. The discussion underscores the potential of LLMs to replace costly, time-consuming human assessments when appropriately guided, showcasing their advanced understanding and reasoning capabilities in streamlining educational processes.

[AI-176] Reconstructive Visual Instruction Tuning

链接: https://arxiv.org/abs/2410.09575
作者: Haochen Wang,Anlin Zheng,Yucheng Zhao,Tiancai Wang,Zheng Ge,Xiangyu Zhang,Zhaoxiang Zhang
关键词-EN: Large Multimodal Models, Large Multimodal, paper introduces reconstructive, family of Large, visual instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.

[AI-177] Are You Human? An Adversarial Benchmark to Expose LLMs

链接: https://arxiv.org/abs/2410.09569
作者: Gilad Gressel,Rahul Pankajakshan,Yisroel Mirsky
关键词-EN: Large Language Models, Large Language, Language Models, raising concerns, scams and deception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation, raising concerns about their potential misuse in scams and deception. Humans have a right to know if they are conversing to an LLM. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time. To this end we compile and release an open-source benchmark dataset that includes ‘implicit challenges’ that exploit an LLM’s instruction-following mechanism to cause role deviation, and ‘exlicit challenges’ that test an LLM’s ability to perform simple tasks typically easy for humans but difficult for LLMs. Our evaluation of 9 leading models from the LMSYS leaderboard revealed that explicit challenges successfully detected LLMs in 78.4% of cases, while implicit challenges were effective in 22.9% of instances. User studies validate the real-world applicability of our methods, with humans outperforming LLMs on explicit challenges (78% vs 22% success rate). Our framework unexpectedly revealed that many study participants were using LLMs to complete tasks, demonstrating its effectiveness in detecting both AI impostors and human misuse of AI tools. This work addresses the critical need for reliable, real-time LLM detection methods in high-stakes conversations.

[AI-178] Extended Japanese Commonsense Morality Dataset with Masked Token and Label Enhancement

链接: https://arxiv.org/abs/2410.09564
作者: Takumi Ohashi,Tsubasa Nakagawa,Hitoshi Iyatomi
关键词-EN: Rapid advancements, artificial intelligence, advancements in artificial, made it crucial, crucial to integrate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Rapid advancements in artificial intelligence (AI) have made it crucial to integrate moral reasoning into AI systems. However, existing models and datasets often overlook regional and cultural differences. To address this shortcoming, we have expanded the JCommonsenseMorality (JCM) dataset, the only publicly available dataset focused on Japanese morality. The Extended JCM (eJCM) has grown from the original 13,975 sentences to 31,184 sentences using our proposed sentence expansion method called Masked Token and Label Enhancement (MTLE). MTLE selectively masks important parts of sentences related to moral judgment and replaces them with alternative expressions generated by a large language model (LLM), while re-assigning appropriate labels. The model trained using our eJCM achieved an F1 score of 0.857, higher than the scores for the original JCM (0.837), ChatGPT one-shot classification (0.841), and data augmented using AugGPT, a state-of-the-art augmentation method (0.850). Specifically, in complex moral reasoning tasks unique to Japanese culture, the model trained with eJCM showed a significant improvement in performance (increasing from 0.681 to 0.756) and achieved a performance close to that of GPT-4 Turbo (0.787). These results demonstrate the validity of the eJCM dataset and the importance of developing models and datasets that consider the cultural context.

[AI-179] Boltzmann-Aligned Inverse Folding Model as a Predictor of Mutational Effects on Protein-Protein Interactions

链接: https://arxiv.org/abs/2410.09543
作者: Xiaoran Jiao,Weian Mao,Wengong Jin,Peiyuan Yang,Hao Chen,Chunhua Shen
关键词-EN: Delta, Predicting the change, modulating protein-protein interactions, drug design, crucial for understanding
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Predicting the change in binding free energy ( \Delta \Delta G ) is crucial for understanding and modulating protein-protein interactions, which are critical in drug design. Due to the scarcity of experimental \Delta \Delta G data, existing methods focus on pre-training, while neglecting the importance of alignment. In this work, we propose the Boltzmann Alignment technique to transfer knowledge from pre-trained inverse folding models to \Delta \Delta G prediction. We begin by analyzing the thermodynamic definition of \Delta \Delta G and introducing the Boltzmann distribution to connect energy with protein conformational distribution. However, the protein conformational distribution is intractable; therefore, we employ Bayes’ theorem to circumvent direct estimation and instead utilize the log-likelihood provided by protein inverse folding models for \Delta \Delta G estimation. Compared to previous inverse folding-based methods, our method explicitly accounts for the unbound state of protein complex in the \Delta \Delta G thermodynamic cycle, introducing a physical inductive bias and achieving both supervised and unsupervised state-of-the-art (SoTA) performance. Experimental results on SKEMPI v2 indicate that our method achieves Spearman coefficients of 0.3201 (unsupervised) and 0.5134 (supervised), significantly surpassing the previously reported SoTA values of 0.2632 and 0.4324, respectively. Futhermore, we demonstrate the capability of our method on binding energy prediction, protein-protein docking and antibody optimization tasks.

[AI-180] MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models

链接: https://arxiv.org/abs/2410.09542
作者: Jiachun Li,Pengfei Cao,Zhuoran Jin,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: achieve higher intelligence, large language models, higher intelligence, Inductive reasoning, essential capability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 25 pages,9 figures, under review

点击查看摘要

Abstract:Inductive reasoning is an essential capability for large language models (LLMs) to achieve higher intelligence, which requires the model to generalize rules from observed facts and then apply them to unseen examples. We present \scshape Mirage, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs’ capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs’ inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner. In many cases, when conducting inductive reasoning, they do not rely on a correct rule to answer the unseen case. From the perspectives of different prompting methods, observation numbers, and task forms, models tend to consistently conduct correct deduction without correct inductive rules. Besides, we find that LLMs are good neighbor-based reasoners. In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space. By leveraging these similar examples, the model maintains strong inductive capabilities within a localized region, significantly improving its deductive performance.

[AI-181] LINKED: Eliciting Filtering and Integrating Knowledge in Large Language Model for Commonsense Reasoning EMNLP2024

链接: https://arxiv.org/abs/2410.09541
作者: Jiachun Li,Pengfei Cao,Chenhao Wang,Zhuoran Jin,Yubo Chen,Kang Liu,Xiaojian Jiang,Jiexin Xu,Jun Zhao
关键词-EN: demonstrate poor performance, demonstrate poor, poor performance, performance on knowledge-intensive, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:Large language models (LLMs) sometimes demonstrate poor performance on knowledge-intensive tasks, commonsense reasoning is one of them. Researchers typically address these issues by retrieving related knowledge from knowledge graphs or employing self-enhancement methods to elicit knowledge in LLMs. However, noisy knowledge and invalid reasoning issues hamper their ability to answer questions accurately. To this end, we propose a novel method named eliciting, filtering and integrating knowledge in large language model (LINKED). In it, we design a reward model to filter out the noisy knowledge and take the marginal consistent reasoning module to reduce invalid reasoning. With our comprehensive experiments on two complex commonsense reasoning benchmarks, our method outperforms SOTA baselines (up to 9.0% improvement of accuracy). Besides, to measure the positive and negative impact of the injected knowledge, we propose a new metric called effectiveness-preservation score for the knowledge enhancement works. Finally, through extensive experiments, we conduct an in-depth analysis and find many meaningful conclusions about LLMs in commonsense reasoning tasks.

[AI-182] PrivQuant: Communication-Efficient Private Inference with Quantized Network/Protocol Co-Optimization

链接: https://arxiv.org/abs/2410.09531
作者: Tianshi Xu,Shuzhang Zhong,Wenxuan Zeng,Runsheng Wang,Meng Li
关键词-EN: Private deep neural, secure two-party computation, secure privacy protection, deep neural network, times
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: ICCAD 2024

点击查看摘要

Abstract:Private deep neural network (DNN) inference based on secure two-party computation (2PC) enables secure privacy protection for both the server and the client. However, existing secure 2PC frameworks suffer from a high inference latency due to enormous communication. As the communication of both linear and non-linear DNN layers reduces with the bit widths of weight and activation, in this paper, we propose PrivQuant, a framework that jointly optimizes the 2PC-based quantized inference protocols and the network quantization algorithm, enabling communication-efficient private inference. PrivQuant proposes DNN architecture-aware optimizations for the 2PC protocols for communication-intensive quantized operators and conducts graph-level operator fusion for communication reduction. Moreover, PrivQuant also develops a communication-aware mixed precision quantization algorithm to improve inference efficiency while maintaining high accuracy. The network/protocol co-optimization enables PrivQuant to outperform prior-art 2PC frameworks. With extensive experiments, we demonstrate PrivQuant reduces communication by 11\times, 2.5\times \mathrmand~ 2.8\times , which results in 8.7\times, 1.8\times ~ \mathrmand~ 2.4\times latency reduction compared with SiRNN, COINN, and CoPriv, respectively.

[AI-183] Preserving Old Memories in Vivid Detail: Human-Interactive Photo Restoration Framework

链接: https://arxiv.org/abs/2410.09529
作者: Seung-Yeon Back,Geonho Son,Dahye Jeong,Eunil Park,Simon S. Woo
关键词-EN: technology enables preserving, enables preserving visual, preserving visual memories, restoration technology enables, technology enables
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Photo restoration technology enables preserving visual memories in photographs. However, physical prints are vulnerable to various forms of deterioration, ranging from physical damage to loss of image quality, etc. While restoration by human experts can improve the quality of outcomes, it often comes at a high price in terms of cost and time for restoration. In this work, we present the AI-based photo restoration framework composed of multiple stages, where each stage is tailored to enhance and restore specific types of photo damage, accelerating and automating the photo restoration process. By integrating these techniques into a unified architecture, our framework aims to offer a one-stop solution for restoring old and deteriorated photographs. Furthermore, we present a novel old photo restoration dataset because we lack a publicly available dataset for our evaluation.

[AI-184] Boosting Deductive Reasoning with Step Signals In RLHF

链接: https://arxiv.org/abs/2410.09528
作者: Jialian Li,Yipin Zhang,Wei Shen,Yuzi Yan,Jian Xie,Dong Yan
关键词-EN: Large Language Models, Large Language, tackle complex problems, Language Models, complex problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi-step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi-step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi-step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in-domain of out-of-domain reasoning tasks. Additionally, we have conducted tests to assess the multi-step reasoning abilities of various models.

[AI-185] Pic@Point: Cross-Modal Learning by Local and Global Point-Picture Correspondence ACML2024

链接: https://arxiv.org/abs/2410.09519
作者: Vencia Herzog,Stefan Suwelack
关键词-EN: achieved remarkable success, success in NLP, Self-supervised pre-training, achieved remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ACML 2024

点击查看摘要

Abstract:Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.

[AI-186] Eco-Aware Graph Neural Networks for Sustainable Recommendations

链接: https://arxiv.org/abs/2410.09514
作者: Antonio Purificato,Fabrizio Silvestri
关键词-EN: alleviating information overload, Graph Neural Networks, Recommender systems play, providing personalized recommendations, personalized recommendations tailored
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 tables, 3 figures, RecSoGood Workshop

点击查看摘要

Abstract:Recommender systems play a crucial role in alleviating information overload by providing personalized recommendations tailored to users’ preferences and interests. Recently, Graph Neural Networks (GNNs) have emerged as a promising approach for recommender systems, leveraging their ability to effectively capture complex relationships and dependencies between users and items by representing them as nodes in a graph structure. In this study, we investigate the environmental impact of GNN-based recommender systems, an aspect that has been largely overlooked in the literature. Specifically, we conduct a comprehensive analysis of the carbon emissions associated with training and deploying GNN models for recommendation tasks. We evaluate the energy consumption and carbon footprint of different GNN architectures and configurations, considering factors such as model complexity, training duration, hardware specifications and embedding size. By addressing the environmental impact of resource-intensive algorithms in recommender systems, this study contributes to the ongoing efforts towards sustainable and responsible artificial intelligence, promoting the development of eco-friendly recommendation technologies that balance performance and environmental considerations. Code is available at: this https URL.

[AI-187] Dying Clusters Is All You Need – Deep Clustering With an Unknown Number of Clusters ICDM

链接: https://arxiv.org/abs/2410.09491
作者: Collin Leiber,Niklas Strauß,Matthias Schubert,Thomas Seidl
关键词-EN: Finding meaningful groups, Finding meaningful, meaningful groups, number of clusters, important challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Acceppted at the Sixth ICDM Workshop on Deep Learning and Clustering

点击查看摘要

Abstract:Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components. The code is available at: this https URL

[AI-188] Distilling Invariant Representations with Dual Augmentation

链接: https://arxiv.org/abs/2410.09474
作者: Nikolaos Giakoumoglou,Tania Stathaki
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, 3 tables. This paper presents preliminary results from a project that we have since discontinued, as our research focus has shifted to new directions

点击查看摘要

[AI-189] DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning

链接: https://arxiv.org/abs/2410.09472
作者: Xiquan Li,Wenxi Chen,Ziyang Ma,Xuenan Xu,Yuzhe Liang,Zhisheng Zheng,Qiuqiang Kong,Xie Chen
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-190] -Fold Cross-Validation for energy-aware Machine Learning Evaluations

链接: https://arxiv.org/abs/2410.09463
作者: Christopher Mahlich,Tobias Vente,Joeran Beel
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-191] VERITAS-NLI : Validation and Extraction of Reliable Information Through Automated Scraping and Natural Language Inference

链接: https://arxiv.org/abs/2410.09455
作者: Arjun Shah,Hetansh Shah,Vedica Bafna,Charmi Khandor,Sindhu Nair
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint, 15 pages, 7 figures

点击查看摘要

[AI-192] MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

链接: https://arxiv.org/abs/2410.09453
作者: Xi Jiang,Jian Li,Hanqiu Deng,Yong Liu,Bin-Bin Gao,Yifeng Zhou,Jialin Li,Chengjie Wang,Feng Zheng
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The code and data are available at this https URL

点击查看摘要

[AI-193] MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning

链接: https://arxiv.org/abs/2410.09437
作者: Yaming Yang,Dilixat Muhtar,Yelong Shen,Yuefeng Zhan,Jianfeng Liu,Yujing Wang,Hao Sun,Denvy Deng,Feng Sun,Qi Zhang,Weizhu Chen,Yunhai Tong
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 Pages, 4 Figures

点击查看摘要

[AI-194] Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets KR

链接: https://arxiv.org/abs/2410.09428
作者: Thomas Eiter,Jan Hadl,Nelson Higuera,Johannes Oetsch
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

[AI-195] Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset NEURIPS2024

链接: https://arxiv.org/abs/2410.09416
作者: Haoming Lu,Feifei Zhong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024 Workshop (EvalEval 2024)

点击查看摘要

[AI-196] FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs Responsiveness to Human Feedback

链接: https://arxiv.org/abs/2410.09412
作者: Youquan Li,Miao Zheng,Fan Yang,Guosheng Dong,Bin Cui,Weipeng Chen,Zenan Zhou,Wentao Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-197] CAMPHOR: Collaborative Agents for Multi-input Planning and High-Order Reasoning On Device

链接: https://arxiv.org/abs/2410.09407
作者: Yicheng Fu,Raviteja Anantha,Jianpeng Cheng
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-198] wo Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation

链接: https://arxiv.org/abs/2410.09403
作者: Haoyang Su,Renqi Chen,Shixiang Tang,Xinzhe Zheng,Jingzhe Li,Zhenfei Yin,Wanli Ouyang,Nanqing Dong
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

[AI-199] A Novel Approach to Malicious Code Detection Using CNN-BiLSTM and Feature Fusion

链接: https://arxiv.org/abs/2410.09401
作者: Lixia Zhang,Tianxu Liu,Kaihui Shen,Cheng Chen
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-200] Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes

链接: https://arxiv.org/abs/2410.09397
作者: Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-201] Mamba4Cast: Efficient Zero-Shot Time Series Forecasting with State Space Models

链接: https://arxiv.org/abs/2410.09385
作者: Sathya Kamesh Bhethanabhotla,Omar Swelam,Julien Siems,David Salinas,Frank Hutter
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-202] Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

链接: https://arxiv.org/abs/2410.09380
作者: Ting Yu,Kunhao Fu,Shuhui Wang,Qingming Huang,Jun Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

[AI-203] Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

链接: https://arxiv.org/abs/2410.09379
作者: Ting Yu,Kunhao Fu,Jian Zhang,Qingming Huang,Jun Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Transactions on Image Processing

点击查看摘要

[AI-204] Looped ReLU MLPs May Be All You Need as Practical Programmable Computers

链接: https://arxiv.org/abs/2410.09375
作者: Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:

点击查看摘要

[AI-205] owards a Domain-Specific Modelling Environment for Reinforcement Learning

链接: https://arxiv.org/abs/2410.09368
作者: Natalie Sinani,Sahil Salma,Paul Boutot,Sadaf Mustafiz
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 24 pages

点击查看摘要

[AI-206] SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins

链接: https://arxiv.org/abs/2410.09362
作者: Jongwoo Ko,Saket Dingliwal,Bhavana Ganesh,Sailik Sengupta,Sravan Bodapati,Aram Galstyan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-207] Green Recommender Systems: Optimizing Dataset Size for Energy-Efficient Algorithm Performance

链接: https://arxiv.org/abs/2410.09359
作者: Ardalan Arabzadeh,Tobias Vente,Joeran Beel
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-208] Generative Subgraph Retrieval for Knowledge Graph-Grounded Dialog Generation EMNLP

链接: https://arxiv.org/abs/2410.09350
作者: Jinyoung Park,Minseok Joo,Joo-Kyung Kim,Hyunwoo J. Kim
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP (main)

点击查看摘要

[AI-209] Inference and Verbalization Functions During In-Context Learning EMNLP2024

链接: https://arxiv.org/abs/2410.09349
作者: Junyi Tao,Xiaoyin Chen,Nelson F. Liu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings

点击查看摘要

[AI-210] Contrastive Learning for Implicit Social Factors in Social Media Popularity Prediction

链接: https://arxiv.org/abs/2410.09345
作者: Zhizhen Zhang,Ruihong Qiu,Xiaohui Xie
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-211] DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

链接: https://arxiv.org/abs/2410.09344
作者: Wenlong Deng,Yize Zhao,Vala Vakilian,Minghui Chen,Xiaoxiao Li,Christos Thrampoulidis
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-212] Advanced Gesture Recognition in Autism: Integrating YOLOv7 Video Augmentation and VideoMAE for Video Analysis

链接: https://arxiv.org/abs/2410.09339
作者: Amit Kumar Singh,Trapti Shrivastava,Vrijendra Singh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-213] Rethinking Data Selection at Scale: Random Selection is Almost All You Need

链接: https://arxiv.org/abs/2410.09335
作者: Tingyu Xia,Bowen Yu,Kai Dang,An Yang,Yuan Wu,Yuan Tian,Yi Chang,Junyang Lin
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-214] Zero-shot Commonsense Reasoning over Machine Imagination EMNLP2024

链接: https://arxiv.org/abs/2410.09329
作者: Hyuntae Park,Yeachan Kim,Jun-Hyung Park,SangKeun Lee(Korea University)
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, 9 figures, EMNLP 2024 (Findings)

点击查看摘要

[AI-215] oken Pruning using a Lightweight Background Aware Vision Transformer NEURIPS2024

链接: https://arxiv.org/abs/2410.09324
作者: Sudhakar Sah,Ravish Kumar,Honnesh Rohmetra,Ehsan Saboori
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 tables, 4 figures, FITML workshop@NeuRIPS 2024

点击查看摘要

[AI-216] Hey AI Can You Grade My Essay?: Automatic Essay Grading AAAI

链接: https://arxiv.org/abs/2410.09319
作者: Maisha Maliha,Vishal Pramanik
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted in ICAAAIML (4th International Conference on Advances and Applications of Artificial Intelligence and Machine Learning) 2023

点击查看摘要

[AI-217] llinstruct: An Instruction-tuned model for English Language Proficiency Assessments

链接: https://arxiv.org/abs/2410.09314
作者: Debanjan Ghosh,Sophia Chan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-218] owards Multi-Modal Animal Pose Estimation: An In-Depth Analysis

链接: https://arxiv.org/abs/2410.09312
作者: Qianyi Deng,Oishi Deb,Amir Patel,Christian Rupprecht,Philip Torr,Niki Trigoni,Andrew Markham
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 35 pages, 5 figures, 8 tables

点击查看摘要

[AI-219] Graph Neural Alchemist: An innovative fully modular architecture for time series-to-graph classification

链接: https://arxiv.org/abs/2410.09307
作者: Paulo Coelho,Raul Araju,Luís Ramos,Samir Saliba,Renato Vimieiro
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-220] Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

链接: https://arxiv.org/abs/2410.09302
作者: Guanlin Liu,Kaixuan Ji,Renjie Zheng,Zheng Wu,Chen Dun,Quanquan Gu,Lin Yan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-221] Nudging: Inference-time Alignment via Model Collaboration

链接: https://arxiv.org/abs/2410.09300
作者: Yu Fei,Yasaman Razeghi,Sameer Singh
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-222] Refinements on the Complementary PDB Construction Mechanism

链接: https://arxiv.org/abs/2410.09297
作者: Yufeng Zou
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-223] Natural Language Counterfactual Explanations for Graphs Using Large Language Models

链接: https://arxiv.org/abs/2410.09295
作者: Flavio Giorgi,Cesare Campagnano,Fabrizio Silvestri,Gabriele Tolomei
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-224] Ranking over Regression for Bayesian Optimization and Molecule Selection

链接: https://arxiv.org/abs/2410.09290
作者: Gary Tom,Stanley Lo,Samantha Corapi,Alan Aspuru-Guzik,Benjamin Sanchez-Lengeling
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 14 + 4 pages, 5 + 3 figures

点击查看摘要

[AI-225] AuD-Former: A Hierarchical Transformer Network for Multimodal Audio-Based Disease Prediction

链接: https://arxiv.org/abs/2410.09289
作者: Jinjin Cai,Ruiqi Wang,Dezhong Zhao,Ziqin Yuan,Victoria McKenna,Aaron Friedman,Rachel Foot,Susan Storey,Ryan Boente,Sudip Vhaduri,Byung-Cheol Min
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-226] Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

链接: https://arxiv.org/abs/2410.09286
作者: Harsh Mahesheka,Zhixian Xie,Zhaoran Wang,Wanxin Jin
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-227] Articulated Animal AI: An Environment for Animal-like Cognition in a Limbed Agent NEURIPS2024

链接: https://arxiv.org/abs/2410.09275
作者: Jeremy Lucas,Isabeau Prémont-Schwarz
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, accepted to Workshop on Open-World Agents (OWA-2024) at NeurIPS 2024 in Vancouver, Canada

点击查看摘要

[AI-228] One Step at a Time: Combining LLMs and Static Analysis to Generate Next-Step Hints for Programming Tasks

链接: https://arxiv.org/abs/2410.09268
作者: Anastasiia Birillo,Elizaveta Artser,Anna Potriasaeva,Ilya Vlasov,Katsiaryna Dzialets,Yaroslav Golubev,Igor Gerasimov,Hieke Keuning,Timofey Bryksin
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 12 pages, 5 figures

点击查看摘要

[AI-229] ReasonPlanner: Enhancing Autonomous Planning in Dynamic Environments with Temporal Knowledge Graphs and LLMs

链接: https://arxiv.org/abs/2410.09252
作者: Minh Pham Dinh,Munira Syed,Michael G Yankoski,Trenton W. Ford
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

[AI-230] Quantum-Trained Convolutional Neural Network for Deepfake Audio Detection

链接: https://arxiv.org/abs/2410.09250
作者: Chu-Hsuan Abraham Lin,Chen-Yu Liu,Samuel Yen-Chi Chen,Kuan-Cheng Chen
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Quantum Physics (quant-ph)
*备注:

点击查看摘要

[AI-231] Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

链接: https://arxiv.org/abs/2410.09247
作者: Jacob Haimes,Cenny Wenner,Kunvar Thaman,Vassil Tashev,Clement Neo,Esben Kran,Jason Schreiber
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-232] DFM: Interpolant-free Dual Flow Matching NEURIPS2024

链接: https://arxiv.org/abs/2410.09246
作者: Denis Gudovskiy,Tomoyuki Okuno,Yohei Nakata
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Extended Abstract Track at the Unifying Representations in Neural Models Workshop (NeurIPS 2024)

点击查看摘要

[AI-233] Using off-the-shelf LLMs to query enterprise data by progressively revealing ontologies

链接: https://arxiv.org/abs/2410.09244
作者: C. Civili,E. Sherkhonov,R.E.K. Stirewalt
关键词-EN:
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

[AI-234] Improving semantic understanding in speech language models via brain-tuning ICLR2025

链接: https://arxiv.org/abs/2410.09230
作者: Omer Moussa,Dietrich Klakow,Mariya Toneva
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under Review at ICLR 2025

点击查看摘要

[AI-235] he Same But Different: Structural Similarities and Differences in Multilingual Language Modeling

链接: https://arxiv.org/abs/2410.09223
作者: Ruochen Zhang,Qinan Yu,Matianyu Zang,Carsten Eickhoff,Ellie Pavlick
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-236] Continual Learning with Neuromorphic Computing: Theories Methods and Applications

链接: https://arxiv.org/abs/2410.09218
作者: Mishal Fatima Minhas,Rachmad Vidya Wicaksana Putra,Falah Awwad,Osman Hasan,Muhammad Shafique
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 71 pages, 31 figures, 6 tables

点击查看摘要

[AI-237] P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

链接: https://arxiv.org/abs/2410.09207
作者: Simeng Han,Aaron Yu,Rui Shen,Zhenting Qi,Martin Riddell,Wenfei Zhou,Yujie Qiao,Yilun Zhao,Semih Yavuz,Ye Liu,Shafiq Joty,Yingbo Zhou,Caiming Xiong,Dragomir Radev,Rex Ying,Arman Cohan
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-238] pyhgf: A neural network library for predictive coding

链接: https://arxiv.org/abs/2410.09206
作者: Nicolas Legrand,Lilian Weber,Peter Thestrup Waade,Anna Hedvig Møller Daugaard,Mojtaba Khodadadi,Nace Mikuš,Chris Mathys
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

[AI-239] Encoding Agent Trajectories as Representations with Sequence Transformers

链接: https://arxiv.org/abs/2410.09204
作者: Athanasios Tsiligkaridis,Nicholas Kalinowski,Zhongheng Li,Elizabeth Hou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 pages, to be presented at GeoAI workshop at ACM SigSpatial 2024

点击查看摘要

[AI-240] AI security and cyber risk in IoT systems

链接: https://arxiv.org/abs/2410.09194
作者: Petar Radanliev,David De Roure,Carsten Maple,Jason R.C. Nurse,Razvan Nicolescu,Uchenna Ani
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

[AI-241] Synthetic Students: A Comparative Study of Bug Distribution Between Large Language Models and Computing Students

链接: https://arxiv.org/abs/2410.09193
作者: Stephen MacNeil,Magdalena Rogalska,Juho Leinonen,Paul Denny,Arto Hellas,Xandria Crosland
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-242] Automated Rewards via LLM-Generated Progress Functions

链接: https://arxiv.org/abs/2410.09187
作者: Vishnu Sarukkai,Brennan Shacklett,Zander Majercik,Kush Bhatia,Christopher Ré,Kayvon Fatahalian
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 26 pages, 5 figures

点击查看摘要

[AI-243] Learning Algorithms Made Simple

链接: https://arxiv.org/abs/2410.09186
作者: Noorbakhsh Amiri Golilarz,Elias Hossain,Abdoljalil Addeh,Keyan Alexander Rahimi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-244] Can a large language model be a gaslighter?

链接: https://arxiv.org/abs/2410.09181
作者: Wei Li,Luyao Zhu,Yang Song,Ruixi Lin,Rui Mao,Yang You
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 10/26 (Main Body/Total), 8 figures

点击查看摘要

[AI-245] Resource-Constrained Heuristic for Max-SAT

链接: https://arxiv.org/abs/2410.09173
作者: Brian Matejek,Daniel Elenius,Cale Gentry,David Stoker,Adam Cobb
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-246] ACER: Automatic Language Model Context Extension via Retrieval

链接: https://arxiv.org/abs/2410.09141
作者: Luyu Gao,Yunyi Zhang,Jamie Callan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-247] Multi-Agent Actor-Critics in Autonomous Cyber Defense

链接: https://arxiv.org/abs/2410.09134
作者: Mingjun Wang,Remington Dechene
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 6 pages. 2 figures

点击查看摘要

[AI-248] When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning

链接: https://arxiv.org/abs/2410.09132
作者: Hao Yan,Chaozhuo Li,Zhigang Yu,Jun Yin,Ruochen Liu,Peiyan Zhang,Weihao Han,Mingzheng Li,Zhengxin Zeng,Hao Sun,Weiwei Deng,Feng Sun,Qi Zhang,Senzhang Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-249] nextlocllm: next location prediction using LLMs

链接: https://arxiv.org/abs/2410.09129
作者: Shuai Liu,Ning Cao,Yile Chen,Yue Jiang,Gao Cong
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 19 pages

点击查看摘要

[AI-250] IGER: Temporally Improved Graph Entity Linker

链接: https://arxiv.org/abs/2410.09128
作者: Pengyu Zhang,Congfeng Cao,Paul Groth
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[AI-251] CYCLE: Cross-Year Contrastive Learning in Entity-Linking

链接: https://arxiv.org/abs/2410.09127
作者: Pengyu Zhang,Congfeng Cao,Klim Zaporojets,Paul Groth
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-252] Convolutional Neural Network Design and Evaluation for Real-Time Multivariate Time Series Fault Detection in Spacecraft Attitude Sensors

链接: https://arxiv.org/abs/2410.09126
作者: Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill
关键词-EN:
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: submitted to Advances in Space Research

点击查看摘要

[AI-253] raining on Fake Labels: Mitigating Label Leakage in Split Learning via Secure Dimension Transformation

链接: https://arxiv.org/abs/2410.09125
作者: Yukun Jiang,Peiran Wang,Chengguo Lin,Ziyue Huang,Yong Cheng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[AI-254] SoK: Verifiable Cross-Silo FL

链接: https://arxiv.org/abs/2410.09124
作者: Aleksei Korneev(CRIStAL, MAGNET),Jan Ramon(CRIStAL, MAGNET)
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[AI-255] Context-Aware Adapter Tuning for Few-Shot Relation Learning in Knowledge Graphs

链接: https://arxiv.org/abs/2410.09123
作者: Ran Liu,Zhongzhou Liu,Xiaoli Li,Yuan Fang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

[AI-256] FSW-GNN: A Bi-Lipschitz WL-Equivalent Graph Neural Network

链接: https://arxiv.org/abs/2410.09118
作者: Yonatan Sverdlov,Yair Davidson,Nadav Dym,Tal Amir
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-257] REDO: Execution-Free Runtime Error Detection for COding Agents

链接: https://arxiv.org/abs/2410.09117
作者: Shou Li,Andrey Kan,Laurent Callot,Bhavana Bhasker,Muhammad Shihab Rashid,Timothy B Esler
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 27 pages, 13 figures, 6 tables

点击查看摘要

[AI-258] Optimizing Hard-to-Place Kidney Allocation: A Machine Learning Approach to Center Ranking

链接: https://arxiv.org/abs/2410.09116
作者: Sean Berry,Berk Gorgulu,Sait Tunc,Mucahit Cevik,Matthew J Ellis
关键词-EN:
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-259] Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

链接: https://arxiv.org/abs/2410.09114
作者: Andrey Anurin,Jonathan Ng,Kibo Schaffer,Ziyue Wang,Jason Schreiber,Esben Kran
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注: this https URL

点击查看摘要

[AI-260] M2-ViT: Accelerating Hybrid Vision Transformers with Two-Level Mixed Quantization

链接: https://arxiv.org/abs/2410.09113
作者: Yanbiao Liang,Huihong Shi,Zhongfeng Wang
关键词-EN:
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-261] HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction NEURIPS2024

链接: https://arxiv.org/abs/2410.09112
作者: Qianyue Hao,Jingyang Fan,Fengli Xu,Jian Yuan,Yong Li
关键词-EN:
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: NeurIPS 2024 paper

点击查看摘要

[AI-262] Compressing high-resolution data through latent representation encoding for downscaling large-scale AI weather forecast model

链接: https://arxiv.org/abs/2410.09109
作者: Qian Liu,Bing Gong,Xiaoran Zhuang,Xiaohui Zhong,Zhiming Kang,Hao Li
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 19 pages

点击查看摘要

[AI-263] Federated Learning for Data Market: Shapley-UCB for Seller Selection and Incentives

链接: https://arxiv.org/abs/2410.09107
作者: Kongyang Chen,Zeming Xu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

[AI-264] Parameter-Efficient Fine-Tuning via Selective Discrete Cosine Transform

链接: https://arxiv.org/abs/2410.09103
作者: Yixian Shen,Qi Bi,Jia-Hong Huang,Hongyi Zhu,Anuj Pathania
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-265] Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

链接: https://arxiv.org/abs/2410.09102
作者: Tong Wu,Shujian Zhang,Kaiqiang Song,Silei Xu,Sanqiang Zhao,Ravi Agrawal,Sathish Reddy Indurthi,Chong Xiang,Prateek Mittal,Wenxuan Zhou
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

[AI-266] Adaptive Active Inference Agents for Heterogeneous and Lifelong Federated Learning

链接: https://arxiv.org/abs/2410.09099
作者: Anastasiya Danilenka,Alireza Furutanpey,Victor Casamayor Pujol,Boris Sedlak,Anna Lackinger,Maria Ganzha,Marcin Paprzycki,Schahram Dustdar
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 11 pages, double column, 15 figures, 2 tables

点击查看摘要

[AI-267] Recent advancements in LLM Red-Teaming: Techniques Defenses and Ethical Considerations

链接: https://arxiv.org/abs/2410.09097
作者: Tarun Raheja,Nilay Pochhi
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 16 pages, 2 figures

点击查看摘要

[AI-268] Reflections on Disentanglement and the Latent Space

链接: https://arxiv.org/abs/2410.09094
作者: Ludovica Schaerf
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to xCoAx 2024 as part of the School of X

点击查看摘要

[AI-269] Automating Bibliometric Analysis with Sentence Transformers and Retrieval-Augmented Generation (RAG): A Pilot Study in Semantic and Contextual Search for Customized Literature Characterization for High-Impact Urban Research

链接: https://arxiv.org/abs/2410.09090
作者: Haowen Xu,Xueping Li,Jose Tupayachi,Jianming(Jamie)Lian,Femi Omitaomu
关键词-EN:
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[AI-270] Different Cybercrimes and their Solution for Common People

链接: https://arxiv.org/abs/2410.09089
作者: S. Tamang,G. S. Chandana,B. K. Roy
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

[AI-271] he Solution for Temporal Action Localisation Task of Perception Test Challenge 2024

链接: https://arxiv.org/abs/2410.09088
作者: Yinan Han,Qingyuan Jiang,Hongming Mei,Yang Yang,Jinhui Tang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-272] Mechanistic? EMNLP2024

链接: https://arxiv.org/abs/2410.09087
作者: Naomi Saphra,Sarah Wiegreffe
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Equal contribution. Position paper. Accepted for presentation at the BlackBoxNLP workshop at EMNLP 2024

点击查看摘要

[AI-273] AI in Archival Science – A Systematic Review

链接: https://arxiv.org/abs/2410.09086
作者: Gaurav Shinde,Tiana Kirstein,Souvick Ghosh,Patricia C. Franks
关键词-EN:
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-274] Diagnosing Robotics Systems Issues with Large Language Models

链接: https://arxiv.org/abs/2410.09084
作者: Jordis Emilia Herrmann,Aswath Mandakath Gopinath,Mikael Norrlof,Mark Niklas Müller
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

[AI-275] Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

链接: https://arxiv.org/abs/2410.09083
作者: Lu Chen,Yuxuan Huang,Yixing Li,Yaohui Jin,Shuai Zhao,Zilong Zheng,Quanshi Zhang
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-276] Semantic Environment Atlas for Object-Goal Navigation

链接: https://arxiv.org/abs/2410.09081
作者: Nuri Kim,Jeongho Park,Mineui Hong,Songhwai Oh
关键词-EN:
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 30 pages

点击查看摘要

[AI-277] Leveraging Social Determinants of Health in Alzheimers Research Using LLM-Augmented Literature Mining and Knowledge Graphs

链接: https://arxiv.org/abs/2410.09080
作者: Tianqi Shang,Shu Yang,Weiqing He,Tianhua Zhai,Dawei Li,Bojian Hou,Tianlong Chen,Jason H. Moore,Marylyn D. Ritchie,Li Shen
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-278] BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.09079
作者: Aofei Chang,Jiaqi Wang,Han Liu,Parminder Bhatia,Cao Xiao,Ting Wang,Fenglong Ma
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 (Findings)

点击查看摘要

[AI-279] Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial Robustness of LLMs

链接: https://arxiv.org/abs/2410.09078
作者: Tomas Bueno Momcilovic,Dian Balta,Beat Buesser,Giulio Zizzo,Mark Purcell
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: Accepted in the VECOMP 2024 workshop

点击查看摘要

[AI-280] Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning ACL2024

链接: https://arxiv.org/abs/2402.18344
作者: Jiachun Li,Pengfei Cao,Chenhao Wang,Zhuoran Jin,Yubo Chen,Daojian Zeng,Kang Liu,Jun Zhao
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted as a long paper to ACL 2024 Main, 25 pages, 22 figures

点击查看摘要

[AI-281] Arrhythmia Classification Using Graph Neural Networks Based on Correlation Matrix

链接: https://arxiv.org/abs/2410.10758
作者: Seungwoo Han
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-282] REHRSeg: Unleashing the Power of Self-Supervised Super-Resolution for Resource-Efficient 3D MRI Segmentation

链接: https://arxiv.org/abs/2410.10097
作者: Zhiyun Song,Yinjie Zhao,Xiaomin Li,Manman Fei,Xiangyu Zhao,Mengjun Liu,Cunjian Chen,Chung-Hsing Yeh,Qian Wang,Guoyan Zheng,Songtao Ai,Lichi Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-283] Enhancing Peer Review in Astronomy: A Machine Learning and Optimization Approach to Reviewer Assignments for ALMA

链接: https://arxiv.org/abs/2410.10009
作者: John M. Carpenter,Andrea Corvillón,Nihar B. Shah
关键词-EN:
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: 19 pages, 5 figures, submitted to PASP

点击查看摘要

[AI-284] Predicting Molecular Ground-State Conformation via Conformation Optimization

链接: https://arxiv.org/abs/2410.09795
作者: Fanmeng Wang,Minjie Cheng,Hongteng Xu
关键词-EN:
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

[AI-285] Universal scaling laws in quantum-probabilistic machine learning by tensor network towards interpreting representation and generalization powers

链接: https://arxiv.org/abs/2410.09703
作者: Sheng-Chen Bai,Shi-Ju Ran
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 5 pages (main text) + 3 pages (appendices), 5 figures (main text) + 4 figures (appendices)

点击查看摘要

[AI-286] Neural Solver Selection for Combinatorial Optimization

链接: https://arxiv.org/abs/2410.09693
作者: Chengrui Gao,Haopu Shang,Ke Xue,Chao Qian
关键词-EN:
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-287] On Goodharts law with an application to value alignment

链接: https://arxiv.org/abs/2410.09638
作者: El-Mahdi El-Mhamdi,Lê-Nguyên Hoang
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 47 pages, 11 figures

点击查看摘要

[AI-288] Can We Estimate Purchase Intention Based on Zero-shot Speech Emotion Recognition?

链接: https://arxiv.org/abs/2410.09636
作者: Ryotaro Nagase,Takashi Sumiyoshi,Natsuo Yamashita,Kota Dohi,Yohei Kawaguchi
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, accepted for APSIPA 2024 ASC

点击查看摘要

[AI-289] Distribution-Aware Mean Estimation under User-level Local Differential Privacy

链接: https://arxiv.org/abs/2410.09506
作者: Corentin Pla,Hugo Richard,Maxime Vono
关键词-EN:
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 1 figure

点击查看摘要

[AI-290] 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

链接: https://arxiv.org/abs/2410.09388
作者: Peifan Jiang,Xuben Wang,Shuang Wang,Fei Deng,Kunpeng Wang,Bin Wang,Yuhan Yang,Islam Fadel
关键词-EN:
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-291] IceDiff: High Resolution and High-Quality Sea Ice Forecasting with Generative Diffusion Prior

链接: https://arxiv.org/abs/2410.09111
作者: Jingyi Xu,Siwei Tu,Weidong Yang,Shuhao Li,Keyi Liu,Yeqi Luo,Lipeng Ma,Ben Fei,Lei Bai
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

[AI-292] Artificial intelligence techniques in inherited retinal diseases: A review

链接: https://arxiv.org/abs/2410.09105
作者: Han Trinh,Jordan Vice,Jason Charng,Zahra Tajbakhsh,Khyber Alam,Fred K. Chen,Ajmal Mian
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-293] Volatility Forecasting in Global Financial Markets Using TimeMixer

链接: https://arxiv.org/abs/2410.09062
作者: Alex Li
关键词-EN:
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages and 2 figures

点击查看摘要

[AI-294] Stellar Karaoke: deep blind separation of terrestrial atmospheric effects out of stellar spectra by velocity whitening

链接: https://arxiv.org/abs/2301.00313
作者: Nima Sedaghat,Brianna M. Smart,J. Bryce Kalmbach,Erin L. Howard,Hamidreza Amindavar
关键词-EN:
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-295] Machines Learn to Infer Stellar Parameters Just by Looking at a Large Number of Spectra

链接: https://arxiv.org/abs/2009.12872
作者: Nima Sedaghat,Martino Romaniello,Jonathan E. Carrick,François-Xavier Pineau
关键词-EN:
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

计算机视觉

[CV-0] x4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

链接: https://arxiv.org/abs/2410.10821
作者: Jingzhi Bao,Xueting Li,Ming-Hsuan Yang
关键词-EN: playing a crucial, role in movies, computer vision, vision and graphics, efficiency in animation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use, playing a crucial role in movies, games, AR, and VR. However, creating temporally consistent and realistic textures for mesh sequences remains labor-intensive for professional artists. On the other hand, while video diffusion models excel at text-driven video generation, they often lack 3D geometry awareness and struggle with achieving multi-view consistent texturing for 3D meshes. In this work, we present Tex4D, a zero-shot approach that integrates inherent 3D geometry knowledge from mesh sequences with the expressiveness of video diffusion models to produce multi-view and temporally consistent 4D textures. Given an untextured mesh sequence and a text prompt as inputs, our method enhances multi-view consistency by synchronizing the diffusion process across different views through latent aggregation in the UV space. To ensure temporal consistency, we leverage prior knowledge from a conditional video generation model for texture synthesis. However, straightforwardly combining the video diffusion model and the UV texture aggregation leads to blurry results. We analyze the underlying causes and propose a simple yet effective modification to the DDIM sampling process to address this issue. Additionally, we introduce a reference latent texture to strengthen the correlation between frames during the denoising process. To the best of our knowledge, Tex4D is the first method specifically designed for 4D scene texturing. Extensive experiments demonstrate its superiority in producing multi-view and multi-frame consistent videos based on untextured mesh sequences.

[CV-1] mporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

链接: https://arxiv.org/abs/2410.10818
作者: Mu Cai,Reuben Tan,Jianrui Zhang,Bocheng Zou,Kai Zhang,Feng Yao,Fangrui Zhu,Jing Gu,Yiwu Zhong,Yuzhang Shang,Yao Dou,Jaden Park,Jianfeng Gao,Yong Jae Lee,Jianwei Yang
关键词-EN: temporal understanding, temporal, fine-grained temporal, Understanding, video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models’ temporal reasoning capabilities. Both dataset and evaluation code will be made available.

[CV-2] When Does Perceptual Alignment Benefit Vision Representations?

链接: https://arxiv.org/abs/2410.10817
作者: Shobhita Sundaram,Stephanie Fu,Lukas Muttenthaler,Netanel Y. Tamir,Lucy Chai,Simon Kornblith,Trevor Darrell,Phillip Isola
关键词-EN: including scene layout, subject location, scene layout, camera pose, diverse visual attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: S.S. and S.F. contributed equally. Website: this http URL

点击查看摘要

Abstract:Humans judge perceptual similarity according to diverse visual attributes, including scene layout, subject location, and camera pose. Existing vision models understand a wide range of semantic abstractions but improperly weigh these attributes and thus make inferences misaligned with human perception. While vision representations have previously benefited from alignment in contexts like image generation, the utility of perceptually aligned representations in more general-purpose settings remains unclear. Here, we investigate how aligning vision model representations to human perceptual judgments impacts their usability across diverse computer vision tasks. We finetune state-of-the-art models on human similarity judgments for image triplets and evaluate them across standard vision benchmarks. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks, including counting, segmentation, depth estimation, instance retrieval, and retrieval-augmented generation. In addition, we find that performance is widely preserved on other tasks, including specialized out-of-distribution domains such as in medical imaging and 3D environment frames. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.

[CV-3] LVD-2M: A Long-take Video Dataset with Temporally Dense Captions NEURIPS2024

链接: https://arxiv.org/abs/2410.10816
作者: Tianwei Xiong,Yuqing Wang,Daquan Zhou,Zhijie Lin,Jiashi Feng,Xihui Liu
关键词-EN: long video generation, video generation models, video generation, generation models, generation models heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Dataset and Benchmark Track. Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

[CV-4] Depth Any Video with Scalable Synthetic Data

链接: https://arxiv.org/abs/2410.10815
作者: Honghui Yang,Di Huang,Wei Yin,Chunhua Shen,Haifeng Liu,Xiaofei He,Binbin Lin,Wanli Ouyang,Tong He
关键词-EN: ground truth data, scalable ground truth, leading to inconsistent, unreliable results, Video depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates-even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

[CV-5] HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

链接: https://arxiv.org/abs/2410.10812
作者: Haotian Tang,Yecheng Wu,Shang Yang,Enze Xie,Junsong Chen,Junyu Chen,Zhuoyang Zhang,Han Cai,Yao Lu,Song Han
关键词-EN: Hybrid Autoregressive Transformer, Autoregressive Transformer, introduce Hybrid Autoregressive, rivaling diffusion models, visual generation model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Demo: this https URL . The first two authors contributed equally to this work

点击查看摘要

Abstract:We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at this https URL.

[CV-6] Deep Linear Probe Generators for Weight Space Learning

链接: https://arxiv.org/abs/2410.10811
作者: Jonathan Kahana,Eliahu Horwitz,Imri Shuval,Yedid Hoshen
关键词-EN: space learning aims, Weight space learning, neural network, generalization error, aims to extract
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weight space learning aims to extract information about a neural network, such as its training dataset or generalization error. Recent approaches learn directly from model weights, but this presents many challenges as weights are high-dimensional and include permutation symmetries between neurons. An alternative approach, Probing, represents a model by passing a set of learned inputs (probes) through the model, and training a predictor on top of the corresponding outputs. Although probing is typically not used as a stand alone approach, our preliminary experiment found that a vanilla probing baseline worked surprisingly well. However, we discover that current probe learning strategies are ineffective. We therefore propose Deep Linear Probe Generators (ProbeGen), a simple and effective modification to probing approaches. ProbeGen adds a shared generator module with a deep linear architecture, providing an inductive bias towards structured probes thus reducing overfitting. While simple, ProbeGen performs significantly better than the state-of-the-art and is very efficient, requiring between 30 to 1000 times fewer FLOPs than other top approaches.

[CV-7] rajDiffuse: A Conditional Diffusion Model for Environment-Aware Trajectory Prediction ICPR

链接: https://arxiv.org/abs/2410.10804
作者: Qingze(Tony)Liu,Danrui Li,Samuel S. Sohn,Sejong Yoon,Mubbasir Kapadia,Vladimir Pavlovic
关键词-EN: Accurate prediction, human or vehicle, vehicle trajectories, trajectories with good, captures their stochastic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to be published as inpreceedings of the 2024 International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Accurate prediction of human or vehicle trajectories with good diversity that captures their stochastic nature is an essential task for many applications. However, many trajectory prediction models produce unreasonable trajectory samples that focus on improving diversity or accuracy while neglecting other key requirements, such as collision avoidance with the surrounding environment. In this work, we propose TrajDiffuse, a planning-based trajectory prediction method using a novel guided conditional diffusion model. We form the trajectory prediction problem as a denoising impaint task and design a map-based guidance term for the diffusion process. TrajDiffuse is able to generate trajectory predictions that match or exceed the accuracy and diversity of the SOTA, while adhering almost perfectly to environmental constraints. We demonstrate the utility of our model through experiments on the nuScenes and PFSD datasets and provide an extensive benchmark analysis against the SOTA methods.

[CV-8] Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

链接: https://arxiv.org/abs/2410.10803
作者: Yanjie Ze,Zixuan Chen,Wenhao Wang,Tianyi Chen,Xialin He,Ying Yuan,Xue Bin Peng,Jiajun Wu
关键词-EN: goal for roboticists, Humanoid robots capable, Diffusion Policy, autonomous operation, Humanoid robots
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills. Recent advances in 3D visuomotor policies, such as the 3D Diffusion Policy (DP3), have shown promise in extending these capabilities to wilder environments. However, 3D visuomotor policies often rely on camera calibration and point-cloud segmentation, which present challenges for deployment on mobile robots like humanoids. In this work, we introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints by leveraging egocentric 3D visual representations. We demonstrate that iDP3 enables a full-sized humanoid robot to autonomously perform skills in diverse real-world scenarios, using only data collected in the lab. Videos are available at: this https URL

[CV-9] Boosting Camera Motion Control for Video Diffusion Transformers

链接: https://arxiv.org/abs/2410.10802
作者: Soon Yau Cheong,Duygu Ceylan,Armin Mustafa,Andrew Gilbert,Chun-Hao Paul Huang
关键词-EN: Recent advancements, camera, camera control, enhanced the quality, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have significantly enhanced the quality of video generation. However, fine-grained control over camera pose remains a challenge. While U-Net-based models have shown promising results for camera control, transformer-based diffusion models (DiT)-the preferred architecture for large-scale video generation - suffer from severe degradation in camera motion accuracy. In this paper, we investigate the underlying causes of this issue and propose solutions tailored to DiT architectures. Our study reveals that camera control performance depends heavily on the choice of conditioning methods rather than camera pose representations that is commonly believed. To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), based on classifier-free guidance, which boosts camera control by over 400%. Additionally, we present a sparse camera control pipeline, significantly simplifying the process of specifying camera poses for long videos. Our method universally applies to both U-Net and DiT models, offering improved camera control for video generation tasks.

[CV-10] owards Foundation Models for 3D Vision: How Close Are We?

链接: https://arxiv.org/abs/2410.10799
作者: Yiming Zuo,Karhan Kayan,Maggie Wang,Kevin Jeon,Jia Deng,Thomas L. Griffiths
关键词-EN: remains unsolved, Visual Question Answering, complex challenge, challenge that remains, Building
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark that covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision.

[CV-11] MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling

链接: https://arxiv.org/abs/2410.10798
作者: Jian Yang,Dacheng Yin,Yizhou Zhou,Fengyun Rao,Wei Zhai,Yang Cao,Zheng-Jun Zha
关键词-EN: multi-modal large language, large language models, Recent advancements, large language, propelled the development
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identifed that recent methods inevitably suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model’s hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Through extensive evaluations on 18 image understanding benchmarks, MMAR demonstrates much more superior performance than other joint multi-modal models, matching the method that employs pretrained CLIP vision encoder, meanwhile being able to generate high quality images at the same time. We also showed that our method is scalable with larger data and model size.

[CV-12] Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

链接: https://arxiv.org/abs/2410.10792
作者: Litu Rout,Yujia Chen,Nataniel Ruiz,Constantine Caramanis,Sanjay Shakkottai,Wen-Sheng Chu
关键词-EN: transform random noise, models transform random, Generative models transform, transform images back, transform random
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Generative models transform random noise into images; their inversion aims to transform images back to structured noise for recovery and editing. This paper addresses two key tasks: (i) inversion and (ii) editing of a real image using stochastic equivalents of rectified flow models (such as Flux). Although Diffusion Models (DMs) have recently dominated the field of generative modeling for images, their inversion presents faithfulness and editability challenges due to nonlinearities in drift and diffusion. Existing state-of-the-art DM inversion approaches rely on training of additional parameters or test-time optimization of latent variables; both are expensive in practice. Rectified Flows (RFs) offer a promising alternative to diffusion models, yet their inversion has been underexplored. We propose RF inversion using dynamic optimal control derived via a linear quadratic regulator. We prove that the resulting vector field is equivalent to a rectified stochastic differential equation. Additionally, we extend our framework to design a stochastic sampler for Flux. Our inversion method allows for state-of-the-art performance in zero-shot inversion and editing, outperforming prior works in stroke-to-image synthesis and semantic image editing, with large-scale human evaluations confirming user preference.

[CV-13] Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

链接: https://arxiv.org/abs/2410.10791
作者: Tim Broedermann,Christos Sakaridis,Yuqian Fu,Luc Van Gool
关键词-EN: Leveraging multiple sensors, Leveraging multiple, robust semantic perception, strengths and weaknesses, type has complementary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality-specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre-trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse-condition scenarios. We set the new state of the art with CAFuser on the MUSES dataset with 59.7 PQ for multimodal panoptic segmentation and 78.2 mIoU for semantic segmentation, ranking first on the public benchmarks.

[CV-14] Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

链接: https://arxiv.org/abs/2410.10790
作者: Jianqi Chen,Panwen Hu,Xiaojun Chang,Zhenwei Shi,Michael Christian Kampffmeyer,Xiaodan Liang
关键词-EN: Recent advancements, unified system capable, human motion synthesis, motion, Signed Distance Function
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code Page: this https URL

点击查看摘要

Abstract:Recent advancements in human motion synthesis have focused on specific types of motions, such as human-scene interaction, locomotion or human-human interaction, however, there is a lack of a unified system capable of generating a diverse combination of motion types. In response, we introduce Sitcom-Crafter, a comprehensive and extendable system for human motion generation in 3D space, which can be guided by extensive plot contexts to enhance workflow efficiency for anime and game designers. The system is comprised of eight modules, three of which are dedicated to motion generation, while the remaining five are augmentation modules that ensure consistent fusion of motion sequences and system functionality. Central to the generation modules is our novel 3D scene-aware human-human interaction module, which addresses collision issues by synthesizing implicit 3D Signed Distance Function (SDF) points around motion spaces, thereby minimizing human-scene collisions without additional data collection costs. Complementing this, our locomotion and human-scene interaction modules leverage existing methods to enrich the system’s motion generation capabilities. Augmentation modules encompass plot comprehension for command generation, motion synchronization for seamless integration of different motion types, hand pose retrieval to enhance motion realism, motion collision revision to prevent human collisions, and 3D retargeting to ensure visual fidelity. Experimental evaluations validate the system’s ability to generate high-quality, diverse, and physically realistic motions, underscoring its potential for advancing creative workflows.

[CV-15] LiveXiv – A Multi-Modal Live Benchmark Based on Arxiv Papers Content

链接: https://arxiv.org/abs/2410.10783
作者: Nimrod Shabtay,Felipe Maia Polo,Sivan Doveh,Wei Lin,M. Jehanzeb Mirza,Leshem Chosen,Mikhail Yurochkin,Yuekai Sun,Assaf Arbelle,Leonid Karlinsky,Raja Giryes
关键词-EN: shown outstanding utility, required world knowledge, multiple downstream tasks, downstream tasks, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (2.5%). Our dataset is available online on HuggingFace, and our code will be available here.

[CV-16] 3DArticCyclists: Generating Simulated Dynamic 3D Cyclists for Human-Object Interaction (HOI) and Autonomous Driving Applications

链接: https://arxiv.org/abs/2410.10782
作者: Eduardo R. Corral-Soto,Yang Liu,Tongtong Cao,Yuan Ren,Liu Bingbing
关键词-EN: Embodied Artificial Intelligence, Artificial Intelligence, Embodied Artificial, human-centric scene understanding, scene understanding applications
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Human-object interaction (HOI) and human-scene interaction (HSI) are crucial for human-centric scene understanding applications in Embodied Artificial Intelligence (EAI), robotics, and augmented reality (AR). A common limitation faced in these research areas is the data scarcity problem: insufficient labeled human-scene object pairs on the input images, and limited interaction complexity and granularity between them. Recent HOI and HSI methods have addressed this issue by generating dynamic interactions with rigid objects. But more complex dynamic interactions such as a human rider pedaling an articulated bicycle have been unexplored. To address this limitation, and to enable research on complex dynamic human-articulated object interactions, in this paper we propose a method to generate simulated 3D dynamic cyclist assets and interactions. We designed a methodology for creating a new part-based multi-view articulated synthetic 3D bicycle dataset that we call 3DArticBikes that can be used to train NeRF and 3DGS-based 3D reconstruction methods. We then propose a 3DGS-based parametric bicycle composition model to assemble 8-DoF pose-controllable 3D bicycles. Finally, using dynamic information from cyclist videos, we build a complete synthetic dynamic 3D cyclist (rider pedaling a bicycle) by re-posing a selectable synthetic 3D person while automatically placing the rider onto one of our new articulated 3D bicycles using a proposed 3D Keypoint optimization-based Inverse Kinematics pose refinement. We present both, qualitative and quantitative results where we compare our generated cyclists against those from a recent stable diffusion-based method.

[CV-17] ControlMM: Controllable Masked Motion Generation

链接: https://arxiv.org/abs/2410.10780
作者: Ekkasit Pinyoanuntapong,Muhammad Usama Saleem,Korrawe Karunratanakul,Pu Wang,Hongfei Xue,Chen Chen,Chuan Guo,Junli Cao,Jian Ren,Sergey Tulyakov
关键词-EN: Recent advances, enabled spatially controllable, control, motion, enabled spatially
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page this https URL

点击查看摘要

Abstract:Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, despite achieving acceptable control precision, these models suffer from generation speed and fidelity limitations. To address these challenges, we propose ControlMM, a novel approach incorporating spatial control signals into the generative masked motion model. ControlMM achieves real-time, high-fidelity, and high-precision controllable motion generation simultaneously. Our approach introduces two key innovations. First, we propose masked consistency modeling, which ensures high-fidelity motion generation via random masking and reconstruction, while minimizing the inconsistency between the input control signals and the extracted control signals from the generated motion. To further enhance control precision, we introduce inference-time logit editing, which manipulates the predicted conditional motion distribution so that the generated motion, sampled from the adjusted distribution, closely adheres to the input control signals. During inference, ControlMM enables parallel and iterative decoding of multiple motion tokens, allowing for high-speed motion generation. Extensive experiments show that, compared to the state of the art, ControlMM delivers superior results in motion quality, with better FID scores (0.061 vs 0.271), and higher control precision (average error 0.0091 vs 0.0108). ControlMM generates motions 20 times faster than diffusion-based methods. Additionally, ControlMM unlocks diverse applications such as any joint any frame control, body part timeline control, and obstacle avoidance. Video visualization can be found at this https URL

[CV-18] UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation

链接: https://arxiv.org/abs/2410.10777
作者: Lihe Yang,Zhen Zhao,Hengshuang Zhao
关键词-EN: Semi-supervised semantic segmentation, semantic segmentation capability, enhance semantic segmentation, learning rich visual, rich visual knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 18 tables, 10 figures

点击查看摘要

Abstract:Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2x fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets. Code, models, and logs of all reported values, are available at this https URL.

[CV-19] Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

链接: https://arxiv.org/abs/2410.10774
作者: Dejia Xu,Yifan Jiang,Chen Huang,Liangchen Song,Thorsten Gernoth,Liangliang Cao,Zhangyang Wang,Hao Tang
关键词-EN: remarkable breakthroughs, recent years, videos, camera, Cavia
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: this https URL

[CV-20] Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.10773
作者: Etai Littwin,Vimal Thilak,Anand Gopalakrishnan
关键词-EN: Image-based Joint-Embedding Predictive, Joint-Embedding Predictive Architecture, Image Modeling framework, Masked Image Modeling, Modeling framework
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Workshop on Self-Supervised Learning - Theory and Practice. Comments welcome!

点击查看摘要

Abstract:Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE) for representation learning using the Masked Image Modeling framework. IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space. However, IJEPA relies on carefully designed context and target windows to avoid representational collapse. The encoder modules in IJEPA cannot adaptively modulate the type of predicted and/or target features based on the feasibility of the masked prediction task as they are not given sufficient information of both context and targets. Based on the intuition that in natural images, information has a strong spatial bias with spatially local regions being highly predictive of one another compared to distant ones. We condition the target encoder and context encoder modules in IJEPA with positions of context and target windows respectively. Our “conditional” encoders show performance gains on several image classification benchmark datasets, improved robustness to context window size and sample-efficiency during pretraining.

[CV-21] Adaptive Diffusion Terrain Generator for Autonomous Uneven Terrain Navigation

链接: https://arxiv.org/abs/2410.10766
作者: Youwei Yu,Junhong Xu,Lantao Liu
关键词-EN: Model-free reinforcement learning, developing robust robot, robust robot control, robot control policies, control policies capable
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured terrains. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently challenging yet attainable terrains to facilitate continuous policy improvement. Existing methods of environment generation often rely on heuristics constrained by a set of parameters, limiting the diversity and realism. In this work, we introduce the Adaptive Diffusion Terrain Generator (ADTG), a novel method that leverages Denoising Diffusion Probabilistic Models to dynamically expand existing training environments by adding more diverse and complex terrains adaptive to the current policy. ADTG guides the diffusion model’s generation process through initial noise optimization, blending noise-corrupted terrains from existing training environments weighted by the policy’s performance in each corresponding environment. By manipulating the noise corruption level, ADTG seamlessly transitions between generating similar terrains for policy fine-tuning and novel ones to expand training diversity. Our experiments show that the policy trained by ADTG outperforms both procedural generated and natural environments, along with popular navigation methods.

[CV-22] DragEnt ity: Trajectory Guided Video Generation using Entity and Positional Relationships ACM-MM2024

链接: https://arxiv.org/abs/2410.10751
作者: Zhang Wan,Sheng Tang,Jiawei Wei,Ruize Zhang,Juan Cao
关键词-EN: receiving significant attention, achieved tremendous success, generation receiving significant, recent years, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM2024 Oral

点击查看摘要

Abstract:In recent years, diffusion models have achieved tremendous success in the field of video generation, with controllable video generation receiving significant attention. However, existing control methods still face two limitations: Firstly, control conditions (such as depth maps, 3D Mesh) are difficult for ordinary users to obtain directly. Secondly, it’s challenging to drive multiple objects through complex motions with multiple trajectories simultaneously. In this paper, we introduce DragEntity, a video generation model that utilizes entity representation for controlling the motion of multiple objects. Compared to previous methods, DragEntity offers two main advantages: 1) Our method is more user-friendly for interaction because it allows users to drag entities within the image rather than individual pixels. 2) We use entity representation to represent any object in the image, and multiple objects can maintain relative spatial relationships. Therefore, we allow multiple trajectories to control multiple objects in the image with different levels of complexity simultaneously. Our experiments validate the effectiveness of DragEntity, demonstrating its excellent performance in fine-grained control in video generation.

[CV-23] FlexGen: Flexible Multi-View Generation from Text and Image Inputs

链接: https://arxiv.org/abs/2410.10745
作者: Xinli Xu,Wenhang Ge,Jiantao Lin,Jiawei Feng,Lie Xu,HanFeng Zhao,Shunsi Zhang,Ying-Cong Chen
关键词-EN: flexible framework designed, framework designed, text, text annotations, multi-view
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. FlexGen tackles the challenges of controllable multi-view synthesis through additional conditioning on 3D-aware text annotations. We utilize the strong reasoning capabilities of GPT-4V to generate 3D-aware text annotations. By analyzing four orthogonal views of an object arranged as tiled multi-view images, GPT-4V can produce text annotations that include 3D-aware information with spatial relationship. By integrating the control signal with proposed adaptive dual-control module, our model can generate multi-view images that correspond to the specified text. FlexGen supports multiple controllable capabilities, allowing users to modify text prompts to generate reasonable and corresponding unseen parts. Additionally, users can influence attributes such as appearance and material properties, including metallic and roughness. Extensive experiments demonstrate that our approach offers enhanced multiple controllability, marking a significant advancement over existing multi-view diffusion models. This work has substantial implications for fields requiring rapid and flexible 3D content creation, including game development, animation, and virtual reality. Project page: this https URL.

[CV-24] Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

链接: https://arxiv.org/abs/2410.10744
作者: Hossein Mirzaei,Mackenzie W. Mathis
关键词-EN: critical real-world applications, OOD, compromising their reliability, real-world applications, significant advancements
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Code and pre-trained models are available at this https URL

点击查看摘要

Abstract:Despite significant advancements in out-of-distribution (OOD) detection, existing methods still struggle to maintain robustness against adversarial attacks, compromising their reliability in critical real-world applications. Previous studies have attempted to address this challenge by exposing detectors to auxiliary OOD datasets alongside adversarial training. However, the increased data complexity inherent in adversarial training, and the myriad of ways that OOD samples can arise during testing, often prevent these approaches from establishing robust decision boundaries. To address these limitations, we propose AROS, a novel approach leveraging neural ordinary differential equations (NODEs) with Lyapunov stability theorem in order to obtain robust embeddings for OOD detection. By incorporating a tailored loss function, we apply Lyapunov stability theory to ensure that both in-distribution (ID) and OOD data converge to stable equilibrium points within the dynamical system. This approach encourages any perturbed input to return to its stable equilibrium, thereby enhancing the model’s robustness against adversarial perturbations. To not use additional data, we generate fake OOD embeddings by sampling from low-likelihood regions of the ID data feature space, approximating the boundaries where OOD data are likely to reside. To then further enhance robustness, we propose the use of an orthogonal binary layer following the stable feature space, which maximizes the separation between the equilibrium points of ID and OOD samples. We validate our method through extensive experiments across several benchmarks, demonstrating superior performance, particularly under adversarial attacks. Notably, our approach improves robust detection performance from 37.8% to 80.1% on CIFAR-10 vs. CIFAR-100 and from 29.0% to 67.0% on CIFAR-100 vs. CIFAR-10.

[CV-25] DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model NEURIPS2024

链接: https://arxiv.org/abs/2410.10738
作者: Yuqi Wang,Ke Cheng,Jiawei He,Qitai Wang,Hengchen Dai,Yuntao Chen,Fei Xia,Zhaoxiang Zhang
关键词-EN: gained increasing attention, increasing attention due, complex physical dynamics, gained increasing, increasing attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.

[CV-26] Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

链接: https://arxiv.org/abs/2410.10733
作者: Junyu Chen,Han Cai,Junsong Chen,Enze Xie,Shang Yang,Haotian Tang,Muyang Li,Yao Lu,Song Han
关键词-EN: present Deep Compression, Deep Compression Autoencoder, present Deep, Deep Compression, spatial compression ratio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. First two authors contributed equally to this work

点击查看摘要

Abstract:We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder’s spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at this https URL.

[CV-27] A Counterexample in Image Registration

链接: https://arxiv.org/abs/2410.10725
作者: Serap A. Savari
关键词-EN: align discrete images, Image registration, image transformation, image similarity, discrete images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image registration is a widespread problem which applies models about image transformation or image similarity to align discrete images of the same scene. Nevertheless, the theoretical limits on its accuracy are not understood even in the case of one-dimensional data. Just as Nyquist’s sampling theorem states conditions for the perfect reconstruction of signals from samples, there are bounds to the quality of reproductions of quantized functions from sets of ideal, noiseless samples in the absence of additional assumptions. In this work we estimate spatially-limited piecewise constant signals from two or more sets of noiseless sampling patterns. We mainly focus on the energy of the error function and find that the uncertainties of the positions of the discontinuity points of the function depend on the discontinuity point selected as the reference point of the signal. As a consequence, the accuracy of the estimate of the signal depends on the reference point of that signal.

[CV-28] 4-LEGS: 4D Language Embedded Gaussian Splatting

链接: https://arxiv.org/abs/2410.10719
作者: Gal Fiebelman,Tamir Cohen,Ayellet Morgenstern,Peter Hedman,Hadar Averbuch-Elor
关键词-EN: photorealistic images rendered, enabling the synthesis, emergence of neural, digitally viewing, viewing a wide
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:The emergence of neural representations has revolutionized our means for digitally viewing a wide range of 3D scenes, enabling the synthesis of photorealistic images rendered from novel views. Recently, several techniques have been proposed for connecting these low-level representations with the high-level semantics understanding embodied within the scene. These methods elevate the rich semantic understanding from 2D imagery to 3D representations, distilling high-dimensional spatial features onto 3D space. In our work, we are interested in connecting language with a dynamic modeling of the world. We show how to lift spatio-temporal features to a 4D representation based on 3D Gaussian Splatting. %, \galwhile introducing a feature-proximity attention mechanism that allows for neighboring features in 3D space to interact. This enables an interactive interface where the user can spatiotemporally localize events in the video from text prompts. We demonstrate our system on public 3D video datasets of people and animals performing various actions.

[CV-29] Benefiting from Quantum? A Comparative Study of Q-Seg Quantum-Inspired Techniques and U-Net for Crack Segmentation

链接: https://arxiv.org/abs/2410.10713
作者: Akshaya Srinivasan,Alexander Geng,Antonio Macaluso,Maximilian Kiefer-Emmanouilidis,Ali Moghiseh
关键词-EN: Exploring the potential, ongoing challenge, hardware for enhancing, Exploring, enhancing classical
类目: Computer Vision and Pattern Recognition (cs.CV); Disordered Systems and Neural Networks (cond-mat.dis-nn); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Exploring the potential of quantum hardware for enhancing classical and real-world applications is an ongoing challenge. This study evaluates the performance of quantum and quantum-inspired methods compared to classical models for crack segmentation. Using annotated gray-scale image patches of concrete samples, we benchmark a classical mean Gaussian mixture technique, a quantum-inspired fermion-based method, Q-Seg a quantum annealing-based method, and a U-Net deep learning architecture. Our results indicate that quantum-inspired and quantum methods offer a promising alternative for image segmentation, particularly for complex crack patterns, and could be applied in near-future applications.

[CV-30] Ensemble of ConvNeXt V2 and MaxViT for Long-Tailed CXR Classification with View-Based Aggregation MICCAI

链接: https://arxiv.org/abs/2410.10710
作者: Yosuke Yamagishi,SHouhei Hanaoka
关键词-EN: place in Subtask, CXR-LT challenge, Subtask, present our solution, chest X-ray dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Solution paper for MICCAI CXR-LT 2024 challenge. 4th place in Subtask 2, 5th in Subtask 1

点击查看摘要

Abstract:In this work, we present our solution for the MICCAI 2024 CXR-LT challenge, achieving 4th place in Subtask 2 and 5th in Subtask 1. We leveraged an ensemble of ConvNeXt V2 and MaxViT models, pretrained on an external chest X-ray dataset, to address the long-tailed distribution of chest findings. The proposed method combines state-of-the-art image classification techniques, asymmetric loss for handling class imbalance, and view-based prediction aggregation to enhance classification performance. Through experiments, we demonstrate the advantages of our approach in improving both detection accuracy and the handling of the long-tailed distribution in CXR findings. The code is available at \urlthis https URL.

[CV-31] Early Diagnoses of Acute Lymphoblastic Leukemia Using YOLOv8 and YOLOv11 Deep Learning Models

链接: https://arxiv.org/abs/2410.10701
作者: Alaa Awad,Mohamed Hegazy,Salah A. Aly
关键词-EN: individuals succumb annually, Acute Lymphoblastic Leukemia, Thousands of individuals, detecting Acute Lymphoblastic, individuals succumb
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Thousands of individuals succumb annually to leukemia alone. This study explores the application of image processing and deep learning techniques for detecting Acute Lymphoblastic Leukemia (ALL), a severe form of blood cancer responsible for numerous annual fatalities. As artificial intelligence technologies advance, the research investigates the reliability of these methods in real-world scenarios. The study focuses on recent developments in ALL detection, particularly using the latest YOLO series models, to distinguish between malignant and benign white blood cells and to identify different stages of ALL, including early stages. Additionally, the models are capable of detecting hematogones, which are often misclassified as ALL. By utilizing advanced deep learning models like YOLOv8 and YOLOv11, the study achieves high accuracy rates reaching 98.8%, demonstrating the effectiveness of these algorithms across multiple datasets and various real-world situations.

[CV-32] ALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model SIGGRAPH

链接: https://arxiv.org/abs/2410.10696
作者: Jiazhi Guan,Quanwei Yang,Kaisiyuan Wang,Hang Zhou,Shengyi He,Zhiliang Xu,Haocheng Feng,Errui Ding,Jingdong Wang,Hongtao Xie,Youjian Zhao,Ziwei Liu
关键词-EN: facial animation techniques, everyday scenarios due, animation techniques, increasingly participated, participated in everyday
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted to SIGGRAPH Asia 2024 (conference track). Project page: this https URL

点击查看摘要

Abstract:Recently, 2D speaking avatars have increasingly participated in everyday scenarios due to the fast development of facial animation techniques. However, most existing works neglect the explicit control of human bodies. In this paper, we propose to drive not only the faces but also the torso and gesture movements of a speaking figure. Inspired by recent advances in diffusion models, we propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Specifically, we carefully construct 2D and 3D structural information as intermediate guidance. While recent diffusion models adopt a side network for control information injection, they fail to synthesize temporally stable results even with person-specific fine-tuning. We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals. Moreover, we build a Memory-based Hand-Recovering module to help with the difficulties in hand-shape preserving. After pre-training, our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data. Extensive experiments demonstrate the effectiveness and superiority of our proposed framework. Resources can be found at this https URL.

[CV-33] Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

链接: https://arxiv.org/abs/2410.10676
作者: Peiwen Sun,Sitong Cheng,Xiangtai Li,Zhen Ye,Huadai Liu,Honggang Zhang,Wei Xue,Yike Guo
关键词-EN: achieved great success, achieved great, great success, success in mono-channel, mono-channel audio generation
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for latent diffusion models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our unified model not only achieves the objective of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.

[CV-34] Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

链接: https://arxiv.org/abs/2410.10663
作者: Zhengwei Yang,Yuke Li,Qiang Sun,Basura Fernando,Heng Huang,Zheng Wang
关键词-EN: few-shot learning focus, few-shot learning, Cross-modal Few-Shot Learning, existing studies, learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.

[CV-35] ransforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning

链接: https://arxiv.org/abs/2410.10660
作者: William A. Stigall
关键词-EN: Convolutional Neural Networks, utilizing Convolutional Neural, Deep Q-Networks utilizing, Q-Networks utilizing Convolutional, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: KSU C-Day Spring 2024

点击查看摘要

Abstract:In this study, we investigate the performance of Deep Q-Networks utilizing Convolutional Neural Networks (CNNs) and Transformer architectures across three different Atari games. The advent of DQNs has significantly advanced Reinforcement Learning, enabling agents to directly learn optimal policies from high-dimensional sensory inputs from pixel or RAM data. While CNN-based DQNs have been extensively studied and deployed in various domains, Transformer-based DQNs are relatively unexplored. Our research aims to fill this gap by benchmarking the performance of both DCQNs and DTQNs across the Atari games Asteroids, Space Invaders, and Centipede. We find that in the 35-40 million parameter range, the DCQN outperforms the DTQN in speed across both ViT and Projection Architectures. We also find the DCQN outperforms the DTQN in all games except for Centipede.

[CV-36] PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion ECCV2024

链接: https://arxiv.org/abs/2410.10659
作者: Runsong Zhu,Shi Qiu,Qianyi Wu,Ka-Hei Hui,Pheng-Ann Heng,Chi-Wing Fu
关键词-EN: panoptic segmentation task, Probabilis-tic Contrastive Fusion, task by unprojecting, effective technique, technique to address
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024. The code is publicly available at this https URL

点击查看摘要

Abstract:Panoptic lifting is an effective technique to address the 3D panoptic segmentation task by unprojecting 2D panoptic segmentations from multi-views to 3D scene. However, the quality of its results largely depends on the 2D segmentations, which could be noisy and error-prone, so its performance often drops significantly for complex scenes. In this work, we design a new pipeline coined PCF-Lift based on our Probabilis-tic Contrastive Fusion (PCF) to learn and embed probabilistic features throughout our pipeline to actively consider inaccurate segmentations and inconsistent instance IDs. Technical-wise, we first model the probabilistic feature embeddings through multivariate Gaussian distributions. To fuse the probabilistic features, we incorporate the probability product kernel into the contrastive loss formulation and design a cross-view constraint to enhance the feature consistency across different views. For the inference, we introduce a new probabilistic clustering method to effectively associate prototype features with the underlying 3D object instances for the generation of consistent panoptic segmentation results. Further, we provide a theoretical analysis to justify the superiority of the proposed probabilistic solution. By conducting extensive experiments, our PCF-lift not only significantly outperforms the state-of-the-art methods on widely used benchmarks including the ScanNet dataset and the challenging Messy Room dataset (4.4% improvement of scene-level PQ), but also demonstrates strong robustness when incorporating various 2D segmentation models or different levels of hand-crafted noise.

[CV-37] SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

链接: https://arxiv.org/abs/2410.10629
作者: Enze Xie,Junsong Chen,Junyu Chen,Han Cai,Yujun Lin,Zhekai Zhang,Muyang Li,Yao Lu,Song Han
关键词-EN: times, laptop GPU, efficiently generate images, images, GPU
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:We introduce \model, a text-to-image framework that can efficiently generate images up to 4096 \times 4096 resolution. \model can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8 \times , we trained an AE that can compress images 32 \times , effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, \model-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, \model-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 \times 1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

[CV-38] BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI

链接: https://arxiv.org/abs/2410.10604
作者: Shaohao Rui,Lingzhi Chen,Zhenyu Tang,Lilong Wang,Mianxin Liu,Shaoting Zhang,Xiaosong Wang
关键词-EN: Accurate diagnosis, complementary multi-parametric MRI, multi-parametric MRI imaging, MRI imaging data, abnormalities is greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of brain abnormalities is greatly enhanced by the inclusion of complementary multi-parametric MRI imaging data. There is significant potential to develop a universal pre-training model that can be quickly adapted for image modalities and various clinical scenarios. However, current models often rely on uni-modal image data, neglecting the cross-modal correlations among different image modalities or struggling to scale up pre-training in the presence of missing modality data. In this paper, we propose BrainMVP, a multi-modal vision pre-training framework for brain image analysis using multi-parametric MRI scans. First, we collect 16,022 brain MRI scans (over 2.4 million images), encompassing eight MRI modalities sourced from a diverse range of centers and devices. Then, a novel pre-training paradigm is proposed for the multi-modal MRI data, addressing the issue of missing modalities and achieving multi-modal information fusion. Cross-modal reconstruction is explored to learn distinctive brain image embeddings and efficient modality fusion capabilities. A modality-wise data distillation module is proposed to extract the essence representation of each MR image modality for both the pre-training and downstream application purposes. Furthermore, we introduce a modality-aware contrastive learning module to enhance the cross-modality association within a study. Extensive experiments on downstream tasks demonstrate superior performance compared to state-of-the-art pre-training methods in the medical domain, with Dice Score improvement of 0.28%-14.47% across six segmentation benchmarks and a consistent accuracy improvement of 0.65%-18.07% in four individual classification tasks.

[CV-39] VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

链接: https://arxiv.org/abs/2410.10594
作者: Shi Yu,Chaoyue Tang,Bokai Xu,Junbo Cui,Junhao Ran,Yukun Yan,Zhenghao Liu,Shuo Wang,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: enables large language, external knowledge sources, utilize external knowledge, large language models, RAG
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25–39% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at this https URL .

[CV-40] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer NEURIPS2024

链接: https://arxiv.org/abs/2410.10589
作者: Minghao Zhu,Zhengpu Wang,Mengxian Hu,Ronghao Dang,Xiao Lin,Xun Zhou,Chengju Liu,Qijun Chen
关键词-EN: Transferring visual-language knowledge, Transferring visual-language, large-scale foundation models, large-scale foundation, Transferring
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Camera Ready

点击查看摘要

Abstract:Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emphWeight Merging Regularization, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \ 600, UCF, and HMDB. Code is available at \urlthis https URL.

[CV-41] opoFR: A Closer Look at Topology Alignment on Face Recognition NEURIPS2024

链接: https://arxiv.org/abs/2410.10587
作者: Jun Dan,Yang Liu,Jiankang Deng,Haoyu Xie,Siyuan Li,Baigui Sun,Shan Luo
关键词-EN: undergone significant advancements, structure information, structure, latent space, latent space structure
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:The field of face recognition (FR) has undergone significant advancements with the rise of deep learning. Recently, the success of unsupervised learning and graph neural networks has demonstrated the effectiveness of data structure information. Considering that the FR task can leverage large-scale training data, which intrinsically contains significant structure information, we aim to investigate how to encode such critical structure information into the latent space. As revealed from our observations, directly aligning the structure information between the input and latent spaces inevitably suffers from an overfitting problem, leading to a structure collapse phenomenon in the latent space. To address this problem, we propose TopoFR, a novel FR model that leverages a topological structure alignment strategy called PTSA and a hard sample mining strategy named SDE. Concretely, PTSA uses persistent homology to align the topological structures of the input and latent spaces, effectively preserving the structure information and improving the generalization performance of FR model. To mitigate the impact of hard samples on the latent space structure, SDE accurately identifies hard samples by automatically computing structure damage score (SDS) for each sample, and directs the model to prioritize optimizing these samples. Experimental results on popular face benchmarks demonstrate the superiority of our TopoFR over the state-of-the-art methods. Code and models are available at: this https URL.

[CV-42] Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification

链接: https://arxiv.org/abs/2410.10573
作者: Jiaxiang Gou,Luping Ji,Pei Liu,Mao Ye
关键词-EN: Slide Image, Multiple Instance Learning, clinical pathology, tumor identification, cancer diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 tables, 11 figures

点击查看摘要

Abstract:Whole Slide Image (WSI) classification has very significant applications in clinical pathology, e.g., tumor identification and cancer diagnosis. Currently, most research attention is focused on Multiple Instance Learning (MIL) using static datasets. One of the most obvious weaknesses of these methods is that they cannot efficiently preserve and utilize previously learned knowledge. With any new data arriving, classification models are required to be re-trained on both previous and current new data. To overcome this shortcoming and break through traditional vision modality, this paper proposes the first Vision-Language-based framework with Queryable Prototype Multiple Instance Learning (QPMIL-VL) specially designed for incremental WSI classification. This framework mainly consists of two information processing branches. One is for generating the bag-level feature by prototype-guided aggregating on the instance features. While the other is for enhancing the class feature through class ensemble, tunable vector and class similarity loss. The experiments on four TCGA datasets demonstrate that our QPMIL-VL framework is effective for incremental WSI classification and often significantly outperforms other compared methods, achieving state-of-the-art (SOTA) performance.

[CV-43] MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

链接: https://arxiv.org/abs/2410.10563
作者: Jiacheng Chen,Tianhao Liang,Sherman Siu,Zhengqing Wang,Kai Wang,Yubo Wang,Yuansheng Ni,Wang Zhu,Ziyan Jiang,Bohan Lyu,Dongfu Jiang,Xuan He,Yuan Liu,Hexiang Hu,Xiang Yue,Wenhu Chen
关键词-EN: highly heterogeneous daily, scales multimodal evaluation, suite that scales, heterogeneous daily, daily use cases
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical report. Project page: this https URL

点击查看摘要

Abstract:We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

[CV-44] ROSAR: An Adversarial Re-Training Framework for Robust Side-Scan Sonar Object Detection

链接: https://arxiv.org/abs/2410.10554
作者: Martin Aubard,László Antal,Ana Madureira,Luis F. Teixeira,Erika Ábrahám
关键词-EN: deep learning object, autonomous underwater vehicles, learning object detection, generated by autonomous, paper introduces ROSAR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces ROSAR, a novel framework enhancing the robustness of deep learning object detection models tailored for side-scan sonar (SSS) images, generated by autonomous underwater vehicles using sonar sensors. By extending our prior work on knowledge distillation (KD), this framework integrates KD with adversarial retraining to address the dual challenges of model efficiency and robustness against SSS noises. We introduce three novel, publicly available SSS datasets, capturing different sonar setups and noise conditions. We propose and formalize two SSS safety properties and utilize them to generate adversarial datasets for retraining. Through a comparative analysis of projected gradient descent (PGD) and patch-based adversarial attacks, ROSAR demonstrates significant improvements in model robustness and detection accuracy under SSS-specific conditions, enhancing the model’s robustness by up to 1.85%. ROSAR is available at this https URL.

[CV-45] RICASSO: Reinforced Imbalance Learning with Class-Aware Self-Supervised Outliers Exposure

链接: https://arxiv.org/abs/2410.10548
作者: Xuan Zhang,Sin Chee Chin,Tingxuan Gao,Wenming Yang
关键词-EN: real OOD data, OOD data, deep learning models, OOD, real OOD
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:In real-world scenarios, deep learning models often face challenges from both imbalanced (long-tailed) and out-of-distribution (OOD) data. However, existing joint methods rely on real OOD data, which leads to unnecessary trade-offs. In contrast, our research shows that data mixing, a potent augmentation technique for long-tailed recognition, can generate pseudo-OOD data that exhibit the features of both in-distribution (ID) data and OOD data. Therefore, by using mixed data instead of real OOD data, we can address long-tailed recognition and OOD detection holistically. We propose a unified framework called Reinforced Imbalance Learning with Class-Aware Self-Supervised Outliers Exposure (RICASSO), where “self-supervised” denotes that we only use ID data for outlier exposure. RICASSO includes three main strategies: Norm-Odd-Duality-Based Outlier Exposure: Uses mixed data as pseudo-OOD data, enabling simultaneous ID data rebalancing and outlier exposure through a single loss function. Ambiguity-Aware Logits Adjustment: Utilizes the ambiguity of ID data to adaptively recalibrate logits. Contrastive Boundary-Center Learning: Combines Virtual Boundary Learning and Dual-Entropy Center Learning to use mixed data for better feature separation and clustering, with Representation Consistency Learning for robustness. Extensive experiments demonstrate that RICASSO achieves state-of-the-art performance in long-tailed recognition and significantly improves OOD detection compared to our baseline (27% improvement in AUROC and 61% reduction in FPR on the iNaturalist2018 dataset). On iNaturalist2018, we even outperforms methods using real OOD data. The code will be made public soon.

[CV-46] Hybrid Transformer for Early Alzheimers Detection: Integration of Handwriting-Based 2D Images and 1D Signal Features

链接: https://arxiv.org/abs/2410.10547
作者: Changqing Gong,Huafeng Qin,Mounîm A. El-Yacoubi
关键词-EN: Alzheimer Disease, prevalent neurodegenerative condition, prevalent neurodegenerative, neurodegenerative condition, Alzheimer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a prevalent neurodegenerative condition where early detection is vital. Handwriting, often affected early in AD, offers a non-invasive and cost-effective way to capture subtle motor changes. State-of-the-art research on handwriting, mostly online, based AD detection has predominantly relied on manually extracted features, fed as input to shallow machine learning models. Some recent works have proposed deep learning (DL)-based models, either 1D-CNN or 2D-CNN architectures, with performance comparing favorably to handcrafted schemes. These approaches, however, overlook the intrinsic relationship between the 2D spatial patterns of handwriting strokes and their 1D dynamic characteristics, thus limiting their capacity to capture the multimodal nature of handwriting data. Moreover, the application of Transformer models remains basically unexplored. To address these limitations, we propose a novel approach for AD detection, consisting of a learnable multimodal hybrid attention model that integrates simultaneously 2D handwriting images with 1D dynamic handwriting signals. Our model leverages a gated mechanism to combine similarity and difference attention, blending the two modalities and learning robust features by incorporating information at different scales. Our model achieved state-of-the-art performance on the DARWIN dataset, with an F1-score of 90.32% and accuracy of 90.91% in Task 8 (‘L’ writing), surpassing the previous best by 4.61% and 6.06% respectively.

[CV-47] Motion-guided small MAV detection in complex and non-planar scenes

链接: https://arxiv.org/abs/2410.10527
作者: Hanqing Guo,Canlun Zheng,Shiyu Zhao
关键词-EN: micro aerial vehicles, recent years, aerial vehicles, numerous applications, growing interest
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:In recent years, there has been a growing interest in the visual detection of micro aerial vehicles (MAVs) due to its importance in numerous applications. However, the existing methods based on either appearance or motion features encounter difficulties when the background is complex or the MAV is too small. In this paper, we propose a novel motion-guided MAV detector that can accurately identify small MAVs in complex and non-planar scenes. This detector first exploits a motion feature enhancement module to capture the motion features of small MAVs. Then it uses multi-object tracking and trajectory filtering to eliminate false positives caused by motion parallax. Finally, an appearance-based classifier and an appearance-based detector that operates on the cropped regions are used to achieve precise detection results. Our proposed method can effectively and efficiently detect extremely small MAVs from dynamic and complex backgrounds because it aggregates pixel-level motion features and eliminates false positives based on the motion and appearance features of MAVs. Experiments on the ARD-MAV dataset demonstrate that the proposed method could achieve high performance in small MAV detection under challenging conditions and outperform other state-of-the-art methods across various metrics

[CV-48] Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling

链接: https://arxiv.org/abs/2410.10511
作者: Wenze Liu,Le Zhuo,Yi Xin,Sheng Xia,Peng Gao,Xiangyu Yue
关键词-EN: termed Set AutoRegressive, Set AutoRegressive Modeling, Fully Masked Transformer, Set AutoRegressive, SAR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 17 figures, 8 tables, github repo: this https URL

点击查看摘要

Abstract:We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set AutoRegressive Modeling (SAR). SAR generalizes the conventional AR to the next-set setting, i.e., splitting the sequence into arbitrary sets containing multiple tokens, rather than outputting each token in a fixed raster order. To accommodate SAR, we develop a straightforward architecture termed Fully Masked Transformer. We reveal that existing AR variants correspond to specific design choices of sequence order and output intervals within the SAR framework, with AR and Masked AR (MAR) as two extreme instances. Notably, SAR facilitates a seamless transition from AR to MAR, where intermediate states allow for training a causal model that benefits from both few-step inference and KV cache acceleration, thus leveraging the advantages of both AR and MAR. On the ImageNet benchmark, we carefully explore the properties of SAR by analyzing the impact of sequence order and output intervals on performance, as well as the generalization ability regarding inference order and steps. We further validate the potential of SAR by training a 900M text-to-image model capable of synthesizing photo-realistic images with any resolution. We hope our work may inspire more exploration and application of AR-based modeling across diverse modalities.

[CV-49] Exploiting Local Features and Range Images for Small Data Real-Time Point Cloud Semantic Segmentation IROS

链接: https://arxiv.org/abs/2410.10510
作者: Daniel Fusaro,Simone Mosco,Emanuele Menegatti,Alberto Pretto
关键词-EN: Semantic segmentation, driving and robotics, essential task, task for understanding, understanding the environment
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: This paper has been accepted for publication at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

点击查看摘要

Abstract:Semantic segmentation of point clouds is an essential task for understanding the environment in autonomous driving and robotics. Recent range-based works achieve real-time efficiency, while point- and voxel-based methods produce better results but are affected by high computational complexity. Moreover, highly complex deep learning models are often not suited to efficiently learn from small datasets. Their generalization capabilities can easily be driven by the abundance of data rather than the architecture design. In this paper, we harness the information from the three-dimensional representation to proficiently capture local features, while introducing the range image representation to incorporate additional information and facilitate fast computation. A GPU-based KDTree allows for rapid building, querying, and enhancing projection with straightforward operations. Extensive experiments on SemanticKITTI and nuScenes datasets demonstrate the benefits of our modification in a ``small data’’ setup, in which only one sequence of the dataset is used to train the models, but also in the conventional setup, where all sequences except one are used for training. We show that a reduced version of our model not only demonstrates strong competitiveness against full-scale state-of-the-art models but also operates in real-time, making it a viable choice for real-world case applications. The code of our method is available at this https URL.

[CV-50] Artificial Intelligence-Based Triaging of Cutaneous Melanocytic Lesions

链接: https://arxiv.org/abs/2410.10509
作者: Ruben T. Lucassen,Nikolas Stathonikos,Gerben E. Breimer,Mitko Veta,Willeke A. M. Blokx
关键词-EN: increasing workload due, comprehensive diagnoses, facing an increasing, growing volume, increasing workload
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Pathologists are facing an increasing workload due to a growing volume of cases and the need for more comprehensive diagnoses. Aiming to facilitate workload reduction and faster turnaround times, we developed an artificial intelligence (AI) model for triaging cutaneous melanocytic lesions based on whole slide images. The AI model was developed and validated using a retrospective cohort from the UMC Utrecht. The dataset consisted of 52,202 whole slide images from 27,167 unique specimens, acquired from 20,707 patients. Specimens with only common nevi were assigned to the low complexity category (86.6%). In contrast, specimens with any other melanocytic lesion subtype, including non-common nevi, melanocytomas, and melanomas, were assigned to the high complexity category (13.4%). The dataset was split on patient level into a development set (80%) and test sets (20%) for independent evaluation. Predictive performance was primarily measured using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A simulation experiment was performed to study the effect of implementing AI-based triaging in the clinic. The AI model reached an AUROC of 0.966 (95% CI, 0.960-0.972) and an AUPRC of 0.857 (95% CI, 0.836-0.877) on the in-distribution test set, and an AUROC of 0.899 (95% CI, 0.860-0.934) and an AUPRC of 0.498 (95% CI, 0.360-0.639) on the out-of-distribution test set. In the simulation experiment, using random case assignment as baseline, AI-based triaging prevented an average of 43.9 (95% CI, 36-55) initial examinations of high complexity cases by general pathologists for every 500 cases. In conclusion, the AI model achieved a strong predictive performance in differentiating between cutaneous melanocytic lesions of high and low complexity. The improvement in workflow efficiency due to AI-based triaging could be substantial.

[CV-51] Continual Learning Improves Zero-Shot Action Recognition ACCV2024

链接: https://arxiv.org/abs/2410.10497
作者: Shreyank N Gowda,Davide Moltisanti,Laura Sevilla-Lara
关键词-EN: Zero-shot action recognition, continual learning, action recognition requires, requires a strong, strong ability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ACCV 2024

点击查看摘要

Abstract:Zero-shot action recognition requires a strong ability to generalize from pre-training and seen classes to novel unseen classes. Similarly, continual learning aims to develop models that can generalize effectively and learn new tasks without forgetting the ones previously learned. The generalization goals of zero-shot and continual learning are closely aligned, however techniques from continual learning have not been applied to zero-shot action recognition. In this paper, we propose a novel method based on continual learning to address zero-shot action recognition. This model, which we call \em Generative Iterative Learning (GIL) uses a memory of synthesized features of past classes, and combines these synthetic features with real ones from novel classes. The memory is used to train a classification model, ensuring a balanced exposure to both old and new classes. Experiments demonstrate that \em GIL improves generalization in unseen classes, achieving a new state-of-the-art in zero-shot recognition across multiple benchmarks. Importantly, \em GIL also boosts performance in the more challenging generalized zero-shot setting, where models need to retain knowledge about classes seen before fine-tuning.

[CV-52] Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing

链接: https://arxiv.org/abs/2410.10496
作者: Kejie Wang,Xuemeng Song,Meng Liu,Weili Guan,Liqiang Nie
关键词-EN: demonstrated remarkable progress, synthesizing high-quality images, diffusion models, models have demonstrated, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models have demonstrated remarkable progress in synthesizing high-quality images from text prompts, which boosts researches on prompt-based image editing that edits a source image according to a target prompt. Despite their advances, existing methods still encounter three key issues: 1) limited capacity of the text prompt in guiding target image generation, 2) insufficient mining of word-to-patch and patch-to-patch relationships for grounding editing areas, and 3) unified editing strength for all regions during each denoising step. To address these issues, we present a Vision-guided and Mask-enhanced Adaptive Editing (ViMAEdit) method with three key novel designs. First, we propose to leverage image embeddings as explicit guidance to enhance the conventional textual prompt-based denoising process, where a CLIP-based target image embedding estimation strategy is introduced. Second, we devise a self-attention-guided iterative editing area grounding strategy, which iteratively exploits patch-to-patch relationships conveyed by self-attention maps to refine those word-to-patch relationships contained in cross-attention maps. Last, we present a spatially adaptive variance-guided sampling, which highlights sampling variances for critical image regions to promote the editing capability. Experimental results demonstrate the superior editing capacity of ViMAEdit over all existing methods.

[CV-53] Learning to Ground VLMs without Forgetting

链接: https://arxiv.org/abs/2410.10491
作者: Aritra Bhowmik,Mohammad Mahdi Derakhshani,Dennis Koelma,Martin R. Oswald,Yuki M. Asano,Cees G. M. Snoek
关键词-EN: enable embodied multimodal, Spatial awareness, awareness is key, key to enable, enable embodied
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Visual Language Models (VLMs) struggle at this task. In this paper, we introduce LynX, a framework that equips pretrained VLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose a Dual Mixture of Experts module that modifies only the decoder layer of the language model, using one frozen Mixture of Experts (MoE) pre-trained on image and language understanding and another learnable MoE for new grounding capabilities. This allows the VLM to retain previously learned knowledge and skills, while acquiring what is missing. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process, thereby simplifying the task of visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning while maintaining its original image and language understanding capabilities on seven standard benchmark datasets.

[CV-54] Advancing Newborn Care: Precise Birth Time Detection Using AI-Driven Thermal Imaging with Adaptive Normalization

链接: https://arxiv.org/abs/2410.10483
作者: Jorge García-Torres,Øyvind Meinich-Bache,Anders Johannessen,Siren Rettedal,Vilde Kolstad,Kjersti Engan
关键词-EN: newborn resuscitation, start breathing, assistance to start, newborn, real newborn resuscitation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper submitted to Computer in Biology and Medicine, ELSEVIER

点击查看摘要

Abstract:Around 5-10% of newborns need assistance to start breathing. Currently, there is a lack of evidence-based research, objective data collection, and opportunities for learning from real newborn resuscitation emergency events. Generating and evaluating automated newborn resuscitation algorithm activity timelines relative to the Time of Birth (ToB) offers a promising opportunity to enhance newborn care practices. Given the importance of prompt resuscitation interventions within the “golden minute” after birth, having an accurate ToB with second precision is essential for effective subsequent analysis of newborn resuscitation episodes. Instead, ToB is generally registered manually, often with minute precision, making the process inefficient and susceptible to error and imprecision. In this work, we explore the fusion of Artificial Intelligence (AI) and thermal imaging to develop the first AI-driven ToB detector. The use of temperature information offers a promising alternative to detect the newborn while respecting the privacy of healthcare providers and mothers. However, the frequent inconsistencies in thermal measurements, especially in a multi-camera setup, make normalization strategies critical. Our methodology involves a three-step process: first, we propose an adaptive normalization method based on Gaussian mixture models (GMM) to mitigate issues related to temperature variations; second, we implement and deploy an AI model to detect the presence of the newborn within the thermal video frames; and third, we evaluate and post-process the model’s predictions to estimate the ToB. A precision of 88.1% and a recall of 89.3% are reported in the detection of the newborn within thermal frames during performance evaluation. Our approach achieves an absolute median deviation of 2.7 seconds in estimating the ToB relative to the manual annotations.

[CV-55] ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

链接: https://arxiv.org/abs/2410.10471
作者: Zhouqiang Jiang,Bowen Wang,Junhao Chen,Yuta Nakashima
关键词-EN: visually-rich document understanding, Recent approaches, manually annotated semantic, annotated semantic groups, document understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to automatically identify such grouping, we argue that current VrDU approaches are unrealistic. We thus introduce a new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups. We also propose a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together. Our experimental results demonstrate the performance of existing methods is deteriorated with the ReVrDU task, while ReLayout shows superiour performance.

[CV-56] Improve Meta-learning for Few-Shot Text Classification with All You Can Acquire from the Tasks EMNLP2024

链接: https://arxiv.org/abs/2410.10454
作者: Xinyue Liu,Yunlong Gao,Linlin Zong,Bo Xu
关键词-EN: achieved promising performance, Meta-learning has emerged, few-shot text classification, promising performance, prominent technology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:Meta-learning has emerged as a prominent technology for few-shot text classification and has achieved promising performance. However, existing methods often encounter difficulties in drawing accurate class prototypes from support set samples, primarily due to probable large intra-class differences and small inter-class differences within the task. Recent approaches attempt to incorporate external knowledge or pre-trained language models to augment data, but this requires additional resources and thus does not suit many few-shot scenarios. In this paper, we propose a novel solution to address this issue by adequately leveraging the information within the task itself. Specifically, we utilize label information to construct a task-adaptive metric space, thereby adaptively reducing the intra-class differences and magnifying the inter-class differences. We further employ the optimal transport technique to estimate class prototypes with query set samples together, mitigating the problem of inaccurate and ambiguous support set samples caused by large intra-class differences. We conduct extensive experiments on eight benchmark datasets, and our approach shows obvious advantages over state-of-the-art models across all the tasks on all the datasets. For reproducibility, all the datasets and codes are available at this https URL.

[CV-57] Self-Assessed Generation: Trustworthy Label Generation for Optical Flow and Stereo Matching in Real-world

链接: https://arxiv.org/abs/2410.10453
作者: Han Ling,Yinghui Sun,Quansen Sun,Ivor Tsang,Yuhui Zheng
关键词-EN: significant challenge facing, real world, optical flow, difficulty in generalizing, facing current optical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A significant challenge facing current optical flow and stereo methods is the difficulty in generalizing them well to the real world. This is mainly due to the high costs required to produce datasets, and the limitations of existing self-supervised methods on fuzzy results and complex model training problems. To address the above challenges, we propose a unified self-supervised generalization framework for optical flow and stereo tasks: Self-Assessed Generation (SAG). Unlike previous self-supervised methods, SAG is data-driven, using advanced reconstruction techniques to construct a reconstruction field from RGB images and generate datasets based on it. Afterward, we quantified the confidence level of the generated results from multiple perspectives, such as reconstruction field distribution, geometric consistency, and structural similarity, to eliminate inevitable defects in the generation process. We also designed a 3D flight foreground automatic rendering pipeline in SAG to encourage the network to learn occlusion and motion foreground. Experimentally, because SAG does not involve changes to methods or loss functions, it can directly self-supervised train the state-of-the-art deep networks, greatly improving the generalization performance of self-supervised methods on current mainstream optical flow and stereo-matching datasets. Compared to previous training modes, SAG is more generalized, cost-effective, and accurate.

[CV-58] Domain-Conditioned Transformer for Fully Test-time Adaptation

链接: https://arxiv.org/abs/2410.10442
作者: Yushun Tang,Shuoshuo Chen,Jiyuan Jia,Yi Zhang,Zhihai He
关键词-EN: model online based, inference stage, aims to adapt, based on sequential, sequential analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins.

[CV-59] Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

链接: https://arxiv.org/abs/2410.10441
作者: Kai Han,Jianyuan Guo,Yehui Tang,Wei He,Enhua Wu,Yunhe Wang
关键词-EN: achieved remarkable success, understanding remains challenging, Vision-language large models, remains challenging due, Vision-language large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech report

点击查看摘要

Abstract:Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While training-based video-LLMs deliver high performance, they often require substantial resources for training and inference. Conversely, training-free approaches offer a more efficient alternative by adapting pre-trained image-LLMs models for video tasks without additional training, but they face inference efficiency bottlenecks due to the large number of visual tokens generated from video frames. In this work, we present a novel prompt-guided visual perception framework (abbreviated as \emphFree Video-LLM) for efficient inference of training-free video LLMs. The proposed framework decouples spatial-temporal dimension and performs temporal frame sampling and spatial RoI cropping respectively based on task-specific prompts. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks. Extensive experiments demonstrate that our approach achieves competitive results with significantly fewer tokens, offering an optimal trade-off between accuracy and computational efficiency compared to state-of-the-art video LLMs. The code will be available at \urlthis https URL.

[CV-60] owards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.10437
作者: Boheng Li,Yanhao Wei,Yankai Fu,Zhenting Wang,Yiming Li,Jie Zhang,Run Wang,Tianwei Zhang
关键词-EN: pushing the boundaries, models, diffusion models, data, personalized models
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in the IEEE Symposium on Security Privacy, May 2025

点击查看摘要

Abstract:Text-to-image diffusion models are pushing the boundaries of what generative AI can achieve in our lives. Beyond their ability to generate general images, new personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles. Such a lightweight solution, enabling AI practitioners and developers to easily build their own personalized models, also poses a new concern regarding whether the personalized models are trained from unauthorized data. A promising solution is to proactively enable data traceability in generative models, where data owners embed external coatings (e.g., image watermarks or backdoor triggers) onto the datasets before releasing. Later the models trained over such datasets will also learn the coatings and unconsciously reproduce them in the generated mimicries, which can be extracted and used as the data usage evidence. However, we identify the existing coatings cannot be effectively learned in personalization tasks, making the corresponding verification less reliable. In this paper, we introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models. Our approach optimizes the coating in a delicate way to be recognized by the model as a feature relevant to the personalization task, thus significantly improving its learnability. We also utilize a human perceptual-aware constraint, a hypersphere classification technique, and a hypothesis-testing-guided verification method to enhance the stealthiness and detection accuracy of the coating. The effectiveness of SIREN is verified through extensive experiments on a diverse set of benchmark datasets, models, and learning algorithms. SIREN is also effective in various real-world scenarios and evaluated against potential countermeasures. Our code is publicly available. Comments: To appear in the IEEE Symposium on Security Privacy, May 2025 Subjects: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.10437 [cs.CY] (or arXiv:2410.10437v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2410.10437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-61] LKASeg:Remote-Sensing Image Semantic Segmentation with Large Kernel Attention and Full-Scale Skip Connections ICASSP2025

链接: https://arxiv.org/abs/2410.10433
作者: Xuezhi Xiang,Yibo Ning,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen
关键词-EN: Large Kernel Attention, Convolutional Neural Networks, Full-Scale Skip Connections, Kernel Attention, semantic segmentation network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: The paper is under consideration at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Semantic segmentation of remote sensing images is a fundamental task in geospatial research. However, widely used Convolutional Neural Networks (CNNs) and Transformers have notable drawbacks: CNNs may be limited by insufficient remote sensing modeling capability, while Transformers face challenges due to computational complexity. In this paper, we propose a remote-sensing image semantic segmentation network named LKASeg, which combines Large Kernel Attention(LSKA) and Full-Scale Skip Connections(FSC). Specifically, we propose a decoder based on Large Kernel Attention (LKA), which extract global features while avoiding the computational overhead of self-attention and providing channel adaptability. To achieve full-scale feature learning and fusion, we apply Full-Scale Skip Connections (FSC) between the encoder and decoder. We conducted experiments by combining the LKA-based decoder with FSC. On the ISPRS Vaihingen dataset, the mF1 and mIoU scores achieved 90.33% and 82.77%.

[CV-62] DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model

链接: https://arxiv.org/abs/2410.10429
作者: Songen Gu,Wei Yin,Bu Jin,Xiaoyang Guo,Junming Wang,Haodong Li,Qian Zhang,Xiaoxiao Long
关键词-EN: past occupancy observations, diffusion-based world model, world model, world, occupancy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Please visit our project page at this https URL

点击查看摘要

Abstract:We propose DOME, a diffusion-based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving. Compared to 2D video-based world models, the occupancy world model utilizes a native 3D representation, which features easily obtainable annotations and is modality-agnostic. This flexibility has the potential to facilitate the development of more advanced world models. Existing occupancy world models either suffer from detail loss due to discrete tokenization or rely on simplistic diffusion architectures, leading to inefficiencies and difficulties in predicting future occupancy with controllability. Our DOME exhibits two key features:(1) High-Fidelity and Long-Duration Generation. We adopt a spatial-temporal diffusion transformer to predict future occupancy frames based on historical context. This architecture efficiently captures spatial-temporal information, enabling high-fidelity details and the ability to generate predictions over long durations. (2)Fine-grained Controllability. We address the challenge of controllability in predictions by introducing a trajectory resampling method, which significantly enhances the model’s ability to generate controlled predictions. Extensive experiments on the widely used nuScenes dataset demonstrate that our method surpasses existing baselines in both qualitative and quantitative evaluations, establishing a new state-of-the-art performance on nuScenes. Specifically, our approach surpasses the baseline by 10.5% in mIoU and 21.2% in IoU for occupancy reconstruction and by 36.0% in mIoU and 24.6% in IoU for 4D occupancy forecasting.

[CV-63] 4DStyleGaussian: Zero-shot 4D Style Transfer with Gaussian Splatting

链接: https://arxiv.org/abs/2410.10412
作者: Wanlin Liang,Hongbin Xu,Weitao Chen,Feng Xiao,Wenxiong Kang
关键词-EN: gained significant attention, provide user-friendly stylization, style transfer, gained significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D neural style transfer has gained significant attention for its potential to provide user-friendly stylization with spatial consistency. However, existing 3D style transfer methods often fall short in terms of inference efficiency, generalization ability, and struggle to handle dynamic scenes with temporal consistency. In this paper, we introduce 4DStyleGaussian, a novel 4D style transfer framework designed to achieve real-time stylization of arbitrary style references while maintaining reasonable content affinity, multi-view consistency, and temporal coherence. Our approach leverages an embedded 4D Gaussian Splatting technique, which is trained using a reversible neural network for reducing content loss in the feature distillation process. Utilizing the 4D embedded Gaussians, we predict a 4D style transformation matrix that facilitates spatially and temporally consistent style transfer with Gaussian Splatting. Experiments demonstrate that our method can achieve high-quality and zero-shot stylization for 4D scenarios with enhanced efficiency and spatial-temporal consistency.

[CV-64] Parameterize Structure with Differentiable Template for 3D Shape Generation

链接: https://arxiv.org/abs/2410.10399
作者: Changfeng Ma,Pengxiao Guo,Shuangyu Yang,Yinuo Chen,Jie Guo,Chongjun Wang,Yanwen Guo,Wenping Wang
关键词-EN: Structural representation, generating editable, representation is crucial, crucial for reconstructing, reconstructing and generating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Structural representation is crucial for reconstructing and generating editable 3D shapes with part semantics. Recent 3D shape generation works employ complicated networks and structure definitions relying on hierarchical annotations and pay less attention to the details inside parts. In this paper, we propose the method that parameterizes the shared structure in the same category using a differentiable template and corresponding fixed-length parameters. Specific parameters are fed into the template to calculate cuboids that indicate a concrete shape. We utilize the boundaries of three-view drawings of each cuboid to further describe the inside details. Shapes are represented with the parameters and three-view details inside cuboids, from which the SDF can be calculated to recover the object. Benefiting from our fixed-length parameters and three-view details, our networks for reconstruction and generation are simple and effective to learn the latent space. Our method can reconstruct or generate diverse shapes with complicated details, and interpolate them smoothly. Extensive evaluations demonstrate the superiority of our method on reconstruction from point cloud, generation, and interpolation.

[CV-65] PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation NEURIPS2024

链接: https://arxiv.org/abs/2410.10394
作者: Kaidong Zhang,Pengzhen Ren,Bingqian Lin,Junfan Lin,Shikui Ma,Hang Xu,Xiaodan Liang
关键词-EN: follow abstract user, Language-guided robotic manipulation, abstract user instructions, Language-guided robotic, waypOinT-aware world model
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.

[CV-66] Reverse Refinement Network for Narrow Rural Road Detection in High-Resolution Satellite Imagery

链接: https://arxiv.org/abs/2410.10389
作者: Ningjing Wang,Xinyu Wang,Yang Pan,Wanqiang Yao,Yanfei Zhong
关键词-EN: large-scale rural road, road extraction, transportation planning, socio-economic progress, rural roads
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The automated extraction of rural roads is pivotal for rural development and transportation planning, serving as a cornerstone for socio-economic progress. Current research primarily focuses on road extraction in urban areas. However, rural roads present unique challenges due to their narrow and irregular nature, posing significant difficulties for road extraction. In this article, a reverse refinement network (R2-Net) is proposed to extract narrow rural roads, enhancing their connectivity and distinctiveness from the background. Specifically, to preserve the fine details of roads within high-resolution feature maps, R2-Net utilizes an axis context aware module (ACAM) to capture the long-distance spatial context information in various layers. Subsequently, the multi-level features are aggregated through a global aggregation module (GAM). Moreover, in the decoder stage, R2-Net employs a reverse-aware module (RAM) to direct the attention of the network to the complex background, thus amplifying its separability. In experiments, we compare R2-Net with several state-of-the-art methods using the DeepGlobe road extraction dataset and the WHU-RuR+ global large-scale rural road dataset. R2-Net achieved superior performance and especially excelled in accurately detecting narrow roads. Furthermore, we explored the applicability of R2-Net for large-scale rural road mapping. The results show that the proposed R2-Net has significant performance advantages for large-scale rural road mapping applications.

[CV-67] V2M: Visual 2-Dimensional Mamba for Image Representation Learning

链接: https://arxiv.org/abs/2410.10382
作者: Chengkun Wang,Wenzhao Zheng,Yuanhui Huang,Jie Zhou,Jiwen Lu
关键词-EN: garnered widespread attention, widespread attention due, efficient hardware performance, garnered widespread, widespread attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. To compensate for the 2D structure information loss (e.g., local similarity) of the original image, most existing methods focus on designing different orders to sequentially process the tokens, which could only alleviate this issue to some extent. In this paper, we propose a Visual 2-Dimensional Mamba (V2M) model as a complete solution, which directly processes image tokens in the 2D space. We first generalize SSM to the 2-dimensional space which generates the next state considering two adjacent states on both dimensions (e.g., columns and rows). We then construct our V2M based on the 2-dimensional SSM formulation and incorporate Mamba to achieve hardware-efficient parallel processing. The proposed V2M effectively incorporates the 2D locality prior yet inherits the efficiency and input-dependent scalability of Mamba. Extensive experimental results on ImageNet classification and downstream visual tasks including object detection and instance segmentation on COCO and semantic segmentation on ADE20K demonstrate the effectiveness of our V2M compared with other visual backbones.

[CV-68] Class Balancing Diversity Multimodal Ensemble for Alzheimers Disease Diagnosis and Early Detection

链接: https://arxiv.org/abs/2410.10374
作者: Arianna Francesconi,Lazzaro di Biase,Donato Cappetta,Fabio Rebecchi,Paolo Soda,Rosa Sicilia,Valerio Guarrasi
关键词-EN: poses significant global, significant global health, global health challenges, health challenges due, Mild Cognitive Impairment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) poses significant global health challenges due to its increasing prevalence and associated societal costs. Early detection and diagnosis of AD are critical for delaying progression and improving patient outcomes. Traditional diagnostic methods and single-modality data often fall short in identifying early-stage AD and distinguishing it from Mild Cognitive Impairment (MCI). This study addresses these challenges by introducing a novel approach: multImodal enseMble via class BALancing diversity for iMbalancEd Data (IMBALMED). IMBALMED integrates multimodal data from the Alzheimer’s Disease Neuroimaging Initiative database, including clinical assessments, neuroimaging phenotypes, biospecimen and subject characteristics data. It employs an ensemble of model classifiers, each trained with different class balancing techniques, to overcome class imbalance and enhance model accuracy. We evaluate IMBALMED on two diagnostic tasks (binary and ternary classification) and four binary early detection tasks (at 12, 24, 36, and 48 months), comparing its performance with state-of-the-art algorithms and an unbalanced dataset method. IMBALMED demonstrates superior diagnostic accuracy and predictive performance in both binary and ternary classification tasks, significantly improving early detection of MCI at 48-month time point. The method shows improved classification performance and robustness, offering a promising solution for early detection and management of AD.

[CV-69] Affinity-Graph-Guided Contractive Learning for Pretext-Free Medical Image Segmentation with Minimal Annotation

链接: https://arxiv.org/abs/2410.10366
作者: Zehua Cheng,Di Yuan,Thomas Lukasiewicz
关键词-EN: medical image segmentation, semi-supervised contrastive learning, contrastive learning framework, image segmentation, contrastive learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: BIBM 2024

点击查看摘要

Abstract:The combination of semi-supervised learning (SemiSL) and contrastive learning (CL) has been successful in medical image segmentation with limited annotations. However, these works often rely on pretext tasks that lack the specificity required for pixel-level segmentation, and still face overfitting issues due to insufficient supervision signals resulting from too few annotations. Therefore, this paper proposes an affinity-graph-guided semi-supervised contrastive learning framework (Semi-AGCL) by establishing additional affinity-graph-based supervision signals between the student and teacher network, to achieve medical image segmentation with minimal annotations without pretext. The framework first designs an average-patch-entropy-driven inter-patch sampling method, which can provide a robust initial feature space without relying on pretext tasks. Furthermore, the framework designs an affinity-graph-guided loss function, which can improve the quality of the learned representation and the model generalization ability by exploiting the inherent structure of the data, thus mitigating overfitting. Our experiments indicate that with merely 10% of the complete annotation set, our model approaches the accuracy of the fully annotated baseline, manifesting a marginal deviation of only 2.52%. Under the stringent conditions where only 5% of the annotations are employed, our model exhibits a significant enhancement in performance surpassing the second best baseline by 23.09% on the dice metric and achieving an improvement of 26.57% on the notably arduous CRAG and ACDC datasets.

[CV-70] FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification NEURIPS2024

链接: https://arxiv.org/abs/2410.10356
作者: Jingfeng Yao,Wang Cheng,Wenyu Liu,Xinggang Wang
关键词-EN: Diffusion Transformers, attracted significant attention, attention in research, attracted significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 (poster)

点击查看摘要

Abstract:Diffusion Transformers (DiT) have attracted significant attention in research. However, they suffer from a slow convergence rate. In this paper, we aim to accelerate DiT training without any architectural modification. We identify the following issues in the training process: firstly, certain training strategies do not consistently perform well across different data. Secondly, the effectiveness of supervision at specific timesteps is limited. In response, we propose the following contributions: (1) We introduce a new perspective for interpreting the failure of the strategies. Specifically, we slightly extend the definition of Signal-to-Noise Ratio (SNR) and suggest observing the Probability Density Function (PDF) of SNR to understand the essence of the data robustness of the strategy. (2) We conduct numerous experiments and report over one hundred experimental results to empirically summarize a unified accelerating strategy from the perspective of PDF. (3) We develop a new supervision method that further accelerates the training process of DiT. Based on them, we propose FasterDiT, an exceedingly simple and practicable design strategy. With few lines of code modifications, it achieves 2.30 FID on ImageNet 256 resolution at 1000k iterations, which is comparable to DiT (2.27 FID) but 7 times faster in training.

[CV-71] On Representation of 3D Rotation in the Context of Deep Learning ICCV

链接: https://arxiv.org/abs/2410.10350
作者: Viktória Pravdová,Lukáš Gajdošech,Hassan Ali,Viktor Kocur
关键词-EN: deep neural networks, methods of representing, paper investigates, investigates various methods, learning process
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted at International Conference on Computer Vision and Graphics ICCVG 2024. The proceedings of the conference will be published in Lecture Notes in Networks and Systems (LNNS), Springer

点击查看摘要

Abstract:This paper investigates various methods of representing 3D rotations and their impact on the learning process of deep neural networks. We evaluated the performance of ResNet18 networks for 3D rotation estimation using several rotation representations and loss functions on both synthetic and real data. The real datasets contained 3D scans of industrial bins, while the synthetic datasets included views of a simple asymmetric object rendered under different rotations. On synthetic data, we also assessed the effects of different rotation distributions within the training and test sets, as well as the impact of the object’s texture. In line with previous research, we found that networks using the continuous 5D and 6D representations performed better than the discontinuous ones.

[CV-72] Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation

链接: https://arxiv.org/abs/2410.10319
作者: Shun Qian,Bingquan Liu,Chengjie Sun,Zhen Xu,Baoxun Wang
关键词-EN: multi-modal language models, visual tokens, visual, plays a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:The projector plays a crucial role in multi-modal language models (MLLMs). The number of visual tokens it outputs affects the efficiency of the MLLM, while the quality of the visual tokens influences the visual understanding capabilities of the MLLM. Current explorations on the projector focus on reducing the number of visual tokens to improve efficiency, often overlooking the inherent spatial discrepancy between the serialized 2-dimensional visual token sequences and natural language token sequences. A Spatial-Aware Efficient Projector (SAEP) is proposed to address this issue. In detail, our SAEP method employs an modified separable depthwise convolution module on multi-layer visual features to enhance the spatial information of visual tokens. As a result, our SAEP method can not only largely reduce the number of visual tokens by 75%, but also significantly improve the multimodal spatial understanding capability of MLLMs. Moreover, compared to existing projectors, our SAEP gets best performances on massive multimodal evaluation benchmarks, which denotes its effectiveness on bridging the modality gap.

[CV-73] QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models NEURIPS2024

链接: https://arxiv.org/abs/2410.10318
作者: Zhumazhan Balapanov,Edward Magongo,Vanessa Matvei,Olivia Holmberg,Jonathan Pei,Kevin Zhu
关键词-EN: Convolutional neural networks, computer vision tasks, limit real-world applicability, made significant advances, Convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024 workshop on Neural Compression

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have made significant advances in computer vision tasks, yet their high inference times and latency often limit real-world applicability. While model compression techniques have gained popularity as solutions, they often overlook the critical balance between low latency and uncompromised accuracy. By harnessing quantum-inspired pruning, tensor decomposition, and annealing-based matrix factorization - three quantum-inspired concepts - we introduce QIANets: a novel approach of redesigning the traditional GoogLeNet, DenseNet, and ResNet-18 model architectures to process more parameters and computations whilst maintaining low inference times. Despite experimental limitations, the method was tested and evaluated, demonstrating reductions in inference times, along with effective accuracy preservations.

[CV-74] GlobalMamba: Global Image Serialization for Vision Mamba

链接: https://arxiv.org/abs/2410.10316
作者: Chengkun Wang,Wenzhao Zheng,Jie Zhou,Jiwen Lu
关键词-EN: demonstrated strong performance, demonstrated strong, strong performance, performance with linear, linear complexity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.

[CV-75] LG-CAV: Train Any Concept Activation Vector with Language Guidance

链接: https://arxiv.org/abs/2410.10308
作者: Qihan Huang,Jie Song,Mengqi Xue,Haofei Zhang,Bingde Hu,Huiqiong Wang,Hao Jiang,Xingen Wang,Mingli Song
关键词-EN: attracted broad research, broad research interest, elegantly attributing model, attributing model predictions, Concept activation vector
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Concept activation vector (CAV) has attracted broad research interest in explainable AI, by elegantly attributing model predictions to specific concepts. However, the training of CAV often necessitates a large number of high-quality images, which are expensive to curate and thus limited to a predefined set of concepts. To address this issue, we propose Language-Guided CAV (LG-CAV) to harness the abundant concept knowledge within the certain pre-trained vision-language models (e.g., CLIP). This method allows training any CAV without labeled data, by utilizing the corresponding concept descriptions as guidance. To bridge the gap between vision-language model and the target model, we calculate the activation values of concept descriptions on a common pool of images (probe images) with vision-language model and utilize them as language guidance to train the LG-CAV. Furthermore, after training high-quality LG-CAVs related to all the predicted classes in the target model, we propose the activation sample reweighting (ASR), serving as a model correction technique, to improve the performance of the target model in return. Experiments on four datasets across nine architectures demonstrate that LG-CAV achieves significantly superior quality to previous CAV methods given any concept, and our model correction method achieves state-of-the-art performance compared to existing concept-based methods. Our code is available at this https URL.

[CV-76] Animate-X: Universal Character Image Animation with Enhanced Motion Representation

链接: https://arxiv.org/abs/2410.10306
作者: Shuai Tan,Biao Gong,Xiang Wang,Shiwei Zhang,Dandan Zheng,Ruobing Zheng,Kecheng Zheng,Jingdong Chen,Ming Yang
关键词-EN: generates high-quality videos, recent years, generates high-quality, significant progress, progress in recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 15 figures, conference

点击查看摘要

Abstract:Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X, a universal animation framework based on LDM for various character types (collectively named X), including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of LDM by simulating possible inputs in advance that may arise during inference. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A^2Bench) to evaluate the performance of Animate-X on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X compared to state-of-the-art methods.

[CV-77] ROA-BEV: 2D Region-Oriented Attention for BEV-based 3D Object

链接: https://arxiv.org/abs/2410.10298
作者: Jiwei Chen,Laiyan Ding,Chi Zhang,Feifei Li,Rui Huang
关键词-EN: Vision-based BEV, Object Detection Network, autonomous driving, recently become popular, popular in autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-based BEV (Bird-Eye-View) 3D object detection has recently become popular in autonomous driving. However, objects with a high similarity to the background from a camera perspective cannot be detected well by existing methods. In this paper, we propose 2D Region-oriented Attention for a BEV-based 3D Object Detection Network (ROA-BEV), which can make the backbone focus more on feature learning in areas where objects may exist. Moreover, our method increases the information content of ROA through a multi-scale structure. In addition, every block of ROA utilizes a large kernel to ensure that the receptive field is large enough to catch large objects’ information. Experiments on nuScenes show that ROA-BEV improves the performance based on BEVDet and BEVDepth. The code will be released soon.

[CV-78] A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration NEURIPS2024

链接: https://arxiv.org/abs/2410.10295
作者: Renlang Huang,Yufan Tang,Jiming Chen,Liang Li
关键词-EN: Deep learning-based feature, shown great superiority, Deep learning-based, pose priors, shown great
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024 as poster

点击查看摘要

Abstract:Deep learning-based feature matching has shown great superiority for point cloud registration in the absence of pose priors. Although coarse-to-fine matching approaches are prevalent, the coarse matching of existing methods is typically sparse and loose without consideration of geometric consistency, which makes the subsequent fine matching rely on ineffective optimal transport and hypothesis-and-selection methods for consistency. Therefore, these methods are neither efficient nor scalable for real-time applications such as odometry in robotics. To address these issues, we design a consistency-aware spot-guided Transformer (CAST), which incorporates a spot-guided cross-attention module to avoid interfering with irrelevant areas, and a consistency-aware self-attention module to enhance matching capabilities with geometrically consistent correspondences. Furthermore, a lightweight fine matching module for both sparse keypoints and dense features can estimate the transformation accurately. Extensive experiments on both outdoor LiDAR point cloud datasets and indoor RGBD point cloud datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness.

[CV-79] Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection

链接: https://arxiv.org/abs/2410.10289
作者: Jiawen Zhu,Yew-Soon Ong,Chunhua Shen,Guansong Pang
关键词-EN: Current zero-shot anomaly, zero-shot anomaly detection, show remarkable success, large pre-trained vision-language, pre-trained vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 27 pages, 19 figures

点击查看摘要

Abstract:Current zero-shot anomaly detection (ZSAD) methods show remarkable success in prompting large pre-trained vision-language models to detect anomalies in a target dataset without using any dataset-specific training or demonstration. However, these methods are often focused on crafting/learning prompts that capture only coarse-grained semantics of abnormality, e.g., high-level semantics like “damaged”, “imperfect”, or “defective” on carpet. They therefore have limited capability in recognizing diverse abnormality details with distinctive visual appearance, e.g., specific defect types like color stains, cuts, holes, and threads on carpet. To address this limitation, we propose FAPrompt, a novel framework designed to learn Fine-grained Abnormality Prompts for more accurate ZSAD. To this end, we introduce a novel compound abnormality prompting module in FAPrompt to learn a set of complementary, decomposed abnormality prompts, where each abnormality prompt is formed by a compound of shared normal tokens and a few learnable abnormal tokens. On the other hand, the fine-grained abnormality patterns can be very different from one dataset to another. To enhance their cross-dataset generalization, we further introduce a data-dependent abnormality prior module that learns to derive abnormality features from each query/test image as a sample-wise abnormality prior to ground the abnormality prompts in a given target dataset. Comprehensive experiments conducted across 19 real-world datasets, covering both industrial defects and medical anomalies, demonstrate that FAPrompt substantially outperforms state-of-the-art methods by at least 3%-5% AUC/AP in both image- and pixel-level ZSAD tasks. Code is available at this https URL.

[CV-80] Manifold-Aware Local Feature Modeling for Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2410.10287
作者: Sicheng Shen,Jinming Cao,Yifang Yin,Roger Zimmermann
关键词-EN: Achieving precise medical, Achieving precise, effective treatment planning, treatment planning, Local Feature Modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:Achieving precise medical image segmentation is vital for effective treatment planning and accurate disease diagnosis. Traditional fully-supervised deep learning methods, though highly precise, are heavily reliant on large volumes of labeled data, which are often difficult to obtain due to the expertise required for medical annotations. This has led to the rise of semi-supervised learning approaches that utilize both labeled and unlabeled data to mitigate the label scarcity issue. In this paper, we introduce the Manifold-Aware Local Feature Modeling Network (MANet), which enhances the U-Net architecture by incorporating manifold supervision signals. This approach focuses on improving boundary accuracy, which is crucial for reliable medical diagnosis. To further extend the versatility of our method, we propose two variants: MA-Sobel and MA-Canny. The MA-Sobel variant employs the Sobel operator, which is effective for both 2D and 3D data, while the MA-Canny variant utilizes the Canny operator, specifically designed for 2D images, to refine boundary detection. These variants allow our method to adapt to various medical image modalities and dimensionalities, ensuring broader applicability. Our extensive experiments on datasets such as ACDC, LA, and Pancreas-NIH demonstrate that MANet consistently surpasses state-of-the-art methods in performance metrics like Dice and Jaccard scores. The proposed method also shows improved generalization across various semi-supervised segmentation networks, highlighting its robustness and effectiveness. Visual analysis of segmentation results confirms that MANet offers clearer and more accurate class boundaries, underscoring the value of manifold information in medical image segmentation.

[CV-81] Exploring Semi-Supervised Learning for Online Mapping

链接: https://arxiv.org/abs/2410.10279
作者: Adam Lilja,Erik Wallin,Junsheng Fu,Lars Hammarstrand
关键词-EN: scaling autonomous driving, well-defined areas, important for scaling, scaling autonomous, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Online mapping is important for scaling autonomous driving beyond well-defined areas. Training a model to produce a local map, including lane markers, road edges, and pedestrian crossings using only onboard sensory information, traditionally requires extensive labelled data, which is difficult and costly to obtain. This paper draws inspiration from semi-supervised learning techniques in other domains, demonstrating their applicability to online mapping. Additionally, we propose a simple yet effective method to exploit inherent attributes of online mapping to further enhance performance by fusing the teacher’s pseudo-labels from multiple samples. The performance gap to using all labels is reduced from 29.6 to 3.4 mIoU on Argoverse, and from 12 to 3.4 mIoU on NuScenes utilising only 10% of the labelled data. We also demonstrate strong performance in extrapolating to new cities outside those in the training data. Specifically, for challenging nuScenes, adapting from Boston to Singapore, performance increases by 6.6 mIoU when unlabelled data from Singapore is included in training.

[CV-82] big.LITTLE Vision Transformer for Efficient Visual Recognition

链接: https://arxiv.org/abs/2410.10267
作者: He Guo,Yulong Wang,Zixuan Ye,Jifeng Dai,Yuwen Xiong
关键词-EN: URL Vision Transformer, http URL Vision, innovative architecture aimed, Vision Transformer, URL Vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the this http URL Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most important information. To validate the effectiveness of our this http URL Vision Transformer, we conducted comprehensive experiments on image classification and segment anything task. Our results demonstrate that the this http URL architecture not only maintains high accuracy but also achieves substantial computational savings. Specifically, our approach enables the efficient handling of large-scale visual recognition tasks by dynamically balancing the trade-offs between performance and efficiency. The success of our method underscores the potential of hybrid models in optimizing both computation and performance in visual recognition tasks, paving the way for more practical and scalable deployment of advanced neural networks in real-world applications.

[CV-83] Slide-based Graph Collaborative Training for Histopathology Whole Slide Image Analysis

链接: https://arxiv.org/abs/2410.10260
作者: Jun Shi,Tong Shu,Zhiguo Jiang,Wei Wang,Haibo Wu,Yushan Zheng
关键词-EN: computational pathology lies, computational pathology, pathology lies, consensus that pathological, pathological characteristics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The development of computational pathology lies in the consensus that pathological characteristics of tumors are significant guidance for cancer diagnostics. Most existing research focuses on the inner-contextual information within each WSI yet ignores the possible inter-correlations between slides. As the development of tumors is a continuous process involving a series of histological, morphological, and genetic changes that accumulate over time, the similarities and differences between WSIs across various stages, grades, locations and patients should potentially contribute to the representation of WSIs and deserve to be taken into account in WSI modeling. To verify the advancement of introducing the slide inter-correlations into the representation learning of WSIs, we proposed a generic WSI analysis pipeline SlideGCD that can be adapted to any existing Multiple Instance Learning (MIL) frameworks and improve their performance. With the new paradigm, the prior knowledge of cancer development can participate in the end-to-end workflow, which concurrently initializes and refines the slide representation, as a guide for message passing in the slide-based graph. Extensive comparisons and experiments are conducted to validate the effectiveness and robustness of the proposed pipeline across 4 different tasks, including cancer subtyping, cancer staging, survival prediction, and gene mutation prediction, with 7 representative SOTA WSI analysis frameworks as backbones.

[CV-84] Saliency Guided Optimization of Diffusion Latents

链接: https://arxiv.org/abs/2410.10257
作者: Xiwen Wang,Jizhe Zhou,Xuekang Zhu,Cheng Li,Mao Li
关键词-EN: generating decent images, generating decent, longer challenging, rapid advances, text prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rapid advances in diffusion models, generating decent images from text prompts is no longer challenging. The key to text-to-image generation is how to optimize the results of a text-to-image generation model so that they can be better aligned with human intentions or prompts. Existing optimization methods commonly treat the entire image uniformly and conduct global optimization. These methods overlook the fact that when viewing an image, the human visual system naturally prioritizes attention toward salient areas, often neglecting less or non-salient regions. That is, humans are likely to neglect optimizations in non-salient areas. Consequently, although model retaining is conducted under the guidance of additional large and multimodality models, existing methods, which perform uniform optimizations, yield sub-optimal results. To address this alignment challenge effectively and efficiently, we propose Saliency Guided Optimization Of Diffusion Latents (SGOOL). We first employ a saliency detector to mimic the human visual attention system and mark out the salient regions. To avoid retraining an additional model, our method directly optimizes the diffusion latents. Besides, SGOOL utilizes an invertible diffusion process and endows it with the merits of constant memory implementation. Hence, our method becomes a parameter-efficient and plug-and-play fine-tuning method. Extensive experiments have been done with several metrics and human evaluation. Experimental results demonstrate the superiority of SGOOL in image quality and prompt alignment.

[CV-85] Automated extraction of 4D aircraft trajectories from video recordings

链接: https://arxiv.org/abs/2410.10249
作者: Jean-François Villeforceix(BEA, IGN, ENSG)
关键词-EN: ground cameras involving, analyze accident videos, Bureau d’Enqu, l’Aviation Civile, tes et d’Analyses
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: in French language, CFPT-RFIAP 2018, SFPT (Société Française de Photogrammétrie et de Télédétection); RFIAP (Reconnaissance des Formes, Image, Apprentissage et Perception), Jun 2018, Champs sur Marne - Marne la Vallée, France

点击查看摘要

Abstract:The Bureau d’Enquêtes et d’Analyses pour la Sécurité de l’Aviation Civile (BEA) has to analyze accident videos from on-board or ground cameras involving all types of aircraft. Until now, this analysis has been manual and time-consuming. The aim of this study is to identify the applications of photogrammetry and to automate the extraction of 4D trajectories from these videos. Taking into account all potential flight configurations, photogrammetric algorithms are being developed on the basis of IGN’s MicMac software and tested in the field. The results of these automated processes are intended to replace flight data from recorders such as FDRs or CVRs, which are sometimes missing. The information of interest to the BEA includes: three-dimensional position with the associated time component, the orientations of the aircraft’s three axes (pitch, roll and yaw navigation angles) and average speeds (including rate of climb).

[CV-86] LOBG:Less Overfitting for Better Generalization in Vision-Language Model

链接: https://arxiv.org/abs/2410.10247
作者: Chenhao Ding,Xinyuan Gao,Songlin Dong,Yuhang He,Qiang Wang,Alex Kot,Yihong Gong
关键词-EN: Existing prompt learning, Vision-Language Models, Existing prompt, downstream tasks, VLM to downstream
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing prompt learning methods in Vision-Language Models (VLM) have effectively enhanced the transfer capability of VLM to downstream tasks, but they suffer from a significant decline in generalization due to severe overfitting. To address this issue, we propose a framework named LOBG for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that might cause overfitting, thereby guiding prompts with basic visual concepts. To further mitigate overfitting, we devel oped a structural topology preservation (STP) loss at the feature level, which endows the feature space with overall plasticity, allowing effective reshaping of the feature space during optimization. Additionally, we employed hierarchical logit distilation (HLD) at the output level to constrain outputs, complementing STP at the output end. Extensive experimental results demonstrate that our method significantly improves generalization capability and alleviates overfitting compared to state-of-the-art approaches.

[CV-87] Capture Artifacts via Progressive Disentangling and Purifying Blended Identities for Deepfake Detection

链接: https://arxiv.org/abs/2410.10244
作者: Weijie Zhou,Xiaoqing Luo,Zhancheng Zhang,Jiachen He,Xiaojun Wu
关键词-EN: Deepfake detection technology, Deepfake detection, Deepfake, raised serious concerns, concerns regarding privacy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Deepfake technology has raised serious concerns regarding privacy breaches and trust issues. To tackle these challenges, Deepfake detection technology has emerged. Current methods over-rely on the global feature space, which contains redundant information independent of the artifacts. As a result, existing Deepfake detection techniques suffer performance degradation when encountering unknown datasets. To reduce information redundancy, the current methods use disentanglement techniques to roughly separate the fake faces into artifacts and content information. However, these methods lack a solid disentanglement foundation and cannot guarantee the reliability of their disentangling process. To address these issues, a Deepfake detection method based on progressive disentangling and purifying blended identities is innovatively proposed in this paper. Based on the artifact generation mechanism, the coarse- and fine-grained strategies are combined to ensure the reliability of the disentanglement method. Our method aims to more accurately capture and separate artifact features in fake faces. Specifically, we first perform the coarse-grained disentangling on fake faces to obtain a pair of blended identities that require no additional annotation to distinguish between source face and target face. Then, the artifact features from each identity are separated to achieve fine-grained disentanglement. To obtain pure identity information and artifacts, an Identity-Artifact Correlation Compression module (IACC) is designed based on the information bottleneck theory, effectively reducing the potential correlation between identity information and artifacts. Additionally, an Identity-Artifact Separation Contrast Loss is designed to enhance the independence of artifact features post-disentangling. Finally, the classifier only focuses on pure artifact features to achieve a generalized Deepfake detector.

[CV-88] ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization

链接: https://arxiv.org/abs/2410.10238
作者: Jiawei Li,Fanrui Zhang,Jiaying Zhu,Esther Sun,Qiang Zhang,Zheng-Jun Zha
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, shown strong capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong capabilities in visual reasoning and explanation generation. However, despite these strengths, they face significant challenges in the increasingly critical task of Image Forgery Detection and Localization (IFDL). Moreover, existing IFDL methods are typically limited to the learning of low-level semantic-agnostic clues and merely provide a single outcome judgment. To tackle these issues, we propose ForgeryGPT, a novel framework that advances the IFDL task by capturing high-order forensics knowledge correlations of forged images from diverse linguistic feature spaces, while enabling explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture. Specifically, ForgeryGPT enhances traditional LLMs by integrating the Mask-Aware Forgery Extractor, which enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts. The Mask-Aware Forgery Extractor consists of a Forgery Localization Expert (FL-Expert) and a Mask Encoder, where the FL-Expert is augmented with an Object-agnostic Forgery Prompt and a Vocabulary-enhanced Vision Encoder, allowing for effectively capturing of multi-scale fine-grained forgery details. To enhance its performance, we implement a three-stage training strategy, supported by our designed Mask-Text Alignment and IFDL Task-Specific Instruction Tuning datasets, which align vision-language modalities and improve forgery detection and instruction-following capabilities. Extensive experiments demonstrate the effectiveness of the proposed method.

[CV-89] LADMIM: Logical Anomaly Detection with Masked Image Modeling in Discrete Latent Space

链接: https://arxiv.org/abs/2410.10234
作者: Shunsuke Sakai,Tatushito Hasegawa,Makoto Koshino
关键词-EN: industrial anomaly detection, making detecting anomalies, Detecting anomalies, anomaly detection, incorrect combinations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Detecting anomalies such as incorrect combinations of objects or deviations in their positions is a challenging problem in industrial anomaly detection. Traditional methods mainly focus on local features of normal images, such as scratches and dirt, making detecting anomalies in the relationships between features difficult. Masked image modeling(MIM) is a self-supervised learning technique that predicts the feature representation of masked regions in an image. To reconstruct the masked regions, it is necessary to understand how the image is composed, allowing the learning of relationships between features within the image. We propose a novel approach that leverages the characteristics of MIM to detect logical anomalies effectively. To address blurriness in the reconstructed image, we replace pixel prediction with predicting the probability distribution of discrete latent variables of the masked regions using a tokenizer. We evaluated the proposed method on the MVTecLOCO dataset, achieving an average AUC of 0.867, surpassing traditional reconstruction-based and distillation-based methods.

[CV-90] KNN Transformer with Pyramid Prompts for Few-Shot Learning

链接: https://arxiv.org/abs/2410.10227
作者: Wenhao Li,Qiangchang Wang,Peng Zhao,Yilong Yin
关键词-EN: Few-Shot Learning, visual features, limited labeled data, visual, aims to recognize
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, accepted by ACM Multimedia 2024

点击查看摘要

Abstract:Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.

[CV-91] Detecting Unforeseen Data Properties with Diffusion Autoencoder Embeddings using Spine MRI data MICCAI2024

链接: https://arxiv.org/abs/2410.10220
作者: Robert Graf,Florian Hunecke,Soeren Pohl,Matan Atad,Hendrik Moeller,Sophie Starck,Thomas Kroencke,Stefanie Bette,Fabian Bamberg,Tobias Pischon,Thoralf Niendorf,Carsten Schmidt,Johannes C. Paetzold,Daniel Rueckert,Jan S Kirschke
关键词-EN: made significant strides, diagnostics and prognostics, German National Cohort, made significant, significant strides
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted in the “Workshop on Interpretability of Machine Intelligence in Medical Image Computing” (iMIMIC) at MICCAI 2024

点击查看摘要

Abstract:Deep learning has made significant strides in medical imaging, leveraging the use of large datasets to improve diagnostics and prognostics. However, large datasets often come with inherent errors through subject selection and acquisition. In this paper, we investigate the use of Diffusion Autoencoder (DAE) embeddings for uncovering and understanding data characteristics and biases, including biases for protected variables like sex and data abnormalities indicative of unwanted protocol variations. We use sagittal T2-weighted magnetic resonance (MR) images of the neck, chest, and lumbar region from 11186 German National Cohort (NAKO) participants. We compare DAE embeddings with existing generative models like StyleGAN and Variational Autoencoder. Evaluations on a large-scale dataset consisting of sagittal T2-weighted MR images of three spine regions show that DAE embeddings effectively separate protected variables such as sex and age. Furthermore, we used t-SNE visualization to identify unwanted variations in imaging protocols, revealing differences in head positioning. Our embedding can identify samples where a sex predictor will have issues learning the correct sex. Our findings highlight the potential of using advanced embedding techniques like DAEs to detect data quality issues and biases in medical imaging datasets. Identifying such hidden relations can enhance the reliability and fairness of deep learning models in healthcare applications, ultimately improving patient care and outcomes.

[CV-92] MagicEraser: Erasing Any Objects via Semantics-Aware Control ECCV2024

链接: https://arxiv.org/abs/2410.10207
作者: Fan Li,Zixiao Zhang,Yi Huang,Jianzhuang Liu,Renjing Pei,Bin Shao,Songcen Xu
关键词-EN: restore corrupted regions, referencing surrounding background, traditional image inpainting, object erasure task, traditional image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:The traditional image inpainting task aims to restore corrupted regions by referencing surrounding background and foreground. However, the object erasure task, which is in increasing demand, aims to erase objects and generate harmonious background. Previous GAN-based inpainting methods struggle with intricate texture generation. Emerging diffusion model-based algorithms, such as Stable Diffusion Inpainting, exhibit the capability to generate novel content, but they often produce incongruent results at the locations of the erased objects and require high-quality text prompt inputs. To address these challenges, we introduce MagicEraser, a diffusion model-based framework tailored for the object erasure task. It consists of two phases: content initialization and controllable generation. In the latter phase, we develop two plug-and-play modules called prompt tuning and semantics-aware attention refocus. Additionally, we propose a data construction strategy that generates training data specially suitable for this task. MagicEraser achieves fine and effective control of content generation while mitigating undesired artifacts. Experimental results highlight a valuable advancement of our approach in the object erasure task.

[CV-93] Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention

链接: https://arxiv.org/abs/2410.10184
作者: Ying Liu,Ge Bai,Chenji Lu,Shilong Li,Zhang Zhang,Ruifang Liu,Wenbin Guo
关键词-EN: Visual Question Answering, Question Answering, Visual Question, advancements in Visual, information remains unresolved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the remarkable advancements in Visual Question Answering (VQA), the challenge of mitigating the language bias introduced by textual information remains unresolved. Previous approaches capture language bias from a coarse-grained perspective. However, the finer-grained information within a sentence, such as context and keywords, can result in different biases. Due to the ignorance of fine-grained information, most existing methods fail to sufficiently capture language bias. In this paper, we propose a novel causal intervention training scheme named CIBi to eliminate language bias from a finer-grained perspective. Specifically, we divide the language bias into context bias and keyword bias. We employ causal intervention and contrastive learning to eliminate context bias and improve the multi-modal representation. Additionally, we design a new question-only branch based on counterfactual generation to distill and eliminate keyword bias. Experimental results illustrate that CIBi is applicable to various VQA models, yielding competitive performance.

[CV-94] Identity-Focused Inference and Extraction Attacks on Diffusion Models

链接: https://arxiv.org/abs/2410.10177
作者: Jayneel Vora,Aditya Krishnan,Nader Bouacida,Prabhu RV Shankar,Prasant Mohapatra
关键词-EN: generating synthetic images, increasing reliance, generating synthetic, amplified concerns, inference
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 5 figures, 3 tables,12 pages main body content

点击查看摘要

Abstract:The increasing reliance on diffusion models for generating synthetic images has amplified concerns about the unauthorized use of personal data, particularly facial images, in model training. In this paper, we introduce a novel identity inference framework to hold model owners accountable for including individuals’ identities in their training data. Our approach moves beyond traditional membership inference attacks by focusing on identity-level inference, providing a new perspective on data privacy violations. Through comprehensive evaluations on two facial image datasets, Labeled Faces in the Wild (LFW) and CelebA, our experiments demonstrate that the proposed membership inference attack surpasses baseline methods, achieving an attack success rate of up to 89% and an AUC-ROC of 0.91, while the identity inference attack attains 92% on LDM models trained on LFW, and the data extraction attack achieves 91.6% accuracy on DDPMs, validating the effectiveness of our approach across diffusion models.

[CV-95] First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending ECAI2024

链接: https://arxiv.org/abs/2410.10168
作者: Zhenhang Li,Yan Shu,Weichao Zeng,Dongbao Yang,Yu Zhou
关键词-EN: visual text generation, visual text blending, visual text, image generation abilities, existing visual text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECAI2024

点击查看摘要

Abstract:Diffusion models, known for their impressive image generation abilities, have played a pivotal role in the rise of visual text generation. Nevertheless, existing visual text generation methods often focus on generating entire images with text prompts, leading to imprecise control and limited practicality. A more promising direction is visual text blending, which focuses on seamlessly merging texts onto text-free backgrounds. However, existing visual text blending methods often struggle to generate high-fidelity and diverse images due to a shortage of backgrounds for synthesis and limited generalization capabilities. To overcome these challenges, we propose a new visual text blending paradigm including both creating backgrounds and rendering texts. Specifically, a background generator is developed to produce high-fidelity and text-free natural images. Moreover, a text renderer named GlyphOnly is designed for achieving visually plausible text-background integration. GlyphOnly, built on a Stable Diffusion framework, utilizes glyphs and backgrounds as conditions for accurate rendering and consistency control, as well as equipped with an adaptive text block exploration strategy for small-scale text rendering. We also explore several downstream applications based on our method, including scene text dataset synthesis for boosting scene text detectors, as well as text image customization and editing. Code and model will be available at \urlthis https URL.

[CV-96] X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

链接: https://arxiv.org/abs/2410.10167
作者: Xinyan Chen,Jianfei Yang
关键词-EN: significantly impacted fields, human body information, interpret human body, advanced deep learning, body information
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel “X-fusion” mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.

[CV-97] Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?

链接: https://arxiv.org/abs/2410.10160
作者: Zeliang Zhang,Xin Liang,Mingqian Feng,Susan Liang,Chenliang Xu
关键词-EN: addressing data scarcity, continuous model improvement, enabling continuous model, researchers have increasingly, demand for high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:As the demand for high-quality training data escalates, researchers have increasingly turned to generative models to create synthetic data, addressing data scarcity and enabling continuous model improvement. However, reliance on self-generated data introduces a critical question: Will this practice amplify bias in future models? While most research has focused on overall performance, the impact on model bias, particularly subgroup bias, remains underexplored. In this work, we investigate the effects of the generated data on image classification tasks, with a specific focus on bias. We develop a practical simulation environment that integrates a self-consuming loop, where the generative model and classification model are trained synergistically. Hundreds of experiments are conducted on Colorized MNIST, CIFAR-20/100, and Hard ImageNet datasets to reveal changes in fairness metrics across generations. In addition, we provide a conjecture to explain the bias dynamics when training models on continuously augmented datasets across generations. Our findings contribute to the ongoing debate on the implications of synthetic data for fairness in real-world applications.

[CV-98] Fast and Accurate Neural Rendering Using Semi-Gradients

链接: https://arxiv.org/abs/2410.10149
作者: In-Young Cho,Jaewoong Cho
关键词-EN: effective neural network-based, neural network-based framework, global illumination rendering, propose a simple, simple yet effective
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a simple yet effective neural network-based framework for global illumination rendering. Recently, rendering techniques that learn neural radiance caches by minimizing the difference (i.e., residual) between the left and right sides of the rendering equation have been suggested. Due to their ease of implementation and the advantage of excluding path integral calculations, these techniques have been applied to various fields, such as free-viewpoint rendering, differentiable rendering, and real-time rendering. However, issues of slow training and occasionally darkened renders have been noted. We identify the cause of these issues as the bias and high variance present in the gradient estimates of the existing residual-based objective function. To address this, we introduce a new objective function that maintains the same global optimum as before but allows for unbiased and low-variance gradient estimates, enabling faster and more accurate training of neural networks. In conclusion, this method is simply implemented by ignoring the partial derivatives of the right-hand side, and theoretical and experimental analyses demonstrate the effectiveness of the proposed loss.

[CV-99] Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution

链接: https://arxiv.org/abs/2410.10140
作者: Junbo Qiao,Jincheng Liao,Wei Li,Yulun Zhang,Yong Guo,Yi Wen,Zhangxizi Qiu,Jiao Xie,Jie Hu,Shaohui Lin
关键词-EN: State Space Models, State Space, Space Models, achieving successful applications, low-level vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:State Space Models (SSM), such as Mamba, have shown strong representation ability in modeling long-range dependency with linear complexity, achieving successful applications from high-level to low-level vision tasks. However, SSM’s sequential nature necessitates multiple scans in different directions to compensate for the loss of spatial dependency when unfolding the image into a 1D sequence. This multi-direction scanning strategy significantly increases the computation overhead and is unbearable for high-resolution image processing. To address this problem, we propose a novel Hierarchical Mamba network, namely, Hi-Mamba, for image super-resolution (SR). Hi-Mamba consists of two key designs: (1) The Hierarchical Mamba Block (HMB) assembled by a Local SSM (L-SSM) and a Region SSM (R-SSM) both with the single-direction scanning, aggregates multi-scale representations to enhance the context modeling ability. (2) The Direction Alternation Hierarchical Mamba Group (DA-HMG) allocates the isomeric single-direction scanning into cascading HMBs to enrich the spatial relationship modeling. Extensive experiments demonstrate the superiority of Hi-Mamba across five benchmark datasets for efficient SR. For example, Hi-Mamba achieves a significant PSNR improvement of 0.29 dB on Manga109 for \times3 SR, compared to the strong lightweight MambaIR.

[CV-100] MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

链接: https://arxiv.org/abs/2410.10139
作者: Peng Xia,Siwei Han,Shi Qiu,Yiyang Zhou,Zhaoyang Wang,Wenhao Zheng,Zhaorun Chen,Chenhang Cui,Mingyu Ding,Linjie Li,Lijuan Wang,Huaxiu Yao
关键词-EN: Interleaved multimodal comprehension, arbitrary sequences, produce and interpret, interpret both images, images and text
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in this https URL.

[CV-101] xtCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

链接: https://arxiv.org/abs/2410.10133
作者: Weichao Zeng,Yan Shu,Zhenhang Li,Dongbao Yang,Yu Zhou
关键词-EN: Scene Text Editing, Centred on content, image manipulation recently, text-driven image manipulation, Scene Text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Centred on content modification and style preservation, Scene Text Editing (STE) remains a challenging task despite considerable progress in text-to-image synthesis and text-driven image manipulation recently. GAN-based STE methods generally encounter a common issue of model generalization, while Diffusion-based STE methods suffer from undesired style deviations. To address these problems, we propose TextCtrl, a diffusion-based method that edits text with prior guidance control. Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text glyph structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy. (ii) To further leverage the style prior, a Glyph-adaptive Mutual Self-attention mechanism is proposed which deconstructs the implicit fine-grained features of the source image to enhance style consistency and vision quality during inference. Furthermore, to fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons. Experiments demonstrate the effectiveness of TextCtrl compared with previous methods concerning both style fidelity and text accuracy.

[CV-102] MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting

链接: https://arxiv.org/abs/2410.10122
作者: Yue Zhang,Minhao Liu,Zhaokang Chen,Bin Wu,Yubin Zeng,Chao Zhan,Yingjie He,Junxin Huang,Wenjiang Zhou
关键词-EN: presents significant challenges, accurate lip-speech synchronization, dubbing presents significant, live video streaming, Achieving high-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Achieving high-resolution, identity consistency, and accurate lip-speech synchronization in face visual dubbing presents significant challenges, particularly for real-time applications like live video streaming. We propose MuseTalk, which generates lip-sync targets in a latent space encoded by a Variational Autoencoder, enabling high-fidelity talking face video generation with efficient inference. Specifically, we project the occluded lower half of the face image and itself as an reference into a low-dimensional latent space and use a multi-scale U-Net to fuse audio and visual features at various levels. We further propose a novel sampling strategy during training, which selects reference images with head poses closely matching the target, allowing the model to focus on precise lip movement by filtering out redundant information. Additionally, we analyze the mechanism of lip-sync loss and reveal its relationship with input information volume. Extensive experiments show that MuseTalk consistently outperforms recent state-of-the-art methods in visual fidelity and achieves comparable lip-sync accuracy. As MuseTalk supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency, it paves the way for real-time applications.

[CV-103] Interaction-Guided Two-Branch Image Dehazing Network ACCV2024

链接: https://arxiv.org/abs/2410.10121
作者: Huichun Liu,Xiaosong Li,Tianshu Tan
关键词-EN: restore clean images, Image dehazing aims, Image dehazing, aims to restore, restore clean
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACCV 2024

点击查看摘要

Abstract:Image dehazing aims to restore clean images from hazy ones. Convolutional Neural Networks (CNNs) and Transformers have demonstrated exceptional performance in local and global feature extraction, respectively, and currently represent the two mainstream frameworks in image dehazing. In this paper, we propose a novel dual-branch image dehazing framework that guides CNN and Transformer components interactively. We reconsider the complementary characteristics of CNNs and Transformers by leveraging the differential relationships between global and local features for interactive guidance. This approach enables the capture of local feature positions through global attention maps, allowing the CNN to focus solely on feature information at effective positions. The single-branch Transformer design ensures the network’s global information recovery capability. Extensive experiments demonstrate that our proposed method yields competitive qualitative and quantitative evaluation performance on both synthetic and real public datasets. Codes are available at this https URL

[CV-104] StegaINR4MIH: steganography by implicit neural representation for multi-image hiding

链接: https://arxiv.org/abs/2410.10117
作者: Weina Dong,Jia Liu,Lifeng Chen,Wenquan Sun,Xiaozhong Pan,Yan Ke
关键词-EN: secret images, images, multiple secret images, cover image, Multi-image hiding
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: 46pages,14figures

点击查看摘要

Abstract:Multi-image hiding, which embeds multiple secret images into a cover image and is able to recover these images with high quality, has gradually become a research hotspot in the field of image steganography. However, due to the need to embed a large amount of data in a limited cover image space, issues such as contour shadowing or color distortion often arise, posing significant challenges for multi-image hiding. In this paper, we propose StegaINR4MIH, a novel implicit neural representation steganography framework that enables the hiding of multiple images within a single implicit representation function. In contrast to traditional methods that use multiple encoders to achieve multi-image embedding, our approach leverages the redundancy of implicit representation function parameters and employs magnitude-based weight selection and secret weight substitution on pre-trained cover image functions to effectively hide and independently extract multiple secret images. We conduct experiments on images with a resolution of from three different datasets: CelebA-HQ, COCO, and DIV2K. When hiding two secret images, the PSNR values of both the secret images and the stego images exceed 42. When hiding five secret images, the PSNR values of both the secret images and the stego images exceed 39. Extensive experiments demonstrate the superior performance of the proposed method in terms of visual quality and undetectability.

[CV-105] Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models

链接: https://arxiv.org/abs/2410.10114
作者: Jun Luo,Chen Chen,Shandong Wu
关键词-EN: diverse downstream tasks, demonstrated potent applicability, pre-trained Vision-Language Models, Prompt learning, downstream tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Prompt learning for pre-trained Vision-Language Models (VLMs) like CLIP has demonstrated potent applicability across diverse downstream tasks. This lightweight approach has quickly gained traction from federated learning (FL) researchers who seek to efficiently adapt VLMs to heterogeneous scenarios. However, current federated prompt learning methods are habitually restricted to the traditional FL paradigm, where the participating clients are generally only allowed to download a single globally aggregated model from the server. While justifiable for training full-sized models under federated settings, in this work, we argue that this paradigm is ill-suited for lightweight prompts. By facilitating the clients to download multiple pre-aggregated prompts as fixed non-local experts, we propose Personalized Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that personalizes the prompt learning process through the lens of Mixture of Experts (MoE). pFedMoAP implements a local attention-based gating network that learns to generate enhanced text features for better alignment with local image data on the client, benefiting from both local and downloaded non-local adaptive prompt experts. The non-local experts are sparsely selected from a server-maintained pool, fostering collaborative learning across clients. To evaluate the proposed algorithm, we conduct extensive experiments across 9 datasets under various heterogeneous federated settings. The results show that pFedMoAP consistently outperforms the state-of-the-art alternatives, underscoring its efficacy in personalizing prompt learning for CLIP within the federated learning paradigm.

[CV-106] Can We Predict Performance of Large Models across Vision-Language Tasks?

链接: https://arxiv.org/abs/2410.10112
作者: Qinyu Zhao,Ming Xu,Kartik Gupta,Akshay Asthana,Liang Zheng,Stephen Gould
关键词-EN: Evaluating large vision-language, high computational costs, Evaluating large, large vision-language models, performance
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Under Review. Project page: this https URL

点击查看摘要

Abstract:Evaluating large vision-language models (LVLMs) is very expensive, due to the high computational costs and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix \boldsymbolR , where each entry R_mn represents the performance score of the m -th model on the n -th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, that is, predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, quickly reducing errors in performance prediction. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. In experiments, we systematically evaluate 108 LVLMs on 176 datasets from 36 benchmarks, constructing training and testing sets for validating our framework. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.

[CV-107] High-Precision Dichotomous Image Segmentation via Probing Diffusion Capacity

链接: https://arxiv.org/abs/2410.10105
作者: Qian Yu,Peng-Tao Jiang,Hao Zhang,Jinwei Chen,Bo Li,Lihe Zhang,Huchuan Lu
关键词-EN: balancing broad contextual, capturing intricate details, broad contextual awareness, high-resolution image segmentation, fine-grained image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages

点击查看摘要

Abstract:In the realm of high-resolution (HR), fine-grained image segmentation, the primary challenge is balancing broad contextual awareness with the precision required for detailed object delineation, capturing intricate details and the finest edges of objects. Diffusion models, trained on vast datasets comprising billions of image-text pairs, such as SD V2.1, have revolutionized text-to-image synthesis by delivering exceptional quality, fine detail resolution, and strong contextual awareness, making them an attractive solution for high-resolution image segmentation. To this end, we propose DiffDIS, a diffusion-driven segmentation model that taps into the potential of the pre-trained U-Net within diffusion models, specifically designed for high-resolution, fine-grained object segmentation. By leveraging the robust generalization capabilities and rich, versatile image representation prior of the SD models, coupled with a task-specific stable one-step denoising approach, we significantly reduce the inference time while preserving high-fidelity, detailed generation. Additionally, we introduce an auxiliary edge generation task to not only enhance the preservation of fine details of the object boundaries, but reconcile the probabilistic nature of diffusion with the deterministic demands of segmentation. With these refined strategies in place, DiffDIS serves as a rapid object mask generation model, specifically optimized for generating detailed binary maps at high resolutions, while demonstrating impressive accuracy and swift processing. Experiments on the DIS5K dataset demonstrate the superiority of DiffDIS, achieving state-of-the-art results through a streamlined inference process. Our code will be made publicly available.

[CV-108] Innovative Deep Learning Techniques for Obstacle Recognition: A Comparative Study of Modern Detection Algorithms

链接: https://arxiv.org/abs/2410.10096
作者: Santiago Pérez,Camila Gómez,Matías Rodríguez
关键词-EN: advanced YOLO models, advanced YOLO, YOLO models, study explores, explores a comprehensive
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This study explores a comprehensive approach to obstacle detection using advanced YOLO models, specifically YOLOv8, YOLOv7, YOLOv6, and YOLOv5. Leveraging deep learning techniques, the research focuses on the performance comparison of these models in real-time detection scenarios. The findings demonstrate that YOLOv8 achieves the highest accuracy with improved precision-recall metrics. Detailed training processes, algorithmic principles, and a range of experimental results are presented to validate the model’s effectiveness.

[CV-109] Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors ECCV2024

链接: https://arxiv.org/abs/2410.10091
作者: Tao Lin,Lijia Yu,Gaojie Jin,Renjue Li,Peng Wu,Lijun Zhang
关键词-EN: deep neural networks, object detection systems, recent years, detection systems, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:In recent years, the study of adversarial robustness in object detection systems, particularly those based on deep neural networks (DNNs), has become a pivotal area of research. Traditional physical attacks targeting object detectors, such as adversarial patches and texture manipulations, directly manipulate the surface of the object. While these methods are effective, their overt manipulation of objects may draw attention in real-world applications. To address this, this paper introduces a more subtle approach: an inconspicuous adversarial trigger that operates outside the bounding boxes, rendering the object undetectable to the model. We further enhance this approach by proposing the Feature Guidance (FG) technique and the Universal Auto-PGD (UAPGD) optimization strategy for crafting high-quality triggers. The effectiveness of our method is validated through extensive empirical testing, demonstrating its high performance in both digital and physical environments. The code and video will be available at: this https URL.

[CV-110] he Ingredients for Robotic Diffusion Transformers

链接: https://arxiv.org/abs/2410.10088
作者: Sudeep Dasari,Oier Mees,Sebastian Zhao,Mohan Kumar Srirama,Sergey Levine
关键词-EN: recent years roboticists, achieved remarkable progress, leveraging high capacity, high capacity Transformer, capacity Transformer network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ( 1500+ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: this https URL

[CV-111] PointNet with KAN versus PointNet with MLP for 3D Classification and Segmentation of Point Sets

链接: https://arxiv.org/abs/2410.10084
作者: Ali Kashefi
关键词-EN: traditional Multilayer Perceptrons, key components, Multilayer Perceptrons, neural network, segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce PointNet-KAN, a neural network for 3D point cloud classification and segmentation tasks, built upon two key components. First, it employs Kolmogorov-Arnold Networks (KANs) instead of traditional Multilayer Perceptrons (MLPs). Second, it retains the core principle of PointNet by using shared KAN layers and applying symmetric functions for global feature extraction, ensuring permutation invariance with respect to the input features. In traditional MLPs, the goal is to train the weights and biases with fixed activation functions; however, in KANs, the goal is to train the activation functions themselves. We use Jacobi polynomials to construct the KAN layers. We extensively evaluate PointNet-KAN across various polynomial degrees and special types such as the Lagrange, Chebyshev, and Gegenbauer polynomials. Our results show that PointNet-KAN achieves competitive performance compared to PointNet with MLPs on benchmark datasets for 3D object classification and segmentation, despite employing a shallower and simpler network architecture. We hope this work serves as a foundation and provides guidance for integrating KANs, as an alternative to MLPs, into more advanced point cloud processing architectures.

[CV-112] Learning to Customize Text-to-Image Diffusion In Diverse Context

链接: https://arxiv.org/abs/2410.10058
作者: Taewook Kim,Wei Chen,Qiang Qiu
关键词-EN: techniques fine-tune models, customization techniques fine-tune, techniques fine-tune, captured in minimal, fine-tune models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most text-to-image customization techniques fine-tune models on a small set of \emphpersonal concept images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emphsolely within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.

[CV-113] DINTR: Tracking via Diffusion-based Interpolation NEURIPS2024

链接: https://arxiv.org/abs/2410.10053
作者: Pha Nguyen,Ngan Le,Jackson Cothren,Alper Yilmaz,Khoa Luu
关键词-EN: computer vision, requiring the localization, Object tracking, object tracking task, tracking task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Object tracking is a fundamental task in computer vision, requiring the localization of objects of interest across video frames. Diffusion models have shown remarkable capabilities in visual generation, making them well-suited for addressing several requirements of the tracking problem. This work proposes a novel diffusion-based methodology to formulate the tracking task. Firstly, their conditional process allows for injecting indications of the target object into the generation process. Secondly, diffusion mechanics can be developed to inherently model temporal correspondences, enabling the reconstruction of actual frames in video. However, existing diffusion models rely on extensive and unnecessary mapping to a Gaussian noise domain, which can be replaced by a more efficient and stable interpolation process. Our proposed interpolation mechanism draws inspiration from classic image-processing techniques, offering a more interpretable, stable, and faster approach tailored specifically for the object tracking task. By leveraging the strengths of diffusion models while circumventing their limitations, our Diffusion-based INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.

[CV-114] ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing

链接: https://arxiv.org/abs/2410.10047
作者: Yuduo Wang,Weikang Yu,Michael Kopp,Pedram Ghamisi
关键词-EN: Change Detection, Change Captioning, Remote Sensing, Recent advancements, advancements in Remote
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in Remote Sensing (RS) for Change Detection (CD) and Change Captioning (CC) have seen substantial success by adopting deep learning techniques. Despite these advances, existing methods often handle CD and CC tasks independently, leading to inefficiencies from the absence of synergistic processing. In this paper, we present ChangeMinds, a novel unified multi-task framework that concurrently optimizes CD and CC processes within a single, end-to-end model. We propose the change-aware long short-term memory module (ChangeLSTM) to effectively capture complex spatiotemporal dynamics from extracted bi-temporal deep features, enabling the generation of universal change-aware representations that effectively serve both CC and CD tasks. Furthermore, we introduce a multi-task predictor with a cross-attention mechanism that enhances the interaction between image and text features, promoting efficient simultaneous learning and processing for both tasks. Extensive evaluations on the LEVIR-MCI dataset, alongside other standard benchmarks, show that ChangeMinds surpasses existing methods in multi-task learning settings and markedly improves performance in individual CD and CC tasks. Codes and pre-trained models will be available online.

[CV-115] GALA: Geometry-Aware Local Adaptive Grids for Detailed 3D Generation

链接: https://arxiv.org/abs/2410.10037
作者: Dingdong Yang,Yizhi Wang,Konrad Schindler,Ali Mahdavi Amiri,Hao Zhang
关键词-EN: reproducing complex geometry, diffusion-based schemes, excels at capturing, computationally efficient, generative modelling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose GALA, a novel representation of 3D shapes that (i) excels at capturing and reproducing complex geometry and surface details, (ii) is computationally efficient, and (iii) lends itself to 3D generative modelling with modern, diffusion-based schemes. The key idea of GALA is to exploit both the global sparsity of surfaces within a 3D volume and their local surface properties. Sparsity is promoted by covering only the 3D object boundaries, not empty space, with an ensemble of tree root voxels. Each voxel contains an octree to further limit storage and compute to regions that contain surfaces. Adaptivity is achieved by fitting one local and geometry-aware coordinate frame in each non-empty leaf node. Adjusting the orientation of the local grid, as well as the anisotropic scales of its axes, to the local surface shape greatly increases the amount of detail that can be stored in a given amount of memory, which in turn allows for quantization without loss of quality. With our optimized C++/CUDA implementation, GALA can be fitted to an object in less than 10 seconds. Moreover, the representation can efficiently be flattened and manipulated with transformer networks. We provide a cascaded generation pipeline capable of generating 3D shapes with great geometric detail.

[CV-116] ULIP: Token-length Upgraded CLIP

链接: https://arxiv.org/abs/2410.10034
作者: Ivona Najdenkoska,Mohammad Mahdi Derakhshani,Yuki M. Asano,Nanne van Noord,Marcel Worring,Cees G. M. Snoek
关键词-EN: representing long captions, address the challenge, challenge of representing, representing long, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performance on tasks requiring longer descriptions. Although recent work has attempted to overcome this limit, their proposed approaches struggle to model token relationships over longer distances and simply extend to a fixed new token length. Instead, we propose a generalizable method, named TULIP, able to upgrade the token length to any length for CLIP-like models. We do so by improving the architecture with relative position encodings, followed by a training procedure that (i) distills the original CLIP text encoder into an encoder with relative position encodings and (ii) enhances the model for aligning longer captions with images. By effectively encoding captions longer than the default 77 tokens, our model outperforms baselines on cross-modal tasks such as retrieval and text-to-image generation.

[CV-117] REPeat: A Real2Sim2Real Approach for Pre-acquisition of Soft Food Items in Robot-assisted Feeding

链接: https://arxiv.org/abs/2410.10017
作者: Nayoung Ha,Ruolin Ye,Ziang Liu,Shubhangi Sinha,Tapomayukh Bhattacharjee
关键词-EN: paper presents REPeat, enhance bite acquisition, bite acquisition, presents REPeat, framework designed
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:The paper presents REPeat, a Real2Sim2Real framework designed to enhance bite acquisition in robot-assisted feeding for soft foods. It uses `pre-acquisition actions’ such as pushing, cutting, and flipping to improve the success rate of bite acquisition actions such as skewering, scooping, and twirling. If the data-driven model predicts low success for direct bite acquisition, the system initiates a Real2Sim phase, reconstructing the food’s geometry in a simulation. The robot explores various pre-acquisition actions in the simulation, then a Sim2Real step renders a photorealistic image to reassess success rates. If the success improves, the robot applies the action in reality. We evaluate the system on 15 diverse plates with 10 types of food items for a soft food diet, showing improvement in bite acquisition success rates by 27% on average across all plates. See our project website at this https URL.

[CV-118] NARAIM: Native Aspect Ratio Autoregressive Image Models NEURIPS

链接: https://arxiv.org/abs/2410.10012
作者: Daniel Gallo Fernández,Robert van der Klis,Rǎzvan-Andrei Matişan,Janusz Partyka,Efstratios Gavves,Samuele Papa,Phillip Lippe
关键词-EN: solve a wide, wide variety, variety of computer, pre-training method, scaling laws
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS, see this https URL

点击查看摘要

Abstract:While vision transformers are able to solve a wide variety of computer vision tasks, no pre-training method has yet demonstrated the same scaling laws as observed in language models. Autoregressive models show promising results, but are commonly trained on images that are cropped or transformed into square images, which distorts or destroys information present in the input. To overcome this limitation, we propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio. By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model’s ability to interpret visual information. In our experiments, we show that maintaining the aspect ratio improves performance on a downstream classification task.

[CV-119] InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling

链接: https://arxiv.org/abs/2410.10010
作者: Muhammad Gohar Javed,Chuan Guo,Li Cheng,Xingyu Li
关键词-EN: textual descriptions remains, challenging task, textual descriptions, descriptions remains, remains a challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Generating realistic 3D human-human interactions from textual descriptions remains a challenging task. Existing approaches, typically based on diffusion models, often generate unnatural and unrealistic results. In this work, we introduce InterMask, a novel framework for generating human interactions using collaborative masked modeling in discrete space. InterMask first employs a VQ-VAE to transform each motion sequence into a 2D discrete motion token map. Unlike traditional 1D VQ token maps, it better preserves fine-grained spatio-temporal details and promotes spatial awareness within each token. Building on this representation, InterMask utilizes a generative masked modeling framework to collaboratively model the tokens of two interacting individuals. This is achieved by employing a transformer architecture specifically designed to capture complex spatio-temporal interdependencies. During training, it randomly masks the motion tokens of both individuals and learns to predict them. In inference, starting from fully masked sequences, it progressively fills in the tokens for both individuals. With its enhanced motion representation, dedicated architecture, and effective learning strategy, InterMask achieves state-of-the-art results, producing high-fidelity and diverse human interactions. It outperforms previous methods, achieving an FID of 5.154 (vs 5.535 for in2IN) on the InterHuman dataset and 0.399 (vs 5.207 for InterGen) on the InterX dataset. Additionally, InterMask seamlessly supports reaction generation without the need for model redesign or fine-tuning.

[CV-120] Leveraging Customer Feedback for Multi-modal Insight Extraction NAACL2024

链接: https://arxiv.org/abs/2410.09999
作者: Sandeep Sricharan Mukku,Abinesh Kanagarajan,Pushpendu Ghosh,Chetan Aggarwal
关键词-EN: Businesses can benefit, customer feedback, products and services, enhance their products, multi-modal customer feedback
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: NAACL 2024

点击查看摘要

Abstract:Businesses can benefit from customer feedback in different modalities, such as text and images, to enhance their products and services. However, it is difficult to extract actionable and relevant pairs of text segments and images from customer feedback in a single pass. In this paper, we propose a novel multi-modal method that fuses image and text information in a latent space and decodes it to extract the relevant feedback segments using an image-text grounded text decoder. We also introduce a weakly-supervised data generation technique that produces training data for this task. We evaluate our model on unseen data and demonstrate that it can effectively mine actionable insights from multi-modal customer feedback, outperforming the existing baselines by 14 points in F1 score.

[CV-121] SlimSeiz: Efficient Channel-Adaptive Seizure Prediction Using a Mamba-Enhanced Network

链接: https://arxiv.org/abs/2410.09998
作者: Guorui Lu,Jing Peng,Bingyuan Huang,Chang Gao,Todor Stefanov,Yong Hao,Qinyu Chen
关键词-EN: abnormal brain activity, Epileptic seizures, brain activity, lead to accidents, abnormal brain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Epileptic seizures cause abnormal brain activity, and their unpredictability can lead to accidents, underscoring the need for long-term seizure prediction. Although seizures can be predicted by analyzing electroencephalogram (EEG) signals, existing methods often require too many electrode channels or larger models, limiting mobile usability. This paper introduces a SlimSeiz framework that utilizes adaptive channel selection with a lightweight neural network model. SlimSeiz operates in two states: the first stage selects the optimal channel set for seizure prediction using machine learning algorithms, and the second stage employs a lightweight neural network based on convolution and Mamba for prediction. On the Children’s Hospital Boston-MIT (CHB-MIT) EEG dataset, SlimSeiz can reduce channels from 22 to 8 while achieving a satisfactory result of 94.8% accuracy, 95.5% sensitivity, and 94.0% specificity with only 21.2K model parameters, matching or outperforming larger models’ performance. We also validate SlimSeiz on a new EEG dataset, SRH-LEI, collected from Shanghai Renji Hospital, demonstrating its effectiveness across different patients. The code and SRH-LEI dataset are available at this https URL.

[CV-122] Facial Width-to-Height Ratio Does Not Predict Self-Reported Behavioral Tendencies

链接: https://arxiv.org/abs/2410.09979
作者: Michal Kosinski
关键词-EN: violent behavioral tendencies, growing number, antisocial or violent, behavioral tendencies, violent behavioral
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Psychological Science (2017)

点击查看摘要

Abstract:A growing number of studies have linked facial width-to-height ratio (fWHR) with various antisocial or violent behavioral tendencies. However, those studies have predominantly been laboratory based and low powered. This work reexamined the links between fWHR and behavioral tendencies in a large sample of 137,163 participants. Behavioral tendencies were measured using 55 well-established psychometric scales, including self-report scales measuring intelligence, domains and facets of the five-factor model of personality, impulsiveness, sense of fairness, sensational interests, self-monitoring, impression management, and satisfaction with life. The findings revealed that fWHR is not substantially linked with any of these self-reported measures of behavioral tendencies, calling into question whether the links between fWHR and behavior generalize beyond the small samples and specific experimental settings that have been used in past fWHR research.

[CV-123] Optimizing Waste Management with Advanced Object Detection for Garbage Classification

链接: https://arxiv.org/abs/2410.09975
作者: Everest Z. Kuang,Kushal Raj Bhandari,Jianxi Gao
关键词-EN: significant environmental challenges, persistent global issues, pose significant environmental, Garbage production, textit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Garbage production and littering are persistent global issues that pose significant environmental challenges. Despite large-scale efforts to manage waste through collection and sorting, existing approaches remain inefficient, leading to inadequate recycling and disposal. Therefore, developing advanced AI-based systems is less labor intensive approach for addressing the growing waste problem more effectively. These models can be applied to sorting systems or possibly waste collection robots that may produced in the future. AI models have grown significantly at identifying objects through object this http URL paper reviews the implementation of AI models for classifying trash through object detection, specifically focusing on the use of YOLO V5 for training and testing. The study demonstrates how YOLO V5 can effectively identify various types of waste, including \textitplastic, \textitpaper, \textitglass, \textitmetal, \textitcardboard, and \textitbiodegradables.

[CV-124] Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions

链接: https://arxiv.org/abs/2410.09972
作者: Kyungmin Kim,JB Lanier,Pierre Baldi,Charless Fowlkes,Roy Fox
关键词-EN: Model-Based Reinforcement Learning, Recent advancements, Model-Based Reinforcement, Reinforcement Learning, advancements in Model-Based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent advancements in Model-Based Reinforcement Learning (MBRL) have made it a powerful tool for visual control tasks. Despite improved data efficiency, it remains challenging to train MBRL agents with generalizable perception. Training in the presence of visual distractions is particularly difficult due to the high variation they introduce to representation learning. Building on DREAMER, a popular MBRL method, we propose a simple yet effective auxiliary task to facilitate representation learning in distracting environments. Under the assumption that task-relevant components of image observations are straightforward to identify with prior knowledge in a given task, we use a segmentation mask on image observations to only reconstruct task-relevant components. In doing so, we greatly reduce the complexity of representation learning by removing the need to encode task-irrelevant objects in the latent representation. Our method, Segmentation Dreamer (SD), can be used either with ground-truth masks easily accessible in simulation or by leveraging potentially imperfect segmentation foundation models. The latter is further improved by selectively applying the reconstruction loss to avoid providing misleading learning signals due to mask prediction errors. In modified DeepMind Control suite (DMC) and Meta-World tasks with added visual distractions, SD achieves significantly better sample efficiency and greater final performance than prior work. We find that SD is especially helpful in sparse reward tasks otherwise unsolvable by prior work, enabling the training of visually robust agents without the need for extensive reward engineering.

[CV-125] Improving 3D Few-Shot Segmentation with Inference-Time Pseudo-Labeling

链接: https://arxiv.org/abs/2410.09967
作者: Mohammad Mozafari,Hosein Hasani,Reza Vahidimajd,Mohamadreza Fereydooni,Mahdieh Soleymani Baghshah
关键词-EN: offering remarkable adaptability, limited annotated data, medical imaging analysis, recent years, models have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, few-shot segmentation (FSS) models have emerged as a promising approach in medical imaging analysis, offering remarkable adaptability to segment novel classes with limited annotated data. Existing approaches to few-shot segmentation have often overlooked the potential of the query itself, failing to fully utilize the valuable information it contains. However, treating the query as unlabeled data provides an opportunity to enhance prediction accuracy. Specifically in the domain of medical imaging, the volumetric structure of queries offers a considerable source of valuable information that can be used to improve the target slice segmentation. In this work, we present a novel strategy to efficiently leverage the intrinsic information of the query sample for final segmentation during inference. First, we use the support slices from a reference volume to generate an initial segmentation score for the query slices through a prototypical approach. Subsequently, we apply a confidence-aware pseudo-labeling procedure to transfer the most informative parts of query slices to the support set. The final prediction is performed based on the new expanded support set, enabling the prediction of a more accurate segmentation mask for the query volume. Extensive experiments show that the proposed method can effectively boost performance across diverse settings and datasets.

[CV-126] LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

链接: https://arxiv.org/abs/2410.09962
作者: Han Qiu,Jiaxing Huang,Peng Gao,Qin Qi,Xiaoqin Zhang,Ling Shao,Shijian Lu
关键词-EN: large language models, multimodal large language, LLM evaluators, generate textual responses, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios, including object/image descriptions and multi-round conversations with 14/130 words and 189 words, respectively, on average. It introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators. Further, we propose an advanced pipeline that greatly facilitates the construction of future hallucination benchmarks with long and complex questions and descriptions. Extensive experiments over multiple recent MLLMs reveal various new challenges when they are handling hallucinations with long and complex textual data. Dataset and evaluation code are available at this https URL.

[CV-127] EITNet: An IoT-Enhanced Framework for Real-Time Basketball Action Recognition

链接: https://arxiv.org/abs/2410.09954
作者: Jingyu Liu,Xinyu Liu,Mingzhe Qu,Tianyi Lyu
关键词-EN: Integrating IoT technology, basketball action recognition, Integrating IoT, providing crucial insights, basketball action
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: pages

点击查看摘要

Abstract:Integrating IoT technology into basketball action recognition enhances sports analytics, providing crucial insights into player performance and game strategy. However, existing methods often fall short in terms of accuracy and efficiency, particularly in complex, real-time environments where player movements are frequently occluded or involve intricate interactions. To overcome these challenges, we propose the EITNet model, a deep learning framework that combines EfficientDet for object detection, I3D for spatiotemporal feature extraction, and TimeSformer for temporal analysis, all integrated with IoT technology for seamless real-time data collection and processing. Our contributions include developing a robust architecture that improves recognition accuracy to 92%, surpassing the baseline EfficientDet model’s 87%, and reducing loss to below 5.0 compared to EfficientDet’s 9.0 over 50 epochs. Furthermore, the integration of IoT technology enhances real-time data processing, providing adaptive insights into player performance and strategy. The paper details the design and implementation of EITNet, experimental validation, and a comprehensive evaluation against existing models. The results demonstrate EITNet’s potential to significantly advance automated sports analysis and optimize data utilization for player performance and strategy improvement.

[CV-128] he Roles of Contextual Semantic Relevance Metrics in Human Visual Processing

链接: https://arxiv.org/abs/2410.09921
作者: Kun Sun,Rong Wang
关键词-EN: visual, processing, metrics, visual perception, Semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic relevance metrics can capture both the inherent semantics of individual objects and their relationships to other elements within a visual scene. Numerous previous research has demonstrated that these metrics can influence human visual processing. However, these studies often did not fully account for contextual information or employ the recent deep learning models for more accurate computation. This study investigates human visual perception and processing by introducing the metrics of contextual semantic relevance. We evaluate semantic relationships between target objects and their surroundings from both vision-based and language-based perspectives. Testing a large eye-movement dataset from visual comprehension, we employ state-of-the-art deep learning techniques to compute these metrics and analyze their impacts on fixation measures on human visual processing through advanced statistical models. These metrics could also simulate top-down and bottom-up processing in visual perception. This study further integrates vision-based and language-based metrics into a novel combined metric, addressing a critical gap in previous research that often treated visual and semantic similarities separately. Results indicate that all metrics could precisely predict fixation measures in visual perception and processing, but with distinct roles in prediction. The combined metric outperforms other metrics, supporting theories that emphasize the interaction between semantic and visual information in shaping visual perception/processing. This finding aligns with growing recognition of the importance of multi-modal information processing in human cognition. These insights enhance our understanding of cognitive mechanisms underlying visual processing and have implications for developing more accurate computational models in fields such as cognitive science and human-computer interaction.

[CV-129] Stratified Domain Adaptation: A Progressive Self-Training Approach for Scene Text Recognition

链接: https://arxiv.org/abs/2410.09913
作者: Kha Nhat Le,Hoang-Tuan Nguyen,Hung Tien Tran,Thanh Duc Ngo
关键词-EN: Unsupervised domain adaptation, scene text recognition, testing data reside, Unsupervised domain, Stratified Domain Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 12 figures, 5 tables, include supplementary materials

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) has become increasingly prevalent in scene text recognition (STR), especially where training and testing data reside in different domains. The efficacy of existing UDA approaches tends to degrade when there is a large gap between the source and target domains. To deal with this problem, gradually shifting or progressively learning to shift from domain to domain is the key issue. In this paper, we introduce the Stratified Domain Adaptation (StrDA) approach, which examines the gradual escalation of the domain gap for the learning process. The objective is to partition the training data into subsets so that the progressively self-trained model can adapt to gradual changes. We stratify the training data by evaluating the proximity of each data sample to both the source and target domains. We propose a novel method for employing domain discriminators to estimate the out-of-distribution and domain discriminative levels of data samples. Extensive experiments on benchmark scene-text datasets show that our approach significantly improves the performance of baseline (source-trained) STR models.

[CV-130] Combining Generative and Geometry Priors for Wide-Angle Portrait Correction ECCV

链接: https://arxiv.org/abs/2410.09911
作者: Lan Yao,Chaofeng Chen,Xiaoming Li,Zifei Yan,Wangmeng Zuo
关键词-EN: Wide-angle lens distortion, aesthetically pleasing images, portrait photography presents, Wide-angle lens, pleasing images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: European Conference on Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Wide-angle lens distortion in portrait photography presents a significant challenge for capturing photo-realistic and aesthetically pleasing images. Such distortions are especially noticeable in facial regions. In this work, we propose encapsulating the generative face prior as a guided natural manifold to facilitate the correction of facial regions. Moreover, a notable central symmetry relationship exists in the non-face background, yet it has not been explored in the correction process. This geometry prior motivates us to introduce a novel constraint to explicitly enforce symmetry throughout the correction process, thereby contributing to a more visually appealing and natural correction in the non-face region. Experiments demonstrate that our approach outperforms previous methods by a large margin, excelling not only in quantitative measures such as line straightness and shape consistency metrics but also in terms of perceptual visual quality. All the code and models are available at this https URL.

[CV-131] UnSeg: One Universal Unlearnable Example Generator is Enough against All Image Segmentation NEURIPS2024

链接: https://arxiv.org/abs/2410.09909
作者: Ye Sun,Hao Zhang,Tiehua Zhang,Xingjun Ma,Yu-Gang Jiang
关键词-EN: semantically meaningful segments, crucial vision task, real-world scenes, Image segmentation, crucial vision
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Image segmentation is a crucial vision task that groups pixels within an image into semantically meaningful segments, which is pivotal in obtaining a fine-grained understanding of real-world scenes. However, an increasing privacy concern exists regarding training large-scale image segmentation models on unauthorized private data. In this work, we exploit the concept of unlearnable examples to make images unusable to model training by generating and adding unlearnable noise into the original images. Particularly, we propose a novel Unlearnable Segmentation (UnSeg) framework to train a universal unlearnable noise generator that is capable of transforming any downstream images into their unlearnable version. The unlearnable noise generator is finetuned from the Segment Anything Model (SAM) via bilevel optimization on an interactive segmentation dataset towards minimizing the training error of a surrogate model that shares the same architecture with SAM but is trained from scratch. We empirically verify the effectiveness of UnSeg across 6 mainstream image segmentation tasks, 10 widely used datasets, and 7 different network architectures, and show that the unlearnable images can reduce the segmentation performance by a large margin. Our work provides useful insights into how to leverage foundation models in a data-efficient and computationally affordable manner to protect images against image segmentation models.

[CV-132] Retrieval Instead of Fine-tuning: A Retrieval-based Parameter Ensemble for Zero-shot Learning

链接: https://arxiv.org/abs/2410.09908
作者: Pengfei Jin,Peng Shu,Sekeun Kim,Qing Xiao,Sifan Song,Cheng Chen,Tianming Liu,Xiang Li,Quanzheng Li
关键词-EN: Foundation models, techniques like Low-Rank, Foundation, RPE, Retrieval-based Parameter Ensemble
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.

[CV-133] Multi class activity classification in videos using Motion History Image generation

链接: https://arxiv.org/abs/2410.09902
作者: Senthilkumar Gopal
关键词-EN: Human action recognition, multiple fields ranging, Human action, topic of interest, interest across multiple
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 5 pages, 9 images

点击查看摘要

Abstract:Human action recognition has been a topic of interest across multiple fields ranging from security to entertainment systems. Tracking the motion and identifying the action being performed on a real time basis is necessary for critical security systems. In entertainment, especially gaming, the need for immediate responses for actions and gestures are paramount for the success of that system. We show that Motion History image has been a well established framework to capture the temporal and activity information in multi dimensional detail enabling various usecases including classification. We utilize MHI to produce sample data to train a classifier and demonstrate its effectiveness for action classification across six different activities in a single multi-action video. We analyze the classifier performance and identify usecases where MHI struggles to generate the appropriate activity image and discuss mechanisms and future work to overcome those limitations.

[CV-134] Large-Scale 3D Medical Image Pre-training with Geometric Context Priors CVPR2024

链接: https://arxiv.org/abs/2410.09890
作者: Linshan Wu,Jiaxin Zhuang,Hao Chen
关键词-EN: medical image analysis, poses a significant, medical, image analysis, medical images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVPR 2024 Extension

点击查看摘要

Abstract:The scarcity of annotations poses a significant challenge in medical image analysis. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Specifically, we (1) introduce the largest medical pre-training dataset PreCT-160K; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a benchmark encompassing 48 medical tasks. Extensive experiments highlight the superiority of VoCo. Codes at this https URL.

[CV-135] Block-to-Scene Pre-training for Point Cloud Hybrid-Domain Masked Autoencoders

链接: https://arxiv.org/abs/2410.09886
作者: Yaohua Zha,Tao Dai,Yanzi Wang,Hang Guo,Taolin Zhang,Zhihao Ouyang,Chunlin Fan,Bin Chen,Ke Chen,Shu-Tao Xia
关键词-EN: Point clouds, domain point clouds, point clouds based, Point, object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds based on the modeled content. Masked autoencoders (MAE) have become the mainstream paradigm in point clouds self-supervised learning. However, existing MAE-based methods are domain-specific, limiting the model’s generalization. In this paper, we propose to pre-train a general Point cloud Hybrid-Domain Masked AutoEncoder (PointHDMAE) via a block-to-scene pre-training strategy. We first propose a hybrid-domain masked autoencoder consisting of an encoder and decoder belonging to the scene domain and object domain, respectively. The object domain encoder specializes in handling object point clouds and multiple shared object encoders assist the scene domain encoder in analyzing the scene point clouds. Furthermore, we propose a block-to-scene strategy to pre-train our hybrid-domain model. Specifically, we first randomly select point blocks within a scene and apply a set of transformations to convert each point block coordinates from the scene space to the object space. Then, we employ an object-level mask and reconstruction pipeline to recover the masked points of each block, enabling the object encoder to learn a universal object representation. Finally, we introduce a scene-level block position regression pipeline, which utilizes the blocks’ features in the object space to regress these blocks’ initial positions within the scene space, facilitating the learning of scene representations. Extensive experiments across different datasets and tasks demonstrate the generalization and superiority of our hybrid-domain model.

[CV-136] Occluded Human Pose Estimation based on Limb Joint Augmentation

链接: https://arxiv.org/abs/2410.09885
作者: Gangtao Han,Chunxiao Song,Song Wang,Hao Wang,Enqing Chen,Guanghui Wang
关键词-EN: Human pose estimation, pose estimation aims, pose estimation model, pose estimation, occluded human pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accept by NCAA

点击查看摘要

Abstract:Human pose estimation aims at locating the specific joints of humans from the images or videos. While existing deep learning-based methods have achieved high positioning accuracy, they often struggle with generalization in occlusion scenarios. In this paper, we propose an occluded human pose estimation framework based on limb joint augmentation to enhance the generalization ability of the pose estimation model on the occluded human bodies. Specifically, the occlusion blocks are at first employed to randomly cover the limb joints of the human bodies from the training images, imitating the scene where the objects or other people partially occlude the human body. Trained by the augmented samples, the pose estimation model is encouraged to accurately locate the occluded keypoints based on the visible ones. To further enhance the localization ability of the model, this paper constructs a dynamic structure loss function based on limb graphs to explore the distribution of occluded joints by evaluating the dependence between adjacent joints. Extensive experimental evaluations on two occluded datasets, OCHuman and CrowdPose, demonstrate significant performance improvements without additional computation cost during inference.

[CV-137] Improving Colorectal Cancer Screening and Risk Assessment through Predictive Modeling on Medical Images and Records

链接: https://arxiv.org/abs/2410.09880
作者: Shuai Jiang,Christina Robinson,Joseph Anderson,William Hisey,Lynn Butterly,Arief Suriawinata,Saeed Hassanpour
关键词-EN: CRC risk, remove colon polyps, Multi-Society Task Force, future CRC risk, CRC risk prediction
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Colonoscopy screening is an effective method to find and remove colon polyps before they can develop into colorectal cancer (CRC). Current follow-up recommendations, as outlined by the U.S. Multi-Society Task Force for individuals found to have polyps, primarily rely on histopathological characteristics, neglecting other significant CRC risk factors. Moreover, the considerable variability in colorectal polyp characterization among pathologists poses challenges in effective colonoscopy follow-up or surveillance. The evolution of digital pathology and recent advancements in deep learning provide a unique opportunity to investigate the added benefits of including the additional medical record information and automatic processing of pathology slides using computer vision techniques in the calculation of future CRC risk. Leveraging the New Hampshire Colonoscopy Registry’s extensive dataset, many with longitudinal colonoscopy follow-up information, we adapted our recently developed transformer-based model for histopathology image analysis in 5-year CRC risk prediction. Additionally, we investigated various multimodal fusion techniques, combining medical record information with deep learning derived risk estimates. Our findings reveal that training a transformer model to predict intermediate clinical variables contributes to enhancing 5-year CRC risk prediction performance, with an AUC of 0.630 comparing to direct prediction. Furthermore, the fusion of imaging and non-imaging features, while not requiring manual inspection of microscopy images, demonstrates improved predictive capabilities for 5-year CRC risk comparing to variables extracted from colonoscopy procedure and microscopy findings. This study signifies the potential of integrating diverse data sources and advanced computational techniques in transforming the accuracy and effectiveness of future CRC risk assessments.

[CV-138] xtMaster: Universal Controllable Text Edit

链接: https://arxiv.org/abs/2410.09879
作者: Aoqiang Wang,Jian Wang,Zhenyu Yan,Wenxiang Shang,Ran Lin,Zhao Zhang
关键词-EN: material resource costs, significantly reduce human, text, resource costs, capabilities can significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In image editing tasks, high-quality text editing capabilities can significantly reduce human and material resource costs. Current methods rely heavily on training data based on OCR text segment detection, where the text is tightly aligned with the mask area. This reliance creates a strong dependency on the mask area and lacks modules for adjusting text spacing and size in various scenarios. When the amount of text to be edited does not match the modification area or when the mask area is too large, significant issues may arise. Furthermore, no existing methods have explored controllable style transfer for text this http URL address these challenges, we propose TextMaster, a solution capable of accurately editing text with high realism and proper layout in any scenario and image area. Our approach employs adaptive standard letter spacing as guidance during training and uses adaptive mask boosting to prevent the leakage of text position and size information. We also utilize an attention mechanism to calculate the bounding box regression loss for each character, making text layout methods learnable across different scenarios. By injecting high-resolution standard font information and applying perceptual loss in the text editing area, we further enhance text rendering accuracy and fidelity. Additionally, we achieve style consistency between the modified and target text through a novel style injection method. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms all existing approaches.

[CV-139] ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification

链接: https://arxiv.org/abs/2410.09875
作者: Chen Mao,Chong Tan,Jingqi Hu,Min Zheng
关键词-EN: Person re-identification, personnel counting, field of security, plays a vital, safety inspections
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Person re-identification(ReID), as a crucial technology in the field of security, plays a vital role in safety inspections, personnel counting, and more. Most current ReID approaches primarily extract features from images, which are easily affected by objective conditions such as clothing changes and occlusions. In addition to cameras, we leverage widely available routers as sensing devices by capturing gait information from pedestrians through the Channel State Information (CSI) in WiFi signals and contribute a multimodal dataset. We employ a two-stream network to separately process video understanding and signal analysis tasks, and conduct multi-modal fusion and contrastive learning on pedestrian video and WiFi data. Extensive experiments in real-world scenarios demonstrate that our method effectively uncovers the correlations between heterogeneous data, bridges the gap between visual and signal modalities, significantly expands the sensing range, and improves ReID accuracy across multiple sensors.

[CV-140] raining-Free Adaptive Diffusion with Bounded Difference Approximation Strategy NEURIPS2024

链接: https://arxiv.org/abs/2410.09873
作者: Hancheng Ye,Jiakang Yuan,Renqiu Xia,Xiangchao Yan,Tao Chen,Junchi Yan,Botian Shi,Bo Zhang
关键词-EN: recently achieved great, achieved great success, Diffusion models, noise prediction steps, video diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024, Homepage: this https URL The code is available at this https URL

点击查看摘要

Abstract:Diffusion models have recently achieved great success in the synthesis of high-quality images and videos. However, the existing denoising techniques in diffusion models are commonly based on step-by-step noise predictions, which suffers from high computation cost, resulting in a prohibitive latency for interactive applications. In this paper, we propose AdaptiveDiffusion to relieve this bottleneck by adaptively reducing the noise prediction steps during the denoising process. Our method considers the potential of skipping as many noise prediction steps as possible while keeping the final denoised results identical to the original full-step ones. Specifically, the skipping strategy is guided by the third-order latent difference that indicates the stability between timesteps during the denoising process, which benefits the reusing of previous noise prediction results. Extensive experiments on image and video diffusion models demonstrate that our method can significantly speed up the denoising process while generating identical results to the original process, achieving up to an average 2~5x speedup without quality degradation.

[CV-141] owards Reproducible Learning-based Compression

链接: https://arxiv.org/abs/2410.09872
作者: Jiahao Pang,Muhammad Asad Lodhi,Junghyun Ahn,Yuning Huang,Dong Tian
关键词-EN: software implementation details, system typically suffers, learning system typically, deep learning, deep learning system
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted at MMSP 2024

点击查看摘要

Abstract:A deep learning system typically suffers from a lack of reproducibility that is partially rooted in hardware or software implementation details. The irreproducibility leads to skepticism in deep learning technologies and it can hinder them from being deployed in many applications. In this work, the irreproducibility issue is analyzed where deep learning is employed in compression systems while the encoding and decoding may be run on devices from different manufacturers. The decoding process can even crash due to a single bit difference, e.g., in a learning-based entropy coder. For a given deep learning-based module with limited resources for protection, we first suggest that reproducibility can only be assured when the mismatches are bounded. Then a safeguarding mechanism is proposed to tackle the challenges. The proposed method may be applied for different levels of protection either at the reconstruction level or at a selected decoding level. Furthermore, the overhead introduced for the protection can be scaled down accordingly when the error bound is being suppressed. Experiments demonstrate the effectiveness of the proposed approach for learning-based compression systems, e.g., in image compression and point cloud compression.

[CV-142] wo-Stage Human Verification using HandCAPTCHA and Anti-Spoofed Finger Biometrics with Feature Selection

链接: https://arxiv.org/abs/2410.09866
作者: Asish Bera,Debotosh Bhattacharjee,Hubert P H Shum
关键词-EN: human verification scheme, enhance security, paper presents, presents a human, overcome the vulnerabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a human verification scheme in two independent stages to overcome the vulnerabilities of attacks and to enhance security. At the first stage, a hand image-based CAPTCHA (HandCAPTCHA) is tested to avert automated bot-attacks on the subsequent biometric stage. In the next stage, finger biometric verification of a legitimate user is performed with presentation attack detection (PAD) using the real hand images of the person who has passed a random HandCAPTCHA challenge. The electronic screen-based PAD is tested using image quality metrics. After this spoofing detection, geometric features are extracted from the four fingers (excluding the thumb) of real users. A modified forward-backward (M-FoBa) algorithm is devised to select relevant features for biometric authentication. The experiments are performed on the Bogazici University (BU) and the IIT-Delhi (IITD) hand databases using the k-nearest neighbor and random forest classifiers. The average accuracy of the correct HandCAPTCHA solution is 98.5%, and the false accept rate of a bot is 1.23%. The PAD is tested on 255 subjects of BU, and the best average error is 0%. The finger biometric identification accuracy of 98% and an equal error rate (EER) of 6.5% have been achieved for 500 subjects of the BU. For 200 subjects of the IITD, 99.5% identification accuracy, and 5.18% EER are obtained.

[CV-143] SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

链接: https://arxiv.org/abs/2410.09865
作者: Xilin He,Cheng Luo,Xiaole Xian,Bing Li,Siyang Song,Muhammad Haris Khan,Weicheng Xie,Linlin Shen,Zongyuan Ge
关键词-EN: datasets remain limited, expression datasets remain, Facial expression datasets, Facial expression, synthetic data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial expression datasets remain limited in scale due to privacy concerns, the subjectivity of annotations, and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Experiment results validate the efficacy of the proposed approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the original size. Our code will be made publicly available.

[CV-144] AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior

链接: https://arxiv.org/abs/2410.09864
作者: Guoqiang Liang,Qingnan Fan,Bingtao Fu,Jinwei Chen,Hong Gu,Lin Wang
关键词-EN: Blind face restoration, Blind face, computer vision, fundamental and challenging, challenging problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Codes and datasets will be available at this https URL

点击查看摘要

Abstract:Blind face restoration (BFR) is a fundamental and challenging problem in computer vision. To faithfully restore high-quality (HQ) photos from poor-quality ones, recent research endeavors predominantly rely on facial image priors from the powerful pretrained text-to-image (T2I) diffusion models. However, such priors often lead to the incorrect generation of non-facial features and insufficient facial details, thus rendering them less practical for real-world applications. In this paper, we propose a novel framework, namely AuthFace that achieves highly authentic face restoration results by exploring a face-oriented generative diffusion prior. To learn such a prior, we first collect a dataset of 1.5K high-quality images, with resolutions exceeding 8K, captured by professional photographers. Based on the dataset, we then introduce a novel face-oriented restoration-tuning pipeline that fine-tunes a pretrained T2I model. Identifying key criteria of quality-first and photography-guided annotation, we involve the retouching and reviewing process under the guidance of photographers for high-quality images that show rich facial features. The photography-guided annotation system fully explores the potential of these high-quality photographic images. In this way, the potent natural image priors from pretrained T2I diffusion models can be subtly harnessed, specifically enhancing their capability in facial detail restoration. Moreover, to minimize artifacts in critical facial areas, such as eyes and mouth, we propose a time-aware latent facial feature loss to learn the authentic face restoration process. Extensive experiments on the synthetic and real-world BFR datasets demonstrate the superiority of our approach.

[CV-145] Point Cloud Novelty Detection Based on Latent Representations of a General Feature Extractor

链接: https://arxiv.org/abs/2410.09861
作者: Shizuka Akahori,Satoshi Iizuka,Ken Mawatari,Kazuhiro Fukui
关键词-EN: general feature extractor, point cloud, point cloud dataset, point cloud feature, general point cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose an effective unsupervised 3D point cloud novelty detection approach, leveraging a general point cloud feature extractor and a one-class classifier. The general feature extractor consists of a graph-based autoencoder and is trained once on a point cloud dataset such as a mathematically generated fractal 3D point cloud dataset that is independent of normal/abnormal categories. The input point clouds are first converted into latent vectors by the general feature extractor, and then one-class classification is performed on the latent vectors. Compared to existing methods measuring the reconstruction error in 3D coordinate space, our approach utilizes latent representations where the shape information is condensed, which allows more direct and effective novelty detection. We confirm that our general feature extractor can extract shape features of unseen categories, eliminating the need for autoencoder re-training and reducing the computational burden. We validate the performance of our method through experiments on several subsets of the ShapeNet dataset and demonstrate that our latent-based approach outperforms the existing methods.

[CV-146] Human Identification using Selected Features from Finger Geometric Profiles

链接: https://arxiv.org/abs/2410.09856
作者: Asish Bera,Debotosh Bhattacharjee
关键词-EN: finger biometric system, hand contour image, hand contour, biometric system, unconstrained environment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A finger biometric system at an unconstrained environment is presented in this paper. A technique for hand image normalization is implemented at the preprocessing stage that decomposes the main hand contour into finger-level shape representation. This normalization technique follows subtraction of transformed binary image from binary hand contour image to generate the left side of finger profiles (LSFP). Then, XOR is applied to LSFP image and hand contour image to produce the right side of finger profiles (RSFP). During feature extraction, initially, thirty geometric features are computed from every normalized finger. The rank-based forward-backward greedy algorithm is followed to select relevant features and to enhance classification accuracy. Two different subsets of features containing nine and twelve discriminative features per finger are selected for two separate experimentations those use the kNN and the Random Forest (RF) for classification on the Bosphorus hand database. The experiments with the selected features of four fingers except the thumb have obtained improved performances compared to features extracted from five fingers and also other existing methods evaluated on the Bosphorus database. The best identification accuracies of 96.56% and 95.92% using the RF classifier have been achieved for the right- and left-hand images of 638 sub-jects, respectively. An equal error rate of 0.078 is obtained for both types of the hand images.

[CV-147] xt4Seg: Reimagining Image Segmentation as Text Generation

链接: https://arxiv.org/abs/2410.09855
作者: Mengcheng Lan,Chaofeng Chen,Yue Zhou,Jiaxing Xu,Yiping Ke,Xinjiang Wang,Litong Feng,Wayne Zhang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with 16\times16 semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by 3\times , without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

[CV-148] Understanding Robustness of Parameter-Efficient Tuning for Image Classification

链接: https://arxiv.org/abs/2410.09845
作者: Jiacheng Ruan,Xian Gao,Suncheng Xiang,Mingye Xie,Ting Liu,Yuzhuo Fu
关键词-EN: Parameter-efficient tuning, PET techniques, PET, PET methods, model predictions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures. Work in Progress

点击查看摘要

Abstract:Parameter-efficient tuning (PET) techniques calibrate the model’s predictions on downstream tasks by freezing the pre-trained models and introducing a small number of learnable parameters. However, despite the numerous PET methods proposed, their robustness has not been thoroughly investigated. In this paper, we systematically explore the robustness of four classical PET techniques (e.g., VPT, Adapter, AdaptFormer, and LoRA) under both white-box attacks and information perturbations. For white-box attack scenarios, we first analyze the performance of PET techniques using FGSM and PGD attacks. Subsequently, we further explore the transferability of adversarial samples and the impact of learnable parameter quantities on the robustness of PET methods. Under information perturbation attacks, we introduce four distinct perturbation strategies, including Patch-wise Drop, Pixel-wise Drop, Patch Shuffle, and Gaussian Noise, to comprehensively assess the robustness of these PET techniques in the presence of information loss. Via these extensive studies, we enhance the understanding of the robustness of PET methods, providing valuable insights for improving their performance in computer vision applications. The code is available at this https URL.

[CV-149] Fusion Based Hand Geometry Recognition Using Dempster-Shafer Theory

链接: https://arxiv.org/abs/2410.09842
作者: Asish Bera,Debotosh Bhattacharjee,Mita Nasipuri
关键词-EN: hand geometric features, person recognition based, pose restrictions, paper presents, technique for person
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a new technique for person recognition based on the fusion of hand geometric features of both the hands without any pose restrictions. All the features are extracted from normalized left and right hand images. Fusion is applied at feature level and also at decision level. Two probability based algorithms are proposed for classification. The first algorithm computes the maximum probability for nearest three neighbors. The second algorithm determines the maximum probability of the number of matched features with respect to a thresholding on distances. Based on these two highest probabilities initial decisions are made. The final decision is considered according to the highest probability as calculated by the Dempster-Shafer theory of evidence. Depending on the various combinations of the initial decisions, three schemes are experimented with 201 subjects for identification and verification. The correct identification rate found to be 99.5%, and the False Acceptance Rate (FAR) of 0.625% has been found during verification.

[CV-150] oward Defining an Efficient and Expandable File Format for AI-Generated Contents

链接: https://arxiv.org/abs/2410.09834
作者: Yixin Gao,Runsen Feng,Xin Li,Weiping Li,Zhibo Chen
关键词-EN: powerful creation capability, gained significant traction, significant traction due, AIGC images, AI-generated content
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recently, AI-generated content (AIGC) has gained significant traction due to its powerful creation capability. However, the storage and transmission of large amounts of high-quality AIGC images inevitably pose new challenges for recent file formats. To overcome this, we define a new file format for AIGC images, named AIGIF, enabling ultra-low bitrate coding of AIGC images. Unlike compressing AIGC images intuitively with pixel-wise space as existing file formats, AIGIF instead compresses the generation syntax. This raises a crucial question: Which generation syntax elements, e.g., text prompt, device configuration, etc, are necessary for compression/transmission? To answer this question, we systematically investigate the effects of three essential factors: platform, generative model, and data configuration. We experimentally find that a well-designed composable bitstream structure incorporating the above three factors can achieve an impressive compression ratio of even up to 1/10,000 while still ensuring high fidelity. We also introduce an expandable syntax in AIGIF to support the extension of the most advanced generation models to be developed in the future.

[CV-151] LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond ACCV2024

链接: https://arxiv.org/abs/2410.09831
作者: Md Tanvir Islam,Inzamamul Alam,Simon S. Woo,Saeed Anwar,IK Hyun Lee,Khan Muhammad
关键词-EN: computer vision tasks, numerous computer vision, LLIE, essential for numerous, numerous computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: Accepted by the Asian Conference on Computer Vision (ACCV 2024)

点击查看摘要

Abstract:Low-light image enhancement (LLIE) is essential for numerous computer vision tasks, including object detection, tracking, segmentation, and scene understanding. Despite substantial research on improving low-quality images captured in underexposed conditions, clear vision remains critical for autonomous vehicles, which often struggle with low-light scenarios, signifying the need for continuous research. However, paired datasets for LLIE are scarce, particularly for street scenes, limiting the development of robust LLIE methods. Despite using advanced transformers and/or diffusion-based models, current LLIE methods struggle in real-world low-light conditions and lack training on street-scene datasets, limiting their effectiveness for autonomous vehicles. To bridge these gaps, we introduce a new dataset LoLI-Street (Low-Light Images of Streets) with 33k paired low-light and well-exposed images from street scenes in developed cities, covering 19k object classes for object detection. LoLI-Street dataset also features 1,000 real low-light test images for testing LLIE models under real-life conditions. Furthermore, we propose a transformer and diffusion-based LLIE model named “TriFuse”. Leveraging the LoLI-Street dataset, we train and evaluate our TriFuse and SOTA models to benchmark on our dataset. Comparing various models, our dataset’s generalization feasibility is evident in testing across different mainstream datasets by significantly enhancing images and object detection for practical applications in autonomous driving and surveillance systems. The complete code and dataset is available on this https URL.

[CV-152] DAS3D: Dual-modality Anomaly Synthesis for 3D Anomaly Detection

链接: https://arxiv.org/abs/2410.09821
作者: Kecen Li,Bingquan Dai,Jingjing Fu,Xinwen Hou
关键词-EN: Synthesizing anomaly samples, Synthesizing anomaly, industrial anomaly detection, strategy for self-supervised, samples has proven
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthesizing anomaly samples has proven to be an effective strategy for self-supervised 2D industrial anomaly detection. However, this approach has been rarely explored in multi-modality anomaly detection, particularly involving 3D and RGB images. In this paper, we propose a novel dual-modality augmentation method for 3D anomaly synthesis, which is simple and capable of mimicking the characteristics of 3D defects. Incorporating with our anomaly synthesis method, we introduce a reconstruction-based discriminative anomaly detection network, in which a dual-modal discriminator is employed to fuse the original and reconstructed embedding of two modalities for anomaly detection. Additionally, we design an augmentation dropout mechanism to enhance the generalizability of the discriminator. Extensive experiments show that our method outperforms the state-of-the-art methods on detection precision and achieves competitive segmentation performance on both MVTec 3D-AD and Eyescandies datasets.

[CV-153] opOC: Topological Deep Learning for Ovarian and Breast Cancer Diagnosis

链接: https://arxiv.org/abs/2410.09818
作者: Saba Fatema,Brighton Nuwagira,Sayoni Chakraborty,Reyhan Gedik,Baris Coskunuzer
关键词-EN: classifying cancerous lesions, Microscopic examination, deep learning methods, cancerous lesions, experienced pathologists
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Microscopic examination of slides prepared from tissue samples is the primary tool for detecting and classifying cancerous lesions, a process that is time-consuming and requires the expertise of experienced pathologists. Recent advances in deep learning methods hold significant potential to enhance medical diagnostics and treatment planning by improving accuracy, reproducibility, and speed, thereby reducing clinicians’ workloads and turnaround times. However, the necessity for vast amounts of labeled data to train these models remains a major obstacle to the development of effective clinical decision support systems. In this paper, we propose the integration of topological deep learning methods to enhance the accuracy and robustness of existing histopathological image analysis models. Topological data analysis (TDA) offers a unique approach by extracting essential information through the evaluation of topological patterns across different color channels. While deep learning methods capture local information from images, TDA features provide complementary global features. Our experiments on publicly available histopathological datasets demonstrate that the inclusion of topological features significantly improves the differentiation of tumor types in ovarian and breast cancers. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Topology (math.AT) Cite as: arXiv:2410.09818 [cs.CV] (or arXiv:2410.09818v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.09818 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: MICCAI TGI3 2024 Related DOI: https://doi.org/10.1007/978-3-031-73967-5_3 Focus to learn more DOI(s) linking to related resources

[CV-154] EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models ECCV2024

链接: https://arxiv.org/abs/2410.09802
作者: Eungbean Lee,Somi Jeong,Kwanghoon Sohn
关键词-EN: Exemplar-guided image translation, attracting attention due, enhance user control, synthesizing photo-realistic images, Exemplar-guided image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024

点击查看摘要

Abstract:Exemplar-guided image translation, synthesizing photo-realistic images that conform to both structural control and style exemplars, is attracting attention due to its ability to enhance user control over style manipulation. Previous methodologies have predominantly depended on establishing dense correspondences across cross-domain inputs. Despite these efforts, they incur quadratic memory and computational costs for establishing dense correspondence, resulting in limited versatility and performance degradation. In this paper, we propose a novel approach termed Exemplar-guided Image Translation with Brownian-Bridge Diffusion Models (EBDM). Our method formulates the task as a stochastic Brownian bridge process, a diffusion process with a fixed initial point as structure control and translates into the corresponding photo-realistic image while being conditioned solely on the given exemplar image. To efficiently guide the diffusion process toward the style of exemplar, we delineate three pivotal components: the Global Encoder, the Exemplar Network, and the Exemplar Attention Module to incorporate global and detailed texture information from exemplar images. Leveraging Bridge diffusion, the network can translate images from structure control while exclusively conditioned on the exemplar style, leading to more robust training and inference processes. We illustrate the superiority of our method over competing approaches through comprehensive benchmark evaluations and visual results.

[CV-155] ask Adaptive Feature Distribution Based Network for Few-shot Fine-grained Target Classification

链接: https://arxiv.org/abs/2410.09797
作者: Ping Li,Hongbo Wang,Lei Lu
关键词-EN: Metric-based few-shot fine-grained, few-shot fine-grained classification, shown promise due, Metric-based few-shot, simplicity and efficiency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 2 figures, conference

点击查看摘要

Abstract:Metric-based few-shot fine-grained classification has shown promise due to its simplicity and efficiency. However, existing methods often overlook task-level special cases and struggle with accurate category description and irrelevant sample information. To tackle these, we propose TAFD-Net: a task adaptive feature distribution network. It features a task-adaptive component for embedding to capture task-level nuances, an asymmetric metric for calculating feature distribution similarities between query samples and support categories, and a contrastive measure strategy to boost performance. Extensive experiments have been conducted on three datasets and the experimental results show that our proposed algorithm outperforms recent incremental learning algorithms.

[CV-156] Intermediate Representations for Enhanced Text-To-Image Generation Using Diffusion Models

链接: https://arxiv.org/abs/2410.09792
作者: Ran Galun,Sagie Benaim
关键词-EN: demonstrated an impressive, impressive ability, produce high-quality outputs, diffusion models, ability to produce
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models have demonstrated an impressive ability to produce high-quality outputs. However, they often struggle to accurately follow fine-grained spatial information in an input text. To this end, we propose a compositional approach for text-to-image generation based on two stages. In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations (such as depth or segmentation maps) conditioned on text. In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model. Our findings indicate that such compositional approach can improve image generation, resulting in a notable improvement in FID score and a comparable CLIP score, when compared to the standard non-compositional baseline.

[CV-157] DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios ACCV2024

链接: https://arxiv.org/abs/2410.09788
作者: Siyi Jiao,Wenzheng Zeng,Changxin Gao,Nong Sang
关键词-EN: portrait matting refers, soft portrait, Interactive portrait matting, refers to extracting, extracting the soft
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACCV 2024

点击查看摘要

Abstract:Interactive portrait matting refers to extracting the soft portrait from a given image that best meets the user’s intent through their inputs. Existing methods often underperform in complex scenarios, mainly due to three factors. (1) Most works apply a tightly coupled network that directly predicts matting results, lacking interpretability and resulting in inadequate modeling. (2) Existing works are limited to a single type of user input, which is ineffective for intention understanding and also inefficient for user operation. (3) The multi-round characteristics have been under-explored, which is crucial for user interaction. To alleviate these limitations, we propose DFIMat, a decoupled framework that enables flexible interactive matting. Specifically, we first decouple the task into 2 sub-ones: localizing target instances by understanding scene semantics and the flexible user inputs, and conducting refinement for instance-level matting. We observe a clear performance gain from decoupling, as it makes sub-tasks easier to learn, and the flexible multi-type input further enhances both effectiveness and efficiency. DFIMat also considers the multi-round interaction property, where a contrastive reasoning module is designed to enhance cross-round refinement. Another limitation for multi-person matting task is the lack of training data. We address this by introducing a new synthetic data generation pipeline that can generate much more realistic samples than previous arts. A new large-scale dataset SMPMat is subsequently established. Experiments verify the significant superiority of DFIMat. With it, we also investigate the roles of different input types, providing valuable principles for users. Our code and dataset can be found at this https URL.

[CV-158] ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos EMNLP2024

链接: https://arxiv.org/abs/2410.09776
作者: Arpan Phukan,Manish Gupta,Asif Ekbal
关键词-EN: Previous studies, focused on generating, common objects, objects and attributes, Previous
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted in EMNLP 2024, this https URL

点击查看摘要

Abstract:Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for video-based learning, recommending ``People Also Ask’’ questions, video-based chatbots, and fact-checking. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals. Further, to the best of our knowledge, there does not exist a large-scale dataset for this task. Most video question generation datasets are on TV shows, movies, or human activities or lack entity-centric information-seeking questions. Hence, we contribute a diverse dataset of YouTube videos, VideoQuestions, consisting of 411 videos with 2265 manually annotated questions. We further propose a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation. Our best method yields BLEU, ROUGE, CIDEr, and METEOR scores of 71.3, 78.6, 7.31, and 81.9, respectively, demonstrating practical usability. We make the code and dataset publicly available. this https URL

[CV-159] Magnituder Layers for Implicit Neural Representations in 3D

链接: https://arxiv.org/abs/2410.09771
作者: Sang Min Kim(1),Byeongchan Kim(1),Arijit Sehanobish(2),Krzysztof Choromanski(3 and 4),Dongseok Shim(1),Avinava Dubey(5),Min-hwan Oh(1) ((1) Seoul National University, (2) Independent Researcher, (3) Google DeepMind, (4) Columbia University, (5) Google Research)
关键词-EN: Signed Distance Fields, Neural Radiance Fields, Radiance Fields, Distance Fields, Signed Distance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Improving the efficiency and performance of implicit neural representations in 3D, particularly Neural Radiance Fields (NeRF) and Signed Distance Fields (SDF) is crucial for enabling their use in real-time applications. These models, while capable of generating photo-realistic novel views and detailed 3D reconstructions, often suffer from high computational costs and slow inference times. To address this, we introduce a novel neural network layer called the “magnituder”, designed to reduce the number of training parameters in these models without sacrificing their expressive power. By integrating magnituders into standard feed-forward layer stacks, we achieve improved inference speed and adaptability. Furthermore, our approach enables a zero-shot performance boost in trained implicit neural representation models through layer-wise knowledge transfer without backpropagation, leading to more efficient scene reconstruction in dynamic environments.

[CV-160] Compressing Scene Dynamics: A Generative Approach

链接: https://arxiv.org/abs/2410.09768
作者: Shanzhi Yin,Zihan Zhang,Bolin Chen,Shiqi Wang,Yan Ye
关键词-EN: learn generative priors, motion priors, paper proposes, proposes to learn, generative video compression
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Submitted to DCC2025

点击查看摘要

Abstract:This paper proposes to learn generative priors from the motion patterns instead of video contents for generative video compression. The priors are derived from small motion dynamics in common scenes such as swinging trees in the wind and floating boat on the sea. Utilizing such compact motion priors, a novel generative scene dynamics compression framework is built to realize ultra-low bit-rate communication and high-quality reconstruction for diverse scene contents. At the encoder side, motion priors are characterized into compact representations in a dense-to-sparse manner. At the decoder side, the decoded motion priors serve as the trajectory hints for scene dynamics reconstruction via a diffusion-based flow-driven generator. The experimental results illustrate that the proposed method can achieve superior rate-distortion performance and outperform the state-of-the-art conventional video codec Versatile Video Coding (VVC) on scene dynamics sequences. The project page can be found at this https URL.

[CV-161] Data Adaptive Few-shot Multi Label Segmentation with Foundation Model

链接: https://arxiv.org/abs/2410.09759
作者: Gurunath Reddy,Dattesh Shanbhag,Deepa Anand
关键词-EN: obtaining accurate annotations, shot algorithms attractive, algorithms attractive, high cost, cost of obtaining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The high cost of obtaining accurate annotations for image segmentation and localization makes the use of one and few shot algorithms attractive. Several state-of-the-art methods for few-shot segmentation have emerged, including text-based prompting for the task but suffer from sub-optimal performance for medical images. Leveraging sub-pixel level features of existing Vision Transformer (ViT) based foundation models for identifying similar region of interest (RoI) based on a single template image have been shown to be very effective for one shot segmentation and localization in medical images across modalities. However, such methods rely on assumption that template image and test image are well matched and simple correlation is sufficient to obtain correspondences. In practice, however such an approach can fail to generalize in clinical data due to patient pose changes, inter-protocol variations even within a single modality or extend to 3D data using single template image. Moreover, for multi-label tasks, the RoI identification has to be performed sequentially. In this work, we propose foundation model (FM) based adapters for single label, multi-label localization and segmentation to address these concerns. We demonstrate the efficacy of the proposed method for multiple segmentation and localization tasks for both 2D and 3D data as we well as clinical data with different poses and evaluate against the state of the art few shot segmentation methods.

[CV-162] Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09750
作者: Juseong Jin,Chang Wook Jeong
关键词-EN: Conversation agents powered, Conversation agents, agents powered, Conversation, scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 AIM-FM Workshop

点击查看摘要

Abstract:Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.

[CV-163] EMWaveNet: Physically Explainable Neural Network Based on Microwave Propagation for SAR Target Recognition

链接: https://arxiv.org/abs/2410.09749
作者: Zhuoxuan Li,Xu Zhang,Shumeng Yu,Haipeng Wang
关键词-EN: achieved significant performance, Deep learning technologies, significant performance improvements, Deep learning, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning technologies have achieved significant performance improvements in the field of synthetic aperture radar (SAR) image target recognition over traditional methods. However, the inherent “black box” property of deep learning models leads to a lack of transparency in decision-making processes, making them difficult to be convincingly applied in practice. This is especially true in SAR applications, where the credibility and reliability of model predictions are crucial. The complexity and insufficient explainability of deep networks have become a bottleneck for their application. To tackle this issue, this study proposes a physically explainable framework for complex-valued SAR image recognition, designed based on the physical process of microwave propagation. This framework utilizes complex-valued SAR data to explore the amplitude and phase information and its intrinsic physical properties. The network architecture is fully parameterized, with all learnable parameters endowed with clear physical meanings, and the computational process is completed entirely in the frequency domain. Experiments on both the complex-valued MSTAR dataset and a self-built Qilu-1 complex-valued dataset were conducted to validate the effectiveness of framework. In conditions of target overlap, our model discerns categories others find challenging. Against 0dB forest background noise, it boasts a 20% accuracy improvement over traditional neural networks. When targets are 60% masked by noise, it still outperforms other models by 9%. An end-to-end complex-valued synthetic aperture radar automatic target recognition (SAR-ATR) system has also been constructed to perform recognition tasks in interference SAR scenarios. The results demonstrate that the proposed method possesses a strong physical decision logic, high physical explainability and robustness, as well as excellent dealiasing capabilities.

[CV-164] -READi: Transformer-Powered Robust and Efficient Multimodal Inference for Autonomous Driving

链接: https://arxiv.org/abs/2410.09747
作者: Pengfei Hu,Yuhang Qian,Tianyue Zheng,Ang Li,Zhe Chen,Yue Gao,Xiuzhen Cheng,Jun Luo
关键词-EN: autonomous vehicles, wide adoption, analytics to fuse, fuse their outputs, multimodal sensors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 16 figures

点击查看摘要

Abstract:Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.

[CV-165] MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

链接: https://arxiv.org/abs/2410.09733
作者: Hang Hua,Yunlong Tang,Ziyun Zeng,Liangliang Cao,Zhengyuan Yang,Hangfeng He,Chenliang Xu,Jiebo Luo
关键词-EN: significantly advanced multimodal, visual question answering, advanced multimodal understanding, large Vision-Language Models, enabling more sophisticated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 15 figures

点击查看摘要

Abstract:The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs’ superior capabilities, researchers lack a comprehensive understanding of their compositionality – the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs’ compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o’s compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: this https URL

[CV-166] LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

链接: https://arxiv.org/abs/2410.09732
作者: Junyan Ye,Baichuan Zhou,Zilong Huang,Junan Zhang,Tianyi Bai,Hengrui Kang,Jun He,Honglin Lin,Zihao Wang,Tong Wu,Zhizheng Wu,Yiping Chen,Dahua Lin,Conghui He,Weijia Li
关键词-EN: data increasingly challenging, credible multimodal data, multimodal data increasingly, synthetic data, making the discrimination
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 79 pages, 63 figures

点击查看摘要

Abstract:With the rapid development of AI-generated content, the future internet may be inundated with synthetic data, making the discrimination of authentic and credible multimodal data increasingly challenging. Synthetic data detection has thus garnered widespread attention, and the performance of large multimodal models (LMMs) in this task has attracted significant interest. LMMs can provide natural language explanations for their authenticity judgments, enhancing the explainability of synthetic content detection. Simultaneously, the task of distinguishing between real and synthetic data effectively tests the perception, knowledge, and reasoning capabilities of LMMs. In response, we introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities. LOKI encompasses video, image, 3D, text, and audio modalities, comprising 18K carefully curated questions across 26 subcategories with clear difficulty levels. The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks, allowing for a comprehensive analysis of LMMs. We evaluated 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities. More information about LOKI can be found at this https URL

[CV-167] Distributed Intelligent Video Surveillance for Early Armed Robbery Detection based on Deep Learning

链接: https://arxiv.org/abs/2410.09731
作者: Sergio Fernandez-Testa,Edwin Salcedo
关键词-EN: Low employment rates, Latin America, Low employment, rates in Latin, America have contributed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in the proceedings of the 37th Conference on Graphics, Patterns and Images (SIBGRAPI 2024)

点击查看摘要

Abstract:Low employment rates in Latin America have contributed to a substantial rise in crime, prompting the emergence of new criminal tactics. For instance, “express robbery” has become a common crime committed by armed thieves, in which they drive motorcycles and assault people in public in a matter of seconds. Recent research has approached the problem by embedding weapon detectors in surveillance cameras; however, these systems are prone to false positives if no counterpart confirms the event. In light of this, we present a distributed IoT system that integrates a computer vision pipeline and object detection capabilities into multiple end-devices, constantly monitoring for the presence of firearms and sharp weapons. Once a weapon is detected, the end-device sends a series of frames to a cloud server that implements a 3DCNN to classify the scene as either a robbery or a normal situation, thus minimizing false positives. The deep learning process to train and deploy weapon detection models uses a custom dataset with 16,799 images of firearms and sharp weapons. The best-performing model, YOLOv5s, optimized using TensorRT, achieved a final mAP of 0.87 running at 4.43 FPS. Additionally, the 3DCNN demonstrated 0.88 accuracy in detecting abnormal situations. Extensive experiments validate that the proposed system significantly reduces false positives while autonomously monitoring multiple locations in real-time.

[CV-168] MIRAGE: Multimodal Identification and Recognition of Annotations in Indian General Prescriptions

链接: https://arxiv.org/abs/2410.09729
作者: Tavish Mankash,V.S. Chaithanya Kota,Anish De,Praveen Prakash,Kshitij Jadhav
关键词-EN: Hospitals generate thousands, Electronic Medical Records, Hospitals generate, availability of Electronic, Electronic Medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Hospitals generate thousands of handwritten prescriptions, a practice that remains prevalent despite the availability of Electronic Medical Records (EMR). This method of record-keeping hinders the examination of long-term medication effects, impedes statistical analysis, and makes the retrieval of records challenging. Handwritten prescriptions pose a unique challenge, requiring specialized data for training models to recognize medications and their patterns of recommendation. While current handwriting recognition approaches typically employ 2-D LSTMs, recent studies have explored the use of Large Language Models (LLMs) for Optical Character Recognition (OCR). Building on this approach, we focus on extracting medication names from medical records. Our methodology MIRAGE (Multimodal Identification and Recognition of Annotations in indian GEneral prescriptions) involves fine-tuning the LLaVA 1.6 and Idefics2 models. Our research utilizes a dataset provided by Medyug Technology, consisting of 743,118 fully annotated high-resolution simulated medical records from 1,133 doctors across India. We demonstrate that our methodology exhibits 82% accuracy in medication name and dosage extraction. We provide a detailed account of our research methodology and results, notes about HWR with Multimodal LLMs, and release a small dataset of 100 medical records with labels.

[CV-169] AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

链接: https://arxiv.org/abs/2410.09714
作者: Yuchen Li,Li Zhang,Youwei Liang,Pengtao Xie
关键词-EN: Segment Anything Model, gained significant recognition, mask decoder feature, mask decoder, gained significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Segment Anything Model (SAM) has gained significant recognition in the field of semantic segmentation due to its versatile capabilities and impressive performance. Despite its success, SAM faces two primary limitations: (1) it relies heavily on meticulous human-provided prompts like key points, bounding boxes or text messages, which is labor-intensive; (2) the mask decoder’s feature representation is sometimes inaccurate, as it solely employs dot product operations at the end of mask decoder, which inadequately captures the necessary correlations for precise segmentation. Current solutions to these problems such as fine-tuning SAM often require retraining a large number of parameters, which needs huge amount of time and computing resources. To address these limitations, we propose an automated prompting and mask calibration method called AM-SAM based on a bi-level optimization framework. Our approach automatically generates prompts for an input image, eliminating the need for human involvement with a good performance in early training epochs, achieving faster convergence. Additionally, we freeze the main part of SAM, and modify the mask decoder with Low-Rank Adaptation (LoRA), enhancing the mask decoder’s feature representation by incorporating advanced techniques that go beyond simple dot product operations to more accurately capture and utilize feature correlations. Our experimental results demonstrate that AM-SAM achieves significantly accurate segmentation, matching or exceeding the effectiveness of human-generated and default prompts. Notably, on the body segmentation dataset, our method yields a 5% higher dice score with a 4-example few-shot training set compared to the SOTA method, underscoring its superiority in semantic segmentation tasks.

[CV-170] EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation

链接: https://arxiv.org/abs/2410.09704
作者: Milos Vukadinovic,Xiu Tang,Neal Yuan,Paul Cheng,Debiao Li,Susan Cheng,Bryan He,David Ouyang
关键词-EN: cardiac imaging modality, capturing ultrasound video, ultrasound video data, imaging modality, capturing ultrasound
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 30 pages, 3 tables, 3 figures

点击查看摘要

Abstract:Echocardiography is the most widely used cardiac imaging modality, capturing ultrasound video data to assess cardiac structure and function. Artificial intelligence (AI) in echocardiography has the potential to streamline manual tasks and improve reproducibility and precision. However, most echocardiography AI models are single-view, single-task systems that do not synthesize complementary information from multiple views captured during a full exam, and thus lead to limited performance and scope of applications. To address this problem, we introduce EchoPrime, a multi-view, view-informed, video-based vision-language foundation model trained on over 12 million video-report pairs. EchoPrime uses contrastive learning to train a unified embedding model for all standard views in a comprehensive echocardiogram study with representation of both rare and common diseases and diagnoses. EchoPrime then utilizes view-classification and a view-informed anatomic attention model to weight video-specific interpretations that accurately maps the relationship between echocardiographic views and anatomical structures. With retrieval-augmented interpretation, EchoPrime integrates information from all echocardiogram videos in a comprehensive study and performs holistic comprehensive clinical echocardiography interpretation. In datasets from two independent healthcare systems, EchoPrime achieves state-of-the art performance on 23 diverse benchmarks of cardiac form and function, surpassing the performance of both task-specific approaches and prior foundation models. Following rigorous clinical evaluation, EchoPrime can assist physicians in the automated preliminary assessment of comprehensive echocardiography.

[CV-171] Robust 3D Point Clouds Classification based on Declarative Defenders

链接: https://arxiv.org/abs/2410.09691
作者: Kaidong Li,Tianxiao Zhang,Chuncong Zhong,Ziming Zhang,Guanghui Wang
关键词-EN: respective input data, cloud classification requires, divergent characteristics, respective input, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:3D point cloud classification requires distinct models from 2D image classification due to the divergent characteristics of the respective input data. While 3D point clouds are unstructured and sparse, 2D images are structured and dense. Bridging the domain gap between these two data types is a non-trivial challenge to enable model interchangeability. Recent research using Lattice Point Classifier (LPC) highlights the feasibility of cross-domain applicability. However, the lattice projection operation in LPC generates 2D images with disconnected projected pixels. In this paper, we explore three distinct algorithms for mapping 3D point clouds into 2D images. Through extensive experiments, we thoroughly examine and analyze their performance and defense mechanisms. Leveraging current large foundation models, we scrutinize the feature disparities between regular 2D images and projected 2D images. The proposed approaches demonstrate superior accuracy and robustness against adversarial attacks. The generative model-based mapping algorithms yield regular 2D images, further minimizing the domain gap from regular 2D classification tasks. The source code is available at this https URL.

[CV-172] FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis

链接: https://arxiv.org/abs/2410.09690
作者: Vishnu Mani Hema,Shubhra Aich,Christian Haene,Jean-Charles Bazin,Fernando de la Torre
关键词-EN: deep implicit modeling, digitizing human figures, advancement in deep, deep implicit, implicit modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advancement in deep implicit modeling and articulated models has significantly enhanced the process of digitizing human figures in 3D from just a single image. While state-of-the-art methods have greatly improved geometric precision, the challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images. This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets, whereas their 2D counterparts are abundant and easily accessible. To address this issue, our paper proposes leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization. We incorporate 2D priors from the fashion dataset to learn the occluded back view, refined with our proposed domain alignment strategy. We then fuse this information with the input image to obtain a fully textured mesh of the given person. Through extensive experimentation on standard 3D human benchmarks, we demonstrate the superior performance of our approach in terms of both texture and geometry. Code and dataset is available at this https URL.

[CV-173] Learning the Bitter Lesson: Empirical Evidence from 20 Years of CVPR Proceedings EMNLP2024

链接: https://arxiv.org/abs/2410.09649
作者: Mojtaba Yousefi,Jack Collins
关键词-EN: Pattern Recognition, Rich Sutton, proposed by Rich, Computer Vision, bitter lesson
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: NLP4Sceince Workshop, EMNLP 2024

点击查看摘要

Abstract:This study examines the alignment of \emphConference on Computer Vision and Pattern Recognition (CVPR) research with the principles of the “bitter lesson” proposed by Rich Sutton. We analyze two decades of CVPR abstracts and titles using large language models (LLMs) to assess the field’s embracement of these principles. Our methodology leverages state-of-the-art natural language processing techniques to systematically evaluate the evolution of research approaches in computer vision. The results reveal significant trends in the adoption of general-purpose learning algorithms and the utilization of increased computational resources. We discuss the implications of these findings for the future direction of computer vision research and its potential impact on broader artificial intelligence development. This work contributes to the ongoing dialogue about the most effective strategies for advancing machine learning and computer vision, offering insights that may guide future research priorities and methodologies in the field.

[CV-174] DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach NEURIPS

链接: https://arxiv.org/abs/2410.09633
作者: Daniel Gallo Fernández,Rǎzvan-Andrei Matişan,Alejandro Monroy Muñoz,Ana-Maria Vasilcoiu,Janusz Partyka,Tin Hadži Veljković,Metod Jazbec
关键词-EN: achieved unprecedented performance, slow inference due, iterative sampling process, initial sampling steps, achieved unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS, see this https URL

点击查看摘要

Abstract:Diffusion models have achieved unprecedented performance in image generation, yet they suffer from slow inference due to their iterative sampling process. To address this, early-exiting has recently been proposed, where the depth of the denoising network is made adaptive based on the (estimated) difficulty of each sampling step. Here, we discover an interesting “phase transition” in the sampling process of current adaptive diffusion models: the denoising network consistently exits early during the initial sampling steps, until it suddenly switches to utilizing the full network. Based on this, we propose accelerating generation by employing a shallower denoising network in the initial sampling steps and a deeper network in the later steps. We demonstrate empirically that our dual-backbone approach, DuoDiff, outperforms existing early-exit diffusion methods in both inference speed and generation quality. Importantly, DuoDiff is easy to implement and complementary to existing approaches for accelerating diffusion.

[CV-175] RailYolact – A Yolact Focused on edge for Real-Time Rail Segmentation

链接: https://arxiv.org/abs/2410.09612
作者: Qihao Qian
关键词-EN: Ensuring obstacle avoidance, autonomous driving trains, Ensuring obstacle, obstacle avoidance, surface is crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ensuring obstacle avoidance on the rail surface is crucial for the safety of autonomous driving trains and its first step is to segment the regions of the rail. We chose to build upon Yolact for our work. To address the issue of rough edge in the rail masks predicted by the model, we incorporated the edge information extracted by edge operator into the original Yolact’s loss function to emphasize the model’s focus on rail edges. Additionally, we applied box filter to smooth the jagged ground truth mask edges cause by linear interpolation. Since the integration of edge information and smooth process only occurred during the training process, the inference speed of the model remained unaffected. The experiments results on our custom rail dataset demonstrated an improvement in the prediction accuracy. Moreover, the results on Cityscapes showed a 4.1 and 4.6 improvement in AP and AP_50 , respectively, compared to Yolact.

[CV-176] FiRework: Field Refinement Framework for Efficient Enhancement of Deformable Registration

链接: https://arxiv.org/abs/2410.09595
作者: Haiqiao Wang,Dong Ni,Yi Wang
关键词-EN: problems involving complex, deformations remains challenging, complex deformations remains, involving complex deformations, image registration remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deformable image registration remains a fundamental task in clinical practice, yet solving registration problems involving complex deformations remains challenging. Current deep learning-based registration methods employ continuous deformation to model large deformations, which often suffer from accumulated registration errors and interpolation inaccuracies. Moreover, achieving satisfactory results with these frameworks typically requires a large number of cascade stages, demanding substantial computational resources. Therefore, we propose a novel approach, the field refinement framework (FiRework), tailored for unsupervised deformable registration, aiming to address these challenges. In FiRework, we redesign the continuous deformation framework to mitigate the aforementioned errors. Notably, our FiRework requires only one level of recursion during training and supports continuous inference, offering improved efficacy compared to continuous deformation frameworks. We conducted experiments on two brain MRI datasets, enhancing two existing deformable registration networks with FiRework. The experimental results demonstrate the superior performance of our proposed framework in deformable registration. The code is publicly available at this https URL.

[CV-177] ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model

链接: https://arxiv.org/abs/2410.09592
作者: Hongbin Xu,Weitao Chen,Zhipeng Zhou,Feng Xiao,Baigui Sun,Mike Zheng Shou,Wenxiong Kang
关键词-EN: challenging issue, recent advancements, remains a challenging, triplane decoder, generation methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Draft version. This paper is still in submission. For access to our project page and code, please visit: this https URL

点击查看摘要

Abstract:Despite recent advancements in 3D generation methods, achieving controllability still remains a challenging issue. Current approaches utilizing score-distillation sampling are hindered by laborious procedures that consume a significant amount of time. Furthermore, the process of first generating 2D representations and then mapping them to 3D lacks internal alignment between the two forms of representation. To address these challenges, we introduce ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM). ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations. To ensure unbiased evaluation, we curate evaluation samples from three distinct datasets (G-OBJ, GSO, ABO) rather than relying on cherry-picking manual generation. The comprehensive experiments conducted on quantitative and qualitative comparisons of 3D controllability and generation quality demonstrate the strong generalization capacity of our proposed approach.

[CV-178] POPoS: Improving Efficient and Robust Facial Landmark Detection with Parallel Optimal Position Search

链接: https://arxiv.org/abs/2410.09583
作者: Chong-Yang Xiang,Jun-Yan He,Zhi-Qi Cheng,Xiao Wu,Xian-Sheng Hua
关键词-EN: facial landmark detection, Achieving a balance, Optimal Position Search, critical challenge, challenge in facial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces the Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the fundamental limitations of traditional FLD methods. POPoS employs three key innovations: (1) Pseudo-range multilateration is utilized to correct heatmap errors, enhancing the precision of landmark localization. By integrating multiple anchor points, this approach minimizes the impact of individual heatmap inaccuracies, leading to robust overall positioning. (2) To improve the pseudo-range accuracy of selected anchor points, a new loss function, named multilateration anchor loss, is proposed. This loss function effectively enhances the accuracy of the distance map, mitigates the risk of local optima, and ensures optimal solutions. (3) A single-step parallel computation algorithm is introduced, significantly enhancing computational efficiency and reducing processing time. Comprehensive evaluations across five benchmark datasets demonstrate that POPoS consistently outperforms existing methods, particularly excelling in low-resolution scenarios with minimal computational overhead. These features establish POPoS as a highly efficient and accurate tool for FLD, with broad applicability in real-world scenarios. The code is available at this https URL

[CV-179] Improving 3D Finger Traits Recognition via Generalizable Neural Rendering

链接: https://arxiv.org/abs/2410.09582
作者: Hongbin Xu,Junduan Huang,Yuer Ma,Zifeng Li,Wenxiong Kang
关键词-EN: demonstrated a powerful, powerful ability, finger, finger traits, Trait Guided Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper is accepted in IJCV. For further information and access to the code, please visit our project page: this https URL

点击查看摘要

Abstract:3D biometric techniques on finger traits have become a new trend and have demonstrated a powerful ability for recognition and anti-counterfeiting. Existing methods follow an explicit 3D pipeline that reconstructs the models first and then extracts features from 3D models. However, these explicit 3D methods suffer from the following problems: 1) Inevitable information dropping during 3D reconstruction; 2) Tight coupling between specific hardware and algorithm for 3D reconstruction. It leads us to a question: Is it indispensable to reconstruct 3D information explicitly in recognition tasks? Hence, we consider this problem in an implicit manner, leaving the nerve-wracking 3D reconstruction problem for learnable neural networks with the help of neural radiance fields (NeRFs). We propose FingerNeRF, a novel generalizable NeRF for 3D finger biometrics. To handle the shape-radiance ambiguity problem that may result in incorrect 3D geometry, we aim to involve extra geometric priors based on the correspondence of binary finger traits like fingerprints or finger veins. First, we propose a novel Trait Guided Transformer (TGT) module to enhance the feature correspondence with the guidance of finger traits. Second, we involve extra geometric constraints on the volume rendering loss with the proposed Depth Distillation Loss and Trait Guided Rendering Loss. To evaluate the performance of the proposed method on different modalities, we collect two new datasets: SCUT-Finger-3D with finger images and SCUT-FingerVein-3D with finger vein images. Moreover, we also utilize the UNSW-3D dataset with fingerprint images for evaluation. In experiments, our FingerNeRF can achieve 4.37% EER on SCUT-Finger-3D dataset, 8.12% EER on SCUT-FingerVein-3D dataset, and 2.90% EER on UNSW-3D dataset, showing the superiority of the proposed implicit method in 3D finger biometrics.

[CV-180] Reconstructive Visual Instruction Tuning

链接: https://arxiv.org/abs/2410.09575
作者: Haochen Wang,Anlin Zheng,Yucheng Zhao,Tiancai Wang,Zheng Ge,Xiangyu Zhang,Zhaoxiang Zhang
关键词-EN: Large Multimodal Models, Large Multimodal, paper introduces reconstructive, family of Large, visual instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.

[CV-181] Bridging Text and Image for Artist Style Transfer via Contrastive Learning

链接: https://arxiv.org/abs/2410.09566
作者: Zhi-Song Liu,Li-Wen Wang,Jun Xiao,Vicky Kalogeiton
关键词-EN: attracted widespread attention, style transfer, past few years, style, Image style transfer
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Image style transfer has attracted widespread attention in the past few years. Despite its remarkable results, it requires additional style images available as references, making it less flexible and inconvenient. Using text is the most natural way to describe the style. More importantly, text can describe implicit abstract styles, like styles of specific artists or art movements. In this paper, we propose a Contrastive Learning for Artistic Style Transfer (CLAST) that leverages advanced image-text encoders to control arbitrary style transfer. We introduce a supervised contrastive training strategy to effectively extract style descriptions from the image-text model (i.e., CLIP), which aligns stylization with the text description. To this end, we also propose a novel and efficient adaLN based state space models that explore style-content fusion. Finally, we achieve a text-driven image style transfer. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods in artistic style transfer. More importantly, it does not require online fine-tuning and can render a 512x512 image in 0.03s.

[CV-182] Robust Optical Flow Computation: A Higher-Order Differential Approach

链接: https://arxiv.org/abs/2410.09563
作者: Chanuka Algama,Kasun Amarasinghe
关键词-EN: dynamic visual scenes, unraveling dynamic visual, optical flow, computer vision, visual scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:In the domain of computer vision, optical flow stands as a cornerstone for unraveling dynamic visual scenes. However, the challenge of accurately estimating optical flow under conditions of large nonlinear motion patterns remains an open question. The image flow constraint is vulnerable to substantial displacements, and rapid spatial transformations. Inaccurate approximations inherent in numerical differentiation techniques can further amplify such intricacies. In response, this research proposes an innovative algorithm for optical flow computation, utilizing the higher precision of second-order Taylor series approximation within the differential estimation framework. By embracing this mathematical underpinning, the research seeks to extract more information about the behavior of the function under complex real-world scenarios and estimate the motion of areas with a lack of texture. An impressive showcase of the algorithm’s capabilities emerges through its performance on renowned optical flow benchmarks such as KITTI (2015) and Middlebury. The average endpoint error (AEE), which computes the Euclidian distance between the calculated flow field and the ground truth flow field, stands notably diminished, validating the effectiveness of the algorithm in handling complex motion patterns.

[CV-183] DiffuTraj: A Stochastic Vessel Trajectory Prediction Approach via Guided Diffusion Process

链接: https://arxiv.org/abs/2410.09550
作者: Changlin Li,Yanglei Gan,Tian Lan,Yuxiang Cai,Xueyi Liu,Run Lin,Qiao Liu
关键词-EN: prediction system capable, Maritime vessel maneuvers, requires vessel trajectory, vessel trajectory prediction, trajectory prediction system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: containing 14pages, 9 figures and 3 tables; Submitted to IEEE Transactions on Intelligent Transportation Systems on 17-June-2024

点击查看摘要

Abstract:Maritime vessel maneuvers, characterized by their inherent complexity and indeterminacy, requires vessel trajectory prediction system capable of modeling the multi-modality nature of future motion states. Conventional stochastic trajectory prediction methods utilize latent variables to represent the multi-modality of vessel motion, however, tends to overlook the complexity and dynamics inherent in maritime behavior. In contrast, we explicitly simulate the transition of vessel motion from uncertainty towards a state of certainty, effectively handling future indeterminacy in dynamic scenes. In this paper, we present a novel framework (\textitDiffuTraj) to conceptualize the trajectory prediction task as a guided reverse process of motion pattern uncertainty diffusion, in which we progressively remove uncertainty from maritime regions to delineate the intended trajectory. Specifically, we encode the previous states of the target vessel, vessel-vessel interactions, and the environment context as guiding factors for trajectory generation. Subsequently, we devise a transformer-based conditional denoiser to capture spatio-temporal dependencies, enabling the generation of trajectories better aligned for particular maritime environment. Comprehensive experiments on vessel trajectory prediction benchmarks demonstrate the superiority of our method.

[CV-184] Bi-temporal Gaussian Feature Dependency Guided Change Detection in Remote Sensing Images

链接: https://arxiv.org/abs/2410.09539
作者: Yi Xiao,Bin Luo,Jun Liu,Xin Su,Wei Wang
关键词-EN: domain information, domain information differences, enables the identification, information, identification of alterations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Change Detection (CD) enables the identification of alterations between images of the same area captured at different times. However, existing CD methods still struggle to address pseudo changes resulting from domain information differences in multi-temporal images and instances of detail errors caused by the loss and contamination of detail features during the upsampling process in the network. To address this, we propose a bi-temporal Gaussian distribution feature-dependent network (BGFD). Specifically, we first introduce the Gaussian noise domain disturbance (GNDD) module, which approximates distribution using image statistical features to characterize domain information, samples noise to perturb the network for learning redundant domain information, addressing domain information differences from a more fundamental perspective. Additionally, within the feature dependency facilitation (FDF) module, we integrate a novel mutual information difference loss ( L_MI ) and more sophisticated attention mechanisms to enhance the capabilities of the network, ensuring the acquisition of essential domain information. Subsequently, we have designed a novel detail feature compensation (DFC) module, which compensates for detail feature loss and contamination introduced during the upsampling process from the perspectives of enhancing local features and refining global features. The BGFD has effectively reduced pseudo changes and enhanced the detection capability of detail information. It has also achieved state-of-the-art performance on four publicly available datasets - DSIFN-CD, SYSU-CD, LEVIR-CD, and S2Looking, surpassing baseline models by +8.58%, +1.28%, +0.31%, and +3.76% respectively, in terms of the F1-Score metric.

[CV-185] Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence ACCV2024

链接: https://arxiv.org/abs/2410.09533
作者: Felipe Cadar,Guilherme Potje,Renato Martins,Cédric Demonceaux,Erickson R. Nascimento
关键词-EN: Visual correspondence, computer vision tasks, structure from motion, key computer vision, including camera localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ACCV 2024

点击查看摘要

Abstract:Visual correspondence is a crucial step in key computer vision tasks, including camera localization, image registration, and structure from motion. The most effective techniques for matching keypoints currently involve using learned sparse or dense matchers, which need pairs of images. These neural networks have a good general understanding of features from both images, but they often struggle to match points from different semantic areas. This paper presents a new method that uses semantic cues from foundation vision model features (like DINOv2) to enhance local feature matching by incorporating semantic reasoning into existing descriptors. Therefore, the learned descriptors do not require image pairs at inference time, allowing feature caching and fast matching using similarity search, unlike learned matchers. We present adapted versions of six existing descriptors, with an average increase in performance of 29% in camera localization, with comparable accuracy to existing matchers as LightGlue and LoFTR in two existing benchmarks. Both code and trained models are available at this https URL

[CV-186] Preserving Old Memories in Vivid Detail: Human-Interactive Photo Restoration Framework

链接: https://arxiv.org/abs/2410.09529
作者: Seung-Yeon Back,Geonho Son,Dahye Jeong,Eunil Park,Simon S. Woo
关键词-EN: technology enables preserving, enables preserving visual, preserving visual memories, restoration technology enables, technology enables
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Photo restoration technology enables preserving visual memories in photographs. However, physical prints are vulnerable to various forms of deterioration, ranging from physical damage to loss of image quality, etc. While restoration by human experts can improve the quality of outcomes, it often comes at a high price in terms of cost and time for restoration. In this work, we present the AI-based photo restoration framework composed of multiple stages, where each stage is tailored to enhance and restore specific types of photo damage, accelerating and automating the photo restoration process. By integrating these techniques into a unified architecture, our framework aims to offer a one-stop solution for restoring old and deteriorated photographs. Furthermore, we present a novel old photo restoration dataset because we lack a publicly available dataset for our evaluation.

[CV-187] Pic@Point: Cross-Modal Learning by Local and Global Point-Picture Correspondence ACML2024

链接: https://arxiv.org/abs/2410.09519
作者: Vencia Herzog,Stefan Suwelack
关键词-EN: achieved remarkable success, success in NLP, Self-supervised pre-training, achieved remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ACML 2024

点击查看摘要

Abstract:Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.

[CV-188] Fine-grained subjective visual quality assessment for high-fidelity compressed images

链接: https://arxiv.org/abs/2410.09501
作者: Michela Testolina,Mohsen Jenadeleh,Shima Mohammadi,Shaolin Su,Joao Ascenso,Touradj Ebrahimi,Jon Sneyers,Dietmar Saupe
关键词-EN: videos widely accessible, widely accessible, display technologies, technologies have made, videos widely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Michela Testolina, Mohsen Jenadeleh contributed equally to this work, submitted to the Data Compression Conference (DCC) 2025

点击查看摘要

Abstract:Advances in image compression, storage, and display technologies have made high-quality images and videos widely accessible. At this level of quality, distinguishing between compressed and original content becomes difficult, highlighting the need for assessment methodologies that are sensitive to even the smallest visual quality differences. Conventional subjective visual quality assessments often use absolute category rating scales, ranging from excellent'' to bad’'. While suitable for evaluating more pronounced distortions, these scales are inadequate for detecting subtle visual differences. The JPEG standardization project AIC is currently developing a subjective image quality assessment methodology for high-fidelity images. This paper presents the proposed assessment methods, a dataset of high-quality compressed images, and their corresponding crowdsourced visual quality ratings. It also outlines a data analysis approach that reconstructs quality scale values in just noticeable difference (JND) units. The assessment method uses boosting techniques on visual stimuli to help observers detect compression artifacts more clearly. This is followed by a rescaling process that adjusts the boosted quality values back to the original perceptual scale. This reconstruction yields a fine-grained, high-precision quality scale in JND units, providing more informative results for practical applications. The dataset and code to reproduce the results will be available at this https URL.

[CV-189] A Simple yet Effective Subway Self-positioning Method based on Aerial-view Sleeper Detection

链接: https://arxiv.org/abs/2410.09492
作者: Jiajie Song,Ningfang Song,Xiong Pan,Xiaoxin Liu,Can Chen,Jingchun Cheng
关键词-EN: collision avoidance systems, urban underground rail, underground rail vehicles,subway, avoidance systems, hot-spot these years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages,8 figures, under review for IEEE Sensors Journal publication

点击查看摘要

Abstract:With the rapid development of urban underground rail vehicles,subway positioning, which plays a fundamental role in the traffic navigation and collision avoidance systems, has become a research hot-spot these years. Most current subway positioning methods rely on localization beacons densely pre-installed alongside the railway tracks, requiring massive costs for infrastructure and maintenance, while commonly lacking flexibility and anti-interference ability. In this paper, we propose a low-cost and real-time visual-assisted self-localization framework to address the robust and convenient positioning problem for subways. Firstly, we perform aerial view rail sleeper detection based on the fast and efficient YOLOv8n network. The detection results are then used to achieve real-time correction of mileage values combined with geometric positioning information, obtaining precise subway locations. Front camera Videos for subway driving scenes along a 6.9 km route are collected and annotated from the simulator for validation of the proposed method. Experimental results show that our aerial view sleeper detection algorithm can efficiently detect sleeper positions with F1-score of 0.929 at 1111 fps, and that the proposed positioning framework achieves a mean percentage error of 0.1%, demonstrating its continuous and high-precision self-localization capability.

[CV-190] Distilling Invariant Representations with Dual Augmentation

链接: https://arxiv.org/abs/2410.09474
作者: Nikolaos Giakoumoglou,Tania Stathaki
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, 3 tables. This paper presents preliminary results from a project that we have since discontinued, as our research focus has shifted to new directions

点击查看摘要

[CV-191] Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors

链接: https://arxiv.org/abs/2410.09467
作者: Hritam Basak,Hadi Tabatabaee,Shreekant Gayaka,Ming-Feng Li,Xin Yang,Cheng-Hao Kuo,Arnie Sen,Min Sun,Zhaozheng Yin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-192] Skipping Computations in Multimodal LLMs NEURIPS2024

链接: https://arxiv.org/abs/2410.09454
作者: Mustafa Shukor,Matthieu Cord
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024 Workshop RBFM. Code: this https URL

点击查看摘要

[CV-193] MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

链接: https://arxiv.org/abs/2410.09453
作者: Xi Jiang,Jian Li,Hanqiu Deng,Yong Liu,Bin-Bin Gao,Yifeng Zhou,Jialin Li,Chengjie Wang,Feng Zheng
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The code and data are available at this https URL

点击查看摘要

[CV-194] An Expeditious Spatial Mean Radiant Temperature Mapping Framework using Visual SLAM and Semantic Segmentation

链接: https://arxiv.org/abs/2410.09443
作者: Wei Liang,Yiting Zhang,Ji Zhang,Erica Cochran Hameen
关键词-EN:
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop

点击查看摘要

[CV-195] Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09432
作者: Raghav Singhal,Kaustubh Ponkshe,Praneeth Vepakomma
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: RS and KP contributed equally to this work: 18 Pages, 9 Figures, and 8 Tables. Another version of the paper accepted at NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability

点击查看摘要

[CV-196] VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment EMNLP2024

链接: https://arxiv.org/abs/2410.09421
作者: Lei Li,Zhihui Xie,Mukai Li,Shunian Chen,Peiyi Wang,Liang Chen,Yazheng Yang,Benyou Wang,Lingpeng Kong,Qi Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: EMNLP 2024 Main Conference camera-ready version. This article supersedes arXiv:2312.10665

点击查看摘要

[CV-197] Neurally Integrated Finite Elements for Differentiable Elasticity on Evolving Domains

链接: https://arxiv.org/abs/2410.09417
作者: Gilles Daviet,Tianchang Shen,Nicholas Sharp,David I. W. Levin
关键词-EN:
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 21 figures

点击查看摘要

[CV-198] Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset NEURIPS2024

链接: https://arxiv.org/abs/2410.09416
作者: Haoming Lu,Feifei Zhong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024 Workshop (EvalEval 2024)

点击查看摘要

[CV-199] Distribution-aware Noisy-label Crack Segmentation

链接: https://arxiv.org/abs/2410.09409
作者: Xiaoyan Jiang,Xinlong Wan,Kaiying Zhu,Xihe Qiu,Zhijun Fang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-200] wo Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation

链接: https://arxiv.org/abs/2410.09403
作者: Haoyang Su,Renqi Chen,Shixiang Tang,Xinzhe Zheng,Jingzhe Li,Zhenfei Yin,Wanli Ouyang,Nanqing Dong
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

[CV-201] CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

链接: https://arxiv.org/abs/2410.09400
作者: Yifeng Xu,Zhenliang He,Shiguang Shan,Xilin Chen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-202] MITA: Bridging the Gap between Model and Data for Test-time Adaptation

链接: https://arxiv.org/abs/2410.09398
作者: Yige Yuan,Bingbing Xu,Teng Xiao,Liang Hou,Fei Sun,Huawei Shen,Xueqi Cheng
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-203] ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance ICME2024

链接: https://arxiv.org/abs/2410.09396
作者: Yongkang Cheng,Mingjiang Liang,Shaoli Huang,Jifeng Ning,Wei Liu
关键词-EN:
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICME 2024

点击查看摘要

[CV-204] CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification

链接: https://arxiv.org/abs/2410.09382
作者: Qianru Han,Xinwei He,Zhi Liu,Sannyuya Liu,Ying Zhang,Jinhai Xiang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-205] Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

链接: https://arxiv.org/abs/2410.09380
作者: Ting Yu,Kunhao Fu,Shuhui Wang,Qingming Huang,Jun Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

[CV-206] Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

链接: https://arxiv.org/abs/2410.09379
作者: Ting Yu,Kunhao Fu,Jian Zhang,Qingming Huang,Jun Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Transactions on Image Processing

点击查看摘要

[CV-207] GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

链接: https://arxiv.org/abs/2410.09377
作者: Eileen Wang,Caren Han,Josiah Poon
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-208] ESVO2: Direct Visual-Inertial Odometry with Stereo Event Cameras

链接: https://arxiv.org/abs/2410.09374
作者: Junkai Niu,Sheng Zhong,Xiuyuan Lu,Shaojie Shen,Guillermo Gallego,Yi Zhou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

[CV-209] Debiasing Vison-Language Models with Text-Only Training

链接: https://arxiv.org/abs/2410.09365
作者: Yunfan Yang,Chaoquan Jiang,Zhiyu Lin,Jinlin Xiao,Jiaming Zhang,Jitao Sang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-210] oward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

链接: https://arxiv.org/abs/2410.09347
作者: Huayu Chen,Hang Su,Peize Sun,Jun Zhu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[CV-211] Advanced Gesture Recognition in Autism: Integrating YOLOv7 Video Augmentation and VideoMAE for Video Analysis

链接: https://arxiv.org/abs/2410.09339
作者: Amit Kumar Singh,Trapti Shrivastava,Vrijendra Singh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-212] oken Pruning using a Lightweight Background Aware Vision Transformer NEURIPS2024

链接: https://arxiv.org/abs/2410.09324
作者: Sudhakar Sah,Ravish Kumar,Honnesh Rohmetra,Ehsan Saboori
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 tables, 4 figures, FITML workshop@NeuRIPS 2024

点击查看摘要

[CV-213] owards Multi-Modal Animal Pose Estimation: An In-Depth Analysis

链接: https://arxiv.org/abs/2410.09312
作者: Qianyi Deng,Oishi Deb,Amir Patel,Christian Rupprecht,Philip Torr,Niki Trigoni,Andrew Markham
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 35 pages, 5 figures, 8 tables

点击查看摘要

[CV-214] D-Paint: Faster Diffusion Inpainting Through Time Aware Pixel Conditioning

链接: https://arxiv.org/abs/2410.09306
作者: Tsiry Mayet,Pourya Shamsolmoali,Simon Bernard,Eric Granger,Romain Hérault,Clement Chatelain
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-215] Hierarchical uncertainty estimation for learning-based registration in neuroimaging

链接: https://arxiv.org/abs/2410.09299
作者: Xiaoling Hu,Karthik Gopinath,Peirong Liu,Malte Hoffmann,Koen Van Leemput,Oula Puonti,Juan Eugenio Iglesias
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 15 pages, 6 figures

点击查看摘要

[CV-216] SurgicalGS: Dynamic 3D Gaussian Splatting for Accurate Robotic-Assisted Surgical Scene Reconstruction

链接: https://arxiv.org/abs/2410.09292
作者: Jialei Chen,Xin Zhang,Mobarakol Islam,Francisco Vasconcelos,Danail Stoyanov,Daniel S. Elson,Baoru Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

[CV-217] Few Exemplar-Based General Medical Image Segmentation via Domain-Aware Selective Adaptation ACCV2024

链接: https://arxiv.org/abs/2410.09254
作者: Chen Xu,Qiming Huang,Yuqi Hou,Jiangxing Wu,Fan Zhang,Hyung Jin Chang,Jianbo Jiao
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepcted in ACCV 2024

点击查看摘要

[CV-218] Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking ACCV2024

链接: https://arxiv.org/abs/2410.09243
作者: Duy Le Dinh Anh,Kim Hoang Tran,Quang-Thuc Nguyen,Ngan Hoang Le
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV 2024, main track, oral presentation

点击查看摘要

[CV-219] Foundation Model-Powered 3D Few-Shot Class Incremental Learning via Training-free Adaptor ACCV2024

链接: https://arxiv.org/abs/2410.09237
作者: Sahar Ahmadi,Ali Cheraghian,Morteza Saberi,Md.Towsif Abir,Hamidreza Dastmalchi,Farookh Hussain,Shafin Rahman
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV 2024

点击查看摘要

[CV-220] Cross-Domain Distribution Alignment for Segmentation of Private Unannotated 3D Medical Images

链接: https://arxiv.org/abs/2410.09210
作者: Ruitong Sun,Mohammad Rostami
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-221] Cross-Domain Evaluation of Few-Shot Classification Models: Natural Images vs. Histopathological Images

链接: https://arxiv.org/abs/2410.09176
作者: Ardhendu Sekhar,Aditya Bhattacharya,Vinayak Goyal,Vrinda Goel,Aditya Bhangale,Ravi Kant Gupta,Amit Sethi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-222] Facial Chick Sexing: An Automated Chick Sexing System From Chick Facial Image

链接: https://arxiv.org/abs/2410.09155
作者: Marta Veganzones Rodriguez,Thinh Phan,Arthur F. A. Fernandes,Vivian Breen,Jesus Arango,Michael T. Kidd,Ngan Le
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-223] RealEra: Semantic-level Concept Erasure via Neighbor-Concept Mining

链接: https://arxiv.org/abs/2410.09140
作者: Yufan Liu,Jinyang An,Wanqian Zhang,Ming Li,Dayan Wu,Jingzi Gu,Zheng Lin,Weiping Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-224] Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset

链接: https://arxiv.org/abs/2410.09135
作者: Victor Radermecker,Andrea Zanon,Nancy Thomas,Annita Vapsi,Saba Rahimi,Rama Ramakrishnan,Daniel Borrajo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[CV-225] When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning

链接: https://arxiv.org/abs/2410.09132
作者: Hao Yan,Chaozhuo Li,Zhigang Yu,Jun Yin,Ruochen Liu,Peiyan Zhang,Weihao Han,Mingzheng Li,Zhengxin Zeng,Hao Sun,Weiwei Deng,Feng Sun,Qi Zhang,Senzhang Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-226] he Solution for Temporal Action Localisation Task of Perception Test Challenge 2024

链接: https://arxiv.org/abs/2410.09088
作者: Yinan Han,Qingyuan Jiang,Hongming Mei,Yang Yang,Jinhui Tang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[CV-227] Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

链接: https://arxiv.org/abs/2410.09083
作者: Lu Chen,Yuxuan Huang,Yixing Li,Yaohui Jin,Shuai Zhao,Zilong Zheng,Quanshi Zhang
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-228] Preserving Cardiac Integrity: A Topology-Infused Approach to Whole Heart Segmentation

链接: https://arxiv.org/abs/2410.10551
作者: Chenyu Zhang,Wenxue Guan,Xiaodan Xing,Guan Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-229] A Novel No-Reference Image Quality Metric For Assessing Sharpness In Satellite Imagery

链接: https://arxiv.org/abs/2410.10488
作者: Lucas Gonzalo Antonel
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures

点击查看摘要

[CV-230] Pubic Symphysis-Fetal Head Segmentation Network Using BiFormer Attention Mechanism and Multipath Dilated Convolution

链接: https://arxiv.org/abs/2410.10352
作者: Pengzhou Cai,Lu Jiang,Yanxin Li,Xiaojuan Liu,Libin Lan
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: MMM2025;Camera-ready Version;The code is available at this https URL

点击查看摘要

[CV-231] Anatomical feature-prioritized loss for enhanced MR to CT translation

链接: https://arxiv.org/abs/2410.10328
作者: Arthur Longuefosse,Baudouin Denis de Senneville,Gael Dournes,Ilyes Benlala,Pascal Desbarats,Fabien Baldacci
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-232] wo-Stage Approach for Brain MR Image Synthesis: 2D Image Synthesis and 3D Refinement MICCAI2024

链接: https://arxiv.org/abs/2410.10269
作者: Jihoon Cho,Seunghyuck Park,Jinah Park
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024 BraSyn Challenge 1st place

点击查看摘要

[CV-233] Generative Human Video Compression with Multi-granularity Temporal Trajectory Factorization

链接: https://arxiv.org/abs/2410.10171
作者: Shanzhi Yin,Bolin Chen,Shiqi Wang,Yan Ye
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to TCSVT

点击查看摘要

[CV-234] Performance Evaluation of Deep Learning and Transformer Models Using Multimodal Data for Breast Cancer Classification MICCAI2024

链接: https://arxiv.org/abs/2410.10146
作者: Sadam Hussain,Mansoor Ali,Usman Naseem,Beatriz Alejandra Bosques Palomo,Mario Alexis Monsivais Molina,Jorge Alberto Garza Abdala,Daly Betzabeth Avendano Avalos,Servando Cardona-Huerta,T. Aaron Gulliver,Jose Gerardo Tamez Pena
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper was accepted and presented in 3rd Workshop on Cancer Prevention, detection, and intervenTion (CaPTion @ MICCAI 2024)

点击查看摘要

[CV-235] REHRSeg: Unleashing the Power of Self-Supervised Super-Resolution for Resource-Efficient 3D MRI Segmentation

链接: https://arxiv.org/abs/2410.10097
作者: Zhiyun Song,Yinjie Zhao,Xiaomin Li,Manman Fei,Xiangyu Zhao,Mengjun Liu,Cunjian Chen,Chung-Hsing Yeh,Qian Wang,Guoyan Zheng,Songtao Ai,Lichi Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-236] HASN: Hybrid Attention Separable Network for Efficient Image Super-resolution

链接: https://arxiv.org/abs/2410.09844
作者: Weifeng Cao,Xiaoyan Lei,Jun Shi,Wanyong Liang,Jie Liu,Zongfei Bai
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Visual Computer

点击查看摘要

[CV-237] EG-SpikeFormer: Eye-Gaze Guided Transformer on Spiking Neural Networks for Medical Image Analysis

链接: https://arxiv.org/abs/2410.09674
作者: Yi Pan,Hanqi Jiang,Junhao Chen,Yiwei Li,Huaqin Zhao,Yifan Zhou,Peng Shu,Zihao Wu,Zhengliang Liu,Dajiang Zhu,Xiang Li,Yohannes Abate,Tianming Liu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

[CV-238] Unique MS Lesion Identification from MRI

链接: https://arxiv.org/abs/2410.09639
作者: Carlos A. Rivas,Jinwei Zhang,Shuwen Wei,Samuel W. Remedios,Aaron Carass,Jerry L. Prince
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures, submitted to SPIE medical imaging conference

点击查看摘要

[CV-239] Exploring Behavior-Relevant and Disentangled Neural Dynamics with Generative Diffusion Models

链接: https://arxiv.org/abs/2410.09614
作者: Yule Wang,Chengrui Li,Weihan Li,Anqi Wu
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-240] Diabetic retinopathy image classification method based on GreenBen data augmentation

链接: https://arxiv.org/abs/2410.09444
作者: Yutong Liu,Jie Gao,Haijiang Zhu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-241] MOZART: Ensembling Approach for COVID-19 Detection using Chest X-Ray Imagery

链接: https://arxiv.org/abs/2410.09255
作者: Mohammed Shabo,Nazar Siddig
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper was originally intended to be published as part of my this http URL . graduation project in Electrical and Electronics Engineering at the University of Khartoum in 2021. However, due to political and economic instability, and most recently, the outbreak of conflict in Sudan in April 2023, the publication process was significantly delayed. But yeah, better late than never

点击查看摘要

[CV-242] Fast Data-independent KLT Approximations Based on Integer Functions

链接: https://arxiv.org/abs/2410.09227
作者: A. P. Radünz,D. F. G. Coelho,F. M. Bayer,R. J. Cintra,A. Madanayake
关键词-EN:
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Numerical Analysis (math.NA); Methodology (stat.ME)
*备注: 19 pages, 10 figures, 7 tables

点击查看摘要

[CV-243] Artificial intelligence techniques in inherited retinal diseases: A review

链接: https://arxiv.org/abs/2410.09105
作者: Han Trinh,Jordan Vice,Jason Charng,Zahra Tajbakhsh,Khyber Alam,Fred K. Chen,Ajmal Mian
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

机器学习

[LG-0] mporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

链接: https://arxiv.org/abs/2410.10818
作者: Mu Cai,Reuben Tan,Jianrui Zhang,Bocheng Zou,Kai Zhang,Feng Yao,Fangrui Zhu,Jing Gu,Yiwu Zhong,Yuzhang Shang,Yao Dou,Jaden Park,Jianfeng Gao,Yong Jae Lee,Jianwei Yang
关键词-EN: temporal understanding, temporal, fine-grained temporal, Understanding, video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models’ temporal reasoning capabilities. Both dataset and evaluation code will be made available.

[LG-1] When Does Perceptual Alignment Benefit Vision Representations?

链接: https://arxiv.org/abs/2410.10817
作者: Shobhita Sundaram,Stephanie Fu,Lukas Muttenthaler,Netanel Y. Tamir,Lucy Chai,Simon Kornblith,Trevor Darrell,Phillip Isola
关键词-EN: including scene layout, subject location, scene layout, camera pose, diverse visual attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: S.S. and S.F. contributed equally. Website: this http URL

点击查看摘要

Abstract:Humans judge perceptual similarity according to diverse visual attributes, including scene layout, subject location, and camera pose. Existing vision models understand a wide range of semantic abstractions but improperly weigh these attributes and thus make inferences misaligned with human perception. While vision representations have previously benefited from alignment in contexts like image generation, the utility of perceptually aligned representations in more general-purpose settings remains unclear. Here, we investigate how aligning vision model representations to human perceptual judgments impacts their usability across diverse computer vision tasks. We finetune state-of-the-art models on human similarity judgments for image triplets and evaluate them across standard vision benchmarks. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks, including counting, segmentation, depth estimation, instance retrieval, and retrieval-augmented generation. In addition, we find that performance is widely preserved on other tasks, including specialized out-of-distribution domains such as in medical imaging and 3D environment frames. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.

[LG-2] LVD-2M: A Long-take Video Dataset with Temporally Dense Captions NEURIPS2024

链接: https://arxiv.org/abs/2410.10816
作者: Tianwei Xiong,Yuqing Wang,Daquan Zhou,Zhijie Lin,Jiashi Feng,Xihui Liu
关键词-EN: long video generation, video generation models, video generation, generation models, generation models heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Dataset and Benchmark Track. Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

[LG-3] Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

链接: https://arxiv.org/abs/2410.10814
作者: Ziyue Li,Tianyi Zhou
关键词-EN: large language models, excel on generation, large language, decoder-only architecture, architecture often limits
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:While large language models (LLMs) excel on generation tasks, their decoder-only architecture often limits their potential as embedding models if no further representation finetuning is applied. Does this contradict their claim of generalists? To answer the question, we take a closer look at Mixture-of-Experts (MoE) LLMs. Our study shows that the expert routers in MoE LLMs can serve as an off-the-shelf embedding model with promising performance on a diverse class of embedding-focused tasks, without requiring any finetuning. Moreover, our extensive analysis shows that the MoE routing weights (RW) is complementary to the hidden state (HS) of LLMs, a widely-used embedding. Compared to HS, we find that RW is more robust to the choice of prompts and focuses on high-level semantics. Motivated by the analysis, we propose MoEE combining RW and HS, which achieves better performance than using either separately. Our exploration of their combination and prompting strategy shed several novel insights, e.g., a weighted sum of RW and HS similarities outperforms the similarity on their concatenation. Our experiments are conducted on 6 embedding tasks with 20 datasets from the Massive Text Embedding Benchmark (MTEB). The results demonstrate the significant improvement brought by MoEE to LLM-based embedding without further finetuning.

[LG-4] HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

链接: https://arxiv.org/abs/2410.10812
作者: Haotian Tang,Yecheng Wu,Shang Yang,Enze Xie,Junsong Chen,Junyu Chen,Zhuoyang Zhang,Han Cai,Yao Lu,Song Han
关键词-EN: Hybrid Autoregressive Transformer, Autoregressive Transformer, introduce Hybrid Autoregressive, rivaling diffusion models, visual generation model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Demo: this https URL . The first two authors contributed equally to this work

点击查看摘要

Abstract:We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at this https URL.

[LG-5] Deep Linear Probe Generators for Weight Space Learning

链接: https://arxiv.org/abs/2410.10811
作者: Jonathan Kahana,Eliahu Horwitz,Imri Shuval,Yedid Hoshen
关键词-EN: space learning aims, Weight space learning, neural network, generalization error, aims to extract
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weight space learning aims to extract information about a neural network, such as its training dataset or generalization error. Recent approaches learn directly from model weights, but this presents many challenges as weights are high-dimensional and include permutation symmetries between neurons. An alternative approach, Probing, represents a model by passing a set of learned inputs (probes) through the model, and training a predictor on top of the corresponding outputs. Although probing is typically not used as a stand alone approach, our preliminary experiment found that a vanilla probing baseline worked surprisingly well. However, we discover that current probe learning strategies are ineffective. We therefore propose Deep Linear Probe Generators (ProbeGen), a simple and effective modification to probing approaches. ProbeGen adds a shared generator module with a deep linear architecture, providing an inductive bias towards structured probes thus reducing overfitting. While simple, ProbeGen performs significantly better than the state-of-the-art and is very efficient, requiring between 30 to 1000 times fewer FLOPs than other top approaches.

[LG-6] Hard-Constrained Neural Networks with Universal Approximation Guarantees

链接: https://arxiv.org/abs/2410.10807
作者: Youngjae Min,Anoopkumar Sonar,Navid Azizan
关键词-EN: Incorporating prior knowledge, gained significant attention, Incorporating prior, significant attention, prior knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Incorporating prior knowledge or specifications of input-output relationships into machine learning models has gained significant attention, as it enhances generalization from limited data and leads to conforming outputs. However, most existing approaches use soft constraints by penalizing violations through regularization, which offers no guarantee of constraint satisfaction – an essential requirement in safety-critical applications. On the other hand, imposing hard constraints on neural networks may hinder their representational power, adversely affecting performance. To address this, we propose HardNet, a practical framework for constructing neural networks that inherently satisfy hard constraints without sacrificing model capacity. Specifically, we encode affine and convex hard constraints, dependent on both inputs and outputs, by appending a differentiable projection layer to the network’s output. This architecture allows unconstrained optimization of the network parameters using standard algorithms while ensuring constraint satisfaction by construction. Furthermore, we show that HardNet retains the universal approximation capabilities of neural networks. We demonstrate the versatility and effectiveness of HardNet across various applications: fitting functions under constraints, learning optimization solvers, optimizing control policies in safety-critical systems, and learning safe decision logic for aircraft systems.

[LG-7] L-PCA: Transfer Learning of Principal Component Analysis

链接: https://arxiv.org/abs/2410.10805
作者: Sharon Hendy,Yehuda Dar
关键词-EN: Principal component analysis, target data, component analysis, PCA, data
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Principal component analysis (PCA) can be significantly limited when there is too few examples of the target data of interest. We propose a transfer learning approach to PCA (TL-PCA) where knowledge from a related source task is used in addition to the scarce data of a target task. Our TL-PCA has two versions, one that uses a pretrained PCA solution of the source task, and another that uses the source data. Our proposed approach extends the PCA optimization objective with a penalty on the proximity of the target subspace and the source subspace as given by the pretrained source model or the source data. This optimization is solved by eigendecomposition for which the number of data-dependent eigenvectors (i.e., principal directions of TL-PCA) is not limited to the number of target data examples, which is a root cause that limits the standard PCA performance. Accordingly, our results for image datasets show that the representation of test data is improved by TL-PCA for dimensionality reduction where the learned subspace dimension is lower or higher than the number of target data examples.

[LG-8] rajDiffuse: A Conditional Diffusion Model for Environment-Aware Trajectory Prediction ICPR

链接: https://arxiv.org/abs/2410.10804
作者: Qingze(Tony)Liu,Danrui Li,Samuel S. Sohn,Sejong Yoon,Mubbasir Kapadia,Vladimir Pavlovic
关键词-EN: Accurate prediction, human or vehicle, vehicle trajectories, trajectories with good, captures their stochastic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to be published as inpreceedings of the 2024 International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Accurate prediction of human or vehicle trajectories with good diversity that captures their stochastic nature is an essential task for many applications. However, many trajectory prediction models produce unreasonable trajectory samples that focus on improving diversity or accuracy while neglecting other key requirements, such as collision avoidance with the surrounding environment. In this work, we propose TrajDiffuse, a planning-based trajectory prediction method using a novel guided conditional diffusion model. We form the trajectory prediction problem as a denoising impaint task and design a map-based guidance term for the diffusion process. TrajDiffuse is able to generate trajectory predictions that match or exceed the accuracy and diversity of the SOTA, while adhering almost perfectly to environmental constraints. We demonstrate the utility of our model through experiments on the nuScenes and PFSD datasets and provide an extensive benchmark analysis against the SOTA methods.

[LG-9] Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

链接: https://arxiv.org/abs/2410.10803
作者: Yanjie Ze,Zixuan Chen,Wenhao Wang,Tianyi Chen,Xialin He,Ying Yuan,Xue Bin Peng,Jiajun Wu
关键词-EN: goal for roboticists, Humanoid robots capable, Diffusion Policy, autonomous operation, Humanoid robots
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills. Recent advances in 3D visuomotor policies, such as the 3D Diffusion Policy (DP3), have shown promise in extending these capabilities to wilder environments. However, 3D visuomotor policies often rely on camera calibration and point-cloud segmentation, which present challenges for deployment on mobile robots like humanoids. In this work, we introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints by leveraging egocentric 3D visual representations. We demonstrate that iDP3 enables a full-sized humanoid robot to autonomously perform skills in diverse real-world scenarios, using only data collected in the lab. Videos are available at: this https URL

[LG-10] Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

链接: https://arxiv.org/abs/2410.10801
作者: Aakanksha,Arash Ahmadian,Seraphina Goldfarb-Tarrant,Beyza Ermis,Marzieh Fadaee,Sara Hooker
关键词-EN: Large Language Models, Large Language, variety of applications, adopted and deployed, deployed worldwide
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been adopted and deployed worldwide for a broad variety of applications. However, ensuring their safe use remains a significant challenge. Preference training and safety measures often overfit to harms prevalent in Western-centric datasets, and safety protocols frequently fail to extend to multilingual settings. In this work, we explore model merging in a diverse multi-task setting, combining safety and general-purpose tasks within a multilingual context. Each language introduces unique and varied learning challenges across tasks. We find that objective-based merging is more effective than mixing data, with improvements of up to 8% and 10% in general performance and safety respectively. We also find that language-based merging is highly effective – by merging monolingually fine-tuned models, we achieve a 4% increase in general performance and 7% reduction in harm across all languages on top of the data mixtures method using the same available data. Overall, our comprehensive study of merging approaches provides a useful framework for building strong and safe multilingual models.

[LG-11] Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance

链接: https://arxiv.org/abs/2410.10796
作者: Sachin Goyal,Christina Baek,J. Zico Kolter,Aditi Raghunathan
关键词-EN: Large language models, Large language, follow user instructions, input context, instruction-finetuned to enhance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Under Review

点击查看摘要

Abstract:Large language models are instruction-finetuned to enhance their ability to follow user instructions and process the input context. However, even state-of-the-art models often struggle to follow the instruction, especially when the input context is not aligned with the model’s parametric knowledge. This manifests as various failures, such as hallucinations where the responses are outdated, biased or contain unverified facts. In this work, we try to understand the underlying reason for this poor context reliance, especially after instruction tuning. We observe an intriguing phenomenon: during instruction tuning, the context reliance initially increases as expected, but then gradually decreases as instruction finetuning progresses. We call this phenomenon context-parametric inversion and observe it across multiple general purpose instruction tuning datasets like TULU, Alpaca and Ultrachat, as well as model families such as Llama, Mistral and Pythia. In a simple theoretical setup, we isolate why context-parametric inversion occurs along the gradient descent trajectory of instruction finetuning. We tie this phenomena to examples in the instruction finetuning data mixture where the input context provides information that is already present in the model’s parametric knowledge. Our analysis suggests natural mitigation strategies that provide some limited gains, while also validating our theoretical insights. We hope that our work serves as a starting point in addressing this failure mode in a staple part of LLM training.

[LG-12] Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

链接: https://arxiv.org/abs/2410.10792
作者: Litu Rout,Yujia Chen,Nataniel Ruiz,Constantine Caramanis,Sanjay Shakkottai,Wen-Sheng Chu
关键词-EN: transform random noise, models transform random, Generative models transform, transform images back, transform random
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Generative models transform random noise into images; their inversion aims to transform images back to structured noise for recovery and editing. This paper addresses two key tasks: (i) inversion and (ii) editing of a real image using stochastic equivalents of rectified flow models (such as Flux). Although Diffusion Models (DMs) have recently dominated the field of generative modeling for images, their inversion presents faithfulness and editability challenges due to nonlinearities in drift and diffusion. Existing state-of-the-art DM inversion approaches rely on training of additional parameters or test-time optimization of latent variables; both are expensive in practice. Rectified Flows (RFs) offer a promising alternative to diffusion models, yet their inversion has been underexplored. We propose RF inversion using dynamic optimal control derived via a linear quadratic regulator. We prove that the resulting vector field is equivalent to a rectified stochastic differential equation. Additionally, we extend our framework to design a stochastic sampler for Flux. Our inversion method allows for state-of-the-art performance in zero-shot inversion and editing, outperforming prior works in stroke-to-image synthesis and semantic image editing, with large-scale human evaluations confirming user preference.

[LG-13] On Information-Theoretic Measures of Predictive Uncertainty

链接: https://arxiv.org/abs/2410.10786
作者: Kajetan Schweighofer,Lukas Aichberger,Mykyta Ielanskyi,Sepp Hochreiter
关键词-EN: machine learning applications, predictive uncertainty, predictive uncertainty measures, Reliable estimation, learning applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reliable estimation of predictive uncertainty is crucial for machine learning applications, particularly in high-stakes scenarios where hedging against risks is essential. Despite its significance, a consensus on the correct measurement of predictive uncertainty remains elusive. In this work, we return to first principles to develop a fundamental framework of information-theoretic predictive uncertainty measures. Our proposed framework categorizes predictive uncertainty measures according to two factors: (I) The predicting model (II) The approximation of the true predictive distribution. Examining all possible combinations of these two factors, we derive a set of predictive uncertainty measures that includes both known and newly introduced ones. We empirically evaluate these measures in typical uncertainty estimation settings, such as misclassification detection, selective prediction, and out-of-distribution detection. The results show that no single measure is universal, but the effectiveness depends on the specific setting. Thus, our work provides clarity about the suitability of predictive uncertainty measures by clarifying their implicit assumptions and relationships.

[LG-14] When Attention Sink Emerges in Language Models: An Empirical View

链接: https://arxiv.org/abs/2410.10781
作者: Xiangming Gu,Tianyu Pang,Chao Du,Qian Liu,Fengzhuo Zhang,Cunxiao Du,Ye Wang,Min Lin
关键词-EN: Language Models, assign significant attention, attention, attention sink, assign significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens’ inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at this https URL.

[LG-15] Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.10773
作者: Etai Littwin,Vimal Thilak,Anand Gopalakrishnan
关键词-EN: Image-based Joint-Embedding Predictive, Joint-Embedding Predictive Architecture, Image Modeling framework, Masked Image Modeling, Modeling framework
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Workshop on Self-Supervised Learning - Theory and Practice. Comments welcome!

点击查看摘要

Abstract:Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE) for representation learning using the Masked Image Modeling framework. IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space. However, IJEPA relies on carefully designed context and target windows to avoid representational collapse. The encoder modules in IJEPA cannot adaptively modulate the type of predicted and/or target features based on the feasibility of the masked prediction task as they are not given sufficient information of both context and targets. Based on the intuition that in natural images, information has a strong spatial bias with spatially local regions being highly predictive of one another compared to distant ones. We condition the target encoder and context encoder modules in IJEPA with positions of context and target windows respectively. Our “conditional” encoders show performance gains on several image classification benchmark datasets, improved robustness to context window size and sample-efficiency during pretraining.

[LG-16] AFlow: Automating Agent ic Workflow Generation

链接: https://arxiv.org/abs/2410.10762
作者: Jiayi Zhang,Jinyu Xiang,Zhaoyang Yu,Fengwei Teng,Xionghui Chen,Jiaqi Chen,Mingchen Zhuge,Xin Cheng,Sirui Hong,Jinlin Wang,Bingnan Zheng,Bang Liu,Yuyu Luo,Chenglin Wu
关键词-EN: Large language models, demonstrated remarkable potential, follow detailed instructions, Large language, employing agentic workflows
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow’s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code will be available at this https URL.

[LG-17] SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

链接: https://arxiv.org/abs/2410.10759
作者: Akrit Mudvari,Yuang Jiang,Leandros Tassiulas
关键词-EN: Large language models, generate human-like text, daily lives due, Large language, LLM inference
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language understanding, information retrieval and search, translation, chatbots, virtual assistance, and many more. However, it is well known that LLMs are massive in terms of the number of parameters. Additionally, the self-attention mechanism in the underlying architecture of LLMs, Transformers, has quadratic complexity in terms of both computation and memory with respect to the input sequence length. For these reasons, LLM inference is resource-intensive, and thus, the throughput of LLM inference is limited, especially for the longer sequences. In this report, we design a collaborative inference architecture between a server and its clients to alleviate the throughput limit. In this design, we consider the available resources on both sides, i.e., the computation and communication costs. We develop a dynamic programming-based algorithm to optimally allocate computation between the server and the client device to increase the server throughput, while not violating the service level agreement (SLA). We show in the experiments that we are able to efficiently distribute the workload allowing for roughly 1/3 reduction in the server workload, while achieving 19 percent improvement over a greedy method. As a result, we are able to demonstrate that, in an environment with different types of LLM inference requests, the throughput of the server is improved.

[LG-18] Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

链接: https://arxiv.org/abs/2410.10744
作者: Hossein Mirzaei,Mackenzie W. Mathis
关键词-EN: critical real-world applications, OOD, compromising their reliability, real-world applications, significant advancements
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Code and pre-trained models are available at this https URL

点击查看摘要

Abstract:Despite significant advancements in out-of-distribution (OOD) detection, existing methods still struggle to maintain robustness against adversarial attacks, compromising their reliability in critical real-world applications. Previous studies have attempted to address this challenge by exposing detectors to auxiliary OOD datasets alongside adversarial training. However, the increased data complexity inherent in adversarial training, and the myriad of ways that OOD samples can arise during testing, often prevent these approaches from establishing robust decision boundaries. To address these limitations, we propose AROS, a novel approach leveraging neural ordinary differential equations (NODEs) with Lyapunov stability theorem in order to obtain robust embeddings for OOD detection. By incorporating a tailored loss function, we apply Lyapunov stability theory to ensure that both in-distribution (ID) and OOD data converge to stable equilibrium points within the dynamical system. This approach encourages any perturbed input to return to its stable equilibrium, thereby enhancing the model’s robustness against adversarial perturbations. To not use additional data, we generate fake OOD embeddings by sampling from low-likelihood regions of the ID data feature space, approximating the boundaries where OOD data are likely to reside. To then further enhance robustness, we propose the use of an orthogonal binary layer following the stable feature space, which maximizes the separation between the equilibrium points of ID and OOD samples. We validate our method through extensive experiments across several benchmarks, demonstrating superior performance, particularly under adversarial attacks. Notably, our approach improves robust detection performance from 37.8% to 80.1% on CIFAR-10 vs. CIFAR-100 and from 29.0% to 67.0% on CIFAR-100 vs. CIFAR-10.

[LG-19] SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing

链接: https://arxiv.org/abs/2410.10741
作者: Pengrui Quan,Xiaomin Ouyang,Jeya Vikranth Jeyakumar,Ziqi Wang,Yang Xing,Mani Srivastava
关键词-EN: Large Language Models, Effective processing, critical component, component of cyber-physical, Effective
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2410.10741 [cs.AI] (or arXiv:2410.10741v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.10741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Online Statistical Inference for Time-varying Sample-averaged Q-learning

链接: https://arxiv.org/abs/2410.10737
作者: Saunak Kumar Panda,Ruiqi Liu,Yisha Xiang
关键词-EN: Reinforcement learning, key approach, approach for training, training agents, agents in complex
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a key approach for training agents in complex and uncertain environments. Incorporating statistical inference in RL algorithms is essential for understanding and managing uncertainty in model performance. This paper introduces a time-varying batch-averaged Q-learning algorithm, termed sampleaveraged Q-learning, which improves upon traditional single-sample Q-learning by aggregating samples of rewards and next states to better account for data variability and uncertainty. We leverage the functional central limit theorem (FCLT) to establish a novel framework that provides insights into the asymptotic normality of the sample-averaged algorithm under mild conditions. Additionally, we develop a random scaling method for interval estimation, enabling the construction of confidence intervals without requiring extra hyperparameters. Numerical experiments conducted on classic OpenAI Gym environments show that the time-varying sample-averaged Q-learning method consistently outperforms both single-sample and constant-batch Q-learning methods, achieving superior accuracy while maintaining comparable learning speeds.

[LG-21] owards Calibrated Losses for Adversarial Robust Reject Option Classification ACML

链接: https://arxiv.org/abs/2410.10736
作者: Vrund Shah,Tejas Chaudhari,Naresh Manwani
关键词-EN: medical diagnosis, Robust Reject Option, autonomous driving, Adversarial Robust Reject, vital property
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at Asian Conference on Machine Learning (ACML) , 2024

点击查看摘要

Abstract:Robustness towards adversarial attacks is a vital property for classifiers in several applications such as autonomous driving, medical diagnosis, etc. Also, in such scenarios, where the cost of misclassification is very high, knowing when to abstain from prediction becomes crucial. A natural question is which surrogates can be used to ensure learning in scenarios where the input points are adversarially perturbed and the classifier can abstain from prediction? This paper aims to characterize and design surrogates calibrated in “Adversarial Robust Reject Option” setting. First, we propose an adversarial robust reject option loss \ell_d^\gamma and analyze it for the hypothesis set of linear classifiers ( \mathcalH_\textrmlin ). Next, we provide a complete characterization result for any surrogate to be (\ell_d^\gamma,\mathcalH_\textrmlin) - calibrated. To demonstrate the difficulty in designing surrogates to \ell_d^\gamma , we show negative calibration results for convex surrogates and quasi-concave conditional risk cases (these gave positive calibration in adversarial setting without reject option). We also empirically argue that Shifted Double Ramp Loss (DRL) and Shifted Double Sigmoid Loss (DSL) satisfy the calibration conditions. Finally, we demonstrate the robustness of shifted DRL and shifted DSL against adversarial perturbations on a synthetically generated dataset.

[LG-22] owards LLM-guided Efficient and Interpretable Multi-linear Tensor Network Rank Selection

链接: https://arxiv.org/abs/2410.10728
作者: Giorgos Iacovides,Wuyang Zhou,Danilo Mandic
关键词-EN: leverages large language, higher-order data analysis, rank selection, guide the rank, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a novel framework that leverages large language models (LLMs) to guide the rank selection in tensor network models for higher-order data analysis. By utilising the intrinsic reasoning capabilities and domain knowledge of LLMs, our approach offers enhanced interpretability of the rank choices and can effectively optimise the objective function. This framework enables users without specialised domain expertise to utilise tensor network decompositions and understand the underlying rationale within the rank selection process. Experimental results validate our method on financial higher-order datasets, demonstrating interpretable reasoning, strong generalisation to unseen test data, and its potential for self-enhancement over successive iterations. This work is placed at the intersection of large language models and higher-order data analysis.

[LG-23] SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

链接: https://arxiv.org/abs/2410.10714
作者: Rasoul Shafipour,David Harrison,Maxwell Horton,Jeffrey Marker,Houman Bedayat,Sachin Mehta,Mohammad Rastegari,Mahyar Najibi,Saman Naderiparizi
关键词-EN: Large Language Models, natural language processing, transformed natural language, high runtime cost, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art compression methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama 3 70B, which is particularly challenging to compress, show that SeedLM achieves significantly better zero-shot accuracy retention at 4- and 3-bit than state-of-the-art techniques, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases to 70B, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.

[LG-24] Early Diagnoses of Acute Lymphoblastic Leukemia Using YOLOv8 and YOLOv11 Deep Learning Models

链接: https://arxiv.org/abs/2410.10701
作者: Alaa Awad,Mohamed Hegazy,Salah A. Aly
关键词-EN: individuals succumb annually, Acute Lymphoblastic Leukemia, Thousands of individuals, detecting Acute Lymphoblastic, individuals succumb
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Thousands of individuals succumb annually to leukemia alone. This study explores the application of image processing and deep learning techniques for detecting Acute Lymphoblastic Leukemia (ALL), a severe form of blood cancer responsible for numerous annual fatalities. As artificial intelligence technologies advance, the research investigates the reliability of these methods in real-world scenarios. The study focuses on recent developments in ALL detection, particularly using the latest YOLO series models, to distinguish between malignant and benign white blood cells and to identify different stages of ALL, including early stages. Additionally, the models are capable of detecting hematogones, which are often misclassified as ALL. By utilizing advanced deep learning models like YOLOv8 and YOLOv11, the study achieves high accuracy rates reaching 98.8%, demonstrating the effectiveness of these algorithms across multiple datasets and various real-world situations.

[LG-25] Dynamical loss functions shape landscape topography and improve learning in artificial neural networks

链接: https://arxiv.org/abs/2410.10690
作者: Eduardo Lavin,Miguel Ruiz-Garcia
关键词-EN: supervised classification tasks, class periodically increases, Dynamical loss functions, standard loss functions, loss functions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamical loss functions are derived from standard loss functions used in supervised classification tasks, but they are modified such that the contribution from each class periodically increases and decreases. These oscillations globally alter the loss landscape without affecting the global minima. In this paper, we demonstrate how to transform cross-entropy and mean squared error into dynamical loss functions. We begin by discussing the impact of increasing the size of the neural network or the learning rate on the learning process. Building on this intuition, we propose several versions of dynamical loss functions and show how they significantly improve validation accuracy for networks of varying sizes. Finally, we explore how the landscape of these dynamical loss functions evolves during training, highlighting the emergence of instabilities that may be linked to edge-of-instability minimization.

[LG-26] SAMPa: Sharpness-aware Minimization Parallelized NEURIPS

链接: https://arxiv.org/abs/2410.10683
作者: Wanyun Xie,Thomas Pethick,Volkan Cevher
关键词-EN: Sharpness-aware minimization, neural networks, SAM, shown to improve, improve the generalization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Advances in Neural Information Processing Systems (NeurIPS), 2024

点击查看摘要

Abstract:Sharpness-aware minimization (SAM) has been shown to improve the generalization of neural networks. However, each SAM update requires \emphsequentially computing two gradients, effectively doubling the per-iteration cost compared to base optimizers like SGD. We propose a simple modification of SAM, termed SAMPa, which allows us to fully parallelize the two gradient computations. SAMPa achieves a twofold speedup of SAM under the assumption that communication costs between devices are negligible. Empirical results show that SAMPa ranks among the most efficient variants of SAM in terms of computational time. Additionally, our method consistently outperforms SAM across both vision and language tasks. Notably, SAMPa theoretically maintains convergence guarantees even for \emphfixed perturbation sizes, which is established through a novel Lyapunov function. We in fact arrive at SAMPa by treating this convergence guarantee as a hard requirement – an approach we believe is promising for developing SAM-based methods in general. Our code is available at \urlthis https URL.

[LG-27] Combinatorial Multi-armed Bandits: Arm Selection via Group Testing

链接: https://arxiv.org/abs/2410.10679
作者: Arpan Mukherjee,Shashanka Ubaru,Keerthiram Murugesan,Karthikeyan Shanmugam,Ali Tajer
关键词-EN: combinatorial multi-armed bandits, combinatorial multi-armed, multi-armed bandits, bandits with semi-bandit, semi-bandit feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 26 pages

点击查看摘要

Abstract:This paper considers the problem of combinatorial multi-armed bandits with semi-bandit feedback and a cardinality constraint on the super-arm size. Existing algorithms for solving this problem typically involve two key sub-routines: (1) a parameter estimation routine that sequentially estimates a set of base-arm parameters, and (2) a super-arm selection policy for selecting a subset of base arms deemed optimal based on these parameters. State-of-the-art algorithms assume access to an exact oracle for super-arm selection with unbounded computational power. At each instance, this oracle evaluates a list of score functions, the number of which grows as low as linearly and as high as exponentially with the number of arms. This can be prohibitive in the regime of a large number of arms. This paper introduces a novel realistic alternative to the perfect oracle. This algorithm uses a combination of group-testing for selecting the super arms and quantized Thompson sampling for parameter estimation. Under a general separability assumption on the reward function, the proposed algorithm reduces the complexity of the super-arm-selection oracle to be logarithmic in the number of base arms while achieving the same regret order as the state-of-the-art algorithms that use exact oracles. This translates to at least an exponential reduction in complexity compared to the oracle-based approaches.

[LG-28] Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach

链接: https://arxiv.org/abs/2410.10674
作者: Rory Young,Nicolas Pugeault
关键词-EN: learning agents achieve, Deep reinforcement learning, simulated control tasks, agents achieve, reinforcement learning agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning agents achieve state-of-the-art performance in a wide range of simulated control tasks. However, successful applications to real-world problems remain limited. One reason for this dichotomy is because the learned policies are not robust to observation noise or adversarial attacks. In this paper, we investigate the robustness of deep RL policies to a single small state perturbation in deterministic continuous control tasks. We demonstrate that RL policies can be deterministically chaotic as small perturbations to the system state have a large impact on subsequent state and reward trajectories. This unstable non-linear behaviour has two consequences: First, inaccuracies in sensor readings, or adversarial attacks, can cause significant performance degradation; Second, even policies that show robust performance in terms of rewards may have unpredictable behaviour in practice. These two facets of chaos in RL policies drastically restrict the application of deep RL to real-world problems. To address this issue, we propose an improvement on the successful Dreamer V3 architecture, implementing a Maximal Lyapunov Exponent regularisation. This new approach reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of Deep Reinforcement Learning for real-world applications.

[LG-29] Double Jeopardy and Climate Impact in the Use of Large Language Models : Socio-economic Disparities and Reduced Utility for Non-English Speakers

链接: https://arxiv.org/abs/2410.10665
作者: Aivin V. Solatorio,Gabriel Stefanini Vicente,Holly Krambeck,Olivier Dupriez
关键词-EN: Artificial Intelligence, World Development Indicators, holds the potential, information gaps, developing nations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: Project GitHub repository at this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI’s GPT models via APIs because of how the system processes the input – tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.

[LG-30] Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

链接: https://arxiv.org/abs/2410.10663
作者: Zhengwei Yang,Yuke Li,Qiang Sun,Basura Fernando,Heng Huang,Zheng Wang
关键词-EN: few-shot learning focus, few-shot learning, Cross-modal Few-Shot Learning, existing studies, learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.

[LG-31] ransforming Game Play: A Comparative Study of DCQN and DTQN Architectures in Reinforcement Learning

链接: https://arxiv.org/abs/2410.10660
作者: William A. Stigall
关键词-EN: Convolutional Neural Networks, utilizing Convolutional Neural, Deep Q-Networks utilizing, Q-Networks utilizing Convolutional, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: KSU C-Day Spring 2024

点击查看摘要

Abstract:In this study, we investigate the performance of Deep Q-Networks utilizing Convolutional Neural Networks (CNNs) and Transformer architectures across three different Atari games. The advent of DQNs has significantly advanced Reinforcement Learning, enabling agents to directly learn optimal policies from high-dimensional sensory inputs from pixel or RAM data. While CNN-based DQNs have been extensively studied and deployed in various domains, Transformer-based DQNs are relatively unexplored. Our research aims to fill this gap by benchmarking the performance of both DCQNs and DTQNs across the Atari games Asteroids, Space Invaders, and Centipede. We find that in the 35-40 million parameter range, the DCQN outperforms the DTQN in speed across both ViT and Projection Architectures. We also find the DCQN outperforms the DTQN in all games except for Centipede.

[LG-32] Navigation under uncertainty: Trajectory prediction and occlusion reasoning with switching dynamical systems

链接: https://arxiv.org/abs/2410.10653
作者: Ran Wei,Joseph Lee,Shohei Wakayama,Alexander Tschantz,Conor Heins,Christopher Buckley,John Carenbauer,Hari Thiruvengada,Mahault Albarracin,Miguel de Prado,Petter Horling,Peter Winzell,Renjith Rajagopal
关键词-EN: Predicting future trajectories, safe robot navigation, Predicting future, robot navigation, crucial task
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting future trajectories of nearby objects, especially under occlusion, is a crucial task in autonomous driving and safe robot navigation. Prior works typically neglect to maintain uncertainty about occluded objects and only predict trajectories of observed objects using high-capacity models such as Transformers trained on large datasets. While these approaches are effective in standard scenarios, they can struggle to generalize to the long-tail, safety-critical scenarios. In this work, we explore a conceptual framework unifying trajectory prediction and occlusion reasoning under the same class of structured probabilistic generative model, namely, switching dynamical systems. We then present some initial experiments illustrating its capabilities using the Waymo open dataset.

[LG-33] A Simple Baseline for Predicting Events with Auto-Regressive Tabular Transformers

链接: https://arxiv.org/abs/2410.10648
作者: Alex Stein,Samuel Sharpe,Doron Bergman,Senthil Kumar,Bayan Bruss,John Dickerson,Tom Goldstein,Micah Goldblum
关键词-EN: credit card transaction, tabular data involve, retail platform, real-world applications, applications of tabular
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
*备注: 10 pages, 6 pages of references+appendix

点击查看摘要

Abstract:Many real-world applications of tabular data involve using historic events to predict properties of new ones, for example whether a credit card transaction is fraudulent or what rating a customer will assign a product on a retail platform. Existing approaches to event prediction include costly, brittle, and application-dependent techniques such as time-aware positional embeddings, learned row and field encodings, and oversampling methods for addressing class imbalance. Moreover, these approaches often assume specific use-cases, for example that we know the labels of all historic events or that we only predict a pre-specified label and not the data’s features themselves. In this work, we propose a simple but flexible baseline using standard autoregressive LLM-style transformers with elementary positional embeddings and a causal language modeling objective. Our baseline outperforms existing approaches across popular datasets and can be employed for various use-cases. We demonstrate that the same model can predict labels, impute missing values, or model event sequences.

[LG-34] DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation

链接: https://arxiv.org/abs/2410.10646
作者: James R. Han,Hugues Thomas,Jian Zhang,Nicholas Rhinehart,Timothy D. Barfoot
关键词-EN: people exhibiting complex, complex motion patterns, exhibiting complex motion, people exhibiting, exhibiting complex
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, under review for IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:How can a robot safely navigate around people exhibiting complex motion patterns? Reinforcement Learning (RL) or Deep RL (DRL) in simulation holds some promise, although much prior work relies on simulators that fail to precisely capture the nuances of real human motion. To address this gap, we propose Deep Residual Model Predictive Control (DR-MPC), a method to enable robots to quickly and safely perform DRL from real-world crowd navigation data. By blending MPC with model-free DRL, DR-MPC overcomes the traditional DRL challenges of large data requirements and unsafe initial behavior. DR-MPC is initialized with MPC-based path tracking, and gradually learns to interact more effectively with humans. To further accelerate learning, a safety component estimates when the robot encounters out-of-distribution states and guides it away from likely collisions. In simulation, we show that DR-MPC substantially outperforms prior work, including traditional DRL and residual DRL models. Real-world experiments show our approach successfully enables a robot to navigate a variety of crowded situations with few errors using less than 4 hours of training data.

[LG-35] Echo State Networks for Spatio-Temporal Area-Level Data

链接: https://arxiv.org/abs/2410.10641
作者: Zhenhua Wang,Scott H. Holan,Christopher K. Wikle
关键词-EN: providing valuable insights, Spatio-temporal area-level datasets, official statistics, providing valuable, Spatio-temporal area-level
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:Spatio-temporal area-level datasets play a critical role in official statistics, providing valuable insights for policy-making and regional planning. Accurate modeling and forecasting of these datasets can be extremely useful for policymakers to develop informed strategies for future planning. Echo State Networks (ESNs) are efficient methods for capturing nonlinear temporal dynamics and generating forecasts. However, ESNs lack a direct mechanism to account for the neighborhood structure inherent in area-level data. Ignoring these spatial relationships can significantly compromise the accuracy and utility of forecasts. In this paper, we incorporate approximate graph spectral filters at the input stage of the ESN, thereby improving forecast accuracy while preserving the model’s computational efficiency during training. We demonstrate the effectiveness of our approach using Eurostat’s tourism occupancy dataset and show how it can support more informed decision-making in policy and planning contexts.

[LG-36] Adapt-infty: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection

链接: https://arxiv.org/abs/2410.10636
作者: Adyasha Maharana,Jaehong Yoon,Tianlong Chen,Mohit Bansal
关键词-EN: Visual instruction datasets, Visual instruction, redundant text-image pairs, Lifelong Instruction Tuning, text-image pairs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: First two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Visual instruction datasets from various distributors are released at different times and often contain a significant number of semantically redundant text-image pairs, depending on their task compositions (i.e., skills) or reference sources. This redundancy greatly limits the efficient deployment of lifelong adaptable multimodal large language models, hindering their ability to refine existing skills and acquire new competencies over time. To address this, we reframe the problem of Lifelong Instruction Tuning (LiIT) via data selection, where the model automatically selects beneficial samples to learn from earlier and new datasets based on the current state of acquired knowledge in the model. Based on empirical analyses that show that selecting the best data subset using a static importance measure is often ineffective for multi-task datasets with evolving distributions, we propose Adapt- \infty , a new multi-way and adaptive data selection approach that dynamically balances sample efficiency and effectiveness during LiIT. We construct pseudo-skill clusters by grouping gradient-based sample vectors. Next, we select the best-performing data selector for each skill cluster from a pool of selector experts, including our newly proposed scoring function, Image Grounding score. This data selector samples a subset of the most important samples from each skill cluster for training. To prevent the continuous increase in the size of the dataset pool during LiIT, which would result in excessive computation, we further introduce a cluster-wise permanent data pruning strategy to remove the most semantically redundant samples from each cluster, keeping computational requirements manageable. Training with samples selected by Adapt- \infty alleviates catastrophic forgetting, especially for rare tasks, and promotes forward transfer across the continuum using only a fraction of the original datasets.

[LG-37] Lambda-Skip Connections: the architectural component that prevents Rank Collapse

链接: https://arxiv.org/abs/2410.10609
作者: Federico Arangath Joseph,Jerome Sieber,Melanie N. Zeilinger,Carmen Amo Alonso
关键词-EN: deep learning literature, models rapidly converge, Rank collapse, sequence models rapidly, learning literature
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Rank collapse, a phenomenon where embedding vectors in sequence models rapidly converge to a uniform token or equilibrium state, has recently gained attention in the deep learning literature. This phenomenon leads to reduced expressivity and potential training instabilities due to vanishing gradients. Empirical evidence suggests that architectural components like skip connections, LayerNorm, and MultiLayer Perceptrons (MLPs) play critical roles in mitigating rank collapse. While this issue is well-documented for transformers, alternative sequence models, such as State Space Models (SSMs), which have recently gained prominence, have not been thoroughly examined for similar vulnerabilities. This paper extends the theory of rank collapse from transformers to SSMs using a unifying framework that captures both architectures. We study how a parametrized version of the classic skip connection component, which we call \emphlambda-skip connections, provides guarantees for rank collapse prevention. Through analytical results, we present a sufficient condition to guarantee prevention of rank collapse across all the aforementioned architectures. We also study the necessity of this condition via ablation studies and analytical examples. To our knowledge, this is the first study that provides a general guarantee to prevent rank collapse, and that investigates rank collapse in the context of SSMs, offering valuable understanding for both theoreticians and practitioners. Finally, we validate our findings with experiments demonstrating the crucial role of architectural components such as skip connections and gating mechanisms in preventing rank collapse.

[LG-38] BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI

链接: https://arxiv.org/abs/2410.10604
作者: Shaohao Rui,Lingzhi Chen,Zhenyu Tang,Lilong Wang,Mianxin Liu,Shaoting Zhang,Xiaosong Wang
关键词-EN: Accurate diagnosis, complementary multi-parametric MRI, multi-parametric MRI imaging, MRI imaging data, abnormalities is greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of brain abnormalities is greatly enhanced by the inclusion of complementary multi-parametric MRI imaging data. There is significant potential to develop a universal pre-training model that can be quickly adapted for image modalities and various clinical scenarios. However, current models often rely on uni-modal image data, neglecting the cross-modal correlations among different image modalities or struggling to scale up pre-training in the presence of missing modality data. In this paper, we propose BrainMVP, a multi-modal vision pre-training framework for brain image analysis using multi-parametric MRI scans. First, we collect 16,022 brain MRI scans (over 2.4 million images), encompassing eight MRI modalities sourced from a diverse range of centers and devices. Then, a novel pre-training paradigm is proposed for the multi-modal MRI data, addressing the issue of missing modalities and achieving multi-modal information fusion. Cross-modal reconstruction is explored to learn distinctive brain image embeddings and efficient modality fusion capabilities. A modality-wise data distillation module is proposed to extract the essence representation of each MR image modality for both the pre-training and downstream application purposes. Furthermore, we introduce a modality-aware contrastive learning module to enhance the cross-modality association within a study. Extensive experiments on downstream tasks demonstrate superior performance compared to state-of-the-art pre-training methods in the medical domain, with Dice Score improvement of 0.28%-14.47% across six segmentation benchmarks and a consistent accuracy improvement of 0.65%-18.07% in four individual classification tasks.

[LG-39] Neural networks that overcome classic challenges through practice

链接: https://arxiv.org/abs/2410.10596
作者: Kazuki Irie,Brenden M. Lake
关键词-EN: neural network models, human cognitive abilities, mind and brain, critics have pointed, cognitive abilities
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Since the earliest proposals for neural network models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that has used metalearning to help overcome some of these challenges. We characterize their successes as addressing an important developmental problem: they provide machines with an incentive to improve X (where X represents the desired capability) and opportunities to practice it, through explicit optimization for X; unlike conventional approaches that hope for achieving X through generalization from related but different objectives. We review applications of this principle to four classic challenges: systematicity, catastrophic forgetting, few-shot learning and multi-step reasoning; we also discuss related aspects of human development in natural environments.

[LG-40] RESTLE: A Model of Concept Formation in Structured Domains

链接: https://arxiv.org/abs/2410.10588
作者: Christopher J. MacLellan,Erik Harpstead,Vincent Aleven,Kenneth R. Koedinger
关键词-EN: concept formation, learning concepts incrementally, concepts incrementally, concept formation focus, TRESTLE
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, 1 table

点击查看摘要

Abstract:The literature on concept formation has demonstrated that humans are capable of learning concepts incrementally, with a variety of attribute types, and in both supervised and unsupervised settings. Many models of concept formation focus on a subset of these characteristics, but none account for all of them. In this paper, we present TRESTLE, an incremental account of probabilistic concept formation in structured domains that unifies prior concept learning models. TRESTLE works by creating a hierarchical categorization tree that can be used to predict missing attribute values and cluster sets of examples into conceptually meaningful groups. It updates its knowledge by partially matching novel structures and sorting them into its categorization tree. Finally, the system supports mixed-data representations, including nominal, numeric, relational, and component attributes. We evaluate TRESTLE’s performance on a supervised learning task and an unsupervised clustering task. For both tasks, we compare it to a nonincremental model and to human participants. We find that this new categorization model is competitive with the nonincremental approach and more closely approximates human behavior on both tasks. These results serve as an initial demonstration of TRESTLE’s capabilities and show that, by taking key characteristics of human learning into account, it can better model behavior than approaches that ignore them.

[LG-41] opoFR: A Closer Look at Topology Alignment on Face Recognition NEURIPS2024

链接: https://arxiv.org/abs/2410.10587
作者: Jun Dan,Yang Liu,Jiankang Deng,Haoyu Xie,Siyuan Li,Baigui Sun,Shan Luo
关键词-EN: undergone significant advancements, structure information, structure, latent space, latent space structure
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:The field of face recognition (FR) has undergone significant advancements with the rise of deep learning. Recently, the success of unsupervised learning and graph neural networks has demonstrated the effectiveness of data structure information. Considering that the FR task can leverage large-scale training data, which intrinsically contains significant structure information, we aim to investigate how to encode such critical structure information into the latent space. As revealed from our observations, directly aligning the structure information between the input and latent spaces inevitably suffers from an overfitting problem, leading to a structure collapse phenomenon in the latent space. To address this problem, we propose TopoFR, a novel FR model that leverages a topological structure alignment strategy called PTSA and a hard sample mining strategy named SDE. Concretely, PTSA uses persistent homology to align the topological structures of the input and latent spaces, effectively preserving the structure information and improving the generalization performance of FR model. To mitigate the impact of hard samples on the latent space structure, SDE accurately identifies hard samples by automatically computing structure damage score (SDS) for each sample, and directs the model to prioritize optimizing these samples. Experimental results on popular face benchmarks demonstrate the superiority of our TopoFR over the state-of-the-art methods. Code and models are available at: this https URL.

[LG-42] STACKFEED: Structured Textual Actor-Critic Knowledge Base Editing with FeedBack

链接: https://arxiv.org/abs/2410.10584
作者: Naman Gupta,Shashank Kirtania,Priyanshu Gupta,Krishna Kariya,Sumit Gulwani,Arun Iyer,Suresh Parthasarathy,Arjun Radhakrishna,Sriram K. Rajamani,Gustavo Soares
关键词-EN: Large Language Models, Large Language, Language Models, outdated information, private data
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. Each document is assigned to an actor, modeled as a ReACT agent, which performs structured edits based on document-specific targeted instructions from a centralized critic. Experimental results show that STACKFEED significantly improves KB quality and RAG system performance, enhancing accuracy by up to 8% over baselines.

[LG-43] Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes

链接: https://arxiv.org/abs/2410.10578
作者: Juan Sebastian Rojas,Chi-Guhn Lee
关键词-EN: Markov decision processes, Average-reward Markov decision, Markov decision, Average-reward Markov, decision processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2006.16318 , arXiv:2110.13855 by other authors

点击查看摘要

Abstract:Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to episodic and discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various subtasks simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.

[LG-44] Regularized Robustly Reliable Learners and Instance Targeted Attacks

链接: https://arxiv.org/abs/2410.10572
作者: Avrim Blum,Donya Saless
关键词-EN: Instance-targeted data poisoning, data poisoning attacks, raised significant concerns, Instance-targeted data, poisoning attacks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Instance-targeted data poisoning attacks, where an adversary corrupts a training set to induce errors on specific test points, have raised significant concerns. Balcan et al (2022) proposed an approach to addressing this challenge by defining a notion of robustly-reliable learners that provide per-instance guarantees of correctness under well-defined assumptions, even in the presence of data poisoning attacks. They then give a generic optimal (but computationally inefficient) robustly reliable learner as well as a computationally efficient algorithm for the case of linear separators over log-concave distributions. In this work, we address two challenges left open by Balcan et al (2022). The first is that the definition of robustly-reliable learners in Balcan et al (2022) becomes vacuous for highly-flexible hypothesis classes: if there are two classifiers h_0, h_1 \in H both with zero error on the training set such that h_0(x) \neq h_1(x), then a robustly-reliable learner must abstain on x. We address this problem by defining a modified notion of regularized robustly-reliable learners that allows for nontrivial statements in this case. The second is that the generic algorithm of Balcan et al (2022) requires re-running an ERM oracle (essentially, retraining the classifier) on each test point x, which is generally impractical even if ERM can be implemented efficiently. To tackle this problem, we show that at least in certain interesting cases we can design algorithms that can produce their outputs in time sublinear in training time, by using techniques from dynamic algorithm design. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2410.10572 [cs.LG] (or arXiv:2410.10572v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.10572 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] ROSAR: An Adversarial Re-Training Framework for Robust Side-Scan Sonar Object Detection

链接: https://arxiv.org/abs/2410.10554
作者: Martin Aubard,László Antal,Ana Madureira,Luis F. Teixeira,Erika Ábrahám
关键词-EN: deep learning object, autonomous underwater vehicles, learning object detection, generated by autonomous, paper introduces ROSAR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces ROSAR, a novel framework enhancing the robustness of deep learning object detection models tailored for side-scan sonar (SSS) images, generated by autonomous underwater vehicles using sonar sensors. By extending our prior work on knowledge distillation (KD), this framework integrates KD with adversarial retraining to address the dual challenges of model efficiency and robustness against SSS noises. We introduce three novel, publicly available SSS datasets, capturing different sonar setups and noise conditions. We propose and formalize two SSS safety properties and utilize them to generate adversarial datasets for retraining. Through a comparative analysis of projected gradient descent (PGD) and patch-based adversarial attacks, ROSAR demonstrates significant improvements in model robustness and detection accuracy under SSS-specific conditions, enhancing the model’s robustness by up to 1.85%. ROSAR is available at this https URL.

[LG-46] SLaNC: Static LayerNorm Calibration NEURIPS2024

链接: https://arxiv.org/abs/2410.10553
作者: Mahsa Salmani,Nikita Trukhanov,Ilya Soloveychik
关键词-EN: Large Language Models, generated enormous pressure, rapidly expanding fields, Large Language, sizes of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 9 pages, 3 figures, NeurIPS 2024 MLNCP Workshop

点击查看摘要

Abstract:The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of the most rapidly expanding fields of the AI industry. Various approaches have been explored to enable efficient and accurate processing of LLMs on the available accelerators given their computational and storage limitations. Among these, various quantization techniques have become the main focus of the community as a means of reducing the compute, communication and storage requirements. Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations. When it comes to processing the popular Transformer models on hardware, one of the main issues becomes calculation of the LayerNorm simply because accumulation of the variance requires a much wider dynamic range than the hardware enables. In this article, we address this matter and propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference. Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers. The scaling factors are computed offline, based solely on the linear layer weights, hence no latency or computational overhead is added during inference. Most importantly, our technique ensures that no numerical issues such as overflow or underflow could happen during the compute. This approach offers smooth, accurate and resource-effective inference across a wide range of hardware architectures. The article provides theoretical justification as well as supporting numerical simulations.

[LG-47] Graph Classification Gaussian Processes via Hodgelet Spectral Features NEURIPS2024

链接: https://arxiv.org/abs/2410.10546
作者: Mathieu Alain,So Takao,Xiaowen Dong,Bastian Rieck,Emmanuel Noutahi
关键词-EN: machine learning, problem of classifying, ubiquitous in machine, features, classifying graphs
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024 Workshop on Bayesian Decision-Making and Uncertainty (oral presentation)

点击查看摘要

Abstract:The problem of classifying graphs is ubiquitous in machine learning. While it is standard to apply graph neural networks for such tasks, Gaussian processes can also be used, by transforming graph features into the spectral domain, and using the resulting spectral features as input points. However, this approach only takes into account features on vertices, whereas some graph data also support features on edges. In this work, we present a Gaussian process-based classification algorithm that can utilise vertex and/or edges features to help classify graphs. Furthermore, we take advantage of the Hodge decomposition of vertex and edge features to increase the flexibility of the model, which can be beneficial on some tasks.

[LG-48] Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.10542
作者: Shubham Kumar Nigam,Aniket Deroy,Subhankar Maity,Arnab Bhattacharya
关键词-EN: context of Indian, study investigates judgment, Indian judgments, including InLegalBERT, utilizing a range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted on NLLP at EMNLP 2024

点击查看摘要

Abstract:This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.

[LG-49] Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Pitch Difference Feature

链接: https://arxiv.org/abs/2410.10537
作者: Jan Vrba,Jakub Steinbach,Tomáš Jirsa,Laura Verde,Roberta De Fazio,Noriyasu Homma,Yuwen Zeng,Key Ichiji,Lukáš Hájek,Zuzana Sedláková,Jan Mareš
关键词-EN: propose a robust, research of contemporary, contemporary practices, feature set, robust set
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 33 pages, 8 figures, code repository: this https URL

点击查看摘要

Abstract:In this study, we propose a robust set of features derived from a thorough research of contemporary practices in voice pathology detection. The feature set is based on the combination of acoustic handcrafted features. Additionally, we introduce pitch difference as a novel feature. We combine this feature set, containing data from the publicly available Saarbrücken Voice Database (SVD), with preprocessing using the K-Means Synthetic Minority Over-Sampling Technique algorithm to address class imbalance. Moreover, we applied multiple ML models as binary classifiers. We utilized support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest and AdaBoost classifiers. To determine the best classification approach, we performed grid search on feasible hyperparameters of respective classifiers and subsections of features. Our approach has achieved the state-of-the-art performance, measured by unweighted average recall in voice pathology detection on SVD database. We intentionally omit accuracy as it is highly biased metric in case of unbalanced data compared to aforementioned metrics. The results are further enhanced by eliminating the potential overestimation of the results with repeated stratified cross-validation. This advancement demonstrates significant potential for the clinical deployment of ML methods, offering a valuable tool for an objective examination of voice pathologies. To support our claims, we provide a publicly available GitHub repository with DOI https://doi.org/10.5281/zenodo.13771573. Finally, we provide REFORMS checklist. Comments: 33 pages, 8 figures, code repository: this https URL Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2410.10537 [cs.SD] (or arXiv:2410.10537v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2410.10537 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] ransparent Networks for Multivariate Time Series

链接: https://arxiv.org/abs/2410.10535
作者: Minkyu Kim,Suan Lee,Jinho Kim
关键词-EN: inherently interpretable predictions, produce inherently interpretable, receiving significant attention, time series, machine learning models
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Transparent models, which are machine learning models that produce inherently interpretable predictions, are receiving significant attention in high-stakes domains. However, despite much real-world data being collected as time series, there is a lack of studies on transparent time series models. To address this gap, we propose a novel transparent neural network model for time series called Generalized Additive Time Series Model (GATSM). GATSM consists of two parts: 1) independent feature networks to learn feature representations, and 2) a transparent temporal module to learn temporal patterns across different time steps using the feature representations. This structure allows GATSM to effectively capture temporal patterns and handle dynamic-length time series while preserving transparency. Empirical experiments show that GATSM significantly outperforms existing generalized additive models and achieves comparable performance to black-box time series models, such as recurrent neural networks and Transformer. In addition, we demonstrate that GATSM finds interesting patterns in time series. The source code is available at this https URL.

[LG-51] Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

链接: https://arxiv.org/abs/2410.10533
作者: Sonja Hannibal,Arnulf Jentzen,Do Minh Thang
关键词-EN: deep neural networks, Deep learning methods, SGD methods, stochastic gradient descent, nowadays key tools
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注: 91 pages

点击查看摘要

Abstract:Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the considered optimizer does with high probability not converge to global minimizers of the optimization problem. It turns out that the probability to not converge to a global minimizer converges at least exponentially quickly to one as the width of the first hidden layer of the ANN and the depth of the ANN, respectively, increase. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods such as the momentum SGD, the Nesterov accelerated SGD, the Adagrad, the RMSProp, the Adam, the Adamax, the AMSGrad, and the Nadam optimizers.

[LG-52] Adaptive Probabilistic ODE Solvers Without Adaptive Memory Requirements

链接: https://arxiv.org/abs/2410.10530
作者: Nicholas Krämer
关键词-EN: memory-demanding differential equations, solve memory-demanding differential, adaptive step sizes, differential equations, step sizes
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Despite substantial progress in recent years, probabilistic solvers with adaptive step sizes can still not solve memory-demanding differential equations – unless we care only about a single point in time (which is far too restrictive; we want the whole time series). Counterintuitively, the culprit is the adaptivity itself: Its unpredictable memory demands easily exceed our machine’s capabilities, making our simulations fail unexpectedly and without warning. Still, dropping adaptivity would abandon years of progress, which can’t be the answer. In this work, we solve this conundrum. We develop an adaptive probabilistic solver with fixed memory demands building on recent developments in robust state estimation. Switching to our method (i) eliminates memory issues for long time series, (ii) accelerates simulations by orders of magnitude through unlocking just-in-time compilation, and (iii) makes adaptive probabilistic solvers compatible with scientific computing in JAX.

[LG-53] Get Rid of Task Isolation: A Continuous Multi-task Spatio-Temporal Learning Framework NEURIPS2024

链接: https://arxiv.org/abs/2410.10524
作者: Zhongchao Yi,Zhengyang Zhou,Qihe Huang,Yanjiang Chen,Liheng Yu,Xu Wang,Yang Wang
关键词-EN: enable urban intelligence, pivotal technique, technique to enable, Spatiotemporal learning, urban
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Spatiotemporal learning has become a pivotal technique to enable urban intelligence. Traditional spatiotemporal models mostly focus on a specific task by assuming a same distribution between training and testing sets. However, given that urban systems are usually dynamic, multi-sourced with imbalanced data distributions, current specific task-specific models fail to generalize to new urban conditions and adapt to new domains without explicitly modeling interdependencies across various dimensions and types of urban data. To this end, we argue that there is an essential to propose a Continuous Multi-task Spatio-Temporal learning framework (CMuST) to empower collective urban intelligence, which reforms the urban spatiotemporal learning from single-domain to cooperatively multi-dimensional and multi-task learning. Specifically, CMuST proposes a new multi-dimensional spatiotemporal interaction network (MSTI) to allow cross-interactions between context and main observations as well as self-interactions within spatial and temporal aspects to be exposed, which is also the core for capturing task-level commonality and personalization. To ensure continuous task learning, a novel Rolling Adaptation training scheme (RoAda) is devised, which not only preserves task uniqueness by constructing data summarization-driven task prompts, but also harnesses correlated patterns among tasks by iterative model behavior modeling. We further establish a benchmark of three cities for multi-task spatiotemporal learning, and empirically demonstrate the superiority of CMuST via extensive evaluations on these datasets. The impressive improvements on both few-shot streaming data and new domain tasks against existing SOAT methods are achieved. Code is available at this https URL.

[LG-54] Continual Deep Reinforcement Learning to Prevent Catastrophic Forgetting in Jamming Mitigation

链接: https://arxiv.org/abs/2410.10521
作者: Kemal Davaslioglu,Sastry Kompella,Tugba Erpek,Yalin E. Sagduyu
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, reliable wireless communications, facilitate reliable wireless, DRL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: IEEE MILCOM 2024

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has been highly effective in learning from and adapting to RF environments and thus detecting and mitigating jamming effects to facilitate reliable wireless communications. However, traditional DRL methods are susceptible to catastrophic forgetting (namely forgetting old tasks when learning new ones), especially in dynamic wireless environments where jammer patterns change over time. This paper considers an anti-jamming system and addresses the challenge of catastrophic forgetting in DRL applied to jammer detection and mitigation. First, we demonstrate the impact of catastrophic forgetting in DRL when applied to jammer detection and mitigation tasks, where the network forgets previously learned jammer patterns while adapting to new ones. This catastrophic interference undermines the effectiveness of the system, particularly in scenarios where the environment is non-stationary. We present a method that enables the network to retain knowledge of old jammer patterns while learning to handle new ones. Our approach substantially reduces catastrophic forgetting, allowing the anti-jamming system to learn new tasks without compromising its ability to perform previously learned tasks effectively. Furthermore, we introduce a systematic methodology for sequentially learning tasks in the anti-jamming framework. By leveraging continual DRL techniques based on PackNet, we achieve superior anti-jamming performance compared to standard DRL methods. Our proposed approach not only addresses catastrophic forgetting but also enhances the adaptability and robustness of the system in dynamic jamming environments. We demonstrate the efficacy of our method in preserving knowledge of past jammer patterns, learning new tasks efficiently, and achieving superior anti-jamming performance compared to traditional DRL approaches.

[LG-55] AI-based particle track identification in scintillating fibres read out with imaging sensors

链接: https://arxiv.org/abs/2410.10519
作者: Noemi Bührer,Saúl Alonso-Monsalve,Matthew Franks,Till Dieminger,Davide Sgalaberna
关键词-EN: scintillating fibres read, particle track identification, paper presents, presents the development, development and application
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Instrumentation and Detectors (physics.ins-det)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:This paper presents the development and application of an AI-based method for particle track identification using scintillating fibres read out with imaging sensors. We propose a variational autoencoder (VAE) to efficiently filter and identify frames containing signal from the substantial data generated by SPAD array sensors. Our VAE model, trained on purely background frames, demonstrated a high capability to distinguish frames containing particle tracks from background noise. The performance of the VAE-based anomaly detection was validated with experimental data, demonstrating the method’s ability to efficiently identify relevant events with rapid processing time, suggesting a solid prospect for deployment as a fast inference tool on hardware for real-time anomaly detection. This work highlights the potential of combining advanced sensor technology with machine learning techniques to enhance particle detection and tracking.

[LG-56] UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

链接: https://arxiv.org/abs/2410.10516
作者: Shikun Feng,Yuyan Ni,Yan Lu,Zhi-Ming Ma,Wei-Ying Ma,Yanyan Lan
关键词-EN: property prediction, Molecular generation, drug discovery, developed independently, molecular property prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision.

[LG-57] Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers KDD ECML

链接: https://arxiv.org/abs/2410.10515
作者: Gabriel Souza,Flavio Figueiredo,Alexei Machado,Deborah Guimarães
关键词-EN: achieved formidable results, recent years, deep learning, creative computing, learning has achieved
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Presented at the Music for Machine Learning Workshop with ECML/PKDD. To be published by Springer

点击查看摘要

Abstract:In recent years, deep learning has achieved formidable results in creative computing. When it comes to music, one viable model for music generation are Transformer based models. However, while transformers models are popular for music generation, they often rely on annotated structural information. In this work, we inquire if the off-the-shelf Music Transformer models perform just as well on structural similarity metrics using only unannotated MIDI information. We show that a slight tweak to the most common representation yields small but significant improvements. We also advocate that searching for better unannotated musical representations is more cost-effective than producing large amounts of curated and annotated data.

[LG-58] Artificial Intelligence-Based Triaging of Cutaneous Melanocytic Lesions

链接: https://arxiv.org/abs/2410.10509
作者: Ruben T. Lucassen,Nikolas Stathonikos,Gerben E. Breimer,Mitko Veta,Willeke A. M. Blokx
关键词-EN: increasing workload due, comprehensive diagnoses, facing an increasing, growing volume, increasing workload
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Pathologists are facing an increasing workload due to a growing volume of cases and the need for more comprehensive diagnoses. Aiming to facilitate workload reduction and faster turnaround times, we developed an artificial intelligence (AI) model for triaging cutaneous melanocytic lesions based on whole slide images. The AI model was developed and validated using a retrospective cohort from the UMC Utrecht. The dataset consisted of 52,202 whole slide images from 27,167 unique specimens, acquired from 20,707 patients. Specimens with only common nevi were assigned to the low complexity category (86.6%). In contrast, specimens with any other melanocytic lesion subtype, including non-common nevi, melanocytomas, and melanomas, were assigned to the high complexity category (13.4%). The dataset was split on patient level into a development set (80%) and test sets (20%) for independent evaluation. Predictive performance was primarily measured using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A simulation experiment was performed to study the effect of implementing AI-based triaging in the clinic. The AI model reached an AUROC of 0.966 (95% CI, 0.960-0.972) and an AUPRC of 0.857 (95% CI, 0.836-0.877) on the in-distribution test set, and an AUROC of 0.899 (95% CI, 0.860-0.934) and an AUPRC of 0.498 (95% CI, 0.360-0.639) on the out-of-distribution test set. In the simulation experiment, using random case assignment as baseline, AI-based triaging prevented an average of 43.9 (95% CI, 36-55) initial examinations of high complexity cases by general pathologists for every 500 cases. In conclusion, the AI model achieved a strong predictive performance in differentiating between cutaneous melanocytic lesions of high and low complexity. The improvement in workflow efficiency due to AI-based triaging could be substantial.

[LG-59] Comparison of deep learning and conventional methods for disease onset prediction

链接: https://arxiv.org/abs/2410.10505
作者: Luis H. John,Chungsoo Kim,Jan A. Kors,Junhyuk Chang,Hannah Morgan-Cooper,Priya Desai,Chao Pang,Peter R. Rijnbeek,Jenna M. Reps,Egill A. Fridgeirsson
关键词-EN: Deep learning, Deep learning methods, deep learning models, learning, reliability and interpretability
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Conventional prediction methods such as logistic regression and gradient boosting have been widely utilized for disease onset prediction for their reliability and interpretability. Deep learning methods promise enhanced prediction performance by extracting complex patterns from clinical data, but face challenges like data sparsity and high dimensionality. Methods: This study compares conventional and deep learning approaches to predict lung cancer, dementia, and bipolar disorder using observational data from eleven databases from North America, Europe, and Asia. Models were developed using logistic regression, gradient boosting, ResNet, and Transformer, and validated both internally and externally across the data sources. Discrimination performance was assessed using AUROC, and calibration was evaluated using Eavg. Findings: Across 11 datasets, conventional methods generally outperformed deep learning methods in terms of discrimination performance, particularly during external validation, highlighting their better transportability. Learning curves suggest that deep learning models require substantially larger datasets to reach the same performance levels as conventional methods. Calibration performance was also better for conventional methods, with ResNet showing the poorest calibration. Interpretation: Despite the potential of deep learning models to capture complex patterns in structured observational healthcare data, conventional models remain highly competitive for disease onset prediction, especially in scenarios involving smaller datasets and if lengthy training times need to be avoided. The study underscores the need for future research focused on optimizing deep learning models to handle the sparsity, high dimensionality, and heterogeneity inherent in healthcare datasets, and find new strategies to exploit the full capabilities of deep learning methods. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.10505 [cs.LG] (or arXiv:2410.10505v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.10505 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luis John [view email] [v1] Mon, 14 Oct 2024 13:46:59 UTC (1,896 KB)

[LG-60] A Kernelizable Primal-Dual Formulation of the Multilinear Singular Value Decomposition

链接: https://arxiv.org/abs/2410.10504
作者: Frederiek Wesel,Kim Batselier
关键词-EN: Support Vector Machine, machine learning methods, Vector Machine, Least-Squares Support Vector, Support Vector
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ability to express a learning task in terms of a primal and a dual optimization problem lies at the core of a plethora of machine learning methods. For example, Support Vector Machine (SVM), Least-Squares Support Vector Machine (LS-SVM), Ridge Regression (RR), Lasso Regression (LR), Principal Component Analysis (PCA), and more recently Singular Value Decomposition (SVD) have all been defined either in terms of primal weights or in terms of dual Lagrange multipliers. The primal formulation is computationally advantageous in the case of large sample size while the dual is preferred for high-dimensional data. Crucially, said learning problems can be made nonlinear through the introduction of a feature map in the primal problem, which corresponds to applying the kernel trick in the dual. In this paper we derive a primal-dual formulation of the Multilinear Singular Value Decomposition (MLSVD), which recovers as special cases both PCA and SVD. Besides enabling computational gains through the derived primal formulation, we propose a nonlinear extension of the MLSVD using feature maps, which results in a dual problem where a kernel tensor arises. We discuss potential applications in the context of signal analysis and deep learning.

[LG-61] A Practical Approach to Causal Inference over Time

链接: https://arxiv.org/abs/2410.10502
作者: Martina Cinquini,Isacco Beretta,Salvatore Ruggieri,Isabel Valera
关键词-EN: focus on estimating, causal, causal VAR framework, time, causal VAR
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we focus on estimating the causal effect of an intervention over time on a dynamical system. To that end, we formally define causal interventions and their effects over time on discrete-time stochastic processes (DSPs). Then, we show under which conditions the equilibrium states of a DSP, both before and after a causal intervention, can be captured by a structural causal model (SCM). With such an equivalence at hand, we provide an explicit mapping from vector autoregressive models (VARs), broadly applied in econometrics, to linear, but potentially cyclic and/or affected by unmeasured confounders, SCMs. The resulting causal VAR framework allows us to perform causal inference over time from observational time series data. Our experiments on synthetic and real-world datasets show that the proposed framework achieves strong performance in terms of observational forecasting while enabling accurate estimation of the causal effect of interventions on dynamical systems. We demonstrate, through a case study, the potential practical questions that can be addressed using the proposed causal VAR framework.

[LG-62] Model-Based Differentially Private Knowledge Transfer for Large Language Models

链接: https://arxiv.org/abs/2410.10481
作者: Zhaomin Wu,Jizhou Guo,Junyi Hou,Bingsheng He,Lixin Fan,Qiang Yang
关键词-EN: effectively leveraging domain-specific, large language models, web services, effectively leveraging, leveraging domain-specific knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly prevalent in web services, effectively leveraging domain-specific knowledge while ensuring privacy has become critical. Existing methods, such as retrieval-augmented generation (RAG) and differentially private data synthesis, often compromise either the utility of domain knowledge or the privacy of sensitive data, limiting their applicability in specialized domains. To address these challenges, we propose \textitLlamdex, a novel framework that integrates privacy-preserving, domain-specific models into LLMs. Our approach significantly enhances the accuracy of domain-specific tasks, achieving up to a 26% improvement compared to existing methods under the same differential privacy constraints. Experimental results show that Llamdex not only improves the accuracy of LLM responses but also maintains comparable inference efficiency to the original LLM, highlighting its potential for real-world applications.

[LG-63] he Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

链接: https://arxiv.org/abs/2410.10473
作者: Yonatan Slutzky,Yotam Alexander,Noam Razin,Nadav Cohen
关键词-EN: implicit bias, Neural networks, tendency of gradient, gradient descent, descent to fit
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, particularly in large language models, we believe significant efforts should be invested in further delineating their susceptibility to clean-label poisoning, and in developing methods for overcoming this susceptibility.

[LG-64] Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

链接: https://arxiv.org/abs/2410.10469
作者: Xu Liu,Juncheng Liu,Gerald Woo,Taha Aksu,Yuxuan Liang,Roger Zimmermann,Chenghao Liu,Silvio Savarese,Caiming Xiong,Doyen Sahoo
关键词-EN: demonstrated impressive performance, Time series, Time, series, demonstrated impressive
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series foundation models have demonstrated impressive performance as zero-shot forecasters. However, achieving effectively unified training on time series remains an open challenge. Existing approaches introduce some level of model specialization to account for the highly heterogeneous nature of time series data. For instance, Moirai pursues unified training by employing multiple input/output projection layers, each tailored to handle time series at a specific frequency. Similarly, TimesFM maintains a frequency embedding dictionary for this purpose. We identify two major drawbacks to this human-imposed frequency-level model specialization: (1) Frequency is not a reliable indicator of the underlying patterns in time series. For example, time series with different frequencies can display similar patterns, while those with the same frequency may exhibit varied patterns. (2) Non-stationarity is an inherent property of real-world time series, leading to varied distributions even within a short context window of a single time series. Frequency-level specialization is too coarse-grained to capture this level of diversity. To address these limitations, this paper introduces Moirai-MoE, using a single input/output projection layer while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With these designs, Moirai-MoE reduces reliance on human-defined heuristics and enables automatic token-level specialization. Extensive experiments on 39 datasets demonstrate the superiority of Moirai-MoE over existing foundation models in both in-distribution and zero-shot scenarios. Furthermore, this study conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models and provides valuable insights for future research.

[LG-65] Information propagation dynamics in Deep Graph Networks

链接: https://arxiv.org/abs/2410.10464
作者: Alessio Gravina
关键词-EN: highly expressive abstraction, Deep Graph Networks, social networks, traffic networks, molecular structures
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: PhD thesis

点击查看摘要

Abstract:Graphs are a highly expressive abstraction for modeling entities and their relations, such as molecular structures, social networks, and traffic networks. Deep Graph Networks (DGNs) have emerged as a family of deep learning models that can effectively process and learn such structured information. However, learning effective information propagation patterns within DGNs remains a critical challenge that heavily influences the model capabilities, both in the static domain and in the temporal domain (where features and/or topology evolve). Given this challenge, this thesis investigates the dynamics of information propagation within DGNs for static and dynamic graphs, focusing on their design as dynamical systems. Throughout this work, we provide theoretical and empirical evidence to demonstrate the effectiveness of our proposed architectures in propagating and preserving long-term dependencies between nodes, and in learning complex spatio-temporal patterns from irregular and sparsely sampled dynamic graphs. In summary, this thesis provides a comprehensive exploration of the intersection between graphs, deep learning, and dynamical systems, offering insights and advancements for the field of graph representation learning and paving the way for more effective and versatile graph-based learning models.

[LG-66] ABCF: Counterfactual Explanations for Tabular Data Using a Transformer-Based VAE

链接: https://arxiv.org/abs/2410.10463
作者: Emmanouil Panagiotou,Manuel Heurich,Tim Landgraf,Eirini Ntoutsi
关键词-EN: field of Explainable, alter a prediction, interpret a black-box, XAI, specific feature types
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Paper accepted at ICAIF '24: 5th ACM International Conference on AI in Finance, Brooklyn, NY, USA, November 2024

点击查看摘要

Abstract:In the field of Explainable AI (XAI), counterfactual (CF) explanations are one prominent method to interpret a black-box model by suggesting changes to the input that would alter a prediction. In real-world applications, the input is predominantly in tabular form and comprised of mixed data types and complex feature interdependencies. These unique data characteristics are difficult to model, and we empirically show that they lead to bias towards specific feature types when generating CFs. To overcome this issue, we introduce TABCF, a CF explanation method that leverages a transformer-based Variational Autoencoder (VAE) tailored for modeling tabular data. Our approach uses transformers to learn a continuous latent space and a novel Gumbel-Softmax detokenizer that enables precise categorical reconstruction while preserving end-to-end differentiability. Extensive quantitative evaluation on five financial datasets demonstrates that TABCF does not exhibit bias toward specific feature types, and outperforms existing methods in producing effective CFs that align with common CF desiderata.

[LG-67] Compositional Shielding and Reinforcement Learning for Multi-Agent Systems

链接: https://arxiv.org/abs/2410.10460
作者: Asger Horn Brorholt,Kim Guldstrand Larsen,Christian Schilling
关键词-EN: obtaining high-performance policies, Deep reinforcement learning, Deep reinforcement, powerful tool, tool for obtaining
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning has emerged as a powerful tool for obtaining high-performance policies. However, the safety of these policies has been a long-standing issue. One promising paradigm to guarantee safety is a shield, which shields a policy from making unsafe actions. However, computing a shield scales exponentially in the number of state variables. This is a particular concern in multi-agent systems with many agents. In this work, we propose a novel approach for multi-agent shielding. We address scalability by computing individual shields for each agent. The challenge is that typical safety specifications are global properties, but the shields of individual agents only ensure local properties. Our key to overcome this challenge is to apply assume-guarantee reasoning. Specifically, we present a sound proof rule that decomposes a (global, complex) safety specification into (local, simple) obligations for the shields of the individual agents. Moreover, we show that applying the shields during reinforcement learning significantly improves the quality of the policies obtained for a given training budget. We demonstrate the effectiveness and scalability of our multi-agent shielding framework in two case studies, reducing the computation time from hours to seconds and achieving fast learning convergence.

[LG-68] Advancing Academic Knowledge Retrieval via LLM-enhanced Representation Similarity Fusion KDD

链接: https://arxiv.org/abs/2410.10455
作者: Wei Dai,Peng Fu,Chunjing Gan
关键词-EN: swift information renewal, robust technological growth, avant-garde academic insights, academic insights spanning, information renewal
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: The 2nd Place of KDD Cup 2024 OAG-Challenge AQA

点击查看摘要

Abstract:In an era marked by robust technological growth and swift information renewal, furnishing researchers and the populace with top-tier, avant-garde academic insights spanning various domains has become an urgent necessity. The KDD Cup 2024 AQA Challenge is geared towards advancing retrieval models to identify pertinent academic terminologies from suitable papers for scientific inquiries. This paper introduces the LLM-KnowSimFuser proposed by Robo Space, which wins the 2nd place in the competition. With inspirations drawed from the superior performance of LLMs on multiple tasks, after careful analysis of the provided datasets, we firstly perform fine-tuning and inference using LLM-enhanced pre-trained retrieval models to introduce the tremendous language understanding and open-domain knowledge of LLMs into this task, followed by a weighted fusion based on the similarity matrix derived from the inference results. Finally, experiments conducted on the competition datasets show the superiority of our proposal, which achieved a score of 0.20726 on the final leaderboard.

[LG-69] Principled Bayesian Optimisation in Collaboration with Human Experts NEURIPS2024

链接: https://arxiv.org/abs/2410.10452
作者: Wenjie Xu,Masaki Adachi,Colin N. Jones,Michael A. Osborne
关键词-EN: Bayesian optimisation, optimisation process, performed interactively, integrating their domain, domain knowledge
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to NeurIPS 2024 as a spotlight

点击查看摘要

Abstract:Bayesian optimisation for real-world problems is often performed interactively with human experts, and integrating their domain knowledge is key to accelerate the optimisation process. We consider a setup where experts provide advice on the next query point through binary accept/reject recommendations (labels). Experts’ labels are often costly, requiring efficient use of their efforts, and can at the same time be unreliable, requiring careful adjustment of the degree to which any expert is trusted. We introduce the first principled approach that provides two key guarantees. (1) Handover guarantee: similar to a no-regret property, we establish a sublinear bound on the cumulative number of experts’ binary labels. Initially, multiple labels per query are needed, but the number of expert labels required asymptotically converges to zero, saving both expert effort and computation time. (2) No-harm guarantee with data-driven trust level adjustment: our adaptive trust level ensures that the convergence rate will not be worse than the one without using advice, even if the advice from experts is adversarial. Unlike existing methods that employ a user-defined function that hand-tunes the trust level adjustment, our approach enables data-driven adjustments. Real-world applications empirically demonstrate that our method not only outperforms existing baselines, but also maintains robustness despite varying labelling accuracy, in tasks of battery design with human experts.

[LG-70] Mobility-Aware Federated Learning: Multi-Armed Bandit Based Selection in Vehicular Network

链接: https://arxiv.org/abs/2410.10451
作者: Haoyu Tu,Lin Chen,Zuguang Li,Xiaopei Chen,Wen Wu
关键词-EN: vehicular federated learning, federated learning, vehicle selection problem, paper,we study, mobility-aware vehicular federated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by 2024 IEEE Globecom Workshops (GC Wkshps)

点击查看摘要

Abstract:In this paper,we study a vehicle selection problem for federated learning (FL) over vehicular networks. Specifically, we design a mobility-aware vehicular federated learning (MAVFL) scheme in which vehicles drive through a road segment to perform FL. Some vehicles may drive out of the segment which leads to unsuccessful this http URL the proposed scheme, the real-time successful training participation ratio is utilized to implement vehicle selection. We conduct the convergence analysis to indicate the influence of vehicle mobility on training loss. Furthermore, we propose a multi-armed bandit-based vehicle selection algorithm to minimize the utility function considering training loss and delay. The simulation results show that compared with baselines, the proposed algorithm can achieve better training performance with approximately 28% faster convergence.

[LG-71] Diversity-Aware Reinforcement Learning for de novo Drug Design

链接: https://arxiv.org/abs/2410.10431
作者: Hampus Gummesson Svensson,Christian Tyrchan,Ola Engkvist,Morteza Haghir Chehreghani
关键词-EN: pre-trained generative model, demonstrated good performance, generating promising drug, reward function, pre-trained generative
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Fine-tuning a pre-trained generative model has demonstrated good performance in generating promising drug molecules. The fine-tuning task is often formulated as a reinforcement learning problem, where previous methods efficiently learn to optimize a reward function to generate potential drug molecules. Nevertheless, in the absence of an adaptive update mechanism for the reward function, the optimization process can become stuck in local optima. The efficacy of the optimal molecule in a local optimization may not translate to usefulness in the subsequent drug optimization process or as a potential standalone clinical candidate. Therefore, it is important to generate a diverse set of promising molecules. Prior work has modified the reward function by penalizing structurally similar molecules, primarily focusing on finding molecules with higher rewards. To date, no study has comprehensively examined how different adaptive update mechanisms for the reward function influence the diversity of generated molecules. In this work, we investigate a wide range of intrinsic motivation methods and strategies to penalize the extrinsic reward, and how they affect the diversity of the set of generated molecules. Our experiments reveal that combining structure- and prediction-based methods generally yields better results in terms of molecular diversity.

[LG-72] A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning

链接: https://arxiv.org/abs/2410.10417
作者: Minyoung Kim,Timothy M. Hospedales
关键词-EN: modern deep learning, general differentiable meta, tackle the general, general differentiable, ubiquitous in modern
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning, including hyperparameter optimization, loss function learning, few-shot learning, invariance learning and more. These problems are often formalized as Bi-Level optimizations (BLO). We introduce a novel perspective by turning a given BLO problem into a stochastic optimization, where the inner loss function becomes a smooth probability distribution, and the outer loss becomes an expected loss over the inner distribution. To solve this stochastic optimization, we adopt Stochastic Gradient Langevin Dynamics (SGLD) MCMC to sample inner distribution, and propose a recurrent algorithm to compute the MC-estimated hypergradient. Our derivation is similar to forward-mode differentiation, but we introduce a new first-order approximation that makes it feasible for large models without needing to store huge Jacobian matrices. The main benefits are two-fold: i) Our stochastic formulation takes into account uncertainty, which makes the method robust to suboptimal inner optimization or non-unique multiple inner minima due to overparametrization; ii) Compared to existing methods that often exhibit unstable behavior and hyperparameter sensitivity in practice, our method leads to considerably more reliable solutions. We demonstrate that the new approach achieves promising results on diverse meta learning problems and easily scales to learning 87M hyperparameters in the case of Vision Transformers.

[LG-73] On Calibration of LLM-based Guard Models for Reliable Content Moderation

链接: https://arxiv.org/abs/2410.10414
作者: Hongfu Liu,Hengguan Huang,Hao Wang,Xiangming Gu,Ye Wang
关键词-EN: Large language models, LLM-based guard models, Large language, guard models, generating harmful content
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) pose significant risks due to the potential for generating harmful content or users attempting to evade guardrails. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs, ensuring adherence to safety policies by blocking content that violates these protocols upon deployment. However, limited attention has been given to the reliability and calibration of such guard models. In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets. Our analysis and experiments underscore the limitations of current LLM-based guard models and provide valuable insights for the future development of well-calibrated guard models toward more reliable content moderation. We also advocate for incorporating reliability evaluation of confidence calibration when releasing future LLM-based guard models.

[LG-74] Deterministic Apple Tasting

链接: https://arxiv.org/abs/2410.10404
作者: Zachary Chase,Idan Mehalel
关键词-EN: mathcal, Theta, mistake bound, apple tasting, bound
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In binary ( 0/1 ) online classification with apple tasting feedback, the learner receives feedback only when predicting 1 . Besides some degenerate learning tasks, all previously known learning algorithms for this model are randomized. Consequently, prior to this work it was unknown whether deterministic apple tasting is generally feasible. In this work, we provide the first widely-applicable deterministic apple tasting learner, and show that in the realizable case, a hypothesis class is learnable if and only if it is deterministically learnable, confirming a conjecture of [Raman, Subedi, Raman, Tewari-24]. Quantitatively, we show that every class \mathcalH is learnable with mistake bound O \left(\sqrt\mathttL(\mathcalH) T \log T \right) (where \mathttL(\mathcalH) is the Littlestone dimension of \mathcalH ), and that this is tight for some classes. We further study the agnostic case, in which the best hypothesis makes at most k many mistakes, and prove a trichotomy stating that every class \mathcalH must be either easy, hard, or unlearnable. Easy classes have (both randomized and deterministic) mistake bound \Theta_\mathcalH(k) . Hard classes have randomized mistake bound \tilde\Theta_\mathcalH \left(k + \sqrtT \right) , and deterministic mistake bound \tilde\Theta_\mathcalH \left(\sqrtk \cdot T \right) , where T is the time horizon. Unlearnable classes have (both randomized and deterministic) mistake bound \Theta(T) . Our upper bound is based on a deterministic algorithm for learning from expert advice with apple tasting feedback, a problem interesting in its own right. For this problem, we show that the optimal deterministic mistake bound is \Theta \left(\sqrtT (k + \log n) \right) for all k and T \leq n \leq 2^T , where n is the number of experts. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.10404 [cs.LG] (or arXiv:2410.10404v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.10404 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] ghter Risk Bounds for Mixtures of Experts

链接: https://arxiv.org/abs/2410.10397
作者: Wissam Akretche,Frédéric LeBlanc,Mario Marchand
关键词-EN: local differential privacy, provide upper bounds, imposing local differential, gating mechanism, mixtures of experts
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we provide upper bounds on the risk of mixtures of experts by imposing local differential privacy (LDP) on their gating mechanism. These theoretical guarantees are tailored to mixtures of experts that utilize the one-out-of- n gating mechanism, as opposed to the conventional n -out-of- n mechanism. The bounds exhibit logarithmic dependence on the number of experts, and encapsulate the dependence on the gating mechanism in the LDP parameter, making them significantly tighter than existing bounds, under reasonable conditions. Experimental results support our theory, demonstrating that our approach enhances the generalization ability of mixtures of experts and validating the feasibility of imposing LDP on the gating mechanism.

[LG-76] Improved Depth Estimation of Bayesian Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.10395
作者: Bart van Erp,Bert de Vries
关键词-EN: Bayesian neural networks, Nazareth and Blei, paper proposes improvements, work by Nazareth, Bayesian neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty this https URL

点击查看摘要

Abstract:This paper proposes improvements over earlier work by Nazareth and Blei (2022) for estimating the depth of Bayesian neural networks. Here, we propose a discrete truncated normal distribution over the network depth to independently learn its mean and variance. Posterior distributions are inferred by minimizing the variational free energy, which balances the model complexity and accuracy. Our method improves test accuracy in the spiral data set and reduces the variance in posterior depth estimates.

[LG-77] PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation NEURIPS2024

链接: https://arxiv.org/abs/2410.10394
作者: Kaidong Zhang,Pengzhen Ren,Bingqian Lin,Junfan Lin,Shikui Ma,Hang Xu,Xiaodan Liang
关键词-EN: follow abstract user, Language-guided robotic manipulation, abstract user instructions, Language-guided robotic, waypOinT-aware world model
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.

[LG-78] GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation

链接: https://arxiv.org/abs/2410.10393
作者: Taha Aksu,Gerald Woo,Juncheng Liu,Xu Liu,Chenghao Liu,Silvio Savarese,Caiming Xiong,Doyen Sahoo
关键词-EN: Time series, Time Series Forecasting, Time series foundation, handling diverse tasks, General Time Series
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training. However, the advancement of these models has been hindered by the lack of comprehensive benchmarks. To address this gap, we introduce the General Time Series Forecasting Model Evaluation, GIFT-Eval, a pioneering benchmark aimed at promoting evaluation across diverse datasets. GIFT-Eval encompasses 28 datasets over 144,000 time series and 177 million data points, spanning seven domains, 10 frequencies, multivariate inputs, and prediction lengths ranging from short to long-term forecasts. To facilitate the effective pretraining and evaluation of foundation models, we also provide a non-leaking pretraining dataset containing approximately 230 billion data points. Additionally, we provide a comprehensive analysis of 17 baselines, which includes statistical models, deep learning models, and foundation models. We discuss each model in the context of various benchmark characteristics and offer a qualitative analysis that spans both deep learning and foundation models. We believe the insights from this analysis, along with access to this new standard zero-shot time series forecasting benchmark, will guide future developments in time series foundation models. The codebase, datasets, and a leaderboard showing all the results in detail will be available soon.

[LG-79] Stein Variational Evolution Strategies

链接: https://arxiv.org/abs/2410.10390
作者: Cornelius V. Braun,Robert T. Lange,Marc Toussaint
关键词-EN: Variational Gradient Descent, Gradient Descent, highly efficient method, Stein Variational Gradient, Stein Variational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Stein Variational Gradient Descent (SVGD) is a highly efficient method to sample from an unnormalized probability distribution. However, the SVGD update relies on gradients of the log-density, which may not always be available. Existing gradient-free versions of SVGD make use of simple Monte Carlo approximations or gradients from surrogate distributions, both with limitations. To improve gradient-free Stein variational inference, we combine SVGD steps with evolution strategy (ES) updates. Our results demonstrate that the resulting algorithm generates high-quality samples from unnormalized target densities without requiring gradient information. Compared to prior gradient-free SVGD methods, we find that the integration of the ES update in SVGD significantly improves the performance on multiple challenging benchmark problems.

[LG-80] Learning Sub-Second Routing Optimization in Computer Networks requires Packet-Level Dynamics

链接: https://arxiv.org/abs/2410.10377
作者: Andreas Boltres,Niklas Freymuth,Patrick Jahnke,Holger Karl,Gerhard Neumann
关键词-EN: computer networking, Finding efficient routes, data packets, essential task, task in computer
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted at Transactions of Machine Learning Research (TMLR) 2024

点击查看摘要

Abstract:Finding efficient routes for data packets is an essential task in computer networking. The optimal routes depend greatly on the current network topology, state and traffic demand, and they can change within milliseconds. Reinforcement Learning can help to learn network representations that provide routing decisions for possibly novel situations. So far, this has commonly been done using fluid network models. We investigate their suitability for millisecond-scale adaptations with a range of traffic mixes and find that packet-level network models are necessary to capture true dynamics, in particular in the presence of TCP traffic. To this end, we present \textitPackeRL , the first packet-level Reinforcement Learning environment for routing in generic network topologies. Our experiments confirm that learning-based strategies that have been trained in fluid environments do not generalize well to this more realistic, but more challenging setup. Hence, we also introduce two new algorithms for learning sub-second Routing Optimization. We present \textitM-Slim , a dynamic shortest-path algorithm that excels at high traffic volumes but is computationally hard to scale to large network topologies, and \textitFieldLines , a novel next-hop policy design that re-optimizes routing for any network topology within milliseconds without requiring any re-training. Both algorithms outperform current learning-based approaches as well as commonly used static baseline protocols in scenarios with high-traffic volumes. All findings are backed by extensive experiments in realistic network conditions in our fast and versatile training and evaluation framework.

[LG-81] Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training

链接: https://arxiv.org/abs/2410.10373
作者: Zhanpeng Zhou,Mingze Wang,Yuchen Mao,Bingrui Li,Junchi Yan
关键词-EN: Sharpness-Aware Minimization, Stochastic Gradient Descent, SAM, substantially improved, neural networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has substantially improved the generalization of neural networks under various settings. Despite the success, its effectiveness remains poorly understood. In this work, we discover an intriguing phenomenon in the training dynamics of SAM, shedding lights on understanding its implicit bias towards flatter minima over Stochastic Gradient Descent (SGD). Specifically, we find that SAM efficiently selects flatter minima late in training. Remarkably, even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training. Subsequently, we delve deeper into the underlying mechanism behind this phenomenon. Theoretically, we identify two phases in the learning dynamics after applying SAM late in training: i) SAM first escapes the minimum found by SGD exponentially fast; and ii) then rapidly converges to a flatter minimum within the same valley. Furthermore, we empirically investigate the role of SAM during the early training phase. We conjecture that the optimization method chosen in the late phase is more crucial in shaping the final solution’s properties. Based on this viewpoint, we extend our findings from SAM to Adversarial Training.

[LG-82] BookWorm: A Dataset for Character Description and Analysis EMNLP2024

链接: https://arxiv.org/abs/2410.10372
作者: Argyrios Papoudakis,Mirella Lapata,Frank Keller
关键词-EN: driving the plot, engaging readers, plot and engaging, numerous interacting characters, Characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 30 pages, 2 figures, EMNLP 2024 Findings

点击查看摘要

Abstract:Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.

[LG-83] Optimal Time Complexity Algorithms for Computing General Random Walk Graph Kernels on Sparse Graphs

链接: https://arxiv.org/abs/2410.10368
作者: Krzysztof Choromanski,Isaac Reid,Arijit Sehanobish,Avinava Dubey
关键词-EN: complexity randomized algorithms, linear time complexity, time complexity randomized, randomized algorithms, algorithms for unbiased
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present the first linear time complexity randomized algorithms for unbiased approximation of the celebrated family of general random walk kernels (RWKs) for sparse graphs. This includes both labelled and unlabelled instances. The previous fastest methods for general RWKs were of cubic time complexity and not applicable to labelled graphs. Our method samples dependent random walks to compute novel graph embeddings in \mathbbR^d whose dot product is equal to the true RWK in expectation. It does so without instantiating the direct product graph in memory, meaning we can scale to massive datasets that cannot be stored on a single machine. We derive exponential concentration bounds to prove that our estimator is sharp, and show that the ability to approximate general RWKs (rather than just special cases) unlocks efficient implicit graph kernel learning. Our method is up to \mathbf27\times faster than its counterparts for efficient computation on large graphs and scales to graphs \mathbf128 \times bigger than largest examples amenable to brute-force computation.

[LG-84] SpeGCL: Self-supervised Graph Spectrum Contrastive Learning without Positive Samples

链接: https://arxiv.org/abs/2410.10365
作者: Yuntao Shou,Xiangyong Cao,Deyu Meng
关键词-EN: GCL, existing GCL, existing GCL methods, GCL methods, Graph Contrastive Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) excels at managing noise and fluctuations in input data, making it popular in various fields (e.g., social networks, and knowledge graphs). Our study finds that the difference in high-frequency information between augmented graphs is greater than that in low-frequency information. However, most existing GCL methods focus mainly on the time domain (low-frequency information) for node feature representations and cannot make good use of high-frequency information to speed up model convergence. Furthermore, existing GCL paradigms optimize graph embedding representations by pulling the distance between positive sample pairs closer and pushing the distance between positive and negative sample pairs farther away, but our theoretical analysis shows that graph contrastive learning benefits from pushing negative pairs farther away rather than pulling positive pairs closer. To solve the above-mentioned problems, we propose a novel spectral GCL framework without positive samples, named SpeGCL. Specifically, to solve the problem that existing GCL methods cannot utilize high-frequency information, SpeGCL uses a Fourier transform to extract high-frequency and low-frequency information of node features, and constructs a contrastive learning mechanism in a Fourier space to obtain better node feature representation. Furthermore, SpeGCL relies entirely on negative samples to refine the graph embedding. We also provide a theoretical justification for the efficacy of using only negative samples in SpeGCL. Extensive experiments on un-supervised learning, transfer learning, and semi-supervised learning have validated the superiority of our SpeGCL framework over the state-of-the-art GCL methods.

[LG-85] Replay-and-Forget-Free Graph Class-Incremental Learning: A Task Profiling and Prompting Approach NEURIPS2024

链接: https://arxiv.org/abs/2410.10341
作者: Chaoxi Niu,Guansong Pang,Ling Chen,Bing Liu
关键词-EN: task, Graph, Class-incremental learning, CIL, aims to continually
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Class-incremental learning (CIL) aims to continually learn a sequence of tasks, with each task consisting of a set of unique classes. Graph CIL (GCIL) follows the same setting but needs to deal with graph tasks (e.g., node classification in a graph). The key characteristic of CIL lies in the absence of task identifiers (IDs) during inference, which causes a significant challenge in separating classes from different tasks (i.e., inter-task class separation). Being able to accurately predict the task IDs can help address this issue, but it is a challenging problem. In this paper, we show theoretically that accurate task ID prediction on graph data can be achieved by a Laplacian smoothing-based graph task profiling approach, in which each graph task is modeled by a task prototype based on Laplacian smoothing over the graph. It guarantees that the task prototypes of the same graph task are nearly the same with a large smoothing step, while those of different tasks are distinct due to differences in graph structure and node attributes. Further, to avoid the catastrophic forgetting of the knowledge learned in previous graph tasks, we propose a novel graph prompting approach for GCIL which learns a small discriminative graph prompt for each task, essentially resulting in a separate classification model for each task. The prompt learning requires the training of a single graph neural network (GNN) only once on the first task, and no data replay is required thereafter, thereby obtaining a GCIL model being both replay-free and forget-free. Extensive experiments on four GCIL benchmarks show that i) our task prototype-based method can achieve 100% task ID prediction accuracy on all four datasets, ii) our GCIL model significantly outperforms state-of-the-art competing methods by at least 18% in average CIL accuracy, and iii) our model is fully free of forgetting on the four datasets.

[LG-86] CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

链接: https://arxiv.org/abs/2410.10336
作者: Joshua Ong Jun Leang,Aryo Pradipta Gema,Shay B. Cohen
关键词-EN: Mathematically Annotated Thought, large language models, remains a significant, significant challenge, challenge for large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: 8 pages, 12 figures

点击查看摘要

Abstract:Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks

[LG-87] GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs

链接: https://arxiv.org/abs/2410.10329
作者: Yun Zhu,Haizhou Shi,Xiaotang Wang,Yongchao Liu,Yaoke Wang,Boci Peng,Chuntao Hong,Siliang Tang
关键词-EN: Large Language Models, Large Language, bolster TAG methodologies, gained significant attention, significant attention due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Recently, research on Text-Attributed Graphs (TAGs) has gained significant attention due to the prevalence of free-text node features in real-world applications and the advancements in Large Language Models (LLMs) that bolster TAG methodologies. However, current TAG approaches face two primary challenges: (i) Heavy reliance on label information and (ii) Limited cross-domain zero/few-shot transferability. These issues constrain the scaling of both data and model size, owing to high labor costs and scaling laws, complicating the development of graph foundation models with strong transferability. In this work, we propose the GraphCLIP framework to address these challenges by learning graph foundation models with strong cross-domain zero/few-shot transferability through a self-supervised contrastive graph-summary pretraining method. Specifically, we generate and curate large-scale graph-summary pair data with the assistance of LLMs, and introduce a novel graph-summary pretraining method, combined with invariant learning, to enhance graph foundation models with strong cross-domain zero-shot transferability. For few-shot learning, we propose a novel graph prompt tuning technique aligned with our pretraining objective to mitigate catastrophic forgetting and minimize learning costs. Extensive experiments show the superiority of GraphCLIP in both zero-shot and few-shot settings, while evaluations across various downstream tasks confirm the versatility of GraphCLIP. Our code is available at: this https URL

[LG-88] Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks

链接: https://arxiv.org/abs/2410.10322
作者: Binghui Li,Zhixuan Pan,Kaifeng Lyu,Jian Li
关键词-EN: principal factors contributing, gradient descent training, gradient descent, deep neural networks, Feature Averaging
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 78 pages, 10 figures

点击查看摘要

Abstract:In this work, we investigate a particular implicit bias in the gradient descent training process, which we term “Feature Averaging”, and argue that it is one of the principal factors contributing to non-robustness of deep neural networks. Despite the existence of multiple discriminative features capable of classifying data, neural networks trained by gradient descent exhibit a tendency to learn the average (or certain combination) of these features, rather than distinguishing and leveraging each feature individually. In particular, we provide a detailed theoretical analysis of the training dynamics of gradient descent in a two-layer ReLU network for a binary classification task, where the data distribution consists of multiple clusters with orthogonal cluster center vectors. We rigorously prove that gradient descent converges to the regime of feature averaging, wherein the weights associated with each hidden-layer neuron represent an average of the cluster centers (each center corresponding to a distinct feature). It leads the network classifier to be non-robust due to an attack that aligns with the negative direction of the averaged features. Furthermore, we prove that, with the provision of more granular supervised information, a two-layer multi-class neural network is capable of learning individual features, from which one can derive a binary classifier with the optimal robustness under our setting. Besides, we also conduct extensive experiments using synthetic datasets, MNIST and CIFAR-10 to substantiate the phenomenon of feature averaging and its role in adversarial robustness of neural networks. We hope the theoretical and empirical insights can provide a deeper understanding of the impact of the gradient descent training on feature learning process, which in turn influences the robustness of the network, and how more detailed supervision may enhance model robustness.

[LG-89] DiRW: Path-Aware Digraph Learning for Heterophily

链接: https://arxiv.org/abs/2410.10320
作者: Daohan Su,Xunkai Li,Zhenjun Li,Yinping Liao,Rong-Hua Li,Guoren Wang
关键词-EN: powerful representation learning, representation learning tool, graph-structured data, powerful representation, tool for graph-structured
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Recently, graph neural network (GNN) has emerged as a powerful representation learning tool for graph-structured data. However, most approaches are tailored for undirected graphs, neglecting the abundant information embedded in the edges of directed graphs (digraphs). In fact, digraphs are widely applied in the real world (e.g., social networks and recommendations) and are also confirmed to offer a new perspective for addressing topological heterophily challenges (i.e., connected nodes have complex patterns of feature distribution or labels). Despite recent significant advancements in DiGNNs, existing spatial- and spectral-based methods have inherent limitations due to the complex learning mechanisms and reliance on high-quality topology, leading to low efficiency and unstable performance. To address these issues, we propose Directed Random Walk (DiRW), which can be viewed as a plug-and-play strategy or an innovative neural architecture that provides a guidance or new learning paradigm for most spatial-based methods or digraphs. Specifically, DiRW incorporates a direction-aware path sampler optimized from the perspectives of walk probability, length, and number in a weight-free manner by considering node profiles and topological structure. Building upon this, DiRW utilizes a node-wise learnable path aggregator for generalized messages obtained by our proposed adaptive walkers to represent the current node. Extensive experiments on 9 datasets demonstrate that DiRW: (1) enhances most spatial-based methods as a plug-and-play strategy; (2) achieves SOTA performance as a new digraph learning paradigm.

[LG-90] QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models NEURIPS2024

链接: https://arxiv.org/abs/2410.10318
作者: Zhumazhan Balapanov,Edward Magongo,Vanessa Matvei,Olivia Holmberg,Jonathan Pei,Kevin Zhu
关键词-EN: Convolutional neural networks, computer vision tasks, limit real-world applicability, made significant advances, Convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024 workshop on Neural Compression

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have made significant advances in computer vision tasks, yet their high inference times and latency often limit real-world applicability. While model compression techniques have gained popularity as solutions, they often overlook the critical balance between low latency and uncompromised accuracy. By harnessing quantum-inspired pruning, tensor decomposition, and annealing-based matrix factorization - three quantum-inspired concepts - we introduce QIANets: a novel approach of redesigning the traditional GoogLeNet, DenseNet, and ResNet-18 model architectures to process more parameters and computations whilst maintaining low inference times. Despite experimental limitations, the method was tested and evaluated, demonstrating reductions in inference times, along with effective accuracy preservations.

[LG-91] A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification in Greek Tweets

链接: https://arxiv.org/abs/2410.10290
作者: Nikolaos Mylonas,Nikolaos Stylianou,Theodora Tsikrika,Stefanos Vrochidis,Ioannis Kompatsiaris
关键词-EN: past few years, existing interpretability techniques, interpretability techniques produce, existing interpretability, interpretability techniques
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Work In Progress

点击查看摘要

Abstract:Interpretability is a topic that has been in the spotlight for the past few years. Most existing interpretability techniques produce interpretations in the form of rules or feature importance. These interpretations, while informative, may be harder to understand for non-expert users and therefore, cannot always be considered as adequate explanations. To that end, explanations in natural language are often preferred, as they are easier to comprehend and also more presentable to end-users. This work introduces an early concept for a novel pipeline that can be used in text classification tasks, offering predictions and explanations in natural language. It comprises of two models: a classifier for labelling the text and an explanation generator which provides the explanation. The proposed pipeline can be adopted by any text classification task, given that ground truth rationales are available to train the explanation generator. Our experiments are centred around the tasks of sentiment analysis and offensive language identification in Greek tweets, using a Greek Large Language Model (LLM) to obtain the necessary explanations that can act as rationales. The experimental evaluation was performed through a user study based on three different metrics and achieved promising results for both datasets.

[LG-92] ABBA-VSM: Time Series Classification using Symbolic Representation on the Edge

链接: https://arxiv.org/abs/2410.10285
作者: Meerzhan Kanatbekova,Shashikant Ilager,Ivona Brandic
关键词-EN: smart city management, recent years, city management, Edge, Internet of Things
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages with references, 5 figures

点击查看摘要

Abstract:In recent years, Edge AI has become more prevalent with applications across various industries, from environmental monitoring to smart city management. Edge AI facilitates the processing of Internet of Things (IoT) data and provides privacy-enabled and latency-sensitive services to application users using Machine Learning (ML) algorithms, e.g., Time Series Classification (TSC). However, existing TSC algorithms require access to full raw data and demand substantial computing resources to train and use them effectively in runtime. This makes them impractical for deployment in resource-constrained Edge environments. To address this, in this paper, we propose an Adaptive Brownian Bridge-based Symbolic Aggregation Vector Space Model (ABBA-VSM). It is a new TSC model designed for classification services on Edge. Here, we first adaptively compress the raw time series into symbolic representations, thus capturing the changing trends of data. Subsequently, we train the classification model directly on these symbols. ABBA-VSM reduces communication data between IoT and Edge devices, as well as computation cycles, in the development of resource-efficient TSC services on Edge. We evaluate our solution with extensive experiments using datasets from the UCR time series classification archive. The results demonstrate that the ABBA-VSM achieves up to 80% compression ratio and 90-100% accuracy for binary classification. Whereas, for non-binary classification, it achieves an average compression ratio of 60% and accuracy ranging from 60-80%.

[LG-93] QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis

链接: https://arxiv.org/abs/2410.10270
作者: Abhijit Manatkar,Ashlesha Akella,Parthivi Gupta,Krishnasuri Narayanam
关键词-EN: Exploratory Data Analysis, Discovering meaningful insights, Large Language Models, Discovering meaningful, Exploratory Data
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.

[LG-94] Matrix Sketching in Bandits: Current Pitfalls and New Framework

链接: https://arxiv.org/abs/2410.10258
作者: Dongxie Wen,Hanyan Yin,Xiao Zhang,Zhewei Wei
关键词-EN: Dyadic Block Sketching, covariance matrix, matrix sketching, techniques has progressively, progressively emerged
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The utilization of sketching techniques has progressively emerged as a pivotal method for enhancing the efficiency of online learning. In linear bandit settings, current sketch-based approaches leverage matrix sketching to reduce the per-round time complexity from (\Omega\left(d^2\right)) to (O(d)), where (d) is the input dimension. Despite this improved efficiency, these approaches encounter critical pitfalls: if the spectral tail of the covariance matrix does not decrease rapidly, it can lead to linear regret. In this paper, we revisit the regret analysis and algorithm design concerning approximating the covariance matrix using matrix sketching in linear bandits. We illustrate how inappropriate sketch sizes can result in unbounded spectral loss, thereby causing linear regret. To prevent this issue, we propose Dyadic Block Sketching, an innovative streaming matrix sketching approach that adaptively manages sketch size to constrain global spectral loss. This approach effectively tracks the best rank-( k ) approximation in an online manner, ensuring efficiency when the geometry of the covariance matrix is favorable. Then, we apply the proposed Dyadic Block Sketching to linear bandits and demonstrate that the resulting bandit algorithm can achieve sublinear regret without prior knowledge of the covariance matrix, even under the worst case. Our method is a general framework for efficient sketch-based linear bandits, applicable to all existing sketch-based approaches, and offers improved regret bounds accordingly. Additionally, we conduct comprehensive empirical studies using both synthetic and real-world data to validate the accuracy of our theoretical findings and to highlight the effectiveness of our algorithm.

[LG-95] LoLCATs: On Low-Rank Linearizing of Large Language Models

链接: https://arxiv.org/abs/2410.10254
作者: Michael Zhang,Simran Arora,Rahul Chalamala,Alan Wu,Benjamin Spector,Aaryan Singhal,Krithik Ramesh,Christopher Ré
关键词-EN: popular Transformer-based LLMs, Recent works show, expensive pretraining costs, linearize large language, popular Transformer-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 47 pages, 20 figures, 18 tables, preprint

点击查看摘要

Abstract:Recent works show we can linearize large language models (LLMs) – swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention – avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. We base these steps on two findings. First, we can replace an LLM’s softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss (“attention transfer”). Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. Furthermore, LoLCATs does so with only 0.2% of past methods’ model parameters and 0.4% of their training tokens. Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.

[LG-96] Feedback Favors the Generalization of Neural ODEs

链接: https://arxiv.org/abs/2410.10253
作者: Jindou Jia,Zihan Yang,Meng Wang,Kexin Guo,Jianfei Yang,Xiang Yu,Lei Guo
关键词-EN: well-known generalization problem, generalization problem hinders, varying latent dynamics, problem hinders, hinders the application
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 22 pages, 17 figures

点击查看摘要

Abstract:The well-known generalization problem hinders the application of artificial neural networks in continuous-time prediction tasks with varying latent dynamics. In sharp contrast, biological systems can neatly adapt to evolving environments benefiting from real-time feedback mechanisms. Inspired by the feedback philosophy, we present feedback neural networks, showing that a feedback loop can flexibly correct the learned latent dynamics of neural ordinary differential equations (neural ODEs), leading to a prominent generalization improvement. The feedback neural network is a novel two-DOF neural network, which possesses robust performance in unseen scenarios with no loss of accuracy performance on previous tasks. A linear feedback form is presented to correct the learned latent dynamics firstly, with a convergence guarantee. Then, domain randomization is utilized to learn a nonlinear neural feedback form. Finally, extensive tests including trajectory prediction of a real irregular object and model predictive control of a quadrotor with various uncertainties, are implemented, indicating significant improvements over state-of-the-art model-based and learning-based methods.

[LG-97] Measurability in the Fundamental Theorem of Statistical Learning

链接: https://arxiv.org/abs/2410.10243
作者: Lothar Sebastian Krapp,Laura Wirth
关键词-EN: Statistical Learning states, Statistical Learning, Fundamental Theorem, dimension is finite, PAC learnable
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Logic (math.LO); Probability (math.PR); Machine Learning (stat.ML)
*备注: 41 pages plus appendix

点击查看摘要

Abstract:The Fundamental Theorem of Statistical Learning states that a hypothesis space is PAC learnable if and only if its VC dimension is finite. For the agnostic model of PAC learning, the literature so far presents proofs of this theorem that often tacitly impose several measurability assumptions on the involved sets and functions. We scrutinize these proofs from a measure-theoretic perspective in order to extract the assumptions needed for a rigorous argument. This leads to a sound statement as well as a detailed and self-contained proof of the Fundamental Theorem of Statistical Learning in the agnostic setting, showcasing the minimal measurability requirements needed. We then discuss applications in Model Theory, considering NIP and o-minimal structures. Our main theorem presents sufficient conditions for the PAC learnability of hypothesis spaces defined over o-minimal expansions of the reals.

[LG-98] Revisiting and Benchmarking Graph Autoencoders: A Contrastive Learning Perspective

链接: https://arxiv.org/abs/2410.10241
作者: Jintang Li,Ruofan Wu,Yuchang Zhu,Huizhe Zhang,Xinzhou Jin,Guibin Zhang,Zulun Zhu,Zibin Zheng,Liang Chen
关键词-EN: low-dimensional latent space, GAEs, self-supervised learning models, latent space, graph-structured data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Preprint, under review

点击查看摘要

Abstract:Graph autoencoders (GAEs) are self-supervised learning models that can learn meaningful representations of graph-structured data by reconstructing the input graph from a low-dimensional latent space. Over the past few years, GAEs have gained significant attention in academia and industry. In particular, the recent advent of GAEs with masked autoencoding schemes marks a significant advancement in graph self-supervised learning research. While numerous GAEs have been proposed, the underlying mechanisms of GAEs are not well understood, and a comprehensive benchmark for GAEs is still lacking. In this work, we bridge the gap between GAEs and contrastive learning by establishing conceptual and methodological connections. We revisit the GAEs studied in previous works and demonstrate how contrastive learning principles can be applied to GAEs. Motivated by these insights, we introduce lrGAE (left-right GAE), a general and powerful GAE framework that leverages contrastive learning principles to learn meaningful representations. Our proposed lrGAE not only facilitates a deeper understanding of GAEs but also sets a new benchmark for GAEs across diverse graph-based learning tasks. The source code for lrGAE, including the baselines and all the code for reproducing the results, is publicly available at this https URL.

[LG-99] SkillAggregation: Reference-free LLM-Dependent Aggregation

链接: https://arxiv.org/abs/2410.10215
作者: Guangzhi Sun,Anmol Kagrecha,Potsawee Manakul,Phil Woodland,Mark Gales
关键词-EN: Large Language Models, Large Language, Language Models, generate human-like judgments, assess NLP tasks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to assess NLP tasks due to their ability to generate human-like judgments. Single LLMs were used initially, however, recent work suggests using multiple LLMs as judges yields improved performance. An important step in exploiting multiple judgements is the combination stage, aggregation. Existing methods in NLP either assign equal weight to all LLM judgments or are designed for specific tasks such as hallucination detection. This work focuses on aggregating predictions from multiple systems where no reference labels are available. A new method called SkillAggregation is proposed, which learns to combine estimates from LLM judges without needing additional data or ground truth. It extends the Crowdlayer aggregation method, developed for image classification, to exploit the judge estimates during inference. The approach is compared to a range of standard aggregation methods on HaluEval-Dialogue, TruthfulQA and Chatbot Arena tasks. SkillAggregation outperforms Crowdlayer on all tasks, and yields the best performance over all approaches on the majority of tasks.

[LG-100] Large Language Model-Enhanced Reinforcement Learning for Generic Bus Holding Control Strategies

链接: https://arxiv.org/abs/2410.10212
作者: Jiajie Yu,Yuhong Wang,Wei Ma
关键词-EN: Bus holding control, Bus holding, bus holding strategies, widely-adopted strategy, strategy for maintaining
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 41 pages, 15 figures

点击查看摘要

Abstract:Bus holding control is a widely-adopted strategy for maintaining stability and improving the operational efficiency of bus systems. Traditional model-based methods often face challenges with the low accuracy of bus state prediction and passenger demand estimation. In contrast, Reinforcement Learning (RL), as a data-driven approach, has demonstrated great potential in formulating bus holding strategies. RL determines the optimal control strategies in order to maximize the cumulative reward, which reflects the overall control goals. However, translating sparse and delayed control goals in real-world tasks into dense and real-time rewards for RL is challenging, normally requiring extensive manual trial-and-error. In view of this, this study introduces an automatic reward generation paradigm by leveraging the in-context learning and reasoning capabilities of Large Language Models (LLMs). This new paradigm, termed the LLM-enhanced RL, comprises several LLM-based modules: reward initializer, reward modifier, performance analyzer, and reward refiner. These modules cooperate to initialize and iteratively improve the reward function according to the feedback from training and test results for the specified RL-based task. Ineffective reward functions generated by the LLM are filtered out to ensure the stable evolution of the RL agents’ performance over iterations. To evaluate the feasibility of the proposed LLM-enhanced RL paradigm, it is applied to various bus holding control scenarios, including a synthetic single-line system and a real-world multi-line system. The results demonstrate the superiority and robustness of the proposed paradigm compared to vanilla RL strategies, the LLM-based controller, and conventional space headway-based feedback control. This study sheds light on the great potential of utilizing LLMs in various smart mobility applications.

[LG-101] Fed-piLot: Optimizing LoRA Assignment for Efficient Federated Foundation Model Fine-Tuning

链接: https://arxiv.org/abs/2410.10200
作者: Zikai Zhang,Jiahao Xu,Ping Liu,Rui Hu
关键词-EN: shown remarkable advancements, Foundation models, intelligent applications, shown remarkable, remarkable advancements
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Foundation models (FMs) have shown remarkable advancements in enhancing the performance of intelligent applications. To address the need for data privacy in FM fine-tuning, federated learning has emerged as the de facto framework. Specifically, Federated FMs (FedFMs) fine-tuning using low-rank adaptation (LoRA) modules instead of the full model over multiple clients can achieve both parameter efficiency and data privacy. However, recent studies rarely address the challenges posed by clients with heterogeneous resources, particularly in GPU memory capacity. In this paper, we introduce Fed-piLot, an efficient FedFM fine-tuning framework with optimized local LoRA assignments for heterogeneous clients. By emphasizing the different memory consumption for training different LoRA layers, as well as the varying contributions of different layers to model performance, we formulate the LoRA assignment as a Knapsack Optimization Problem. We design a Local-Global Information Gain Score (IG-Score) based value function to optimize LoRA assignment under clients’ memory constraints. To further mitigate the impact of heterogeneity in model updates, we propose a novel Spatial-Temporal model aggregation (STAgg) rule using the Dynamic Weight Adjustment (DWA) strategy. Experimental results on three datasets under both IID and non-IID conditions demonstrate the effectiveness and efficiency of Fed-piLot. The code will be publicly available.

[LG-102] Predicting from Strings: Language Model Embeddings for Bayesian Optimization

链接: https://arxiv.org/abs/2410.10190
作者: Tung Nguyen,Qiuyi Zhang,Bangding Yang,Chansoo Lee,Jorg Bornschein,Yingjie Miao,Sagi Perel,Yutian Chen,Xingyou Song
关键词-EN: improving search efficiency, fixed search spaces, tabular input features, search efficiency, Bayesian Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bayesian Optimization is ubiquitous in the field of experimental design and blackbox optimization for improving search efficiency, but has been traditionally restricted to regression models which are only applicable to fixed search spaces and tabular input features. We propose Embed-then-Regress, a paradigm for applying in-context regression over string inputs, through the use of string embedding capabilities of pretrained language models. By expressing all inputs as strings, we are able to perform general-purpose regression for Bayesian Optimization over various domains including synthetic, combinatorial, and hyperparameter optimization, obtaining comparable results to state-of-the-art Gaussian Process-based algorithms. Code can be found at this http URL.

[LG-103] Hamiltonian Neural Networks for Robust Out-of-Time Credit Scoring

链接: https://arxiv.org/abs/2410.10182
作者: Javier Marín
关键词-EN: financial risk management, Hamiltonian-inspired neural network, neural network approach, designed to address, prediction in financial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel Hamiltonian-inspired neural network approach to credit scoring, designed to address the challenges of class imbalance and out-of-time (OOT) prediction in financial risk management. Drawing from concepts in Hamiltonian mechanics, we develop a symplectic optimizer and a new loss function to capture the complex dynamics of credit risk evolution. Using the Freddie Mac Single-Family Loan-Level Dataset, we evaluate our model’s performance against other machine learning approaches. Our method shows superior discriminative power in OOT scenarios, as measured by the Area Under the Curve (AUC), indicating better ranking ability and robustness to class imbalance. The Hamiltonian-inspired approach shows particular strength in maintaining consistent performance between in-sample and OOT test sets, suggesting improved generalization to future, unseen data. These findings suggest that physics-inspired techniques offer a promising direction for developing more robust and reliable credit scoring models, particularly in uncertain economic situations.

[LG-104] Gaussian Mixture Vector Quantization with Aggregated Categorical Posterior

链接: https://arxiv.org/abs/2410.10180
作者: Mingyuan Yan,Jiawei Wu,Rushi Shah,Dianbo Liu
关键词-EN: Quantized Variational Autoencoder, machine learning, Vector Quantized Variational, widely used method, method to map
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The vector quantization is a widely used method to map continuous representation to discrete space and has important application in tokenization for generative mode, bottlenecking information and many other tasks in machine learning. Vector Quantized Variational Autoencoder (VQ-VAE) is a type of variational autoencoder using discrete embedding as latent. We generalize the technique further, enriching the probabilistic framework with a Gaussian mixture as the underlying generative model. This framework leverages a codebook of latent means and adaptive variances to capture complex data distributions. This principled framework avoids various heuristics and strong assumptions that are needed with the VQ-VAE to address training instability and to improve codebook utilization. This approach integrates the benefits of both discrete and continuous representations within a variational Bayesian framework. Furthermore, by introducing the \textitAggregated Categorical Posterior Evidence Lower Bound (ALBO), we offer a principled alternative optimization objective that aligns variational distributions with the generative model. Our experiments demonstrate that GM-VQ improves codebook utilization and reduces information loss without relying on handcrafted heuristics.

[LG-105] Is Parameter Collision Hindering Continual Learning in LLMs?

链接: https://arxiv.org/abs/2410.10179
作者: Shuo Yang,Kun-Peng Ning,Yu-Yang Liu,Jia-Yu Yao,Yong-Hong Tian,Yi-Bing Song,Li Yuan
关键词-EN: Large Language Models, Large Language, Language Models, making continual learning, dynamic deployment
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as O-LoRA, typically focus on constructing orthogonality tasks to decouple parameter interdependence from various this http URL this paper, we reveal that building non-collision parameters is a more critical factor in addressing CL challenges. Our theoretical and experimental analyses demonstrate that non-collision parameters can provide better task orthogonality, which is a sufficient but unnecessary condition. Furthermore, knowledge from multiple domains will be preserved in non-collision parameter subspaces, making it more difficult to forget previously seen data. Leveraging this insight, we propose Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach leveraging low collision rates to enhance CL in LLMs. Experimental results on multiple CL benchmarks indicate that N-LoRA achieves superior performance (+2.9), higher task orthogonality (*4.1 times), and lower parameter collision (*58.1 times) than SOTA methods.

[LG-106] GUISE: Graph GaUssIan Shading watErmark

链接: https://arxiv.org/abs/2410.10178
作者: Renyi Yang
关键词-EN: generative artificial intelligence, integrating robust watermarking, robust watermarking technologies, protect intellectual property, maintain content authenticity
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In the expanding field of generative artificial intelligence, integrating robust watermarking technologies is essential to protect intellectual property and maintain content authenticity. Traditionally, watermarking techniques have been developed primarily for rich information media such as images and audio. However, these methods have not been adequately adapted for graph-based data, particularly molecular graphs. Latent 3D graph diffusion(LDM-3DG) is an ascendant approach in the molecular graph generation field. This model effectively manages the complexities of molecular structures, preserving essential symmetries and topological features. We adapt the Gaussian Shading, a proven performance lossless watermarking technique, to the latent graph diffusion domain to protect this sophisticated new technology. Our adaptation simplifies the watermark diffusion process through duplication and padding, making it adaptable and suitable for various message types. We conduct several experiments using the LDM-3DG model on publicly available datasets QM9 and Drugs, to assess the robustness and effectiveness of our technique. Our results demonstrate that the watermarked molecules maintain statistical parity in 9 out of 10 performance metrics compared to the original. Moreover, they exhibit a 100% detection rate and a 99% extraction rate in a 2D decoded pipeline, while also showing robustness against post-editing attacks.

[LG-107] Identity-Focused Inference and Extraction Attacks on Diffusion Models

链接: https://arxiv.org/abs/2410.10177
作者: Jayneel Vora,Aditya Krishnan,Nader Bouacida,Prabhu RV Shankar,Prasant Mohapatra
关键词-EN: generating synthetic images, increasing reliance, generating synthetic, amplified concerns, inference
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 5 figures, 3 tables,12 pages main body content

点击查看摘要

Abstract:The increasing reliance on diffusion models for generating synthetic images has amplified concerns about the unauthorized use of personal data, particularly facial images, in model training. In this paper, we introduce a novel identity inference framework to hold model owners accountable for including individuals’ identities in their training data. Our approach moves beyond traditional membership inference attacks by focusing on identity-level inference, providing a new perspective on data privacy violations. Through comprehensive evaluations on two facial image datasets, Labeled Faces in the Wild (LFW) and CelebA, our experiments demonstrate that the proposed membership inference attack surpasses baseline methods, achieving an attack success rate of up to 89% and an AUC-ROC of 0.91, while the identity inference attack attains 92% on LDM models trained on LFW, and the data extraction attack achieves 91.6% accuracy on DDPMs, validating the effectiveness of our approach across diffusion models.

[LG-108] Balanced Neural ODEs: nonlinear model order reduction and Koopman operator approxmations

链接: https://arxiv.org/abs/2410.10174
作者: Julius Aka,Johannes Brunnemann,Jörg Eiden,Arne Speerforck,Lars Mikelsons
关键词-EN: compact latent representations, learning compact latent, Variational Autoencoders, transient system dynamics, learning transient system
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Conference paper under review

点击查看摘要

Abstract:Variational Autoencoders (VAEs) are a powerful framework for learning compact latent representations, while NeuralODEs excel in learning transient system dynamics. This work combines the strengths of both to create fast surrogate models with adjustable complexity. By leveraging the VAE’s dimensionality reduction using a non-hierarchical prior, our method adaptively assigns stochastic noise, naturally complementing known NeuralODE training enhancements and enabling probabilistic time series modeling. We show that standard Latent ODEs struggle with dimensionality reduction in systems with time-varying inputs. Our approach mitigates this by continuously propagating variational parameters through time, establishing fixed information channels in latent space. This results in a flexible and robust method that can learn different system complexities, e.g. deep neural networks or linear matrices. Hereby, it enables efficient approximation of the Koopman operator without the need for predefining its dimensionality. As our method balances dimensionality reduction and reconstruction accuracy, we call it Balanced Neural ODE (B-NODE). We demonstrate the effectiveness of this method on academic test cases and apply it to a real-world example of a thermal power plant.

[LG-109] Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.10166
作者: Yongjin Yang,Sihyeon Kim,Hojung Jung,Sangmin Bae,SangMook Kim,Se-Young Yun,Kimin Lee
关键词-EN: human feedback datasets, human feedback, aligning model behavior, feedback datasets, feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. In this work, we propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets with direct preference optimization (DPO). Specifically, our approach selects data by solving an optimization problem to maximize three components: preference margin, text quality, and text diversity. The concept of preference margin is used to identify samples that contain high informational value to address the noisy nature of feedback dataset, which is calculated using a proxy reward model. Additionally, we incorporate text quality, assessed by large language models to prevent harmful contents, and consider text diversity through a k-nearest neighbor entropy estimator to improve generalization. Finally, we integrate all these components into an optimization process, with approximating the solution by assigning importance score to each data pair and selecting the most important ones. As a result, our method efficiently filters data automatically, without the need for manual intervention, and can be applied to any large-scale dataset. Experimental results show that FiFA significantly enhances training stability and achieves better performance, being preferred by humans 17% more, while using less than 0.5% of the full data and thus 1% of the GPU hours compared to utilizing full human feedback datasets.

[LG-110] HSR-Enhanced Sparse Attention Acceleration

链接: https://arxiv.org/abs/2410.10165
作者: Bo Chen,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. This paper introduces a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with \mathsfReLU^\alpha activation, \alpha \in \mathbbN_+ ), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to rapidly identify non-zero or “massively activated” entries in the attention matrix. We present theoretical analyses for two key scenarios: attention generation and full attention computation with long input context. Our approach achieves a running time of O(mn^4/5) significantly faster than the naive approach O(mn) for attention generation, where n is the context length, m is the query length, and d is the hidden dimension. We can also reduce the running time of full attention computation from O(mn) to O(mn^1 - 1 / \lfloor d/2\rfloor + mn^4/5) . Importantly, our method introduces no error for ReLU attention and only provably negligible error for Softmax attention, where the latter is supported by our empirical validation. This work represents a significant step towards enabling efficient long-context processing in LLMs, potentially broadening their applicability across various domains.

[LG-111] Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

链接: https://arxiv.org/abs/2410.10158
作者: Kihyun Yu,Duksang Lee,William Overman,Dabeen Lee
关键词-EN: tabular constrained Markov, constrained Markov decision, Markov decision process, reinforcement learning problem, learning problem formulated
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of \widetilde\mathcalO((\bar C - \bar C_b)^-1H^2.5 S\sqrtAK) where \bar C is the cost budget for an episode, \bar C_b is the expected cost under a safe baseline policy over an episode, H is the horizon, and S , A and K are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when \bar C- \bar C_b=\Omega(H) , it nearly matches the regret lower bound of \Omega(H^1.5\sqrtSAK) . We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.

[LG-112] racing Human Stress from Physiological Signals using UWB Radar

链接: https://arxiv.org/abs/2410.10155
作者: Jia Xu,Teng Xiao,Pin Lv,Zhe Chen,Chao Cai,Yang Zhang,Zehui Xiong
关键词-EN: important research domain, closest related works, Stress, supports many applications, DST
类目: Human-Computer Interaction (cs.HC); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 19 pages, 11 figures

点击查看摘要

Abstract:Stress tracing is an important research domain that supports many applications, such as health care and stress management; and its closest related works are derived from stress detection. However, these existing works cannot well address two important challenges facing stress detection. First, most of these studies involve asking users to wear physiological sensors to detect their stress states, which has a negative impact on the user experience. Second, these studies have failed to effectively utilize multimodal physiological signals, which results in less satisfactory detection results. This paper formally defines the stress tracing problem, which emphasizes the continuous detection of human stress states. A novel deep stress tracing method, named DST, is presented. Note that DST proposes tracing human stress based on physiological signals collected by a noncontact ultrawideband radar, which is more friendly to users when collecting their physiological signals. In DST, a signal extraction module is carefully designed at first to robustly extract multimodal physiological signals from the raw RF data of the radar, even in the presence of body movement. Afterward, a multimodal fusion module is proposed in DST to ensure that the extracted multimodal physiological signals can be effectively fused and utilized. Extensive experiments are conducted on three real-world datasets, including one self-collected dataset and two publicity datasets. Experimental results show that the proposed DST method significantly outperforms all the baselines in terms of tracing human stress states. On average, DST averagely provides a 6.31% increase in detection accuracy on all datasets, compared with the best baselines.

[LG-113] Diagnosing Hate Speech Classification: Where Do Humans and Machines Disagree and Why?

链接: https://arxiv.org/abs/2410.10153
作者: Xilin Yang
关键词-EN: hate speech, cosine similarity ratio, Measuring Hate Speech, hate speech classification, diagnose hate speech
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study uses the cosine similarity ratio, embedding regression, and manual re-annotation to diagnose hate speech classification. We begin by computing cosine similarity ratio on a dataset “Measuring Hate Speech” that contains 135,556 annotated comments on social media. This way, we show a basic use of cosine similarity as a description of hate speech content. We then diagnose hate speech classification starting from understanding the inconsistency of human annotation from the dataset. Using embedding regression as a basic diagnostic, we found that female annotators are more sensitive to racial slurs that target the black population. We perform with a more complicated diagnostic by training a hate speech classifier using a SoTA pre-trained large language model, NV-Embed-v2, to convert texts to embeddings and run a logistic regression. This classifier achieves a testing accuracy of 94%. In diagnosing where machines disagree with human annotators, we found that machines make fewer mistakes than humans despite the fact that human annotations are treated as ground truth in the training set. Machines perform better in correctly labeling long statements of facts, but perform worse in labeling short instances of swear words. We hypothesize that this is due to model alignment - while curating models at their creation prevents the models from producing obvious hate speech, it also reduces the model’s ability to detect such content.

[LG-114] Fast and Accurate Neural Rendering Using Semi-Gradients

链接: https://arxiv.org/abs/2410.10149
作者: In-Young Cho,Jaewoong Cho
关键词-EN: effective neural network-based, neural network-based framework, global illumination rendering, propose a simple, simple yet effective
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a simple yet effective neural network-based framework for global illumination rendering. Recently, rendering techniques that learn neural radiance caches by minimizing the difference (i.e., residual) between the left and right sides of the rendering equation have been suggested. Due to their ease of implementation and the advantage of excluding path integral calculations, these techniques have been applied to various fields, such as free-viewpoint rendering, differentiable rendering, and real-time rendering. However, issues of slow training and occasionally darkened renders have been noted. We identify the cause of these issues as the bias and high variance present in the gradient estimates of the existing residual-based objective function. To address this, we introduce a new objective function that maintains the same global optimum as before but allows for unbiased and low-variance gradient estimates, enabling faster and more accurate training of neural networks. In conclusion, this method is simply implemented by ignoring the partial derivatives of the right-hand side, and theoretical and experimental analyses demonstrate the effectiveness of the proposed loss.

[LG-115] alpha-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

链接: https://arxiv.org/abs/2410.10148
作者: Junkang Wu,Xue Wang,Zhengyi Yang,Jiancan Wu,Jinyang Gao,Bolin Ding,Xiang Wang,Rong Jin,Xiangnan He
关键词-EN: Aligning large language, Aligning large, large language models, Direct Preference Optimization, Simple Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO’s assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose \alpha -DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, \alpha -DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for \alpha -DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that \alpha -DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at this https URL

[LG-116] Unified Representation of Genomic and Biomedical Concepts through Multi-Task Multi-Source Contrastive Learning

链接: https://arxiv.org/abs/2410.10144
作者: Hongyi Yuan,Suqi Liu,Kelly Cho,Katherine Liao,Alexandre Pereira,Tianxi Cai
关键词-EN: introduce GENomic Encoding, GENomic Encoding REpresentation, GENomic Encoding, Encoding REpresentation, biomedical knowledge bases
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP)
*备注: 15 pages, 2 figures, 5 tables

点击查看摘要

Abstract:We introduce GENomic Encoding REpresentation with Language Model (GENEREL), a framework designed to bridge genetic and biomedical knowledge bases. What sets GENEREL apart is its ability to fine-tune language models to infuse biological knowledge behind clinical concepts such as diseases and medications. This fine-tuning enables the model to capture complex biomedical relationships more effectively, enriching the understanding of how genomic data connects to clinical outcomes. By constructing a unified embedding space for biomedical concepts and a wide range of common SNPs from sources such as patient-level data, biomedical knowledge graphs, and GWAS summaries, GENEREL aligns the embeddings of SNPs and clinical concepts through multi-task contrastive learning. This allows the model to adapt to diverse natural language representations of biomedical concepts while bypassing the limitations of traditional code mapping systems across different data sources. Our experiments demonstrate GENEREL’s ability to effectively capture the nuanced relationships between SNPs and clinical concepts. GENEREL also emerges to discern the degree of relatedness, potentially allowing for a more refined identification of concepts. This pioneering approach in constructing a unified embedding system for both SNPs and biomedical concepts enhances the potential for data integration and discovery in biomedical research.

[LG-117] MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

链接: https://arxiv.org/abs/2410.10139
作者: Peng Xia,Siwei Han,Shi Qiu,Yiyang Zhou,Zhaoyang Wang,Wenhao Zheng,Zhaorun Chen,Chenhang Cui,Mingyu Ding,Linjie Li,Lijuan Wang,Huaxiu Yao
关键词-EN: Interleaved multimodal comprehension, arbitrary sequences, produce and interpret, interpret both images, images and text
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in this https URL.

[LG-118] Variational autoencoders with latent high-dimensional steady geometric flows for dynamics

链接: https://arxiv.org/abs/2410.10137
作者: Andrew Gracyk
关键词-EN: PDE-type ambient data, variational autoencoders, geometric flow, dynamical latent manifolds, flow
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 35 pages; 21 figures

点击查看摘要

Abstract:We develop Riemannian approaches to variational autoencoders (VAEs) for PDE-type ambient data with regularizing geometric latent dynamics, which we refer to as VAE-DLM, or VAEs with dynamical latent manifolds. We redevelop the VAE framework such that manifold geometries, subject to a geometric flow, embedded in Euclidean space are learned in the intermediary latent space developed by encoders and decoders. We reformulate the traditional evidence lower bound (ELBO) loss with a considerate choice of prior. We develop a linear geometric flow with a steady-state regularizing term. This geometric flow requires only automatic differentiation of one time derivative, and can be solved in moderately high dimensions in a physics-informed approach, allowing more expressive latent representations. We discuss how this flow can be formulated as a gradient flow, and maintains entropy away from metric singularity. This, along with an eigenvalue penalization condition, helps ensure the manifold is sufficiently large in measure, nondegenerate, and a canonical geometry, which contribute to a robust representation. Our methods focus on the modified multi-layer perceptron architecture with tanh activations for the manifold encoder-decoder. We demonstrate, on our datasets of interest, our methods perform at least as well as the traditional VAE, and oftentimes better. Our methods can outperform a standard VAE and a VAE endowed with our proposed architecture by up to 25% reduction in out-of-distribution (OOD) error and potentially greater. We highlight our method on ambient PDEs whose solutions maintain minimal variation in late times over its solution. Our approaches are particularly favorable with severe OOD effect. We provide empirical justification towards how latent Riemannian manifolds improve robust learning for external dynamics with VAEs.

[LG-119] FormalAlign: Automated Alignment Evaluation for Autoformalization

链接: https://arxiv.org/abs/2410.10135
作者: Jianqiao Lu,Yingjia Wan,Yinya Huang,Jing Xiong,Zhengying Liu,Zhijiang Guo
关键词-EN: convert informal mathematical, informal mathematical proofs, machine-verifiable formats, bridging the gap, aims to convert
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: 23 pages, 13 tables, 3 figures

点击查看摘要

Abstract:Autoformalization aims to convert informal mathematical proofs into machine-verifiable formats, bridging the gap between natural and formal languages. However, ensuring semantic alignment between the informal and formalized statements remains challenging. Existing approaches heavily rely on manual verification, hindering scalability. To address this, we introduce \textscFormalAlign, the first automated framework designed for evaluating the alignment between natural and formal languages in autoformalization. \textscFormalAlign trains on both the autoformalization sequence generation task and the representational alignment between input and output, employing a dual loss that combines a pair of mutually enhancing autoformalization and alignment tasks. Evaluated across four benchmarks augmented by our proposed misalignment strategies, \textscFormalAlign demonstrates superior performance. In our experiments, \textscFormalAlign outperforms GPT-4, achieving an Alignment-Selection Score 11.58% higher on \forml-Basic (99.21% vs. 88.91%) and 3.19% higher on MiniF2F-Valid (66.39% vs. 64.34%). This effective alignment evaluation significantly reduces the need for manual verification. Both the dataset and code can be accessed via~\urlthis https URL.

[LG-120] Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning

链接: https://arxiv.org/abs/2410.10132
作者: Hung Le,Kien Do,Dung Nguyen,Sunil Gupta,Svetha Venkatesh
关键词-EN: Effective decision-making, environments demands robust, robust memory management, observable environments demands, demands robust memory
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint 18 pages

点击查看摘要

Abstract:Effective decision-making in partially observable environments demands robust memory management. Despite their success in supervised learning, current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. They fail to efficiently capture relevant past information, adapt flexibly to changing observations, and maintain stable updates over long episodes. We theoretically analyze the limitations of existing memory models within a unified framework and introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our model dynamically adjusts memory by erasing no longer needed experiences and reinforcing crucial ones computationally efficiently. To this end, we leverage the Hadamard product for calibrating and updating memory, specifically designed to enhance memory capacity while mitigating numerical and learning challenges. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks, such as meta-reinforcement learning, long-horizon credit assignment, and POPGym, demonstrating superior performance in handling long-term and evolving contexts.

[LG-121] Edge Unlearning is Not “on Edge”! An Adaptive Exact Unlearning System on Resource-Constrained Devices

链接: https://arxiv.org/abs/2410.10128
作者: Xiaoyu Xia,Ziqi Wang,Ruoxi Sun,Bowen Liu,Ibrahim Khalil,Minhui Xue
关键词-EN: machine learning models, learning models enable, machine learning, exact unlearning, data owner data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 17 pages, 18 figures

点击查看摘要

Abstract:The right to be forgotten mandates that machine learning models enable the erasure of a data owner’s data and information from a trained model. Removing data from the dataset alone is inadequate, as machine learning models can memorize information from the training data, increasing the potential privacy risk to users. To address this, multiple machine unlearning techniques have been developed and deployed. Among them, approximate unlearning is a popular solution, but recent studies report that its unlearning effectiveness is not fully guaranteed. Another approach, exact unlearning, tackles this issue by discarding the data and retraining the model from scratch, but at the cost of considerable computational and memory resources. However, not all devices have the capability to perform such retraining. In numerous machine learning applications, such as edge devices, Internet-of-Things (IoT), mobile devices, and satellites, resources are constrained, posing challenges for deploying existing exact unlearning methods. In this study, we propose a Constraint-aware Adaptive Exact Unlearning System at the network Edge (CAUSE), an approach to enabling exact unlearning on resource-constrained devices. Aiming to minimize the retrain overhead by storing sub-models on the resource-constrained device, CAUSE innovatively applies a Fibonacci-based replacement strategy and updates the number of shards adaptively in the user-based data partition process. To further improve the effectiveness of memory usage, CAUSE leverages the advantage of model pruning to save memory via compression with minimal accuracy sacrifice. The experimental results demonstrate that CAUSE significantly outperforms other representative systems in realizing exact unlearning on the resource-constrained device by 9.23%-80.86%, 66.21%-83.46%, and 5.26%-194.13% in terms of unlearning speed, energy consumption, and accuracy.

[LG-122] Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.10118
作者: Yuxuan Ren,Dihan Zheng,Chang Liu,Peiran Jin,Yu Shi,Lin Huang,Jiyan He,Shengjie Luo,Tao Qin,Tie-Yan Liu
关键词-EN: handling molecular science, demonstrated impressive capability, machine learning, recent years, machine learning models
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: Published as a conference paper at NeurIPS 2024

点击查看摘要

Abstract:In recent years, machine learning has demonstrated impressive capability in handling molecular science tasks. To support various molecular properties at scale, machine learning models are trained in the multi-task learning paradigm. Nevertheless, data of different molecular properties are often not aligned: some quantities, e.g. equilibrium structure, demand more cost to compute than others, e.g. energy, so their data are often generated by cheaper computational methods at the cost of lower accuracy, which cannot be directly overcome through multi-task learning. Moreover, it is not straightforward to leverage abundant data of other tasks to benefit a particular task. To handle such data heterogeneity challenges, we exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches that allow different tasks to exchange information directly so as to improve one another. Particularly, we demonstrate that the more accurate energy data can improve the accuracy of structure prediction. We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction, demonstrating a broad capability for integrating heterogeneous data.

[LG-123] Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models

链接: https://arxiv.org/abs/2410.10114
作者: Jun Luo,Chen Chen,Shandong Wu
关键词-EN: diverse downstream tasks, demonstrated potent applicability, pre-trained Vision-Language Models, Prompt learning, downstream tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Prompt learning for pre-trained Vision-Language Models (VLMs) like CLIP has demonstrated potent applicability across diverse downstream tasks. This lightweight approach has quickly gained traction from federated learning (FL) researchers who seek to efficiently adapt VLMs to heterogeneous scenarios. However, current federated prompt learning methods are habitually restricted to the traditional FL paradigm, where the participating clients are generally only allowed to download a single globally aggregated model from the server. While justifiable for training full-sized models under federated settings, in this work, we argue that this paradigm is ill-suited for lightweight prompts. By facilitating the clients to download multiple pre-aggregated prompts as fixed non-local experts, we propose Personalized Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that personalizes the prompt learning process through the lens of Mixture of Experts (MoE). pFedMoAP implements a local attention-based gating network that learns to generate enhanced text features for better alignment with local image data on the client, benefiting from both local and downloaded non-local adaptive prompt experts. The non-local experts are sparsely selected from a server-maintained pool, fostering collaborative learning across clients. To evaluate the proposed algorithm, we conduct extensive experiments across 9 datasets under various heterogeneous federated settings. The results show that pFedMoAP consistently outperforms the state-of-the-art alternatives, underscoring its efficacy in personalizing prompt learning for CLIP within the federated learning paradigm.

[LG-124] Learning Linear Attention in Polynomial Time

链接: https://arxiv.org/abs/2410.10101
作者: Morris Yau,Ekin Akyurek,Jiayuan Mao,Joshua B. Tenenbaum,Stefanie Jegelka,Jacob Andreas
关键词-EN: simulating Boolean circuits, Previous research, simulating Boolean, Boolean circuits, Universal Turing Machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key–value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.

[LG-125] How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective EMNLP2024

链接: https://arxiv.org/abs/2410.10093
作者: Teng Xiao,Mingxiao Li,Yige Yuan,Huaisheng Zhu,Chao Cui,Vasant G Honavar
关键词-EN: GSIL, generalized self-imitation learning, efficiently aligns large, textbf, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: EMNLP 2024 Main

点击查看摘要

Abstract:This paper introduces a novel generalized self-imitation learning ( \textbfGSIL ) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop \textbfGSIL by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with simple classification losses. \textbfGSIL eliminates the need for complex adversarial training in standard imitation learning, achieving lightweight and efficient fine-tuning for large language models. In addition, \textbfGSIL encompasses a family of offline losses parameterized by a general class of convex functions for density ratio estimation and enables a unified view for alignment with demonstration data. Extensive experiments show that \textbfGSIL consistently and significantly outperforms baselines in many challenging benchmarks, such as coding (HuamnEval), mathematical reasoning (GSM8K) and instruction-following benchmark (MT-Bench).

[LG-126] PromptGCN: Bridging Subgraph Gaps in Lightweight GCNs

链接: https://arxiv.org/abs/2410.10089
作者: Shengwei Ji,Yujie Tian,Fei Liu,Xinlu Li,Le Wu
关键词-EN: Graph Convolutional Networks, Convolutional Networks, social networks, Graph Convolutional, Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) are widely used in graph-based applications, such as social networks and recommendation systems. Nevertheless, large-scale graphs or deep aggregation layers in full-batch GCNs consume significant GPU memory, causing out of memory (OOM) errors on mainstream GPUs (e.g., 29GB memory consumption on the Ogbnproducts graph with 5 layers). The subgraph sampling methods reduce memory consumption to achieve lightweight GCNs by partitioning the graph into multiple subgraphs and sequentially training GCNs on each subgraph. However, these methods yield gaps among subgraphs, i.e., GCNs can only be trained based on subgraphs instead of global graph information, which reduces the accuracy of GCNs. In this paper, we propose PromptGCN, a novel prompt-based lightweight GCN model to bridge the gaps among subgraphs. First, the learnable prompt embeddings are designed to obtain global information. Then, the prompts are attached into each subgraph to transfer the global information among subgraphs. Extensive experimental results on seven largescale graphs demonstrate that PromptGCN exhibits superior performance compared to baselines. Notably, PromptGCN improves the accuracy of subgraph sampling methods by up to 5.48% on the Flickr dataset. Overall, PromptGCN can be easily combined with any subgraph sampling method to obtain a lightweight GCN model with higher accuracy.

[LG-127] he Ingredients for Robotic Diffusion Transformers

链接: https://arxiv.org/abs/2410.10088
作者: Sudeep Dasari,Oier Mees,Sebastian Zhao,Mohan Kumar Srirama,Sergey Levine
关键词-EN: recent years roboticists, achieved remarkable progress, leveraging high capacity, high capacity Transformer, capacity Transformer network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ( 1500+ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: this https URL

[LG-128] NeRF-enabled Analysis-Through-Synthesis for ISAR Imaging of Small Everyday Objects with Sparse and Noisy UWB Radar Data

链接: https://arxiv.org/abs/2410.10085
作者: Md Farhan Tasnim Oshim,Albert Reed,Suren Jayasuriya,Tauhidur Rahman
关键词-EN: Inverse Synthetic Aperture, Synthetic Aperture Radar, Inverse Synthetic, Synthetic Aperture, inherent resolution constraints
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse Synthetic Aperture Radar (ISAR) imaging presents a formidable challenge when it comes to small everyday objects due to their limited Radar Cross-Section (RCS) and the inherent resolution constraints of radar systems. Existing ISAR reconstruction methods including backprojection (BP) often require complex setups and controlled environments, rendering them impractical for many real-world noisy scenarios. In this paper, we propose a novel Analysis-through-Synthesis (ATS) framework enabled by Neural Radiance Fields (NeRF) for high-resolution coherent ISAR imaging of small objects using sparse and noisy Ultra-Wideband (UWB) radar data with an inexpensive and portable setup. Our end-to-end framework integrates ultra-wideband radar wave propagation, reflection characteristics, and scene priors, enabling efficient 2D scene reconstruction without the need for costly anechoic chambers or complex measurement test beds. With qualitative and quantitative comparisons, we demonstrate that the proposed method outperforms traditional techniques and generates ISAR images of complex scenes with multiple targets and complex structures in Non-Line-of-Sight (NLOS) and noisy scenarios, particularly with limited number of views and sparse UWB radar scans. This work represents a significant step towards practical, cost-effective ISAR imaging of small everyday objects, with broad implications for robotics and mobile sensing applications.

[LG-129] PointNet with KAN versus PointNet with MLP for 3D Classification and Segmentation of Point Sets

链接: https://arxiv.org/abs/2410.10084
作者: Ali Kashefi
关键词-EN: traditional Multilayer Perceptrons, key components, Multilayer Perceptrons, neural network, segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce PointNet-KAN, a neural network for 3D point cloud classification and segmentation tasks, built upon two key components. First, it employs Kolmogorov-Arnold Networks (KANs) instead of traditional Multilayer Perceptrons (MLPs). Second, it retains the core principle of PointNet by using shared KAN layers and applying symmetric functions for global feature extraction, ensuring permutation invariance with respect to the input features. In traditional MLPs, the goal is to train the weights and biases with fixed activation functions; however, in KANs, the goal is to train the activation functions themselves. We use Jacobi polynomials to construct the KAN layers. We extensively evaluate PointNet-KAN across various polynomial degrees and special types such as the Lagrange, Chebyshev, and Gegenbauer polynomials. Our results show that PointNet-KAN achieves competitive performance compared to PointNet with MLPs on benchmark datasets for 3D object classification and segmentation, despite employing a shallower and simpler network architecture. We hope this work serves as a foundation and provides guidance for integrating KANs, as an alternative to MLPs, into more advanced point cloud processing architectures.

[LG-130] VideoAgent : Self-Improving Video Generation

链接: https://arxiv.org/abs/2410.10076
作者: Achint Soni,Sreyas Venkataraman,Abhranil Chandra,Sebastian Fischmeister,Percy Liang,Bo Dai,Sherry Yang
关键词-EN: generated video plans, generated video, Video generation, Video, generate visual plans
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, utilizing feedback from a pretrained vision-language model (VLM). As the refined video plan is being executed, VideoAgent collects additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robotics can be an effective tool in grounding video generation in the physical world.

[LG-131] Divide Reweight and Conquer: A Logit Arithmetic Approach for In-Context Learning

链接: https://arxiv.org/abs/2410.10074
作者: Chengsong Huang,Langlin Huang,Jiaxin Huang
关键词-EN: Large Language Models, updating model parameters, Large Language, Language Models, In-Context Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In-Context Learning (ICL) emerges as a key feature for Large Language Models (LLMs), allowing them to adapt to new tasks by leveraging task-specific examples without updating model parameters. However, ICL faces challenges with increasing numbers of examples due to performance degradation and quadratic computational costs. In this paper, we propose Logit Arithmetic Reweighting Approach (LARA), a novel framework that enhances ICL by using logit-based ensembling of multiple demonstrations. Our approach divides long input demonstrations into parallelizable shorter inputs to significantly reduce memory requirements, and then effectively aggregate the information by reweighting logits of each group via a non-gradient optimization approach. We further introduce Binary LARA (B-LARA), a variant that constrains weights to binary values to simplify the search space and reduces memory usage by filtering out less informative demonstration groups. Experiments on BBH and MMLU demonstrate that LARA and B-LARA outperform all baseline methods in both accuracy and memory efficiency. We also conduct extensive analysis to show that LARA generalizes well to scenarios of varying numbers of examples from limited to many-shot demonstrations.

[LG-132] Self-Organizing Recurrent Stochastic Configuration Networks for Nonstationary Data Modelling

链接: https://arxiv.org/abs/2410.10072
作者: Gang Dang,Dianhui Wang
关键词-EN: randomized learner models, Recurrent stochastic configuration, class of randomized, randomized learner, shown promise
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recurrent stochastic configuration networks (RSCNs) are a class of randomized learner models that have shown promise in modelling nonlinear dynamics. In many fields, however, the data generated by industry systems often exhibits nonstationary characteristics, leading to the built model performing well on the training data but struggling with the newly arriving data. This paper aims at developing a self-organizing version of RSCNs, termed as SORSCNs, to enhance the continuous learning ability of the network for modelling nonstationary data. SORSCNs can autonomously adjust the network parameters and reservoir structure according to the data streams acquired in real-time. The output weights are updated online using the projection algorithm, while the network structure is dynamically adjusted in the light of the recurrent stochastic configuration algorithm and an improved sensitivity analysis. Comprehensive comparisons among the echo state network (ESN), online self-learning stochastic configuration network (OSL-SCN), self-organizing modular ESN (SOMESN), RSCN, and SORSCN are carried out. Experimental results clearly demonstrate that the proposed SORSCNs outperform other models with sound generalization, indicating great potential in modelling nonlinear systems with nonstationary dynamics.

[LG-133] he Epochal Sawtooth Effect: Unveiling Training Loss Oscillations in Adam and Other Optimizers

链接: https://arxiv.org/abs/2410.10056
作者: Qi Liu,Wanjing Ma
关键词-EN: Epochal Sawtooth Effect, recurring training loss, Epochal Sawtooth, training loss pattern, Sawtooth Effect
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 15 pages, 21 figures

点击查看摘要

Abstract:In this paper, we identify and analyze a recurring training loss pattern, which we term the \textitEpochal Sawtooth Effect (ESE), commonly observed during training with adaptive gradient-based optimizers, particularly Adam optimizer. This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve. Through empirical observations, we demonstrate that while this effect is most pronounced with Adam, it persists, although less severely, with other optimizers such as RMSProp. We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect. The influences of factors like (\beta), batch size, data shuffling on this pattern have been studied. We quantify the influence of (\beta_2) on the shape of the loss curve, showing that higher values of (\beta_2) result in a nearly linear increase in loss, while lower values create a concave upward trend. Our analysis reveals that this behavior stems from the adaptive learning rate controlled by the second moment estimate, with (\beta_1) playing a minimal role when (\beta_2) is large. To support our analysis, we replicate this phenomenon through a controlled quadratic minimization task. By incrementally solving a series of quadratic optimization problems using Adam, we demonstrate that the Epochal Sawtooth Effect can emerge even in simple optimization scenarios, reinforcing the generality of this pattern. This paper provides both theoretical insights and quantitative analysis, offering a comprehensive understanding of this ubiquitous phenomenon in modern optimization techniques. Comments: 15 pages, 21 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2410.10056 [cs.LG] (or arXiv:2410.10056v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.10056 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-134] owards Bridging Generalization and Expressivity of Graph Neural Networks

链接: https://arxiv.org/abs/2410.10051
作者: Shouheng Li,Floris Geerts,Dongwoo Kim,Qing Wang
关键词-EN: graph neural networks, neural networks, critical aspects, generalization, Expressivity
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Expressivity and generalization are two critical aspects of graph neural networks (GNNs). While significant progress has been made in studying the expressivity of GNNs, much less is known about their generalization capabilities, particularly when dealing with the inherent complexity of graph-structured data. In this work, we address the intricate relationship between expressivity and generalization in GNNs. Theoretical studies conjecture a trade-off between the two: highly expressive models risk overfitting, while those focused on generalization may sacrifice expressivity. However, empirical evidence often contradicts this assumption, with expressive GNNs frequently demonstrating strong generalization. We explore this contradiction by introducing a novel framework that connects GNN generalization to the variance in graph structures they can capture. This leads us to propose a k -variance margin-based generalization bound that characterizes the structural properties of graph embeddings in terms of their upper-bounded expressive power. Our analysis does not rely on specific GNN architectures, making it broadly applicable across GNN models. We further uncover a trade-off between intra-class concentration and inter-class separation, both of which are crucial for effective generalization. Through case studies and experiments on real-world datasets, we demonstrate that our theoretical findings align with empirical results, offering a deeper understanding of how expressivity can enhance GNN generalization.

[LG-135] StatioCL: Contrastive Learning for Time Series via Non-Stationary and Temporal Contrast CIKM24

链接: https://arxiv.org/abs/2410.10048
作者: Yu Wu,Ting Dang,Dimitris Spathis,Hong Jia,Cecilia Mascolo
关键词-EN: false negative pairs, embedding similar pairs, similar pairs closely, Contrastive learning, negative pairs
类目: Machine Learning (cs.LG)
*备注: Accepted in CIKM24

点击查看摘要

Abstract:Contrastive learning (CL) has emerged as a promising approach for representation learning in time series data by embedding similar pairs closely while distancing dissimilar ones. However, existing CL methods often introduce false negative pairs (FNPs) by neglecting inherent characteristics and then randomly selecting distinct segments as dissimilar pairs, leading to erroneous representation learning, reduced model performance, and overall inefficiency. To address these issues, we systematically define and categorize FNPs in time series into semantic false negative pairs and temporal false negative pairs for the first time: the former arising from overlooking similarities in label categories, which correlates with similarities in non-stationarity and the latter from neglecting temporal proximity. Moreover, we introduce StatioCL, a novel CL framework that captures non-stationarity and temporal dependency to mitigate both FNPs and rectify the inaccuracies in learned representations. By interpreting and differentiating non-stationary states, which reflect the correlation between trends or temporal dynamics with underlying data patterns, StatioCL effectively captures the semantic characteristics and eliminates semantic FNPs. Simultaneously, StatioCL establishes fine-grained similarity levels based on temporal dependencies to capture varying temporal proximity between segments and to mitigate temporal FNPs. Evaluated on real-world benchmark time series classification datasets, StatioCL demonstrates a substantial improvement over state-of-the-art CL methods, achieving a 2.9% increase in Recall and a 19.2% reduction in FNPs. Most importantly, StatioCL also shows enhanced data efficiency and robustness against label scarcity.

[LG-136] VQ-CNMP: Neuro-Symbolic Skill Learning for Bi-Level Planning

链接: https://arxiv.org/abs/2410.10045
作者: Hakan Aktas,Emre Ugur
关键词-EN: unlabeled demonstration data, neural network model, network model capable, demonstration data, neural network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, Submitted to Conference on Robot Learning LEAP Workshop 2024

点击查看摘要

Abstract:This paper proposes a novel neural network model capable of discovering high-level skill representations from unlabeled demonstration data. We also propose a bi-level planning pipeline that utilizes our model using a gradient-based planning approach. While extracting high-level representations, our model also preserves the low-level information, which can be used for low-level action planning. In the experiments, we tested the skill discovery performance of our model under different conditions, tested whether Multi-Modal LLMs can be utilized to label the learned high-level skill representations, and finally tested the high-level and low-level planning performance of our pipeline.

[LG-137] Are KAN Effective for Identifying and Tracking Concept Drift in Time Series?

链接: https://arxiv.org/abs/2410.10041
作者: Kunpeng Xu,Lifei Chen,Shengrui Wang
关键词-EN: online activity logs, understanding complex systems, Dynamic concepts, financial markets, activity logs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dynamic concepts in time series are crucial for understanding complex systems such as financial markets, healthcare, and online activity logs. These concepts help reveal structures and behaviors in sequential data for better decision-making and forecasting. Existing models struggle with detecting and tracking concept drift due to limitations in interpretability and adaptability. This paper introduces Kolmogorov-Arnold Networks (KAN) into time series and proposes WormKAN, a KAN-based auto-encoder to address concept drift in co-evolving time series. WormKAN integrates the KAN-SR module, in which the encoder, decoder, and self-representation layer are built on KAN, along with a temporal constraint to capture concept transitions. These transitions, akin to passing through a “wormhole”, are identified by abrupt changes in the latent space. Experiments show that KAN and KAN-based models (WormKAN) effectively segment time series into meaningful concepts, enhancing the identification and tracking of concept drifts.

[LG-138] Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

链接: https://arxiv.org/abs/2410.10024
作者: Hossein Taheri,Christos Thrampoulidis,Arya Mazumdar
关键词-EN: study the data-dependent, data-dependent convergence, convergence and generalization, generalization behavior, neural networks
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation. Our first result is a novel bound on the excess risk of deep networks trained by the logistic loss, via an alogirthmic stability analysis. Compared to previous works, our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds. Importantly, the bounds we derive in this paper are tighter, hold even for neural networks of small width, do not scale unfavorably with width, are algorithm-dependent, and consequently capture the role of initialization on the sample complexity of gradient descent for deep nets. Specialized to noiseless data separable with margin \gamma by neural tangent kernel (NTK) features of a network of width \Omega(\poly(\log(n))) , we show the test-error rate to be e^O(L)/\gamma^2 n , where n is the training set size and L denotes the number of hidden layers. This is an improvement in the test loss bound compared to previous works while maintaining the poly-logarithmic width conditions. We further investigate excess risk bounds for deep nets trained with noisy data, establishing that under a polynomial condition on the network width, gradient descent can achieve the optimal excess risk. Finally, we show that a large step-size significantly improves upon the NTK regime’s results in classifying the XOR distribution. In particular, we show for a one-hidden-layer neural network of constant width m with quadratic activation and standard Gaussian initialization that mini-batch SGD with linear sample complexity and with a large step-size \eta=m reaches the perfect test accuracy after only \ceil\log(d) iterations, where d is the data dimension.

[LG-139] Online Multi-modal Root Cause Analysis

链接: https://arxiv.org/abs/2410.10021
作者: Lecheng Zheng,Zhengzhang Chen,Haifeng Chen,Jingrui He
关键词-EN: Root Cause Analysis, RCA methods, Traditional data-driven RCA, essential for pinpointing, failures in microservice
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Traditional data-driven RCA methods are typically limited to offline applications due to high computational demands, and existing online RCA methods handle only single-modal data, overlooking complex interactions in multi-modal systems. In this paper, we introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization. OCEAN employs a dilated convolutional neural network to capture long-term temporal dependencies and graph neural networks to learn causal relationships among system entities and key performance indicators. We further design a multi-factor attention mechanism to analyze and reassess the relationships among different metrics and log indicators/attributes for enhanced online causal graph learning. Additionally, a contrastive mutual information maximization-based graph fusion module is developed to effectively model the relationships across various modalities. Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of our proposed method.

[LG-140] Improving accuracy and convergence of federated learning edge computing methods for generalized DER forecasting applications in power grid NEURIPS2022

链接: https://arxiv.org/abs/2410.10018
作者: Vineet Jagadeesan Nair,Lucas Pereira
关键词-EN: distributed energy resources, accurate federated learning, lower communication requirements, faster convergence properties, low-carbon power grids
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
*备注: Presented at the NeurIPS 2022 Tackling Climate Change with Machine Learning workshop

点击查看摘要

Abstract:This proposal aims to develop more accurate federated learning (FL) methods with faster convergence properties and lower communication requirements, specifically for forecasting distributed energy resources (DER) such as renewables, energy storage, and loads in modern, low-carbon power grids. This will be achieved by (i) leveraging recently developed extensions of FL such as hierarchical and iterative clustering to improve performance with non-IID data, (ii) experimenting with different types of FL global models well-suited to time-series data, and (iii) incorporating domain-specific knowledge from power systems to build more general FL frameworks and architectures that can be applied to diverse types of DERs beyond just load forecasting, and with heterogeneous clients.

[LG-141] apWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining

链接: https://arxiv.org/abs/2410.10006
作者: Ruiyi Zhang,Sai Ashish Somayajula,Pengtao Xie
关键词-EN: Large-scale general domain, Large-scale general, general domain pretraining, downstream-specific finetuning, predominant paradigm
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale general domain pretraining followed by downstream-specific finetuning has become a predominant paradigm in machine learning. However, discrepancies between the pretraining and target domains can still lead to performance degradation in certain cases, underscoring the need for task-adaptive continued pretraining (TAP). TAP methods typically involve continued pretraining on task-specific unlabeled datasets or introducing additional unsupervised learning objectives to enhance model capabilities. While many TAP methods perform continued pretraining with multiple pretraining objectives, they often determine the tradeoff parameters between objectives manually, resulting in suboptimal outcomes and higher computational costs. In this paper, we propose TapWeight, a task-adaptive pretraining framework which automatically determines the optimal importance of each pretraining objective based on downstream feedback. TapWeight reweights each pretraining objective by solving a multi-level optimization problem. We applied TapWeight to both molecular property prediction and natural language understanding tasks, significantly surpassing baseline methods. Experimental results validate the effectiveness and generalizability of TapWeight.

[LG-142] A Holistic Weakly Supervised Approach for Liver Tumor Segmentation with Clinical Knowledge-Informed Label Smoothing

链接: https://arxiv.org/abs/2410.10005
作者: Hairong Wang,Lingchao Mao,Zihan Zhang,Jing Li
关键词-EN: accurate CT-based tumor, liver tumor segmentation, CT-based tumor segmentation, mortality worldwide, accurate CT-based
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Liver cancer is a leading cause of mortality worldwide, and accurate CT-based tumor segmentation is essential for diagnosis and treatment. Manual delineation is time-intensive, prone to variability, and highlights the need for reliable automation. While deep learning has shown promise for automated liver segmentation, precise liver tumor segmentation remains challenging due to the heterogeneous nature of tumors, imprecise tumor margins, and limited labeled data. We present a novel holistic weakly supervised framework that integrates clinical knowledge to address these challenges with (1) A knowledge-informed label smoothing technique that leverages clinical data to generate smooth labels, which regularizes model training reducing the risk of overfitting and enhancing model performance; (2) A global and local-view segmentation framework, breaking down the task into two simpler sub-tasks, allowing optimized preprocessing and training for each; and (3) Pre- and post-processing pipelines customized to the challenges of each subtask, which enhances tumor visibility and refines tumor boundaries. We evaluated the proposed method on the HCC-TACE-Seg dataset and showed that these three key components complementarily contribute to the improved performance. Lastly, we prototyped a tool for automated liver tumor segmentation and diagnosis summary generation called MedAssistLiver. The app and code are published at this https URL.

[LG-143] HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

链接: https://arxiv.org/abs/2410.09988
作者: Jingxuan Fan,Sarah Martinson,Erik Y. Wang,Kaylie Hausknecht,Jonah Brenner,Danxian Liu,Nianli Peng,Corey Wang,Michael P. Brenner
关键词-EN: Large Language Model, existing Large Language, Large Language, Language Model, applied mathematics problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code and the HARDMath dataset is available at this https URL

点击查看摘要

Abstract:Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational tools, and subjective judgment, making them difficult for LLMs. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level applied math problems and underscore the importance of datasets like HARDMath to advance mathematical abilities of LLMs.

[LG-144] Self-Data Distillation for Recovering Quality in Pruned Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09982
作者: Vithursan Thangarasa,Ganesh Venkatesh,Nish Sinnadurai,Sean Lie
关键词-EN: Large language models, natural language processing, deployment requires substantial, requires substantial compute, Large language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Accepted at the NeurIPS 2024 Machine Learning and Compression Workshop

点击查看摘要

Abstract:Large language models have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one-shot pruning often results in significant quality degradation, particularly in tasks requiring multi-step reasoning. To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting by shifting the model’s learned data distribution. Therefore, addressing the degradation from both pruning and SFT is essential to preserve the original model’s quality. In this work, we propose self-data distilled fine-tuning to address these challenges. Our approach leverages the original, unpruned model to generate a distilled dataset that preserves semantic richness and mitigates catastrophic forgetting by maintaining alignment with the base model’s knowledge. Empirically, we demonstrate that self-data distillation consistently outperforms standard SFT, improving average accuracy by up to 8% on the HuggingFace OpenLLM Leaderboard v1. Specifically, when pruning 6 decoder blocks on Llama3.1-8B Instruct (i.e., 32 to 24 layers, reducing the model size from 8.03B to 6.72B parameters), our method retains 91.2% of the original model’s accuracy compared to 81.7% with SFT, while reducing real-world FLOPs by 16.30%. Furthermore, our approach scales effectively across datasets, with the quality improving as the dataset size increases.

[LG-145] Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions

链接: https://arxiv.org/abs/2410.09972
作者: Kyungmin Kim,JB Lanier,Pierre Baldi,Charless Fowlkes,Roy Fox
关键词-EN: Model-Based Reinforcement Learning, Recent advancements, Model-Based Reinforcement, Reinforcement Learning, advancements in Model-Based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent advancements in Model-Based Reinforcement Learning (MBRL) have made it a powerful tool for visual control tasks. Despite improved data efficiency, it remains challenging to train MBRL agents with generalizable perception. Training in the presence of visual distractions is particularly difficult due to the high variation they introduce to representation learning. Building on DREAMER, a popular MBRL method, we propose a simple yet effective auxiliary task to facilitate representation learning in distracting environments. Under the assumption that task-relevant components of image observations are straightforward to identify with prior knowledge in a given task, we use a segmentation mask on image observations to only reconstruct task-relevant components. In doing so, we greatly reduce the complexity of representation learning by removing the need to encode task-irrelevant objects in the latent representation. Our method, Segmentation Dreamer (SD), can be used either with ground-truth masks easily accessible in simulation or by leveraging potentially imperfect segmentation foundation models. The latter is further improved by selectively applying the reconstruction loss to avoid providing misleading learning signals due to mask prediction errors. In modified DeepMind Control suite (DMC) and Meta-World tasks with added visual distractions, SD achieves significantly better sample efficiency and greater final performance than prior work. We find that SD is especially helpful in sparse reward tasks otherwise unsolvable by prior work, enabling the training of visually robust agents without the need for extensive reward engineering.

[LG-146] Deep-Ace: LSTM-based Prokaryotic Lysine Acetylation Site Predictor

链接: https://arxiv.org/abs/2410.09968
作者: Maham Ilyasa,Abida Yasmeenc,Yaser Daanial Khanb,Arif Mahmood
关键词-EN: Acetylation of lysine, post-translation modification occurring, lysine residues, prokaryotes and eukaryotes, K-Ace sites
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注:

点击查看摘要

Abstract:Acetylation of lysine residues (K-Ace) is a post-translation modification occurring in both prokaryotes and eukaryotes. It plays a crucial role in disease pathology and cell biology hence it is important to identify these K-Ace sites. In the past, many machine learning-based models using hand-crafted features and encodings have been used to find and analyze the characteristics of K-Ace sites however these methods ignore long term relationships within sequences and therefore observe performance degradation. In the current work we propose Deep-Ace, a deep learning-based framework using Long-Short-Term-Memory (LSTM) network which has the ability to understand and encode long-term relationships within a sequence. Such relations are vital for learning discriminative and effective sequence representations. In the work reported here, the use of LSTM to extract deep features as well as for prediction of K-Ace sites using fully connected layers for eight different species of prokaryotic models (including B. subtilis, C. glutamicum, E. coli, G. kaustophilus, S. eriocheiris, B. velezensis, S. typhimurium, and M. tuberculosis) has been explored. Our proposed method has outperformed existing state of the art models achieving accuracy as 0.80, 0.79, 0.71, 0.75, 0.80, 0.83, 0.756, and 0.82 respectively for eight bacterial species mentioned above. The method with minor modifications can be used for eukaryotic systems and can serve as a tool for the prognosis and diagnosis of various diseases in humans.

[LG-147] Improving 3D Few-Shot Segmentation with Inference-Time Pseudo-Labeling

链接: https://arxiv.org/abs/2410.09967
作者: Mohammad Mozafari,Hosein Hasani,Reza Vahidimajd,Mohamadreza Fereydooni,Mahdieh Soleymani Baghshah
关键词-EN: offering remarkable adaptability, limited annotated data, medical imaging analysis, recent years, models have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, few-shot segmentation (FSS) models have emerged as a promising approach in medical imaging analysis, offering remarkable adaptability to segment novel classes with limited annotated data. Existing approaches to few-shot segmentation have often overlooked the potential of the query itself, failing to fully utilize the valuable information it contains. However, treating the query as unlabeled data provides an opportunity to enhance prediction accuracy. Specifically in the domain of medical imaging, the volumetric structure of queries offers a considerable source of valuable information that can be used to improve the target slice segmentation. In this work, we present a novel strategy to efficiently leverage the intrinsic information of the query sample for final segmentation during inference. First, we use the support slices from a reference volume to generate an initial segmentation score for the query slices through a prototypical approach. Subsequently, we apply a confidence-aware pseudo-labeling procedure to transfer the most informative parts of query slices to the support set. The final prediction is performed based on the new expanded support set, enabling the prediction of a more accurate segmentation mask for the query volume. Extensive experiments show that the proposed method can effectively boost performance across diverse settings and datasets.

[LG-148] Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing

链接: https://arxiv.org/abs/2410.09964
作者: Muhammad Umar,Muhammad Asif,Arif Mahmood
关键词-EN: Single-cell RNA sequencing, single cell level, Single-cell RNA, RNA sequencing, enables the study
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at this https URL.

[LG-149] EITNet: An IoT-Enhanced Framework for Real-Time Basketball Action Recognition

链接: https://arxiv.org/abs/2410.09954
作者: Jingyu Liu,Xinyu Liu,Mingzhe Qu,Tianyi Lyu
关键词-EN: Integrating IoT technology, basketball action recognition, Integrating IoT, providing crucial insights, basketball action
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: pages

点击查看摘要

Abstract:Integrating IoT technology into basketball action recognition enhances sports analytics, providing crucial insights into player performance and game strategy. However, existing methods often fall short in terms of accuracy and efficiency, particularly in complex, real-time environments where player movements are frequently occluded or involve intricate interactions. To overcome these challenges, we propose the EITNet model, a deep learning framework that combines EfficientDet for object detection, I3D for spatiotemporal feature extraction, and TimeSformer for temporal analysis, all integrated with IoT technology for seamless real-time data collection and processing. Our contributions include developing a robust architecture that improves recognition accuracy to 92%, surpassing the baseline EfficientDet model’s 87%, and reducing loss to below 5.0 compared to EfficientDet’s 9.0 over 50 epochs. Furthermore, the integration of IoT technology enhances real-time data processing, providing adaptive insights into player performance and strategy. The paper details the design and implementation of EITNet, experimental validation, and a comprehensive evaluation against existing models. The results demonstrate EITNet’s potential to significantly advance automated sports analysis and optimize data utilization for player performance and strategy improvement.

[LG-150] Efficient Federated Unlearning under Plausible Deniability ACML2024

链接: https://arxiv.org/abs/2410.09947
作者: Ayush K. Varshney,Vicenç Torra
关键词-EN: GDPR in Europe, data point, specific data point, server, data
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication in the journal track (Springer Machine Learning) of ACML 2024. Published version will be available after the conference. The source code is available at this https URL

点击查看摘要

Abstract:Privacy regulations like the GDPR in Europe and the CCPA in the US allow users the right to remove their data ML applications. Machine unlearning addresses this by modifying the ML parameters in order to forget the influence of a specific data point on its weights. Recent literature has highlighted that the contribution from data point(s) can be forged with some other data points in the dataset with probability close to one. This allows a server to falsely claim unlearning without actually modifying the model’s parameters. However, in distributed paradigms such as FL, where the server lacks access to the dataset and the number of clients are limited, claiming unlearning in such cases becomes a challenge. This paper introduces an efficient way to achieve federated unlearning, by employing a privacy model which allows the FL server to plausibly deny the client’s participation in the training up to a certain extent. We demonstrate that the server can generate a Proof-of-Deniability, where each aggregated update can be associated with at least x number of client updates. This enables the server to plausibly deny a client’s participation. However, in the event of frequent unlearning requests, the server is required to adopt an unlearning strategy and, accordingly, update its model parameters. We also perturb the client updates in a cluster in order to avoid inference from an honest but curious server. We show that the global model satisfies differential privacy after T number of communication rounds. The proposed methodology has been evaluated on multiple datasets in different privacy settings. The experimental results show that our framework achieves comparable utility while providing a significant reduction in terms of memory (30 times), as well as retraining time (1.6-500769 times). The source code for the paper is available.

[LG-151] Dynamic Estimation of Learning Rates Using a Non-Linear Autoregressive Model

链接: https://arxiv.org/abs/2410.09943
作者: Ramin Okhrati
关键词-EN: adaptive non-linear autoregressive, non-linear autoregressive, models incorporating, iterations increases, class of adaptive
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:We introduce a new class of adaptive non-linear autoregressive (Nlar) models incorporating the concept of momentum, which dynamically estimate both the learning rates and momentum as the number of iterations increases. In our method, the growth of the gradients is controlled using a scaling (clipping) function, leading to stable convergence. Within this framework, we propose three distinct estimators for learning rates and provide theoretical proof of their convergence. We further demonstrate how these estimators underpin the development of effective Nlar optimizers. The performance of the proposed estimators and optimizers is rigorously evaluated through extensive experiments across several datasets and a reinforcement learning environment. The results highlight two key features of the Nlar optimizers: robust convergence despite variations in underlying parameters, including large initial learning rates, and strong adaptability with rapid convergence during the initial epochs.

[LG-152] Generalized Group Data Attribution

链接: https://arxiv.org/abs/2410.09940
作者: Dan Ley,Shichang Zhang,Suraj Srinivas,Gili Rusak,Himabindu Lakkaraju
关键词-EN: Generalized Group Data, Group Data Attribution, data selection, Data Attribution, individual training data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification. However, existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models. To address this challenge, we introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones. GGDA is a general framework that subsumes existing attribution methods and can be applied to new DA techniques as they emerge. It allows users to optimize the trade-off between efficiency and fidelity based on their needs. Our empirical results demonstrate that GGDA applied to popular DA methods such as Influence Functions, TracIn, and TRAK results in upto 10x-50x speedups over standard DA methods while gracefully trading off attribution fidelity. For downstream applications such as dataset pruning and noisy label identification, we demonstrate that GGDA significantly improves computational efficiency and maintains effectiveness, enabling practical applications in large-scale machine learning scenarios that were previously infeasible.

[LG-153] Robust identifiability for symbolic recovery of differential equations

链接: https://arxiv.org/abs/2410.09938
作者: Hillary Hauger,Philipp Scholl,Gitta Kutyniok
关键词-EN: Recent advancements, moving from manual, advancements in machine, machine learning, learning have transformed
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recent advancements in machine learning have transformed the discovery of physical laws, moving from manual derivation to data-driven methods that simultaneously learn both the structure and parameters of governing equations. This shift introduces new challenges regarding the validity of the discovered equations, particularly concerning their uniqueness and, hence, identifiability. While the issue of non-uniqueness has been well-studied in the context of parameter estimation, it remains underexplored for algorithms that recover both structure and parameters simultaneously. Early studies have primarily focused on idealized scenarios with perfect, noise-free data. In contrast, this paper investigates how noise influences the uniqueness and identifiability of physical laws governed by partial differential equations (PDEs). We develop a comprehensive mathematical framework to analyze the uniqueness of PDEs in the presence of noise and introduce new algorithms that account for noise, providing thresholds to assess uniqueness and identifying situations where excessive noise hinders reliable conclusions. Numerical experiments demonstrate the effectiveness of these algorithms in detecting uniqueness despite the presence of noise.

[LG-154] How to unlearn a learned Machine Learning model ?

链接: https://arxiv.org/abs/2410.09935
作者: Seifeddine Achour
关键词-EN: contemporary times, numerous domains, human expectations, sparked a remarkable, remarkable revolution
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In contemporary times, machine learning (ML) has sparked a remarkable revolution across numerous domains, surpassing even the loftiest of human expectations. However, despite the astounding progress made by ML, the need to regulate its outputs and capabilities has become imperative. A viable approach to address this concern is by exerting control over the data used for its training, more precisely, by unlearning the model from undesired data. In this article, I will present an elegant algorithm for unlearning a machine learning model and visualize its abilities. Additionally, I will elucidate the underlying mathematical theory and establish specific metrics to evaluate both the unlearned model’s performance on desired data and its level of ignorance regarding unwanted data.

[LG-155] FedECADO: A Dynamical System Model of Federated Learning

链接: https://arxiv.org/abs/2410.09933
作者: Aayushya Agarwal,Gauri Joshi,Larry Pileggi
关键词-EN: unified machine learning, Federated learning harnesses, harnesses the power, power of distributed, distributed optimization
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Federated learning harnesses the power of distributed optimization to train a unified machine learning model across separate clients. However, heterogeneous data distributions and computational workloads can lead to inconsistent updates and limit model performance. This work tackles these challenges by proposing FedECADO, a new algorithm inspired by a dynamical system representation of the federated learning process. FedECADO addresses non-IID data distribution through an aggregate sensitivity model that reflects the amount of data processed by each client. To tackle heterogeneous computing, we design a multi-rate integration method with adaptive step-size selections that synchronizes active client updates in continuous time. Compared to prominent techniques, including FedProx and FedNova, FedECADO achieves higher classification accuracies in numerous heterogeneous scenarios.

[LG-156] A resource-efficient model for deep kernel learning

链接: https://arxiv.org/abs/2410.09926
作者: Luisa D’Amore
关键词-EN: major challenges encountered, Hughes phenomenon, scale of complexity, curse of dimensionality, accelerate learning computations
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:According to the Hughes phenomenon, the major challenges encountered in computations with learning models comes from the scale of complexity, e.g. the so-called curse of dimensionality. There are various approaches for accelerate learning computations with minimal loss of accuracy. These approaches range from model-level to implementation-level approaches. To the best of our knowledge, the first one is rarely used in its basic form. Perhaps, this is due to theoretical understanding of mathematical insights of model decomposition approaches, and thus the ability of developing mathematical improvements has lagged behind. We describe a model-level decomposition approach that combines both the decomposition of the operators and the decomposition of the network. We perform a feasibility analysis on the resulting algorithm, both in terms of its accuracy and scalability.

[LG-157] Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

链接: https://arxiv.org/abs/2410.09918
作者: DiJia Su,Sainbayar Sukhbaatar,Michael Rabbat,Yuandong Tian,Qinqing Zheng
关键词-EN: human cognition theory, intuitive System, deliberative System, System, human cognition
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In human cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative System 2. Recent studies have shown that incorporating System 2 process into Transformers including large language models (LLMs), significantly enhances their reasoning capabilities. Nevertheless, models that purely resemble System 2 thinking require substantially higher computational costs and are much slower to respond. To address this challenge, we present Dualformer, a single Transformer model that seamlessly integrates both the fast and slow reasoning modes. Dualformer is obtained by training on data with randomized reasoning traces, where different parts of the traces are dropped during training. The dropping strategies are specifically tailored according to the trace structure, analogous to analyzing our thinking process and creating shortcuts with patterns. At inference time, our model can be configured to output only the solutions (fast mode) or both the reasoning chain and the final solution (slow mode), or automatically decide which mode to engage (auto mode). In all cases, Dualformer outperforms the corresponding baseline models in both performance and computational efficiency: (1) in slow mode, Dualformer optimally solves unseen 30 x 30 maze navigation tasks 97.6% of the time, surpassing the Searchformer (trained on data with complete reasoning traces) baseline performance of 93.3%, while only using 45.5% fewer reasoning steps; (2) in fast mode, Dualformer completes those tasks with an 80% optimal rate, significantly outperforming the Solution-Only model (trained on solution-only data), which has an optimal rate of only 30%. For math problems, our techniques have also achieved improved performance with LLM fine-tuning, showing its generalization beyond task-specific models.

[LG-158] UnSeg: One Universal Unlearnable Example Generator is Enough against All Image Segmentation NEURIPS2024

链接: https://arxiv.org/abs/2410.09909
作者: Ye Sun,Hao Zhang,Tiehua Zhang,Xingjun Ma,Yu-Gang Jiang
关键词-EN: semantically meaningful segments, crucial vision task, real-world scenes, Image segmentation, crucial vision
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Image segmentation is a crucial vision task that groups pixels within an image into semantically meaningful segments, which is pivotal in obtaining a fine-grained understanding of real-world scenes. However, an increasing privacy concern exists regarding training large-scale image segmentation models on unauthorized private data. In this work, we exploit the concept of unlearnable examples to make images unusable to model training by generating and adding unlearnable noise into the original images. Particularly, we propose a novel Unlearnable Segmentation (UnSeg) framework to train a universal unlearnable noise generator that is capable of transforming any downstream images into their unlearnable version. The unlearnable noise generator is finetuned from the Segment Anything Model (SAM) via bilevel optimization on an interactive segmentation dataset towards minimizing the training error of a surrogate model that shares the same architecture with SAM but is trained from scratch. We empirically verify the effectiveness of UnSeg across 6 mainstream image segmentation tasks, 10 widely used datasets, and 7 different network architectures, and show that the unlearnable images can reduce the segmentation performance by a large margin. Our work provides useful insights into how to leverage foundation models in a data-efficient and computationally affordable manner to protect images against image segmentation models.

[LG-159] Retrieval Instead of Fine-tuning: A Retrieval-based Parameter Ensemble for Zero-shot Learning

链接: https://arxiv.org/abs/2410.09908
作者: Pengfei Jin,Peng Shu,Sekeun Kim,Qing Xiao,Sifan Song,Cheng Chen,Tianming Liu,Xiang Li,Quanzheng Li
关键词-EN: Foundation models, techniques like Low-Rank, Foundation, RPE, Retrieval-based Parameter Ensemble
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.

[LG-160] Multi class activity classification in videos using Motion History Image generation

链接: https://arxiv.org/abs/2410.09902
作者: Senthilkumar Gopal
关键词-EN: Human action recognition, multiple fields ranging, Human action, topic of interest, interest across multiple
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 5 pages, 9 images

点击查看摘要

Abstract:Human action recognition has been a topic of interest across multiple fields ranging from security to entertainment systems. Tracking the motion and identifying the action being performed on a real time basis is necessary for critical security systems. In entertainment, especially gaming, the need for immediate responses for actions and gestures are paramount for the success of that system. We show that Motion History image has been a well established framework to capture the temporal and activity information in multi dimensional detail enabling various usecases including classification. We utilize MHI to produce sample data to train a classifier and demonstrate its effectiveness for action classification across six different activities in a single multi-action video. We analyze the classifier performance and identify usecases where MHI struggles to generate the appropriate activity image and discuss mechanisms and future work to overcome those limitations.

[LG-161] Inductive Conformal Prediction under Data Scarcity: Exploring the Impacts of Nonconformity Measures

链接: https://arxiv.org/abs/2410.09894
作者: Yuko Kato,David M.J. Tax,Marco Loog
关键词-EN: Conformal prediction, conformal prediction interval, conformal prediction quantifies, nonconformity measure, makes no distributional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction, which makes no distributional assumptions about the data, has emerged as a powerful and reliable approach to uncertainty quantification in practical applications. The nonconformity measure used in conformal prediction quantifies how a test sample differs from the training data and the effectiveness of a conformal prediction interval may depend heavily on the precise measure employed. The impact of this choice has, however, not been widely explored, especially when dealing with limited amounts of data. The primary objective of this study is to evaluate the performance of various nonconformity measures (absolute error-based, normalized absolute error-based, and quantile-based measures) in terms of validity and efficiency when used in inductive conformal prediction. The focus is on small datasets, which is still a common setting in many real-world applications. Using synthetic and real-world data, we assess how different characteristics – such as dataset size, noise, and dimensionality – can affect the efficiency of conformal prediction intervals. Our results show that although there are differences, no single nonconformity measure consistently outperforms the others, as the effectiveness of each nonconformity measure is heavily influenced by the specific nature of the data. Additionally, we found that increasing dataset size does not always improve efficiency, suggesting the importance of fine-tuning models and, again, the need to carefully select the nonconformity measure for different applications.

[LG-162] Improving Colorectal Cancer Screening and Risk Assessment through Predictive Modeling on Medical Images and Records

链接: https://arxiv.org/abs/2410.09880
作者: Shuai Jiang,Christina Robinson,Joseph Anderson,William Hisey,Lynn Butterly,Arief Suriawinata,Saeed Hassanpour
关键词-EN: CRC risk, remove colon polyps, Multi-Society Task Force, future CRC risk, CRC risk prediction
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Colonoscopy screening is an effective method to find and remove colon polyps before they can develop into colorectal cancer (CRC). Current follow-up recommendations, as outlined by the U.S. Multi-Society Task Force for individuals found to have polyps, primarily rely on histopathological characteristics, neglecting other significant CRC risk factors. Moreover, the considerable variability in colorectal polyp characterization among pathologists poses challenges in effective colonoscopy follow-up or surveillance. The evolution of digital pathology and recent advancements in deep learning provide a unique opportunity to investigate the added benefits of including the additional medical record information and automatic processing of pathology slides using computer vision techniques in the calculation of future CRC risk. Leveraging the New Hampshire Colonoscopy Registry’s extensive dataset, many with longitudinal colonoscopy follow-up information, we adapted our recently developed transformer-based model for histopathology image analysis in 5-year CRC risk prediction. Additionally, we investigated various multimodal fusion techniques, combining medical record information with deep learning derived risk estimates. Our findings reveal that training a transformer model to predict intermediate clinical variables contributes to enhancing 5-year CRC risk prediction performance, with an AUC of 0.630 comparing to direct prediction. Furthermore, the fusion of imaging and non-imaging features, while not requiring manual inspection of microscopy images, demonstrates improved predictive capabilities for 5-year CRC risk comparing to variables extracted from colonoscopy procedure and microscopy findings. This study signifies the potential of integrating diverse data sources and advanced computational techniques in transforming the accuracy and effectiveness of future CRC risk assessments.

[LG-163] Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning

链接: https://arxiv.org/abs/2410.09878
作者: Yan Scholten,Stephan Günnemann
关键词-EN: Conformal prediction, prediction sets, prediction, user-specified probability, model-agnostic and distribution-free
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction provides model-agnostic and distribution-free uncertainty quantification through prediction sets that are guaranteed to include the ground truth with any user-specified probability. Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data, which can significantly alter prediction sets in practice. As a solution, we propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning. To ensure reliability under training poisoning, we introduce smoothed score functions that reliably aggregate predictions of classifiers trained on distinct partitions of the training data. To ensure reliability under calibration poisoning, we construct multiple prediction sets, each calibrated on distinct subsets of the calibration data. We then aggregate them into a majority prediction set, which includes a class only if it appears in a majority of the individual sets. Both proposed aggregations mitigate the influence of datapoints in the training and calibration data on the final prediction set. We experimentally validate our approach on image classification tasks, achieving strong reliability while maintaining utility and preserving coverage on clean data. Overall, our approach represents an important step towards more trustworthy uncertainty quantification in the presence of data poisoning.

[LG-164] Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset INTERSPEECH2024

链接: https://arxiv.org/abs/2410.09869
作者: Hideyuki Oiso,Yuto Matsunaga,Kazuya Kakizaki,Taiki Miyagawa
关键词-EN: audio deepfake detection, deepfake detection, study test-time domain, test-time domain adaptation, study test-time
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at Interspeech 2024. Hideyuki Oiso and Yuto Matsunaga contributed equally

点击查看摘要

Abstract:We study test-time domain adaptation for audio deepfake detection (ADD), addressing three challenges: (i) source-target domain gaps, (ii) limited target dataset size, and (iii) high computational costs. We propose an ADD method using prompt tuning in a plug-in style. It bridges domain gaps by integrating it seamlessly with state-of-the-art transformer models and/or with other fine-tuning methods, boosting their performance on target data (challenge (i)). In addition, our method can fit small target datasets because it does not require a large number of extra parameters (challenge (ii)). This feature also contributes to computational efficiency, countering the high computational costs typically associated with large-scale pre-trained models in ADD (challenge (iii)). We conclude that prompt tuning for ADD under domain gaps presents a promising avenue for enhancing accuracy with minimal target data and negligible extra computational burden.

[LG-165] owards characterizing the value of edge embeddings in Graph Neural Networks

链接: https://arxiv.org/abs/2410.09867
作者: Dhruv Rohatgi,Tanya Marwah,Zachary Chase Lipton,Jianfeng Lu,Ankur Moitra,Andrej Risteski
关键词-EN: Graph neural networks, solving machine learning, machine learning problems, learning problems defined, Graph neural
类目: Machine Learning (cs.LG)
*备注: 25 pages, 2 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) are the dominant approach to solving machine learning problems defined over graphs. Despite much theoretical and empirical work in recent years, our understanding of finer-grained aspects of architectural design for GNNs remains impoverished. In this paper, we consider the benefits of architectures that maintain and update edge embeddings. On the theoretical front, under a suitable computational abstraction for a layer in the model, as well as memory constraints on the embeddings, we show that there are natural tasks on graphical models for which architectures leveraging edge embeddings can be much shallower. Our techniques are inspired by results on time-space tradeoffs in theoretical computer science. Empirically, we show architectures that maintain edge embeddings almost always improve on their node-based counterparts – frequently significantly so in topologies that have ``hub’’ nodes.

[LG-166] Symmetry Discovery for Different Data Types

链接: https://arxiv.org/abs/2410.09841
作者: Lexiang Hu,Yikang Li,Zhouchen Lin
关键词-EN: Equivariant neural networks, achieving higher generalization, neural networks incorporate, constructing equivariant neural, Equivariant neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance. However, constructing equivariant neural networks typically requires prior knowledge of data types and symmetries, which is difficult to achieve in most tasks. In this paper, we propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks. It characterizes equivariance and invariance (a special case of equivariance) of continuous groups using Lie algebra and directly solves the Lie algebra space through the inputs, outputs, and gradients of the trained neural network. Then, we extend the method to make it applicable to multi-channel data and tensor data, respectively. We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging. Compared with the baseline, LieSD can accurately determine the number of Lie algebra bases without the need for expensive group sampling. Furthermore, LieSD can perform well on non-uniform datasets, whereas methods based on GANs fail.

[LG-167] Uncovering Explaining and Mitigating the Superficial Safety of Backdoor Defense NEURIPS2024

链接: https://arxiv.org/abs/2410.09838
作者: Rui Min,Zeyu Qin,Nevin L. Zhang,Li Shen,Minhao Cheng
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, threat to Deep, Attack Success Rates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: NeurIPS 2024 Spotlight paper. The first two authors contributed equally

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. Typically, these purified models exhibit low Attack Success Rates (ASR), rendering them resistant to backdoored inputs. However, Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? In this paper, we provide an affirmative answer to this question by thoroughly investigating the Post-Purification Robustness of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using a very small number of poisoned samples. Based on this, we further propose the practical Query-based Reactivation Attack (QRA) which could effectively reactivate the backdoor by merely querying purified models. We find the failure to achieve satisfactory post-tuning robustness stems from the insufficient deviation of purified models from the backdoored model along the backdoor-connected path. To improve the post-purification robustness, we propose a straightforward tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates. Extensive experiments demonstrate that PAM significantly improves post-purification robustness while maintaining a good clean accuracy and low ASR. Our work provides a new perspective on understanding the effectiveness of backdoor safety tuning and highlights the importance of faithfully assessing the model’s safety.

[LG-168] Learning Pattern-Specific Experts for Time Series Forecasting Under Patch-level Distribution Shift

链接: https://arxiv.org/abs/2410.09836
作者: Yanru Sun,Zongxia Xie,Emadeldeen Eldele,Dongyue Chen,Qinghua Hu,Min Wu
关键词-EN: garnered significant attention, significant attention due, Time series forecasting, Time series, range of applications
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series forecasting, which aims to predict future values based on historical data, has garnered significant attention due to its broad range of applications. However, real-world time series often exhibit complex non-uniform distribution with varying patterns across segments, such as season, operating condition, or semantic meaning, making accurate forecasting challenging. Existing approaches, which typically train a single model to capture all these diverse patterns, often struggle with the pattern drifts between patches and may lead to poor generalization. To address these challenges, we propose \textbfTFPS, a novel architecture that leverages pattern-specific experts for more accurate and adaptable time series forecasting. TFPS employs a dual-domain encoder to capture both time-domain and frequency-domain features, enabling a more comprehensive understanding of temporal dynamics. It then uses subspace clustering to dynamically identify distinct patterns across data patches. Finally, pattern-specific experts model these unique patterns, delivering tailored predictions for each patch. By explicitly learning and adapting to evolving patterns, TFPS achieves significantly improved forecasting accuracy. Extensive experiments on real-world datasets demonstrate that TFPS outperforms state-of-the-art methods, particularly in long-term forecasting, through its dynamic and pattern-aware learning approach. The data and codes are available: \urlthis https URL.

[LG-169] Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

链接: https://arxiv.org/abs/2410.09823
作者: Fei Wang,Li Shen,Liang Ding,Chao Xue,Ye Liu,Changxing Ding
关键词-EN: adapting large language, huge memory usages, large language models, powerful for adapting, adapting large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace First-Order (FO) gradient calculations, albeit with longer training time due to its stochastic nature. By revisiting the Memory-efficient ZO (MeZO) optimizer, we discover that the full-parameter perturbation and updating processes consume over 50% of its overall fine-tuning time cost. Based on these observations, we introduce a novel layer-wise sparse computation and memory efficient ZO optimizer, named LeZO. LeZO treats layers as fundamental units for sparsification and dynamically perturbs different parameter subsets in each step to achieve full-parameter fine-tuning. LeZO incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation (SPSA) and ZO stochastic gradient descent (ZO-SGD). It achieves accelerated computation during perturbation and updating processes without additional memory overhead. We conduct extensive experiments with the OPT model family on the SuperGLUE benchmark and two generative tasks. The experiments show that LeZO accelerates training without compromising the performance of ZO optimization. Specifically, it achieves over 3x speedup compared to MeZO on the SST-2, BoolQ, and Copa tasks.

[LG-170] opOC: Topological Deep Learning for Ovarian and Breast Cancer Diagnosis

链接: https://arxiv.org/abs/2410.09818
作者: Saba Fatema,Brighton Nuwagira,Sayoni Chakraborty,Reyhan Gedik,Baris Coskunuzer
关键词-EN: classifying cancerous lesions, Microscopic examination, deep learning methods, cancerous lesions, experienced pathologists
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Microscopic examination of slides prepared from tissue samples is the primary tool for detecting and classifying cancerous lesions, a process that is time-consuming and requires the expertise of experienced pathologists. Recent advances in deep learning methods hold significant potential to enhance medical diagnostics and treatment planning by improving accuracy, reproducibility, and speed, thereby reducing clinicians’ workloads and turnaround times. However, the necessity for vast amounts of labeled data to train these models remains a major obstacle to the development of effective clinical decision support systems. In this paper, we propose the integration of topological deep learning methods to enhance the accuracy and robustness of existing histopathological image analysis models. Topological data analysis (TDA) offers a unique approach by extracting essential information through the evaluation of topological patterns across different color channels. While deep learning methods capture local information from images, TDA features provide complementary global features. Our experiments on publicly available histopathological datasets demonstrate that the inclusion of topological features significantly improves the differentiation of tumor types in ovarian and breast cancers. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Topology (math.AT) Cite as: arXiv:2410.09818 [cs.CV] (or arXiv:2410.09818v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.09818 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: MICCAI TGI3 2024 Related DOI: https://doi.org/10.1007/978-3-031-73967-5_3 Focus to learn more DOI(s) linking to related resources

[LG-171] BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

链接: https://arxiv.org/abs/2410.09804
作者: Xinyuan Wang,Victor Shea-Jay Huang,Renmiao Chen,Hao Wang,Chengwei Pan,Lei Sha,Minlie Huang
关键词-EN: encounter potential security, potential security risks, bypass security measures, large language models, exhibit remarkable capabilities
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.

[LG-172] ContextWIN: Whittle Index Based Mixture-of-Experts Neural Model For Restless Bandits Via Deep RL

链接: https://arxiv.org/abs/2410.09781
作者: Zhanqiu Guo,Wayne Wang
关键词-EN: Restless Multi-Armed Bandit, address Restless Multi-Armed, Neural Whittle Index, Whittle Index Network, Multi-Armed Bandit
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study introduces ContextWIN, a novel architecture that extends the Neural Whittle Index Network (NeurWIN) model to address Restless Multi-Armed Bandit (RMAB) problems with a context-aware approach. By integrating a mixture of experts within a reinforcement learning framework, ContextWIN adeptly utilizes contextual information to inform decision-making in dynamic environments, particularly in recommendation systems. A key innovation is the model’s ability to assign context-specific weights to a subset of NeurWIN networks, thus enhancing the efficiency and accuracy of the Whittle index computation for each arm. The paper presents a thorough exploration of ContextWIN, from its conceptual foundation to its implementation and potential applications. We delve into the complexities of RMABs and the significance of incorporating context, highlighting how ContextWIN effectively harnesses these elements. The convergence of both the NeurWIN and ContextWIN models is rigorously proven, ensuring theoretical robustness. This work lays the groundwork for future advancements in applying contextual information to complex decision-making scenarios, recognizing the need for comprehensive dataset exploration and environment development for full potential realization.

[LG-173] Quis custodiet ipsos custodes? Who will watch the watchmen? On Detecting AI-generated peer-reviews EMNLP

链接: https://arxiv.org/abs/2410.09770
作者: Sandeep Kumar,Mohit Sahu,Vardhan Gacche,Tirthankar Ghosal,Asif Ekbal
关键词-EN: maintaining scientific rigor, process is vital, vital for maintaining, rigor and trust, academic community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: EMNLP Main, 17 pages, 5 figures, 9 tables

点击查看摘要

Abstract:The integrity of the peer-review process is vital for maintaining scientific rigor and trust within the academic community. With the steady increase in the usage of large language models (LLMs) like ChatGPT in academic writing, there is a growing concern that AI-generated texts could compromise scientific publishing, including peer-reviews. Previous works have focused on generic AI-generated text detection or have presented an approach for estimating the fraction of peer-reviews that can be AI-generated. Our focus here is to solve a real-world problem by assisting the editor or chair in determining whether a review is written by ChatGPT or not. To address this, we introduce the Term Frequency (TF) model, which posits that AI often repeats tokens, and the Review Regeneration (RR) model, which is based on the idea that ChatGPT generates similar outputs upon re-prompting. We stress test these detectors against token attack and paraphrasing. Finally, we propose an effective defensive strategy to reduce the effect of paraphrasing on our models. Our findings suggest both our proposed methods perform better than the other AI text detectors. Our RR model is more robust, although our TF model performs better than the RR model without any attacks. We make our code, dataset, and model public.

[LG-174] Stability and Sharper Risk Bounds with Convergence Rate O(1/n2)

链接: https://arxiv.org/abs/2410.09766
作者: Bowei Zhu,Shaojie Li,Yong Liu
关键词-EN: probability excess risk, excess risk bounds, high probability excess, excess risk, probability excess
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The sharpest known high probability excess risk bounds are up to O\left( 1/n \right) for empirical risk minimization and projected gradient descent via algorithmic stability (Klochkov \ Zhivotovskiy, 2021). In this paper, we show that high probability excess risk bounds of order up to O\left( 1/n^2 \right) are possible. We discuss how high probability excess risk bounds reach O\left( 1/n^2 \right) under strongly convexity, smoothness and Lipschitz continuity assumptions for empirical risk minimization, projected gradient descent and stochastic gradient descent. Besides, to the best of our knowledge, our high probability results on the generalization gap measured by gradients for nonconvex problems are also the sharpest.

[LG-175] argeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation

链接: https://arxiv.org/abs/2410.09760
作者: Guozhi Liu,Weiwei Lin,Tiansheng Huang,Ruichao Mo,Qi Mu,Li Shen
关键词-EN: online fine-tuning service, fine-tuning attack poses, attack poses, uniform perturbation, layers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a recent alignment-stage defense, applies uniform perturbation to all layers of embedding to make the model robust to the simulated embedding drift. However, applying layer-wise uniform perturbation may lead to excess perturbations for some particular safety-irrelevant layers, resulting in defense performance degradation and unnecessary memory consumption. To address this limitation, we propose Targeted Vaccine (T-Vaccine), a memory-efficient safety alignment method that applies perturbation to only selected layers of the model. T-Vaccine follows two core steps: First, it uses gradient norm as a statistical metric to identify the safety-critical layers. Second, instead of applying uniform perturbation across all layers, T-Vaccine only applies perturbation to the safety-critical layers while keeping other layers frozen during training. Results show that T-Vaccine outperforms Vaccine in terms of both defense effectiveness and resource efficiency. Comparison with other defense baselines, e.g., RepNoise and TAR also demonstrate the superiority of T-Vaccine. Notably, T-Vaccine is the first defense that can address harmful fine-tuning issues for a 7B pre-trained models trained on consumer GPUs with limited memory (e.g., RTX 4090). Our code is available at this https URL.

[LG-176] BiDoRA: Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation

链接: https://arxiv.org/abs/2410.09758
作者: Peijia Qin,Ruiyi Zhang,Pengtao Xie
关键词-EN: gained considerable attention, Parameter-efficient fine-tuning, large language models, adapting LLMs, gained considerable
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) of large language models (LLMs) has gained considerable attention as a flexible and efficient way of adapting LLMs to downstream tasks. Among these methods, weighted decomposed low-rank adaptation (DoRA) has emerged as a promising approach. DoRA bridges the gap between low-rank adaptation (LoRA) and full fine-tuning (FT) by decomposing the weight matrices into magnitude and direction components, thereby maintaining learning behavior similar to FT. Although DoRA shows encouraging performance, it introduces additional parameters compared to LoRA, which potentially increases the risk of overfitting. Moreover, optimizing magnitude and direction simultaneously leads to a coupled gradient updating pattern for both components, limiting its learning capacity. To overcome these limitations, we propose BiDoRA, a bi-level optimization-based PEFT method. In BiDoRA, the direction and magnitude components are optimized on two distinct datasets at different optimization levels, mitigating the risk of overfitting. Additionally, the asynchronous optimization of the two components promotes their decoupling, allowing for more flexible gradient updates suitable for various downstream tasks. Evaluation of BiDoRA on fourteen datasets spanning natural language understanding, natural language generation, and token classification reveals that it significantly outperforms DoRA and other PEFT methods. The superior performance of BiDoRA underscores its effectiveness. The code for BiDoRA is available at this https URL.

[LG-177] Comparison of Machine Learning Approaches for Classifying Spinodal Events

链接: https://arxiv.org/abs/2410.09756
作者: Ashwini Malviya,Sparsh Mittal
关键词-EN: deep learning models, compare the performance, performance of deep, deep learning, classifying the spinodal
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:In this work, we compare the performance of deep learning models for classifying the spinodal dataset. We evaluate state-of-the-art models (MobileViT, NAT, EfficientNet, CNN), alongside several ensemble models (majority voting, AdaBoost). Additionally, we explore the dataset in a transformed color space. Our findings show that NAT and MobileViT outperform other models, achieving the highest metrics-accuracy, AUC, and F1 score on both training and testing data (NAT: 94.65, 0.98, 0.94; MobileViT: 94.20, 0.98, 0.94), surpassing the earlier CNN model (88.44, 0.95, 0.88). We also discuss failure cases for the top performing models.

[LG-178] SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2410.09754
作者: Hojoon Lee,Dongyoon Hwang,Donghu Kim,Hyunseung Kim,Jun Jet Tai,Kaushik Subramanian,Peter R. Wurman,Jaegul Choo,Peter Stone,Takuma Seno
关键词-EN: traditional theories suggesting, Recent advances, largely driven, traditional theories, theories suggesting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa’s broad applicability and effectiveness across diverse RL algorithms and environments.

[LG-179] -READi: Transformer-Powered Robust and Efficient Multimodal Inference for Autonomous Driving

链接: https://arxiv.org/abs/2410.09747
作者: Pengfei Hu,Yuhang Qian,Tianyue Zheng,Ang Li,Zhe Chen,Yue Gao,Xiuzhen Cheng,Jun Luo
关键词-EN: autonomous vehicles, wide adoption, analytics to fuse, fuse their outputs, multimodal sensors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 16 figures

点击查看摘要

Abstract:Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.

[LG-180] Real-time Fuel Leakage Detection via Online Change Point Detection

链接: https://arxiv.org/abs/2410.09741
作者: Ruimin Chu,Li Chik,Yiliao Song,Jeffrey Chan,Xiaodong Li
关键词-EN: prevent catastrophic hazards, underground petroleum storage, petroleum storage systems, fuel leakage, catastrophic hazards
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Early detection of fuel leakage at service stations with underground petroleum storage systems is a crucial task to prevent catastrophic hazards. Current data-driven fuel leakage detection methods employ offline statistical inventory reconciliation, leading to significant detection delays. Consequently, this can result in substantial financial loss and environmental impact on the surrounding community. In this paper, we propose a novel framework called Memory-based Online Change Point Detection (MOCPD) which operates in near real-time, enabling early detection of fuel leakage. MOCPD maintains a collection of representative historical data within a size-constrained memory, along with an adaptively computed threshold. Leaks are detected when the dissimilarity between the latest data and historical memory exceeds the current threshold. An update phase is incorporated in MOCPD to ensure diversity among historical samples in the memory. With this design, MOCPD is more robust and achieves a better recall rate while maintaining a reasonable precision score. We have conducted a variety of experiments comparing MOCPD to commonly used online change point detection (CPD) baselines on real-world fuel variance data with induced leakages, actual fuel leakage data and benchmark CPD datasets. Overall, MOCPD consistently outperforms the baseline methods in terms of detection accuracy, demonstrating its applicability to fuel leakage detection and CPD problems.

[LG-181] owards Stable Globally Expressive Graph Representations with Laplacian Eigenvectors

链接: https://arxiv.org/abs/2410.09737
作者: Junru Zhou,Cai Zhou,Xiyuan Wang,Pan Li,Muhan Zhang
关键词-EN: achieved remarkable success, machine learning tasks, achieved remarkable, remarkable success, variety of machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved remarkable success in a variety of machine learning tasks over graph data. Existing GNNs usually rely on message passing, i.e., computing node representations by gathering information from the neighborhood, to build their underlying computational graphs. They are known fairly limited in expressive power, and often fail to capture global characteristics of graphs. To overcome the issue, a popular solution is to use Laplacian eigenvectors as additional node features, as they contain global positional information of nodes, and can serve as extra node identifiers aiding GNNs to separate structurally similar nodes. For such an approach, properly handling the orthogonal group symmetry among eigenvectors with equal eigenvalue is crucial for its stability and generalizability. However, using a naive orthogonal group invariant encoder for each separate eigenspace may not keep the full expressivity in the Laplacian eigenvectors. Moreover, computing such invariants inevitably entails a hard split of Laplacian eigenvalues according to their numerical identity, which suffers from great instability when the graph structure is perturbed. In this paper, we propose a novel method exploiting Laplacian eigenvectors to generate stable and globally expressive graph representations. The main difference from previous works is that (i) our method utilizes learnable orthogonal group invariant representations for each Laplacian eigenspace, based upon powerful orthogonal group equivariant neural network layers already well studied in the literature, and that (ii) our method deals with numerically close eigenvalues in a smooth fashion, ensuring its better robustness against perturbations. Experiments on various graph learning benchmarks witness the competitive performance of our method, especially its great potential to learn global properties of graphs.

[LG-182] Gradient-Free Neural Network Training on the Edge

链接: https://arxiv.org/abs/2410.09734
作者: Dotan Di Castro,Omkar Joglekar,Shir Kozlovsky,Vladimir Tchuiev,Michal Moshkovitz
关键词-EN: heavy and energy-intensive, computationally heavy, Training neural networks, neural networks, Training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training neural networks is computationally heavy and energy-intensive. Many methodologies were developed to save computational requirements and energy by reducing the precision of network weights at inference time and introducing techniques such as rounding, stochastic rounding, and quantization. However, most of these techniques still require full gradient precision at training time, which makes training such models prohibitive on edge devices. This work presents a novel technique for training neural networks without needing gradients. This enables a training process where all the weights are one or two bits, without any hidden full precision computations. We show that it is possible to train models without gradient-based optimization techniques by identifying erroneous contributions of each neuron towards the expected classification and flipping the relevant bits using logical operations. We tested our method on several standard datasets and achieved performance comparable to corresponding gradient-based baselines with a fraction of the compute power.

[LG-183] Meta-Reinforcement Learning with Universal Policy Adaptation: Provable Near-Optimality under All-task Optimum Comparator

链接: https://arxiv.org/abs/2410.09728
作者: Siyuan Xu,Minghui Zhu
关键词-EN: enhance reinforcement learning, attracted attention due, Meta-reinforcement learning, reinforcement learning, attracted attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meta-reinforcement learning (Meta-RL) has attracted attention due to its capability to enhance reinforcement learning (RL) algorithms, in terms of data efficiency and generalizability. In this paper, we develop a bilevel optimization framework for meta-RL (BO-MRL) to learn the meta-prior for task-specific policy adaptation, which implements multiple-step policy optimization on one-time data collection. Beyond existing meta-RL analyses, we provide upper bounds of the expected optimality gap over the task distribution. This metric measures the distance of the policy adaptation from the learned meta-prior to the task-specific optimum, and quantifies the model’s generalizability to the task distribution. We empirically validate the correctness of the derived upper bounds and demonstrate the superior effectiveness of the proposed algorithm over benchmarks.

[LG-184] Flying Quadrotors in Tight Formations using Learning-based Model Predictive Control

链接: https://arxiv.org/abs/2410.09727
作者: Kong Yao Chee,Pei-An Hsieh,George J. Pappas,M. Ani Hsieh
关键词-EN: challenging problem, physical experiments, framework, physical, aerodynamic effects
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Flying quadrotors in tight formations is a challenging problem. It is known that in the near-field airflow of a quadrotor, the aerodynamic effects induced by the propellers are complex and difficult to characterize. Although machine learning tools can potentially be used to derive models that capture these effects, these data-driven approaches can be sample inefficient and the resulting models often do not generalize as well as their first-principles counterparts. In this work, we propose a framework that combines the benefits of first-principles modeling and data-driven approaches to construct an accurate and sample efficient representation of the complex aerodynamic effects resulting from quadrotors flying in formation. The data-driven component within our model is lightweight, making it amenable for optimization-based control design. Through simulations and physical experiments, we show that incorporating the model into a novel learning-based nonlinear model predictive control (MPC) framework results in substantial performance improvements in terms of trajectory tracking and disturbance rejection. In particular, our framework significantly outperforms nominal MPC in physical experiments, achieving a 40.1% improvement in the average trajectory tracking errors and a 57.5% reduction in the maximum vertical separation errors. Our framework also achieves exceptional sample efficiency, using only a total of 46 seconds of flight data for training across both simulations and physical experiments. Furthermore, with our proposed framework, the quadrotors achieve an exceptionally tight formation, flying with an average separation of less than 1.5 body lengths throughout the flight. A video illustrating our framework and physical experiments is given here: this https URL

[LG-185] A Tidal Current Speed Forecasting Model based on Multiple Periodicity Learning

链接: https://arxiv.org/abs/2410.09718
作者: Tengfei Cheng,Yunxuan Dong,Yangdi Huang
关键词-EN: tidal current speed, tidal current, Tidal energy, Tidal, key components
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tidal energy is one of the key components in increasing the penetration rate of renewable energy. The penetration of tidal energy in the electrical grid depends on the accuracy of tidal current speed forecasting. Modeling inaccuracies hinder forecast accuracy. Previous research has primarily used physical models to forecast tidal current speed. However, tidal current variations influenced by the orbital periods of celestial bodies make accurate physical modeling challenging. Researching the multiple periodicity of tides is crucial for accurately forecasting tidal current speed. In this article, we propose the Wavelet-Enhanced Convolutional Network (WCN) to learn multiple periodicity. The framework embeds intra-period and inter-period variations of one-dimensional tidal current data into the rows and columns of a two-dimensional tensor. Then, the two-dimensional variations of the sequence can be processed by convolutional kernels. We integrate a time-frequency analysis method into the framework to further address local periodic features. Additionally, to enhance the framework’s stability, we optimize the framework’s hyperparameters with the Tree-structured Parzen Estimator algorithm. The proposed framework avoids the lack of learning multiple periodicity. Compared with benchmarks, the proposed framework reduces the mean absolute error and mean square error in 10-step forecasting by, at most, 90.36% and 97.56%, respectively.

[LG-186] AM-SAM: Automated Prompting and Mask Calibration for Segment Anything Model

链接: https://arxiv.org/abs/2410.09714
作者: Yuchen Li,Li Zhang,Youwei Liang,Pengtao Xie
关键词-EN: Segment Anything Model, gained significant recognition, mask decoder feature, mask decoder, gained significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Segment Anything Model (SAM) has gained significant recognition in the field of semantic segmentation due to its versatile capabilities and impressive performance. Despite its success, SAM faces two primary limitations: (1) it relies heavily on meticulous human-provided prompts like key points, bounding boxes or text messages, which is labor-intensive; (2) the mask decoder’s feature representation is sometimes inaccurate, as it solely employs dot product operations at the end of mask decoder, which inadequately captures the necessary correlations for precise segmentation. Current solutions to these problems such as fine-tuning SAM often require retraining a large number of parameters, which needs huge amount of time and computing resources. To address these limitations, we propose an automated prompting and mask calibration method called AM-SAM based on a bi-level optimization framework. Our approach automatically generates prompts for an input image, eliminating the need for human involvement with a good performance in early training epochs, achieving faster convergence. Additionally, we freeze the main part of SAM, and modify the mask decoder with Low-Rank Adaptation (LoRA), enhancing the mask decoder’s feature representation by incorporating advanced techniques that go beyond simple dot product operations to more accurately capture and utilize feature correlations. Our experimental results demonstrate that AM-SAM achieves significantly accurate segmentation, matching or exceeding the effectiveness of human-generated and default prompts. Notably, on the body segmentation dataset, our method yields a 5% higher dice score with a 4-example few-shot training set compared to the SOTA method, underscoring its superiority in semantic segmentation tasks.

[LG-187] Control the GNN: Utilizing Neural Controller with Lyapunov Stability for Test-Time Feature Reconstruction

链接: https://arxiv.org/abs/2410.09708
作者: Jielong Yang,Rui Ding,Feng Ji,Hongbin Wang,Linbo Xie
关键词-EN: testing sample distributions, graph neural networks, sample distributions, susceptible to discrepancies, discrepancies between training
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of graph neural networks (GNNs) is susceptible to discrepancies between training and testing sample distributions. Prior studies have attempted to enhance GNN performance by reconstructing node features during the testing phase without modifying the model parameters. However, these approaches lack theoretical analysis of the proximity between predictions and ground truth at test time. In this paper, we propose a novel node feature reconstruction method grounded in Lyapunov stability theory. Specifically, we model the GNN as a control system during the testing phase, considering node features as control variables. A neural controller that adheres to the Lyapunov stability criterion is then employed to reconstruct these node features, ensuring that the predictions progressively approach the ground truth at test time. We validate the effectiveness of our approach through extensive experiments across multiple datasets, demonstrating significant performance improvements.

[LG-188] EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation

链接: https://arxiv.org/abs/2410.09704
作者: Milos Vukadinovic,Xiu Tang,Neal Yuan,Paul Cheng,Debiao Li,Susan Cheng,Bryan He,David Ouyang
关键词-EN: cardiac imaging modality, capturing ultrasound video, ultrasound video data, imaging modality, capturing ultrasound
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 30 pages, 3 tables, 3 figures

点击查看摘要

Abstract:Echocardiography is the most widely used cardiac imaging modality, capturing ultrasound video data to assess cardiac structure and function. Artificial intelligence (AI) in echocardiography has the potential to streamline manual tasks and improve reproducibility and precision. However, most echocardiography AI models are single-view, single-task systems that do not synthesize complementary information from multiple views captured during a full exam, and thus lead to limited performance and scope of applications. To address this problem, we introduce EchoPrime, a multi-view, view-informed, video-based vision-language foundation model trained on over 12 million video-report pairs. EchoPrime uses contrastive learning to train a unified embedding model for all standard views in a comprehensive echocardiogram study with representation of both rare and common diseases and diagnoses. EchoPrime then utilizes view-classification and a view-informed anatomic attention model to weight video-specific interpretations that accurately maps the relationship between echocardiographic views and anatomical structures. With retrieval-augmented interpretation, EchoPrime integrates information from all echocardiogram videos in a comprehensive study and performs holistic comprehensive clinical echocardiography interpretation. In datasets from two independent healthcare systems, EchoPrime achieves state-of-the art performance on 23 diverse benchmarks of cardiac form and function, surpassing the performance of both task-specific approaches and prior foundation models. Following rigorous clinical evaluation, EchoPrime can assist physicians in the automated preliminary assessment of comprehensive echocardiography.

[LG-189] Scalable Weibull Graph Attention Autoencoder for Modeling Document Networks

链接: https://arxiv.org/abs/2410.09696
作者: Chaojie Wang,Xinyang Liu,Dongsheng Wang,Hao Zhang,Bo Chen,Mingyuan Zhou
关键词-EN: generating graph-structured data, variational graph autoencoders, skewed latent node, graph-structured data, discrete observations
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submit to T-PAMI

点击查看摘要

Abstract:Although existing variational graph autoencoders (VGAEs) have been widely used for modeling and generating graph-structured data, most of them are still not flexible enough to approximate the sparse and skewed latent node representations, especially those of document relational networks (DRNs) with discrete observations. To analyze a collection of interconnected documents, a typical branch of Bayesian models, specifically relational topic models (RTMs), has proven their efficacy in describing both link structures and document contents of DRNs, which motives us to incorporate RTMs with existing VGAEs to alleviate their potential issues when modeling the generation of DRNs. In this paper, moving beyond the sophisticated approximate assumptions of traditional RTMs, we develop a graph Poisson factor analysis (GPFA), which provides analytic conditional posteriors to improve the inference accuracy, and extend GPFA to a multi-stochastic-layer version named graph Poisson gamma belief network (GPGBN) to capture the hierarchical document relationships at multiple semantic levels. Then, taking GPGBN as the decoder, we combine it with various Weibull-based graph inference networks, resulting in two variants of Weibull graph auto-encoder (WGAE), equipped with model inference algorithms. Experimental results demonstrate that our models can extract high-quality hierarchical latent document representations and achieve promising performance on various graph analytic tasks.

[LG-190] Can In-context Learning Really Generalize to Out-of-distribution Tasks?

链接: https://arxiv.org/abs/2410.09695
作者: Qixun Wang,Yifei Wang,Yisen Wang,Xianghua Ying
关键词-EN: ICL, learn OOD, OOD, learn OOD task, learn OOD mathematical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint, under review

点击查看摘要

Abstract:In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL’s well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL’s performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbflow-test-error preference of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.

[LG-191] ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

链接: https://arxiv.org/abs/2410.09692
作者: Hai Huang,Randall Balestriero
关键词-EN: Large Language Model, Large Language, Language Model, butter of Large, LoRA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, AB , of a pretrained matrix parameter W to align the model to a new task or dataset with W+AB . We identify three core limitations to LoRA for finetuning–a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prevent overfitting. We prove that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes. Second, LoRA’s initialization of B at 0 creates a slow training dynamic between A and B . That dynamic is also exacerbated by Dropout that further slows the escape from 0 for B which is particularly harmful for short training episodes. Third, the scaling factor multiplying each LoRA additive perturbation creates ``short-sighted’’ interactions between the LoRA modules of different layers. Motivated by principled analysis of those limitations, we find an elegant solution: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate–coined ALLoRA. By scaling the per sample and per parameter gradients with a coefficient inversely proportional to parameters’ \ell_2 norm, ALLoRA alleviates those three limitations. As a by-product, ALLoRA removes two hyper-parameters from LoRA: the scaling factor and the dropout rate. Empirical results show that ALLoRA admits better accuracy than LoRA on various settings, including against recent LoRA variants such as Weight-Decomposed Low-Rank Adaptation (DoRA). Ablation studies show our solution is the optimal in a family of weight-dependent / output-dependent approaches on various LLMs including the latest Llama3.

[LG-192] MoIN: Mixture of Introvert Experts to Upcycle an LLM

链接: https://arxiv.org/abs/2410.09687
作者: Ajinkya Tejankar,KL Navaneet,Ujjawal Panchal,Kossar Pourahmadi,Hamed Pirsiavash
关键词-EN: existing large language, large language model, existing large, large language, prohibitive requirements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them “introvert” experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.

[LG-193] LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

链接: https://arxiv.org/abs/2410.09681
作者: Christopher Diehl,Peter Karkus,Shushant Veer,Marco Pavone,Torsten Bertram
关键词-EN: Distribution shifts, self-driving vehicles, shifts between operational, operational domains, domains can severely
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Distribution shifts between operational domains can severely affect the performance of learned models in self-driving vehicles (SDVs). While this is a well-established problem, prior work has mostly explored naive solutions such as fine-tuning, focusing on the motion prediction task. In this work, we explore novel adaptation strategies for differentiable autonomy stacks consisting of prediction, planning, and control, perform evaluation in closed-loop, and investigate the often-overlooked issue of catastrophic forgetting. Specifically, we introduce two simple yet effective techniques: a low-rank residual decoder (LoRD) and multi-task fine-tuning. Through experiments across three models conducted on two real-world autonomous driving datasets (nuPlan, exiD), we demonstrate the effectiveness of our methods and highlight a significant performance gap between open-loop and closed-loop evaluation in prior approaches. Our approach improves forgetting by up to 23.33% and the closed-loop OOD driving score by 8.83% in comparison to standard fine-tuning.

[LG-194] Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis

链接: https://arxiv.org/abs/2410.09678
作者: Yunwei Ren,Jason D. Lee
关键词-EN: Gaussian single-index models, Ben Arous, stochastic gradient descent, lowest degree, online stochastic gradient
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The information exponent (Ben Arous et al. [2021]) – which is equivalent to the lowest degree in the Hermite expansion of the link function for Gaussian single-index models – has played an important role in predicting the sample complexity of online stochastic gradient descent (SGD) in various learning tasks. In this work, we demonstrate that, for multi-index models, focusing solely on the lowest degree can miss key structural details of the model and result in suboptimal rates. Specifically, we consider the task of learning target functions of form f_(\mathbfx) = \sum_k=1^P \phi(\mathbfv_k^ \cdot \mathbfx) , where P \ll d , the ground-truth directions \ \mathbfv_k^* _k=1^P are orthonormal, and only the second and 2L -th Hermite coefficients of the link function \phi can be nonzero. Based on the theory of information exponent, when the lowest degree is 2L , recovering the directions requires d^2L-1\mathrmpoly§ samples, and when the lowest degree is 2 , only the relevant subspace (not the exact directions) can be recovered due to the rotational invariance of the second-order terms. In contrast, we show that by considering both second- and higher-order terms, we can first learn the relevant space via the second-order terms, and then the exact directions using the higher-order terms, and the overall sample and complexity of online SGD is d \mathrmpoly§ . Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.09678 [cs.LG] (or arXiv:2410.09678v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.09678 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-195] Uncovering Attacks and Defenses in Secure Aggregation for Federated Deep Learning

链接: https://arxiv.org/abs/2410.09676
作者: Yiwei Zhang,Rouzbeh Behnia,Attila A. Yavuz,Reza Ebrahimi,Elisa Bertino
关键词-EN: Federated learning enables, preserving data locality, Federated learning, transfer user data, Secure aggregation protocols
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning enables the collaborative learning of a global model on diverse data, preserving data locality and eliminating the need to transfer user data to a central server. However, data privacy remains vulnerable, as attacks can target user training data by exploiting the updates sent by users during each learning iteration. Secure aggregation protocols are designed to mask/encrypt user updates and enable a central server to aggregate the masked information. MicroSecAgg (PoPETS 2024) proposes a single server secure aggregation protocol that aims to mitigate the high communication complexity of the existing approaches by enabling a one-time setup of the secret to be re-used in multiple training iterations. In this paper, we identify a security flaw in the MicroSecAgg that undermines its privacy guarantees. We detail the security flaw and our attack, demonstrating how an adversary can exploit predictable masking values to compromise user privacy. Our findings highlight the critical need for enhanced security measures in secure aggregation protocols, particularly the implementation of dynamic and unpredictable masking strategies. We propose potential countermeasures to mitigate these vulnerabilities and ensure robust privacy protection in the secure aggregation frameworks.

[LG-196] EquiJump: Protein Dynamics Simulation via SO(3)-Equivariant Stochastic Interpolants

链接: https://arxiv.org/abs/2410.09667
作者: Allan dos Santos Costa,Ilan Mitnikov,Franco Pellegrini,Ameya Daigavane,Mario Geiger,Zhonglin Cao,Karsten Kreis,Tess Smidt,Emine Kucukbenli,Joseph Jacobson
关键词-EN: Mapping the conformational, functional mechanisms, crucial for elucidating, elucidating their functional, Mapping
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Mapping the conformational dynamics of proteins is crucial for elucidating their functional mechanisms. While Molecular Dynamics (MD) simulation enables detailed time evolution of protein motion, its computational toll hinders its use in practice. To address this challenge, multiple deep learning models for reproducing and accelerating MD have been proposed drawing on transport-based generative methods. However, existing work focuses on generation through transport of samples from prior distributions, that can often be distant from the data manifold. The recently proposed framework of stochastic interpolants, instead, enables transport between arbitrary distribution endpoints. Building upon this work, we introduce EquiJump, a transferable SO(3)-equivariant model that bridges all-atom protein dynamics simulation time steps directly. Our approach unifies diverse sampling methods and is benchmarked against existing models on trajectory data of fast folding proteins. EquiJump achieves state-of-the-art results on dynamics simulation with a transferable model on all of the fast folding proteins.

[LG-197] Interpolated-MLPs: Controllable Inductive Bias ICML

链接: https://arxiv.org/abs/2410.09655
作者: Sean Wu,Jordan Hong,Keyu Bai,Gregor Bachmann
关键词-EN: inductive bias, Multi-Layer Perceptrons, low-compute levels compared, weak inductive bias, inductive
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 3 figures, ICML HiLD 2024 Workshop: 2nd Workshop on High-dimensional Learning Dynamics

点击查看摘要

Abstract:Due to their weak inductive bias, Multi-Layer Perceptrons (MLPs) have subpar performance at low-compute levels compared to standard architectures such as convolution-based networks (CNN). Recent work, however, has shown that the performance gap drastically reduces as the amount of compute is increased without changing the amount of inductive bias. In this work, we study the converse: in the low-compute regime, how does the incremental increase of inductive bias affect performance? To quantify inductive bias, we propose a “soft MLP” approach, which we coin Interpolated MLP (I-MLP). We control the amount of inductive bias in the standard MLP by introducing a novel algorithm based on interpolation between fixed weights from a prior model with high inductive bias. We showcase our method using various prior models, including CNNs and the MLP-Mixer architecture. This interpolation scheme allows fractional control of inductive bias, which may be attractive when full inductive bias is not desired (e.g. in the mid-compute regime). We find experimentally that for Vision Tasks in the low-compute regime, there is a continuous and two-sided logarithmic relationship between inductive bias and performance when using CNN and MLP-Mixer prior models.

[LG-198] Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution EMNLP2024

链接: https://arxiv.org/abs/2410.09652
作者: Ankita Sinha,Wendi Cui,Kamalika Das,Jiaxin Zhang
关键词-EN: Large language models, demonstrated remarkable capabilities, Large language, prioritized performance metrics, historically prioritized performance
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: EMNLP 2024 Industry Track

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities; however, the optimization of their prompts has historically prioritized performance metrics at the expense of crucial safety and security considerations. To overcome this shortcoming, we introduce “Survival of the Safest” (SoS), an innovative multi-objective prompt optimization framework that enhances both performance and security in LLMs simultaneously. SoS utilizes an interleaved multi-objective evolution strategy, integrating semantic, feedback, and crossover mutations to effectively traverse the prompt landscape. Differing from the computationally demanding Pareto front methods, SoS provides a scalable solution that expedites optimization in complex, high-dimensional discrete search spaces while keeping computational demands low. Our approach accommodates flexible weighting of objectives and generates a pool of optimized candidates, empowering users to select prompts that optimally meet their specific performance and security needs. Experimental evaluations across diverse benchmark datasets affirm SoS’s efficacy in delivering high performance and notably enhancing safety and security compared to single-objective methods. This advancement marks a significant stride towards the deployment of LLM systems that are both high-performing and secure across varied industrial applications

[LG-199] Learning the Bitter Lesson: Empirical Evidence from 20 Years of CVPR Proceedings EMNLP2024

链接: https://arxiv.org/abs/2410.09649
作者: Mojtaba Yousefi,Jack Collins
关键词-EN: Pattern Recognition, Rich Sutton, proposed by Rich, Computer Vision, bitter lesson
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: NLP4Sceince Workshop, EMNLP 2024

点击查看摘要

Abstract:This study examines the alignment of \emphConference on Computer Vision and Pattern Recognition (CVPR) research with the principles of the “bitter lesson” proposed by Rich Sutton. We analyze two decades of CVPR abstracts and titles using large language models (LLMs) to assess the field’s embracement of these principles. Our methodology leverages state-of-the-art natural language processing techniques to systematically evaluate the evolution of research approaches in computer vision. The results reveal significant trends in the adoption of general-purpose learning algorithms and the utilization of increased computational resources. We discuss the implications of these findings for the future direction of computer vision research and its potential impact on broader artificial intelligence development. This work contributes to the ongoing dialogue about the most effective strategies for advancing machine learning and computer vision, offering insights that may guide future research priorities and methodologies in the field.

[LG-200] Multimodal Physical Activity Forecasting in Free-Living Clinical Settings: Hunting Opportunities for Just-in-Time Interventions

链接: https://arxiv.org/abs/2410.09643
作者: Abdullah Mamun,Krista S. Leonard,Megan E. Petrov,Matthew P. Buman,Hassan Ghasemzadeh
关键词-EN: lifestyle intervention system, Multimodal LSTM, patient activity behavior, LSTM, called MoveSense
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Objective: This research aims to develop a lifestyle intervention system, called MoveSense, that forecasts a patient’s activity behavior to allow for early and personalized interventions in real-world clinical environments. Methods: We conducted two clinical studies involving 58 prediabetic veterans and 60 patients with obstructive sleep apnea to gather multimodal behavioral data using wearable devices. We develop multimodal long short-term memory (LSTM) network models, which are capable of forecasting the number of step counts of a patient up to 24 hours in advance by examining data from activity and engagement modalities. Furthermore, we design goal-based forecasting models to predict whether a person’s next-day steps will be over a certain threshold. Results: Multimodal LSTM with early fusion achieves 33% and 37% lower mean absolute errors than linear regression and ARIMA respectively on the prediabetes dataset. LSTM also outperforms linear regression and ARIMA with a margin of 13% and 32% on the sleep dataset. Multimodal forecasting models also perform with 72% and 79% accuracy on the prediabetes dataset and sleep dataset respectively on goal-based forecasting. Conclusion: Our experiments conclude that multimodal LSTM models with early fusion are better than multimodal LSTM with late fusion and unimodal LSTM models and also than ARIMA and linear regression models. Significance: We address an important and challenging task of time-series forecasting in uncontrolled environments. Effective forecasting of a person’s physical activity can aid in designing adaptive behavioral interventions to keep the user engaged and adherent to a prescribed routine.

[LG-201] Provable Acceleration of Nesterovs Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.09640
作者: Zhenghao Xu,Yuqing Wang,Tuo Zhao,Rachel Ward,Molei Tao
关键词-EN: nonconvex optimization problem, canonical nonconvex optimization, mathbf, rectangular matrix factorization, optimization problem
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted by 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank- r matrix \mathbfA\in\mathbbR^m\times n , we prove that gradient descent (GD) can find a pair of \epsilon -optimal solutions \mathbfX_T\in\mathbbR^m\times d and \mathbfY_T\in\mathbbR^n\times d , where d\geq r , satisfying \lVert\mathbfX_T\mathbfY_T^\top-\mathbfA\rVert_\mathrmF\leq\epsilon\lVert\mathbfA\rVert_\mathrmF in T=O(\kappa^2\log\frac1\epsilon) iterations with high probability, where \kappa denotes the condition number of \mathbfA . Furthermore, we prove that Nesterov’s accelerated gradient (NAG) attains an iteration complexity of O(\kappa\log\frac1\epsilon) , which is the best-known bound of first-order methods for rectangular matrix factorization. Different from small balanced random initialization in the existing literature, we adopt an unbalanced initialization, where \mathbfX_0 is large and \mathbfY_0 is 0 . Moreover, our initialization and analysis can be further extended to linear neural networks, where we prove that NAG can also attain an accelerated linear convergence rate. In particular, we only require the width of the network to be greater than or equal to the rank of the output label matrix. In contrast, previous results achieving the same rate require excessive widths that additionally depend on the condition number and the rank of the input data matrix.

[LG-202] ReLUs Revival: On the Entropic Overload in Normalization-Free Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09637
作者: Nandan Kumar Jha,Brandon Reagen
关键词-EN: ensuring smooth optimization, modern large language, large language models, smooth optimization, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Workshop on Attributing Model Behavior at Scale

点击查看摘要

Abstract:LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an \em opposite trend – ReLU significantly outperforms GELU in LayerNorm-free models, leading to an \bf 8.2% perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are \em ill-suited for LayerNorm-free architectures, whereas ReLU’s geometrical properties – specialization in input space and intra-class selectivity – lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.

[LG-203] Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health

链接: https://arxiv.org/abs/2410.09635
作者: Abdullah Mamun,Lawrence D. Devoe,Mark I. Evans,David W. Britt,Judith Klein-Seetharaman,Hassan Ghasemzadeh
关键词-EN: Early detection, risk enables interventions, adverse labor outcomes, mitigate adverse labor, Explaining Neonatal Health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Early detection of intrapartum risk enables interventions to potentially prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, there is no accurate automated system to predict such events to assist with clinical decision-making. To fill this gap, we propose “Artificial Intelligence (AI) for Modeling and Explaining Neonatal Health” (AIMEN), a deep learning framework that not only predicts adverse labor outcomes from maternal, fetal, obstetrical, and intrapartum risk factors but also provides the model’s reasoning behind the predictions made. The latter can provide insights into what modifications in the input variables of the model could have changed the predicted outcome. We address the challenges of imbalance and small datasets by synthesizing additional training data using Adaptive Synthetic Sampling (ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN uses an ensemble of fully-connected neural networks as the backbone for its classification with the data augmentation supported by either ADASYN or CTGAN. AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in classification. AIMEN can predict a high risk for adverse labor outcomes with an average F1 score of 0.784. It also provides counterfactual explanations that can be achieved by changing 2 to 3 attributes on average. Resources available: this https URL.

[LG-204] Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.09629
作者: Jiaxin Zhang,Wendi Cui,Yiran Huang,Kamalika Das,Sricharan Kumar
关键词-EN: Large language models, Large language, Retrieval Augmented Generation, proficient in capturing, knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: EMNLP 2024 main conference long paper

点击查看摘要

Abstract:Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.

[LG-205] An Ensemble Scheme for Proactive Dominant Data Migration of Pervasive Tasks at the Edge

链接: https://arxiv.org/abs/2410.09621
作者: Georgios Boulougaris,Kostas Kolomvatsos
关键词-EN: Internet of Things, Edge Computing, autonomous edge nodes, significant focus, research community
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nowadays, a significant focus within the research community on the intelligent management of data at the confluence of the Internet of Things (IoT) and Edge Computing (EC) is observed. In this manuscript, we propose a scheme to be implemented by autonomous edge nodes concerning their identifications of the appropriate data to be migrated to particular locations within the infrastructure, thereby facilitating the effective processing of requests. Our objective is to equip nodes with the capability to comprehend the access patterns relating to offloaded data-driven tasks and to predict which data ought to be returned to the original nodes associated with those tasks. It is evident that these tasks depend on the processing of data that is absent from the original hosting nodes, thereby underscoring the essential data assets that necessitate access. To infer these data intervals, we utilize an ensemble approach that integrates a statistically oriented model and a machine learning framework. As a result, we are able to identify the dominant data assets in addition to detecting the density of the requests. A detailed analysis of the suggested method is provided by presenting the related formulations, which is also assessed and compared with models found in the relevant literature.

[LG-206] SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

链接: https://arxiv.org/abs/2410.09615
作者: Mohammad Mozaffari,Maryam Mehri Dehnavi
关键词-EN: revolutionized natural language, natural language understanding, high memory consumption, slow inference times, inference times due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language understanding and generation tasks but suffer from high memory consumption and slow inference times due to their large parameter sizes. Traditional model compression techniques, such as quantization and pruning, mitigate these issues but often require retraining to maintain accuracy, which is computationally expensive. This paper introduces SLiM, a novel approach for compressing LLMs using a one-shot Quantized Sparse Plus Low-rank Approximation. SLiM eliminates the need for costly retraining by combining a symmetric quantization method (SLiM-Quant) with a saliency-based low-rank approximation. Our method reduces quantization error while leveraging sparse representations compatible with accelerated hardware architectures. Additionally, we propose a parameter-efficient fine-tuning recipe that significantly reduces overhead compared to conventional quantization-aware training. SLiM achieves up to a 5.4% improvement in model accuracy for sparsity patterns like 2:4, and the fine-tuning step further enhances accuracy by up to 5.8%, demonstrating state-of-the-art performance. This work provides a pathway for efficiently deploying large models in memory-constrained environments without compromising accuracy.

[LG-207] raining Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis NEURIPS2024

链接: https://arxiv.org/abs/2410.09605
作者: Hongru Yang,Bhavya Kailkhura,Zhangyang Wang,Yingbin Liang
关键词-EN: large language models, important to explain, explain the impressive, impressive capabilities, capabilities behind large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training. We discover that gradient flow serves as an inherent mechanism that naturally divide the training process into two phases. In Phase 1, the linear MLP quickly aligns with the two target signals for correct classification, whereas the softmax attention remains almost unchanged. In Phase 2, the attention matrices and the MLP evolve jointly to enlarge the classification margin and reduce the loss to a near minimum value. Technically, we prove a novel property of the gradient flow, termed \textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss. We also conduct experiments to verify our theoretical results.

[LG-208] he Fragility of Fairness: Causal Sensitivity Analysis for Fair Machine Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.09600
作者: Jake Fawkes,Nic Fishman,Mel Andrews,Zachary C. Lipton
关键词-EN: fair Real-world data, machine learning literature, fair machine learning, Real-world data, fair Real-world
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Published at Neurips 2024 in the Dataset and Benchmarks Track

点击查看摘要

Abstract:Fairness metrics are a core tool in the fair machine learning literature (FairML), used to determine that ML models are, in some sense, fair ''.Real-world data, however, are typically plagued by various measurement biases and other violated assumptions, which can render fairness assessments meaningless. We adapt tools from causal sensitivity analysis to the FairML context, providing a general framework which (1) accommodates effectively any combination of fairness metric and bias that can be posed in the oblivious setting ‘’; (2) allows researchers to investigate combinations of biases, resulting in non-linear sensitivity; and (3) enables flexible encoding of domain-specific constraints and assumptions. Employing this framework, we analyze the sensitivity of the most common parity metrics under 3 varieties of classifier across 14 canonical fairness datasets. Our analysis reveals the striking fragility of fairness assessments to even minor dataset biases. We show that causal sensitivity analysis provides a powerful and necessary toolkit for gauging the informativeness of parity metric evaluations. Our repository is available here: this https URL.

[LG-209] A Complete Characterization of Learnability for Stochastic Noisy Bandits

链接: https://arxiv.org/abs/2410.09597
作者: Steve Hanneke,Kun Wang
关键词-EN: unknown reward function, model, study the stochastic, mathcal, model class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the stochastic noisy bandit problem with an unknown reward function f^* in a known function class \mathcalF . Formally, a model M maps arms \pi to a probability distribution M(\pi) of reward. A model class \mathcalM is a collection of models. For each model M , define its mean reward function f^M(\pi)=\mathbbE_r \sim M(\pi)[r] . In the bandit learning problem, we proceed in rounds, pulling one arm \pi each round and observing a reward sampled from M(\pi) . With knowledge of \mathcalM , supposing that the true model M\in \mathcalM , the objective is to identify an arm \hat\pi of near-maximal mean reward f^M(\hat\pi) with high probability in a bounded number of rounds. If this is possible, then the model class is said to be learnable. Importantly, a result of \citehanneke2023bandit shows there exist model classes for which learnability is undecidable. However, the model class they consider features deterministic rewards, and they raise the question of whether learnability is decidable for classes containing sufficiently noisy models. For the first time, we answer this question in the positive by giving a complete characterization of learnability for model classes with arbitrary noise. In addition to that, we also describe the full spectrum of possible optimal query complexities. Further, we prove adaptivity is sometimes necessary to achieve the optimal query complexity. Last, we revisit an important complexity measure for interactive decision making, the Decision-Estimation-Coefficient \citepfoster2021statistical,foster2023tight, and propose a new variant of the DEC which also characterizes learnability in this setting. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2410.09597 [cs.LG] (or arXiv:2410.09597v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.09597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-210] Mastering AI: Big Data Deep Learning and the Evolution of Large Language Models – AutoML from Basics to State-of-the-Art Techniques

链接: https://arxiv.org/abs/2410.09596
作者: Pohsun Feng,Ziqian Bi,Yizhu Wen,Benji Peng,Junyu Liu,Caitlyn Heqi Yin,Tianyang Wang,Keyu Chen,Sen Zhang,Ming Li,Jiawei Xu,Ming Liu,Xuanhe Pan,Jinlang Wang,Qian Niu
关键词-EN: covering fundamental principles, Automated Machine Learning, guide to Automated, practical implementations, Automated Machine
类目: Machine Learning (cs.LG)
*备注: This book contains 170 pages and 5 figures

点击查看摘要

Abstract:This manuscript presents a comprehensive guide to Automated Machine Learning (AutoML), covering fundamental principles, practical implementations, and future trends. The paper is structured to assist both beginners and experienced practitioners, with detailed discussions on popular AutoML tools such as TPOT, AutoGluon, and Auto-Keras. It also addresses emerging topics like Neural Architecture Search (NAS) and AutoML’s applications in deep learning. We believe this work will contribute to ongoing research and development in the field of AI and machine learning.

[LG-211] Bayesian Sheaf Neural Networks

链接: https://arxiv.org/abs/2410.09590
作者: Patrick Gillespie,Vasileios Maroulas,Ioannis Schizas
关键词-EN: convolution operation defined, Equipping graph neural, sheaf offers advantages, Bayesian sheaf neural, learning expressive representations
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:Equipping graph neural networks with a convolution operation defined in terms of a cellular sheaf offers advantages for learning expressive representations of heterophilic graph data. The most flexible approach to constructing the sheaf is to learn it as part of the network as a function of the node features. However, this leaves the network potentially overly sensitive to the learned sheaf. As a counter-measure, we propose a variational approach to learning cellular sheaves within sheaf neural networks, yielding an architecture we refer to as a Bayesian sheaf neural network. As part of this work, we define a novel family of reparameterizable probability distributions on the rotation group SO(n) using the Cayley transform. We evaluate the Bayesian sheaf neural network on several graph datasets, and show that our Bayesian sheaf models outperform deterministic sheaf models when training data is limited, and are less sensitive to the choice of hyperparameters.

[LG-212] oward General Instruction-Following Alignment for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2410.09584
作者: Guanting Dong,Xiaoshuai Song,Yutao Zhu,Runqi Qiao,Zhicheng Dou,Ji-Rong Wen
关键词-EN: Retrieval-Augmented Generation, Large Language Models, RAG systems, RAG, effective application
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Working in progress

点击查看摘要

Abstract:Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions (100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset (100k) through automated processes. To further bridge the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and eight widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems. Our code and datasets are released at this https URL.

[LG-213] Structure of Artificial Neural Networks – Empirical Investigations

链接: https://arxiv.org/abs/2410.09579
作者: Julian Stier
关键词-EN: neural architecture, neural architecture search, Deep Learning overtook, neural, deep architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: PhD thesis

点击查看摘要

Abstract:Within one decade, Deep Learning overtook the dominating solution methods of countless problems of artificial intelligence. Deep'' refers to the deep architectures with operations in manifolds of which there are no immediate observations. For these deep architectures some kind of structure is pre-defined -- but what is this structure? With a formal definition for structures of neural networks, neural architecture search problems and solution methods can be formulated under a common framework. Both practical and theoretical questions arise from closing the gap between applied neural architecture search and learning theory. Does structure make a difference or can it be chosen arbitrarily? This work is concerned with deep structures of artificial neural networks and examines automatic construction methods under empirical principles to shed light on to the so called black-box models’'. Our contributions include a formulation of graph-induced neural networks that is used to pose optimisation problems for neural architecture. We analyse structural properties for different neural network objectives such as correctness, robustness or energy consumption and discuss how structure affects them. Selected automation methods for neural architecture optimisation problems are discussed and empirically analysed. With the insights gained from formalising graph-induced neural networks, analysing structural properties and comparing the applicability of neural architecture search methods qualitatively and quantitatively we advance these methods in two ways. First, new predictive models are presented for replacing computationally expensive evaluation schemes, and second, new generative models for informed sampling during neural architecture search are analysed and discussed. Comments: PhD thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Reportnumber: urn:nbn:de:bvb:739-opus4-14968 Cite as: arXiv:2410.09579 [cs.LG] (or arXiv:2410.09579v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.09579 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-214] Reconstructive Visual Instruction Tuning

链接: https://arxiv.org/abs/2410.09575
作者: Haochen Wang,Anlin Zheng,Yucheng Zhao,Tiancai Wang,Zheng Ge,Xiangyu Zhang,Zhaoxiang Zhang
关键词-EN: Large Multimodal Models, Large Multimodal, paper introduces reconstructive, family of Large, visual instruction tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.

[LG-215] GETS: Ensemble Temperature Scaling for Calibration in Graph Neural Networks

链接: https://arxiv.org/abs/2410.09570
作者: Dingyi Zhuang,Chonghe Jiang,Yunhan Zheng,Shenhao Wang,Jinhua Zhao
关键词-EN: Neural Networks deliver, Graph Neural Networks, Networks deliver strong, Neural Networks, deliver strong classification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks deliver strong classification results but often suffer from poor calibration performance, leading to overconfidence or underconfidence. This is particularly problematic in high stakes applications where accurate uncertainty estimates are essential. Existing post hoc methods, such as temperature scaling, fail to effectively utilize graph structures, while current GNN calibration methods often overlook the potential of leveraging diverse input information and model ensembles jointly. In the paper, we propose Graph Ensemble Temperature Scaling, a novel calibration framework that combines input and model ensemble strategies within a Graph Mixture of Experts archi SOTA calibration techniques, reducing expected calibration error by 25 percent across 10 GNN benchmark datasets. Additionally, GETS is computationally efficient, scalable, and capable of selecting effective input combinations for improved calibration performance.

[LG-216] meseria: an object-oriented time series processing library

链接: https://arxiv.org/abs/2410.09567
作者: Stefano Alberto Russo,Giuliano Taffonia,Luca Bortolussi
关键词-EN: series processing library, processing library implemented, object-oriented time series, manipulate time series, time series processing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Timeseria is an object-oriented time series processing library implemented in Python, which aims at making it easier to manipulate time series data and to build statistical and machine learning models on top of it. Unlike common data analysis frameworks, it builds up from well defined and reusable logical units (objects), which can be easily combined together in order to ensure a high level of consistency. Thanks to this approach, Timeseria can address by design several non-trivial issues often underestimated, such as handling data losses, non-uniform sampling rates, differences between aggregated data and punctual observations, time zones, daylight saving times, and more. Timeseria comes with a comprehensive set of base data structures, common data manipulation operations, and extensible models for data reconstruction, forecasting and anomaly detection. It also integrates a powerful plotting engine capable of handling even millions of data points.

[LG-217] owards Scalable Semantic Representation for Recommendation

链接: https://arxiv.org/abs/2410.09560
作者: Taolin Zhang,Junwei Pan,Jinpeng Wang,Yaohua Zha,Tao Dai,Bin Chen,Ruisheng Luo,Xiaoxiang Deng,Yuan Wang,Ming Yue,Jie Jiang,Shu-Tao Xia
关键词-EN: large language models, developing Semantic IDs, Semantic IDs based, language models, recent advances
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With recent advances in large language models (LLMs), there has been emerging numbers of research in developing Semantic IDs based on LLMs to enhance the performance of recommendation systems. However, the dimension of these embeddings needs to match that of the ID embedding in recommendation, which is usually much smaller than the original length. Such dimension compression results in inevitable losses in discriminability and dimension robustness of the LLM embeddings, which motivates us to scale up the semantic representation. In this paper, we propose Mixture-of-Codes, which first constructs multiple independent codebooks for LLM representation in the indexing stage, and then utilizes the Semantic Representation along with a fusion module for the downstream recommendation stage. Extensive analysis and experiments demonstrate that our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.

[LG-218] Exploring space efficiency in a tree-based linear model for extreme multi-label classification EMNLP2024

链接: https://arxiv.org/abs/2410.09554
作者: He-Zhe Lin,Cheng-Hung Liu,Chih-Jen Lin
关键词-EN: identify relevant subsets, Extreme multi-label classification, Extreme multi-label, aims to identify, numerous labels
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: EMNLP 2024

点击查看摘要

Abstract:Extreme multi-label classification (XMC) aims to identify relevant subsets from numerous labels. Among the various approaches for XMC, tree-based linear models are effective due to their superior efficiency and simplicity. However, the space complexity of tree-based methods is not well-studied. Many past works assume that storing the model is not affordable and apply techniques such as pruning to save space, which may lead to performance loss. In this work, we conduct both theoretical and empirical analyses on the space to store a tree model under the assumption of sparse data, a condition frequently met in text data. We found that, some features may be unused when training binary classifiers in a tree method, resulting in zero values in the weight vectors. Hence, storing only non-zero elements can greatly save space. Our experimental results indicate that tree models can achieve up to a 95% reduction in storage space compared to the standard one-vs-rest method for multi-label text classification. Our research provides a simple procedure to estimate the size of a tree model before training any classifier in the tree nodes. Then, if the model size is already acceptable, this approach can help avoid modifying the model through weight pruning or other techniques.

[LG-219] OP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

链接: https://arxiv.org/abs/2410.09536
作者: Ge Li,Dong Tian,Hongyi Zhou,Xinkai Jiang,Rudolf Lioutikov,Gerhard Neumann
关键词-EN: Episodic Reinforcement Learning, Off-Policy Episodic Reinforcement, Episodic Reinforcement, Transformer-based Off-Policy Episodic, work introduces Transformer-based
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajectories are typically parameterized by trajectory generators such as Movement Primitives (MP), allowing for smooth and efficient exploration over long horizons while capturing high-level temporal correlations. However, ERL methods are often constrained to on-policy frameworks due to the difficulty of evaluating state-action values for entire action sequences, limiting their sample efficiency and preventing the use of more efficient off-policy architectures. TOP-ERL addresses this shortcoming by segmenting long action sequences and estimating the state-action values for each segment using a transformer-based critic architecture alongside an n-step return estimation. These contributions result in efficient and stable training that is reflected in the empirical results conducted on sophisticated robot learning environments. TOP-ERL significantly outperforms state-of-the-art RL methods. Thorough ablation studies additionally show the impact of key design choices on the model performance.

[LG-220] Boosting Deductive Reasoning with Step Signals In RLHF

链接: https://arxiv.org/abs/2410.09528
作者: Jialian Li,Yipin Zhang,Wei Shen,Yuzi Yan,Jian Xie,Dong Yan
关键词-EN: Large Language Models, Large Language, tackle complex problems, Language Models, complex problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi-step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi-step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi-step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in-domain of out-of-domain reasoning tasks. Additionally, we have conducted tests to assess the multi-step reasoning abilities of various models.

[LG-221] HG2P: Hippocampus-inspired High-reward Graph and Model-Free Q-Gradient Penalty for Path Planning and Motion Control

链接: https://arxiv.org/abs/2410.09505
作者: Haoran Wang,Yaoru Sun,Zeshen Tang
关键词-EN: hierarchical reinforcement learning, decomposes complex reaching, showing significant promise, Goal-conditioned hierarchical reinforcement, complex reaching tasks
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Goal-conditioned hierarchical reinforcement learning (HRL) decomposes complex reaching tasks into a sequence of simple subgoal-conditioned tasks, showing significant promise for addressing long-horizon planning in large-scale environments. This paper bridges the goal-conditioned HRL based on graph-based planning to brain mechanisms, proposing a hippocampus-striatum-like dual-controller hypothesis. Inspired by the brain mechanisms of organisms (i.e., the high-reward preferences observed in hippocampal replay) and instance-based theory, we propose a high-return sampling strategy for constructing memory graphs, improving sample efficiency. Additionally, we derive a model-free lower-level Q-function gradient penalty to resolve the model dependency issues present in prior work, improving the generalization of Lipschitz constraints in applications. Finally, we integrate these two extensions, High-reward Graph and model-free Gradient Penalty (HG2P), into the state-of-the-art framework ACLG, proposing a novel goal-conditioned HRL framework, HG2P+ACLG. Experimentally, the results demonstrate that our method outperforms state-of-the-art goal-conditioned HRL algorithms on a variety of long-horizon navigation tasks and robotic manipulation tasks.

[LG-222] Dying Clusters Is All You Need – Deep Clustering With an Unknown Number of Clusters ICDM

链接: https://arxiv.org/abs/2410.09491
作者: Collin Leiber,Niklas Strauß,Matthias Schubert,Thomas Seidl
关键词-EN: Finding meaningful groups, Finding meaningful, meaningful groups, number of clusters, important challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Acceppted at the Sixth ICDM Workshop on Deep Learning and Clustering

点击查看摘要

Abstract:Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components. The code is available at: this https URL

[LG-223] ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

链接: https://arxiv.org/abs/2410.09486
作者: Yarden As,Bhavya Sukhija,Lenart Treven,Carmelo Sferrazza,Stelian Coros,Andreas Krause
关键词-EN: Reinforcement learning, development of modern, Reinforcement, ActSafe, agents require extensive
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. In this work, we present ActSafe, a novel model-based RL algorithm for safe and efficient exploration. ActSafe learns a well-calibrated probabilistic model of the system and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics, while enforcing pessimism w.r.t. the safety constraints. Under regularity assumptions on the constraints and dynamics, we show that ActSafe guarantees safety during learning while also obtaining a near-optimal policy in finite time. In addition, we propose a practical variant of ActSafe that builds on latest model-based RL advancements and enables safe exploration even in high-dimensional settings such as visual control. We empirically show that ActSafe obtains state-of-the-art performance in difficult exploration tasks on standard safe deep RL benchmarks while ensuring safety during learning.

[LG-224] Bridging Gaps: Federated Multi-View Clustering in Heterogeneous Hybrid Views

链接: https://arxiv.org/abs/2410.09484
作者: Xinyue Chen,Yazhou Ren,Jie Xu,Fangfei Lin,Xiaorong Pu,Yang Yang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-225] Distilling Invariant Representations with Dual Augmentation

链接: https://arxiv.org/abs/2410.09474
作者: Nikolaos Giakoumoglou,Tania Stathaki
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, 3 tables. This paper presents preliminary results from a project that we have since discontinued, as our research focus has shifted to new directions

点击查看摘要

[LG-226] Reinforcement Learning in Hyperbolic Spaces: Models and Experiments

链接: https://arxiv.org/abs/2410.09466
作者: Vladimir Jaćimović,Zinaid Kapić,Aladin Crnkić
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-227] -Fold Cross-Validation for energy-aware Machine Learning Evaluations

链接: https://arxiv.org/abs/2410.09463
作者: Christopher Mahlich,Tobias Vente,Joeran Beel
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-228] Power-Softmax: Towards Secure LLM Inference over Encrypted Data

链接: https://arxiv.org/abs/2410.09457
作者: Itamar Zimerman,Allon Adir,Ehud Aharoni,Matan Avitan,Moran Baruch,Nir Drucker,Jenny Lerner,Ramy Masalha,Reut Meiri,Omri Soceanu
关键词-EN:
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[LG-229] VERITAS-NLI : Validation and Extraction of Reliable Information Through Automated Scraping and Natural Language Inference

链接: https://arxiv.org/abs/2410.09455
作者: Arjun Shah,Hetansh Shah,Vedica Bafna,Charmi Khandor,Sindhu Nair
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint, 15 pages, 7 figures

点击查看摘要

[LG-230] Skipping Computations in Multimodal LLMs NEURIPS2024

链接: https://arxiv.org/abs/2410.09454
作者: Mustafa Shukor,Matthieu Cord
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024 Workshop RBFM. Code: this https URL

点击查看摘要

[LG-231] MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning

链接: https://arxiv.org/abs/2410.09437
作者: Yaming Yang,Dilixat Muhtar,Yelong Shen,Yuefeng Zhan,Jianfeng Liu,Yujing Wang,Hao Sun,Denvy Deng,Feng Sun,Qi Zhang,Weizhu Chen,Yunhai Tong
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 Pages, 4 Figures

点击查看摘要

[LG-232] FlatQuant: Flatness Matters for LLM Quantization

链接: https://arxiv.org/abs/2410.09426
作者: Yuxuan Sun,Ruikang Liu,Haoli Bai,Han Bao,Kang Zhao,Yuening Li,Jiaxin Hu,Xianzhi Yu,Lu Hou,Chun Yuan,Xin Jiang,Wulong Liu,Jun Yao
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

[LG-233] owards the Effect of Examples on In-Context Learning: A Theoretical Case Study

链接: https://arxiv.org/abs/2410.09411
作者: Pengfei He,Yingqian Cui,Han Xu,Hui Liu,Makoto Yamada,Jiliang Tang,Yue Xing
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-234] C-Adapter: Adapting Deep Classifiers for Efficient Conformal Prediction Sets

链接: https://arxiv.org/abs/2410.09408
作者: Kangdao Liu,Hao Zeng,Jianguo Huang,Huiping Zhuang,Chi-Man Vong,Hongxin Wei
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-235] CAMPHOR: Collaborative Agents for Multi-input Planning and High-Order Reasoning On Device

链接: https://arxiv.org/abs/2410.09407
作者: Yicheng Fu,Raviteja Anantha,Jianpeng Cheng
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-236] wo Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation

链接: https://arxiv.org/abs/2410.09403
作者: Haoyang Su,Renqi Chen,Shixiang Tang,Xinzhe Zheng,Jingzhe Li,Zhenfei Yin,Wanli Ouyang,Nanqing Dong
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

[LG-237] xt Classification using Graph Convolutional Networks: A Comprehensive Survey

链接: https://arxiv.org/abs/2410.09399
作者: Syed Mustafa Haider Rizvi,Ramsha Imran,Arif Mahmood
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-238] MITA: Bridging the Gap between Model and Data for Test-time Adaptation

链接: https://arxiv.org/abs/2410.09398
作者: Yige Yuan,Bingbing Xu,Teng Xiao,Liang Hou,Fei Sun,Huawei Shen,Xueqi Cheng
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[LG-239] Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes

链接: https://arxiv.org/abs/2410.09397
作者: Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-240] Mamba4Cast: Efficient Zero-Shot Time Series Forecasting with State Space Models

链接: https://arxiv.org/abs/2410.09385
作者: Sathya Kamesh Bhethanabhotla,Omar Swelam,Julien Siems,David Salinas,Frank Hutter
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-241] Deep Transfer Learning: Model Framework and Error Analysis

链接: https://arxiv.org/abs/2410.09383
作者: Yuling Jiao,Huazhen Lin,Yuchen Luo,Jerry Zhijian Yang
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-242] Looped ReLU MLPs May Be All You Need as Practical Programmable Computers

链接: https://arxiv.org/abs/2410.09375
作者: Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Yufa Zhou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:

点击查看摘要

[LG-243] Debiasing Vison-Language Models with Text-Only Training

链接: https://arxiv.org/abs/2410.09365
作者: Yunfan Yang,Chaoquan Jiang,Zhiyu Lin,Jinlin Xiao,Jiaming Zhang,Jitao Sang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-244] SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins

链接: https://arxiv.org/abs/2410.09362
作者: Jongwoo Ko,Saket Dingliwal,Bhavana Ganesh,Sailik Sengupta,Sravan Bodapati,Aram Galstyan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-245] Decision-Point Guided Safe Policy Improvement

链接: https://arxiv.org/abs/2410.09361
作者: Abhishek Sharma,Leo Benac,Sonali Parbhoo,Finale Doshi-Velez
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-246] owards the Synthesis of Non-speech Vocalizations

链接: https://arxiv.org/abs/2410.09360
作者: Enjamamul Hoq,Ifeoma Nwogu
关键词-EN:
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[LG-247] Fusion Matrix Prompt Enhanced Self-Attention Spatial-Temporal Interactive Traffic Forecasting Framework

链接: https://arxiv.org/abs/2410.09356
作者: Mu Liu,MingChen Sun YingJi Li,Ying Wang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: THE WEB CONFERENCE 2025

点击查看摘要

[LG-248] On Divergence Measures for Training GFlowNets NEURIPS2024

链接: https://arxiv.org/abs/2410.09355
作者: Tiago da Silva,Eliezer de Souza da Silva,Diego Mesquita
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2024, this https URL

点击查看摘要

[LG-249] Inference and Verbalization Functions During In-Context Learning EMNLP2024

链接: https://arxiv.org/abs/2410.09349
作者: Junyi Tao,Xiaoyin Chen,Nelson F. Liu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings

点击查看摘要

[LG-250] BANGS: Game-Theoretic Node Selection for Graph Self-Training

链接: https://arxiv.org/abs/2410.09348
作者: Fangxin Wang,Kay Liu,Sourav Medya,Philip S. Yu
关键词-EN:
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Preprint

点击查看摘要

[LG-251] oward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

链接: https://arxiv.org/abs/2410.09347
作者: Huayu Chen,Hang Su,Peize Sun,Jun Zhu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[LG-252] DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

链接: https://arxiv.org/abs/2410.09344
作者: Wenlong Deng,Yize Zhao,Vala Vakilian,Minghui Chen,Xiaoxiao Li,Christos Thrampoulidis
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-253] Advanced Gesture Recognition in Autism: Integrating YOLOv7 Video Augmentation and VideoMAE for Video Analysis

链接: https://arxiv.org/abs/2410.09339
作者: Amit Kumar Singh,Trapti Shrivastava,Vrijendra Singh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-254] owards Multi-Modal Animal Pose Estimation: An In-Depth Analysis

链接: https://arxiv.org/abs/2410.09312
作者: Qianyi Deng,Oishi Deb,Amir Patel,Christian Rupprecht,Philip Torr,Niki Trigoni,Andrew Markham
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 35 pages, 5 figures, 8 tables

点击查看摘要

[LG-255] Graph Neural Alchemist: An innovative fully modular architecture for time series-to-graph classification

链接: https://arxiv.org/abs/2410.09307
作者: Paulo Coelho,Raul Araju,Luís Ramos,Samir Saliba,Renato Vimieiro
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-256] Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

链接: https://arxiv.org/abs/2410.09303
作者: Buu Phan,Brandon Amos,Itai Gat,Marton Havasi,Matthew Muckley,Karen Ullrich
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-257] Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

链接: https://arxiv.org/abs/2410.09302
作者: Guanlin Liu,Kaixuan Ji,Renjie Zheng,Zheng Wu,Chen Dun,Quanquan Gu,Lin Yan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-258] Nudging: Inference-time Alignment via Model Collaboration

链接: https://arxiv.org/abs/2410.09300
作者: Yu Fei,Yasaman Razeghi,Sameer Singh
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-259] Hierarchical uncertainty estimation for learning-based registration in neuroimaging

链接: https://arxiv.org/abs/2410.09299
作者: Xiaoling Hu,Karthik Gopinath,Peirong Liu,Malte Hoffmann,Koen Van Leemput,Oula Puonti,Juan Eugenio Iglesias
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 15 pages, 6 figures

点击查看摘要

[LG-260] DeepOSets: Non-Autoregressive In-Context Learning of Supervised Learning Operators

链接: https://arxiv.org/abs/2410.09298
作者: Shao-Ting Chiu,Junyuan Hong,Ulisses Braga-Neto
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-261] Ranking over Regression for Bayesian Optimization and Molecule Selection

链接: https://arxiv.org/abs/2410.09290
作者: Gary Tom,Stanley Lo,Samantha Corapi,Alan Aspuru-Guzik,Benjamin Sanchez-Lengeling
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 14 + 4 pages, 5 + 3 figures

点击查看摘要

[LG-262] AuD-Former: A Hierarchical Transformer Network for Multimodal Audio-Based Disease Prediction

链接: https://arxiv.org/abs/2410.09289
作者: Jinjin Cai,Ruiqi Wang,Dezhong Zhao,Ziqin Yuan,Victoria McKenna,Aaron Friedman,Rachel Foot,Susan Storey,Ryan Boente,Sudip Vhaduri,Byung-Cheol Min
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[LG-263] Enhanced Federated Anomaly Detection Through Autoencoders Using Summary Statistics-Based Thresholding

链接: https://arxiv.org/abs/2410.09284
作者: Sofiane Laridi,Gregory Palmer,Kam-Ming Mark Tam
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-264] Predicting Drug Effects from High-Dimensional Asymmetric Drug Datasets by Using Graph Neural Networks: A Comprehensive Analysis of Multitarget Drug Effect Prediction

链接: https://arxiv.org/abs/2410.09280
作者: Avishek Bose,Guojing Cong
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 14 sub-figures, 4 tables

点击查看摘要

[LG-265] Articulated Animal AI: An Environment for Animal-like Cognition in a Limbed Agent NEURIPS2024

链接: https://arxiv.org/abs/2410.09275
作者: Jeremy Lucas,Isabeau Prémont-Schwarz
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, accepted to Workshop on Open-World Agents (OWA-2024) at NeurIPS 2024 in Vancouver, Canada

点击查看摘要

[LG-266] Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

链接: https://arxiv.org/abs/2410.09247
作者: Jacob Haimes,Cenny Wenner,Kunvar Thaman,Vassil Tashev,Clement Neo,Esben Kran,Jason Schreiber
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-267] DFM: Interpolant-free Dual Flow Matching NEURIPS2024

链接: https://arxiv.org/abs/2410.09246
作者: Denis Gudovskiy,Tomoyuki Okuno,Yohei Nakata
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Extended Abstract Track at the Unifying Representations in Neural Models Workshop (NeurIPS 2024)

点击查看摘要

[LG-268] nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder

链接: https://arxiv.org/abs/2410.09240
作者: Maksim Kuznetsov,Airat Valiev,Alex Aliper,Daniil Polykovskiy,Elena Tutubalina,Rim Shayakhmetov,Zulfat Miftahutdinov
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-269] Scaling Gaussian Processes for Learning Curve Prediction via Latent Kronecker Structure NEURIPS2024

链接: https://arxiv.org/abs/2410.09239
作者: Jihao Andreas Lin,Sebastian Ament,Maximilian Balandat,Eytan Bakshy
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Bayesian Decision-making and Uncertainty Workshop at NeurIPS 2024

点击查看摘要

[LG-270] M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought EMNLP2024

链接: https://arxiv.org/abs/2410.09220
作者: Gitanjali Kumari,Kirtan Jain,Asif Ekbal
关键词-EN:
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 34 Pages. Accepted in The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Main Conference

点击查看摘要

[LG-271] Continual Learning with Neuromorphic Computing: Theories Methods and Applications

链接: https://arxiv.org/abs/2410.09218
作者: Mishal Fatima Minhas,Rachmad Vidya Wicaksana Putra,Falah Awwad,Osman Hasan,Muhammad Shafique
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 71 pages, 31 figures, 6 tables

点击查看摘要

[LG-272] pyhgf: A neural network library for predictive coding

链接: https://arxiv.org/abs/2410.09206
作者: Nicolas Legrand,Lilian Weber,Peter Thestrup Waade,Anna Hedvig Møller Daugaard,Mojtaba Khodadadi,Nace Mikuš,Chris Mathys
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

[LG-273] Encoding Agent Trajectories as Representations with Sequence Transformers

链接: https://arxiv.org/abs/2410.09204
作者: Athanasios Tsiligkaridis,Nicholas Kalinowski,Zhongheng Li,Elizabeth Hou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 pages, to be presented at GeoAI workshop at ACM SigSpatial 2024

点击查看摘要

[LG-274] An Efficient Contrastive Unimodal Pretraining Method for EHR Time Series Data

链接: https://arxiv.org/abs/2410.09199
作者: Ryan King,Shivesh Kodali,Conrad Krueger,Tianbao Yang,Bobak J. Mortazavi
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-275] Scalable Signature-Based Distribution Regression via Reference Sets

链接: https://arxiv.org/abs/2410.09196
作者: Andrew Alden,Carmine Ventre,Blanka Horvath
关键词-EN:
类目: Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Machine Learning (stat.ML)
*备注: 24 pages, 4 figures

点击查看摘要

[LG-276] Long Range Named Entity Recognition for Marathi Documents

链接: https://arxiv.org/abs/2410.09192
作者: Pranita Deshmukh,Nikita Kulkarni,Sanhita Kulkarni,Kareena Manghani,Geetanjali Kale,Raviraj Joshi
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-277] me to Retrain? Detecting Concept Drifts in Machine Learning Systems

链接: https://arxiv.org/abs/2410.09190
作者: Tri Minh Triet Pham,Karthikeyan Premkumar,Mohamed Naili,Jinqiu Yang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-278] Automated Rewards via LLM-Generated Progress Functions

链接: https://arxiv.org/abs/2410.09187
作者: Vishnu Sarukkai,Brennan Shacklett,Zander Majercik,Kush Bhatia,Christopher Ré,Kayvon Fatahalian
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 26 pages, 5 figures

点击查看摘要

[LG-279] Learning Algorithms Made Simple

链接: https://arxiv.org/abs/2410.09186
作者: Noorbakhsh Amiri Golilarz,Elias Hossain,Abdoljalil Addeh,Keyan Alexander Rahimi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-280] L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi

链接: https://arxiv.org/abs/2410.09184
作者: Pranita Deshmukh,Nikita Kulkarni,Sanhita Kulkarni,Kareena Manghani,Raviraj Joshi
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-281] Can a large language model be a gaslighter?

链接: https://arxiv.org/abs/2410.09181
作者: Wei Li,Luyao Zhu,Yang Song,Ruixi Lin,Rui Mao,Yang You
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 10/26 (Main Body/Total), 8 figures

点击查看摘要

[LG-282] Learning to Walk from Three Minutes of Real-World Data with Semi-structured Dynamics Models

链接: https://arxiv.org/abs/2410.09163
作者: Jacob Levy,Tyler Westenbroek,David Fridovich-Keil
关键词-EN:
类目: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-283] On Discriminative Probabilistic Modeling for Self-Supervised Representation Learning

链接: https://arxiv.org/abs/2410.09156
作者: Bokun Wang,Yunwen Lei,Yiming Ying,Tianbao Yang
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-284] ACER: Automatic Language Model Context Extension via Retrieval

链接: https://arxiv.org/abs/2410.09141
作者: Luyu Gao,Yunyi Zhang,Jamie Callan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-285] Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset

链接: https://arxiv.org/abs/2410.09135
作者: Victor Radermecker,Andrea Zanon,Nancy Thomas,Annita Vapsi,Saba Rahimi,Rama Ramakrishnan,Daniel Borrajo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[LG-286] When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning

链接: https://arxiv.org/abs/2410.09132
作者: Hao Yan,Chaozhuo Li,Zhigang Yu,Jun Yin,Ruochen Liu,Peiyan Zhang,Weihao Han,Mingzheng Li,Zhengxin Zeng,Hao Sun,Weiwei Deng,Feng Sun,Qi Zhang,Senzhang Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[LG-287] nextlocllm: next location prediction using LLMs

链接: https://arxiv.org/abs/2410.09129
作者: Shuai Liu,Ning Cao,Yile Chen,Yue Jiang,Gao Cong
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 19 pages

点击查看摘要

[LG-288] IGER: Temporally Improved Graph Entity Linker

链接: https://arxiv.org/abs/2410.09128
作者: Pengyu Zhang,Congfeng Cao,Paul Groth
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[LG-289] CYCLE: Cross-Year Contrastive Learning in Entity-Linking

链接: https://arxiv.org/abs/2410.09127
作者: Pengyu Zhang,Congfeng Cao,Klim Zaporojets,Paul Groth
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-290] raining on Fake Labels: Mitigating Label Leakage in Split Learning via Secure Dimension Transformation

链接: https://arxiv.org/abs/2410.09125
作者: Yukun Jiang,Peiran Wang,Chengguo Lin,Ziyue Huang,Yong Cheng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[LG-291] SoK: Verifiable Cross-Silo FL

链接: https://arxiv.org/abs/2410.09124
作者: Aleksei Korneev(CRIStAL, MAGNET),Jan Ramon(CRIStAL, MAGNET)
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[LG-292] Context-Aware Adapter Tuning for Few-Shot Relation Learning in Knowledge Graphs

链接: https://arxiv.org/abs/2410.09123
作者: Ran Liu,Zhongzhou Liu,Xiaoli Li,Yuan Fang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

[LG-293] textitlucie: An Improved Python Package for Loading Datasets from the UCI Machine Learning Repository

链接: https://arxiv.org/abs/2410.09119
作者: Kenneth Ge,Phuc Nguyen,Ramy Arnaout
关键词-EN:
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 5 pages, 3 figures

点击查看摘要

[LG-294] FSW-GNN: A Bi-Lipschitz WL-Equivalent Graph Neural Network

链接: https://arxiv.org/abs/2410.09118
作者: Yonatan Sverdlov,Yair Davidson,Nadav Dym,Tal Amir
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-295] Optimizing Hard-to-Place Kidney Allocation: A Machine Learning Approach to Center Ranking

链接: https://arxiv.org/abs/2410.09116
作者: Sean Berry,Berk Gorgulu,Sait Tunc,Mucahit Cevik,Matthew J Ellis
关键词-EN:
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-296] Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

链接: https://arxiv.org/abs/2410.09114
作者: Andrey Anurin,Jonathan Ng,Kibo Schaffer,Ziyue Wang,Jason Schreiber,Esben Kran
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注: this https URL

点击查看摘要

[LG-297] Compressing high-resolution data through latent representation encoding for downscaling large-scale AI weather forecast model

链接: https://arxiv.org/abs/2410.09109
作者: Qian Liu,Bing Gong,Xiaoran Zhuang,Xiaohui Zhong,Zhiming Kang,Hao Li
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 19 pages

点击查看摘要

[LG-298] Federated Learning for Data Market: Shapley-UCB for Seller Selection and Incentives

链接: https://arxiv.org/abs/2410.09107
作者: Kongyang Chen,Zeming Xu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

[LG-299] Parameter-Efficient Fine-Tuning via Selective Discrete Cosine Transform

链接: https://arxiv.org/abs/2410.09103
作者: Yixian Shen,Qi Bi,Jia-Hong Huang,Hongyi Zhu,Anuj Pathania
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[LG-300] Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

链接: https://arxiv.org/abs/2410.09102
作者: Tong Wu,Shujian Zhang,Kaiqiang Song,Silei Xu,Sanqiang Zhao,Ravi Agrawal,Sathish Reddy Indurthi,Chong Xiang,Prateek Mittal,Wenxuan Zhou
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

[LG-301] Data Taggants: Dataset Ownership Verification via Harmless Targeted Data Poisoning

链接: https://arxiv.org/abs/2410.09101
作者: Wassim Bouaziz,El-Mahdi El-Mhamdi,Nicolas Usunier
关键词-EN:
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 7 figures

点击查看摘要

[LG-302] Adaptive Active Inference Agents for Heterogeneous and Lifelong Federated Learning

链接: https://arxiv.org/abs/2410.09099
作者: Anastasiya Danilenka,Alireza Furutanpey,Victor Casamayor Pujol,Boris Sedlak,Anna Lackinger,Maria Ganzha,Marcin Paprzycki,Schahram Dustdar
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 11 pages, double column, 15 figures, 2 tables

点击查看摘要

[LG-303] Reflections on Disentanglement and the Latent Space

链接: https://arxiv.org/abs/2410.09094
作者: Ludovica Schaerf
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to xCoAx 2024 as part of the School of X

点击查看摘要

[LG-304] Mechanistic? EMNLP2024

链接: https://arxiv.org/abs/2410.09087
作者: Naomi Saphra,Sarah Wiegreffe
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Equal contribution. Position paper. Accepted for presentation at the BlackBoxNLP workshop at EMNLP 2024

点击查看摘要

[LG-305] Diagnosing Robotics Systems Issues with Large Language Models

链接: https://arxiv.org/abs/2410.09084
作者: Jordis Emilia Herrmann,Aswath Mandakath Gopinath,Mikael Norrlof,Mark Niklas Müller
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

[LG-306] Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

链接: https://arxiv.org/abs/2410.09083
作者: Lu Chen,Yuxuan Huang,Yixing Li,Yaohui Jin,Shuai Zhao,Zilong Zheng,Quanshi Zhang
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-307] Leveraging Social Determinants of Health in Alzheimers Research Using LLM-Augmented Literature Mining and Knowledge Graphs

链接: https://arxiv.org/abs/2410.09080
作者: Tianqi Shang,Shu Yang,Weiqing He,Tianhua Zhai,Dawei Li,Bojian Hou,Tianlong Chen,Jason H. Moore,Marylyn D. Ritchie,Li Shen
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-308] BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.09079
作者: Aofei Chang,Jiaqi Wang,Han Liu,Parminder Bhatia,Cao Xiao,Ting Wang,Fenglong Ma
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 (Findings)

点击查看摘要

[LG-309] Modeling and Prediction of the UEFA EURO 2024 via Combined Statistical Learning Approaches

链接: https://arxiv.org/abs/2410.09068
作者: Andreas Groll,Lars M. Hvattum,Christophe Ley,Jonas Sternemann,Gunther Schauberger,Achim Zeileis
关键词-EN:
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

[LG-310] AI versus AI in Financial Crimes and Detection: GenAI Crime Waves to Co-Evolutionary AI

链接: https://arxiv.org/abs/2410.09066
作者: Eren Kurshan,Dhagash Mehta,Bayan Bruss,Tucker Balch
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-311] Geospatial Road Cycling Race Results Data Set

链接: https://arxiv.org/abs/2410.09055
作者: Bram Janssens,Luca Pappalardo,Jelle De Bock,Matthias Bogaert,Steven Verstockt
关键词-EN:
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-312] Separation of Neural Drives to Muscles from Transferred Polyfunctional Nerves using Implanted Micro-electrode Arrays

链接: https://arxiv.org/abs/2410.10694
作者: Laura Ferrante,Anna Boesendorfer,Deren Yusuf Barsakcioglu,Benedikt Baumgartner,Yazan Al-Ajam,Alex Woollard,Norbert Venantius Kang,Oskar Aszmann,Dario Farina
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[LG-313] QueST: Querying Functional and Structural Niches on Spatial Transcriptomics Data via Contrastive Subgraph Embedding

链接: https://arxiv.org/abs/2410.10652
作者: Mo Chen,Minsheng Hao,Xuegong Zhang,Lei Wei
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-314] High-Dimensional Differential Parameter Inference in Exponential Family using Time Score Matching

链接: https://arxiv.org/abs/2410.10637
作者: Daniel J. Williams,Leyang Wang,Qizhen Ying,Song Liu,Mladen Kolar
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Daniel J. Williams and Leyang Wang contributed equally to this work

点击查看摘要

[LG-315] Robust Gradient Descent for Phase Retrieval

链接: https://arxiv.org/abs/2410.10623
作者: Alex Buna,Patrick Rebeschini
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-316] Online waveform selection for cognitive radar

链接: https://arxiv.org/abs/2410.10591
作者: Thulasi Tholeti,Avinash Rangarajan,Sheetal Kalyani
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-317] Data-Driven Approaches for Modelling Target Behaviour

链接: https://arxiv.org/abs/2410.10538
作者: Isabel Schlangen,André Brandenburger,Mengwei Sun,James R. Hopgood
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 11 pages, 9 figures. Submitted to IEEE Transactions on Signal Processing on October 14, 2024

点击查看摘要

[LG-318] Inverse Problems and Data Assimilation: A Machine Learning Approach

链接: https://arxiv.org/abs/2410.10523
作者: Eviatar Bach,Ricardo Baptista,Daniel Sanz-Alonso,Andrew Stuart
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 254 pages

点击查看摘要

[LG-319] Coupled autoregressive active inference agents for control of multi-joint dynamical systems

链接: https://arxiv.org/abs/2410.10415
作者: Tim N. Nisslbeck,Wouter M. Kouw
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 14 pages, 3 figures, accepted to the International Workshop on Active Inference 2024

点击查看摘要

[LG-320] Bayesian Optimisation with Unknown Hyperparameters: Regret Bounds Logarithmically Closer to Optimal

链接: https://arxiv.org/abs/2410.10384
作者: Juliusz Ziomek,Masaki Adachi,Michael A. Osborne
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-321] Collaborative filtering based on nonnegative/binary matrix factorization

链接: https://arxiv.org/abs/2410.10381
作者: Yukino Terui,Yuka Inoue,Yohei Hamakawa,Kosuke Tatsumura,Kazue Kudo
关键词-EN:
类目: atistical Mechanics (cond-mat.stat-mech); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures

点击查看摘要

[LG-322] Groningen: Spatial Prediction of Rock Gas Saturation by Leveraging Selected and Augmented Well and Seismic Data with Classifier Ensembles

链接: https://arxiv.org/abs/2410.10371
作者: Dmitry Ivlev
关键词-EN:
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures, 7 tables

点击查看摘要

[LG-323] Learning via Surrogate PAC-Bayes

链接: https://arxiv.org/abs/2410.10230
作者: Antoine Picard-Weibel,Roman Moscoviz,Benjamin Guedj(UCL, UCL-CS, Inria, Inria-London, MODAL)
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-324] Neural Quasiprobabilistic Likelihood Ratio Estimation with Negatively Weighted Data

链接: https://arxiv.org/abs/2410.10216
作者: Matthew Drnevich,Stephen Jiggins,Judith Katzy,Kyle Cranmer
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 59 pages, 29 figures

点击查看摘要

[LG-325] Queueing Matching Bandits with Preference Feedback NEURIPS2024

链接: https://arxiv.org/abs/2410.10098
作者: Jung-hun Kim,Min-hwan Oh
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS2024

点击查看摘要

[LG-326] fastHDMI: Fast Mutual Information Estimation for High-Dimensional Data

链接: https://arxiv.org/abs/2410.10082
作者: Kai Yang,Masoud Asgharian,Nikhil Bhagwat,Jean-Baptiste Poline,Celia M.T. Greenwood
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注: 31 pages, 5 figures

点击查看摘要

[LG-327] DAG-aware Transformer for Causal Effect Estimation

链接: https://arxiv.org/abs/2410.10044
作者: Manqing Liu,David R. Bellamy,Andrew L. Beam
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-328] Physics-informed AI and ML-based sparse system identification algorithm for discovery of PDEs representing nonlinear dynamic systems

链接: https://arxiv.org/abs/2410.10023
作者: Ashish Pal,Sutanu Bhowmick,Satish Nagarajaiah
关键词-EN:
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-329] Phase retrieval: Global convergence of gradient descent with optimal sample complexity

链接: https://arxiv.org/abs/2410.09990
作者: Théodore Fougereux,Cédric Josz,Xiaopeng Li
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-330] Gradient Span Algorithms Make Predictable Progress in High Dimension

链接: https://arxiv.org/abs/2410.09973
作者: Felix Benning,Leif Döring
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

[LG-331] Variational Diffusion Posterior Sampling with Midpoint Guidance

链接: https://arxiv.org/abs/2410.09945
作者: Badr Moufad,Yazid Janati,Lisa Bedin,Alain Durmus,Randal Douc,Eric Moulines,Jimmy Olsson
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-332] Predicting Molecular Ground-State Conformation via Conformation Optimization

链接: https://arxiv.org/abs/2410.09795
作者: Fanmeng Wang,Minjie Cheng,Hongteng Xu
关键词-EN:
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

[LG-333] Learning from the past: predicting critical transitions with machine learning trained on surrogates of historical data

链接: https://arxiv.org/abs/2410.09707
作者: Zhiqin Ma,Chunhua Zeng,Yi-Cheng Zhang,Thomas M. Bury
关键词-EN:
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-334] Universal scaling laws in quantum-probabilistic machine learning by tensor network towards interpreting representation and generalization powers

链接: https://arxiv.org/abs/2410.09703
作者: Sheng-Chen Bai,Shi-Ju Ran
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 5 pages (main text) + 3 pages (appendices), 5 figures (main text) + 4 figures (appendices)

点击查看摘要

[LG-335] ransformers as Game Players: Provable In-context Game-playing Capabilities of Pre-trained Models NEURIPS2024

链接: https://arxiv.org/abs/2410.09701
作者: Chengshuai Shi,Kun Yang,Jing Yang,Cong Shen
关键词-EN:
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted to NeurIPS 2024

点击查看摘要

[LG-336] Provable Convergence and Limitations of Geometric Tempering for Langevin Dynamics

链接: https://arxiv.org/abs/2410.09697
作者: Omar Chehab,Anna Korba,Austin Stromme,Adrien Vacher
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

[LG-337] Neural Solver Selection for Combinatorial Optimization

链接: https://arxiv.org/abs/2410.09693
作者: Chengrui Gao,Haopu Shang,Ke Xue,Chao Qian
关键词-EN:
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-338] EG-SpikeFormer: Eye-Gaze Guided Transformer on Spiking Neural Networks for Medical Image Analysis

链接: https://arxiv.org/abs/2410.09674
作者: Yi Pan,Hanqi Jiang,Junhao Chen,Yiwei Li,Huaqin Zhao,Yifan Zhou,Peng Shu,Zihao Wu,Zhengliang Liu,Dajiang Zhu,Xiang Li,Yohannes Abate,Tianming Liu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

[LG-339] Structured Regularization for Constrained Optimization on the SPD Manifold

链接: https://arxiv.org/abs/2410.09660
作者: Andrew Cheng,Melanie Weber
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-340] Many-body Expansion Based Machine Learning Models for Octahedral Transition Metal Complexes

链接: https://arxiv.org/abs/2410.09659
作者: Ralf Meyer,Daniel Benjamin Kasman Chu,Heather J. Kulik
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-341] On Goodharts law with an application to value alignment

链接: https://arxiv.org/abs/2410.09638
作者: El-Mahdi El-Mhamdi,Lê-Nguyên Hoang
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 47 pages, 11 figures

点击查看摘要

[LG-342] Can We Estimate Purchase Intention Based on Zero-shot Speech Emotion Recognition?

链接: https://arxiv.org/abs/2410.09636
作者: Ryotaro Nagase,Takashi Sumiyoshi,Natsuo Yamashita,Kota Dohi,Yohei Kawaguchi
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, accepted for APSIPA 2024 ASC

点击查看摘要

[LG-343] Exploring Behavior-Relevant and Disentangled Neural Dynamics with Generative Diffusion Models

链接: https://arxiv.org/abs/2410.09614
作者: Yule Wang,Chengrui Li,Weihan Li,Anqi Wu
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-344] Second-Order Min-Max Optimization with Lazy Hessians

链接: https://arxiv.org/abs/2410.09568
作者: Lesi Chen,Chengchang Liu,Jingzhao Zhang
关键词-EN:
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-345] Distribution-Aware Mean Estimation under User-level Local Differential Privacy

链接: https://arxiv.org/abs/2410.09506
作者: Corentin Pla,Hugo Richard,Maxime Vono
关键词-EN:
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 1 figure

点击查看摘要

[LG-346] Identification of Non-causal Graphical Models

链接: https://arxiv.org/abs/2410.09480
作者: Junyao You,Mattia Zorzi
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to the IEEE CDC 2024 conference

点击查看摘要

[LG-347] Exploring Channel Distinguishability in Local Neighborhoods of the Model Space in Quantum Neural Networks

链接: https://arxiv.org/abs/2410.09470
作者: Sabrina Herbst,Sandeep Suresh Cranganore,Vincenzo De Maio,Ivona Brandic
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-348] Anderson Acceleration in Nonsmooth Problems: Local Convergence via Active Manifold Identification

链接: https://arxiv.org/abs/2410.09420
作者: Kexin Li,Luwei Bai,Xiao Wang,Hao Wang
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-349] 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

链接: https://arxiv.org/abs/2410.09388
作者: Peifan Jiang,Xuben Wang,Shuang Wang,Fei Deng,Kunpeng Wang,Bin Wang,Yuhan Yang,Islam Fadel
关键词-EN:
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-350] Combinatorial optimization of the coefficient of determination

链接: https://arxiv.org/abs/2410.09316
作者: Marc Harary
关键词-EN:
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

[LG-351] Data Deletion for Linear Regression with Noisy SGD

链接: https://arxiv.org/abs/2410.09311
作者: Zhangjie Xia,Chi-Hua Wang,Guang Cheng
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-352] MOZART: Ensembling Approach for COVID-19 Detection using Chest X-Ray Imagery

链接: https://arxiv.org/abs/2410.09255
作者: Mohammed Shabo,Nazar Siddig
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper was originally intended to be published as part of my this http URL . graduation project in Electrical and Electronics Engineering at the University of Khartoum in 2021. However, due to political and economic instability, and most recently, the outbreak of conflict in Sudan in April 2023, the publication process was significantly delayed. But yeah, better late than never

点击查看摘要

[LG-353] MVG-CRPS: A Robust Loss Function for Multivariate Probabilistic Forecasting

链接: https://arxiv.org/abs/2410.09133
作者: Vincent Zhihao Zheng,Lijun Sun
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-354] Comparing Quantum Encoding Techniques

链接: https://arxiv.org/abs/2410.09121
作者: Nidhi Munikote,Ang Li,Chenxu Liu,Samuel Stein
关键词-EN:
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-355] IceDiff: High Resolution and High-Quality Sea Ice Forecasting with Generative Diffusion Prior

链接: https://arxiv.org/abs/2410.09111
作者: Jingyi Xu,Siwei Tu,Weidong Yang,Shuhao Li,Keyi Liu,Yeqi Luo,Lipeng Ma,Ben Fei,Lei Bai
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

[LG-356] An Innovative Attention-based Ensemble System for Credit Card Fraud Detection

链接: https://arxiv.org/abs/2410.09069
作者: Mehdi Hosseini Chagahi,Niloufar Delfan,Saeed Mohammadi Dashtaki,Behzad Moshiri,Md. Jalil Piran
关键词-EN:
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-357] Volatility Forecasting in Global Financial Markets Using TimeMixer

链接: https://arxiv.org/abs/2410.09062
作者: Alex Li
关键词-EN:
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages and 2 figures

点击查看摘要

信息检索

[IR-0] Generating Model Parameters for Controlling: Parameter Diffusion for Controllable Multi-Task Recommendation

链接: https://arxiv.org/abs/2410.10639
作者: Chenglei Shen,Jiahao Zhao,Xiao Zhang,Weijie Yu,Ming He,Jianping Fan
关键词-EN: recommender systems face, task requirements, Commercial recommender systems, model, model parameters
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Commercial recommender systems face the challenge that task requirements from platforms or users often change dynamically (e.g., varying preferences for accuracy or diversity). Ideally, the model should be re-trained after resetting a new objective function, adapting to these changes in task requirements. However, in practice, the high computational costs associated with retraining make this process impractical for models already deployed to online environments. This raises a new challenging problem: how to efficiently adapt the learning model to different task requirements by controlling model parameters after deployment, without the need for retraining. To address this issue, we propose a novel controllable learning approach via Parameter Diffusion for controllable multi-task Recommendation (PaDiRec), which allows the customization and adaptation of recommendation model parameters to new task requirements without retraining. Specifically, we first obtain the optimized model parameters through adapter tunning based on the feasible task requirements. Then, we utilize the diffusion model as a parameter generator, employing classifier-free guidance in conditional training to learn the distribution of optimized model parameters under various task requirements. Finally, the diffusion model is applied to effectively generate model parameters in a test-time adaptation manner given task requirements. As a model-agnostic approach, PaDiRec can leverage existing recommendation models as backbones to enhance their controllability. Extensive experiments on public datasets and a dataset from a commercial app, indicate that PaDiRec can effectively enhance controllability through efficient model parameter generation. The code is released at this https URL.

[IR-1] VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

链接: https://arxiv.org/abs/2410.10594
作者: Shi Yu,Chaoyue Tang,Bokai Xu,Junbo Cui,Junhao Ran,Yukun Yan,Zhenghao Liu,Shuo Wang,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: enables large language, external knowledge sources, utilize external knowledge, large language models, RAG
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25–39% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at this https URL .

[IR-2] Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.10542
作者: Shubham Kumar Nigam,Aniket Deroy,Subhankar Maity,Arnab Bhattacharya
关键词-EN: context of Indian, study investigates judgment, Indian judgments, including InLegalBERT, utilizing a range
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted on NLLP at EMNLP 2024

点击查看摘要

Abstract:This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.

[IR-3] Advancing Academic Knowledge Retrieval via LLM-enhanced Representation Similarity Fusion KDD

链接: https://arxiv.org/abs/2410.10455
作者: Wei Dai,Peng Fu,Chunjing Gan
关键词-EN: swift information renewal, robust technological growth, avant-garde academic insights, academic insights spanning, information renewal
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: The 2nd Place of KDD Cup 2024 OAG-Challenge AQA

点击查看摘要

Abstract:In an era marked by robust technological growth and swift information renewal, furnishing researchers and the populace with top-tier, avant-garde academic insights spanning various domains has become an urgent necessity. The KDD Cup 2024 AQA Challenge is geared towards advancing retrieval models to identify pertinent academic terminologies from suitable papers for scientific inquiries. This paper introduces the LLM-KnowSimFuser proposed by Robo Space, which wins the 2nd place in the competition. With inspirations drawed from the superior performance of LLMs on multiple tasks, after careful analysis of the provided datasets, we firstly perform fine-tuning and inference using LLM-enhanced pre-trained retrieval models to introduce the tremendous language understanding and open-domain knowledge of LLMs into this task, followed by a weighted fusion based on the similarity matrix derived from the inference results. Finally, experiments conducted on the competition datasets show the superiority of our proposal, which achieved a score of 0.20726 on the final leaderboard.

[IR-4] Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion EMNLP2024

链接: https://arxiv.org/abs/2410.10408
作者: Xinping Zhao,Jindi Yu,Zhenyu Liu,Jifang Wang,Dongfang Li,Yibin Chen,Baotian Hu,Min Zhang
关键词-EN: Large Language Models, Language Models, Large Language, prevail in Large, factually incorrect
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 12 pages, 3 figures, 6 tables. Accepted by EMNLP 2024’s demo track

点击查看摘要

Abstract:As we all know, hallucinations prevail in Large Language Models (LLMs), where the generated content is coherent but factually incorrect, which inflicts a heavy blow on the widespread application of LLMs. Previous studies have shown that LLMs could confidently state non-existent facts rather than answering ``I don’t know’'. Therefore, it is necessary to resort to external knowledge to detect and correct the hallucinated content. Since manual detection and correction of factual errors is labor-intensive, developing an automatic end-to-end hallucination-checking approach is indeed a needful thing. To this end, we present Medico, a Multi-source evidence fusion enhanced hallucination detection and correction framework. It fuses diverse evidence from multiple sources, detects whether the generated content contains factual errors, provides the rationale behind the judgment, and iteratively revises the hallucinated content. Experimental results on evidence retrieval (0.964 HR@5, 0.908 MRR@5), hallucination detection (0.927-0.951 F1), and hallucination correction (0.973-0.979 approval rate) manifest the great potential of Medico. A video demo of Medico can be found at this https URL.

[IR-5] BookWorm: A Dataset for Character Description and Analysis EMNLP2024

链接: https://arxiv.org/abs/2410.10372
作者: Argyrios Papoudakis,Mirella Lapata,Frank Keller
关键词-EN: driving the plot, engaging readers, plot and engaging, numerous interacting characters, Characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 30 pages, 2 figures, EMNLP 2024 Findings

点击查看摘要

Abstract:Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.

[IR-6] A Hybrid Filtering for Micro-video Hashtag Recommendation using Graph-based Deep Neural Network

链接: https://arxiv.org/abs/2410.10367
作者: Shubhi Bansal,Kushaan Gowda,Mohammad Zia Ur Rehman,Chandravardhan Singh Raghaw,Nagendra Kumar
关键词-EN: manage content efficiently, social media platforms, media platforms, growing volume, indicators to manage
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Due to the growing volume of user generated content, hashtags are employed as topic indicators to manage content efficiently on social media platforms. However, finding these vital topics is challenging in microvideos since they contain substantial information in a short duration. Existing methods that recommend hashtags for microvideos primarily focus on content and personalization while disregarding relatedness among users. Moreover, the cold start user issue prevails in hashtag recommendation systems. Considering the above, we propose a hybrid filtering based MIcro-video haSHtag recommendatiON MISHON technique to recommend hashtags for micro-videos. Besides content based filtering, we employ user-based collaborative filtering to enhance recommendations. Since hashtags reflect users topical interests, we find similar users based on historical tagging behavior to model user relatedness. We employ a graph-based deep neural network to model user to user, modality to modality, and user to modality interactions. We then use refined modality specific and user representations to recommend pertinent hashtags for microvideos. The empirical results on three real world datasets demonstrate that MISHON attains a comparative enhancement of 3.6, 2.8, and 6.5 reported in percentage concerning the F1 score, respectively. Since cold start users exist whose historical tagging information is unavailable, we also propose a content and social influence based technique to model the relatedness of cold start users with influential users. The proposed solution shows a relative improvement of 15.8 percent in the F1 score over its content only counterpart. These results show that the proposed framework mitigates the cold start user problem.

[IR-7] Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning

链接: https://arxiv.org/abs/2410.10360
作者: Yongxin Xu,Ruizhe Zhang,Xinke Jiang,Yujie Feng,Yuzhen Xiao,Xinyu Ma,Runchuan Zhu,Xu Chu,Junfeng Zhao,Yasha Wang
关键词-EN: Large Language Models, Large Language, incorporating externally retrieved, externally retrieved knowledge, faced by Large
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) offers an effective solution to the issues faced by Large Language Models (LLMs) in hallucination generation and knowledge obsolescence by incorporating externally retrieved knowledge. However, due to potential conflicts between internal and external knowledge, as well as retrieval noise, LLMs often struggle to effectively integrate external evidence, leading to a decline in performance. Although existing methods attempt to tackle these challenges, they often struggle to strike a balance between model adherence and robustness, resulting in significant learning variance. Inspired by human cognitive processes, we propose Parenting, a novel framework that decouples adherence and robustness within the parameter space of LLMs. Specifically, Parenting utilizes a key parameter mining method based on forward activation gain to identify and isolate the crucial parameter units that are strongly linked to adherence and robustness. Then, Parenting employs a type-guided tailored tuning strategy, applying specific and appropriate fine-tuning methods to parameter units representing different capabilities, aiming to achieve a balanced enhancement of adherence and robustness. Extensive experiments on various datasets and models validate the effectiveness and generalizability of our methods.

[IR-8] Enhancing Attributed Graph Networks with Alignment and Uniformity Constraints for Session-based Recommendation

链接: https://arxiv.org/abs/2410.10296
作者: Xinping Zhao,Chaochao Chen,Jiajie Su,Yizhao Zhang,Baotian Hu
关键词-EN: attribute-agnostic SBR models, existing attribute-agnostic SBR, drawn increasing attention, Session-based Recommendation, SBR models
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures, 5 tables. Accepted by ICWS 2024

点击查看摘要

Abstract:Session-based Recommendation (SBR), seeking to predict a user’s next action based on an anonymous session, has drawn increasing attention for its practicability. Most SBR models only rely on the contextual transitions within a short session to learn item representations while neglecting additional valuable knowledge. As such, their model capacity is largely limited by the data sparsity issue caused by short sessions. A few studies have exploited the Modeling of Item Attributes (MIA) to enrich item representations. However, they usually involve specific model designs that can hardly transfer to existing attribute-agnostic SBR models and thus lack universality. In this paper, we propose a model-agnostic framework, named AttrGAU (Attributed Graph Networks with Alignment and Uniformity Constraints), to bring the MIA’s superiority into existing attribute-agnostic models, to improve their accuracy and robustness for recommendation. Specifically, we first build a bipartite attributed graph and design an attribute-aware graph convolution to exploit the rich attribute semantics hidden in the heterogeneous item-attribute relationship. We then decouple existing attribute-agnostic SBR models into the graph neural network and attention readout sub-modules to satisfy the non-intrusive requirement. Lastly, we design two representation constraints, i.e., alignment and uniformity, to optimize distribution discrepancy in representation between the attribute semantics and collaborative semantics. Extensive experiments on three public benchmark datasets demonstrate that the proposed AttrGAU framework can significantly enhance backbone models’ recommendation performance and robustness against data sparsity and data noise issues. Our implementation codes will be available at this https URL.

[IR-9] FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

链接: https://arxiv.org/abs/2410.10293
作者: Xinping Zhao,Yan Zhong,Zetian Sun,Xinshuo Hu,Zhenyu Liu,Dongfang Li,Baotian Hu,Min Zhang
关键词-EN: Large Language Models, Language Models, Large Language, prevails in Large, retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: 18 pages, 6 figures, 13 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) prevails in Large Language Models. It mainly consists of retrieval and generation. The retrieval modules (a.k.a. retrievers) aim to find useful information used to facilitate generation modules (a.k.a. generators). As such, generators’ performance largely depends on the effectiveness and efficiency of retrievers. However, the retrieval paradigm that we design and use remains flat, which treats the retrieval procedures as a one-off deal with constant granularity. Despite effectiveness, we argue that they suffer from two limitations: (1) flat retrieval exerts a significant burden on one retriever; (2) constant granularity limits the ceiling of retrieval performance. In this work, we propose a progressive retrieval paradigm with coarse-to-fine granularity for RAG, termed FunnelRAG, so as to balance effectiveness and efficiency. Specifically, FunnelRAG establishes a progressive retrieval pipeline by collaborating coarse-to-fine granularity, large-to-small quantity, and low-to-high capacity, which can relieve the burden on one retriever and also promote the ceiling of retrieval performance. Extensive experiments manifest that FunnelRAG achieves comparable retrieval performance while the time overhead is reduced by nearly 40 percent.

[IR-10] Back-of-the-Book Index Automation for Arabic Documents

链接: https://arxiv.org/abs/2410.10286
作者: Nawal Haidar,Fadi A. Zaraket
关键词-EN: indexes are crucial, book readability, index, Arabic books, Abstract
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Back-of-the-book indexes are crucial for book readability. Their manual creation is laborious and error prone. In this paper, we consider automating back-of-the-book index extraction for Arabic books to help simplify both the creation and review tasks. Given a back-of-the-book index, we aim to check and identify the accurate occurrences of index terms relative to the associated pages. To achieve this, we first define a pool of candidates for each term by extracting all possible noun phrases from paragraphs appearing on the relevant index pages. These noun phrases, identified through part-of-speech analysis, are stored in a vector database for efficient retrieval. We use several metrics, including exact matches, lexical similarity, and semantic similarity, to determine the most appropriate occurrence. The candidate with the highest score based on these metrics is chosen as the occurrence of the term. We fine-tuned a heuristic method, that considers the above metrics and that achieves an F1-score of .966 (precision=.966, recall=.966). These excellent results open the door for future work related to automation of back-of-the-book index generation and checking.

[IR-11] DecKG: Decentralized Collaborative Learning with Knowledge Graph Enhancement for POI Recommendation

链接: https://arxiv.org/abs/2410.10130
作者: Ruiqi Zheng,Liang Qu,Guanhua Ye,Tong Chen,Yuhui Shi,Hongzhi Yin
关键词-EN: gained research interest, Decentralized collaborative learning, research interest due, leverages collaborative learning, collaborative learning
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Decentralized collaborative learning for Point-of-Interest (POI) recommendation has gained research interest due to its advantages in privacy preservation and efficiency, as it keeps data locally and leverages collaborative learning among clients to train models in a decentralized manner. However, since local data is often limited and insufficient for training accurate models, a common solution is integrating external knowledge as auxiliary information to enhance model performance. Nevertheless, this solution poses challenges for decentralized collaborative learning. Due to private nature of local data, identifying relevant auxiliary information specific to each user is non-trivial. Furthermore, resource-constrained local devices struggle to accommodate all auxiliary information, which places heavy burden on local storage. To fill the gap, we propose a novel decentralized collaborative learning with knowledge graph enhancement framework for POI recommendation (DecKG). Instead of directly uploading interacted items, users generate desensitized check-in data by uploading general categories of interacted items and sampling similar items from same category. The server then pretrains KG without sensitive user-item interactions and deploys relevant partitioned sub-KGs to individual users. Entities are further refined on the device, allowing client to client communication to exchange knowledge learned from local data and sub-KGs. Evaluations across two real-world datasets demonstrate DecKG’s effectiveness recommendation performance.

[IR-12] MAIR: A Massive Benchmark for Evaluating Instructed Retrieval EMNLP2024

链接: https://arxiv.org/abs/2410.10127
作者: Weiwei Sun,Zhengliang Shi,Jiulong Wu,Lingyong Yan,Xinyu Ma,Yiding Liu,Min Cao,Dawei Yin,Zhaochun Ren
关键词-EN: Recent information retrieval, Recent information, Massive Instructed Retrieval, Instructed Retrieval Benchmark, wide range
类目: Information Retrieval (cs.IR)
*备注: EMNLP 2024

点击查看摘要

Abstract:Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We benchmark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models generally achieve superior performance compared to non-instruction-tuned models on MAIR. Additionally, our results suggest that current instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available at this https URL.

[IR-13] Leveraging Customer Feedback for Multi-modal Insight Extraction NAACL2024

链接: https://arxiv.org/abs/2410.09999
作者: Sandeep Sricharan Mukku,Abinesh Kanagarajan,Pushpendu Ghosh,Chetan Aggarwal
关键词-EN: Businesses can benefit, customer feedback, products and services, enhance their products, multi-modal customer feedback
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: NAACL 2024

点击查看摘要

Abstract:Businesses can benefit from customer feedback in different modalities, such as text and images, to enhance their products and services. However, it is difficult to extract actionable and relevant pairs of text segments and images from customer feedback in a single pass. In this paper, we propose a novel multi-modal method that fuses image and text information in a latent space and decodes it to extract the relevant feedback segments using an image-text grounded text decoder. We also introduce a weakly-supervised data generation technique that produces training data for this task. We evaluate our model on unseen data and demonstrate that it can effectively mine actionable insights from multi-modal customer feedback, outperforming the existing baselines by 14 points in F1 score.

[IR-14] Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization

链接: https://arxiv.org/abs/2410.09942
作者: Alireza Salemi,Hamed Zamani
关键词-EN: multiple retrieval-augmented generation, backbone large language, unified search engine, search engine, serve multiple retrieval-augmented
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper investigates the design of a unified search engine to serve multiple retrieval-augmented generation (RAG) agents, each with a distinct task, backbone large language model (LLM), and retrieval-augmentation strategy. We introduce an iterative approach where the search engine generates retrieval results for these RAG agents and gathers feedback on the quality of the retrieved documents during an offline phase. This feedback is then used to iteratively optimize the search engine using a novel expectation-maximization algorithm, with the goal of maximizing each agent’s utility function. Additionally, we adapt this approach to an online setting, allowing the search engine to refine its behavior based on real-time individual agents feedback to better serve the results for each of them. Experiments on diverse datasets from the Knowledge-Intensive Language Tasks (KILT) benchmark demonstrates that our approach significantly on average outperforms competitive baselines across 18 RAG models. We also demonstrate that our method effectively ``personalizes’’ the retrieval process for each RAG agent based on the collected feedback. Finally, we provide a comprehensive ablation study to explore various aspects of our method.

[IR-15] he Role of Fake Users in Sequential Recommender Systems

链接: https://arxiv.org/abs/2410.09936
作者: Filippo Betello
关键词-EN: Sequential Recommender Systems, Sequential Recommender, Recommender Systems, Discounted Cumulative Gain, Rank Sensitivity List
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Sequential Recommender Systems (SRSs) are widely used to model user behavior over time, yet their robustness remains an under-explored area of research. In this paper, we conduct an empirical study to assess how the presence of fake users, who engage in random interactions, follow popular or unpopular items, or focus on a single genre, impacts the performance of SRSs in real-world scenarios. We evaluate two SRS models across multiple datasets, using established metrics such as Normalized Discounted Cumulative Gain (NDCG) and Rank Sensitivity List (RLS) to measure performance. While traditional metrics like NDCG remain relatively stable, our findings reveal that the presence of fake users severely degrades RLS metrics, often reducing them to near-zero values. These results highlight the need for further investigation into the effects of fake users on training data and emphasize the importance of developing more resilient SRSs that can withstand different types of adversarial attacks.

[IR-16] Analysis and Design of a Personalized Recommendation System Based on a Dynamic User Interest Model

链接: https://arxiv.org/abs/2410.09923
作者: Chunyan Mao,Shuaishuai Huang,Mingxiu Sui,Haowei Yang,Xueshe Wang
关键词-EN: important research topic, explosion of information, user interest model, rapid development, accurate personalized recommendations
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of the internet and the explosion of information, providing users with accurate personalized recommendations has become an important research topic. This paper designs and analyzes a personalized recommendation system based on a dynamic user interest model. The system captures user behavior data, constructs a dynamic user interest model, and combines multiple recommendation algorithms to provide personalized content to users. The research results show that this system significantly improves recommendation accuracy and user satisfaction. This paper discusses the system’s architecture design, algorithm implementation, and experimental results in detail and explores future research directions.

[IR-17] ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification

链接: https://arxiv.org/abs/2410.09875
作者: Chen Mao,Chong Tan,Jingqi Hu,Min Zheng
关键词-EN: Person re-identification, personnel counting, field of security, plays a vital, safety inspections
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Person re-identification(ReID), as a crucial technology in the field of security, plays a vital role in safety inspections, personnel counting, and more. Most current ReID approaches primarily extract features from images, which are easily affected by objective conditions such as clothing changes and occlusions. In addition to cameras, we leverage widely available routers as sensing devices by capturing gait information from pedestrians through the Channel State Information (CSI) in WiFi signals and contribute a multimodal dataset. We employ a two-stream network to separately process video understanding and signal analysis tasks, and conduct multi-modal fusion and contrastive learning on pedestrian video and WiFi data. Extensive experiments in real-world scenarios demonstrate that our method effectively uncovers the correlations between heterogeneous data, bridges the gap between visual and signal modalities, significantly expands the sensing range, and improves ReID accuracy across multiple sensors.

[IR-18] A Comparative Study of PDF Parsing Tools Across Diverse Document Categories

链接: https://arxiv.org/abs/2410.09871
作者: Narayan S. Adhikari,Shradha Agarwal
关键词-EN: prominent data formats, making PDF parsing, PDF parsing crucial, RAG systems, rise of RAG
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 17 pages,11 figures, 5 tables

点击查看摘要

Abstract:PDF is one of the most prominent data formats, making PDF parsing crucial for information extraction and retrieval, particularly with the rise of RAG systems. While various PDF parsing tools exist, their effectiveness across different document types remains understudied, especially beyond academic papers. Our research aims to address this gap by comparing 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. These tools include PyPDF, this http URL, PyMuPDF, pdfplumber, pypdfium2, Unstructured, Tabula, Camelot, as well as the deep learning-based tools Nougat and Table Transformer(TATR). We evaluated both text extraction and table detection capabilities. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all parsers struggled with Scientific and Patent documents. For these challenging categories, learning-based tools like Nougat demonstrated superior performance. In table detection, TATR excelled in the Financial, Patent, Law Regulations, and Scientific categories. Table detection tool Camelot performed best for tender documents, while PyMuPDF performed superior in the Manual category. Our findings highlight the importance of selecting appropriate parsing tools based on document type and specific tasks, providing valuable insights for researchers and practitioners working with diverse document sources.

[IR-19] Generating Driving Simulations via Conversation

链接: https://arxiv.org/abs/2410.09829
作者: Rimvydas Rubavicius,Antonio Valerio Miceli-Barone,Alex Lascarides,Subramanian Ramamoorthy
关键词-EN: Cyber-physical systems, autonomous vehicles, scenario specification, Cyber-physical, domain-specific programs
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Robotics (cs.RO)
*备注: 6 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Cyber-physical systems like autonomous vehicles are tested in simulation before deployment, using domain-specific programs for scenario specification. To aid the testing of autonomous vehicles in simulation, we design a natural language interface, using an instruction-following large language model, to assist a non-coding domain expert in synthesising the desired scenarios and vehicle behaviours. We show that using it to convert utterances to the symbolic program is feasible, despite the very small training dataset. Human experiments show that dialogue is critical to successful simulation generation, leading to a 4.5 times higher success rate than a generation without engaging in extended conversation.

[IR-20] ContextWIN: Whittle Index Based Mixture-of-Experts Neural Model For Restless Bandits Via Deep RL

链接: https://arxiv.org/abs/2410.09781
作者: Zhanqiu Guo,Wayne Wang
关键词-EN: Restless Multi-Armed Bandit, address Restless Multi-Armed, Neural Whittle Index, Whittle Index Network, Multi-Armed Bandit
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study introduces ContextWIN, a novel architecture that extends the Neural Whittle Index Network (NeurWIN) model to address Restless Multi-Armed Bandit (RMAB) problems with a context-aware approach. By integrating a mixture of experts within a reinforcement learning framework, ContextWIN adeptly utilizes contextual information to inform decision-making in dynamic environments, particularly in recommendation systems. A key innovation is the model’s ability to assign context-specific weights to a subset of NeurWIN networks, thus enhancing the efficiency and accuracy of the Whittle index computation for each arm. The paper presents a thorough exploration of ContextWIN, from its conceptual foundation to its implementation and potential applications. We delve into the complexities of RMABs and the significance of incorporating context, highlighting how ContextWIN effectively harnesses these elements. The convergence of both the NeurWIN and ContextWIN models is rigorously proven, ensuring theoretical robustness. This work lays the groundwork for future advancements in applying contextual information to complex decision-making scenarios, recognizing the need for comprehensive dataset exploration and environment development for full potential realization.

[IR-21] ChartKG: A Knowledge-Graph-Based Representation for Chart Images

链接: https://arxiv.org/abs/2410.09761
作者: Zhiguang Zhou,Haoxuan Wang,Zhengqing Zhao,Fengling Zheng,Yongheng Wang,Wei Chen,Yong Wang
关键词-EN: explosively produced due, Chart images, Chart, explosively produced, produced due
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Chart images, such as bar charts, pie charts, and line charts, are explosively produced due to the wide usage of data visualizations. Accordingly, knowledge mining from chart images is becoming increasingly important, which can benefit downstream tasks like chart retrieval and knowledge graph completion. However, existing methods for chart knowledge mining mainly focus on converting chart images into raw data and often ignore their visual encodings and semantic meanings, which can result in information loss for many downstream tasks. In this paper, we propose ChartKG, a novel knowledge graph (KG) based representation for chart images, which can model the visual elements in a chart image and semantic relations among them including visual encodings and visual insights in a unified manner. Further, we develop a general framework to convert chart images to the proposed KG-based representation. It integrates a series of image processing techniques to identify visual elements and relations, e.g., CNNs to classify charts, yolov5 and optical character recognition to parse charts, and rule-based methods to construct graphs. We present four cases to illustrate how our knowledge-graph-based representation can model the detailed visual elements and semantic relations in charts, and further demonstrate how our approach can benefit downstream applications such as semantic-aware chart retrieval and chart question answering. We also conduct quantitative evaluations to assess the two fundamental building blocks of our chart-to-KG framework, i.e., object recognition and optical character recognition. The results provide support for the usefulness and effectiveness of ChartKG.

[IR-22] Agent ic Information Retrieval

链接: https://arxiv.org/abs/2410.09713
作者: Weinan Zhang,Junwei Liao,Ning Li,Kounianhua Du
关键词-EN: information retrieval, information, Agentic Information Retrieval, Agentic, relevant information
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 11 pages, position paper

点击查看摘要

Abstract:What will information entry look like in the next generation of digital products? Since the 1970s, user access to relevant information has relied on domain-specific architectures of information retrieval (IR). Over the past two decades, the advent of modern IR systems, including web search engines and personalized recommender systems, has greatly improved the efficiency of retrieving relevant information from vast data corpora. However, the core paradigm of these IR systems remains largely unchanged, relying on filtering a predefined set of candidate items. Since 2022, breakthroughs in large language models (LLMs) have begun transforming how information is accessed, establishing a new technical paradigm. In this position paper, we introduce Agentic Information Retrieval (Agentic IR), a novel IR paradigm shaped by the capabilities of LLM agents. Agentic IR expands the scope of accessible tasks and leverages a suite of new techniques to redefine information retrieval. We discuss three types of cutting-edge applications of agentic IR and the challenges faced. We propose that agentic IR holds promise for generating innovative applications, potentially becoming a central information entry point in future digital ecosystems.

[IR-23] Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.09629
作者: Jiaxin Zhang,Wendi Cui,Yiran Huang,Kamalika Das,Sricharan Kumar
关键词-EN: Large language models, Large language, Retrieval Augmented Generation, proficient in capturing, knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: EMNLP 2024 main conference long paper

点击查看摘要

Abstract:Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.

[IR-24] oward General Instruction-Following Alignment for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2410.09584
作者: Guanting Dong,Xiaoshuai Song,Yutao Zhu,Runqi Qiao,Zhicheng Dou,Ji-Rong Wen
关键词-EN: Retrieval-Augmented Generation, Large Language Models, RAG systems, RAG, effective application
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Working in progress

点击查看摘要

Abstract:Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions (100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset (100k) through automated processes. To further bridge the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and eight widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems. Our code and datasets are released at this https URL.

[IR-25] owards Scalable Semantic Representation for Recommendation

链接: https://arxiv.org/abs/2410.09560
作者: Taolin Zhang,Junwei Pan,Jinpeng Wang,Yaohua Zha,Tao Dai,Bin Chen,Ruisheng Luo,Xiaoxiang Deng,Yuan Wang,Ming Yue,Jie Jiang,Shu-Tao Xia
关键词-EN: large language models, developing Semantic IDs, Semantic IDs based, language models, recent advances
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With recent advances in large language models (LLMs), there has been emerging numbers of research in developing Semantic IDs based on LLMs to enhance the performance of recommendation systems. However, the dimension of these embeddings needs to match that of the ID embedding in recommendation, which is usually much smaller than the original length. Such dimension compression results in inevitable losses in discriminability and dimension robustness of the LLM embeddings, which motivates us to scale up the semantic representation. In this paper, we propose Mixture-of-Codes, which first constructs multiple independent codebooks for LLM representation in the indexing stage, and then utilizes the Semantic Representation along with a fusion module for the downstream recommendation stage. Extensive analysis and experiments demonstrate that our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.

[IR-26] Eco-Aware Graph Neural Networks for Sustainable Recommendations

链接: https://arxiv.org/abs/2410.09514
作者: Antonio Purificato,Fabrizio Silvestri
关键词-EN: alleviating information overload, Graph Neural Networks, Recommender systems play, providing personalized recommendations, personalized recommendations tailored
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 tables, 3 figures, RecSoGood Workshop

点击查看摘要

Abstract:Recommender systems play a crucial role in alleviating information overload by providing personalized recommendations tailored to users’ preferences and interests. Recently, Graph Neural Networks (GNNs) have emerged as a promising approach for recommender systems, leveraging their ability to effectively capture complex relationships and dependencies between users and items by representing them as nodes in a graph structure. In this study, we investigate the environmental impact of GNN-based recommender systems, an aspect that has been largely overlooked in the literature. Specifically, we conduct a comprehensive analysis of the carbon emissions associated with training and deploying GNN models for recommendation tasks. We evaluate the energy consumption and carbon footprint of different GNN architectures and configurations, considering factors such as model complexity, training duration, hardware specifications and embedding size. By addressing the environmental impact of resource-intensive algorithms in recommender systems, this study contributes to the ongoing efforts towards sustainable and responsible artificial intelligence, promoting the development of eco-friendly recommendation technologies that balance performance and environmental considerations. Code is available at: this https URL.

[IR-27] Green Recommender Systems: Optimizing Dataset Size for Energy-Efficient Algorithm Performance

链接: https://arxiv.org/abs/2410.09359
作者: Ardalan Arabzadeh,Tobias Vente,Joeran Beel
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[IR-28] ACER: Automatic Language Model Context Extension via Retrieval

链接: https://arxiv.org/abs/2410.09141
作者: Luyu Gao,Yunyi Zhang,Jamie Callan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[IR-29] IGER: Temporally Improved Graph Entity Linker

链接: https://arxiv.org/abs/2410.09128
作者: Pengyu Zhang,Congfeng Cao,Paul Groth
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-30] textitlucie: An Improved Python Package for Loading Datasets from the UCI Machine Learning Repository

链接: https://arxiv.org/abs/2410.09119
作者: Kenneth Ge,Phuc Nguyen,Ramy Arnaout
关键词-EN:
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 5 pages, 3 figures

点击查看摘要

[IR-31] Automating Bibliometric Analysis with Sentence Transformers and Retrieval-Augmented Generation (RAG): A Pilot Study in Semantic and Contextual Search for Customized Literature Characterization for High-Impact Urban Research

链接: https://arxiv.org/abs/2410.09090
作者: Haowen Xu,Xueping Li,Jose Tupayachi,Jianming(Jamie)Lian,Femi Omitaomu
关键词-EN:
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-32] A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2410.09077
作者: Yilong Zhao,Daifeng Li
关键词-EN:
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-33] Collaborative filtering based on nonnegative/binary matrix factorization

链接: https://arxiv.org/abs/2410.10381
作者: Yukino Terui,Yuka Inoue,Yohei Hamakawa,Kosuke Tatsumura,Kazue Kudo
关键词-EN:
类目: atistical Mechanics (cond-mat.stat-mech); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures

点击查看摘要

附件下载

点击下载今日全部论文列表