本篇博文主要展示 2024-10-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-08)

今日共更新898篇论文,其中:

  • 自然语言处理229篇(Computation and Language (cs.CL))
  • 人工智能241篇(Artificial Intelligence (cs.AI))
  • 计算机视觉187篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习314篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLM)在数据对齐过程中生成的数据质量问题,特别是数据中存在的代表性不足和低质量数据点的问题。解决方案的关键是提出了Data Advisor,这是一种基于LLM的增强方法,通过考虑目标数据集的特征,监控生成数据的状态,识别当前数据集的弱点,并相应地指导下一轮数据生成,从而提高数据质量和覆盖率。Data Advisor能够无缝集成到现有的数据生成方法中,实验结果表明,该方法在提升模型安全性的同时,不会牺牲模型的实用性。

链接: https://arxiv.org/abs/2410.05269
作者: Fei Wang,Ninareh Mehrabi,Palash Goyal,Rahul Gupta,Kai-Wei Chang,Aram Galstyan
关键词-EN: Data Advisor, Data, large language model, crucial element, element in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 Main Conference. Project website: this https URL

点击查看摘要

Abstract:Data is a crucial element in large language model (LLM) alignment. Recent studies have explored using LLMs for efficient data collection. However, LLM-generated data often suffers from quality issues, with underrepresented or absent aspects and low-quality datapoints. To address these problems, we propose Data Advisor, an enhanced LLM-based method for generating data that takes into account the characteristics of the desired dataset. Starting from a set of pre-defined principles in hand, Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation accordingly. Data Advisor can be easily integrated into existing data generation methods to enhance data quality and coverage. Experiments on safety alignment of three representative LLMs (i.e., Mistral, Llama2, and Falcon) demonstrate the effectiveness of Data Advisor in enhancing model safety against various fine-grained safety issues without sacrificing model utility.
摘要:数据是大语言模型 (LLM) 对齐的关键要素。近期研究探讨了利用 LLM 进行高效数据收集的方法。然而,LLM 生成的数据往往存在质量问题,包括代表性不足或缺失的方面以及低质量的数据点。为解决这些问题,我们提出了 Data Advisor,一种基于 LLM 的增强型数据生成方法,该方法考虑了目标数据集的特征。从一组预定义的原则出发,Data Advisor 监控生成数据的状态,识别当前数据集的弱点,并据此指导下一轮数据生成。Data Advisor 可以轻松集成到现有的数据生成方法中,以提升数据质量和覆盖范围。在三个代表性 LLM(即 Mistral、Llama2 和 Falcon)的安全对齐实验中,Data Advisor 展示了在不牺牲模型实用性的前提下,增强模型对各种细粒度安全问题的防护效果。

[NLP-1] Grounding Partially-Defined Events in Multimodal Data EMNLP

【速读】: 该论文试图解决从短视频片段中理解复杂事件的问题,特别是在视觉数据不便于直接表示部分可观测事件的情况下。解决方案的关键在于提出了一个多模态的部分定义事件模型,并将事件提取任务转化为一个三阶段的跨度检索任务。通过引入名为MultiVENT-G的基准测试,包含14.5小时密集标注的当前事件视频和1,168篇文本文档,论文评估了基于大型语言模型(LLM)的多模态事件分析方法,展示了在事件中心视频语言系统中理解抽象事件的挑战和潜力。

链接: https://arxiv.org/abs/2410.05267
作者: Kate Sanders,Reno Kriz,David Etter,Hannah Recknor,Alexander Martin,Cameron Carpenter,Jingyang Lin,Benjamin Van Durme
关键词-EN: learn about complex, short snippets, complex current events, complex current, events
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; 9 pages; 2024 EMNLP Findings

点击查看摘要

Abstract:How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be able to model events from collections of unstructured video data. To tackle robust event modeling in multimodal settings, we introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis, and evaluate them on MultiVENT-G. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
摘要:我们如何仅通过短视频片段来了解复杂的当前事件?尽管自然语言提供了表示未完全指定、部分可观察事件的直接方法,但视觉数据并不支持类似的方法,因此给事件理解带来了独特的挑战。随着具备视觉能力的 AI 智能体的普及,这些系统必须能够从非结构化的视频数据集合中建模事件。为了解决多模态环境中的鲁棒事件建模问题,我们引入了一种针对部分定义事件的多模态表述,并将这些事件的提取视为一个三阶段的跨度检索任务。我们为此任务提出了一个相应的基准,即 MultiVENT-G,它包含 14.5 小时的密集注释当前事件视频和 1,168 篇文本文档,涵盖了 22.8K 个标注的事件中心实体。我们提出了一系列基于大语言模型 (LLM) 的方法来处理多模态事件分析任务,并在 MultiVENT-G 上对其进行了评估。结果展示了抽象事件理解所带来的挑战,并展示了以事件为中心的视频-语言系统的前景。

[NLP-2] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

【速读】: 该论文试图解决在大语言模型(LLMs)部署中,激活量化过程中存在的token-wise异常值问题,这些问题导致了对每个token进行动态量化的依赖,从而增加了计算成本。解决方案的关键是提出了PrefixQuant技术,该技术通过离线识别高频异常token并在KV缓存中对其进行前缀处理,从而在推理过程中避免生成异常token,简化了量化过程。PrefixQuant首次实现了高效的per-tensor静态量化,显著优于传统的per-token动态量化方法,在量化精度和推理速度上均取得了显著提升。

链接: https://arxiv.org/abs/2410.05265
作者: Mengzhao Chen,Yi Liu,Jiahao Wang,Yi Bin,Wenqi Shao,Ping Luo
关键词-EN: deploying Large Language, Large Language Models, Large Language, enhancing memory efficiency, deploying Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: A PTQ method to significantly boost the performance of static activation quantization

点击查看摘要

Abstract:Quantization is essential for deploying Large Language Models (LLMs) by enhancing memory efficiency and inference speed. Existing methods for activation quantization mainly address channel-wise outliers, often neglecting token-wise outliers, leading to reliance on costly per-token dynamic quantization. To address this, we introduce PrefixQuant, a novel technique that isolates outlier tokens offline without re-training. Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization. To our knowledge, PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and 4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5 common-sense reasoning tasks, outperforming previous per-token dynamic quantization methods like QuaRot with 0.98 perplexity improvement and +5.98 points accuracy. Additionally, the inference speed of W4A4 quantized models using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot models by 1.2x to 1.3x. Our code is available at \urlthis https URL.
摘要:量化对于通过提高内存效率和推理速度来部署大语言模型 (LLMs) 至关重要。现有的激活量化方法主要解决通道维度上的异常值,往往忽视了 Token 维度上的异常值,导致依赖于成本高昂的逐 Token 动态量化。为了解决这一问题,我们引入了 PrefixQuant,这是一种新颖的技术,能够在不重新训练的情况下离线隔离异常 Token。具体而言,PrefixQuant 识别高频异常 Token 并在 KV 缓存中对其进行前缀处理,防止在推理过程中生成异常 Token,从而简化量化过程。据我们所知,PrefixQuant 是首个实现高效的逐张量静态量化以超越昂贵的逐 Token 动态量化的技术。例如,在 W4A4KV4 (4 位权重、4 位激活和 4 位 KV 缓存) 的 Llama-3-8B 模型中,使用逐张量静态量化的 PrefixQuant 在 WikiText2 上达到了 7.43 的困惑度,并在 5 个常识推理任务上实现了 71.08% 的平均准确率,优于之前的逐 Token 动态量化方法,如 QuaRot,困惑度降低了 0.98,准确率提高了 5.98 个百分点。此外,使用 PrefixQuant 的 W4A4 量化模型的推理速度比 FP16 模型快 1.60 倍到 2.81 倍,并且比 QuaRot 模型快 1.2 倍到 1.3 倍。我们的代码可在 \urlthis https URL 获取。

[NLP-3] urtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

【速读】: 该论文试图解决现有大型语言模型(LLMs)评估基准在动态用户交互和逻辑推理能力评估方面的不足。解决方案的关键在于提出了TurtleBench,这是一个基于真实用户猜测数据的动态评估基准。TurtleBench通过收集自开发的Turtle Soup Puzzle平台上的用户猜测数据,生成相对动态的评估数据集,从而避免了模型作弊的风险,并更贴近用户对推理能力的需求,提高了评估的可靠性。该基准包含1,532个用户猜测及其正确性标注,用于全面评估当前最先进的九个LLMs,并揭示了OpenAI o1系列模型在评估中未取得领先结果的现象,提出了进一步研究的假设。

链接: https://arxiv.org/abs/2410.05262
作者: Qingchen Yu,Shichao Song,Ke Fang,Yunfeng Shi,Zifan Zheng,Hanyu Wang,Simin Niu,Zhiyu Li
关键词-EN: Large Language Models, Large Language, reliable evaluations increases, Language Models, application of Large
类目: Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model’s logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with genuine user needs for reasoning capabilities, thus enhancing the reliability of evaluations. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. Using this dataset, we thoroughly evaluated nine of the most advanced LLMs available today. Notably, the OpenAI o1 series models did not achieve leading results in these evaluations. We propose several hypotheses for further research, such as “the latent reasoning of o1 utilizes trivial Chain-of-Thought (CoT) techniques” and “increasing CoT length not only provides reasoning benefits but also incurs noise costs.”
摘要:随着大语言模型 (LLM) 应用的扩展,对可靠评估的需求也随之增加。现有的 LLM 评估基准主要依赖于静态数据集,这使得在动态用户交互中评估模型性能变得困难。此外,这些基准通常依赖于特定的背景知识,使得测量模型的逻辑推理能力变得复杂。基于强模型或人工努力的动态评估方法可能引入偏见,并产生高成本和时间需求,阻碍大规模应用。为了解决这些问题,我们提出了 TurtleBench。TurtleBench 收集了我们开发的在线 Turtle Soup Puzzle 平台上的真实用户猜测。这种方法允许相对动态地生成评估数据集,减少模型作弊的风险,同时使评估更贴近真实用户对推理能力的需求,从而提高评估的可靠性。TurtleBench 包含 1,532 个用户猜测及其标注后的正确性。利用此数据集,我们全面评估了当今九个最先进的大语言模型。值得注意的是,OpenAI 的 o1 系列模型在这些评估中并未取得领先结果。我们提出了几个进一步研究的假设,例如“o1 的潜在推理利用了简单的链式思维 (CoT) 技术”和“增加 CoT 长度不仅提供了推理优势,还带来了噪声成本”。

[NLP-4] Differential Transformer

【速读】: 该论文试图解决Transformer模型在处理长上下文时过度关注无关信息的问题。解决方案的关键在于引入Diff Transformer,通过差分注意力机制计算注意力分数,即两个独立softmax注意力图的差值,从而消除噪声并促进稀疏注意力模式的形成。这种机制使得模型在长上下文建模、关键信息检索、幻觉缓解、上下文学习和减少激活异常值等方面表现出显著优势,有效提升了模型的性能和鲁棒性。

链接: https://arxiv.org/abs/2410.05258
作者: Tianzhu Ye,Li Dong,Yuqing Xia,Yutao Sun,Yi Zhu,Gao Huang,Furu Wei
关键词-EN: Diff Transformer, Transformer, Diff, introduce Diff Transformer, attention
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.
摘要:Transformer 倾向于过度关注无关的上下文。在本研究中,我们引入了 Diff Transformer,它增强了相关上下文的关注度,同时消除了噪声。具体而言,微分注意力机制通过计算两个独立 softmax 注意力图之间的差异来确定注意力分数。这种减法操作消除了噪声,促进了稀疏注意力模式的涌现。在语言建模的实验结果表明,在各种模型规模和训练 Token 数量扩展的情况下,Diff Transformer 均优于 Transformer。更引人注目的是,它在实际应用中展现出显著优势,如长上下文建模、关键信息检索、幻觉缓解、上下文学习以及减少激活异常值。由于较少受到无关上下文的干扰,Diff Transformer 能够缓解问答和文本摘要中的幻觉问题。在上下文学习方面,Diff Transformer 不仅提高了准确性,而且对顺序排列的鲁棒性更强,这一直被认为是长期存在的鲁棒性问题。这些结果将 Diff Transformer 定位为一种高效且有前景的架构,有望推动大语言模型的发展。

[NLP-5] GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

【速读】: 该论文试图解决在经济和战略交互中,大型语言模型(LLMs)的行为是否理性、能否模仿人类行为、是否能达成高效和公平的结果,以及自然语言在战略交互中的作用和环境特征对这些动态的影响等问题。解决方案的关键在于引入一个标准化基准,用于研究两玩家、顺序、基于语言的游戏,并通过定义三种基本游戏家族、一致的参数化、自由度和经济度量来评估代理的性能(自我收益)以及游戏结果(效率和公平性)。此外,开发了一个开源框架用于交互模拟和分析,并收集了LLM与LLM以及人类与LLM交互的数据集,通过实验展示了该框架和数据集在比较LLM代理与人类玩家行为、评估个体和集体性能指标以及量化环境经济特征对代理行为影响方面的应用。

链接: https://arxiv.org/abs/2410.05254
作者: Eilam Shapira,Omer Madmon,Itamar Reinman,Samuel Joseph Amouyal,Roi Reichart,Moshe Tennenholtz
关键词-EN: Large Language Models, Large Language, Language Models, show significant potential, show significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic human behavior? Do they tend to reach an efficient and fair outcome? What is the role of natural language in the strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. While the ML community has been exploring the potential of LLMs in such multi-agent setups, varying assumptions, design choices and evaluation criteria across studies make it difficult to draw robust and meaningful conclusions. To address this, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents’ performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents to human players in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents.
摘要:大语言模型 (LLM) 在经济和战略互动中展现出显著的潜力,其中自然语言交流往往占据主导地位。这引发了一系列关键问题:LLM 是否表现出理性行为?它们能否模仿人类行为?它们是否倾向于达成高效且公平的结果?自然语言在战略互动中扮演何种角色?经济环境的特征如何影响这些动态?这些问题在将基于 LLM 的智能体整合到现实世界的数据驱动系统(如在线零售平台和推荐系统)中时变得至关重要,涉及其经济和社会影响。尽管机器学习社区一直在探索 LLM 在多智能体设置中的潜力,但不同研究中的假设、设计选择和评估标准的差异使得得出稳健且有意义的结论变得困难。为解决这一问题,我们引入了一个基准,用于标准化研究双人、顺序、基于语言的游戏。受经济学文献启发,我们定义了三个基本游戏家族,这些游戏具有一致的参数化、自由度和经济度量,以评估智能体的性能(自我收益)以及游戏结果(效率和公平性)。我们开发了一个开源框架,用于交互模拟和分析,并利用它收集了大量游戏配置下 LLM 对 LLM 交互的数据集,以及额外的人类对 LLM 交互的数据集。通过广泛的实验,我们展示了如何使用我们的框架和数据集来:(i) 在各种经济背景下比较基于 LLM 的智能体与人类玩家的行为;(ii) 评估智能体在个体和集体性能度量中的表现;以及 (iii) 量化经济环境特征对智能体行为的影响。

[NLP-6] Causal Micro-Narratives EMNLP2024

【速读】: 该论文试图解决从文本中分类因果微叙事的问题,即识别句子级别的解释,说明目标主题的原因和/或结果。解决方案的关键在于仅依赖于特定主题的因果本体,并通过微调大型语言模型(如Llama 3.1 8B)来实现高效的分类。研究结果表明,该方法在因果叙事检测和分类任务上表现优异,F1分数分别达到0.87和0.71,展示了其在社会科学研究中的广泛应用潜力。

链接: https://arxiv.org/abs/2410.05252
作者: Mourad Heddaya,Qingcheng Zeng,Chenhao Tan,Rob Voigt,Alexander Zentefis
关键词-EN: classify causal micro-narratives, micro-narratives from text, classify causal, Abstract, causal micro-narratives
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 Workshop on Narrative Understanding

点击查看摘要

Abstract:We present a novel approach to classify causal micro-narratives from text. These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject. The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives. Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task. The best-performing model–a fine-tuned Llama 3.1 8B–achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification. Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research.
摘要:我们提出了一种新颖的方法来分类文本中的因果微叙事。这些叙事是对目标主体的因果关系进行句子级别的解释。该方法仅需要一个特定主体的因果本体,并通过应用于通货膨胀叙事的案例进行了展示。我们使用一个涵盖历史和当代美国新闻文章的人工标注数据集进行训练,评估了多个大语言模型 (LLMs) 在这一多标签分类任务上的表现。表现最佳的模型——经过微调的 Llama 3.1 8B——在叙事检测上达到了 0.87 的 F1 分数,在叙事分类上达到了 0.71 的 F1 分数。全面的错误分析揭示了语言模糊性带来的挑战,并强调了模型错误往往反映了人工标注者之间的分歧。这项研究建立了一个从现实世界数据中提取因果微叙事的框架,具有广泛的应用于社会科学研究的潜力。

[NLP-7] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

【速读】: 该论文试图解决在大语言模型(LLMs)指令调优过程中,依赖高质量监督微调(SFT)数据集所带来的高计算和人力成本问题。解决方案的关键在于提出了一种名为SFTMix的新方法,该方法通过利用训练动态识别不同置信度的样本,并应用基于Mixup的正则化技术,来缓解对高置信度样本的过拟合,同时增强对低置信度样本的学习。这种方法无需精心筛选的数据集,显著提升了指令跟随和特定领域任务的性能,展示了其对不同LLM家族和数据集规模的适应性和扩展性。

链接: https://arxiv.org/abs/2410.05248
作者: Yuxin Xiao,Shujian Zhang,Wenxuan Zhou,Marzyeh Ghassemi,Sanqiang Zhao
关键词-EN: induce desired behaviors, stage typically trains, large language models, instruction-tuning stage typically, typically trains LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To induce desired behaviors in large language models (LLMs) for interaction-driven tasks, the instruction-tuning stage typically trains LLMs on instruction-response pairs using the next-token prediction (NTP) loss. Previous work aiming to improve instruction-tuning performance often emphasizes the need for higher-quality supervised fine-tuning (SFT) datasets, which typically involves expensive data filtering with proprietary LLMs or labor-intensive data generation by human annotators. However, these approaches do not fully leverage the datasets’ intrinsic properties, resulting in high computational and labor costs, thereby limiting scalability and performance gains. In this paper, we propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm, without the need for well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, then applies a Mixup-based regularization to mitigate overfitting on confident examples while propagating supervision signals to improve learning on relatively unconfident ones. This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks, demonstrating its adaptability to diverse LLM families and scalability to datasets of any size. Comprehensive ablation studies further verify the robustness of SFTMix’s design choices, underscoring its versatility in consistently enhancing performance across different LLMs and datasets in broader natural language processing applications.
摘要:为了在交互驱动任务中引导大语言模型 (LLM) 表现出期望的行为,指令微调阶段通常使用下一个 Token 预测 (NTP) 损失在指令-响应对上训练 LLM。先前旨在提高指令微调性能的工作往往强调需要更高质量的监督微调 (SFT) 数据集,这通常涉及使用专有 LLM 进行昂贵的数据过滤或通过人工注释者进行劳动密集型的数据生成。然而,这些方法并未充分利用数据集的内在特性,导致计算和劳动力成本高昂,从而限制了可扩展性和性能提升。在本文中,我们提出了 SFTMix,这是一种新颖的方法,能够在无需精心策划数据集的情况下,超越传统 NTP 范式的指令微调性能。观察到 LLM 在语义表示空间中表现出不均匀的置信度,我们主张在指令微调过程中,不同置信度的样本应发挥不同的作用。基于这一见解,SFTMix 利用训练动态来识别具有不同置信度水平的样本,然后应用基于 Mixup 的正则化方法来缓解对高置信度样本的过拟合,同时传播监督信号以改善对相对低置信度样本的学习。这种方法使得 SFTMix 在广泛的指令跟随和医疗领域特定的 SFT 任务中显著优于 NTP,展示了其对不同 LLM 家族的适应性和对任何规模数据集的可扩展性。全面的消融研究进一步验证了 SFTMix 设计选择的鲁棒性,强调了其在不同 LLM 和数据集上持续提升性能的广泛自然语言处理应用中的多功能性。

[NLP-8] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

【速读】: 该论文试图解决图形用户界面(GUI)代理在复杂、多平台环境中视觉接地能力不足的问题。解决方案的关键在于采用类人化的视觉感知方式,通过直接对GUI进行像素级操作,利用视觉接地模型将GUI元素的多样表达准确映射到其在界面上的坐标。论文提出了一种简单有效的训练方法,结合基于网页的合成数据和LLaVA架构的轻微调整,训练出强大的通用视觉接地模型UGround。该模型在多个基准测试中显著优于现有模型,并使GUI代理在仅依赖视觉感知的情况下,性能超越了使用额外文本输入的现有最先进代理。

链接: https://arxiv.org/abs/2410.05243
作者: Boyu Gou,Ruohan Wang,Boyuan Zheng,Yanan Xie,Cheng Chang,Yiheng Shu,Huan Sun,Yu Su
关键词-EN: Multimodal large language, graphical user interface, Multimodal large, GUI agents, GUI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 正在改变图形用户界面 (Graphical User Interface, GUI) 智能体的能力,推动它们从受控模拟过渡到跨平台复杂现实应用。然而,这些智能体的有效性依赖于其基础能力的稳健性。当前的 GUI 智能体主要使用基于文本的表示,如 HTML 或可访问性树,尽管这些方法具有实用性,但往往引入噪声、不完整性以及增加计算开销。本文主张为 GUI 智能体赋予类似人类的视觉感知能力,使其完全通过视觉感知环境,并直接在 GUI 上进行像素级操作。关键在于视觉基础模型,这些模型能够准确地将 GUI 元素的多样引用表达映射到不同平台上的 GUI 坐标。我们展示了一种简单的方法,包括基于网络的合成数据和 LLaVA 架构的轻微调整,对于训练此类视觉基础模型出乎意料地有效。我们收集了迄今为止最大的 GUI 视觉基础数据集,包含 1000 万个 GUI 元素及其引用表达,覆盖 130 万张截图,并使用该数据集训练了 UGround,一种强大的通用视觉基础模型,用于 GUI 智能体。在涵盖三个类别(基础、离线智能体和在线智能体)的六个基准上的实证结果表明:1) UGround 显著优于现有的 GUI 智能体视觉基础模型,绝对提升高达 20%;2) 尽管现有智能体使用额外的基于文本的输入,而我们的智能体仅使用视觉感知,但配备 UGround 的智能体仍优于最先进的智能体。这些结果为以人类方式导航数字世界的 GUI 智能体的可行性和前景提供了有力支持。

[NLP-9] uneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models ACCV2024

【速读】: 该论文试图解决在视觉-语言分割模型(VLSMs)中,如何在不进行昂贵的微调情况下,高效地将模型适应到新领域的问题。解决方案的关键在于引入多种提示调优技术(包括文本、视觉和多模态提示),并开发了一个开源基准测试框架TuneVLSeg,以整合这些技术到VLSMs中,使其能够适应具有任意类别数量的下游分割数据集。通过在8个不同领域的数据集上进行测试,研究发现视觉提示调优在显著的领域偏移下表现出色,且参数较少,通常能与多模态方法相媲美,从而为领域特定的分割任务提供了一个有效的初步尝试。

链接: https://arxiv.org/abs/2410.05239
作者: Rabin Adhikari,Safal Thapaliya,Manish Dhakal,Bishesh Khanal
关键词-EN: requires expensive fine-tuning, Prompt tuning, Vision-Language Segmentation Models, shown impressive performance, Prompt
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at ACCV 2024 (oral presentation)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown impressive performance in vision tasks, but adapting them to new domains often requires expensive fine-tuning. Prompt tuning techniques, including textual, visual, and multimodal prompting, offer efficient alternatives by leveraging learnable prompts. However, their application to Vision-Language Segmentation Models (VLSMs) and evaluation under significant domain shifts remain unexplored. This work presents an open-source benchmarking framework, TuneVLSeg, to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes. TuneVLSeg includes 6 prompt tuning strategies on various prompt depths used in 2 VLSMs totaling of 8 different combinations. We test various prompt tuning on 8 diverse medical datasets, including 3 radiology datasets (breast tumor, echocardiograph, chest X-ray pathologies) and 5 non-radiology datasets (polyp, ulcer, skin cancer), and two natural domain segmentation datasets. Our study found that textual prompt tuning struggles under significant domain shifts, from natural-domain images to medical data. Furthermore, visual prompt tuning, with fewer hyperparameters than multimodal prompt tuning, often achieves performance competitive to multimodal approaches, making it a valuable first attempt. Our work advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation. The source code is available at this https URL.
摘要:视觉-语言模型 (Vision-Language Models, VLMs) 在视觉任务中展示了令人印象深刻的表现,但将其适应于新领域通常需要昂贵的微调。提示调优技术,包括文本、视觉和多模态提示,通过利用可学习的提示提供了高效的替代方案。然而,这些技术在视觉-语言分割模型 (Vision-Language Segmentation Models, VLSMs) 中的应用及其在显著领域转移下的评估仍未被探索。本研究提出了一种开源基准框架,TuneVLSeg,将各种单模态和多模态提示调优技术整合到 VLSMs 中,使得提示调优可用于具有任意类别数量的下游分割数据集。TuneVLSeg 包括了在 2 种 VLSMs 中使用的 6 种提示调优策略,共计 8 种不同的组合。我们在 8 个多样化的医疗数据集上测试了各种提示调优方法,包括 3 个放射学数据集(乳腺肿瘤、超声心动图、胸部 X 光病理)和 5 个非放射学数据集(息肉、溃疡、皮肤癌),以及两个自然领域分割数据集。我们的研究发现,文本提示调优在从自然领域图像到医疗数据的显著领域转移下表现不佳。此外,视觉提示调优,其超参数少于多模态提示调优,通常能达到与多模态方法相竞争的性能,使其成为一个有价值的首选尝试。我们的工作推进了对不同提示调优技术在特定领域分割中鲁棒性的理解和适用性。源代码可在以下链接获取:https URL。

[NLP-10] CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures

【速读】: 该论文试图解决人工智能在医疗领域决策解释的问题,特别是如何帮助住院医生训练其解释诊断结论的能力。解决方案的关键在于开发了一个多语言的医疗问答数据集——Multilingual CasiMedicos-Arg,该数据集不仅包含正确和错误的诊断,还附有医生撰写的自然语言解释,并进一步手动标注了论证组件(如前提、主张)和论证关系(如支持、攻击)。这一数据集的构建为训练和评估解释性AI模型提供了宝贵的资源,有助于提升AI在医疗教育中的应用效果。

链接: https://arxiv.org/abs/2410.05235
作者: katerina Sviridova,Anar Yeginbergen,Ainara Estarrona,Elena Cabrio,Serena Villata,Rodrigo Agerri
关键词-EN: Explaining Artificial Intelligence, Explaining Artificial, Artificial Intelligence, major challenge nowadays, medicine and law
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Explaining Artificial Intelligence (AI) decisions is a major challenge nowadays in AI, in particular when applied to sensitive scenarios like medicine and law. However, the need to explain the rationale behind decisions is a main issue also for human-based deliberation as it is important to justify \textitwhy a certain decision has been taken. Resident medical doctors for instance are required not only to provide a (possibly correct) diagnosis, but also to explain how they reached a certain conclusion. Developing new tools to aid residents to train their explanation skills is therefore a central objective of AI in education. In this paper, we follow this direction, and we present, to the best of our knowledge, the first multilingual dataset for Medical Question Answering where correct and incorrect diagnoses for a clinical case are enriched with a natural language explanation written by doctors. These explanations have been manually annotated with argument components (i.e., premise, claim) and argument relations (i.e., attack, support), resulting in the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases in four languages (English, Spanish, French, Italian) with explanations, where we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 attack relations. We conclude by showing how competitive baselines perform over this challenging dataset for the argument mining task.
摘要:解释人工智能 (AI) 决策是当今 AI 领域面临的主要挑战之一,尤其是在应用于医疗和法律等敏感场景时。然而,解释决策背后的理由不仅是人类决策过程中的重要问题,也是 AI 决策过程中需要解决的关键问题。例如,住院医生不仅需要提供(可能正确的)诊断,还需要解释他们是如何得出某一结论的。因此,开发新的工具来帮助住院医生训练他们的解释技能,是 AI 在教育领域的一个核心目标。本文沿着这一方向,据我们所知,首次提出了一个用于医学问答的多语言数据集,其中针对临床案例的正确和错误诊断都由医生用自然语言进行了解释。这些解释经过手动标注了论证组件(即前提、主张)和论证关系(即攻击、支持),形成了 Multilingual CasiMedicos-Arg 数据集,该数据集包含 558 个临床案例,涵盖四种语言(英语、西班牙语、法语、意大利语),并带有解释,其中我们标注了 5021 个主张、2313 个前提、2431 个支持关系和 1106 个攻击关系。最后,我们展示了在论证挖掘任务中,竞争性基线模型在这个具有挑战性的数据集上的表现。

[NLP-11] Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates

【速读】: 该论文试图解决在微调大型语言模型(LLMs)时,指令数据集的构建成本高、耗时长以及可能涉及隐私和法律问题的问题。解决方案的关键在于引入Cookbook框架,该框架通过编程生成由随机令牌组成的简单模式数据,从而实现一种可扩展、成本效益高且避免法律和隐私问题的数据生成方法。Cookbook使用模板化的Python函数生成训练数据,鼓励模型学习显式的基于模式的规则,以提高特定任务的性能。此外,Cookbook还能算法化地混合来自不同模板的数据,以优化多任务性能。

链接: https://arxiv.org/abs/2410.05224
作者: Avanika Narayan,Mayee F. Chen,Kush Bhatia,Christopher Ré
关键词-EN: Fine-tuning large language, LLM generative capabilities, large language models, instruction datasets, improve LLM generative
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: COLM 2024

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data is less labor-intensive, it may violate user privacy agreements or terms of service of LLM providers. Therefore, we seek a way of constructing instruction datasets with samples that are not generated by humans or LLMs but still improve LLM generative capabilities. In this work, we introduce Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues. First, Cookbook uses a template – a data generating Python function – to produce training data that encourages the model to learn an explicit pattern-based rule that corresponds to a desired task. We find that fine-tuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points. Second, since instruction datasets improve performance on multiple downstream tasks simultaneously, Cookbook algorithmically learns how to mix data from various templates to optimize performance on multiple tasks. On the standard multi-task GPT4ALL evaluation suite, Mistral-7B fine-tuned using a Cookbook-generated dataset attains the best accuracy on average compared to other 7B parameter instruction-tuned models and is the best performing model on 3 out of 8 tasks. Finally, we analyze when and why Cookbook improves performance and present a metric that allows us to verify that the improvement is largely explained by the model’s generations adhering better to template rules.
摘要:在指令数据集上微调大语言模型 (LLMs) 是提升其生成能力的常见方法。然而,手动构建指令数据集既耗时又昂贵,而使用 LLM 生成的数据虽然减少了人力成本,但可能违反用户隐私协议或 LLM 提供商的服务条款。因此,我们寻求一种构建指令数据集的方法,其中的样本既非人类也非 LLM 生成,但仍能提升 LLM 的生成能力。在本研究中,我们提出了 Cookbook,这是一个通过编程生成训练数据的框架,其数据由随机 Token 上的简单模式组成,从而实现了一种可扩展、成本效益高且避免法律和隐私问题的方法。首先,Cookbook 使用一个模板——一个数据生成的 Python 函数——来生成训练数据,鼓励模型学习与所需任务相对应的显式基于模式的规则。我们发现,在 Cookbook 生成的数据上进行微调能够将相应任务的性能提升高达 52.7 个准确度点。其次,由于指令数据集能同时提升多个下游任务的性能,Cookbook 算法学习如何混合来自不同模板的数据以优化多任务性能。在标准的 GPT4ALL 多任务评估套件中,使用 Cookbook 生成数据集微调的 Mistral-7B 模型在平均准确度上优于其他 7B 参数指令微调模型,并在 8 项任务中的 3 项中表现最佳。最后,我们分析了 Cookbook 提升性能的时机和原因,并提出了一种度量标准,使我们能够验证性能提升主要归因于模型生成的内容更好地遵循了模板规则。

[NLP-12] Precise Model Benchmarking with Only a Few Observations EMNLP2024

【速读】: 该论文试图解决在大规模问答数据集中,如何精确估计大型语言模型(LLM)在特定主题问题上的准确率问题。解决方案的关键在于提出了一种经验贝叶斯(Empirical Bayes, EB)估计器,该估计器通过平衡直接估计和回归估计,为每个子组(主题)分别提供更精确的模型性能估计。实验结果表明,与直接估计和回归估计相比,EB估计器在多个数据集上均能显著降低均方误差,并提供更窄的置信区间,从而提高了子组级别模型性能估计的精度。

链接: https://arxiv.org/abs/2410.05222
作者: Riccardo Fogliato,Pratik Patil,Nil-Jana Akpinar,Mathew Monfort
关键词-EN: larger question-answering dataset, large language model, model accuracy, larger question-answering, accuracy on questions
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: To appear at EMNLP 2024

点击查看摘要

Abstract:How can we precisely estimate a large language model’s (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model’s accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model’s accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.
摘要:我们如何准确估计大语言模型 (LLM) 在大型问答数据集中特定主题问题上的准确性?标准的直接估计方法,即对每个子组 (主题) 中的问题进行模型准确性平均,可能会在样本量较小的子组中表现出高方差。合成回归建模利用模型在其他主题问题上的准确性,可能会产生过于不可靠的偏差估计,尤其是对于较大的子组。我们提出了一种简单而有效的解决方案:一种经验贝叶斯 (EB) 估计器,该估计器分别平衡每个子组的直接估计和回归估计,从而提高模型性能在子组级别的估计精度。我们在多个数据集上的实验表明,与直接和回归方法相比,这种方法始终提供更精确的 LLM 性能估计,显著降低了均方误差。EB 估计的置信区间也接近名义覆盖率,并且比直接估计器的置信区间更窄。对表格和视觉数据的额外实验验证了这种 EB 方法的益处。

[NLP-13] Density estimation with LLMs: a geometric investigation of in-context learning trajectories ICLR2025

【速读】: 该论文试图解决大语言模型(LLMs)在上下文学习中进行概率密度函数(PDF)估计的问题。解决方案的关键在于利用密集主成分分析(InPCA)来可视化和分析LLaMA-2模型在上下文学习中的动态行为,发现这些模型在低维InPCA空间中遵循相似的学习轨迹,与传统的密度估计方法(如直方图和高斯核密度估计)有显著区别。论文提出将LLaMA的上下文密度估计过程解释为具有自适应核宽度和形状的高斯核密度估计(KDE),并通过一个仅含两个参数的定制核模型捕捉了LLaMA行为的大部分特征,从而揭示了LLMs在上下文概率推理中的机制。

链接: https://arxiv.org/abs/2410.05218
作者: Toni J.B. Liu,Nicolas Boullé,Raphaël Sarfati,Christopher J. Earls
关键词-EN: Large language models, demonstrate remarkable emergent, including time series, time series forecasting, remarkable emergent abilities
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable emergent abilities to perform in-context learning across various tasks, including time series forecasting. This work investigates LLMs’ ability to estimate probability density functions (PDFs) from data observed in-context; such density estimation (DE) is a fundamental task underlying many probabilistic modeling problems. We leverage the Intensive Principal Component Analysis (InPCA) to visualize and analyze the in-context learning dynamics of LLaMA-2 models. Our main finding is that these LLMs all follow similar learning trajectories in a low-dimensional InPCA space, which are distinct from those of traditional density estimation methods like histograms and Gaussian kernel density estimation (KDE). We interpret the LLaMA in-context DE process as a KDE with an adaptive kernel width and shape. This custom kernel model captures a significant portion of LLaMA’s behavior despite having only two parameters. We further speculate on why LLaMA’s kernel width and shape differs from classical algorithms, providing insights into the mechanism of in-context probabilistic reasoning in LLMs.
摘要:大语言模型 (LLMs) 展示了在各种任务中进行上下文学习的显著涌现能力,包括时间序列预测。本研究探讨了 LLMs 从上下文中观察到的数据估计概率密度函数 (PDFs) 的能力;这种密度估计 (DE) 是许多概率建模问题的基本任务。我们利用密集主成分分析 (InPCA) 来可视化和分析 LLaMA-2 模型的上下文学习动态。我们的主要发现是,这些 LLMs 在低维 InPCA 空间中遵循相似的学习轨迹,这与传统的密度估计方法(如直方图和高斯核密度估计 (KDE))的轨迹明显不同。我们将 LLaMA 的上下文 DE 过程解释为具有自适应核宽度和形状的 KDE。这种自定义核模型尽管只有两个参数,但捕捉了 LLaMA 行为的大部分特征。我们进一步推测了为什么 LLaMA 的核宽度和形状与经典算法不同,为 LLMs 中的上下文概率推理机制提供了见解。

[NLP-14] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality EMNLP2024

【速读】: 该论文试图解决预训练视觉和语言模型(VLMs)在增强组合理解能力时,传统微调方法可能导致零样本多模态任务性能下降的问题。解决方案的关键在于提出了细粒度选择性校准的CLIP(FSC-CLIP),通过整合局部硬负样本损失和选择性校准正则化,提供细粒度的负样本监督,同时保持模型表示的完整性,从而在不牺牲多模态能力的前提下提升组合理解能力。

链接: https://arxiv.org/abs/2410.05210
作者: Youngtaek Oh,Jae Won Cho,Dong-Jin Kim,In So Kweon,Junmo Kim
关键词-EN: enhance compositional understanding, method to enhance, understanding in pre-trained, pre-trained vision, vision and language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024 (Long, Main). Project page: this https URL

点击查看摘要

Abstract:In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model’s multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model’s representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: this https URL.
摘要:本文提出了一种新的方法,旨在增强预训练视觉与语言模型 (Vision and Language Models, VLMs) 的组合理解能力,同时不牺牲其在零样本多模态任务中的性能。传统的微调方法往往以牺牲多模态能力为代价来提升组合推理能力,这主要是因为使用了全局硬负样本 (Hard Negative, HN) 损失,该损失对比了图像和文本的全局表示。这种全局 HN 损失会推动与原始样本高度相似的 HN 文本,从而损害模型的多模态表示。为克服这一局限,我们提出了细粒度选择性校准的 CLIP (Fine-grained Selective Calibrated CLIP, FSC-CLIP),该方法整合了局部硬负样本损失和选择性校准正则化。这些创新不仅提供了细粒度的负样本监督,还保持了模型的表示完整性。我们在多种基准测试中对组合性和多模态任务进行了广泛的评估,结果表明 FSC-CLIP 不仅在组合性方面达到了与最先进模型相媲美的水平,还保留了强大的多模态能力。代码可在以下链接获取:this https URL.

[NLP-15] Studying and Mitigating Biases in Sign Language Understanding Models

【速读】: 该论文旨在解决手语技术在不同社区成员间公平分配的问题,特别是如何避免由众包手语数据集(如ASL Citizen数据集)设计或使用中可能产生的偏见和不公平。论文通过分析ASL Citizen数据集中参与者的社会人口统计信息和词汇特征,研究并记录了基于众包手语数据集训练的模型可能产生的偏见。关键解决方案在于应用多种偏见缓解技术在模型训练过程中,这些技术在不降低准确性的前提下,有效减少了性能差异。此外,论文公开了ASL Citizen数据集的参与者人口统计信息,以促进未来在该领域的偏见缓解研究。

链接: https://arxiv.org/abs/2410.05206
作者: Katherine Atwell,Danielle Bragg,Malihe Alikhani
关键词-EN: ASL Citizen dataset, ASL Citizen, sign language technologies, Citizen dataset, members is crucial
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring that the benefits of sign language technologies are distributed equitably among all community members is crucial. Thus, it is important to address potential biases and inequities that may arise from the design or use of these resources. Crowd-sourced sign language datasets, such as the ASL Citizen dataset, are great resources for improving accessibility and preserving linguistic diversity, but they must be used thoughtfully to avoid reinforcing existing biases. In this work, we utilize the rich information about participant demographics and lexical features present in the ASL Citizen dataset to study and document the biases that may result from models trained on crowd-sourced sign datasets. Further, we apply several bias mitigation techniques during model training, and find that these techniques reduce performance disparities without decreasing accuracy. With the publication of this work, we release the demographic information about the participants in the ASL Citizen dataset to encourage future bias mitigation work in this space. Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.05206 [cs.CL] (or arXiv:2410.05206v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.05206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:确保手语技术带来的益处能够公平地分配给所有社区成员至关重要。因此,解决这些资源的设计或使用中可能出现的潜在偏见和不公平问题显得尤为重要。众包手语数据集,如 ASL Citizen 数据集,是提高可访问性和保护语言多样性的宝贵资源,但必须谨慎使用,以避免强化现有偏见。在本研究中,我们利用 ASL Citizen 数据集中丰富的参与者人口统计信息和词汇特征,研究并记录了基于众包手语数据集训练的模型可能产生的偏见。此外,我们在模型训练过程中应用了几种偏见缓解技术,并发现这些技术在不降低准确性的情况下减少了性能差异。通过发布本研究成果,我们公开了 ASL Citizen 数据集中参与者的人口统计信息,以鼓励未来在该领域的偏见缓解工作。

主题:计算与语言 (cs.CL);计算机视觉与模式识别 (cs.CV)
引用方式:arXiv:2410.05206 [cs.CL] (或 arXiv:2410.05206v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.05206
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-16] RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

【速读】: 该论文试图解决LLM-as-a-Judge在文本生成质量评估中与人类评估之间的可靠性差距问题。解决方案的关键在于引入RevisEval,这是一种通过响应适应性参考(response-adapted references)进行文本生成评估的新范式。RevisEval利用大型语言模型(LLMs)的文本修订能力,动态调整待评估的响应,并将其修订后的文本作为参考,从而提高评估的准确性和相关性。实验结果表明,RevisEval在自然语言生成任务和开放式指令跟随任务中,优于传统的无参考和基于参考的评估方法,并且能够进一步提升经典文本指标(如BLEU和BERTScore)的性能。

链接: https://arxiv.org/abs/2410.05193
作者: Qiyuan Zhang,Yufei Wang,Tiezheng YU,Yuxin Jiang,Chuhan Wu,Liangyou Li,Yasheng Wang,Xin Jiang,Lifeng Shang,Ruiming Tang,Fuyuan Lyu,Chen Ma
关键词-EN: recent studies, text generation quality, significant efforts, efforts in recent, cost-effective alternative
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval’s effectiveness in bias reduction, the impact of inference cost, and reference relevance.
摘要:近年来,随着研究的深入,大语言模型作为评判者 (LLM-as-a-Judge) 已成为评估广泛任务中文本生成质量的一种经济高效的替代方案。然而,大语言模型作为评判者与人工评估之间仍存在可靠性差距。一个重要原因是评估过程中缺乏引导性的“神谕”。受经典文本评估中广泛使用的参考文本的启发,我们提出了 RevisEval,这是一种通过响应适应性参考文本 (response-adapted references) 进行文本生成评估的新范式。RevisEval 的核心观察是,理想的参考文本应与待评估的响应保持必要的相关性。具体而言,RevisEval 利用大语言模型 (LLMs) 的文本修订能力,自适应地修订响应文本,然后将修订后的文本作为参考文本 (response-adapted reference) 用于后续评估。大量实验表明,RevisEval 在自然语言生成 (NLG) 任务和开放式指令跟随任务中,优于使用大语言模型作为评判者的传统无参考和基于参考的评估范式。更重要的是,与传统参考文本相比,我们的响应适应性参考文本能够进一步提升经典的文本指标,如 BLEU 和 BERTScore,甚至与大语言模型作为评判者的表现相媲美。我们还进行了详细分析,以确认 RevisEval 在减少偏差、推理成本影响和参考相关性方面的有效性。

[NLP-17] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

【速读】: 该论文试图解决训练语言模型时需要预先确定固定计算预算的问题,解决方案的关键在于引入Warmup-Stable-Decay (WSD)学习率调度策略。WSD通过使用恒定的学习率生成一个可以无限期继续的主分支,从而避免了预先设定计算预算的需求。在任何给定的计算预算下,可以从主分支的适当时间点分支出来,并使用快速衰减的学习率生成一个强模型。WSD的核心在于其稳定的恒定学习率阶段和快速衰减阶段,分别负责在“河流”和“山脉”方向上的进展,两者都是关键。此外,论文提出的WSD-S变体通过重用先前检查点的衰减阶段,进一步优化了模型训练效率。

链接: https://arxiv.org/abs/2410.05192
作者: Kaiyue Wen,Zhiyuan Li,Jason Wang,David Hall,Percy Liang,Tengyu Ma
关键词-EN: typical cosine learning, learning rate, Training language models, fixed compute budget, rate schedule depends
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 45 pages,13 figures

点击查看摘要

Abstract:Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate’s oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints’ decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
摘要:训练语言模型目前需要预先确定一个固定的计算预算,因为典型的余弦学习率调度依赖于总步数。相比之下,Warmup-Stable-Decay (WSD) 调度使用恒定的学习率生成一个主分支迭代序列,原则上可以无限期地继续下去,而无需预先指定的计算预算。然后,在任何给定的计算预算下,可以在适当的时间从主分支分支出一个快速衰减学习率的子分支,以生成一个强大的模型。从经验上看,WSD 生成了一个非传统的损失曲线:在稳定阶段损失保持较高,但在衰减阶段急剧下降。为了解释这一现象,我们推测预训练损失呈现出一种河流谷地景观,类似于底部有一条河流的深谷。在此假设下,我们表明,在稳定阶段,由于高学习率,迭代会经历大幅振荡,但它在河流方向上快速前进。在衰减阶段,快速下降的学习率最小化了迭代的振荡,使其更接近河流,并揭示了真正的优化进展。因此,持续的高学习率阶段和快速衰减阶段分别负责河流方向和山地方向的进展,两者都至关重要。我们的分析预测的现象与经验观察一致,并表明这种景观可以从简单的双字节数据集的预训练中出现。受理论启发,我们引入了 WSD-S,这是 WSD 的一个变体,它重用了先前检查点的衰减阶段,并仅保留一个主分支,我们从衰减的检查点恢复。WSD-S 在单次运行中,在各种计算预算下,从 0.1B 到 1.2B 的参数缩放范围内,获得了多个语言模型检查点,其表现优于 WSD 和 Cyclic-Cosine。

[NLP-18] Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics EMNLP2024

【速读】: 该论文试图解决机器翻译(MT)评估指标在数据过滤和翻译重排序等新应用场景中的解释性不足问题。解决方案的关键在于引入了一个可解释的评估框架,通过在数据过滤和翻译重排序的代理场景中评估MT指标,并使用Precision、Recall和F-score等指标来衡量其性能,从而提供比传统的人类判断相关性更直观的洞察。此外,论文还对基于Direct Assessments+Scalar Quality Metrics(DA+SQM)指南的手动数据标注的可靠性提出了质疑,指出其与Multidimensional Quality Metrics(MQM)注释之间存在显著的低一致性。

链接: https://arxiv.org/abs/2410.05183
作者: Stefano Perrella,Lorenzo Proietti,Pere-Lluís Huguet Cabot,Edoardo Barba,Roberto Navigli
关键词-EN: Machine Translation, translation quality automatically, assess translation quality, metrics assess translation, metrics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2024 Main Conference. 26 pages

点击查看摘要

Abstract:Machine Translation (MT) evaluation metrics assess translation quality automatically. Recently, researchers have employed MT metrics for various new use cases, such as data filtering and translation re-ranking. However, most MT metrics return assessments as scalar scores that are difficult to interpret, posing a challenge to making informed design choices. Moreover, MT metrics’ capabilities have historically been evaluated using correlation with human judgment, which, despite its efficacy, falls short of providing intuitive insights into metric performance, especially in terms of new metric use cases. To address these issues, we introduce an interpretable evaluation framework for MT metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. Furthermore, by measuring the performance of MT metrics using Precision, Recall, and F-score, we offer clearer insights into their capabilities than correlation with human judgments. Finally, we raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines, reporting a notably low agreement with Multidimensional Quality Metrics (MQM) annotations.
摘要:机器翻译 (Machine Translation, MT) 评估指标用于自动评估翻译质量。近期,研究人员将 MT 评估指标应用于多种新用例,如数据过滤和翻译重排序。然而,大多数 MT 评估指标返回的评估结果为难以解释的标量分数,这给做出明智的设计选择带来了挑战。此外,MT 评估指标的能力传统上通过与人类判断的相关性进行评估,尽管这种方法有效,但未能提供关于指标性能的直观见解,尤其是在新指标用例方面。为解决这些问题,我们引入了一个可解释的 MT 评估框架。在该框架内,我们在两种场景下评估指标,这两种场景作为数据过滤和翻译重排序用例的代理。此外,通过使用精确率 (Precision)、召回率 (Recall) 和 F 分数 (F-score) 来衡量 MT 评估指标的性能,我们提供了比与人类判断相关性更清晰的指标能力见解。最后,我们提出了对遵循直接评估+标量质量指标 (Direct Assessments+Scalar Quality Metrics, DA+SQM) 指南的手动筛选数据的可靠性的担忧,报告了与多维度质量指标 (Multidimensional Quality Metrics, MQM) 注释显著低的一致性。

[NLP-19] Enhancing Equity in Large Language Models for Medical Applications

【速读】: 该论文试图解决大型语言模型(LLMs)在医疗应用中可能加剧的种族、性别和代表性不足群体的健康不平等问题。解决方案的关键是提出并评估了一个名为EquityGuard的新框架,该框架通过集成一个偏见检测机制,能够识别和纠正LLM应用中的不公平预测,从而提升不同人群的健康公平性。

链接: https://arxiv.org/abs/2410.05180
作者: Yuelyu Ji,Wenhe Ma,Sonish Sivarajkumar,Hang Zhang,Eugene Mathew Sadhu,Zhuochun Li,Xizhi Wu,Shyam Visweswaran,Yanshan Wang
关键词-EN: Clinical Trial Matching, automating Clinical Trial, clinical decision support, large language models, Trial Matching
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements have highlighted the potential of large language models (LLMs) in medical applications, notably in automating Clinical Trial Matching for translational research and providing medical question-answering for clinical decision support. However, our study reveals significant inequities in the use of LLMs, particularly for individuals from specific racial, gender, and underrepresented groups influenced by social determinants of health. These disparities could worsen existing health inequities if LLMs are broadly adopted in healthcare. To address this, we propose and evaluate a novel framework, EquityGuard, designed to detect and mitigate biases in LLM-based medical applications. EquityGuard incorporates a Bias Detection Mechanism capable of identifying and correcting unfair predictions, thus enhancing outcomes and promoting equity across diverse population groups.
摘要:近期的发展突显了大语言模型 (LLMs) 在医疗应用中的潜力,特别是在转化研究中自动化临床试验匹配以及为临床决策支持提供医学问答方面。然而,我们的研究揭示了在使用 LLMs 时存在显著的不平等现象,特别是对于受健康社会决定因素影响的特定种族、性别和代表性不足的群体。如果 LLMs 在医疗领域广泛应用,这些差异可能会加剧现有的健康不平等。为此,我们提出并评估了一种新型框架——EquityGuard,旨在检测和缓解基于 LLM 的医疗应用中的偏见。EquityGuard 集成了一个偏见检测机制,能够识别和纠正不公平的预测,从而改善结果并促进不同人群之间的公平性。

[NLP-20] ReasoningRank: Teaching Student Models to Rank through Reasoning-Based Knowledge Distillation

【速读】: 该论文试图解决信息检索中重排序方法缺乏透明度和解释性的问题。解决方案的关键在于引入ReasoningRank,这是一种新颖的重排序方法,通过生成两种类型的推理(显式推理和比较推理)来增强重排序过程的清晰度。显式推理解释文档如何满足查询需求,而比较推理则说明为何一个文档比另一个更相关。该方法利用大型语言模型(LLMs)作为教师模型生成这些解释,并将这些知识蒸馏到更小、更高效的“学生”模型中,从而在保持竞争性能的同时显著减少计算资源需求,适用于大规模或资源受限的环境。

链接: https://arxiv.org/abs/2410.05168
作者: Yuelyu Ji,Zhuochun Li,Rui Meng,Daqing He
关键词-EN: information retrieval, critical in information, Reranking documents based, Reranking, documents based
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reranking documents based on their relevance to a given query is critical in information retrieval. Traditional reranking methods often focus on improving the initial rankings but lack transparency, failing to explain why one document is ranked higher. In this paper, we introduce ReasoningRank, a novel reranking approach that enhances clarity by generating two types of reasoning: explicit reasoning, which explains how a document addresses the query, and comparison reasoning, which justifies the relevance of one document over another. We leverage large language models (LLMs) as teacher models to generate these explanations and distill this knowledge into smaller, more resource-efficient student models. While the student models may not outperform LLMs in speed, they significantly reduce the computational burden by requiring fewer resources, making them more suitable for large-scale or resource-constrained settings. These student models are trained to both generate meaningful reasoning and rerank documents, achieving competitive performance across multiple datasets, including MSMARCO and BRIGHT. Experiments demonstrate that ReasoningRank improves reranking accuracy and provides valuable insights into the decision-making process, offering a structured and interpretable solution for reranking tasks.
摘要:基于文档与给定查询的相关性进行重新排序在信息检索中至关重要。传统的重新排序方法通常侧重于改进初始排名,但缺乏透明度,无法解释为何某一文档排名更高。本文中,我们介绍了 ReasoningRank,一种新颖的重新排序方法,通过生成两种类型的推理来增强清晰度:显式推理,解释文档如何应对查询;以及比较推理,证明某一文档相对于另一文档的相关性。我们利用大语言模型 (LLMs) 作为教师模型来生成这些解释,并将这些知识提炼到更小、更资源高效的学生模型中。尽管学生模型在速度上可能无法超越 LLMs,但它们通过减少资源需求显著降低了计算负担,使其更适合大规模或资源受限的环境。这些学生模型经过训练,既能生成有意义的推理,又能重新排序文档,在多个数据集(包括 MSMARCO 和 BRIGHT)上实现了具有竞争力的性能。实验表明,ReasoningRank 提高了重新排序的准确性,并为决策过程提供了宝贵的见解,为重新排序任务提供了一种结构化和可解释的解决方案。

[NLP-21] Efficient Inference for Large Language Model-based Generative Recommendation

【速读】: 该论文试图解决基于大型语言模型(LLM)的生成推荐系统中,由于自回归解码导致的推理延迟过高的问题。解决方案的关键在于提出了一种名为AtSpeed的对齐框架,通过两个主要优化策略来加速解码过程:一是增强草稿模型与目标LLM之间的top-K序列对齐,以确保在严格top-K验证下仍能高效生成推荐列表;二是引入宽松的采样验证策略,允许接受高概率的非top-K草稿序列,从而显著减少对目标LLM的调用次数。实验结果表明,AtSpeed在严格和宽松验证策略下分别实现了接近2倍和最高2.5倍的加速效果。

链接: https://arxiv.org/abs/2410.05165
作者: Xinyu Lin,Chaoqun Yang,Wenjie Wang,Yongqi Li,Cunxiao Du,Fuli Feng,See-Kiong Ng,Tat-Seng Chua
关键词-EN: Large Language Model, Large Language, achieved notable success, excessive inference latency, inference latency caused
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promising solution. However, applying SD to generative recommendation presents unique challenges due to the requirement of generating top-K items (i.e., K distinct token sequences) as a recommendation list by beam search. This leads to more stringent verification in SD, where all the top-K sequences from the target LLM must be successfully drafted by the draft model at each decoding step. To alleviate this, we consider 1) boosting top-K sequence alignment between the draft model and the target LLM, and 2) relaxing the verification strategy to reduce trivial LLM calls. To this end, we propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification. Moreover, we introduce a relaxed sampling verification strategy that allows high-probability non-top-K drafted sequences to be accepted, significantly reducing LLM calls. Correspondingly, we propose AtSpeed-R for top-K alignment under this relaxed sampling verification. Empirical results on two real-world datasets demonstrate that AtSpeed significantly accelerates LLM-based generative recommendation, e.g., near 2x speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. The codes and datasets will be released in the near future.
摘要:基于大语言模型 (LLM) 的生成式推荐系统取得了显著的成功,但其实际部署成本高昂,主要原因是自回归解码导致的推理延迟过高。为了实现无损的 LLM 解码加速,推测性解码 (Speculative Decoding, SD) 已成为一种有前景的解决方案。然而,将 SD 应用于生成式推荐系统面临独特的挑战,因为生成推荐列表需要通过束搜索生成前 K 项(即 K 个不同的 Token 序列)。这导致 SD 中的验证要求更为严格,即在每个解码步骤中,目标 LLM 的所有前 K 个序列必须由草稿模型成功生成。为缓解这一问题,我们考虑了以下两点:1) 增强草稿模型与目标 LLM 之间的前 K 序列对齐,以及 2) 放宽验证策略以减少不必要的 LLM 调用。为此,我们提出了一种名为 AtSpeed 的对齐框架,该框架提出了 AtSpeed-S 优化目标,用于在严格的前 K 验证条件下实现前 K 对齐。此外,我们引入了一种宽松的采样验证策略,允许接受高概率的非前 K 草稿序列,从而显著减少 LLM 调用。相应地,我们提出了 AtSpeed-R,用于在宽松采样验证条件下的前 K 对齐。在两个真实世界数据集上的实验结果表明,AtSpeed 显著加速了基于 LLM 的生成式推荐系统,例如,在严格的前 K 验证条件下实现了近 2 倍的加速,在宽松采样验证条件下实现了高达 2.5 倍的加速。代码和数据集将在近期发布。

[NLP-22] Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models EMNLP2024

【速读】: 该论文试图解决生成语言模型在处理专业或较少讨论的知识时遇到的困难,提出了一种名为\textsc{Atlas}的检索增强生成(RAG)模型作为解决方案。关键在于该模型在生成响应前先检索相关信息,并通过因果中介分析和控制实验研究模型如何在其已知知识(参数化)和检索到的信息(非参数化)之间做出选择。研究发现,在模型可以选择使用两种信息的情况下,它更倾向于依赖检索到的上下文而非参数化知识。此外,研究还揭示了模型如何利用上下文信息的计算机制,包括判断上下文的相关性以及编码器如何计算输出表示以支持相关时的复制操作。

链接: https://arxiv.org/abs/2410.05162
作者: Mehrdad Farahani,Richard Johansson
关键词-EN: Generative language models, Generative language, struggle with specialized, specialized or less-discussed, Generative
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:Generative language models often struggle with specialized or less-discussed knowledge. A potential solution is found in Retrieval-Augmented Generation (RAG) models which act like retrieving information before generating responses. In this study, we explore how the \textscAtlas approach, a RAG model, decides between what it already knows (parametric) and what it retrieves (non-parametric). We use causal mediation analysis and controlled experiments to examine how internal representations influence information processing. Our findings disentangle the effects of parametric knowledge and the retrieved context. They indicate that in cases where the model can choose between both types of information (parametric and non-parametric), it relies more on the context than the parametric knowledge. Furthermore, the analysis investigates the computations involved in \emphhow the model uses the information from the context. We find that multiple mechanisms are active within the model and can be detected with mediation analysis: first, the decision of \emphwhether the context is relevant, and second, how the encoder computes output representations to support copying when relevant.
摘要:生成式语言模型在处理专业或较少讨论的知识时常常遇到困难。一种潜在的解决方案是检索增强生成 (Retrieval-Augmented Generation, RAG) 模型,这类模型在生成响应之前会先检索信息。在本研究中,我们探讨了 \textscAtlas 方法,一种 RAG 模型,如何在已知知识 (参数化) 和检索到的信息 (非参数化) 之间做出决策。我们使用因果中介分析和控制实验来研究内部表示如何影响信息处理。我们的研究结果揭示了参数化知识和检索上下文的影响。研究表明,在模型可以选择两种信息 (参数化和非参数化) 的情况下,它更依赖于上下文而非参数化知识。此外,分析还探讨了模型如何使用上下文信息的计算过程。我们发现,模型内部存在多种机制,可以通过中介分析检测到:首先,模型决定上下文是否相关;其次,编码器如何计算输出表示以在相关时支持复制。

[NLP-23] VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

【速读】: 该论文试图解决构建通用多模态嵌入模型的问题,即开发一种能够处理广泛下游任务(如分类、视觉问答、多模态检索和视觉定位)的嵌入模型。解决方案的关键在于提出了MMEB(大规模多模态嵌入基准)和VLM2Vec(视觉-语言模型向量化)框架。MMEB涵盖了4个元任务和36个数据集,用于训练和评估。VLM2Vec通过对比训练框架,将任何先进的视觉-语言模型转换为嵌入模型,能够根据任务指令处理任意图像和文本组合,生成固定维度的向量。实验结果显示,VLM2Vec在MMEB的分布内和分布外数据集上,相较于现有模型平均提升了10%到20%的性能。

链接: https://arxiv.org/abs/2410.05160
作者: Ziyan Jiang,Rui Meng,Xinyi Yang,Semih Yavuz,Yingbo Zhou,Wenhu Chen
关键词-EN: Embedding models, multimodal embedding models, multimodal embedding, semantic similarity, Embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite their importance. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model - Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB. Unlike previous models such as CLIP and BLIP, VLM2Vec can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2Vec models on Phi-3.5-V and evaluate them on MMEB’s evaluation split. Our results show that \model achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB.
摘要:嵌入模型在实现各种下游任务(如语义相似性、信息检索和聚类)方面至关重要。近期,开发能够跨任务泛化的通用文本嵌入模型(例如 MTEB)引起了广泛关注。然而,尽管其重要性,学习通用多模态嵌入模型的进展相对缓慢。在本研究中,我们旨在探索构建能够处理广泛下游任务的通用嵌入模型的潜力。我们的贡献主要有两方面:(1) MMEB(大规模多模态嵌入基准),涵盖 4 个元任务(即分类、视觉问答、多模态检索和视觉定位)和 36 个数据集,包括 20 个训练数据集和 16 个评估数据集;(2) VLM2Vec(视觉-语言模型 - 向量),一个对比训练框架,通过在 MMEB 上训练,将任何最先进的视觉-语言模型转换为嵌入模型。与之前的模型如 CLIP 和 BLIP 不同,VLM2Vec 可以处理任何图像和文本的组合,根据任务指令生成固定维度的向量。我们在 Phi-3.5-V 上构建了一系列 VLM2Vec 模型,并在 MMEB 的评估部分对其进行了评估。结果显示,我们的模型在 MMEB 中的分布内和分布外数据集上,相对于现有的多模态嵌入模型,实现了 10% 到 20% 的绝对平均改进。

[NLP-24] CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

【速读】: 该论文试图解决流式语音翻译(ST)模型在缺乏大量手动标注的目标语言文本数据时,翻译精度和实时性受限的问题。解决方案的关键在于引入了一种名为Connectionist Temporal Classification guided modality matching (CTC-GMM)的方法,通过利用大量的机器翻译(MT)文本数据来增强流式ST模型。具体来说,CTC-GMM利用CTC将语音序列压缩为紧凑的嵌入序列,使其与对应的文本序列匹配,从而能够利用MT语料库中的源-目标语言文本对来进一步优化流式ST模型,最终在翻译精度和解码速度上均取得了显著提升。

链接: https://arxiv.org/abs/2410.05146
作者: Rui Zhao,Jinyu Li,Ruchao Fan,Matt Post
关键词-EN: achieve high accuracy, Connectionist Temporal Classification, target language, achieve high, low latency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE Spoken Language Technology Workshop (SLT 2024)

点击查看摘要

Abstract:Models for streaming speech translation (ST) can achieve high accuracy and low latency if they’re developed with vast amounts of paired audio in the source language and written text in the target language. Yet, these text labels for the target language are often pseudo labels due to the prohibitive cost of manual ST data labeling. In this paper, we introduce a methodology named Connectionist Temporal Classification guided modality matching (CTC-GMM) that enhances the streaming ST model by leveraging extensive machine translation (MT) text data. This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence, allowing us to utilize matched source-target language text pairs from the MT corpora to refine the streaming ST model further. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4% respectively, while also boosting decoding speed by 59.7% on GPU.
摘要:流式语音翻译 (Streaming Speech Translation, ST) 模型在拥有大量源语言音频和目标语言文本对的情况下,能够实现高准确率和低延迟。然而,这些目标语言的文本标签通常是伪标签,因为手动标注 ST 数据的成本过高。本文介绍了一种名为连接时序分类引导的模态匹配 (Connectionist Temporal Classification guided modality matching, CTC-GMM) 的方法,该方法通过利用大量的机器翻译 (Machine Translation, MT) 文本数据来增强流式 ST 模型。该技术使用 CTC 将语音序列压缩成紧凑的嵌入序列,使其与相应的文本序列匹配,从而允许我们利用 MT 语料库中的源-目标语言文本对来进一步优化流式 ST 模型。我们在 FLEURS 和 CoVoST2 上的评估显示,CTC-GMM 方法分别相对提高了 13.9% 和 6.4% 的翻译准确率,同时 GPU 上的解码速度提升了 59.7%。

[NLP-25] SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

【速读】: 该论文试图解决现有偏好优化(Preference Optimization, PO)方法中所有token在损失函数中权重相同的问题,认为人类偏好受特定词或短语影响更大,而非每个词同等重要。解决方案的关键在于提出了一种称为SparsePO的灵活目标函数,通过自动学习每个token对应的KL散度和奖励的权重,实现对不同token的差异化加权。论文提出了两种权重掩码变体,一种基于参考模型,另一种动态学习,从而在训练过程中引入稀疏性,使模型能够根据任务目标为token分配有意义的权重,生成更符合偏好的响应,并在推理任务中表现更优。

链接: https://arxiv.org/abs/2410.05102
作者: Fenia Christopoulou,Ronald Cardenas,Gerasimos Lampouras,Haitham Bou-Ammar,Jun Wang
关键词-EN: Direct Preference Optimization, Preference Optimization objective, Preference Optimization, aligning language models, offline Direct Preference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 papges, 9 figures, 5 tables. Under Review

点击查看摘要

Abstract:Preference Optimization (PO) has proven an effective step for aligning language models to human-desired behaviors. Current variants, following the offline Direct Preference Optimization objective, have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected by each word in a sequence equally but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best weight reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments on multiple domains, including sentiment control, dialogue, text summarization and text-to-code generation, illustrate that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.
摘要:偏好优化 (Preference Optimization, PO) 已被证明是使语言模型与人类期望行为对齐的有效步骤。当前的变体遵循离线直接偏好优化目标,专注于一种严格设置,其中所有 Token 都为 KL 散度和奖励信号贡献于损失函数。然而,人类的偏好并非受序列中每个词的同等影响,而是往往依赖于特定的词或短语,例如,存在有毒词汇会导致非偏好的响应。基于这一观察,我们认为在 PO 过程中不应同等加权所有 Token,并提出了一种称为 SparsePO 的灵活目标,旨在自动学习在 PO 训练期间为每个 Token 对应的 KL 散度和奖励加权。我们提出了两种不同的权重掩码变体,这些掩码可以由参考模型本身导出,或动态学习。值得注意的是,我们的方法在学习的掩码中引入了稀疏性,使得模型能够在 Token 级别学习如何最佳地加权奖励和 KL 散度贡献,从而学习到最优的掩码稀疏度水平。在多个领域(包括情感控制、对话、文本摘要和文本到代码生成)的广泛实验表明,我们的方法根据目标任务为 Token 分配了有意义的权重,生成了更多符合期望偏好的响应,并且在推理任务上相比其他 Token 和响应级别的 PO 方法提高了多达 2 个百分点。

[NLP-26] Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances CONLL2024

【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)在处理包含噪音的对话转录时,能否有效地提取出结构良好的语句,类似于人类在理解语言时能够区分语义重要内容与非重要噪音的能力。解决方案的关键在于通过基于语言学的实验,考察LLMs是否能够掌握并应用语法规则来构建抽象的句法-语义结构,从而在面对噪音对话时提取出有意义的语句。研究结果表明,尽管LLMs在某些情况下能够提取出结构良好的语句,但并非所有提取的语句都是正确结构的,这表明LLMs在理解和处理噪音对话方面仍不如人类熟练。

链接: https://arxiv.org/abs/2410.05099
作者: Alina Wróblewska
关键词-EN: disregarding speech-specific elements, speech-specific elements poses, exhibit remarkable cognitive, separate semantically significant, semantically significant content
类目: Computation and Language (cs.CL)
备注: Accepted at CoNLL 2024

点击查看摘要

Abstract:Selectively processing noisy utterances while effectively disregarding speech-specific elements poses no considerable challenge for humans, as they exhibit remarkable cognitive abilities to separate semantically significant content from speech-specific noise (i.e. filled pauses, disfluencies, and restarts). These abilities may be driven by mechanisms based on acquired grammatical rules that compose abstract syntactic-semantic structures within utterances. Segments without syntactic and semantic significance are consistently disregarded in these structures. The structures, in tandem with lexis, likely underpin language comprehension and thus facilitate effective communication. In our study, grounded in linguistically motivated experiments, we investigate whether large language models (LLMs) can effectively perform analogical speech comprehension tasks. In particular, we examine the ability of LLMs to extract well-structured utterances from transcriptions of noisy dialogues. We conduct two evaluation experiments in the Polish language scenario, using a~dataset presumably unfamiliar to LLMs to mitigate the risk of data contamination. Our results show that not all extracted utterances are correctly structured, indicating that either LLMs do not fully acquire syntactic-semantic rules or they acquire them but cannot apply them effectively. We conclude that the ability of LLMs to comprehend noisy utterances is still relatively superficial compared to human proficiency in processing them.
摘要:在处理含噪语音时,人类能够有效地忽略语音特有的元素,这并不构成重大挑战,因为他们展现出卓越的认知能力,能够将语义上有意义的内容与语音特有的噪声(如填充停顿、不流畅和重新开始)区分开来。这些能力可能是由基于习得的语法规则的机制驱动的,这些规则在话语中构建了抽象的句法-语义结构。在这些结构中,没有句法和语义意义的片段被一致地忽略。这些结构与词汇一起,可能构成了语言理解的基础,从而促进了有效的沟通。在我们的研究中,基于语言学动机的实验,我们探讨了大语言模型 (LLMs) 是否能够有效地执行类似的语音理解任务。特别是,我们考察了 LLMs 从含噪对话的转录中提取结构良好的话语的能力。我们在波兰语场景中进行了两次评估实验,使用了一个据推测对 LLMs 不熟悉的语料库,以降低数据污染的风险。我们的结果表明,并非所有提取的话语都结构正确,这表明 LLMs 要么没有完全习得句法-语义规则,要么习得了但无法有效地应用它们。我们得出结论,与人类处理这些话语的能力相比,LLMs 对含噪话语的理解能力仍然相对肤浅。

[NLP-27] Explanation sensitivity to the randomness of large language models : the case of journalistic text classification

【速读】: 该论文试图解决大型语言模型(LLMs)在自然语言处理任务中表现优异但解释性不足的问题。研究通过分析训练过程中随机元素对模型预测解释性的影响,发现不同随机种子训练出的模型在准确性相似的情况下,解释性存在显著差异。解决方案的关键在于,提出需要对解释性的统计分布进行表征,以增强LLMs的解释性。此外,研究还探讨了基于文本特征的简单模型,该模型在解释性上更为稳定但准确性较低,通过引入CamemBERT模型的解释特征,可以改善其性能,从而在准确性和解释性之间找到新的平衡点。

链接: https://arxiv.org/abs/2410.05085
作者: Jeremie Bogaert,Marie-Catherine de Marneffe,Antonin Descampe,Louis Escouflaire,Cedrick Fairon,Francois-Xavier Standaert
关键词-EN: Large language models, natural language processing, Large language, language processing tasks, raise explainability challenges
类目: Computation and Language (cs.CL)
备注: This paper is a faithful translation of a paper which was peer-reviewed and published in the French journal Traitement Automatique des Langues, n. 64

点击查看摘要

Abstract:Large language models (LLMs) perform very well in several natural language processing tasks but raise explainability challenges. In this paper, we examine the effect of random elements in the training of LLMs on the explainability of their predictions. We do so on a task of opinionated journalistic text classification in French. Using a fine-tuned CamemBERT model and an explanation method based on relevance propagation, we find that training with different random seeds produces models with similar accuracy but variable explanations. We therefore claim that characterizing the explanations’ statistical distribution is needed for the explainability of LLMs. We then explore a simpler model based on textual features which offers stable explanations but is less accurate. Hence, this simpler model corresponds to a different tradeoff between accuracy and explainability. We show that it can be improved by inserting features derived from CamemBERT’s explanations. We finally discuss new research directions suggested by our results, in particular regarding the origin of the sensitivity observed in the training randomness.
摘要:大语言模型 (LLMs) 在多项自然语言处理任务中表现出色,但也带来了可解释性方面的挑战。本文探讨了训练 LLMs 过程中随机元素对其预测可解释性的影响。我们以法语观点性新闻文本分类任务为例,使用微调后的 CamemBERT 模型和基于相关性传播的解释方法,发现使用不同随机种子训练的模型在准确性上相似,但解释结果存在差异。因此,我们认为需要对解释的统计分布进行特征化,以提升 LLMs 的可解释性。随后,我们探索了一种基于文本特征的简单模型,该模型提供了稳定的解释,但准确性较低。因此,这种简单模型体现了准确性与可解释性之间的不同权衡。我们展示了通过插入从 CamemBERT 解释中提取的特征,可以改进该简单模型。最后,我们讨论了由研究结果提出的新研究方向,特别是关于训练随机性中观察到的敏感性来源。

[NLP-28] ScienceAgent Bench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

【速读】: 该论文试图解决的问题是如何全面评估基于语言模型(LLM)的科学发现自动化代理的能力。解决方案的关键在于提出了ScienceAgentBench基准,这是一个用于评估语言代理在数据驱动科学发现中各项任务表现的新基准。通过从44篇同行评审的出版物中提取102项任务,并由九位领域专家验证,该基准确保了科学真实性和现实相关性。每项任务的目标输出被统一为自包含的Python程序文件,并采用多种评估指标来检查生成的程序、执行结果和成本。通过多轮手动验证和专家参与,确保了任务标注的质量和科学合理性。此外,论文还提出了两种有效策略来缓解数据污染问题。实验结果表明,当前最先进的语言代理在独立解决任务和结合专家知识的情况下,分别只能解决32.4%和34.3%的任务,突显了其在生成代码进行数据驱动发现方面的局限性。

链接: https://arxiv.org/abs/2410.05080
作者: Ziru Chen,Shijie Chen,Yuting Ning,Qianheng Zhang,Boshi Wang,Botao Yu,Yifei Li,Zeyi Liao,Chen Wei,Zitong Lu,Vishal Dey,Mingyi Xue,Frazier N. Baker,Benjamin Burns,Daniel Adu-Ampratwum,Xuhui Huang,Xia Ning,Song Gao,Yu Su,Huan Sun
关键词-EN: piqued growing interest, automate scientific discovery, developing LLM-based language, scientific discovery, piqued growing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 55 pages

点击查看摘要

Abstract:The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.
摘要:大语言模型 (LLM) 的进步引发了越来越多关于开发基于 LLM 的语言智能体以实现科学发现全自动化过程的兴趣,这一趋势既引发了兴奋,也引发了对其真正能力的质疑。在本研究中,我们认为,要使一个智能体完全自动化科学发现过程,它必须能够完成工作流程中的所有关键任务。因此,我们呼吁在做出全自动化的大胆声明之前,对智能体在科学工作流程中的各个任务进行严格评估。为此,我们提出了 ScienceAgentBench,这是一个用于评估数据驱动科学发现语言智能体的新基准。为了确保我们基准的科学真实性和现实相关性,我们从四个学科的 44 篇同行评审出版物中提取了 102 项任务,并邀请了九位领域专家对其进行验证。我们将每个任务的目标输出统一为一个自包含的 Python 程序文件,并采用一系列评估指标来检查生成的程序、执行结果和成本。每个任务都经过多轮注释者和领域专家的手动验证,以确保其注释质量和科学合理性。我们还提出了两种有效的策略来缓解数据污染的担忧。使用我们的基准,我们评估了五个开源和专有的 LLM,每个模型都使用了三种框架:直接提示、OpenHands 和自我调试。在每个任务的三次尝试中,表现最佳的智能体仅能独立解决 32.4% 的任务,并在专家提供的知识帮助下解决 34.3% 的任务。这些结果突显了当前语言智能体在生成数据驱动发现代码方面的能力有限,更不用说实现科学研究的全自动化了。

[NLP-29] ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering EMNLP2024

【速读】: 该论文试图解决当前大型语言模型(LLMs)在常识问答任务中推理过程不透明的问题,并提出了一种名为ZEBRA的零样本问答框架。解决方案的关键在于ZEBRA框架结合了知识检索、案例推理和内省机制,无需对LLM进行额外训练,通过从知识库中检索相关的问题-知识对,并基于这些对进行推理生成新知识,从而提高模型的性能和输出解释性。

链接: https://arxiv.org/abs/2410.05077
作者: Francesco Maria Molfese,Simone Conia,Riccardo Orlando,Roberto Navigli
关键词-EN: Current Large Language, Large Language Models, Current Large, Large Language, remains largely opaque
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Main Conference

点击查看摘要

Abstract:Current Large Language Models (LLMs) have shown strong reasoning capabilities in commonsense question answering benchmarks, but the process underlying their success remains largely opaque. As a consequence, recent approaches have equipped LLMs with mechanisms for knowledge retrieval, reasoning and introspection, not only to improve their capabilities but also to enhance the interpretability of their outputs. However, these methods require additional training, hand-crafted templates or human-written explanations. To address these issues, we introduce ZEBRA, a zero-shot question answering framework that combines retrieval, case-based reasoning and introspection and dispenses with the need for additional training of the LLM. Given an input question, ZEBRA retrieves relevant question-knowledge pairs from a knowledge base and generates new knowledge by reasoning over the relationships in these pairs. This generated knowledge is then used to answer the input question, improving the model’s performance and interpretability. We evaluate our approach across 8 well-established commonsense reasoning benchmarks, demonstrating that ZEBRA consistently outperforms strong LLMs and previous knowledge integration approaches, achieving an average accuracy improvement of up to 4.5 points.
摘要:当前的大语言模型 (LLM) 在常识问答基准测试中展示了强大的推理能力,但其成功背后的过程仍然很大程度上不透明。因此,最近的方法为 LLM 配备了知识检索、推理和内省机制,不仅是为了提升其能力,也是为了增强其输出的可解释性。然而,这些方法需要额外的训练、手工制作的模板或人工编写的解释。为了解决这些问题,我们引入了 ZEBRA,一个零样本问答框架,它结合了检索、基于案例的推理和内省,并且无需对 LLM 进行额外训练。给定一个输入问题,ZEBRA 从知识库中检索相关的问题-知识对,并通过推理这些对之间的关系生成新的知识。然后,这些生成的知识被用来回答输入问题,从而提高模型的性能和可解释性。我们在 8 个成熟的常识推理基准上评估了我们的方法,结果表明 ZEBRA 始终优于强大的 LLM 和之前的知识整合方法,平均准确率提高了多达 4.5 个百分点。

[NLP-30] dalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

【速读】: 该论文试图解决大型语言模型(LLMs)在处理长上下文输入时,由于Transformer架构的键值(KV)缓存大小急剧增加而导致的内存瓶颈问题,特别是在解码阶段。解决方案的关键是引入了一种名为TidalDecode的算法和系统,通过位置持久性稀疏注意力机制来实现快速且准确的LLM解码。TidalDecode利用现有稀疏注意力方法选择的token的空间一致性,引入少量执行全注意力的token选择层来识别具有最高注意力分数的token,而其他层则使用预选token进行稀疏注意力。这种设计在不牺牲生成结果质量的前提下,显著减少了稀疏注意力token选择的计算开销,从而提高了LLM解码的效率。

链接: https://arxiv.org/abs/2410.05076
作者: Lijie Yang,Zhihao Zhang,Zhuofu Chen,Zikun Li,Zhihao Jia
关键词-EN: Large language models, long-context models gaining, models gaining prominence, handling extended inputs, driven significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.
摘要:大语言模型 (LLMs) 在多样化的自然语言处理 (NLP) 任务中推动了显著的进步,其中长上下文模型因其处理扩展输入的能力而备受关注。然而,Transformer 架构所需的键值 (KV) 缓存大小的扩展加剧了内存限制,特别是在解码阶段,形成了显著的瓶颈。现有的稀疏注意力机制旨在解决这一瓶颈,但存在两个局限性:(1) 它们往往无法可靠地识别最相关的 Token 进行注意力处理;(2) 它们忽略了连续 Transformer 层之间 Token 选择的空间一致性,这可能导致性能下降和 Token 选择过程中的大量开销。本文介绍了 TidalDecode,一种简单而有效的算法和系统,通过位置持久稀疏注意力实现快速且准确的大语言模型解码。TidalDecode 利用现有稀疏注意力方法选择的 Token 的空间一致性,并引入少量 Token 选择层,这些层执行全注意力以识别具有最高注意力分数的 Token,而所有其他层则使用预选的 Token 执行稀疏注意力。这种设计使得 TidalDecode 能够在不牺牲生成结果质量的情况下,显著减少稀疏注意力 Token 选择的开销。在多样化的 LLM 和任务上的评估表明,TidalDecode 在生成性能上与全注意力方法相当,同时将 LLM 解码延迟减少了高达 2.1 倍。

[NLP-31] Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes EMNLP2024

【速读】: 该论文试图解决在大语言模型预训练过程中出现的损失尖峰问题,即损失值突然发散的现象。论文假设参数范数的非均匀性是导致损失尖峰的原因之一。解决方案的关键是提出了一种名为“权重缩放作为重参数化(WeSaR)”的新技术。WeSaR通过引入每个参数矩阵的门参数,并调整其值以满足梯度尺度在各层保持一致的要求,从而将原始参数的范数均匀化,确保训练过程的稳定性。实验结果表明,WeSaR不仅稳定了训练过程,还加速了训练,并且在性能上优于包括流行初始化方法在内的其他方法。

链接: https://arxiv.org/abs/2410.05052
作者: Kosuke Nishida,Kyosuke Nishida,Kuniko Saito
关键词-EN: large language models, Loss spikes, diverges suddenly, pre-training of large, large language
类目: Computation and Language (cs.CL)
备注: EMNLP2024 accepted

点击查看摘要

Abstract:Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the original parameters uniformly, which results in stable training. Experimental results with the Transformer decoders consisting of 130 million, 1.3 billion, and 13 billion parameters showed that WeSaR stabilizes and accelerates training and that it outperformed compared methods including popular initialization methods.
摘要:损失尖峰(Loss spikes),即损失值突然发散的现象,是大语言模型预训练中的一个基本问题。本文假设参数范数的非均匀性是损失尖峰的原因之一。在神经网络训练中,为了防止梯度消失和梯度爆炸问题,要求各层的梯度规模保持恒定。然而,为了在 Transformer 模型中满足这些要求,模型参数的范数必须是非均匀的,因此,范数较小的参数对参数更新的敏感性更高。为了解决这一问题,我们提出了一种新颖的技术,即权重缩放作为重参数化(Weight Scaling as Reparameterization, WeSaR)。WeSaR 为每个参数矩阵引入了一个门参数,并将其调整到满足要求的值。由于门参数的存在,WeSaR 将原始参数的范数设置为均匀,从而实现了稳定的训练。实验结果表明,WeSaR 不仅稳定了训练过程,还加速了训练,并且在包括流行初始化方法在内的比较方法中表现优异。

[NLP-32] A test suite of prompt injection attacks for LLM-based machine translation

【速读】: 该论文试图解决基于大型语言模型(LLM)的自然语言处理系统面临的提示注入攻击(Prompt Injection Attacks, PIAs)问题。解决方案的关键在于识别和防御恶意用户通过精心设计的输入干扰提示模板,导致LLM生成系统设计者未预期的响应。论文扩展了Sun和Miceli-Barone提出的针对机器翻译系统的PIA方法,将其应用于WMT 2024通用机器翻译任务的所有语言对,并引入了额外的攻击格式以增强测试的全面性。

链接: https://arxiv.org/abs/2410.05047
作者: Antonio Valerio Miceli-Barone,Zhifan Sun
关键词-EN: NLP systems typically, LLM-based NLP systems, systems typically work, NLP systems, LLM response
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based NLP systems typically work by embedding their input data into prompt templates which contain instructions and/or in-context examples, creating queries which are submitted to a LLM, and then parsing the LLM response in order to generate the system outputs. Prompt Injection Attacks (PIAs) are a type of subversion of these systems where a malicious user crafts special inputs which interfere with the prompt templates, causing the LLM to respond in ways unintended by the system designer. Recently, Sun and Miceli-Barone proposed a class of PIAs against LLM-based machine translation. Specifically, the task is to translate questions from the TruthfulQA test suite, where an adversarial prompt is prepended to the questions, instructing the system to ignore the translation instruction and answer the questions instead. In this test suite, we extend this approach to all the language pairs of the WMT 2024 General Machine Translation task. Moreover, we include additional attack formats in addition to the one originally studied. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.05047 [cs.CL] (or arXiv:2410.05047v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.05047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:基于大语言模型 (LLM) 的自然语言处理系统通常通过将输入数据嵌入到包含指令和/或上下文示例的提示模板中,创建查询并提交给大语言模型,然后解析大语言模型的响应以生成系统输出。提示注入攻击 (Prompt Injection Attacks, PIAs) 是一种颠覆这些系统的方式,恶意用户精心设计特殊输入,干扰提示模板,导致大语言模型以系统设计者未预料到的方式响应。最近,Sun 和 Miceli-Barone 提出了一类针对基于大语言模型的机器翻译的 PIAs。具体来说,任务是将 TruthfulQA 测试套件中的问题进行翻译,其中在问题前添加了一个对抗性提示,指示系统忽略翻译指令并直接回答问题。在本测试套件中,我们将这种方法扩展到 WMT 2024 通用机器翻译任务的所有语言对。此外,我们还包含了除最初研究之外的其他攻击格式。

主题:计算与语言 (cs.CL)
引用方式:arXiv:2410.05047 [cs.CL] (或 arXiv:2410.05047v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.05047
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-33] Named Clinical Entity Recognition Benchmark

【速读】: 该论文旨在解决医疗领域中命名临床实体识别(Named Clinical Entity Recognition, NER)的评估问题。解决方案的关键在于建立一个标准化的基准平台,用于评估不同语言模型在识别和分类临床实体(如疾病、症状、药物、手术和实验室测量)方面的性能。该基准平台利用了按照Observational Medical Outcomes Partnership (OMOP) Common Data Model标准化的开放临床数据集,确保了跨不同医疗系统和数据集的一致性和互操作性。评估模型性能主要通过F1-score进行,并辅以多种评估模式以提供全面的性能洞察。通过这一基准框架,论文旨在推动透明度、促进模型间的比较分析,并推动医疗领域NLP任务的创新。

链接: https://arxiv.org/abs/2410.05046
作者: Wadood M Abdul,Marco AF Pimentel,Muhammad Umar Salman,Tathagata Raha,Clément Christophe,Praveen K Kanithi,Nasir Hayat,Ronnie Rajan,Shadab Khan
关键词-EN: trial cohort identification, extracting structured information, Named Clinical Entity, natural language processing, Entity Recognition Benchmark
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare, addressing the crucial natural language processing (NLP) task of extracting structured information from clinical narratives to support applications like automated coding, clinical trial cohort identification, and clinical decision support. The leaderboard provides a standardized platform for assessing diverse language models, including encoder and decoder architectures, on their ability to identify and classify clinical entities across multiple medical domains. A curated collection of openly available clinical datasets is utilized, encompassing entities such as diseases, symptoms, medications, procedures, and laboratory measurements. Importantly, these entities are standardized according to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, ensuring consistency and interoperability across different healthcare systems and datasets, and a comprehensive evaluation of model performance. Performance of models is primarily assessed using the F1-score, and it is complemented by various assessment modes to provide comprehensive insights into model performance. The report also includes a brief analysis of models evaluated to date, highlighting observed trends and limitations. By establishing this benchmarking framework, the leaderboard aims to promote transparency, facilitate comparative analyses, and drive innovation in clinical entity recognition tasks, addressing the need for robust evaluation methods in healthcare NLP. Comments: Technical Report Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.05046 [cs.CL] (or arXiv:2410.05046v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.05046 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本技术报告介绍了一个用于评估医疗领域语言模型的命名临床实体识别基准,旨在解决从临床叙述中提取结构化信息这一关键的自然语言处理 (NLP) 任务,以支持自动化编码、临床试验队列识别和临床决策支持等应用。该排行榜提供了一个标准化的平台,用于评估包括编码器和解码器架构在内的多种语言模型,在其识别和分类跨多个医疗领域的临床实体的能力。使用了精心挑选的公开可用临床数据集,涵盖疾病、症状、药物、程序和实验室测量等实体。重要的是,这些实体按照观察性医疗结果合作 (OMOP) 通用数据模型进行标准化,确保不同医疗系统和数据集之间的一致性和互操作性,并全面评估模型性能。模型性能主要通过 F1-score 进行评估,并辅以多种评估模式,以提供对模型性能的全面洞察。报告还包括对迄今为止评估的模型的简要分析,突出显示了观察到的趋势和局限性。通过建立这一基准框架,排行榜旨在促进透明度、便于比较分析,并推动临床实体识别任务的创新,满足医疗 NLP 领域对稳健评估方法的需求。

评论:技术报告
主题:计算与语言 (cs.CL); 人工智能 (cs.AI)
引用:arXiv:2410.05046 [cs.CL]
(或 arXiv:2410.05046v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.05046
重点学习更多
arXiv 发布的 DOI 通过 DataCite (注册待定)

[NLP-34] Can LLMs plan paths with extra hints from solvers?

【速读】: 该论文试图解决大型语言模型(LLMs)在长期规划和高级推理任务中的局限性和脆弱性问题。解决方案的关键在于通过集成求解器生成的反馈来增强LLM在解决经典机器人规划任务中的表现。具体方法包括探索四种不同的反馈策略(包括视觉反馈),并利用微调技术,评估三种不同LLM在10个标准和100个随机生成的规划问题上的性能。研究结果表明,求解器生成的反馈能有效提升LLM解决中等难度问题的能力,但对于更困难的问题仍难以应对。

链接: https://arxiv.org/abs/2410.05045
作者: Erik Wu,Sayan Mitra
关键词-EN: Large Language Models, natural language processing, Large Language, Language Models, shown remarkable capabilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language processing, mathematical problem solving, and tasks related to program synthesis. However, their effectiveness in long-term planning and higher-order reasoning has been noted to be limited and fragile. This paper explores an approach for enhancing LLM performance in solving a classical robotic planning task by integrating solver-generated feedback. We explore four different strategies for providing feedback, including visual feedback, we utilize fine-tuning, and we evaluate the performance of three different LLMs across a 10 standard and 100 more randomly generated planning problems. Our results suggest that the solver-generated feedback improves the LLM’s ability to solve the moderately difficult problems, but the harder problems still remain out of reach. The study provides detailed analysis of the effects of the different hinting strategies and the different planning tendencies of the evaluated LLMs.
摘要:大语言模型 (LLMs) 在自然语言处理、数学问题解决以及与程序合成相关的任务中展现了显著的能力。然而,其在长期规划和高阶推理方面的有效性被指出是有限的且脆弱的。本文探讨了一种通过集成求解器生成的反馈来增强 LLM 在解决经典机器人规划任务中性能的方法。我们探索了四种不同的反馈提供策略,包括视觉反馈,并利用微调技术,评估了三种不同 LLM 在 10 个标准和 100 个随机生成的规划问题上的表现。研究结果表明,求解器生成的反馈提高了 LLM 解决中等难度问题的能力,但更困难的问题仍然难以触及。该研究提供了对不同提示策略和评估的 LLM 不同规划倾向的详细分析。

[NLP-35] DEPT: Decoupled Embeddings for Pre-training Language Models

【速读】: 该论文试图解决多语言和多领域数据混合训练中的负面干扰问题,即“多语言诅咒”。解决方案的关键在于提出了一种名为DEPT的新型预训练框架,该框架通过将嵌入层与Transformer主体解耦,并在多重上下文中同时训练Transformer主体,从而使模型能够在不依赖共享全局词汇表的情况下进行训练。DEPT的主要优势包括:在显著的数据异质性下实现稳健有效的训练、大幅减少词汇嵌入参数数量和通信成本、增强模型对新语言和领域的适应能力,以及允许根据每个数据源定制优化词汇表。

链接: https://arxiv.org/abs/2410.05021
作者: Alex Iacob,Lorenzo Sani,Meghdad Kurmanji,William F. Shen,Xinchi Qiu,Dongqi Cai,Yan Gao,Nicholas D. Lane
关键词-EN: broader data mixture, Model pre-training benefits, DEPT, broader data, data mixture
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language Model pre-training benefits from a broader data mixture to enhance performance across domains and languages. However, training on such heterogeneous text corpora is complex, requiring extensive and cost-intensive efforts. Since these data sources vary in lexical, syntactic, and semantic aspects, they cause negative interference or the “curse of multilinguality”. We propose a novel pre-training framework to alleviate this curse. Our method, DEPT, decouples the embedding layers from the transformer body while simultaneously training the latter in multiple contexts. DEPT enables the model to train without being bound to a shared global vocabulary. DEPT: (1) can train robustly and effectively under significant data heterogeneity, (2) reduces the parameter count of the token embeddings by up to 80% and the communication costs by 675x for billion-scale models (3) enhances model generalization and plasticity in adapting to new languages and domains, and (4) allows training with custom optimized vocabulary per data source. We prove DEPT’s potential by performing the first vocabulary-agnostic federated multilingual pre-training of a 1.3 billion-parameter model across high and low-resource languages, reducing its parameter count by 409 million.
摘要:语言模型预训练受益于更广泛的数据混合,以提升跨领域和语言的性能。然而,在如此异构的文本语料库上进行训练是复杂的,需要大量且成本高昂的努力。由于这些数据源在词汇、句法和语义方面存在差异,它们会导致负面干扰或“多语言诅咒”。我们提出了一种新的预训练框架来缓解这一诅咒。我们的方法 DEPT,将嵌入层与 Transformer 主体解耦,同时在多个上下文中训练后者。DEPT 使模型能够在不依赖共享全局词汇表的情况下进行训练。DEPT 的优势包括:(1) 在显著的数据异质性下能够稳健且有效地训练,(2) 将 Token 嵌入的参数数量减少高达 80%,并将通信成本降低 675 倍(适用于十亿级模型),(3) 增强模型在新语言和领域中的泛化能力和适应性,以及 (4) 允许根据每个数据源进行自定义优化的词汇表训练。我们通过首次对 1.3 亿参数模型进行词汇无关的联邦多语言预训练,证明了 DEPT 的潜力,减少了 4.09 亿参数。

[NLP-36] On the Biased Assessment of Expert Finding Systems RECSYS RECSYS2024

【速读】: 该论文试图解决企业专家检索系统评估中由于缺乏全面的真实专家标注数据而导致的评估偏差问题。解决方案的关键在于通过自动化推荐知识领域来验证标注,并引入同义词扩展以减少对字面提及的偏见,同时提出对标注过程的约束条件,以确保评估的公正性和方法间的有效比较。

链接: https://arxiv.org/abs/2410.05018
作者: Jens-Joris Decorte,Jeroen Van Hautte,Chris Develder,Thomas Demeester
关键词-EN: internal knowledge spread, large organisations, teams and departments, topic is crucial, crucial in leveraging
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to the 4th Workshop on Recommender Systems for Human Resources (RecSys in HR 2024) as part of RecSys 2024

点击查看摘要

Abstract:In large organisations, identifying experts on a given topic is crucial in leveraging the internal knowledge spread across teams and departments. So-called enterprise expert retrieval systems automatically discover and structure employees’ expertise based on the vast amount of heterogeneous data available about them and the work they perform. Evaluating these systems requires comprehensive ground truth expert annotations, which are hard to obtain. Therefore, the annotation process typically relies on automated recommendations of knowledge areas to validate. This case study provides an analysis of how these recommendations can impact the evaluation of expert finding systems. We demonstrate on a popular benchmark that system-validated annotations lead to overestimated performance of traditional term-based retrieval models and even invalidate comparisons with more recent neural methods. We also augment knowledge areas with synonyms to uncover a strong bias towards literal mentions of their constituent words. Finally, we propose constraints to the annotation process to prevent these biased evaluations, and show that this still allows annotation suggestions of high utility. These findings should inform benchmark creation or selection for expert finding, to guarantee meaningful comparison of methods.
摘要:在大型组织中,识别特定领域的专家对于利用分散在团队和部门中的内部知识至关重要。所谓的企业专家检索系统能够基于关于员工及其工作的海量异构数据,自动发现并结构化员工的专业知识。评估这些系统需要全面的专家标注作为基准,而这些标注难以获取。因此,标注过程通常依赖于知识领域的自动推荐来进行验证。本案例研究分析了这些推荐如何影响专家检索系统的评估。我们在一个流行的基准上展示了系统验证的标注会导致基于传统术语的检索模型性能被高估,甚至使与更新的神经方法的比较失效。我们还通过增加知识领域的同义词来揭示对这些领域构成词的字面提及的强烈偏见。最后,我们提出了对标注过程的约束,以防止这些偏见评估,并证明这仍然允许提供高实用性的标注建议。这些发现应为专家检索系统的基准创建或选择提供参考,以确保方法之间的有意义比较。

[NLP-37] SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness ECML-PKDD2024

【速读】: 该论文试图解决技能关系建模的准确性问题,特别是在人力资源流程中的招聘和员工发展方面。解决方案的关键在于构建并发布了一个名为SkillMatch的基准,该基准基于从数百万个职位广告中挖掘的专家知识,用于评估技能相关性方法。此外,论文提出了一种可扩展的自监督学习技术,通过在职位广告中技能的共现情况来适应Sentence-BERT模型,从而显著超越传统的技能相关性模型。

链接: https://arxiv.org/abs/2410.05006
作者: Jens-Joris Decorte,Jeroen Van Hautte,Thomas Demeester,Chris Develder
关键词-EN: human resources processes, Accurately modeling, employee development, modeling the relationships, crucial part
类目: Computation and Language (cs.CL)
备注: Accepted to the International workshop on AI for Human Resources and Public Employment Services (AI4HRPES) as part of ECML-PKDD 2024

点击查看摘要

Abstract:Accurately modeling the relationships between skills is a crucial part of human resources processes such as recruitment and employee development. Yet, no benchmarks exist to evaluate such methods directly. We construct and release SkillMatch, a benchmark for the task of skill relatedness, based on expert knowledge mining from millions of job ads. Additionally, we propose a scalable self-supervised learning technique to adapt a Sentence-BERT model based on skill co-occurrence in job ads. This new method greatly surpasses traditional models for skill relatedness as measured on SkillMatch. By releasing SkillMatch publicly, we aim to contribute a foundation for research towards increased accuracy and transparency of skill-based recommendation systems.
摘要:准确建模技能之间的关系是人力资源流程(如招聘和员工发展)中的关键部分。然而,目前尚无直接评估此类方法的基准。我们构建并发布了 SkillMatch,这是一个基于从数百万份职位广告中挖掘出的专家知识的技能相关性任务基准。此外,我们提出了一种可扩展的自监督学习技术,以基于职位广告中的技能共现来调整 Sentence-BERT 模型。这种新方法在 SkillMatch 上显著超越了传统的技能相关性模型。通过公开发布 SkillMatch,我们旨在为提高基于技能的推荐系统的准确性和透明度的研究提供基础。

[NLP-38] On the Rigour of Scientific Writing: Criteria Analysis and Insights EMNLP2024

【速读】: 该论文试图解决科学研究中严谨性(rigour)的自动识别和定义问题,并评估这些标准在实际科学论文中的有效性。解决方案的关键在于提出一个自下而上、数据驱动的框架,该框架包括严谨性关键词提取、详细的严谨性定义生成以及显著标准的识别。此外,该框架具有领域无关性,能够根据不同领域的显著标准进行调整,从而适用于不同领域的科学严谨性评估。通过在机器学习和自然语言处理领域的高影响力会议(如ICLR和ACL)的数据集上进行实验,验证了该框架在模型化严谨性方面的有效性。

链接: https://arxiv.org/abs/2410.04981
作者: Joseph James,Chenghao Xiao,Yucheng Li,Chenghua Lin
关键词-EN: results and findings, ensures the reproducibility, reproducibility and validity, validity of results, Rigour
类目: Computation and Language (cs.CL)
备注: Accepted Findings at EMNLP 2024

点击查看摘要

Abstract:Rigour is crucial for scientific research as it ensures the reproducibility and validity of results and findings. Despite its importance, little work exists on modelling rigour computationally, and there is a lack of analysis on whether these criteria can effectively signal or measure the rigour of scientific papers in practice. In this paper, we introduce a bottom-up, data-driven framework to automatically identify and define rigour criteria and assess their relevance in scientific writing. Our framework includes rigour keyword extraction, detailed rigour definition generation, and salient criteria identification. Furthermore, our framework is domain-agnostic and can be tailored to the evaluation of scientific rigour for different areas, accommodating the distinct salient criteria across fields. We conducted comprehensive experiments based on datasets collected from two high impact venues for Machine Learning and NLP (i.e., ICLR and ACL) to demonstrate the effectiveness of our framework in modelling rigour. In addition, we analyse linguistic patterns of rigour, revealing that framing certainty is crucial for enhancing the perception of scientific rigour, while suggestion certainty and probability uncertainty diminish it.
摘要:严谨性对于科学研究至关重要,因为它确保了结果和发现的再现性和有效性。尽管其重要性不言而喻,但在计算上对严谨性进行建模的研究甚少,且缺乏对这些标准是否能在实际中有效标志或衡量科学论文严谨性的分析。本文中,我们提出了一种自下而上、数据驱动的框架,用于自动识别和定义严谨性标准,并评估其在科学写作中的相关性。我们的框架包括严谨性关键词提取、详细的严谨性定义生成以及显著标准识别。此外,我们的框架具有领域无关性,可以针对不同领域的科学严谨性评估进行定制,适应各领域不同的显著标准。我们基于从两个高影响力机器学习和自然语言处理(即 ICLR 和 ACL)会议收集的数据集进行了全面的实验,以展示我们框架在模型化严谨性方面的有效性。此外,我们分析了严谨性的语言模式,揭示了确定性的框架对于增强科学严谨性的感知至关重要,而建议的确定性和概率的不确定性则削弱了这种感知。

[NLP-39] Activation Scaling for Steering and Interpreting Language Models EMNLP2024

【速读】: 该论文试图解决通过调整少量相关激活向量的标量值来引导语言模型从错误预测(如“France”)转向正确预测(如“Italy”)的问题。解决方案的关键在于建立一个三项目标:干预应有效翻转正确与错误标记(有效性),不影响其他标记(忠实性),并保持稀疏性(最小性)。通过基于梯度的优化,论文提出了一种特定类型的干预方法——激活标量调整,仅通过修改激活向量的符号幅度来增强、减弱或反转模型中已编码的引导方向,从而实现高效且可解释的干预。

链接: https://arxiv.org/abs/2410.04962
作者: Niklas Stoehr,Kevin Du,Vésteinn Snæbjarnarson,Robert West,Ryan Cotterell,Aaron Schein
关键词-EN: steer a language, relevant activation vectors, France, Italy, Rome
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of the Association for Computational Linguistics: EMNLP 2024

点击查看摘要

Abstract:Given the prompt “Rome is in”, can we steer a language model to flip its prediction of an incorrect token “France” to a correct token “Italy” by only multiplying a few relevant activation vectors with scalars? We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. Concretely, we establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa (effectiveness), and leave other tokens unaffected (faithfulness), all while being sparse (minimality). Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention: activation scaling only modifies the signed magnitude of activation vectors to strengthen, weaken, or reverse the steering directions already encoded in the model. On synthetic tasks, this intervention performs comparably with steering vectors in terms of effectiveness and faithfulness, but is much more minimal allowing us to pinpoint interpretable model components. We evaluate activation scaling from different angles, compare performance on different datasets, and make activation scalars a learnable function of the activation vectors themselves to generalize to varying-length prompts.
摘要:给定提示“Rome is in”,我们能否通过仅将几个相关的激活向量与标量相乘,来引导语言模型将其错误的预测“France”翻转为正确的“Italy”?我们认为,成功干预模型是解释其内部工作机制的前提。具体而言,我们确立了一个三项目标:成功的干预应能将正确与错误的 Token 互换(有效性),并且不影响其他 Token(忠实性),同时保持稀疏性(最小性)。通过基于梯度的优化,这一目标使我们能够学习(并随后评估)一种特定的高效且可解释的干预方式:激活缩放仅修改激活向量的符号幅度,以增强、减弱或反转模型中已编码的引导方向。在合成任务上,这种干预在有效性和忠实性方面与引导向量表现相当,但更为稀疏,使我们能够精确定位可解释的模型组件。我们从不同角度评估激活缩放,比较不同数据集上的性能,并将激活标量设为激活向量本身的可学习函数,以推广到不同长度的提示。

[NLP-40] Intent Classification for Bank Chatbots through LLM Fine-Tuning

【速读】: 该论文旨在评估大型语言模型(LLMs)在银行网站预设回复聊天机器人中的意图分类应用。研究的关键在于比较SlovakBERT与多语言生成模型(如Llama 8b instruct和Gemma 7b instruct)在预训练和微调版本中的效果。研究发现,SlovakBERT在领域内准确性和领域外误报率方面表现优于其他模型,因此被确立为该应用的基准模型。

链接: https://arxiv.org/abs/2410.04925
作者: Bibiána Lajčinová,Patrik Valábek,Michal Spišiak
关键词-EN: banking industry websites, predetermined responses designed, large language models, industry websites, study evaluates
类目: Computation and Language (cs.CL)
备注: 7 pages, no figures

点击查看摘要

Abstract:This study evaluates the application of large language models (LLMs) for intent classification within a chatbot with predetermined responses designed for banking industry websites. Specifically, the research examines the effectiveness of fine-tuning SlovakBERT compared to employing multilingual generative models, such as Llama 8b instruct and Gemma 7b instruct, in both their pre-trained and fine-tuned versions. The findings indicate that SlovakBERT outperforms the other models in terms of in-scope accuracy and out-of-scope false positive rate, establishing it as the benchmark for this application.
摘要:本研究评估了大语言模型 (LLM) 在为银行业网站设计的预定义响应聊天机器人中的意图分类应用。具体而言,研究比较了微调 SlovakBERT 与使用多语言生成模型(如 Llama 8b instruct 和 Gemma 7b instruct)在预训练和微调版本中的效果。研究结果表明,SlovakBERT 在范围准确性和范围外误报率方面优于其他模型,确立了其在该应用中的基准地位。

[NLP-41] Leveraging Grammar Induction for Language Understanding and Generation EMNLP2024

【速读】: 该论文试图解决如何将诱导的语法结构应用于下游任务以提升性能的问题。解决方案的关键在于提出了一种无监督的语法诱导方法,通过构建语法解析器来诱导成分结构和依存关系,并将这些诱导的语法特征作为语法掩码融入Transformer模型中,以指导自注意力机制。这种方法在多个机器翻译和自然语言理解任务中表现优异,证明了显式建模文本语法结构对神经网络模型的贡献。

链接: https://arxiv.org/abs/2410.04878
作者: Jushi Kai,Shengyuan Hou,Yusheng Huang,Zhouhan Lin
关键词-EN: made significant progress, recent years, made significant, significant progress, progress in recent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Grammar induction has made significant progress in recent years. However, it is not clear how the application of induced grammar could enhance practical performance in downstream tasks. In this work, we introduce an unsupervised grammar induction method for language understanding and generation. We construct a grammar parser to induce constituency structures and dependency relations, which is simultaneously trained on downstream tasks without additional syntax annotations. The induced grammar features are subsequently incorporated into Transformer as a syntactic mask to guide self-attention. We evaluate and apply our method to multiple machine translation tasks and natural language understanding tasks. Our method demonstrates superior performance compared to the original Transformer and other models enhanced with external parsers. Experimental results indicate that our method is effective in both from-scratch and pre-trained scenarios. Additionally, our research highlights the contribution of explicitly modeling the grammatical structure of texts to neural network models.
摘要:近年来,语法归纳取得了显著进展。然而,如何将归纳出的语法应用于下游任务以提升实际性能尚不明确。在本研究中,我们提出了一种用于语言理解和生成的无监督语法归纳方法。我们构建了一个语法解析器,用于归纳句法结构和依存关系,该解析器在下游任务中同时训练,无需额外的语法标注。归纳出的语法特征随后被整合到 Transformer 中,作为句法掩码来引导自注意力机制。我们将该方法应用于多个机器翻译任务和自然语言理解任务中进行评估。实验结果表明,与原始 Transformer 及其他使用外部解析器增强的模型相比,我们的方法表现出更优越的性能。此外,我们的研究还强调了显式建模文本语法结构对神经网络模型的贡献。

[NLP-42] Rationale-Aware Answer Verification by Pairwise Self-Evaluation EMNLP2024

【速读】: 该论文试图解决现有验证器模型在区分正确答案和错误推理过程时存在的不足,特别是在正确答案背后可能存在错误推理的情况下。解决方案的关键在于引入REPS(Rationale Enhancement through Pairwise Selection)方法,通过迭代应用成对自我评估来从候选方案中选择有效的推理过程,从而提高验证器的可靠性。这种方法确保了验证器不仅关注最终答案的正确性,还重视推理过程的有效性,从而在复杂推理任务中更准确地辅助人类。

链接: https://arxiv.org/abs/2410.04838
作者: Akira Kawabata,Saku Sugawara
关键词-EN: Answer verification identifies, large language models, verification identifies correct, verification identifies, generated by large
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier’s ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.
摘要:答案验证旨在从大语言模型 (LLM) 生成的候选方案中识别出正确的解决方案。当前的方法通常通过标记解决方案为正确或错误来训练验证模型,仅基于最终答案是否与标准答案匹配。然而,这种方法忽略了在产生正确答案的解决方案中可能存在的任何逻辑缺陷,从而削弱了验证模型区分合理与缺陷逻辑的能力。我们在 StrategyQA 上的实证研究表明,仅有 19% 的 LLM 生成的带有正确答案的解决方案具有有效的逻辑推理,这导致了验证模型的不可靠性。此外,我们证明,通过在有效逻辑推理上训练验证模型,可以显著提高其区分有效与缺陷逻辑的能力。为了在不增加额外人工监督的情况下构建更好的验证模型,我们引入了 REPS (Rationale Enhancement through Pairwise Selection),这是一种通过迭代应用成对自我评估从候选方案中选择有效逻辑推理的方法,使用生成解决方案的同一 LLM 进行评估。在 REPS 选择的解决方案上训练的验证模型在三个推理基准测试 (ARC-Challenge, DROP, 和 StrategyQA) 中表现优于使用传统训练方法训练的验证模型。我们的结果表明,训练可靠的验证模型不仅需要确保最终答案的正确性,还需要确保逻辑推理的有效性,这对于协助人类解决复杂推理任务的模型至关重要。

[NLP-43] As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss

【速读】: 该论文试图解决直接偏好优化(DPO)及其变体在数学数据集上存在的超参数敏感性和不稳定性问题。解决方案的关键在于提出了一种新的双向负反馈(Bidirectional Negative Feedback, BNF)损失函数,该损失函数通过在优化过程中建立稳定的负反馈机制,消除了对数似然损失函数中的单向似然导数负反馈问题。BNF损失不仅简化了对齐流程,无需额外的可调超参数或成对偏好数据,还能在保持价值对齐的同时,显著降低在推理任务上的性能下降,从而在价值对齐和推理能力之间达到更好的平衡。

链接: https://arxiv.org/abs/2410.04834
作者: Xin Mao,Feng-Lin Li,Huimin Xu,Wei Zhang,Wang Chen,Anh Tuan Luu
关键词-EN: Proximal Policy Optimization, Reinforcement Learning, Learning from Human, Proximal Policy, computationally efficient alternative
类目: Computation and Language (cs.CL)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO), eliminating the need for reward models and online sampling. Despite these benefits, DPO and its variants remain sensitive to hyper-parameters and prone to instability, particularly on mathematical datasets. We argue that these issues arise from the unidirectional likelihood-derivative negative feedback inherent in the log-likelihood loss function. To address this, we propose a novel LLM alignment loss that establishes a stable Bidirectional Negative Feedback (BNF) during optimization. Our proposed BNF loss eliminates the need for pairwise contrastive losses and does not require any extra tunable hyper-parameters or pairwise preference data, streamlining the alignment pipeline to be as simple as supervised fine-tuning. We conduct extensive experiments across two challenging QA benchmarks and four reasoning benchmarks. The experimental results show that BNF achieves comparable performance to the best methods on QA benchmarks, while its performance decrease on the four reasoning benchmarks is significantly lower compared to the best methods, thus striking a better balance between value alignment and reasoning ability. In addition, we further validate the performance of BNF on non-pairwise datasets, and conduct in-depth analysis of log-likelihood and logit shifts across different preference optimization methods.
摘要:直接偏好优化 (Direct Preference Optimization, DPO) 作为一种计算效率更高的替代方案,已经出现,它取代了基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 与近端策略优化 (Proximal Policy Optimization, PPO) 的结合,消除了对奖励模型和在线采样的需求。尽管有这些优势,DPO 及其变体仍然对超参数敏感且容易不稳定,特别是在数学数据集上。我们认为这些问题源于对数似然损失函数中固有的单向似然导数负反馈。为了解决这一问题,我们提出了一种新颖的大语言模型对齐损失,该损失在优化过程中建立了稳定的双向负反馈 (Bidirectional Negative Feedback, BNF)。我们提出的 BNF 损失消除了对成对对比损失的需求,并且不需要任何额外的可调超参数或成对偏好数据,简化了对齐流程,使其与监督微调一样简单。我们在两个具有挑战性的问答基准和四个推理基准上进行了广泛的实验。实验结果表明,BNF 在问答基准上达到了与最佳方法相当的性能,而在四个推理基准上的性能下降显著低于最佳方法,从而在价值对齐和推理能力之间取得了更好的平衡。此外,我们进一步验证了 BNF 在非成对数据集上的性能,并对不同偏好优化方法中的对数似然和对数偏移进行了深入分析。

[NLP-44] MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在决策透明性方面的解释性不足问题。解决方案的关键在于提出了MINER框架,通过四个阶段(模态分离、重要性评分计算、评分聚合、模态特异性神经元选择)来挖掘模态特异性神经元(MSNs),从而提高MLLMs的可解释性。实验结果表明,仅停用2%的MSNs就能显著降低模型性能,且不同模态主要在较低层级收敛,MSNs影响关键信息的融合方式,并揭示了两个值得进一步研究的语义现象。

链接: https://arxiv.org/abs/2410.04819
作者: Kaichen Huang,Jiahao Huo,Yibo Yan,Kun Wang,Yutao Yue,Xuming Hu
关键词-EN: multimodal large language, large language models, recent years, multimodal large, diverse applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, multimodal large language models (MLLMs) have significantly advanced, integrating more modalities into diverse applications. However, the lack of explainability remains a major barrier to their use in scenarios requiring decision transparency. Current neuron-level explanation paradigms mainly focus on knowledge localization or language- and domain-specific analyses, leaving the exploration of multimodality largely unaddressed. To tackle these challenges, we propose MINER, a transferable framework for mining modality-specific neurons (MSNs) in MLLMs, which comprises four stages: (1) modality separation, (2) importance score calculation, (3) importance score aggregation, (4) modality-specific neuron selection. Extensive experiments across six benchmarks and two representative MLLMs show that (I) deactivating ONLY 2% of MSNs significantly reduces MLLMs performance (0.56 to 0.24 for Qwen2-VL, 0.69 to 0.31 for Qwen2-Audio), (II) different modalities mainly converge in the lower layers, (III) MSNs influence how key information from various modalities converges to the last token, (IV) two intriguing phenomena worth further investigation, i.e., semantic probing and semantic telomeres. The source code is available at this URL.
摘要:近年来,多模态大语言模型 (Multimodal Large Language Models, MLLMs) 取得了显著进展,将更多模态整合到多样化的应用中。然而,缺乏可解释性仍然是其在需要决策透明度的场景中应用的主要障碍。当前的神经元级解释范式主要集中在知识定位或语言和领域特定的分析上,对多模态的探索仍未得到充分解决。为了应对这些挑战,我们提出了 MINER,一个可迁移的框架,用于挖掘多模态大语言模型中的模态特定神经元 (Modality-Specific Neurons, MSNs),该框架包括四个阶段:(1) 模态分离,(2) 重要性分数计算,(3) 重要性分数聚合,(4) 模态特定神经元选择。在六个基准测试和两个代表性 MLLMs 上的广泛实验表明:(I) 仅停用 2% 的 MSNs 就能显著降低 MLLMs 的性能 (Qwen2-VL 从 0.56 降至 0.24,Qwen2-Audio 从 0.69 降至 0.31),(II) 不同模态主要在较低层收敛,(III) MSNs 影响来自各种模态的关键信息如何收敛到最后一个 Token,(IV) 两个值得进一步研究的有趣现象,即语义探针和语义端粒。源代码可在以下链接获取。

[NLP-45] LPZero: Language Model Zero-cost Proxy Search from Zero

【速读】: 该论文试图解决神经架构搜索(NAS)中零成本代理(Zero-cost proxies)依赖专家知识和试错成本高的问题。解决方案的关键在于引入了一个名为LPZero的新框架,该框架首次实现了自动设计适用于各种任务的零成本代理。具体来说,LPZero将零成本代理建模为符号方程,并通过遗传编程在统一的代理搜索空间中寻找最优的符号组合,同时采用基于规则的剪枝策略(RPS)提前排除不具潜力的代理,从而提高代理的排名一致性和下游任务性能。

链接: https://arxiv.org/abs/2410.04808
作者: Peijie Dong,Lujun Li,Xiang Liu,Zhenheng Tang,Xuebo Liu,Qiang Wang,Xiaowen Chu
关键词-EN: Neural Architecture Search, Neural Architecture, Zero-shot NAS, massive computation, Architecture Search
类目: Computation and Language (cs.CL)
备注: 8 pages, 7 figures, 10 appendix

点击查看摘要

Abstract:In spite of the outstanding performance, Neural Architecture Search (NAS) is criticized for massive computation. Recently, Zero-shot NAS has emerged as a promising approach by exploiting Zero-cost (ZC) proxies, which markedly reduce computational demands. Despite this, existing ZC proxies heavily rely on expert knowledge and incur significant trial-and-error costs. Particularly in NLP tasks, most existing ZC proxies fail to surpass the performance of the naive baseline. To address these challenges, we introduce a novel framework, \textbfLPZero, which is the first to automatically design ZC proxies for various tasks, achieving higher ranking consistency than human-designed proxies. Specifically, we model the ZC proxy as a symbolic equation and incorporate a unified proxy search space that encompasses existing ZC proxies, which are composed of a predefined set of mathematical symbols. To heuristically search for the best ZC proxy, LPZero incorporates genetic programming to find the optimal symbolic composition. We propose a \textitRule-based Pruning Strategy (RPS), which preemptively eliminates unpromising proxies, thereby mitigating the risk of proxy degradation. Extensive experiments on FlexiBERT, GPT-2, and LLaMA-7B demonstrate LPZero’s superior ranking ability and performance on downstream tasks compared to current approaches.
摘要:尽管神经架构搜索 (NAS) 表现出色,但其巨大的计算量备受诟病。近期,零样本 NAS (Zero-shot NAS) 通过利用零成本 (ZC) 代理,显著降低了计算需求,成为一种有前景的方法。然而,现有的 ZC 代理严重依赖专家知识,且需要大量试错成本。特别是在自然语言处理 (NLP) 任务中,大多数现有 ZC 代理的表现未能超越简单的基线。为应对这些挑战,我们提出了一种新型框架,\textbfLPZero,这是首个能够自动为各种任务设计 ZC 代理的框架,其排序一致性高于人工设计的代理。具体而言,我们将 ZC 代理建模为一个符号方程,并引入了一个统一的代理搜索空间,该空间涵盖了现有的 ZC 代理,这些代理由一组预定义的数学符号组成。为了启发式地搜索最佳 ZC 代理,LPZero 采用了遗传编程来寻找最优的符号组合。我们提出了一种基于规则的剪枝策略 (RPS),该策略预先排除不具前景的代理,从而降低了代理退化的风险。在 FlexiBERT、GPT-2 和 LLaMA-7B 上的广泛实验表明,LPZero 在下游任务中的排序能力和性能均优于当前方法。

[NLP-46] DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

【速读】: 该论文试图解决Transformer模型在处理长序列时表现受限的问题,即Transformer的长度外推问题。解决方案的关键在于将注意力机制视为特征图,并通过卷积操作处理不同注意力头之间的邻近注意力分数,从而增强模型的表达能力。具体来说,论文提出将传统的查询-键点积替换为更复杂的特征图处理方法,以克服单纯依赖点积计算注意力分数的局限性,进而提升Transformer在长序列任务中的性能。

链接: https://arxiv.org/abs/2410.04798
作者: Chuanyang Zheng,Yihang Gao,Han Shi,Jing Xiong,Jiankai Sun,Jingyao Li,Minbin Huang,Xiaozhe Ren,Michael Ng,Xin Jiang,Zhenguo Li,Yu Li
关键词-EN: feed-forward neural networks, earlier feed-forward neural, attention scores, contributing to interactions, distinct tokens
类目: Computation and Language (cs.CL)
备注: Tech Report. arXiv admin note: text overlap with arXiv:2405.14722

点击查看摘要

Abstract:The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work’s occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.
摘要:注意力机制是 Transformer 模型的基本组成部分,它促进了不同 Token 之间的交互,与早期的前馈神经网络形成对比。通常,注意力分数仅由键-查询乘积决定。然而,本研究偶尔尝试(结合 DAPE 和 NoPE)在无位置编码的情况下,在注意力分数上添加额外的多层感知机 (MLP),结果表明经典的键-查询乘积可能限制了 Transformer 的性能。在本研究中,我们将注意力视为特征图,并应用卷积操作(针对不同头部的邻近注意力分数)来模拟计算机视觉中的处理方法。具体而言,本文的主要贡献在于识别并解释了 Transformer 长度外推问题,这是由于朴素查询和键点积的表达能力有限所致,并且我们成功地将长度外推问题转化为一个已被充分理解的特征图处理问题。这一新颖的见解可以适用于各种与注意力相关的模型,揭示了当前 Transformer 架构具有进一步进化的潜力。广泛的实验表明,将注意力视为特征图并应用卷积作为处理方法显著提升了 Transformer 的性能。

[NLP-47] Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models

【速读】: 该论文试图解决在泰国语言模型(LLM)开发中缺乏评估框架的问题,特别是针对泰语这种在LLM开发中代表性不足的语言。解决方案的关键在于提出了两个新的基准测试:Thai-H6和Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)。这些基准不仅评估模型的核心能力(如推理、知识和常识),还特别强调了对泰国文化和语言的理解。通过这些基准的全面评估,论文为泰国LLM的发展提供了重要的工具和数据支持,并计划将数据集和评估代码公开,以促进进一步的研究和开发。

链接: https://arxiv.org/abs/2410.04795
作者: Dahyun Kim,Sukyung Lee,Yungi Kim,Attapol Rutherford,Chanjun Park
关键词-EN: large language models, widely-used benchmark suites, robust evaluation frameworks, Thai LLM, Thai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks that assess their core capabilities, such as reasoning, knowledge, and commonsense, leading to the inception of certain widely-used benchmark suites such as the H6 benchmark. However, these benchmark suites are primarily built for the English language, and there exists a lack thereof for under-represented languages, in terms of LLM development, such as Thai. On the other hand, developing LLMs for Thai should also include enhancing the cultural understanding as well as core capabilities. To address these dual challenge in Thai LLM research, we propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI). Through a thorough evaluation of various LLMs with multi-lingual capabilities, we provide a comprehensive analysis of the proposed benchmarks and how they contribute to Thai LLM development. Furthermore, we will make both the datasets and evaluation code publicly available to encourage further research and development for Thai LLMs.
摘要:大语言模型 (LLM) 的快速发展凸显了对评估其核心能力(如推理、知识和常识)的稳健评估框架的需求,从而催生了如 H6 基准测试套件等广泛使用的评估工具。然而,这些基准测试套件主要针对英语语言构建,而在大语言模型开发方面,对于如泰语等代表性不足的语言,则缺乏相应的基准测试。另一方面,为泰语开发大语言模型也应包括增强文化理解和核心能力。为应对泰语大语言模型研究中的双重挑战,我们提出了两个关键基准:泰语 H6 (Thai-H6) 和泰语文化与语言智能基准 (Thai Cultural and Linguistic Intelligence Benchmark, ThaiCLI)。通过对具备多语言能力的大语言模型进行全面评估,我们提供了对所提出基准的全面分析,并探讨了它们对泰语大语言模型发展的贡献。此外,我们将公开数据集和评估代码,以鼓励对泰语大语言模型的进一步研究和开发。

[NLP-48] GARLIC: LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph for Long Document QA

【速读】: 该论文试图解决在处理长文档时,传统检索增强生成(RAG)方法在保留全局上下文和细节信息方面的不足,尤其是在使用更强大的语言模型(如Llama 3.1)时,直接输入整个文档比基于树结构的RAG方法表现更好,但RAG方法在计算成本上仍具优势的问题。解决方案的关键在于提出了一种名为LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph (GARLIC)的新检索方法,该方法通过构建层次加权有向无环图(Hierarchical Weighted Directed Acyclic Graph),利用语言模型的注意力权重进行检索,并动态调整检索信息的量和深度,从而在保持计算效率的同时,显著提升了检索和生成的性能。

链接: https://arxiv.org/abs/2410.04790
作者: Xinyu Wang,Yanzheng Xiang,Lin Gui,Yulan He
关键词-EN: enable language models, Retrieval-Augmented Generation, RAG methods, tree-based RAG methods, Recent tree-based RAG
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the past, Retrieval-Augmented Generation (RAG) methods split text into chunks to enable language models to handle long documents. Recent tree-based RAG methods are able to retrieve detailed information while preserving global context. However, with the advent of more powerful LLMs, such as Llama 3.1, which offer better comprehension and support for longer inputs, we found that even recent tree-based RAG methods perform worse than directly feeding the entire document into Llama 3.1, although RAG methods still hold an advantage in reducing computational costs. In this paper, we propose a new retrieval method, called LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph (GARLIC), which outperforms previous state-of-the-art baselines, including Llama 3.1, while retaining the computational efficiency of RAG methods. Our method introduces several improvements: (1) Rather than using a tree structure, we construct a Hierarchical Weighted Directed Acyclic Graph with many-to-many summarization, where the graph edges are derived from attention mechanisms, and each node focuses on a single event or very few events. (2) We introduce a novel retrieval method that leverages the attention weights of LLMs rather than dense embedding similarity. Our method allows for searching the graph along multiple paths and can terminate at any depth. (3) We use the LLM to control the retrieval process, enabling it to dynamically adjust the amount and depth of information retrieved for different queries. Experimental results show that our method outperforms previous state-of-the-art baselines, including Llama 3.1, on two single-document and two multi-document QA datasets, while maintaining similar computational complexity to traditional RAG methods.
摘要:过去,检索增强生成 (Retrieval-Augmented Generation, RAG) 方法通过将文本分割成块来使语言模型能够处理长文档。最近基于树结构的 RAG 方法能够在保留全局上下文的同时检索详细信息。然而,随着更强大的大语言模型 (Large Language Model, LLM) 的出现,例如 Llama 3.1,这些模型在理解和处理更长输入方面表现更佳,我们发现即使是最近的基于树结构的 RAG 方法,其表现也不如直接将整个文档输入 Llama 3.1,尽管 RAG 方法在降低计算成本方面仍具有优势。在本文中,我们提出了一种新的检索方法,称为基于大语言模型的动态进度控制与层次加权图 (LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph, GARLIC),该方法在保持 RAG 方法计算效率的同时,超越了包括 Llama 3.1 在内的先前最先进的基线。我们的方法引入了以下改进:(1) 我们构建了一个层次加权有向无环图 (Hierarchical Weighted Directed Acyclic Graph),而不是使用树结构,其中图的边由注意力机制导出,每个节点专注于单个事件或极少数事件。(2) 我们引入了一种新的检索方法,该方法利用 LLM 的注意力权重而不是密集嵌入相似性。我们的方法允许沿着多条路径搜索图,并且可以在任何深度终止。(3) 我们使用 LLM 来控制检索过程,使其能够根据不同的查询动态调整检索的信息量和深度。实验结果表明,我们的方法在两个单文档和两个多文档问答数据集上优于包括 Llama 3.1 在内的先前最先进的基线,同时保持与传统 RAG 方法相似的计算复杂度。

[NLP-49] Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge EMNLP2024

【速读】: 该论文旨在探讨大型语言模型(LLMs)在预训练过程中如何处理包含误导性和冲突信息的数据。研究的关键发现是,LLMs在面对冲突知识时,表现出类似于人类的学习偏好,即更倾向于正式文本和拼写错误较少的文本,从而在这些特征明显的数据上学习更快且更优。这一发现具有跨模型和跨语言的普遍性,尤其在大模型中更为显著。深入分析表明,LLMs倾向于信任与多数数据一致性较高的数据,并通过调整数据与多数数据的一致性程度,可以植入新的偏好或消除旧的偏好。

链接: https://arxiv.org/abs/2410.04784
作者: Jiahuan Li,Yiqing Cao,Shujian Huang,Jiajun Chen
关键词-EN: shown excellent performance, massive pretraining data, knowledge-intensive tasks, trained on massive, shown excellent
类目: Computation and Language (cs.CL)
备注: accepted by EMNLP 2024, main conference

点击查看摘要

Abstract:Having been trained on massive pretraining data, large language models have shown excellent performance on many knowledge-intensive tasks. However, pretraining data tends to contain misleading and even conflicting information, and it is intriguing to understand how LLMs handle these noisy data during training. In this study, we systematically analyze LLMs’ learning preferences for data with conflicting knowledge. We find that pretrained LLMs establish learning preferences similar to humans, i.e., preferences towards formal texts and texts with fewer spelling errors, resulting in faster learning and more favorable treatment of knowledge in data with such features when facing conflicts. This finding is generalizable across models and languages and is more evident in larger models. An in-depth analysis reveals that LLMs tend to trust data with features that signify consistency with the majority of data, and it is possible to instill new preferences and erase old ones by manipulating the degree of consistency with the majority data.
摘要:经过大规模预训练数据训练后,大语言模型在许多知识密集型任务中表现出色。然而,预训练数据往往包含误导性甚至相互矛盾的信息,理解大语言模型在训练过程中如何处理这些噪声数据颇具吸引力。在本研究中,我们系统地分析了大语言模型对具有冲突知识的数据的学习偏好。我们发现,预训练的大语言模型建立了类似于人类的学习偏好,即偏好正式文本和拼写错误较少的文本,从而在面对冲突时更快地学习和更优地处理具有这些特征的数据中的知识。这一发现可跨模型和语言推广,并且在更大的模型中更为明显。深入分析表明,大语言模型倾向于信任与大多数数据一致的数据特征,并且通过操纵与大多数数据的一致性程度,可以植入新的偏好并消除旧的偏好。

[NLP-50] ImProver: Agent -Based Automated Proof Optimization

【速读】: 该论文试图解决自动化证明优化的问题,即在证明助手(如Lean)中,如何根据不同的下游应用需求,自动重写数学定理的证明以优化其长度、可读性或模块化结构。解决方案的关键在于提出了ImProver,这是一个基于大型语言模型的智能代理,能够根据用户定义的任意指标在Lean中重写证明。ImProver通过结合符号化Lean上下文的新型Chain-of-States技术、错误纠正和检索机制,克服了直接应用LLMs进行证明优化时的不足,从而实现了证明的显著缩短、模块化增强和可读性提升。

链接: https://arxiv.org/abs/2410.04753
作者: Riyaz Ahuja,Jeremy Avigad,Prasad Tetali,Sean Welleck
关键词-EN: Large language models, Large language, generate formal proofs, language models, generate formal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 19 pages, 21 figures

点击查看摘要

Abstract:Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.
摘要:大语言模型 (LLMs) 已被用于在 Lean 等证明助手中生成数学定理的正式证明。然而,我们通常希望根据其下游用途,对正式证明进行各种标准的优化。例如,我们可能希望证明遵循某种风格,或者希望证明具有可读性、简洁性或模块化结构。对于学习任务而言,拥有适当优化的证明同样重要,特别是由于人工编写的证明可能并非为此目的而最优。为此,我们研究了一个新的自动化证明优化问题:重写证明,使其正确并针对任意标准(如长度或可读性)进行优化。作为自动化证明优化的第一种方法,我们提出了 ImProver,这是一个大语言模型智能体,能够在 Lean 中重写证明以优化任意用户定义的指标。我们发现,单纯应用 LLMs 进行证明优化存在不足,因此我们在 ImProver 中融入了多种改进,例如在创新的 Chain-of-States 技术中使用符号化的 Lean 上下文,以及错误纠正和检索。我们测试了 ImProver 在重写真实世界中的本科、竞赛和研究级别的数学定理证明,发现 ImProver 能够重写证明,使其显著缩短、更具模块化且更具可读性。

[NLP-51] Document-level Causal Relation Extraction with Knowledge-guided Binary Question Answering EMNLP2024

【速读】: 该论文试图解决事件-事件因果关系抽取(ECRE)中的两个关键问题:缺乏文档级建模和因果幻觉。解决方案的关键在于提出了一种知识引导的二元问答(KnowQA)方法,结合事件结构进行ECRE。该方法包括两个阶段:事件结构构建和二元问答。通过在大语言模型(LLMs)上进行零样本和微调设置的广泛实验,证明了事件结构在文档级ECRE中的有效性,并实现了在MECI数据集上的最先进性能。

链接: https://arxiv.org/abs/2410.04752
作者: Zimu Wang,Lei Xia,Wei Wang,Xinya Du
关键词-EN: Causal Relation Extraction, Event-Event Causal Relation, Relation Extraction, binary Question Answering, information extraction
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2024. Camera-ready version

点击查看摘要

Abstract:As an essential task in information extraction (IE), Event-Event Causal Relation Extraction (ECRE) aims to identify and classify the causal relationships between event mentions in natural language texts. However, existing research on ECRE has highlighted two critical challenges, including the lack of document-level modeling and causal hallucinations. In this paper, we propose a Knowledge-guided binary Question Answering (KnowQA) method with event structures for ECRE, consisting of two stages: Event Structure Construction and Binary Question Answering. We conduct extensive experiments under both zero-shot and fine-tuning settings with large language models (LLMs) on the MECI and MAVEN-ERE datasets. Experimental results demonstrate the usefulness of event structures on document-level ECRE and the effectiveness of KnowQA by achieving state-of-the-art on the MECI dataset. We observe not only the effectiveness but also the high generalizability and low inconsistency of our method, particularly when with complete event structures after fine-tuning the models.
摘要:作为信息提取 (Information Extraction, IE) 中的一个关键任务,事件-事件因果关系提取 (Event-Event Causal Relation Extraction, ECRE) 旨在识别和分类自然语言文本中事件提及之间的因果关系。然而,现有的 ECRE 研究突显了两个关键挑战,包括缺乏文档级建模和因果幻觉 (causal hallucinations)。本文提出了一种基于事件结构的知识引导二元问答 (Knowledge-guided binary Question Answering, KnowQA) 方法,该方法包括两个阶段:事件结构构建和二元问答。我们在 MECI 和 MAVEN-ERE 数据集上,通过大语言模型 (Large Language Models, LLMs) 在零样本 (zero-shot) 和微调 (fine-tuning) 设置下进行了广泛的实验。实验结果表明,事件结构在文档级 ECRE 中的有效性,以及 KnowQA 方法通过在 MECI 数据集上达到最先进 (state-of-the-art) 水平来证明其有效性。我们不仅观察到了方法的有效性,还观察到了其在模型微调后具有完整事件结构时的高泛化性和低不一致性。

[NLP-52] Intriguing Properties of Large Language and Vision Models

【速读】: 该论文试图解决大型语言和视觉模型(LLVMs)在基本感知任务上的性能不足问题,特别是其在多模态视觉感知(MMVP)任务上的表现与高级推理任务之间的显著差异。解决方案的关键在于系统性地研究LLVMs在图像处理、数学推理、跨模态对齐等方面的特性,通过评估LLaVA等常见LLVMs在10个基准测试中的表现,揭示了这些模型在处理图像时的全局性、数学问题解决能力的不完全依赖于详细数值感知、以及跨模态对齐在复杂推理任务中的过拟合现象。此外,论文还强调了低层表示空间在视觉理解中的重要性,并基于这些发现提出了改进LLVMs和构建更具挑战性评估基准的未来方向。

链接: https://arxiv.org/abs/2410.04751
作者: Young-Jun Lee,Byungsoo Ko,Han-Gyu Kim,Yechan Hwang,Ho-Jin Choi
关键词-EN: received significant attention, development efforts due, large language model, tasks requiring perception, remarkable generalization performance
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Code is available in this https URL

点击查看摘要

Abstract:Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM’s families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.
摘要:近年来,大语言和视觉模型 (Large Language and Vision Models, LLVMs) 因其在一系列需要感知和认知能力的任务中表现出的卓越泛化性能而受到广泛关注和开发。其成功的关键因素之一是其简单的架构,该架构由视觉编码器、投影器和大语言模型 (Large Language Model, LLM) 组成。尽管在高级推理任务中取得了显著成就,但其在基础感知相关任务(如 MMVP)中的表现却出乎意料地低。这种差异引发了关于 LLVMs 如何真正感知图像并利用视觉编码器优势的问题。为了解决这一问题,我们系统地从多个方面进行了探讨:排列不变性、鲁棒性、数学推理、对齐保持及其重要性,并通过评估最常见的 LLVM 家族(即 LLaVA)在 10 个评估基准上的表现。我们的广泛实验揭示了当前 LLVMs 的几个有趣特性:(1) 它们在内部以全局方式处理图像,即使视觉补丁序列的顺序被随机排列;(2) 它们有时能够在不完全感知详细数值信息的情况下解决数学问题;(3) 跨模态对齐过度适应复杂推理任务,从而导致它们失去了部分视觉编码器的原始感知能力;(4) 较低层 (25%) 的表示空间在决定性能和增强视觉理解方面起着至关重要的作用。最后,基于上述观察,我们提出了构建更好的 LLVMs 和设计更具挑战性评估基准的潜在未来方向。

[NLP-53] ableRAG: Million-Token Table Understanding with Language Models NEURIPS2024

【速读】: 该论文试图解决现有语言模型在处理表格数据时面临的可扩展性问题,特别是由于位置偏差或上下文长度限制导致的输入完整表格的挑战。解决方案的关键在于引入TableRAG框架,这是一个基于检索增强生成(RAG)的框架,专门用于语言模型对表格的理解。TableRAG通过结合查询扩展与模式和单元格检索,能够在将数据提供给语言模型之前精确定位关键信息,从而实现更高效的数据编码和精确检索,显著减少提示长度并缓解信息丢失问题。

链接: https://arxiv.org/abs/2410.04739
作者: Si-An Chen,Lesly Miculicich,Julian Martin Eisenschlos,Zifeng Wang,Zilong Wang,Yanfei Chen,Yasuhisa Fujii,Hsuan-Tien Lin,Chen-Yu Lee,Tomas Pfister
关键词-EN: Recent advancements, language models, primarily through program-aided, advancements in language, notably enhanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG’s effectiveness at scale. Our results demonstrate that TableRAG’s retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
摘要:近年来,语言模型 (Language Models, LMs) 在处理表格数据方面的能力显著提升,主要通过程序辅助机制来操作和分析表格。然而,这些方法通常需要将整个表格作为输入,这导致了由于位置偏差或上下文长度限制而带来的可扩展性挑战。针对这些挑战,我们提出了 TableRAG,这是一个专为基于 LM 的表格理解设计的检索增强生成 (Retrieval-Augmented Generation, RAG) 框架。TableRAG 利用查询扩展结合模式和单元格检索,在将信息提供给 LMs 之前精确定位关键信息。这使得数据编码更加高效,检索更加精确,显著减少了提示长度并缓解了信息损失。我们基于 Arcade 和 BIRD-SQL 数据集开发了两个新的百万 Token 基准,以全面评估 TableRAG 在大规模应用中的有效性。我们的结果表明,TableRAG 的检索设计实现了最高的检索质量,从而在大规模表格理解任务中达到了新的最先进水平。

[NLP-54] LDR: Token-Level Detective Reward Model for Large Vision Language Models

【速读】: 该论文试图解决现有奖励模型在多模态大语言模型中存在的粗糙性和信息量不足的问题,特别是这些模型仅提供单一的二元反馈,无法对文本进行细粒度评估。解决方案的关键在于提出了一个名为Token-Level Detective Reward Model (TLDR)的新模型,该模型能够为每个文本标记提供细粒度的注释。通过引入基于扰动的方法生成合成硬负样本及其标记级别的标签,TLDR模型不仅能够帮助现成的模型自我修正其生成内容,还可作为幻觉评估工具,并显著加速高质量视觉语言数据的获取。

链接: https://arxiv.org/abs/2410.04734
作者: Deqing Fu,Tong Xiao,Rui Wang,Wang Zhu,Pengchuan Zhang,Guan Pang,Robin Jia,Lawrence Chen
关键词-EN: improving multimodal large, TLDR models, reward models, models, minimal information
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Work done at Meta

点击查看摘要

Abstract:Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a \textbfT oken- \textbfL evel \textbfD etective \textbfR eward Model ( \textbfTLDR ) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.
摘要:尽管奖励模型在提升多模态大语言模型方面取得了成功,但这些奖励模型本身仍然较为粗糙,包含的信息量有限。值得注意的是,现有的奖励模型仅通过分配一个二元反馈来模仿人类标注,无论文本长度如何。在多模态语言模型领域,模型需要同时处理图像和文本,一个简单的奖励模型可能会学习到对文本的隐性偏见,从而在图像处理上表现不佳。本文提出了一种 Token-Level Detective Reward Model (TLDR),旨在为每个文本 Token 提供细粒度的标注。我们首先引入了一种基于扰动的方法来生成合成硬负样本及其 Token 级别的标签,以训练 TLDR 模型。随后,我们展示了 TLDR 模型在协助现成模型自我修正生成内容以及作为幻觉评估工具方面的丰富用途。最后,我们证明 TLDR 模型能够将人工标注速度提高三倍,从而获取更广泛的高质量视觉语言数据。

[NLP-55] Efficient transformer with reinforced position embedding for language models

【速读】: 该论文试图解决在葡萄牙语到英语翻译任务中,传统Transformer模型参数多、训练时间长的问题。解决方案的关键在于提出了一种高效的Transformer架构,通过强化位置嵌入(reinforced positional embedding)来提升性能,同时减少编码器和解码器层的数量。具体方法包括将位置编码与可训练的词嵌入连接,对词嵌入矩阵的列进行归一化,并将归一化后的词嵌入矩阵作为注意力层的值。这些改进显著降低了训练和验证损失,缩短了训练时间,并在多个翻译数据集上表现出更高的学习效率。

链接: https://arxiv.org/abs/2410.04731
作者: Yen-Che Hsiao,Abhishek Dutta
关键词-EN: obtain superior performance, encoder decoder layers, reinforced positional embedding, efficient transformer architecture, token embedding matrix
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.
摘要:本文提出了一种高效的 Transformer 架构,该架构通过强化位置嵌入 (reinforced positional embedding) 实现了在编码器和解码器层数减半的情况下获得更优的性能。我们证明了,将位置编码与可训练的 Token 嵌入 (token embeddings) 连接,对 Token 嵌入矩阵中的列进行归一化,并将归一化后的 Token 嵌入矩阵作为注意力层的值,能够改善编码器-解码器 Transformer 模型在葡萄牙语-英语翻译任务中的训练和验证损失,以及训练时间。该任务在 10 次试验中进行了 10 个 epoch 或 12 小时的训练。与基线模型相比,我们的方法在参数数量上减少了约三倍,平均训练损失为 1.21,平均验证损失为 1.51,每个 epoch 的平均训练时间为 1352.27 秒,超过了使用位置编码与 Token 嵌入相加的基线模型,后者在相同的嵌入维度下,平均训练损失为 1.96,验证损失为 2.18,每个 epoch 的平均训练时间为 4297.79 秒。此外,我们在 TensorFlow 的 14 个多样化的翻译数据集上评估了所提出的架构和基线模型。结果表明,我们的方法在训练和验证损失方面始终达到更低或相当的水平,表明学习效率得到了提升。

[NLP-56] Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models

【速读】: 该论文试图解决现有评估语言模型记忆能力方法的局限性问题。解决方案的关键在于提出了一种新的评估方法——遗忘曲线(forgetting curve),该方法不依赖于特定的提示(prompts),且适用于不同规模的模型,具有对测试语料和实验设置的鲁棒性。通过将遗忘曲线应用于多种基于Transformer和RNN/SSM架构的模型,研究者提供了关于Transformer扩展技术有效性的实证证据,并质疑了RNN/SSM模型有效长度的评估。

链接: https://arxiv.org/abs/2410.04727
作者: Xinyu Liu,Runsong Zhao,Pengcheng Huang,Chunyang Xiao,Bei Li,Jingang Wang,Tong Xiao,Jingbo Zhu
关键词-EN: Numerous recent works, Numerous recent, recent works target, extend effective context, target to extend
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Numerous recent works target to extend effective context length for language models and various methods, tasks and benchmarks exist to measure model’s effective memorization length. However, through thorough investigations, we find limitations for currently existing evaluations on model’s memorization capability. We provide an extensive survey for limitations in this work and propose a new method called forgetting curve to measure the memorization capability of long-context models. We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings, of not relying on prompts and can be applied to any model size. We apply our forgetting curve to a large variety of models involving both transformer and RNN/SSM based architectures. Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models. We also examine the difference between our measurement and existing benchmarks as well as popular metrics for various models. Our code and results can be found at this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.04727 [cs.CL] (or arXiv:2410.04727v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.04727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:近期众多研究致力于扩展语言模型的有效上下文长度,并存在多种方法、任务和基准来衡量模型的有效记忆长度。然而,通过深入调查,我们发现现有评估模型记忆能力的方法存在局限性。本文广泛探讨了这些局限性,并提出了一种名为“遗忘曲线”的新方法,用于衡量长上下文模型的记忆能力。我们展示了遗忘曲线在测试语料库和实验设置方面的稳健性,不依赖于提示,并且适用于任何模型规模。我们将遗忘曲线应用于多种模型,包括基于Transformer和RNN/SSM的架构。我们的测量结果为Transformer扩展技术的有效性提供了实证证据,同时对基于RNN/SSM模型的有效长度提出了质疑。我们还考察了我们的测量方法与现有基准以及各种模型流行指标之间的差异。相关代码和结果可在以下链接找到:https URL。

主题:计算与语言 (cs.CL)
引用方式:arXiv:2410.04727 [cs.CL] (或 arXiv:2410.04727v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.04727
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-57] textbfOnly-IF:Revealing the Decisive Effect of Instruction Diversity on Generalization

【速读】: 该论文试图解决大型语言模型(LLMs)在遵循多样化指令时如何实现泛化的问题。解决方案的关键在于通过跨领域的数据多样化来增强模型的适应性。研究表明,仅在有限领域内多样化数据无法确保稳健的泛化能力,而跨领域的数据多样化,即使在数据预算受限的情况下,也能显著提升模型的性能。此外,无论是针对特定领域(specialist)还是通用领域(generalist)的模型,增加训练数据集的多样性比单纯增加数据量更能有效提升模型性能。因此,论文强调了数据多样化的战略重要性,并提供了优化数据质量的具体指导。

链接: https://arxiv.org/abs/2410.04717
作者: Dylan Zhang,Justin Wang,Francois Charton
关键词-EN: Understanding and accurately, large language models, data, large language, Turing-complete Markov algorithm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization \textbfonly emerges when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model’s adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of \textit \textbfspecialist and \textit \textbfgeneralist models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.
摘要:理解和准确遵循指令对于大语言模型 (LLMs) 在多样任务中发挥效能至关重要。本文严格审视了使模型能够泛化到未见指令的关键因素,为指导指令调优数据的收集提供了见解。通过受图灵完备的马尔可夫算法启发的控制实验,我们证明当训练数据在语义领域上足够多样化时,这种泛化能力才会显现。我们的研究还揭示,仅在有限领域内多样化无法确保稳健的泛化。相比之下,跨领域数据多样化,即使在数据预算受限的情况下,也能显著增强模型的适应性。我们进一步将分析扩展到现实场景,包括专家模型和通才模型的微调。在这两种情况下,我们都证明了:1) 通过增加现有数据集的多样性同时保持数据规模不变,可以实现更好的性能;2) 在扩展数据时,多样化指令的语义比简单增加相似数据的量更为有效。我们的研究为数据集的整理提供了重要见解,特别是在通过扩展训练数据来优化模型性能时,无论是针对专家还是通才场景。我们表明,仔细考虑数据多样化是关键:使用超出其核心领域的数据训练专家模型会带来显著的性能提升,而通才模型则从增强其整体指令遵循能力的多样化数据混合中受益。我们的结果突显了战略性多样化的关键作用,并为提高数据质量提供了明确的指导方针。

[NLP-58] Rule-based Data Selection for Large Language Models

【速读】: 该论文试图解决传统基于规则的数据选择方法在评估训练数据质量时依赖于人类直觉、缺乏有效评估指标且适应性有限的问题。解决方案的关键在于引入了一种创新的基于规则的框架,利用规则评分向量的正交性作为评估规则的新指标。具体方法包括使用大型语言模型(LLMs)生成多样化的评分规则,通过行列式点过程(DPP)选择最正交的评分向量,从而识别出独立规则,并利用这些规则对所有数据进行评估,选择平均评分最高的样本用于下游任务如LLM训练。实验结果表明,该方法在评分精度和模型性能方面均优于其他方法。

链接: https://arxiv.org/abs/2410.04715
作者: Xiaomin Li,Mingye Gao,Zhiwei Zhang,Chang Yue,Hong Hu
关键词-EN: data significantly impacts, large language models, significantly impacts, large language, rules
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The quality of training data significantly impacts the performance of large language models (LLMs). There are increasing studies using LLMs to rate and select data based on several human-crafted metrics (rules). However, these conventional rule-based approaches often depend too heavily on human heuristics, lack effective metrics for assessing rules, and exhibit limited adaptability to new tasks. In our study, we introduce an innovative rule-based framework that utilizes the orthogonality of score vectors associated with rules as a novel metric for rule evaluations. Our approach includes an automated pipeline that first uses LLMs to generate a diverse set of rules, encompassing various rating dimensions to evaluate data quality. Then it rates a batch of data based on these rules and uses the determinantal point process (DPP) from random matrix theory to select the most orthogonal score vectors, thereby identifying a set of independent rules. These rules are subsequently used to evaluate all data, selecting samples with the highest average scores for downstream tasks such as LLM training. We verify the effectiveness of our method through two experimental setups: 1) comparisons with ground truth ratings and 2) benchmarking LLMs trained with the chosen data. Our comprehensive experiments cover a range of scenarios, including general pre-training and domain-specific fine-tuning in areas such as IMDB, Medical, Math, and Code. The outcomes demonstrate that our DPP-based rule rating method consistently outperforms other approaches, including rule-free rating, uniform sampling, importance resampling, and QuRating, in terms of both rating precision and model performance.
摘要:训练数据的质量显著影响大语言模型 (LLMs) 的性能。越来越多的研究使用 LLMs 根据多种人为设计的指标 (规则) 来评估和选择数据。然而,这些传统的基于规则的方法往往过于依赖人类的直觉,缺乏有效的规则评估指标,并且在新任务上的适应性有限。在我们的研究中,我们引入了一种创新的基于规则的框架,该框架利用与规则相关的评分向量的正交性作为规则评估的新指标。我们的方法包括一个自动化流程,首先使用 LLMs 生成一组多样化的规则,涵盖多种评估数据质量的维度。然后,根据这些规则对一批数据进行评分,并利用随机矩阵理论中的行列式点过程 (DPP) 选择最正交的评分向量,从而识别出一组独立的规则。这些规则随后用于评估所有数据,选择平均评分最高的样本用于下游任务,如 LLM 训练。我们通过两种实验设置验证了该方法的有效性:1) 与真实评分进行比较;2) 使用所选数据训练的 LLMs 进行基准测试。我们的综合实验涵盖了多种场景,包括通用预训练和在 IMDB、医疗、数学和代码等领域的特定领域微调。结果表明,我们的基于 DPP 的规则评分方法在评分精度和模型性能方面始终优于其他方法,包括无规则评分、均匀采样、重要性重采样和 QuRating。

[NLP-59] Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

【速读】: 该论文试图解决的问题是如何在语言模型(LM)的解码过程中,根据输入的复杂度动态分配计算资源,以提高解码效率和输出质量。解决方案的关键在于提出了一种自适应计算分配方法,该方法通过预测输入的奖励分布和计算预算,动态决定是否为某些输入分配更多的计算资源。具体实现包括两种解码策略:一是自适应的最佳k采样过程,根据输入动态选择生成样本的数量;二是路由策略,根据查询的复杂度选择使用高成本但精确的解码方法或低成本但能力较弱的解码方法。实验结果表明,这种方法能够在不降低响应质量的前提下减少高达50%的计算量,或在固定计算预算下提高响应质量达10%。

链接: https://arxiv.org/abs/2410.04707
作者: Mehul Damani,Idan Shenfeld,Andi Peng,Andreea Bobu,Jacob Andreas
关键词-EN: Computationally intensive decoding, spanning code generation, problems spanning code, Computationally intensive, intensive decoding procedures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computationally intensive decoding procedures–including search, reranking, and self-critique–can improve the quality of language model (LM) outputs in problems spanning code generation, numerical reasoning, and dialog. Existing work typically applies the same decoding procedure for every input to an LM. But not all inputs require the same amount of computation to process. Can we allocate decoding computation adaptively, using more resources to answer questions whose answers will be harder to compute? We present an approach that predicts the distribution of rewards given an input and computation budget, then allocates additional computation to inputs for which it is predicted to be most useful. We apply this approach in two decoding procedures: first, an adaptive best-of-k procedure that dynamically selects the number of samples to generate as input to a reranker; second, a routing procedure that dynamically responds to a query using a decoding procedure that is expensive but accurate, or one that is cheaper but less capable. Across a suite of programming, mathematics, and dialog tasks, we show that accurate computation-allocation procedures can be learned, and reduce computation by up to 50% at no cost to response quality, or improve quality by up to 10% at a fixed computational budget.
摘要:计算密集型的解码过程——包括搜索、重排序和自我批评——可以提高语言模型 (LM) 在代码生成、数值推理和对话等广泛问题中的输出质量。现有工作通常对每个输入应用相同的解码过程。但并非所有输入都需要相同的计算量来处理。我们能否根据输入的难度自适应地分配解码计算资源,对那些更难计算答案的问题投入更多资源?我们提出了一种方法,该方法预测给定输入和计算预算下的奖励分布,然后为预测最有用的输入分配额外的计算资源。我们在两种解码过程中应用了这种方法:首先,一种自适应的最佳 k 选择过程,动态选择生成样本的数量作为重排序器的输入;其次,一种路由过程,动态响应查询,使用昂贵但准确的解码过程,或使用更便宜但能力较弱的解码过程。在一系列编程、数学和对话任务中,我们展示了可以学习到精确的计算分配过程,并在不影响响应质量的情况下将计算量减少高达 50%,或在固定计算预算下将质量提高高达 10%。

[NLP-60] Modeling and Estimation of Vocal Tract and Glottal Source Parameters Using ARMAX-LF Model

【速读】: 该论文试图解决在语音建模中,传统的ARX模型无法准确估计包含反共振峰(zeros)的语音信号(如鼻音、摩擦音和塞音)的问题。解决方案的关键在于提出了ARMAX-LF模型,该模型结合了Liljencrants-Fant模型和ARMAX模型,能够同时处理声道滤波器和声门源的参数估计。通过引入深度神经网络(DNN)进行非线性拟合,实现了对声门源和声道参数的精确估计,减少了迭代次数,提高了估计精度和速度。

链接: https://arxiv.org/abs/2410.04704
作者: Kai Lia,Masato Akagia,Yongwei Lib,Masashi Unokia
关键词-EN: iteration-based estimation approach, vocal tract, model, Auto-Regressive Moving Average, glottal source
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Modeling and estimation of the vocal tract and glottal source parameters of vowels from raw speech can be typically done by using the Auto-Regressive with eXogenous input (ARX) model and Liljencrants-Fant (LF) model with an iteration-based estimation approach. However, the all-pole autoregressive model in the modeling of vocal tract filters cannot provide the locations of anti-formants (zeros), which increases the estimation errors in certain classes of speech sounds, such as nasal, fricative, and stop consonants. In this paper, we propose the Auto-Regressive Moving Average eXogenous with LF (ARMAX-LF) model to extend the ARX-LF model to a wider variety of speech sounds, including vowels and nasalized consonants. The LF model represents the glottal source derivative as a parametrized time-domain model, and the ARMAX model represents the vocal tract as a pole-zero filter with an additional exogenous LF excitation as input. To estimate multiple parameters with fewer errors, we first utilize the powerful nonlinear fitting ability of deep neural networks (DNNs) to build a mapping from extracted glottal source derivatives or speech waveforms to corresponding LF parameters. Then, glottal source and vocal tract parameters can be estimated with fewer estimation errors and without any iterations as in the analysis-by-synthesis strategy. Experimental results with synthesized speech using the linear source-filter model, synthesized speech using the physical model, and real speech signals showed that the proposed ARMAX-LF model with a DNN-based estimation method can estimate the parameters of both vowels and nasalized sounds with fewer errors and estimation time.
摘要:从原始语音中对元音的声道和声门源参数进行建模和估计,通常可以通过使用带有外生输入的自回归模型 (ARX) 和基于迭代估计方法的 Liljencrants-Fant (LF) 模型来实现。然而,在声道滤波器建模中使用的全极点自回归模型无法提供反共振峰 (零点) 的位置,这增加了某些类别语音声音(如鼻音、摩擦音和塞音)的估计误差。本文提出了一种扩展的 ARX-LF 模型,即带有 LF 的自回归滑动平均外生模型 (ARMAX-LF),以涵盖更广泛的语音声音,包括元音和鼻化辅音。LF 模型将声门源导数表示为参数化时域模型,而 ARMAX 模型则将声道表示为带有外生 LF 激励输入的极零滤波器。为了以更少的误差估计多个参数,我们首先利用深度神经网络 (DNN) 强大的非线性拟合能力,构建从提取的声门源导数或语音波形到相应 LF 参数的映射。然后,声门源和声道参数可以在没有分析-合成策略中的迭代的情况下,以更少的估计误差进行估计。使用线性源-滤波器模型合成的语音、物理模型合成的语音以及真实语音信号的实验结果表明,所提出的 ARMAX-LF 模型结合基于 DNN 的估计方法,能够以更少的误差和估计时间对元音和鼻化声音的参数进行估计。

[NLP-61] he LLM Effect: Are Humans Truly Using LLMs or Are They Being Influenced By Them Instead? EMNLP

【速读】: 该论文试图解决大型语言模型(LLMs)在处理高度专业化和开放性任务(如政策研究)中的效率和准确性问题。解决方案的关键在于通过结构化的用户研究,探讨LLMs与专家注释者之间的合作模式,特别是在主题发现和主题分配两个阶段。研究结果表明,LLMs生成的主题列表与人类生成的主题列表有显著重叠,尽管在特定文档主题上存在遗漏,但LLMs的建议能显著提高任务完成速度。然而,这种效率提升可能伴随着锚定偏差,影响分析的深度和细微差别,从而引发关于效率提升与分析偏差之间权衡的讨论。

链接: https://arxiv.org/abs/2410.04699
作者: Alexander S. Choi,Syeda Sabrina Akter,JP Singh,Antonios Anastasopoulos
关键词-EN: Large Language Models, Large Language, Language Models, shown capabilities close, leading researchers
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to EMNLP Main 2024. First two authors contributed equally

点击查看摘要

Abstract:Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks, leading researchers to use them for time and labor-intensive analyses. However, their capability to handle highly specialized and open-ended tasks in domains like policy studies remains in question. This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership. The study, conducted in two stages-Topic Discovery and Topic Assignment-integrates LLMs with expert annotators to observe the impact of LLM suggestions on what is usually human-only analysis. Results indicate that LLM-generated topic lists have significant overlap with human generated topic lists, with minor hiccups in missing document-specific topics. However, LLM suggestions may significantly improve task completion speed, but at the same time introduce anchoring bias, potentially affecting the depth and nuance of the analysis, raising a critical question about the trade-off between increased efficiency and the risk of biased analysis.
摘要:大语言模型 (LLM) 在各种分析任务中展现出接近人类表现的性能,促使研究人员将其用于时间和劳动密集型的分析工作。然而,其在处理政策研究等领域的专业化和开放性任务方面的能力仍存疑。本文通过一项结构化的用户研究,聚焦于人-LLM 合作,探讨了 LLM 在专业任务中的效率和准确性。该研究分为两个阶段——主题发现和主题分配——将 LLM 与专家标注者结合,以观察 LLM 建议对通常仅由人类进行的分析的影响。结果表明,LLM 生成的主题列表与人类生成的主题列表有显著重叠,但在遗漏特定文档主题方面存在轻微问题。尽管如此,LLM 建议可能显著提高任务完成速度,但同时也引入了锚定偏差,可能影响分析的深度和细微差别,这引发了一个关键问题:在提高效率与分析偏差风险之间的权衡。

[NLP-62] MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

【速读】: 该论文试图解决大语言模型(LLMs)在长上下文场景中数学推理能力评估的缺失问题。解决方案的关键在于引入了MathHay,这是一个自动化基准测试,专门设计用于评估LLMs在长上下文中的数学推理能力。与以往主要关注长文本信息检索的基准不同,MathHay要求模型具备信息检索和复杂数学推理的双重能力,从而更全面地评估LLMs在实际应用中的表现。

链接: https://arxiv.org/abs/2410.04698
作者: Lei Wang,Shan Dong,Yuhui Xu,Hanze Dong,Yalu Wang,Amrita Saha,Ee-Peng Lim,Caiming Xiong,Doyen Sahoo
关键词-EN: Recent large language, demonstrated versatile capabilities, mathematical reasoning abilities, mathematical reasoning, long-context mathematical reasoning
类目: Computation and Language (cs.CL)
备注: Work-in-Progress

点击查看摘要

Abstract:Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs’ application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.
摘要:近期的大语言模型 (LLMs) 在长上下文场景中展示了多样的能力。尽管已有一些基准测试用于评估 LLMs 的长上下文能力,但缺乏针对 LLMs 在长上下文中数学推理能力的基准测试,这对于 LLMs 在实际应用中的表现至关重要。本文中,我们介绍了 MathHay,一个旨在评估 LLMs 长上下文数学推理能力的自动化基准。与以往主要关注长文本中信息检索的基准测试(如 Needle in a Haystack)不同,MathHay 要求模型具备信息搜索和复杂数学推理的双重能力。我们在 MathHay 上进行了广泛的实验,评估了八种顶尖 LLMs 的长上下文数学推理能力。即使是最优表现的模型 Gemini-1.5-Pro-002,在长上下文数学推理方面也面临挑战,仅达到 51.26% 的准确率,处理 128K Token 时。这突显了在 MathHay 基准上仍有显著的改进空间。

[NLP-63] Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning EMNLP’24

【速读】: 该论文试图解决的问题是:在处理具有隐含模式(implicit patterns)的任务时,为什么上下文学习(in-context learning, ICL)比微调(fine-tuning)更有效。解决方案的关键在于,ICL能够快速捕捉并理解这些隐含模式,而微调虽然使用了大量训练样本,但在提升模型对这些模式的识别能力上效果有限。论文通过构建包含隐含模式的多个数据集,并对比不同参数规模模型在ICL和微调下的表现,验证了ICL在处理此类任务时的优越性,并提出了电路转移理论(circuit shift theory)从机制解释的角度解释了ICL的优势。

链接: https://arxiv.org/abs/2410.04691
作者: Qingyu Yin,Xuzheng He,Luoao Deng,Chak Tou Leong,Fan Wang,Yanzhao Yan,Xiaoyu Shen,Qiang Zhang
关键词-EN: imbuing large language, large language models, in-context learning, task-specific knowledge, ICL
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: EMNLP’24 Findings

点击查看摘要

Abstract:Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge. It is commonly believed that fine-tuning can surpass ICL given sufficient training samples as it allows the model to adjust its internal parameters based on the data. However, this paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning. We developed several datasets featuring implicit patterns, such as sequences determining answers through parity or identifying reducible terms in calculations. We then evaluated the models’ understanding of these patterns under both fine-tuning and ICL across models ranging from 0.5B to 7B parameters. The results indicate that models employing ICL can quickly grasp deep patterns and significantly improve accuracy. In contrast, fine-tuning, despite utilizing thousands of times more training samples than ICL, achieved only limited improvements. We also proposed circuit shift theory from a mechanistic interpretability’s view to explain why ICL wins.
摘要:微调 (Fine-tuning) 和上下文学习 (In-context Learning, ICL) 是赋予大语言模型 (Large Language Model) 任务特定知识的两种流行方法。通常认为,在有足够训练样本的情况下,微调能够超越 ICL,因为它允许模型根据数据调整其内部参数。然而,本文提出了一个反直觉的发现:对于具有隐含模式 (implicit patterns) 的任务,ICL 比微调更能有效地捕捉这些模式。我们开发了几个包含隐含模式的基准数据集,例如通过奇偶性确定答案的序列或识别计算中的可约项。随后,我们在从 0.5B 到 7B 参数的模型范围内,评估了模型在微调和 ICL 两种方式下对这些模式的理解。结果表明,采用 ICL 的模型能够迅速掌握深层模式,并显著提高准确性。相比之下,尽管微调使用了比 ICL 多数千倍的训练样本,但其改进效果却有限。我们还从机制可解释性的角度提出了电路偏移理论 (circuit shift theory),以解释为何 ICL 更胜一筹。

[NLP-64] Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

【速读】: 该论文试图解决如何利用大型语言模型(LLMs)自身来优化其输出评估的问题。解决方案的关键在于提出了一种新颖的框架,将LLMs视为一个由多个交互代理组成的集合中的辩护者,通过法官和陪审团系统来辩护其答案并达成结论。这种方法相较于传统的人工评估或自动化指标,提供了更为动态和全面的评估过程,并通过概率模型评估迭代辩护系统带来的错误减少效果。

链接: https://arxiv.org/abs/2410.04663
作者: Chaithanya Bandi,Hari Bandi,Abir Harrasse
关键词-EN: paper explores optimal, large language models, explores optimal architectures, paper explores, explores optimal
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper explores optimal architectures for evaluating the outputs of large language models (LLMs) using LLMs themselves. We propose a novel framework that interprets LLMs as advocates within an ensemble of interacting agents, allowing them to defend their answers and reach conclusions through a judge and jury system. This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics. We discuss the motivation behind this framework, its key components, and comparative advantages. We also present a probabilistic model to evaluate the error reduction achieved by iterative advocate systems. Finally, we outline experiments to validate the effectiveness of multi-advocate architectures and discuss future research directions.
摘要:本文探讨了利用大语言模型 (LLMs) 自身来评估其输出的最佳架构。我们提出了一种新颖的框架,将 LLMs 解释为在一个交互智能体集合中的辩护者,通过法官和陪审团系统来捍卫其答案并达成结论。与传统的人类评估或自动化指标相比,这种方法提供了更为动态和全面的评估过程。我们讨论了该框架的动机、关键组成部分及其比较优势。此外,我们还提出了一个概率模型,用于评估迭代辩护系统所实现的错误减少。最后,我们概述了验证多辩护者架构有效性的实验,并讨论了未来的研究方向。

[NLP-65] Contrastive Learning to Improve Retrieval for Real-world Fact Checking EMNLP2024

【速读】: 该论文试图解决在事实核查过程中,传统检索方法难以处理复杂声明的问题。解决方案的关键在于提出了对比事实核查重排序器(Contrastive Fact-Checking Reranker, CFR),通过利用AVeriTeC数据集中的子问题标注和人类编写的证据答案,对Contriever模型进行微调,采用对比学习目标,结合GPT-4的蒸馏、子问题答案评估和数据集中的黄金标签等多重训练信号。这种方法不仅提高了检索相关证据的准确性,还在事实核查的最终判断上实现了6%的分类准确性提升,并展示了在多个数据集上的迁移能力。

链接: https://arxiv.org/abs/2410.04657
作者: Aniruddh Sriram,Fangyuan Xu,Eunsol Choi,Greg Durrett
关键词-EN: Recent work, incorporate evidence retrieved, addresses a realistic, web to decide, models incorporate evidence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024 FEVER Workshop

点击查看摘要

Abstract:Recent work on fact-checking addresses a realistic setting where models incorporate evidence retrieved from the web to decide the veracity of claims. A bottleneck in this pipeline is in retrieving relevant evidence: traditional methods may surface documents directly related to a claim, but fact-checking complex claims requires more inferences. For instance, a document about how a vaccine was developed is relevant to addressing claims about what it might contain, even if it does not address them directly. We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for this setting. By leveraging the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents, we fine-tune Contriever with a contrastive objective based on multiple training signals, including distillation from GPT-4, evaluating subquestion answers, and gold labels in the dataset. We evaluate our model on both retrieval and end-to-end veracity judgments about claims. On the AVeriTeC dataset, we find a 6% improvement in veracity classification accuracy. We also show our gains can be transferred to FEVER, ClaimDecomp, HotpotQA, and a synthetic dataset requiring retrievers to make inferences.
摘要:近期关于事实核查的研究关注了一个现实场景,即模型通过整合从网络检索到的证据来判断声明的真实性。这一流程中的瓶颈在于检索相关证据:传统方法可能直接提供与声明直接相关的文档,但核查复杂声明需要更多的推理。例如,一篇关于疫苗如何开发的文档与关于其成分的声明相关,即使它并未直接讨论这些成分。我们提出了对比事实核查重排序器 (Contrastive Fact-Checking Reranker, CFR),这是一种针对此场景的改进检索器。通过利用 AVeriTeC 数据集,该数据集为声明标注了由证据文档中的人类编写的子问题答案,我们使用基于多种训练信号的对比目标对 Contriever 进行微调,包括从 GPT-4 的蒸馏、评估子问题答案以及数据集中的黄金标签。我们在检索和关于声明的端到端真实性判断上评估了我们的模型。在 AVeriTeC 数据集上,我们发现真实性分类准确率提高了 6%。我们还展示了我们的改进可以转移到 FEVER、ClaimDecomp、HotpotQA 以及一个需要检索器进行推理的合成数据集上。

[NLP-66] A Cross-Lingual Meta-Learning Method Based on Domain Adaptation for Speech Emotion Recognition

【速读】: 该论文试图解决在数据稀缺的情况下进行语音情感识别的问题。解决方案的关键在于采用元学习技术,结合大规模预训练的骨干网络和原型网络,以提高在少样本学习场景下的模型性能。特别地,论文提出了一种改进的微调技术,在元测试阶段显著提升了模型在分布外数据集上的表现,从而在希腊语和罗马尼亚语的语音情感识别任务中实现了83.78%和56.30%的准确率。

链接: https://arxiv.org/abs/2410.04633
作者: David-Gabriel Ion,Răzvan-Alexandru Smădu,Dumitru-Clementin Cercel,Florin Pop,Mihaela-Claudia Cercel
关键词-EN: Best-performing speech models, speech emotion recognition, Best-performing speech, speech emotion, emotion recognition
类目: Computation and Language (cs.CL)
备注: 16 pages, 1 figure, Accepted by WISE 2024

点击查看摘要

Abstract:Best-performing speech models are trained on large amounts of data in the language they are meant to work for. However, most languages have sparse data, making training models challenging. This shortage of data is even more prevalent in speech emotion recognition. Our work explores the model’s performance in limited data, specifically for speech emotion recognition. Meta-learning specializes in improving the few-shot learning. As a result, we employ meta-learning techniques on speech emotion recognition tasks, accent recognition, and person identification. To this end, we propose a series of improvements over the multistage meta-learning method. Unlike other works focusing on smaller models due to the high computational cost of meta-learning algorithms, we take a more practical approach. We incorporate a large pre-trained backbone and a prototypical network, making our methods more feasible and applicable. Our most notable contribution is an improved fine-tuning technique during meta-testing that significantly boosts the performance on out-of-distribution datasets. This result, together with incremental improvements from several other works, helped us achieve accuracy scores of 83.78% and 56.30% for Greek and Romanian speech emotion recognition datasets not included in the training or validation splits in the context of 4-way 5-shot learning.
摘要:表现最佳的语音模型通常在其所针对的语言上使用大量数据进行训练。然而,大多数语言的数据稀少,使得训练模型变得困难。这种数据短缺在语音情感识别中尤为明显。我们的研究探讨了在有限数据情况下的模型性能,特别是针对语音情感识别。元学习专注于改进少样本学习。因此,我们将元学习技术应用于语音情感识别、口音识别和个人识别任务。为此,我们提出了一系列对多阶段元学习方法的改进。与其他由于元学习算法的高计算成本而专注于较小模型的工作不同,我们采取了更实际的方法。我们结合了一个大型预训练骨干网络和一个原型网络,使我们的方法更具可行性和适用性。我们最显著的贡献是在元测试期间改进的微调技术,这显著提升了在分布外数据集上的性能。这一结果,连同其他几项工作的逐步改进,帮助我们在4-way 5-shot学习的背景下,对未包含在训练或验证分割中的希腊语和罗马尼亚语语音情感识别数据集上,分别达到了83.78%和56.30%的准确率。

[NLP-67] Control Large Language Models via Divide and Conquer EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在基于提示的控制下进行词法约束生成(Lexically Constrained Generation, LCG)时面临的挑战。论文指出LLMs在满足词法约束方面存在三个主要限制:位置偏差、对解码参数的低响应性以及处理复杂约束的困难。解决方案的关键在于引入了一种“分而治之”的生成策略,该策略在白盒和黑盒LLMs中均有效,显著提高了在最具挑战性的LCG任务中的成功率,改善效果超过90%。这一策略为更复杂和定制化的文本生成应用提供了新的途径。

链接: https://arxiv.org/abs/2410.04628
作者: Bingxuan Li,Yiwei Wang,Tao Meng,Kai-Wei Chang,Nanyun Peng
关键词-EN: Lexically Constrained Generation, Lexically Constrained, large language models, paper investigates controllable, focusing on Lexically
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:This paper investigates controllable generation for large language models (LLMs) with prompt-based control, focusing on Lexically Constrained Generation (LCG). We systematically evaluate the performance of LLMs on satisfying lexical constraints with prompt-based control, as well as their efficacy in downstream applications. We conclude that LLMs face significant challenges in consistently satisfying lexical constraints with prompt-based control. We identified three key limitations of LLMs for LCG, including (1) position bias, where LLMs tend to satisfy constraints that appear in specific positions within the input; (2) low responsiveness to decoding parameters, which render minimal impact on control of LLMs; and (3) struggle with handling the inherent complexity of certain constraints (e.g., compound words). To address these issues, we introduce a Divide and Conquer Generation strategy, effective for both white-box and black-box LLMs, to enhance LLMs performance in LCG tasks, which demonstrates over 90% improvement on success rate in the most challenging LCG task. Our analysis provides valuable insights into the performance of LLMs in LCG with prompt-based control, and our proposed strategy offers a pathway to more sophisticated and customized text generation applications.
摘要:本文探讨了基于提示控制的大语言模型 (LLM) 在词汇约束生成 (Lexically Constrained Generation, LCG) 方面的可控生成能力。我们系统评估了 LLM 在满足基于提示的词汇约束方面的表现,以及其在下游应用中的有效性。研究结果表明,LLM 在基于提示控制下持续满足词汇约束方面面临显著挑战。我们识别了 LLM 在 LCG 中的三个关键限制:(1) 位置偏差,即 LLM 倾向于满足出现在输入中特定位置的约束;(2) 对解码参数的低响应性,这使得对 LLM 的控制影响甚微;(3) 处理某些约束(如复合词)固有复杂性的困难。为解决这些问题,我们提出了一种分治生成策略,该策略对白盒和黑盒 LLM 均有效,以提升 LLM 在 LCG 任务中的表现,在最具挑战性的 LCG 任务中成功率提高了超过 90%。我们的分析为基于提示控制的 LCG 中 LLM 的表现提供了宝贵的见解,并提出的策略为更复杂和定制化的文本生成应用提供了路径。

[NLP-68] Punctuation Prediction for Polish Texts using Transformers

【速读】: 该论文试图解决波兰语文本中的标点预测问题,即如何为无标点的语音识别输出文本自动添加标点符号,以提高文本的可读性和理解性。解决方案的关键在于使用了一个经过微调的HerBERT模型,该模型在竞赛数据和外部数据集上进行了训练,最终在Poleval 2022任务1中取得了71.44的加权F1分数。

链接: https://arxiv.org/abs/2410.04621
作者: Jakub Pokrywka
关键词-EN: Speech recognition systems, recognition systems typically, systems typically output, Speech recognition, typically output text
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech recognition systems typically output text lacking punctuation. However, punctuation is crucial for written text comprehension. To tackle this problem, Punctuation Prediction models are developed. This paper describes a solution for Poleval 2022 Task 1: Punctuation Prediction for Polish Texts, which scores 71.44 Weighted F1. The method utilizes a single HerBERT model finetuned to the competition data and an external dataset.
摘要:语音识别系统通常输出的文本缺乏标点符号。然而,标点符号对于书面文本的理解至关重要。为了解决这一问题,研究人员开发了标点预测模型。本文描述了针对 Poleval 2022 任务 1:波兰语文本标点预测的解决方案,该方案的加权 F1 得分为 71.44。该方法采用了一个经过微调的 HerBERT 模型,该模型基于竞赛数据和外部数据集进行训练。

[NLP-69] Passage Retrieval of Polish Texts Using OKAPI BM25 and an Ensemble of Cross Encoders

【速读】: 该论文试图解决Poleval 2023 Task 3中的段落检索问题,特别是在波兰语文本的三种领域(琐事、法律和客户支持)中进行段落检索,其中仅使用琐事领域的数据进行训练和开发。解决方案的关键在于使用OKAPI BM25算法进行文档检索,并结合公开的多语言Cross Encoders进行重排序。尽管对重排序模型进行微调在训练领域有所提升,但在其他领域表现有所下降。

链接: https://arxiv.org/abs/2410.04620
作者: Jakub Pokrywka
关键词-EN: Passage Retrieval challenge, Passage Retrieval, traditionally relied, relied on lexical, lexical methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Passage Retrieval has traditionally relied on lexical methods like TF-IDF and BM25. Recently, some neural network models have surpassed these methods in performance. However, these models face challenges, such as the need for large annotated datasets and adapting to new domains. This paper presents a winning solution to the Poleval 2023 Task 3: Passage Retrieval challenge, which involves retrieving passages of Polish texts in three domains: trivia, legal, and customer support. However, only the trivia domain was used for training and development data. The method used the OKAPI BM25 algorithm to retrieve documents and an ensemble of publicly available multilingual Cross Encoders for Reranking. Fine-tuning the reranker models slightly improved performance but only in the training domain, while it worsened in other domains.
摘要:传统的段落检索方法主要依赖于词汇方法,如 TF-IDF 和 BM25。近年来,一些神经网络模型在性能上超越了这些传统方法。然而,这些模型面临着一些挑战,例如需要大量标注数据以及适应新领域的能力。本文介绍了 Poleval 2023 任务 3:段落检索挑战的获胜解决方案,该挑战涉及从三个领域(琐事、法律和客户支持)的波兰语文本中检索段落。然而,只有琐事领域用于训练和开发数据。该方法使用了 OKAPI BM25 算法来检索文档,并结合了公开可用的多语言 Cross Encoders 进行重排序。对重排序模型进行微调略微提升了训练领域的性能,但在其他领域则有所下降。

[NLP-70] Evaluation of Code LLMs on Geospatial Code Generation

【速读】: 该论文试图解决在地理空间数据科学领域中,现有大型语言模型(LLMs)在代码生成方面的评估不足问题。解决方案的关键在于构建了一个专门针对地理空间任务的代码生成模型评估基准。通过分类地理空间任务的复杂性和所需工具,创建了一个包含高质量手动编写的地理空间编码问题的数据集,并设计了自动检查生成代码正确性的测试场景。此外,论文还测试了现有LLMs在地理空间领域的代码生成能力,并将数据集和可重复的评估代码公开在GitHub上,以期为未来LLMs在该领域的评估提供基准,推动开发能够高精度解决地理空间编码任务的新模型。

链接: https://arxiv.org/abs/2410.04617
作者: Piotr Gramacki,Bruno Martins,Piotr Szymański
关键词-EN: Large Language Models, Large Language, approaches using Large, code generation, Language Models
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Software development support tools have been studied for a long time, with recent approaches using Large Language Models (LLMs) for code generation. These models can generate Python code for data science and machine learning applications. LLMs are helpful for software engineers because they increase productivity in daily work. An LLM can also serve as a “mentor” for inexperienced software developers, and be a viable learning support. High-quality code generation with LLMs can also be beneficial in geospatial data science. However, this domain poses different challenges, and code generation LLMs are typically not evaluated on geospatial tasks. Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks. We categorised geospatial tasks based on their complexity and required tools. Then, we created a dataset with tasks that test model capabilities in spatial reasoning, spatial data processing, and geospatial tools usage. The dataset consists of specific coding problems that were manually created for high quality. For every problem, we proposed a set of test scenarios that make it possible to automatically check the generated code for correctness. In addition, we tested a selection of existing code generation LLMs for code generation in the geospatial domain. We share our dataset and reproducible evaluation code on a public GitHub repository, arguing that this can serve as an evaluation benchmark for new LLMs in the future. Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy. These models will enable the creation of coding assistants tailored for geospatial applications.
摘要:软件开发支持工具的研究已有很长时间,近期方法利用大语言模型 (LLMs) 进行代码生成。这些模型能够为数据科学和机器学习应用生成 Python 代码。LLMs 对软件工程师非常有帮助,因为它们能提高日常工作效率。LLM 还可以作为经验不足的软件开发者的“导师”,并提供有效的学习支持。高质量的代码生成在地理空间数据科学中也有益处。然而,这一领域带来了不同的挑战,代码生成 LLMs 通常未在地理空间任务上进行评估。在此,我们展示了如何基于一系列地理空间任务构建代码生成模型的评估基准。我们根据任务的复杂性和所需的工具对地理空间任务进行了分类。随后,我们创建了一个包含测试模型在空间推理、空间数据处理和地理空间工具使用能力的任务数据集。该数据集由为确保高质量而手动创建的具体编码问题组成。对于每个问题,我们提出了一组测试场景,使得自动检查生成的代码的正确性成为可能。此外,我们还测试了一系列现有的代码生成 LLMs 在地理空间领域的代码生成能力。我们在一个公开的 GitHub 仓库中分享了我们的数据集和可重复的评估代码,认为这可以作为未来新 LLMs 的评估基准。我们的数据集有望促进能够高精度解决地理空间编码任务的新模型的开发。这些模型将能够创建专为地理空间应用定制的编码助手。

[NLP-71] LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking

【速读】: 该论文试图解决多模态错误信息检测中的专家驱动方法劳动密集且难以扩展的问题。解决方案的关键在于提出一个全自动框架LRQ-Fact,该框架利用视觉-语言模型(VLMs)和大型语言模型(LLMs)生成全面的问题和答案,并通过基于规则的决策模块评估原始内容和生成的问答,以评估整体真实性。实验结果表明,LRQ-Fact在多模态错误信息检测的准确性上有所提升,并展示了其在不同模型骨干上的通用性。

链接: https://arxiv.org/abs/2410.04616
作者: Alimohammad Beigi,Bohan Jiang,Dawei Li,Tharindu Kumarage,Zhen Tan,Pouya Shaeri,Huan Liu
关键词-EN: specialized domain knowledge, Human fact-checkers, formulate precise questions, verify information accuracy, fact-checkers have specialized
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human fact-checkers have specialized domain knowledge that allows them to formulate precise questions to verify information accuracy. However, this expert-driven approach is labor-intensive and is not scalable, especially when dealing with complex multimodal misinformation. In this paper, we propose a fully-automated framework, LRQ-Fact, for multimodal fact-checking. Firstly, the framework leverages Vision-Language Models (VLMs) and Large Language Models (LLMs) to generate comprehensive questions and answers for probing multimodal content. Next, a rule-based decision-maker module evaluates both the original content and the generated questions and answers to assess the overall veracity. Extensive experiments on two benchmarks show that LRQ-Fact improves detection accuracy for multimodal misinformation. Moreover, we evaluate its generalizability across different model backbones, offering valuable insights for further refinement.
摘要:人类事实核查员拥有专门的领域知识,这使他们能够提出精确的问题来验证信息准确性。然而,这种专家驱动的方法劳动密集且不具备可扩展性,尤其是在处理复杂的跨模态错误信息时。本文提出了一种全自动框架,即 LRQ-Fact,用于跨模态事实核查。首先,该框架利用视觉-语言模型 (Vision-Language Models, VLMs) 和大语言模型 (Large Language Models, LLMs) 生成全面的提问和回答,以探究跨模态内容。接着,一个基于规则的决策模块评估原始内容和生成的提问与回答,以评估整体的真实性。在两个基准上的广泛实验表明,LRQ-Fact 提高了跨模态错误信息的检测准确性。此外,我们还评估了其在不同模型骨干上的泛化能力,为后续改进提供了宝贵的见解。

[NLP-72] Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

【速读】: 该论文试图解决大型语言模型(LLMs)在多轮对话任务中表现不佳的问题,特别是在需要长期规划的对话场景中。解决方案的关键是引入了一种名为REgressing the RELative FUture(REFUEL)的高效策略优化方法,该方法通过单一模型估计Q值并基于自生成数据进行训练,从而解决了训练数据中的协变量偏移问题。REFUEL将多轮强化学习从人类反馈(RLHF)问题框架为一系列回归任务,简化了实现过程,并理论上证明了其性能可以匹配训练集所涵盖的任何策略。实验结果表明,REFUEL在各种设置下均优于现有的最先进方法,如DPO和REBEL,并且在参数较少的情况下,REFUEL微调的模型在长多轮对话中表现优于更大规模的模型。

链接: https://arxiv.org/abs/2410.04612
作者: Zhaolin Gao,Wenhao Zhan,Jonathan D. Chang,Gokul Swamy,Kianté Brantley,Jason D. Lee,Wen Sun
关键词-EN: Large Language Models, Large Language, achieved remarkable success, Language Models, REFUEL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate Q -values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at this https URL, and models trained by REFUEL can be found at this https URL.
摘要:大语言模型 (LLMs) 在单轮交互任务(如摘要生成)中取得了显著的成功。然而,在需要长期规划的多轮任务(如对话)中,它们仍然面临挑战。以往的多轮对话研究通过将所有先前的对话轮次视为长上下文,将单轮人类反馈强化学习 (RLHF) 方法扩展到多轮设置。这种方法存在协变量偏移问题:训练集中的对话先前轮次由某种参考策略生成,这意味着低训练误差不一定对应于学习者在实际对话循环中的良好表现。为此,我们引入了 REgressing the RELative FUture (REFUEL),这是一种旨在解决 LLMs 中多轮 RLHF 的高效策略优化方法。REFUEL 使用单一模型来估计 Q 值,并在自生成数据上进行训练,从而解决了协变量偏移问题。REFUEL 将多轮 RLHF 问题框架化为一系列在迭代收集的数据集上的回归任务,简化了实现过程。理论上,我们证明了 REFUEL 可以匹配训练集中涵盖的任何策略的性能。在实验中,我们使用 Llama-3.1-70B-it 模拟用户与我们的模型进行对话,评估了我们的算法。REFUEL 在各种设置下持续优于 DPO 和 REBEL 等最先进的方法。此外,尽管只有 80 亿参数,经过 REFUEL 微调的 Llama-3-8B-it 在长多轮对话中表现优于 Llama-3.1-70B-it。REFUEL 的实现可以在以下链接找到:https URL,经过 REFUEL 训练的模型可以在以下链接找到:https URL。

[NLP-73] ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks ACL

【速读】: 该论文试图解决自动化生成可执行机器人科学实验协议的问题,并提出了一种灵活的自动评估框架ProtocoLLM。解决方案的关键在于利用大型语言模型(LLMs)生成科学协议的伪代码,并通过LLAM-EVAL方法将目标模型的输出与GPT-4生成的伪代码进行对比评估。该框架不仅提供了灵活的评估模型、材料和标准,还引入了BIOPROT 2.0数据集,以支持LLMs在科学协议生成任务中的训练和评估。

链接: https://arxiv.org/abs/2410.04601
作者: Seungjun Yi,Jaeyoung Lim,Juyong Yoon
关键词-EN: scientific research processes, significantly accelerate scientific, accelerate scientific research, scientific protocols executable, Automated generation
类目: Computation and Language (cs.CL)
备注: Submitted to 2024 ACL Rolling Review June Cycle

点击查看摘要

Abstract:Automated generation of scientific protocols executable by robots can significantly accelerate scientific research processes. Large Language Models (LLMs) excel at Scientific Protocol Formulation Tasks (SPFT), but the evaluation of their capabilities rely on human evaluation. Here, we propose a flexible, automatic framework to evaluate LLM’s capability on SPFT: ProtocoLLM. This framework prompts the target model and GPT-4 to extract pseudocode from biology protocols using only predefined lab actions and evaluates the output of target model using LLAM-EVAL, the pseudocode generated by GPT-4 serving as a baseline and Llama-3 acting as the evaluator. Our adaptable prompt-based evaluation method, LLAM-EVAL, offers significant flexibility in terms of evaluation model, material, criteria, and is free of cost. We evaluate GPT variations, Llama, Mixtral, Gemma, Cohere, and Gemini. Overall, we find that GPT and Cohere is a powerful scientific protocol formulators. We also introduce BIOPROT 2.0, a dataset with biology protocols and corresponding pseudocodes, which can aid LLMs in formulation and evaluation of SPFT. Our work is extensible to assess LLMs on SPFT across various domains and other fields that require protocol generation for specific goals.
摘要:机器人执行的科学协议自动化生成可以显著加速科学研究过程。大语言模型 (LLMs) 在科学协议制定任务 (SPFT) 中表现出色,但其能力的评估依赖于人工评估。在此,我们提出了一种灵活的自动化框架来评估 LLM 在 SPFT 中的能力:ProtocoLLM。该框架提示目标模型和 GPT-4 仅使用预定义的实验室操作从生物学协议中提取伪代码,并使用 LLAM-EVAL 评估目标模型的输出,其中 GPT-4 生成的伪代码作为基准,Llama-3 作为评估者。我们灵活的基于提示的评估方法 LLAM-EVAL 在评估模型、材料、标准方面提供了显著的灵活性,并且无需成本。我们评估了 GPT 变体、Llama、Mixtral、Gemma、Cohere 和 Gemini。总体而言,我们发现 GPT 和 Cohere 是强大的科学协议制定者。我们还引入了 BIOPROT 2.0,这是一个包含生物学协议及其相应伪代码的数据集,可以帮助 LLMs 在 SPFT 的制定和评估中。我们的工作可以扩展到评估 LLMs 在不同领域和其他需要特定目标协议生成的领域的 SPFT。

[NLP-74] owards the first UD Treebank of Spoken Italian: the KIParla forest

【速读】: 该论文旨在丰富意大利语的语言资源,通过构建一个适用于KIParla语料库的通用依存树库(Universal Dependencies treebank)来实现这一目标。KIParla语料库是一个现有的、广为人知的意大利口语资源。解决方案的关键在于利用现有的KIParla语料库,构建一个能够反映意大利口语语法结构的依存树库,从而提升意大利语的自然语言处理能力。

链接: https://arxiv.org/abs/2410.04589
作者: Ludovica Pannitto
关键词-EN: Universal Dependencies treebank, present project endeavors, Universal Dependencies, constructing a Universal, Dependencies treebank
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The present project endeavors to enrich the linguistic resources available for Italian by constructing a Universal Dependencies treebank for the KIParla corpus (Mauri et al., 2019, Ballarè et al., 2020), an existing and well known resource for spoken Italian.
摘要:本项目致力于通过构建 KIParla 语料库的 Universal Dependencies 树库(Mauri 等,2019,Ballarè 等,2020)来丰富意大利语的语言资源。KIParla 语料库是一个现有的、广为人知的意大利口语资源。

[NLP-75] Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval

【速读】: 该论文试图解决大型语言模型(LLMs)在临床决策支持中存在的幻觉问题和缺乏细粒度医学知识的问题,特别是在高风险的医疗应用如临床诊断中。解决方案的关键是引入了一个名为KARE的新框架,该框架通过将知识图谱(KG)社区级别的检索与LLM推理相结合,以增强医疗预测的准确性和可解释性。KARE的核心创新包括:1)密集医学知识结构化方法,确保相关信息的准确检索;2)动态知识检索机制,丰富患者情境并提供多方面的医学见解;3)推理增强的预测框架,利用这些丰富的情境生成准确且可解释的临床预测。实验结果表明,KARE在MIMIC-III和MIMIC-IV数据集上的死亡率和再入院预测中,分别比领先模型提高了10.8-15.0%和12.6-12.7%。

链接: https://arxiv.org/abs/2410.04585
作者: Pengcheng Jiang,Cao Xiao,Minhao Jiang,Parminder Bhatia,Taha Kass-Hout,Jimeng Sun,Jiawei Han
关键词-EN: Large language models, demonstrated significant potential, Large language, clinical decision support, decision support
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential in clinical decision support. Yet LLMs still suffer from hallucinations and lack fine-grained contextual medical knowledge, limiting their high-stake healthcare applications such as clinical diagnosis. Traditional retrieval-augmented generation (RAG) methods attempt to address these limitations but frequently retrieve sparse or irrelevant information, undermining prediction accuracy. We introduce KARE, a novel framework that integrates knowledge graph (KG) community-level retrieval with LLM reasoning to enhance healthcare predictions. KARE constructs a comprehensive multi-source KG by integrating biomedical databases, clinical literature, and LLM-generated insights, and organizes it using hierarchical graph community detection and summarization for precise and contextually relevant information retrieval. Our key innovations include: (1) a dense medical knowledge structuring approach enabling accurate retrieval of relevant information; (2) a dynamic knowledge retrieval mechanism that enriches patient contexts with focused, multi-faceted medical insights; and (3) a reasoning-enhanced prediction framework that leverages these enriched contexts to produce both accurate and interpretable clinical predictions. Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions. In addition to its impressive prediction accuracy, our framework leverages the reasoning capabilities of LLMs, enhancing the trustworthiness of clinical predictions.
摘要:大语言模型 (LLMs) 在临床决策支持中展示了显著的潜力。然而,LLMs 仍然存在幻觉问题,并且缺乏细粒度的上下文医学知识,限制了其在临床诊断等高风险医疗应用中的应用。传统的检索增强生成 (RAG) 方法试图解决这些限制,但经常检索到稀疏或不相关的信息,从而降低了预测准确性。我们引入了 KARE,这是一种新颖的框架,它将知识图谱 (KG) 社区级别的检索与 LLM 推理相结合,以增强医疗预测。KARE 通过整合生物医学数据库、临床文献和 LLM 生成的洞察,构建了一个全面的多源 KG,并使用层次图社区检测和总结来组织它,以实现精确和上下文相关的信息检索。我们的关键创新包括:(1) 一种密集的医学知识结构化方法,能够准确检索相关信息;(2) 一种动态知识检索机制,通过聚焦的多方面医学洞察丰富患者上下文;(3) 一种推理增强的预测框架,利用这些丰富的上下文生成准确且可解释的临床预测。广泛的实验表明,KARE 在 MIMIC-III 和 MIMIC-IV 上的死亡率和再入院预测方面分别比领先模型高出 10.8-15.0% 和 12.6-12.7%。除了其令人印象深刻的预测准确性外,我们的框架还利用了 LLMs 的推理能力,增强了临床预测的可信度。

[NLP-76] Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

【速读】: 该论文试图解决多语言环境下数据分布不均衡的问题,特别是在高资源和低资源语言之间数据量差异显著的情况下,如何有效训练语言模型。解决方案的关键在于通过理论和实证分析,揭示了温度采样(Temperature Sampling)和损失加权(Scalarization)在完全梯度下降下等效,但在随机梯度下降下不等效的现象。论文提出了一种名为“Cooldown”的策略,通过在训练过程中降低采样温度,加速收敛同时避免过拟合低资源语言,从而在保持计算效率的同时,与现有的数据重加权方法竞争。

链接: https://arxiv.org/abs/2410.04579
作者: Tianjian Li,Haoran Xu,Weiting Tan,Dongwei Jiang,Kenton Murray,Daniel Khashabi
关键词-EN: face data scarcity, long-tail distribution, data scarcity, Data, low-resource languages
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 18 pages

点击查看摘要

Abstract:Data availability across domains often follows a long-tail distribution: a few domains have abundant data, while most face data scarcity. This imbalance poses challenges in training language models uniformly across all domains. In our study, we focus on multilingual settings, where data sizes vary significantly between high- and low-resource languages. Common strategies to address this include upsampling low-resource languages (Temperature Sampling) or upweighting their loss (Scalarization). Although often considered equivalent, this assumption has not been proven, which motivates our study. Through both theoretical and empirical analysis, we identify the conditions under which these approaches are equivalent and when they diverge. Specifically, we demonstrate that these two methods are equivalent under full gradient descent, but this equivalence breaks down with stochastic gradient descent. Empirically, we observe that Temperature Sampling converges more quickly but is prone to overfitting. We argue that this faster convergence is likely due to the lower variance in gradient estimations, as shown theoretically. Based on these insights, we propose Cooldown, a strategy that reduces sampling temperature during training, accelerating convergence without overfitting to low-resource languages. Our method is competitive with existing data re-weighting and offers computational efficiency.
摘要:跨领域的数据可用性通常遵循长尾分布:少数领域拥有丰富的数据,而大多数领域面临数据稀缺的问题。这种不平衡给在所有领域中统一训练语言模型带来了挑战。在我们的研究中,我们重点关注多语言环境,其中高资源语言和低资源语言之间的数据量差异显著。常见的解决策略包括对低资源语言进行上采样(温度采样)或增加其损失权重(标量化)。尽管通常认为这两种方法是等效的,但这一假设尚未得到证明,这促使我们进行研究。通过理论和实证分析,我们确定了这两种方法在何种条件下等效以及何时出现差异。具体来说,我们证明了在全梯度下降的情况下,这两种方法是等效的,但在随机梯度下降的情况下,这种等效性会失效。实证上,我们观察到温度采样收敛更快,但容易过拟合。我们认为这种更快的收敛可能是由于梯度估计的方差较低,这在理论上得到了证明。基于这些见解,我们提出了冷却策略(Cooldown),即在训练过程中降低采样温度,从而在不导致低资源语言过拟合的情况下加速收敛。我们的方法与现有的数据重加权方法相比具有竞争力,并提供了计算效率。

[NLP-77] How Does the Disclosure of AI Assistance Affect the Perceptions of Writing? EMNLP2024

【速读】: 该论文试图解决的问题是:在写作过程中披露AI辅助的程度和类型如何影响人们对写作质量的评价。解决方案的关键在于通过实验研究,揭示了披露AI辅助信息(尤其是AI生成新内容的情况)会降低写作质量的平均评价,并增加评价的个体差异性。此外,研究发现个人的写作自信和对AI写作助手的熟悉程度会调节这种影响,同时披露AI辅助使用可能会减少AI生成内容在高质量写作中的比例。

链接: https://arxiv.org/abs/2410.04545
作者: Zhuoyan Li,Chen Liang,Jing Peng,Ming Yin
关键词-EN: large language models, Recent advances, assistance, writing, advances in generative
类目: Computation and Language (cs.CL)
备注: EMNLP 2024. arXiv admin note: text overlap with arXiv:2403.12004

点击查看摘要

Abstract:Recent advances in generative AI technologies like large language models have boosted the incorporation of AI assistance in writing workflows, leading to the rise of a new paradigm of human-AI co-creation in writing. To understand how people perceive writings that are produced under this paradigm, in this paper, we conduct an experimental study to understand whether and how the disclosure of the level and type of AI assistance in the writing process would affect people’s perceptions of the writing on various aspects, including their evaluation on the quality of the writing and their ranking of different writings. Our results suggest that disclosing the AI assistance in the writing process, especially if AI has provided assistance in generating new content, decreases the average quality ratings for both argumentative essays and creative stories. This decrease in the average quality ratings often comes with an increased level of variations in different individuals’ quality evaluations of the same writing. Indeed, factors such as an individual’s writing confidence and familiarity with AI writing assistants are shown to moderate the impact of AI assistance disclosure on their writing quality evaluations. We also find that disclosing the use of AI assistance may significantly reduce the proportion of writings produced with AI’s content generation assistance among the top-ranked writings.
摘要:近年来,生成式 AI 技术(如大语言模型)的进步推动了 AI 辅助在写作流程中的应用,催生了人机协同创作的新范式。为了理解人们如何看待在这种范式下产生的作品,本文进行了一项实验研究,探讨在写作过程中披露 AI 辅助的程度和类型是否以及如何影响人们对作品在多个方面的看法,包括对作品质量的评价以及对不同作品的排序。研究结果表明,披露写作过程中使用了 AI 辅助,尤其是 AI 在生成新内容方面提供了帮助时,会降低人们对议论文和创意故事的平均质量评分。这种平均质量评分的下降往往伴随着个体对同一作品质量评价的变异程度增加。事实上,个体的写作自信度和对 AI 写作助手的熟悉程度等因素,会调节 AI 辅助披露对其作品质量评价的影响。我们还发现,披露使用 AI 辅助可能会显著减少在顶级作品中由 AI 内容生成辅助创作的比例。

[NLP-78] Casablanca: Data and Models for Multidialectal Arabic Speech Recognition

【速读】: 该论文试图解决阿拉伯语方言在语音处理领域中数据集匮乏的问题,以缩小技术差距并促进技术和社会经济包容性。解决方案的关键在于提出了Casablanca数据集,这是一个大规模的社区驱动项目,旨在收集和转录涵盖八种阿拉伯方言(阿尔及利亚语、埃及语、阿联酋语、约旦语、毛里塔尼亚语、摩洛哥语、巴勒斯坦语和也门语)的多方言阿拉伯语数据集。该数据集不仅包括语音转录,还标注了性别、方言和代码切换信息,并开发了一系列利用Casablanca数据集的强基线模型。

链接: https://arxiv.org/abs/2410.04527
作者: Bashar Talafha,Karima Kadaoui,Samar Mohamed Magdy,Mariem Habiboullah,Chafei Mohamed Chafei,Ahmed Oumar El-Shangiti,Hiba Zayed,Mohamedou cheikh tourad,Rahaf Alhamouri,Rwaa Assi,Aisha Alraeesi,Hour Mohamed,Fakhraddin Alwajih,Abdelrahman Mohamed,Abdellah El Mekki,El Moatez Billah Nagoudi,Benelhadj Djelloul Mama Saadia,Hamzah A. Alsayadi,Walid Al-Dhabyani,Sara Shatnawi,Yasir Ech-Chammakhy,Amal Makouar,Yousra Berrachedi,Mustafa Jarrar,Shady Shehata,Ismail Berrada,Muhammad Abdul-Mageed
关键词-EN: dialects remain uncovered, remain uncovered, recent progress, majority of world, world languages
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: this http URL.
摘要:尽管近期语音处理技术取得了进展,但大多数世界语言和方言仍未被覆盖。这种情况进一步加剧了本已广泛存在的技术鸿沟,从而阻碍了技术和经济社会包容性。这一挑战主要源于缺乏能够赋能多样化语音系统的数据集。在本文中,我们试图通过介绍 Casablanca 来缓解这一障碍,Casablanca 是一个大规模的社区驱动项目,旨在收集和转录一个多方言阿拉伯语数据集。该数据集涵盖了八种方言:阿尔及利亚语、埃及语、阿联酋语、约旦语、毛里塔尼亚语、摩洛哥语、巴勒斯坦语和也门语,并包含转录、性别、方言和代码切换的标注。我们还开发了一系列利用 Casablanca 的强大基线模型。Casablanca 的项目页面可通过以下链接访问:this http URL。

[NLP-79] FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

【速读】: 该论文试图解决金融领域多语言多模态问答系统的评估问题,解决方案的关键在于引入了一个名为FAMMA的开源基准测试,该基准包含1,758个精心收集的问答对,涵盖8个金融子领域,并结合了文本和多种图像类型的混合格式。通过评估当前最先进的多种多模态大语言模型(MLLMs),论文发现即使是高级系统如GPT-4o和Claude-35-Sonnet也仅能达到42%的准确率,表明FAMMA对现有模型提出了显著挑战。此外,论文还探讨了GPT-o1风格的推理链以增强模型的推理能力,显著提高了错误纠正效果。FAMMA基准的引入将促进未来在金融问答领域的专家系统研发。

链接: https://arxiv.org/abs/2410.04526
作者: Siqiao Xue,Tingting Chen,Fan Zhou,Qingyang Dai,Zhixuan Chu,Hongyuan Mei
关键词-EN: multilingual multimodal question, financial multilingual multimodal, multimodal question answering, multilingual multimodal, introduce FAMMA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce FAMMA, an open-source benchmark for financial multilingual multimodal question answering (QA). Our benchmark aims to evaluate the abilities of multimodal large language models (MLLMs) in answering questions that require advanced financial knowledge and sophisticated reasoning. It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams, spanning 8 major subfields in finance including corporate finance, asset management, and financial engineering. Some of the QA pairs are written in Chinese or French, while a majority of them are in English. These questions are presented in a mixed format combining text and heterogeneous image types, such as charts, tables, and diagrams. We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models. Even advanced systems like GPT-4o and Claude-35-Sonnet achieve only 42% accuracy. Additionally, the open-source Qwen2-VL lags notably behind its proprietary counterparts. Lastly, we explore GPT o1-style reasoning chains to enhance the models’ reasoning capabilities, which significantly improve error correction. Our FAMMA benchmark will facilitate future research to develop expert systems in financial QA. The leaderboard is available at this https URL .
摘要:本文介绍 FAMMA,一个用于金融多语言多模态问答 (QA) 的开源基准测试。我们的基准测试旨在评估多模态大语言模型 (MLLMs) 在回答需要高级金融知识和复杂推理的问题方面的能力。它包括从大学教科书和考试中精心收集的 1,758 个问答对,涵盖了公司金融、资产管理、金融工程等 8 个主要金融子领域。其中一些问答对是用中文或法文编写的,而大多数则是英文。这些问题以文本和异构图像类型(如图表、表格和图示)相结合的混合格式呈现。我们在基准测试上评估了一系列最先进的多模态大语言模型,分析结果显示 FAMMA 对这些模型构成了显著挑战。即使是像 GPT-4o 和 Claude-35-Sonnet 这样的高级系统,准确率也仅为 42%。此外,开源的 Qwen2-VL 明显落后于其专有对手。最后,我们探索了 GPT o1 风格的推理链,以增强模型的推理能力,这显著提高了错误纠正能力。我们的 FAMMA 基准测试将促进未来在金融问答领域开发专家系统的研究。排行榜可在此 https URL 获得。

[NLP-80] owards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning

【速读】: 该论文试图解决指令微调(Instruction Fine-Tuning, IFT)过程中大语言模型(LLMs)安全性下降的问题,即使是在使用完全良性的指令(Benign IFT)进行微调时。解决方案的关键在于提出了一种名为模块层级学习率(Modular Layer-wise Learning Rate, ML-LR)的新策略。该策略通过模块鲁棒性分析,识别出对模型安全性至关重要的鲁棒模块子集(Mods_Robust),并在IFT过程中为这些模块和其余模块分配不同的学习率。实验结果表明,ML-LR策略显著降低了Benign IFT后模型有害性的增加,同时对模型的可用性和专业性影响较小。

链接: https://arxiv.org/abs/2410.04524
作者: Yanrui Du,Sendong Zhao,Jiawei Cao,Ming Ma,Danyang Zhao,Fenglei Fan,Ting Liu,Bing Qin
关键词-EN: Large Language Models, base Large Language, adapting base Large, Language Models, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction Fine-Tuning (IFT) has become an essential method for adapting base Large Language Models (LLMs) into variants for professional and private use. However, researchers have raised concerns over a significant decrease in LLMs’ security following IFT, even when the IFT process involves entirely benign instructions (termed Benign IFT). Our study represents a pioneering effort to mitigate the security risks arising from Benign IFT. Specifically, we conduct a Module Robustness Analysis, aiming to investigate how LLMs’ internal modules contribute to their security. Based on our analysis, we propose a novel IFT strategy, called the Modular Layer-wise Learning Rate (ML-LR) strategy. In our analysis, we implement a simple security feature classifier that serves as a proxy to measure the robustness of modules (e.g. Q / K / V , etc.). Our findings reveal that the module robustness shows clear patterns, varying regularly with the module type and the layer depth. Leveraging these insights, we develop a proxy-guided search algorithm to identify a robust subset of modules, termed Mods _Robust . During IFT, the ML-LR strategy employs differentiated learning rates for Mods _Robust and the rest modules. Our experimental results show that in security assessments, the application of our ML-LR strategy significantly mitigates the rise in harmfulness of LLMs following Benign IFT. Notably, our ML-LR strategy has little impact on the usability or expertise of LLMs following Benign IFT. Furthermore, we have conducted comprehensive analyses to verify the soundness and flexibility of our ML-LR strategy.
摘要:指令微调 (Instruction Fine-Tuning, IFT) 已成为将基础大语言模型 (Large Language Models, LLMs) 适应于专业和私人用途的重要方法。然而,研究人员对 IFT 后 LLMs 安全性的显著下降表示担忧,即使在 IFT 过程中涉及完全良性的指令(称为良性 IFT)。我们的研究代表了缓解由良性 IFT 引起的安全风险的首创努力。具体而言,我们进行了模块鲁棒性分析,旨在研究 LLMs 内部模块如何影响其安全性。基于我们的分析,我们提出了一种新的 IFT 策略,称为模块层级学习率 (Modular Layer-wise Learning Rate, ML-LR) 策略。在我们的分析中,我们实现了一个简单的安全特征分类器,作为衡量模块鲁棒性的代理(例如 Q / K / V 等)。我们的研究发现,模块鲁棒性显示出明显的模式,随着模块类型和层深度的变化而规律变化。利用这些见解,我们开发了一种代理引导的搜索算法,以识别一个鲁棒的模块子集,称为 Mods _Robust。在 IFT 过程中,ML-LR 策略为 Mods _Robust 和其余模块采用不同的学习率。我们的实验结果表明,在安全性评估中,应用我们的 ML-LR 策略显著缓解了良性 IFT 后 LLMs 有害性的增加。值得注意的是,我们的 ML-LR 策略对良性 IFT 后 LLMs 的可用性或专业性影响甚微。此外,我们进行了全面的分析,以验证我们 ML-LR 策略的合理性和灵活性。

[NLP-81] RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在处理高并发客户查询时面临的效率问题,特别是在数据复用(data multiplexing)过程中如何保持分类性能的问题。解决方案的关键在于提出了RevMUX框架,该框架通过在复用器中引入可逆设计,使得复用器可以被解复用器重用以执行反向操作,从而恢复单个样本进行分类,实现了参数高效的数据复用,同时保持了较高的分类性能。

链接: https://arxiv.org/abs/2410.04519
作者: Yige Xu,Xu Guo,Zhiwei Zeng,Chunyan Miao
关键词-EN: Large language models, natural language processing, high throughput demands, handling concurrent customer, concurrent customer queries
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) have brought a great breakthrough to the natural language processing (NLP) community, while leading the challenge of handling concurrent customer queries due to their high throughput demands. Data multiplexing addresses this by merging multiple inputs into a single composite input, allowing more efficient inference through a shared forward pass. However, as distinguishing individuals from a composite input is challenging, conventional methods typically require training the entire backbone, yet still suffer from performance degradation. In this paper, we introduce RevMUX, a parameter-efficient data multiplexing framework that incorporates a reversible design in the multiplexer, which can be reused by the demultiplexer to perform reverse operations and restore individual samples for classification. Extensive experiments on four datasets and three types of LLM backbones demonstrate the effectiveness of RevMUX for enhancing LLM inference efficiency while retaining a satisfactory classification performance.
摘要:大语言模型 (LLMs) 为自然语言处理 (NLP) 领域带来了重大突破,但由于其高吞吐量需求,也带来了处理并发客户查询的挑战。数据复用通过将多个输入合并为单一复合输入,使得通过共享前向传递实现更高效的推理成为可能。然而,由于从复合输入中区分个体样本具有挑战性,传统方法通常需要训练整个骨干网络,但仍会遭受性能下降的影响。本文中,我们提出了 RevMUX,一种参数高效的数据复用框架,该框架在复用器中采用了可逆设计,复用器可以被解复用器重用以执行反向操作并恢复单个样本进行分类。在四个数据集和三种类型的大语言模型骨干网络上进行的广泛实验表明,RevMUX 在提高大语言模型推理效率的同时,仍能保持令人满意的分类性能。

[NLP-82] DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination EMNLP2024

【速读】: 该论文旨在解决大型视觉-语言模型(LVLMs)中的对象幻觉问题。解决方案的关键在于提出了一种名为DAMRO的训练无需求策略,通过深入分析LVLMs的注意力机制,减少对象幻觉。具体来说,DAMRO利用ViT的分类标记(CLS)来过滤背景中的高注意力异常标记,并在解码阶段消除这些标记的影响,从而显著减少这些异常标记对模型输出的干扰,有效缓解LVLMs的对象幻觉问题。

链接: https://arxiv.org/abs/2410.04514
作者: Xuan Gong,Tianshi Ming,Xinpeng Wang,Zhihua Wei
关键词-EN: Large Vision-Language Models, Large Language Model, Large Vision-Language, Vision-Language Models, Large Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by EMNLP2024 (Main Conference)

点击查看摘要

Abstract:Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that D ive into A ttention M echanism of LVLM to R educe O bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code of our method will be released soon.
摘要:尽管大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 取得了巨大的成功,但它们不可避免地存在幻觉问题。众所周知,LVLMs 中的视觉编码器和大型语言模型 (Large Language Model, LLM) 解码器均基于 Transformer,通过注意力机制提取视觉信息并生成文本输出。我们发现,LLM 解码器对图像 Token 的注意力分布与视觉编码器高度一致,且两者都倾向于聚焦于特定的背景 Token,而非图像中被引用的对象。我们将这种意外的注意力分布归因于视觉编码器本身的固有缺陷,该缺陷误导 LVLMs 过度强调冗余信息,从而产生对象幻觉。为解决这一问题,我们提出了 DAMRO,一种无需训练的新策略,深入研究 LVLM 的注意力机制以减少对象幻觉。具体而言,我们的方法利用 ViT 的分类 Token (CLS) 来过滤掉分散在背景中的高注意力异常 Token,并在解码阶段消除其影响。我们在包括 LLaVA-1.5、LLaVA-NeXT 和 InstructBLIP 在内的 LVLMs 上评估了我们的方法,使用了 POPE、CHAIR、MME 和 GPT-4V 辅助评估等多种基准。结果表明,我们的方法显著减少了这些异常 Token 的影响,从而有效缓解了 LVLMs 的幻觉问题。我们方法的代码将很快发布。

[NLP-83] Realizing Video Summarization from the Path of Language-based Semantic Understanding

【速读】: 该论文试图解决视频摘要生成中单一视频大语言模型(VideoLLM)的局限性问题,提出了一种基于混合专家(Mixture of Experts, MoE)范式的新型视频摘要框架。解决方案的关键在于无需微调的推理时算法,通过集成多个VideoLLM,利用各自的优势互补,生成全面且连贯的文本摘要。该方法不仅有效结合了视觉和音频内容,提供详细的背景描述,还能精准识别关键帧,从而在语义上优于仅依赖视觉信息的传统计算机视觉方法,同时增强了下游任务如视频摘要生成的性能。

链接: https://arxiv.org/abs/2410.04511
作者: Kuan-Chen Mu,Zhi-Yi Chin,Wei-Chen Chiu
关键词-EN: Video-based Large Language, Large Language Models, Large Language, Video-based Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent development of Video-based Large Language Models (VideoLLMs), has significantly advanced video summarization by aligning video features and, in some cases, audio features with Large Language Models (LLMs). Each of these VideoLLMs possesses unique strengths and weaknesses. Many recent methods have required extensive fine-tuning to overcome the limitations of these models, which can be resource-intensive. In this work, we observe that the strengths of one VideoLLM can complement the weaknesses of another. Leveraging this insight, we propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm, which operates as an inference-time algorithm without requiring any form of fine-tuning. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries. It effectively combines visual and audio content, provides detailed background descriptions, and excels at identifying keyframes, which enables more semantically meaningful retrieval compared to traditional computer vision approaches that rely solely on visual information, all without the need for additional fine-tuning. Moreover, the resulting summaries enhance performance in downstream tasks such as summary video generation, either through keyframe selection or in combination with text-to-image models. Our language-driven approach offers a semantically rich alternative to conventional methods and provides flexibility to incorporate newer VideoLLMs, enhancing adaptability and performance in video summarization tasks.
摘要:近年来,基于视频的大语言模型 (Video-based Large Language Models, VideoLLMs) 的发展显著推动了视频摘要技术的进步,通过将视频特征(在某些情况下还包括音频特征)与大语言模型 (Large Language Models, LLMs) 对齐。每种 VideoLLM 都有其独特的优势和劣势。许多近期方法需要大量微调来克服这些模型的局限性,这通常是资源密集型的。在本研究中,我们观察到一种 VideoLLM 的优势可以弥补另一种的劣势。基于这一洞察,我们提出了一种新颖的视频摘要框架,灵感来源于专家混合 (Mixture of Experts, MoE) 范式,该框架作为一种推理时算法运行,无需任何形式的微调。我们的方法整合了多个 VideoLLMs,以生成全面且连贯的文本摘要。它有效地结合了视觉和音频内容,提供详细的背景描述,并擅长识别关键帧,从而在语义上更有意义的检索方面优于仅依赖视觉信息的传统计算机视觉方法,且无需额外的微调。此外,生成的摘要提升了下游任务(如摘要视频生成)的性能,无论是通过关键帧选择还是与文本到图像模型结合。我们的语言驱动方法为传统方法提供了语义丰富的替代方案,并提供了灵活性以整合更新的 VideoLLMs,从而增强了视频摘要任务中的适应性和性能。

[NLP-84] ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在复杂数学推理任务中错误检测能力的评估问题。解决方案的关键在于提出了一个新的任务——多模态错误检测,并引入了首个专门用于评估此能力的基准ErrorRadar。ErrorRadar通过评估错误步骤识别和错误分类两个子任务,提供了一个全面的框架来评估MLLMs在复杂数学推理中的表现。该基准包含2,500个高质量的多模态K-12数学问题,来源于真实的学生互动,具有严格的标注和丰富的元数据,如问题类型和错误类别。实验结果表明,尽管GPT-4表现最佳,但仍与人类评估存在约10%的差距,显示出该领域仍面临显著挑战。

链接: https://arxiv.org/abs/2410.04509
作者: Yibo Yan,Shen Wang,Jiahao Huo,Hang Li,Boyan Li,Jiamin Su,Xiong Gao,Yi-Fan Zhang,Tianlong Xu,Zhendong Chu,Aoxiao Zhong,Kun Wang,Hui Xiong,Philip S. Yu,Xuming Hu,Qingsong Wen
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, revolutionize artificial intelligence
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs’ problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs’ capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs’ complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation. The dataset will be available upon acceptance.
摘要:随着多模态大语言模型 (Multimodal Large Language Models, MLLMs) 领域的不断发展,其在革新人工智能方面的潜力尤为显著,特别是在解决数学推理任务方面。当前的数学基准主要集中在评估 MLLMs 的问题解决能力,但在处理更复杂的场景,如错误检测方面,存在一个关键的空白,这对于提升复杂环境中的推理能力至关重要。为了填补这一空白,我们正式提出了新的任务:多模态错误检测,并引入了 ErrorRadar,这是首个针对此类任务设计的基准。ErrorRadar 评估两个子任务:错误步骤识别和错误分类,为评估 MLLMs 的复杂数学推理能力提供了一个全面的框架。该基准包含 2,500 个高质量的多模态 K-12 数学问题,这些问题收集自教育机构中学生的实际互动,经过严格的标注,并附有丰富的问题类型和错误类别等元数据。通过广泛的实验,我们评估了开源和闭源的代表性 MLLMs,将其性能与教育专家的评估结果进行对比。结果表明,尽管 GPT-4o 表现最佳,但仍比人类评估落后约 10%,这表明仍存在显著挑战。该数据集将在接受后公开。

[NLP-85] LRHP: Learning Representations for Human Preferences via Preference Pairs

【速读】: 该论文试图解决传统奖励建模方法在表示人类偏好时过于简化的问题,导致偏好分析复杂且应用受限。解决方案的关键在于引入了一种新的偏好表示学习任务,通过构建更丰富和结构化的偏好表示,开发了一个名为LRHP的通用框架。该框架不仅限于传统的奖励建模,还能在偏好数据选择和偏好边际预测等下游任务中实现更强的性能,显著超越现有基线方法。

链接: https://arxiv.org/abs/2410.04503
作者: Chenglong Wang,Yang Gan,Yifu Huo,Yongyu Mu,Qiaozhi He,Murun Yang,Tong Xiao,Chunliang Zhang,Tongran Liu,Jingbo Zhu
关键词-EN: human-preference alignment training, improve human-preference alignment, developed numerous preference, numerous preference datasets, preference datasets consisting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as “preferred” or “dispreferred”. These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). However, representing these human preferences as a numerical value complicates the analysis of these preferences and restricts their broader applications other than RLHF. In contrast, in this work, we introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We further develop a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction. Building upon the human preferences in representations, we achieve strong performance in both tasks, significantly outperforming baselines.
摘要:为了改进人类偏好对齐训练,当前的研究已经开发了大量包含偏好对的数据集,这些偏好对被标记为“偏好”或“非偏好”。这些偏好对通常通过奖励建模将人类偏好编码为一个单一的数值,该数值在从人类反馈中进行强化学习 (RLHF) 时充当奖励信号。然而,将这些人类偏好表示为一个数值,使得这些偏好的分析变得复杂,并限制了其在 RLHF 之外的更广泛应用。相比之下,在本研究中,我们引入了一个偏好表示学习任务,旨在构建一个更丰富、更具结构化的人类偏好表示。我们进一步开发了一个更具泛化能力的框架,即通过偏好对学习人类偏好表示 (LRHP),该框架超越了传统的奖励建模,以解决这一任务。我们验证了偏好表示在两个下游任务中的效用:偏好数据选择和偏好边际预测。基于表示中的人类偏好,我们在两个任务中均取得了优异的表现,显著超越了基线方法。

[NLP-86] Leveraging Large Language Models for Suicide Detection on Social Media with Limited Labels

【速读】: 该论文试图解决在社交媒体平台上自动检测自杀倾向内容的问题。解决方案的关键在于利用大型语言模型(LLMs)生成伪标签以增强未标注数据的分类准确性,并通过集成多个经过微调的模型(如Qwen2-72B-Instruct、Llama3-8B、Llama3.1-8B和Gemma2-9B)来提升检测性能。实验结果表明,这种集成方法显著提高了检测精度,在公开测试集和私有测试集上分别达到了0.770和0.731的加权F1分数,显示出其在识别社交媒体中自杀内容方面的潜力。

链接: https://arxiv.org/abs/2410.04501
作者: Vy Nguyen,Chau Pham
关键词-EN: suicidal thoughts highlights, Social media, increasing frequency, thoughts highlights, highlights the importance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing frequency of suicidal thoughts highlights the importance of early detection and intervention. Social media platforms, where users often share personal experiences and seek help, could be utilized to identify individuals at risk. However, the large volume of daily posts makes manual review impractical. This paper explores the use of Large Language Models (LLMs) to automatically detect suicidal content in text-based social media posts. We propose a novel method for generating pseudo-labels for unlabeled data by prompting LLMs, along with traditional classification fine-tuning techniques to enhance label accuracy. To create a strong suicide detection model, we develop an ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B. We evaluate our approach on the dataset of the Suicide Ideation Detection on Social Media Challenge, a track of the IEEE Big Data 2024 Big Data Cup. Additionally, we conduct a comprehensive analysis to assess the impact of different models and fine-tuning strategies on detection performance. Experimental results show that the ensemble model significantly improves the detection accuracy, by 5% points compared with the individual models. It achieves a weight F1 score of 0.770 on the public test set, and 0.731 on the private test set, providing a promising solution for identifying suicidal content in social media. Our analysis shows that the choice of LLMs affects the prompting performance, with larger models providing better accuracy. Our code and checkpoints are publicly available at this https URL.
摘要:自杀念头的频繁出现突显了早期检测和干预的重要性。社交媒体平台,用户常在此分享个人经历并寻求帮助,可用于识别高风险个体。然而,每日发布的大量内容使得人工审查变得不切实际。本文探讨了利用大语言模型 (LLMs) 自动检测基于文本的社交媒体帖子中的自杀内容。我们提出了一种通过提示 LLMs 生成未标注数据的伪标签的新方法,并结合传统的分类微调技术以提高标签准确性。为构建一个强大的自杀检测模型,我们开发了一种集成方法,涉及使用 Qwen2-72B-Instruct 进行提示,并结合 Llama3-8B、Llama3.1-8B 和 Gemma2-9B 等微调模型。我们在 IEEE Big Data 2024 Big Data Cup 的自杀念头检测社交媒体挑战赛数据集上评估了我们的方法。此外,我们还进行了全面分析,以评估不同模型和微调策略对检测性能的影响。实验结果表明,集成模型显著提高了检测准确性,与单个模型相比提高了 5 个百分点。在公开测试集上达到了 0.770 的加权 F1 分数,在私有测试集上达到了 0.731,为识别社交媒体中的自杀内容提供了一个有前景的解决方案。我们的分析显示,LLMs 的选择影响提示性能,较大的模型提供更高的准确性。我们的代码和检查点可在以下链接公开获取:https URL。

[NLP-87] Knowledge-Guided Dynamic Modality Attention Fusion Framework for Multimodal Sentiment Analysis EMNLP

【速读】: 该论文试图解决多模态情感分析中各模态贡献不均衡的问题,特别是当某一模态在特定情境下成为主导时,传统方法未能动态调整各模态的权重。解决方案的关键在于提出了知识引导的动态模态注意力融合框架(KuDA),该框架利用情感知识动态选择主导模态并调整各模态的贡献,同时通过相关性评估损失进一步突出主导模态的贡献,从而在不同主导模态的情境下实现更优的性能。

链接: https://arxiv.org/abs/2410.04491
作者: Xinyu Feng,Yuming Lin,Lihua He,You Li,Liang Chang,Ya Zhou
关键词-EN: Multimodal Sentiment Analysis, utilizes multimodal data, Attention Fusion Framework, Sentiment Analysis, dominant modality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to EMNLP Findings 2024

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) utilizes multimodal data to infer the users’ sentiment. Previous methods focus on equally treating the contribution of each modality or statically using text as the dominant modality to conduct interaction, which neglects the situation where each modality may become dominant. In this paper, we propose a Knowledge-Guided Dynamic Modality Attention Fusion Framework (KuDA) for multimodal sentiment analysis. KuDA uses sentiment knowledge to guide the model dynamically selecting the dominant modality and adjusting the contributions of each modality. In addition, with the obtained multimodal representation, the model can further highlight the contribution of dominant modality through the correlation evaluation loss. Extensive experiments on four MSA benchmark datasets indicate that KuDA achieves state-of-the-art performance and is able to adapt to different scenarios of dominant modality.
摘要:多模态情感分析 (Multimodal Sentiment Analysis, MSA) 利用多模态数据推断用户的情感。以往的方法主要集中在平等对待每种模态的贡献,或静态地使用文本作为主导模态进行交互,这忽略了每种模态可能成为主导的情况。本文提出了一种知识引导的动态模态注意力融合框架 (Knowledge-Guided Dynamic Modality Attention Fusion Framework, KuDA) 用于多模态情感分析。KuDA 利用情感知识动态选择主导模态并调整每种模态的贡献。此外,通过获得的多模态表示,模型可以通过相关性评估损失进一步突出主导模态的贡献。在四个 MSA 基准数据集上的广泛实验表明,KuDA 达到了最先进的性能,并且能够适应不同主导模态的场景。

[NLP-88] A Pluggable Common Sense-Enhanced Framework for Knowledge Graph Completion

【速读】: 该论文试图解决现有基于嵌入的知识图谱补全(KGC)方法主要依赖事实三元组,可能导致结果与常识不一致的问题。解决方案的关键在于提出一个可插拔的常识增强KGC框架,该框架能够结合事实和常识进行KGC,并根据实体概念的丰富程度自动生成显式或隐式常识。对于具有丰富实体概念的KGs,引入常识引导的负采样和粗到细的推理方法;对于无概念的KGs,提出双评分机制和关系感知的概念嵌入机制。该框架可作为插件模块集成到多种知识图谱嵌入(KGE)模型中,实现常识和事实驱动的联合训练与推理,从而提升KGC任务的性能和扩展性。

链接: https://arxiv.org/abs/2410.04488
作者: Guanglin Niu,Bo Li,Siling Feng
关键词-EN: infer missing facts, knowledge-intensive applications, KGC, aim to infer, infer missing
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Knowledge graph completion (KGC) tasks aim to infer missing facts in a knowledge graph (KG) for many knowledge-intensive applications. However, existing embedding-based KGC approaches primarily rely on factual triples, potentially leading to outcomes inconsistent with common sense. Besides, generating explicit common sense is often impractical or costly for a KG. To address these challenges, we propose a pluggable common sense-enhanced KGC framework that incorporates both fact and common sense for KGC. This framework is adaptable to different KGs based on their entity concept richness and has the capability to automatically generate explicit or implicit common sense from factual triples. Furthermore, we introduce common sense-guided negative sampling and a coarse-to-fine inference approach for KGs with rich entity concepts. For KGs without concepts, we propose a dual scoring scheme involving a relation-aware concept embedding mechanism. Importantly, our approach can be integrated as a pluggable module for many knowledge graph embedding (KGE) models, facilitating joint common sense and fact-driven training and inference. The experiments illustrate that our framework exhibits good scalability and outperforms existing models across various KGC tasks.
摘要:知识图谱补全 (Knowledge Graph Completion, KGC) 任务旨在为众多知识密集型应用推断知识图谱 (Knowledge Graph, KG) 中缺失的事实。然而,现有的基于嵌入的 KGC 方法主要依赖于事实三元组,可能导致结果与常识不一致。此外,为 KG 生成显式的常识通常是不切实际或成本高昂的。为应对这些挑战,我们提出了一种可插拔的常识增强 KGC 框架,该框架结合了事实和常识进行 KGC。该框架可根据实体概念的丰富程度适应不同的 KG,并具备从事实三元组中自动生成显式或隐式常识的能力。此外,我们引入了常识引导的负采样和针对实体概念丰富的 KG 的粗到细推理方法。对于没有概念的 KG,我们提出了一种双评分方案,涉及关系感知的概念嵌入机制。重要的是,我们的方法可以作为可插拔模块集成到许多知识图谱嵌入 (Knowledge Graph Embedding, KGE) 模型中,促进联合常识和事实驱动的训练与推理。实验表明,我们的框架具有良好的可扩展性,并在各种 KGC 任务中优于现有模型。

[NLP-89] Fine-Grained Prediction of Reading Comprehension from Eye Movements EMNLP

【速读】: 该论文试图解决从阅读过程中的眼动数据评估人类阅读理解能力的问题。解决方案的关键在于利用大规模眼动数据和多模态语言模型,预测单个问题在文本段落中的阅读理解水平。研究通过引入三种新的多模态语言模型以及文献中的多种模型,评估这些模型在不同阅读情境(普通阅读和信息搜索)下对新文本和新参与者的泛化能力。结果表明,尽管任务极具挑战性,眼动数据仍包含对阅读理解精细预测的有用信号。

链接: https://arxiv.org/abs/2410.04484
作者: Omer Shubi,Yoav Meiri,Cfir Avraham Hadar,Yevgeni Berzak
关键词-EN: human reading comprehension, reading comprehension, eye movements, reading, comprehension
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP

点击查看摘要

Abstract:Can human reading comprehension be assessed from eye movements in reading? In this work, we address this longstanding question using large-scale eyetracking data over textual materials that are geared towards behavioral analyses of reading comprehension. We focus on a fine-grained and largely unaddressed task of predicting reading comprehension from eye movements at the level of a single question over a passage. We tackle this task using three new multimodal language models, as well as a battery of prior models from the literature. We evaluate the models’ ability to generalize to new textual items, new participants, and the combination of both, in two different reading regimes, ordinary reading and information seeking. The evaluations suggest that although the task is highly challenging, eye movements contain useful signals for fine-grained prediction of reading comprehension. Code and data will be made publicly available.
摘要:人类的阅读理解能力能否通过阅读时的眼球运动来评估?在本研究中,我们利用大规模的眼动追踪数据,针对旨在进行阅读理解行为分析的文本材料,探讨了这一长期存在的问题。我们专注于一个细粒度且尚未充分解决的任务,即通过单个问题对应段落的眼球运动来预测阅读理解能力。我们采用三种新的多模态语言模型以及一系列文献中的现有模型来解决这一任务。我们在两种不同的阅读模式(普通阅读和信息搜索)下,评估了模型对新文本项目、新参与者的泛化能力,以及两者的组合。评估结果表明,尽管该任务极具挑战性,但眼球运动确实包含了用于细粒度预测阅读理解的有用信号。代码和数据将公开发布。

[NLP-90] Configurable Multilingual ASR with Speech Summary Representations

【速读】: 该论文试图解决多语言自动语音识别(MASR)中,当目标语言未知时部署多个单语言模型的挑战。解决方案的关键在于提出了一种名为Configurable MASR model with Summary Vector (csvMASR)的新架构,该架构通过引入适配器和语音摘要向量表示,结合语言特定组件的输出,并在话语层面上进行整合,从而增强了模型的可配置性。此外,论文还引入了辅助语言分类损失,进一步提升了模型的配置能力。实验结果表明,csvMASR在多语言Librispeech数据集上的表现优于现有MASR模型,显著降低了词错误率(WER),并在语言分类和提示任务中表现出优越性能。

链接: https://arxiv.org/abs/2410.04478
作者: Harrison Zhu,Ivan Fung,Yingke Zhu,Lahiru Samarakoon
关键词-EN: making multilingual ASR, Approximately half, multilingual ASR, world population, configurable multilingual MASR
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: A preprint

点击查看摘要

Abstract:Approximately half of the world’s population is multilingual, making multilingual ASR (MASR) essential. Deploying multiple monolingual models is challenging when the ground-truth language is unknown in advance. This motivates research efforts on configurable multilingual MASR models that can be prompted manually or adapted automatically to recognise specific languages. In this paper, we present the Configurable MASR model with Summary Vector (csvMASR), a novel architecture designed to enhance configurability. Our approach leverages adapters and introduces speech summary vector representations, inspired by conversational summary representations in speech diarization, to combine outputs from language-specific components at the utterance level. We also incorporate an auxiliary language classification loss to enhance configurability. Using data from 7 languages in the Multilingual Librispeech (MLS) dataset, csvMASR outperforms existing MASR models and reduces the word error rate (WER) from 10.33% to 9.95% when compared with the baseline. Additionally, csvMASR demonstrates superior performance in language classification and prompting tasks.
摘要:全球约有一半人口是多语言使用者,这使得多语言自动语音识别 (Multilingual ASR, MASR) 变得至关重要。当真实语言未知时,部署多个单语言模型面临挑战。这促使研究人员致力于开发可配置的多语言 MASR 模型,这些模型可以通过手动提示或自动适应来识别特定语言。本文介绍了带有摘要向量的可配置 MASR 模型 (Configurable MASR model with Summary Vector, csvMASR),这是一种新颖的架构,旨在增强可配置性。我们的方法利用适配器并引入语音摘要向量表示,灵感来自于语音对话摘要表示,以在话语级别结合语言特定组件的输出。我们还引入了一个辅助语言分类损失以增强可配置性。使用 Multilingual Librispeech (MLS) 数据集中的 7 种语言数据,csvMASR 在性能上优于现有的 MASR 模型,并将词错误率 (WER) 从 10.33% 降低到 9.95%,相较于基线模型有显著提升。此外,csvMASR 在语言分类和提示任务中表现出色。

[NLP-91] Collapsed Language Models Promote Fairness

【速读】: 该论文试图解决预训练语言模型中隐含的社会偏见问题,提出了一种基于神经崩溃(Neural Collapse)现象的公平性改进方法。解决方案的关键在于通过评估公平性相关词汇的最后一层表示和分类器中的学习现象,发现去偏语言模型在标记表示与词嵌入之间存在崩溃对齐。基于这一观察,论文设计了一种原则性的微调方法,能够在多种去偏方法中有效提升模型的公平性,同时保持模型在标准自然语言理解任务中的性能。

链接: https://arxiv.org/abs/2410.04472
作者: Jingxuan Xu,Wuyang Chen,Linyi Li,Yao Zhao,Yunchao Wei
关键词-EN: mitigate societal biases, societal biases implicitly, biases implicitly encoded, recent successful pretrained, successful pretrained language
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse – a learning phenomenon happen in last-layer representations and classifiers in deep networks – on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code at this https URL .
摘要:为了缓解近期成功预训练语言模型中隐含的社会偏见,研究者们提出了一系列多样化的方法来促进模型的公平性,这些方法包括提示 (prompting)、数据增强 (data augmentation)、正则化微调 (regularized fine-tuning) 等。尽管这些方法有所发展,但要达到对公平性的原则性理解以及设计出能够持续去偏的语言模型的有效算法仍然具有挑战性。在本研究中,通过对深度网络中最后一层表示和分类器中出现的神经崩溃 (Neural Collapse) 现象进行严格评估,我们发现去偏的语言模型在 Token 表示与词嵌入之间表现出崩溃的对齐。更重要的是,这一观察启发我们设计了一种原则性的微调方法,该方法能够有效提升广泛去偏方法中的公平性,同时仍然保持语言模型在标准自然语言理解任务中的性能。我们在此 https URL 附上了我们的代码。

[NLP-92] Revisiting In-context Learning Inference Circuit in Large Language Models ICLR2025

【速读】: 该论文试图解决In-context Learning(ICL)在大型语言模型中的内部机制问题,特别是如何全面解释ICL的推理过程。解决方案的关键在于提出一个综合的推理电路模型,将ICL推理过程细分为三个主要操作:(1)总结,即将输入文本编码为隐藏状态中的线性表示;(2)语义合并,即将演示文本的编码表示与其对应的标签合并;(3)特征检索与复制,即在任务子空间中搜索与查询相似的联合表示并将其复制到查询中。该模型成功捕捉了ICL过程中的多种现象,并通过消融分析验证了其关键性,表明该推理电路是ICL性能的主导机制。

链接: https://arxiv.org/abs/2410.04468
作者: Hakaze Cho,Mariko Kato,Yoshihiro Sakai,Naoya Inoue
关键词-EN: emerging few-shot learning, few-shot learning paradigm, In-context Learning, ICL, few-shot learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 37 figures, 6 tables, ICLR 2025 under review

点击查看摘要

Abstract:In-context Learning (ICL) is an emerging few-shot learning paradigm on Language Models (LMs) with inner mechanisms un-explored. There are already existing works describing the inner processing of ICL, while they struggle to capture all the inference phenomena in large language models. Therefore, this paper proposes a comprehensive circuit to model the inference dynamics and try to explain the observed phenomena of ICL. In detail, we divide ICL inference into 3 major operations: (1) Summarize: LMs encode every input text (demonstrations and queries) into linear representation in the hidden states with sufficient information to solve ICL tasks. (2) Semantics Merge: LMs merge the encoded representations of demonstrations with their corresponding label tokens to produce joint representations of labels and demonstrations. (3) Feature Retrieval and Copy: LMs search the joint representations similar to the query representation on a task subspace, and copy the searched representations into the query. Then, language model heads capture these copied label representations to a certain extent and decode them into predicted labels. The proposed inference circuit successfully captured many phenomena observed during the ICL process, making it a comprehensive and practical explanation of the ICL inference process. Moreover, ablation analysis by disabling the proposed steps seriously damages the ICL performance, suggesting the proposed inference circuit is a dominating mechanism. Additionally, we confirm and list some bypass mechanisms that solve ICL tasks in parallel with the proposed circuit.
摘要:上下文学习 (In-context Learning, ICL) 是一种新兴的少样本学习范式,针对具有未探索内在机制的语言模型 (Language Models, LMs)。已有研究描述了 ICL 的内部处理过程,但它们难以捕捉大语言模型中的所有推理现象。因此,本文提出了一种全面的电路模型,以模拟推理动态并尝试解释观察到的 ICL 现象。具体而言,我们将 ICL 推理分为三个主要操作:(1) 总结:LMs 将每个输入文本(示例和查询)编码为隐藏状态中的线性表示,这些表示包含足够的信息以解决 ICL 任务。(2) 语义合并:LMs 将示例的编码表示与其对应的标签 Token 合并,以生成标签和示例的联合表示。(3) 特征检索与复制:LMs 在任务子空间中搜索与查询表示相似的联合表示,并将搜索到的表示复制到查询中。然后,语言模型头部在一定程度上捕捉这些复制的标签表示,并将其解码为预测的标签。所提出的推理电路成功捕捉了许多在 ICL 过程中观察到的现象,使其成为对 ICL 推理过程的全面且实用的解释。此外,通过禁用所提出的步骤进行消融分析,严重损害了 ICL 性能,这表明所提出的推理电路是一个主导机制。此外,我们确认并列出了一些与所提出电路并行解决 ICL 任务的旁路机制。

[NLP-93] Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information EMNLP2024

【速读】: 该论文试图解决当前链式思维(Chain-of-Thought, CoT)方法在大型语言模型(LLMs)中的两个关键问题:单一验证方法的局限性和错误信息的忽视。解决方案的关键在于提出了一种名为“错误思维”(Wrong-of-Thought, WoT)的新方法,包含两个核心模块:多视角验证(Multi-Perspective Verification)和错误信息利用(Wrong Information Utilization)。多视角验证通过多角度验证来精确地改进推理过程和结果,而错误信息利用则通过利用错误信息来警示LLMs,减少重复错误的发生概率。实验结果表明,WoT在多个数据集和LLMs上均优于以往的基线方法,特别是在复杂的计算任务中表现出色。

链接: https://arxiv.org/abs/2410.04463
作者: Yongheng Zhang,Qiguang Chen,Jingxuan Zhou,Peng Wang,Jiasheng Si,Jin Wang,Wenpeng Lu,Libo Qin
关键词-EN: Large Language Models, Language Models, Large Language, attracting increasing attention, performance of Large
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:Chain-of-Thought (CoT) has become a vital technique for enhancing the performance of Large Language Models (LLMs), attracting increasing attention from researchers. One stream of approaches focuses on the iterative enhancement of LLMs by continuously verifying and refining their reasoning outputs for desired quality. Despite its impressive results, this paradigm faces two critical issues: (1) Simple verification methods: The current paradigm relies solely on a single verification method. (2) Wrong Information Ignorance: Traditional paradigms directly ignore wrong information during reasoning and refine the logic paths from scratch each time. To address these challenges, we propose Wrong-of-Thought (WoT), which includes two core modules: (1) Multi-Perspective Verification: A multi-perspective verification method for accurately refining the reasoning process and result, and (2) Wrong Information Utilization: Utilizing wrong information to alert LLMs and reduce the probability of LLMs making same mistakes. Experiments on 8 popular datasets and 5 LLMs demonstrate that WoT surpasses all previous baselines. In addition, WoT exhibits powerful capabilities in difficult computation tasks.
摘要:思维链 (Chain-of-Thought, CoT) 已成为提升大语言模型 (Large Language Models, LLMs) 性能的关键技术,吸引了越来越多研究者的关注。一类方法专注于通过持续验证和优化 LLMs 的推理输出以达到期望的质量来迭代增强 LLMs。尽管取得了显著成果,这一范式面临两个关键问题:(1) 简单验证方法:当前范式仅依赖单一验证方法。(2) 错误信息忽视:传统范式在推理过程中直接忽略错误信息,每次都从头开始优化逻辑路径。为解决这些挑战,我们提出了错误思维 (Wrong-of-Thought, WoT),其包含两个核心模块:(1) 多视角验证:一种多视角验证方法,用于精确优化推理过程和结果,以及 (2) 错误信息利用:利用错误信息提醒 LLMs 并降低 LLMs 重复犯错的可能性。在 8 个流行数据集和 5 个 LLMs 上的实验表明,WoT 超越了所有先前的基线。此外,WoT 在复杂计算任务中展现出强大的能力。

[NLP-94] SWEb: A Large Web Dataset for the Scandinavian Languages

【速读】: 该论文试图解决斯堪的纳维亚语言领域预训练数据集规模不足的问题,并提出了迄今为止最大的预训练数据集SWEb,包含超过一万亿个标记。解决方案的关键在于引入了一种基于模型的文本提取器,相较于传统的基于规则的方法,显著降低了复杂性。此外,论文还引入了一个新的填空风格基准测试,用于评估瑞典语语言模型,并通过该测试比较了基于SWEb数据训练的模型与基于FineWeb数据训练的模型的性能,取得了有竞争力的结果。

链接: https://arxiv.org/abs/2410.04456
作者: Tobias Norlund,Tim Isbister,Amaru Cuba Gyllensten,Paul Dos Santos,Danila Petrelli,Ariel Ekgren,Magnus Sahlgren
关键词-EN: hitherto largest pretraining, largest pretraining dataset, Scandinavian WEb, Scandinavian languages, trillion tokens
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the hitherto largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb), comprising over one trillion tokens. The paper details the collection and processing pipeline, and introduces a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches. We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results. All data, models and code are shared openly.
摘要:本文介绍了迄今为止最大的斯堪的纳维亚语言预训练数据集:斯堪的纳维亚网络 (Scandinavian WEb, SWEb),包含超过一万亿个 Token。本文详细描述了数据集的收集和处理流程,并引入了一种基于模型的文本提取器,与基于规则的方法相比,显著降低了复杂性。我们还引入了一个新的完形填空风格基准,用于评估瑞典语的语言模型,并使用该测试将基于 SWEb 数据训练的模型与基于 FineWeb 数据训练的模型进行比较,取得了具有竞争力的结果。所有数据、模型和代码均公开共享。

[NLP-95] CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs

【速读】: 该论文试图解决大语言模型(LLMs)在生成文本时可能涉及版权数据的问题,特别是如何评估预训练数据集对模型输出的影响。解决方案的关键在于引入了一个名为CopyLens的新框架,该框架通过两阶段方法分析版权数据集对LLM响应的影响。首先,基于嵌入空间中预训练数据的独特性,对潜在的版权文本进行标记表示融合;然后,使用轻量级的LSTM网络分析数据集的贡献。此外,设计了一个基于对比学习的非版权OOD检测器。实验结果表明,CopyLens在效率和准确性上分别比基线方法提高了15.2%和58.7%,AUC提升了0.21。

链接: https://arxiv.org/abs/2410.04454
作者: Qichao Ma,Rui-Jie Zhu,Peiye Liu,Renye Yan,Fahong Zhang,Ling Liang,Meng Li,Zhaofei Yu,Zongwei Wang,Yimao Cai,Tiejun Huang
关键词-EN: Large Language Models, Large Language, Language Models, text-generation capabilities, pervasive due
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become pervasive due to their knowledge absorption and text-generation capabilities. Concurrently, the copyright issue for pretraining datasets has been a pressing concern, particularly when generation includes specific styles. Previous methods either focus on the defense of identical copyrighted outputs or find interpretability by individual tokens with computational burdens. However, the gap between them exists, where direct assessments of how dataset contributions impact LLM outputs are missing. Once the model providers ensure copyright protection for data holders, a more mature LLM community can be established. To address these limitations, we introduce CopyLens, a new framework to analyze how copyrighted datasets may influence LLM responses. Specifically, a two-stage approach is employed: First, based on the uniqueness of pretraining data in the embedding space, token representations are initially fused for potential copyrighted texts, followed by a lightweight LSTM-based network to analyze dataset contributions. With such a prior, a contrastive-learning-based non-copyright OOD detector is designed. Our framework can dynamically face different situations and bridge the gap between current copyright detection methods. Experiments show that CopyLens improves efficiency and accuracy by 15.2% over our proposed baseline, 58.7% over prompt engineering methods, and 0.21 AUC over OOD detection baselines.
摘要:大语言模型 (LLMs) 因其知识吸收和文本生成能力而变得无处不在。与此同时,预训练数据集的版权问题已成为一个紧迫的议题,尤其是在生成内容包含特定风格时。以往的方法要么专注于防御相同的版权输出,要么通过单个 Token 进行计算负担较大的可解释性分析。然而,这些方法之间存在一个空白,即缺乏对数据集贡献如何影响 LLM 输出的直接评估。一旦模型提供者确保数据持有者的版权保护,一个更成熟的大语言模型社区便可以建立。为了解决这些局限性,我们引入了 CopyLens,这是一个新的框架,用于分析版权数据集可能如何影响 LLM 的响应。具体来说,采用了一个两阶段的方法:首先,基于嵌入空间中预训练数据的独特性,对 Token 表示进行初步融合以识别潜在的版权文本,然后通过一个轻量级的基于 LSTM 的网络来分析数据集的贡献。在此基础上,设计了一个基于对比学习的非版权 OOD (Out-of-Distribution) 检测器。我们的框架能够动态应对不同情况,并弥合当前版权检测方法之间的差距。实验表明,CopyLens 在效率和准确性上分别比我们提出的基线提高了 15.2%,比提示工程方法提高了 58.7%,比 OOD 检测基线提高了 0.21 AUC。

[NLP-96] MindScope: Exploring cognitive biases in large language models through Multi-Agent Systems ECAI2024

【速读】: 该论文试图解决当前检测大型语言模型(LLMs)中认知偏差的方法存在检测能力不全面和可检测偏差类型有限的问题。解决方案的关键在于引入了一个名为“MindScope”的数据集,该数据集独特地结合了静态和动态元素。静态部分包含5,170个开放式问题,涵盖72种认知偏差类别;动态部分则利用基于规则的多代理通信框架生成多轮对话,增强了数据集的灵活性和适应性。此外,论文还提出了一种多代理检测方法,结合了检索增强生成(RAG)、竞争性辩论和基于强化学习的决策模块,显著提高了检测准确性,相比GPT-4提升了35.10%。

链接: https://arxiv.org/abs/2410.04452
作者: Zhentao Xie,Jiabao Zhao,Yilei Wang,Jinxin Shi,Yanhong Bai,Xingjiao Wu,Liang He
关键词-EN: Detecting cognitive biases, existing cognitive biases, large language models, cognitive biases, Detecting cognitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages,7 figures,Our paper has been accepted for presentation at the 2024 European Conference on Artificial Intelligence (ECAI 2024)

点击查看摘要

Abstract:Detecting cognitive biases in large language models (LLMs) is a fascinating task that aims to probe the existing cognitive biases within these models. Current methods for detecting cognitive biases in language models generally suffer from incomplete detection capabilities and a restricted range of detectable bias types. To address this issue, we introduced the ‘MindScope’ dataset, which distinctively integrates static and dynamic elements. The static component comprises 5,170 open-ended questions spanning 72 cognitive bias categories. The dynamic component leverages a rule-based, multi-agent communication framework to facilitate the generation of multi-round dialogues. This framework is flexible and readily adaptable for various psychological experiments involving LLMs. In addition, we introduce a multi-agent detection method applicable to a wide range of detection tasks, which integrates Retrieval-Augmented Generation (RAG), competitive debate, and a reinforcement learning-based decision module. Demonstrating substantial effectiveness, this method has shown to improve detection accuracy by as much as 35.10% compared to GPT-4. Codes and appendix are available at this https URL.
摘要:检测大语言模型 (LLM) 中的认知偏差是一项引人入胜的任务,旨在探究这些模型中存在的认知偏差。当前检测语言模型中认知偏差的方法普遍存在检测能力不完整和可检测偏差类型受限的问题。为解决这一问题,我们引入了“MindScope”数据集,该数据集独特地整合了静态和动态元素。静态部分包含 5,170 个开放式问题,涵盖 72 种认知偏差类别。动态部分利用基于规则的多智能体通信框架,促进多轮对话的生成。该框架灵活且易于适应涉及 LLM 的各种心理学实验。此外,我们引入了一种适用于广泛检测任务的多智能体检测方法,该方法集成了检索增强生成 (RAG)、竞争辩论和基于强化学习的决策模块。实验证明,该方法显著有效,相比 GPT-4,检测准确率提高了 35.10%。代码和附录可在以下链接获取:https URL。

[NLP-97] CAPEEN: Image Captioning with Early Exits and Knowledge Distillation EMNLP

【速读】: 该论文试图解决深度神经网络在图像描述生成任务中计算负担和推理延迟增加的问题。解决方案的关键在于引入CAPEEN(Captioning with Early Exit Networks),通过知识蒸馏技术在中间层实现早期退出(Early Exit),从而在预测置信度超过预定义阈值时提前完成推理,提高效率。此外,为了应对实际部署中目标分布可能偏离训练样本的情况,论文还提出了A-CAPEEN变体,利用多臂赌博机框架动态调整阈值,增强系统的鲁棒性。实验结果表明,CAPEEN在保持竞争性能的同时,实现了1.77倍的加速,而A-CAPEEN进一步提升了对数据分布变化的适应能力。

链接: https://arxiv.org/abs/2410.04433
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
关键词-EN: Deep neural networks, made significant progress, recognizing visual elements, generating descriptive text, Deep neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear in EMNLP (finding) 2024

点击查看摘要

Abstract:Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at this https URL
摘要:深度神经网络 (DNN) 在图像描述任务中识别视觉元素并生成描述性文本方面取得了显著进展。然而,其性能的提升伴随着计算负担和推理延迟的增加。早期退出 (EE) 策略可以用于提高其效率,但在图像描述任务中的适应性面临挑战,因为这需要不同层次的语义信息以进行准确的预测。为克服这一问题,我们引入了 CAPEEN,通过知识蒸馏改进 EE 策略的性能。在 CAPEEN 中,如果预测置信度超过从训练数据中学习到的预定义值,推理将在中间层完成。为了应对实际部署中目标分布可能偏离训练样本的情况,我们引入了变体 A-CAPEEN,使用多臂赌博机框架动态调整阈值。在 MS COCO 和 Flickr30k 数据集上的实验表明,CAPEEN 在保持与最终层相当性能的同时,实现了 1.77 倍的加速,而 A-CAPEEN 还额外提供了对失真的鲁棒性。源代码可在以下链接获取:https URL

[NLP-98] DAdEE: Unsupervised Domain Adaptation in Early Exit PLMs EMNLP

【速读】: 该论文试图解决预训练语言模型(PLMs)在推理过程中由于模型规模大导致的推理延迟问题,以及早期退出(Early Exit)策略在面对领域变化时泛化能力不足的问题。解决方案的关键在于提出了无监督领域适应的早期退出框架(DADEE),通过多层次的知识蒸馏和基于生成对抗网络(GAN)的对抗适应,实现每一层的领域不变表示,从而减少源域和目标域之间的领域差异。这种方法不仅加速了推理过程,还增强了领域适应性,减少了灾难性遗忘和模式崩溃,使其更适合实际应用场景。

链接: https://arxiv.org/abs/2410.04424
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
关键词-EN: Pre-trained Language Models, exhibit good accuracy, large size results, Pre-trained Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in EMNLP (findings) 2024

点击查看摘要

Abstract:Pre-trained Language Models (PLMs) exhibit good accuracy and generalization ability across various tasks using self-supervision, but their large size results in high inference latency. Early Exit (EE) strategies handle the issue by allowing the samples to exit from classifiers attached to the intermediary layers, but they do not generalize well, as exit classifiers can be sensitive to domain changes. To address this, we propose Unsupervised Domain Adaptation in EE framework (DADEE) that employs multi-level adaptation using knowledge distillation. DADEE utilizes GAN-based adversarial adaptation at each layer to achieve domain-invariant representations, reducing the domain gap between the source and target domain across all layers. The attached exits not only speed up inference but also enhance domain adaptation by reducing catastrophic forgetting and mode collapse, making it more suitable for real-world scenarios. Experiments on tasks such as sentiment analysis, entailment classification, and natural language inference demonstrate that DADEE consistently outperforms not only early exit methods but also various domain adaptation methods under domain shift scenarios. The anonymized source code is available at this https URL.
摘要:预训练语言模型 (Pre-trained Language Models, PLMs) 通过自监督学习在各种任务中表现出良好的准确性和泛化能力,但其庞大的规模导致了高推理延迟。早期退出 (Early Exit, EE) 策略通过允许样本从附加在中介层的分类器中退出来解决这一问题,但这些策略的泛化能力不佳,因为退出分类器可能对领域变化敏感。为了解决这一问题,我们提出了基于知识蒸馏的多层次适应的早期退出框架中的无监督领域适应 (Unsupervised Domain Adaptation in EE framework, DADEE)。DADEE 在每一层使用基于生成对抗网络 (GAN) 的对抗适应来实现领域不变的表示,从而减少了源域和目标域之间的领域差距。附加的退出不仅加速了推理,还通过减少灾难性遗忘和模式崩溃来增强领域适应,使其更适合实际应用场景。在情感分析、蕴涵分类和自然语言推理等任务上的实验表明,DADEE 不仅在早期退出方法中表现出色,而且在领域迁移场景下也优于各种领域适应方法。匿名源代码可在以下链接获取:https URL。

[NLP-99] Hyper-multi-step: The Truth Behind Difficult Long-context Tasks

【速读】: 该论文试图解决长上下文语言模型(LCLM)在处理复杂长上下文任务时遇到的困难,特别是这些任务的难度来源问题。研究指出,这些困难主要源于两个基本问题:“多匹配检索”和“基于逻辑的检索”。前者要求同时检索多个项目,后者则需要在检索标准中进行逻辑判断。这两个问题本质上是超多步骤的,即需要大量步骤才能解决,这超出了当前LCLM的能力范围。论文通过实验验证了这一点,并提出这一发现有助于更准确地重新思考和设计解决这些高级长上下文任务的方案。

链接: https://arxiv.org/abs/2410.04422
作者: Yijiong Yu
关键词-EN: extensive context window, Long-context language models, language models, context window, increasingly popular
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: “multi-matching retrieval,” which requires the simultaneous retrieval of multiple items, and “logic-based retrieval,” which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.
摘要:长上下文语言模型 (Long-context Language Models, LCLM) 以其广泛上下文窗口的特点,正变得越来越受欢迎。与此同时,许多长上下文基准测试提出了即使是目前最先进的 LCLM 也难以完成的挑战性任务。然而,这些复杂长上下文任务的根本原因却鲜有研究。为了填补这一空白,我们进行了实验,表明其难度主要源于两个基本问题:“多匹配检索” (multi-matching retrieval),即需要同时检索多个项目;以及“基于逻辑的检索” (logic-based retrieval),即在检索标准中需要进行逻辑判断。这两个问题虽然看似简单,但实际上超出了 LCLM 的能力范围,因为它们本质上被证明是超多步骤的 (需要大量步骤才能解决)。这一发现可以解释为什么大语言模型 (Large Language Models, LLM) 在更高级的长上下文任务中表现不佳,并为重新思考解决这些问题的方案提供了更准确的视角。

[NLP-100] Blocks Architecture (BloArk): Efficient Cost-Effective and Incremental Dataset Architecture for Wikipedia Revision History

【速读】: 该论文试图解决处理Wikipedia修订历史(WikiRevHist)数据集时面临的计算资源需求高、运行时间长以及重复工作多的问题。解决方案的关键在于提出了Blocks Architecture(BloArk),这是一个专注于效率的数据处理架构。BloArk通过三个基础设施部分(blocks、segments和warehouses)以及核心数据处理管道(builder和modifier)来实现。Builder将原始的WikiRevHist数据从XML格式转换为JSON Lines(JSONL)格式,以提高并发和存储效率;Modifier则利用已构建的warehouses进行增量修改,从而提高现有数据库的利用率并减少重复工作的成本。最终,BloArk能够轻松扩展,适用于处理Wikipedia修订历史数据集并支持下游NLP应用的增量修改。

链接: https://arxiv.org/abs/2410.04410
作者: Lingxi Li,Zonghai Yao,Sunjae Kwon,Hong Yu
关键词-EN: Wikipedia Revision History, natural language processing, Wikipedia Revision, processing Wikipedia Revision, natural language
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures; for package documentation and usage examples, see this https URL and this https URL

点击查看摘要

Abstract:Wikipedia (Wiki) is one of the most widely used and publicly available resources for natural language processing (NLP) applications. Wikipedia Revision History (WikiRevHist) shows the order in which edits were made to any Wiki page since its first modification. While the most up-to-date Wiki has been widely used as a training source, WikiRevHist can also be valuable resources for NLP applications. However, there are insufficient tools available to process WikiRevHist without having substantial computing resources, making additional customization, and spending extra time adapting others’ works. Therefore, we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and warehouses. On top of that, we build the core data processing pipeline: builder and modifier. The BloArk builder transforms the original WikiRevHist dataset from XML syntax into JSON Lines (JSONL) format for improving the concurrent and storage efficiency. The BloArk modifier takes previously-built warehouses to operate incremental modifications for improving the utilization of existing databases and reducing the cost of reusing others’ works. In the end, BloArk can scale up easily in both processing Wikipedia Revision History and incrementally modifying existing dataset for downstream NLP use cases. The source code, documentations, and example usages are publicly available online and open-sourced under GPL-2.0 license.
摘要:维基百科 (Wikipedia) 是自然语言处理 (NLP) 应用中最广泛使用且公开可用的资源之一。维基百科修订历史 (Wikipedia Revision History, WikiRevHist) 展示了自首次修改以来对任何维基页面所做的编辑顺序。尽管最新的维基百科已被广泛用作训练源,但 WikiRevHist 同样可以成为 NLP 应用的宝贵资源。然而,目前缺乏足够的工具来处理 WikiRevHist,而无需拥有大量计算资源、进行额外定制以及花费额外时间适应他人的工作。因此,我们提出了块架构 (Blocks Architecture, BloArk),这是一种专注于效率的数据处理架构,旨在减少处理 WikiRevHist 数据集时的运行时间、计算资源需求和重复工作。BloArk 的基础设施由三个部分组成:块 (blocks)、段 (segments) 和仓库 (warehouses)。在此基础上,我们构建了核心数据处理管道:构建器 (builder) 和修改器 (modifier)。BloArk 构建器将原始 WikiRevHist 数据集从 XML 语法转换为 JSON Lines (JSONL) 格式,以提高并发和存储效率。BloArk 修改器利用先前构建的仓库进行增量修改,以提高现有数据库的利用率并降低重复使用他人工作的成本。最终,BloArk 可以轻松扩展,既适用于处理维基百科修订历史,也适用于对现有数据集进行增量修改,以满足下游 NLP 用例的需求。源代码、文档和示例用法已公开在线并根据 GPL-2.0 许可证开源。

[NLP-101] Lens: Rethinking Multilingual Enhancement for Large Language Models

【速读】: 该论文试图解决当前大型语言模型(LLMs)在多语言处理中存在的性能差距问题,特别是英语为中心的模型在非英语语言上的表现不佳。解决方案的关键在于提出了一种名为Lens的新方法,通过操纵LLMs内部的语言表示空间来增强其多语言能力。具体来说,Lens利用模型的语言无关和语言特定子空间,在语言无关子空间中将目标语言向中心语言(如英语)拉近,以继承其语义表示;同时在语言特定子空间中将目标语言与中心语言的表示分开,使其能够独特表达。这种方法在不牺牲中心语言能力的前提下,显著提升了多语言性能,且相比现有的数据驱动后训练方法,所需计算资源更少。

链接: https://arxiv.org/abs/2410.04407
作者: Weixiang Zhao,Yulin Hu,Jiahe Guo,Xingyu Sui,Tongtong Wu,Yang Deng,Yanyan Zhao,Bing Qin,Wanxiang Che,Ting Liu
关键词-EN: diverse linguistic backgrounds, growing global demand, remain predominantly English-centric, cutting-edge LLMs remain, LLMs remain predominantly
类目: Computation and Language (cs.CL)
备注: 21 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Despite the growing global demand for large language models (LLMs) that serve users from diverse linguistic backgrounds, most cutting-edge LLMs remain predominantly English-centric. This creates a performance gap across languages, restricting access to advanced AI services for non-English speakers. Current methods to enhance multilingual capabilities largely rely on data-driven post-training techniques, such as multilingual instruction tuning or continual pre-training. However, these approaches encounter significant challenges, including the scarcity of high-quality multilingual datasets and the limited enhancement of multilingual capabilities. They often suffer from off-target issues and catastrophic forgetting of central language abilities. To this end, we propose Lens, a novel approach to enhance multilingual capabilities of LLMs by leveraging their internal language representation spaces. Specially, Lens operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs. Using the central language as a pivot, the target language is drawn closer to it within the language-agnostic subspace, allowing it to inherit well-established semantic representations. Meanwhile, in the language-specific subspace, the representations of the target and central languages are pushed apart, enabling the target language to express itself distinctly. Extensive experiments on one English-centric and two multilingual LLMs demonstrate that Lens effectively improves multilingual performance without sacrificing the original central language capabilities of the backbone model, achieving superior results with much fewer computational resources compared to existing post-training approaches.
摘要:尽管全球对服务于来自不同语言背景用户的大语言模型 (LLM) 的需求不断增长,但大多数尖端 LLM 仍然以英语为中心。这导致了不同语言之间的性能差距,限制了非英语用户对先进 AI 服务的访问。当前提升多语言能力的方法主要依赖于数据驱动的后训练技术,如多语言指令调优或持续预训练。然而,这些方法面临显著挑战,包括高质量多语言数据集的稀缺性和多语言能力提升的有限性。它们常常遭遇偏离目标问题和中心语言能力的灾难性遗忘。为此,我们提出了 Lens,一种通过利用 LLM 内部语言表示空间来增强其多语言能力的新方法。特别地,Lens 通过操纵 LLM 顶层中的语言无关和语言特定子空间的隐藏表示来运作。使用中心语言作为支点,目标语言在语言无关子空间中被拉近,使其能够继承已建立的语义表示。同时,在语言特定子空间中,目标语言和中心语言的表示被推开,使目标语言能够清晰地表达自身。在以英语为中心和两种多语言 LLM 上的广泛实验表明,Lens 有效地提高了多语言性能,同时不牺牲骨干模型的原始中心语言能力,与现有后训练方法相比,以更少的计算资源实现了更优越的结果。

[NLP-102] CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text

【速读】: 该论文试图解决在机器学习模型中如何有效利用论文正文进行未来引用次数预测的问题。解决方案的关键在于提出了一种基于BERT的模型CiMaTe,该模型通过显式捕捉论文的章节结构来充分利用正文信息,从而在计算语言学和生物学领域的实验中显著提升了预测效果,分别在Spearman等级相关系数上提高了5.1和1.8个百分点。

链接: https://arxiv.org/abs/2410.04404
作者: Jun Hirako,Ryohei Sasano,Koichi Takeda
关键词-EN: citation count prediction, future citation counts, find interesting papers, count prediction, count prediction model
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prediction of the future citation counts of papers is increasingly important to find interesting papers among an ever-growing number of papers. Although a paper’s main text is an important factor for citation count prediction, it is difficult to handle in machine learning models because the main text is typically very long; thus previous studies have not fully explored how to leverage it. In this paper, we propose a BERT-based citation count prediction model, called CiMaTe, that leverages the main text by explicitly capturing a paper’s sectional structure. Through experiments with papers from computational linguistics and biology domains, we demonstrate the CiMaTe’s effectiveness, outperforming the previous methods in Spearman’s rank correlation coefficient; 5.1 points in the computational linguistics domain and 1.8 points in the biology domain.
摘要:预测论文未来的引用次数在海量论文中寻找有趣论文的过程中变得越来越重要。尽管论文的正文是引用次数预测的重要因素,但由于其通常非常长,机器学习模型难以处理,因此先前的研究并未充分探索如何利用它。本文提出了一种基于 BERT 的引用次数预测模型,称为 CiMaTe,该模型通过显式捕捉论文的章节结构来利用正文信息。通过对计算语言学和生物学领域的论文进行实验,我们展示了 CiMaTe 的有效性,其在 Spearman 等级相关系数上优于先前的方法;在计算语言学领域提高了 5.1 分,在生物学领域提高了 1.8 分。

[NLP-103] Suspiciousness of Adversarial Texts to Human

【速读】: 该论文试图解决对抗性文本生成中的“人类可疑性”问题,即如何生成既能够欺骗自然语言处理系统又不易被人类读者察觉的对抗性文本。解决方案的关键在于通过分析人类对对抗性文本的感知,建立一个基于回归的模型来量化文本的可疑性,并利用该模型指导对抗性文本的生成,从而降低其被人类识别为计算机生成的可能性。论文还发布了一个新的数据集,包含人类对对抗性文本可疑性的Likert量表评估,为未来研究提供了基准。

链接: https://arxiv.org/abs/2410.04377
作者: Shakila Mahjabin Tonni,Pedro Faustini,Mark Dras
关键词-EN: deep neural networks, meticulously altered inputs, degrade model performance, Adversarial, neural networks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Under review

点击查看摘要

Abstract:Adversarial examples pose a significant challenge to deep neural networks (DNNs) across both image and text domains, with the intent to degrade model performance through meticulously altered inputs. Adversarial texts, however, are distinct from adversarial images due to their requirement for semantic similarity and the discrete nature of the textual contents. This study delves into the concept of human suspiciousness, a quality distinct from the traditional focus on imperceptibility found in image-based adversarial examples. Unlike images, where adversarial changes are meant to be indistinguishable to the human eye, textual adversarial content must often remain undetected or non-suspicious to human readers, even when the text’s purpose is to deceive NLP systems or bypass filters. In this research, we expand the study of human suspiciousness by analyzing how individuals perceive adversarial texts. We gather and publish a novel dataset of Likert-scale human evaluations on the suspiciousness of adversarial sentences, crafted by four widely used adversarial attack methods and assess their correlation with the human ability to detect machine-generated alterations. Additionally, we develop a regression-based model to quantify suspiciousness and establish a baseline for future research in reducing the suspiciousness in adversarial text generation. We also demonstrate how the regressor-generated suspicious scores can be incorporated into adversarial generation methods to produce texts that are less likely to be perceived as computer-generated. We make our human suspiciousness annotated data and our code available. Comments: Under review Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2410.04377 [cs.LG] (or arXiv:2410.04377v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04377 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:对抗样本对深度神经网络 (DNN) 在图像和文本领域都构成了重大挑战,其目的是通过精心修改的输入来降低模型性能。然而,对抗文本与对抗图像不同,因为它们需要语义相似性以及文本内容的离散性。本研究深入探讨了人类可疑性这一概念,这一特性与基于图像的对抗样本中传统的不可感知性关注点不同。与图像不同,图像中的对抗变化旨在让人眼无法区分,而文本对抗内容则必须经常保持不被人类读者察觉或不引起怀疑,即使文本的目的是欺骗自然语言处理 (NLP) 系统或绕过过滤器。在这项研究中,我们通过分析个体如何感知对抗文本,扩展了对人类可疑性的研究。我们收集并发布了一个新颖的数据集,该数据集包含对由四种广泛使用的对抗攻击方法生成的对抗句子的 Likert 量表人类评估,并评估了这些评估与人类检测机器生成变化能力之间的相关性。此外,我们开发了一个基于回归的模型来量化可疑性,并为未来减少对抗文本生成中可疑性的研究建立了基准。我们还展示了如何将回归器生成的可疑分数纳入对抗生成方法中,以生成不太可能被认为是计算机生成的文本。我们公开了人类可疑性标注数据和我们的代码。

评论:正在评审中 主题:机器学习 (cs.LG); 计算与语言 (cs.CL); 密码学与安全 (cs.CR) 引用为:arXiv:2410.04377 [cs.LG] (或 arXiv:2410.04377v1 [cs.LG] 用于此版本) https://doi.org/10.48550/arXiv.2410.04377 聚焦以了解更多信息 arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-104] Algorithmic Capabilities of Random Transformers NEURIPS2024

【速读】: 该论文试图解决的问题是:在训练过程中,Transformer模型如何形成执行算术和联想回忆等任务的可解释过程,以及这些过程在多大程度上依赖于监督信号,或在训练开始时模型本身已具备的能力。解决方案的关键在于研究仅优化嵌入层的随机初始化Transformer模型,通过这种方式,模型只能学习到初始模型(通过编码方案选择)已经实现的任务。研究发现,这些随机初始化的Transformer模型能够执行多种有意义的算法任务,表明在模型训练之前,某些算法能力已经存在于Transformer中,并通过适当结构的输入得以实现。

链接: https://arxiv.org/abs/2410.04368
作者: Ziqian Zhong,Jacob Andreas
关键词-EN: implement interpretable procedures, implement interpretable, interpretable procedures, procedures originate, associative recall
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To what extent do they depend on the supervisory signal provided to models, and to what extent are they attributable to behavior already present in models at the beginning of training? To investigate these questions, we investigate what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input–output mappings learnable from data are those already implemented (up to a choice of encoding scheme) by the randomly initialized model. We find that these random transformers can perform a wide range of meaningful algorithmic tasks, including modular arithmetic, in-weights and in-context associative recall, decimal addition, parenthesis balancing, and even some aspects of natural language text generation. Our results indicate that some algorithmic capabilities are present in transformers (and accessible via appropriately structured inputs) even before these models are trained. Code is available at this https URL.
摘要:经过训练的 Transformer 模型已被发现能够为算术和联想回忆等任务实现可解释的程序,但对于这些程序在训练过程中如何产生的机制尚知之甚少。这些机制在多大程度上依赖于模型接收到的监督信号,又在多大程度上归因于训练开始时模型中已存在的行为?为了探究这些问题,我们研究了在仅优化嵌入层的情况下,随机初始化的 Transformer 能够学习哪些功能。因此,从数据中可学习的输入-输出映射仅限于随机初始化模型(取决于编码方案的选择)已实现的功能。我们发现,这些随机 Transformer 能够执行多种有意义的算法任务,包括模运算、权重内和上下文内的联想回忆、十进制加法、括号平衡,甚至自然语言文本生成的一些方面。我们的研究结果表明,即使在模型训练之前,Transformer 中已经存在某些算法能力(并且可以通过适当结构的输入来访问)。代码可在以下链接获取:https URL。

[NLP-105] IS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

【速读】: 该论文试图解决直接偏好优化(DPO)在处理大型语言模型(LLMs)时,由于将整个响应视为单一臂的强盗问题,忽略了不同token之间的重要性差异,从而影响优化效率和难以达到最优结果的问题。解决方案的关键在于提出了基于token级别的权重采样DPO目标(TIS-DPO),通过为每个token分配基于其奖励的重要性权重,从而实现无偏优化。具体实现中,利用对比LLMs的预测概率差异来估计token的重要性权重,并通过三种方法构建对比LLMs:引导原始LLM使用对比提示、训练两个分别使用胜负响应的LLMs,以及使用胜负响应进行正反DPO训练。实验结果表明,TIS-DPO在无害性和有用性对齐以及摘要任务上显著优于多种基线方法。

链接: https://arxiv.org/abs/2410.04350
作者: Aiwei Liu,Haoping Bai,Zhiyun Lu,Yanchao Sun,Xiang Kong,Simon Wang,Jiulong Shan,Albin Madappally Jose,Xiaojiang Liu,Lijie Wen,Philip S. Yu,Meng Cao
关键词-EN: Large Language Models, Direct Preference Optimization, Language Models, Large Language, Direct Preference
类目: Computation and Language (cs.CL)
备注: 27 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the optimal dataset is unavailable in practice, we propose using the original dataset for importance sampling to achieve unbiased optimization. Accordingly, we propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. Inspired by previous works, we estimate the token importance weights using the difference in prediction probabilities from a pair of contrastive LLMs. We explore three methods to construct these contrastive LLMs: (1) guiding the original LLM with contrastive prompts, (2) training two separate LLMs using winning and losing responses, and (3) performing forward and reverse DPO training with winning and losing responses. Experiments show that TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks. We also visualize the estimated weights, demonstrating their ability to identify key token positions.
摘要:直接偏好优化 (Direct Preference Optimization, DPO) 因其简单性和有效性,已被广泛应用于大语言模型 (Large Language Models, LLMs) 的偏好对齐。然而,DPO 被视为一个多臂老虎机问题,其中整个响应被视为单一臂,忽略了 Token 之间的重要性差异,这可能影响优化效率并难以达到最优结果。在本研究中,我们提出,DPO 的最优数据应使得获胜和失败响应中的每个 Token 具有相等的预期奖励,因为 Token 的重要性没有差异。然而,由于实践中无法获得最优数据集,我们提出使用原始数据集进行重要性采样以实现无偏优化。据此,我们提出了一种名为 TIS-DPO 的 Token 级重要性采样 DPO 目标,该目标根据 Token 的奖励为其分配重要性权重。受先前工作的启发,我们使用一对对比大语言模型的预测概率差异来估计 Token 重要性权重。我们探索了三种构建这些对比大语言模型的方法:(1) 使用对比提示引导原始大语言模型,(2) 使用获胜和失败响应训练两个独立的大语言模型,以及 (3) 使用获胜和失败响应进行正向和反向 DPO 训练。实验表明,TIS-DPO 在无害性和有用性对齐以及摘要任务上显著优于各种基线方法。我们还可视化了估计的权重,展示了其识别关键 Token 位置的能力。

[NLP-106] Latent Feature Mining for Predictive Model Enhancement with Large Language Models

【速读】: 该论文试图解决在数据有限且质量不高的情况下,传统机器学习模型难以捕捉到未观测但关键的潜在特征的问题。解决方案的关键在于提出了FLAME框架,该框架利用大型语言模型(LLMs)进行文本到文本的命题逻辑推理,从而挖掘并增强潜在特征,显著提升下游任务中机器学习模型的预测能力。FLAME框架通过结合特定领域的上下文信息,确保了其在不同领域中的可迁移性和有效性。

链接: https://arxiv.org/abs/2410.04347
作者: Bingxuan Li,Pengyi Shi,Amy Ward
关键词-EN: faces challenges due, latent feature mining, practical difficulties, modeling often faces, weakly correlated
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predictive modeling often faces challenges due to limited data availability and quality, especially in domains where collected features are weakly correlated with outcomes and where additional feature collection is constrained by ethical or practical difficulties. Traditional machine learning (ML) models struggle to incorporate unobserved yet critical factors. In this work, we introduce an effective approach to formulate latent feature mining as text-to-text propositional logical reasoning. We propose FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features and enhance the predictive power of ML models in downstream tasks. Our framework is generalizable across various domains with necessary domain-specific adaptation, as it is designed to incorporate contextual information unique to each area, ensuring effective transfer to different areas facing similar data availability challenges. We validate our framework with two case studies: (1) the criminal justice system, a domain characterized by limited and ethically challenging data collection; (2) the healthcare domain, where patient privacy concerns and the complexity of medical data limit comprehensive feature collection. Our results show that inferred latent features align well with ground truth labels and significantly enhance the downstream classifier.
摘要:预测建模常常面临数据可用性和质量的挑战,特别是在收集的特征与结果弱相关,且由于伦理或实际困难限制了额外特征收集的领域。传统机器学习 (ML) 模型难以整合未观测但关键的因素。在此工作中,我们提出了一种将潜在特征挖掘形式化为文本到文本命题逻辑推理的有效方法。我们提出了 FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement),这是一个利用大语言模型 (LLMs) 来增强观测特征与潜在特征,并在下游任务中提升 ML 模型预测能力的框架。我们的框架具有跨领域的通用性,只需进行必要的领域特定调整,因为它旨在整合每个领域独特的上下文信息,确保有效转移到面临类似数据可用性挑战的不同领域。我们通过两个案例研究验证了我们的框架:(1) 刑事司法系统,这是一个以数据收集有限且伦理挑战为特征的领域;(2) 医疗领域,患者隐私问题和医疗数据的复杂性限制了全面特征的收集。我们的结果表明,推断的潜在特征与真实标签高度一致,并显著增强了下游分类器的表现。

[NLP-107] Ordinal Preference Optimization: Aligning Human Preferences via NDCG

【速读】: 该论文试图解决现有强化学习从人类反馈(RLHF)和直接偏好优化(DPO)等方法在处理多响应场景时,未能充分利用奖励模型或人类反馈给出的排序信息的问题。解决方案的关键在于提出了一种新的列表式优化方法,称为Ordinal Preference Optimization(OPO),该方法利用Normalized Discounted Cumulative Gain(NDCG)这一广泛使用的排序指标,通过近似NDCG的可微分代理损失函数,实现端到端的偏好优化。这种方法不仅在多响应数据集上表现优于现有的成对和列表式方法,还通过增加负样本池来减少简单负样本的负面影响,从而提升模型性能。

链接: https://arxiv.org/abs/2410.04346
作者: Yang Zhao,Yixin Wang,Mingzhang Yin
关键词-EN: Aligning Large Language, Large Language Models, enhancing generation quality, Large Language, diverse human preferences
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with diverse human preferences is a pivotal technique for controlling model behaviors and enhancing generation quality. Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and their variants optimize language models by pairwise comparisons. However, when multiple responses are available, these approaches fall short of leveraging the extensive information in the ranking given by the reward models or human feedback. In this work, we propose a novel listwise approach named Ordinal Preference Optimization (OPO), which employs the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, to better utilize relative proximity within ordinal multiple responses. We develop an end-to-end preference optimization algorithm by approximating NDCG with a differentiable surrogate loss. This approach builds a connection between ranking models in information retrieval and the alignment problem. In aligning multi-response datasets assigned with ordinal rewards, OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval. Moreover, we demonstrate that increasing the pool of negative samples can enhance model performance by reducing the adverse effects of trivial negatives.
摘要:将大语言模型 (LLM) 与多样化的人类偏好对齐是控制模型行为和提升生成质量的关键技术。基于人类反馈的强化学习 (RLHF)、直接偏好优化 (DPO) 及其变体通过成对比较来优化语言模型。然而,当存在多个响应时,这些方法未能充分利用奖励模型或人类反馈给出的排序中的丰富信息。在本研究中,我们提出了一种名为序数偏好优化 (OPO) 的新型列表式方法,该方法采用广泛使用的排序指标——归一化折损累积增益 (NDCG),以更好地利用序数多个响应中的相对接近度。我们通过近似 NDCG 与可微分的替代损失,开发了一种端到端的偏好优化算法。这种方法在信息检索中的排序模型与对齐问题之间建立了联系。在对齐带有序数奖励的多响应数据集时,OPO 在评估集和通用基准(如 AlpacaEval)上均优于现有的成对和列表式方法。此外,我们证明了增加负样本池可以减少平凡负样本的负面影响,从而提升模型性能。

[NLP-108] Inference Scaling for Long-Context Retrieval Augmented Generation

【速读】: 该论文试图解决在知识密集型任务中,如何通过扩展推理计算来提升检索增强生成(RAG)模型的性能问题。解决方案的关键在于探索两种推理扩展策略:上下文学习和迭代提示。这些策略通过增加检索文档数量或生成步骤,灵活地扩展测试时的计算,从而增强模型获取和利用上下文信息的能力。论文通过建立推理扩展规律模型,预测在不同计算约束下的最优推理参数配置,实验结果表明,采用这些最优配置可以在基准数据集上实现高达58.9%的性能提升。

链接: https://arxiv.org/abs/2410.04343
作者: Zhenrui Yue,Honglei Zhuang,Aijun Bai,Kai Hui,Rolf Jagerman,Hansi Zeng,Zhen Qin,Dong Wang,Xuanhui Wang,Michael Bendersky
关键词-EN: RAG, RAG performance, long-context large language, large language models, inference
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring strategies beyond simply increasing the quantity of knowledge. We focus on two inference scaling strategies: in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs’ ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.
摘要:推理计算的扩展已经释放了长上下文大语言模型 (LLM) 在各种应用场景中的潜力。对于知识密集型任务,增加的计算资源通常用于整合更多的外部知识。然而,如果不有效地利用这些知识,单纯扩展上下文并不总能提升性能。在本研究中,我们探讨了检索增强生成 (RAG) 的推理扩展,探索了超越简单增加知识量的策略。我们重点关注两种推理扩展策略:上下文学习 (in-context learning) 和迭代提示 (iterative prompting)。这些策略提供了额外的灵活性,以扩展测试时的计算(例如,通过增加检索的文档数量或生成步骤),从而增强 LLM 有效获取和利用上下文信息的能力。我们解决了两个关键问题:(1) 当最佳配置时,RAG 性能如何从推理计算的扩展中受益?(2) 我们能否通过建模 RAG 性能与推理参数之间的关系,预测给定预算下的最佳测试时计算分配?我们的观察结果表明,当最佳分配时,增加推理计算会导致 RAG 性能近乎线性提升,我们将这种关系描述为 RAG 的推理扩展定律。基于此,我们进一步开发了计算分配模型,以估计不同推理配置下的 RAG 性能。该模型预测了在各种计算约束下的最佳推理参数,这些参数与实验结果高度一致。通过应用这些最佳配置,我们证明了在长上下文 LLM 上扩展推理计算,相比标准 RAG,在基准数据集上实现了高达 58.9% 的性能提升。

[NLP-109] ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

【速读】: 该论文试图解决大型语言模型(LLMs)中分词器(tokenizer)在高压缩率场景下的效率问题,特别是在输入和输出长度增加时导致的训练和推理成本上升。解决方案的关键在于通过替换和重新初始化LLMs的输入和输出层参数,同时固定其他参数,以提高模型在长文本处理中的解码速度,从而在不显著影响模型性能的前提下提升处理效率。

链接: https://arxiv.org/abs/2410.04335
作者: Shuhao Gu,Mengdi Zhao,Bowen Zhang,Liangdong Wang,Jijie Li,Guang Liu
关键词-EN: high compression rate, large language models, high compression, compression rate, ensure high compression
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenizer is an essential component for large language models (LLMs), and a tokenizer with a high compression rate can improve the model’s representation and processing efficiency. However, the tokenizer cannot ensure high compression rate in all scenarios, and an increase in the average input and output lengths will increases the training and inference costs of the model. Therefore, it is crucial to find ways to improve the model’s efficiency with minimal cost while maintaining the model’s performance. In this work, we propose a method to improve model representation and processing efficiency by replacing the tokenizers of LLMs. We propose replacing and reinitializing the parameters of the model’s input and output layers with the parameters of the original model, and training these parameters while keeping other parameters fixed. We conducted experiments on different LLMs, and the results show that our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.
摘要:Tokenizer 是大语言模型 (LLM) 中的一个关键组件,具有高压缩率的 Tokenizer 可以提升模型的表示能力和处理效率。然而,Tokenizer 并不能在所有场景中保证高压缩率,输入和输出长度的增加会提高模型的训练和推理成本。因此,在保持模型性能的前提下,寻找提升模型效率且成本最低的方法至关重要。在本研究中,我们提出了一种通过替换 LLM 的 Tokenizer 来提升模型表示和处理效率的方法。我们建议用原始模型的参数替换并重新初始化模型的输入和输出层参数,并在固定其他参数的同时训练这些参数。我们在不同的 LLM 上进行了实验,结果表明,我们的方法在替换 Tokenizer 后能够保持模型的性能,同时显著提升了长文本的解码速度。

[NLP-110] OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions

【速读】: 该论文试图解决在无载体隐写术中,如何利用大型语言模型(LLM)驱动算术编码解码器生成自然流畅的隐写文本,同时最小化嵌入秘密消息所需的语言标记数量的问题。解决方案的关键在于通过最大化下一个标记生成的替换概率分布的熵,并约束所选概率分布与LLM原始分布之间的KL散度,从而在数学上等价地解决该问题。论文提供了一个封闭形式的优化解法,并解决了实际应用中的几个重要问题,包括标记化不匹配、优化分布与词汇截断技术的结合,以及优化分布与其他序列级选择启发式的结合,以进一步提高效率和可靠性。

链接: https://arxiv.org/abs/2410.04328
作者: Yu-Shin Huang,Peter Just,Krishna Narayanan,Chao Tian
关键词-EN: Large Language Model, arithmetic coding decoder, Language Model, Large Language, drives an arithmetic
类目: Information Theory (cs.IT); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 9 figures

点击查看摘要

Abstract:We consider coverless steganography where a Large Language Model (LLM) drives an arithmetic coding decoder to generate stego-texts. An efficient method should embed secret message bits in as few language tokens as possible, while still keeping the stego-text natural and fluent. We show that on the individual token level, this problem is mathematically equivalent to maximizing the entropy of a replacement probability distribution of the next token generation, subject to a constraint on the KL divergence between the chosen probability distribution and the original distribution given by the LLM. A closed-form solution is provided for the optimization problem, which can be computed efficiently. Several important practical issues are also tackled: 1) An often-overlooked tokenization mismatch issue is resolved with a simple prompt selection approach, 2) The combination of the optimized distribution and the vocabulary truncation technique is considered, and 3) The combination of the optimized distribution with other sequence-level selection heuristics to further enhance the efficiency and reliability is studied.
摘要:我们考虑了一种无覆盖隐写术,其中大语言模型 (LLM) 驱动算术编码解码器生成隐写文本。一个高效的方法应当在尽可能少的语言 Token 中嵌入秘密消息位,同时保持隐写文本的自然和流畅。我们证明,在单个 Token 层面上,这个问题在数学上等价于最大化下一个 Token 生成的替换概率分布的熵,同时受限于所选概率分布与 LLM 给出的原始分布之间的 KL 散度约束。我们为优化问题提供了一个封闭形式的解,该解可以高效计算。此外,我们还解决了几个重要的实际问题:1) 通过简单的提示选择方法解决了常被忽视的 Token 化不匹配问题,2) 考虑了优化分布与词汇截断技术的结合,3) 研究了优化分布与其他序列级选择启发式的结合,以进一步提高效率和可靠性。

[NLP-111] Calibrating Expressions of Certainty

【速读】: 该论文试图解决语言表达中确定性词汇(如“可能”和“很可能”)的校准问题。解决方案的关键在于将不确定性建模为分布而非单一分数,以更准确地捕捉这些词汇的语义。为此,论文提出了新的校准方法,并扩展了现有的校准度量,以适应这种新的不确定性表示。通过这些工具,研究者分析了人类(如放射科医生)和计算模型(如语言模型)的校准情况,并提供了可解释的改进建议。

链接: https://arxiv.org/abs/2410.04315
作者: Peiqi Wang,Barbara D. Lam,Yingcheng Liu,Ameneh Asgari-Targhi,Rameswar Panda,William M. Wells,Tina Kapur,Polina Golland
关键词-EN: calibrating linguistic expressions, approach to calibrating, calibrating linguistic, linguistic expressions, Abstract
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel approach to calibrating linguistic expressions of certainty, e.g., “Maybe” and “Likely”. Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.
摘要:我们提出了一种新颖的方法来校准确定性语言表达,例如“可能”和“很可能”。与先前工作不同,先前工作为每个确定性短语分配单一分数,我们将不确定性建模为在单纯形上的分布,以更准确地捕捉其语义。为了适应这种新的确定性表示,我们推广了现有的校准误差度量,并引入了一种新的后验校准方法。利用这些工具,我们分析了人类(例如放射科医生)和计算模型(例如语言模型)的校准情况,并提供了可解释的建议以改进其校准。

[NLP-112] Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

【速读】: 该论文试图解决多语言数据集在语言识别模型不完美的情况下,如何有效筛选出不可靠子集的问题。解决方案的关键在于引入了一种统计测试方法——偏好比例测试(Preference Proportion Test),通过仅标注20个样本,即可识别出系统性转录错误,从而在训练下游任务(如音素转录)时过滤掉低质量数据,显著提升模型在分布外语言上的转录性能,相对改进达到25.7%。

链接: https://arxiv.org/abs/2410.04292
作者: Farhan Samir,Emily P. Ahn,Shreya Prakash,Márton Soskuthy,Vered Shwartz,Jian Zhu
关键词-EN: span multiple languages, Curating datasets, span multiple, Preference Proportion Test, Curating
类目: Computation and Language (cs.CL)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language subsets being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets. By annotating only 20 samples for a language subset, we’re able to identify systematic transcription errors for 10 language subsets in a recent large multilingual transcribed audio dataset, X-IPAPack (Zhu et al., 2024). We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in out-of-distribution languages. Our method lays a path forward for systematic and reliable multilingual dataset auditing.
摘要:构建涵盖多种语言的数据集颇具挑战性。为了提高数据收集的可扩展性,研究人员通常会在过程中引入一个或多个不完美的分类器,例如语言识别模型。然而,这些模型容易出现错误,导致某些语言子集在下游任务中不可靠。我们提出了一种统计检验方法——偏好比例检验,用于识别这些不可靠的子集。通过仅标注20个样本,我们能够识别出最近一个大型的多语言转录音频数据集 X-IPAPack (Zhu et al., 2024) 中10个语言子集的系统性转录错误。我们发现,在训练用于音素转录的下游任务模型时,过滤掉这些低质量数据能带来显著的好处,尤其是在转录分布外语言的录音时,相对改进率达到25.7%。我们的方法为系统且可靠的多语言数据集审计开辟了新的道路。

[NLP-113] Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia EMNLP’24

【速读】: 该论文旨在解决跨语言文本分析中的信息差异和不一致性问题,特别是在社会现象和系统性偏见的研究中。解决方案的关键在于引入InfoGap方法,这是一种高效且可靠的策略,能够在事实层面上定位不同语言文章中的信息差距和不一致性。通过分析英语、俄语和法语维基百科中关于LGBT人物的传记页面,研究揭示了不同语言间事实覆盖的显著差异,特别是负面事实在俄语维基百科中更可能被强调。InfoGap方法不仅支持大规模分析,还能精确定位文档和事实层面的信息差距,为大规模、精细化的跨语言比较分析奠定了新的基础。

链接: https://arxiv.org/abs/2410.04282
作者: Farhan Samir,Chan Young Park,Anjalie Field,Vered Shwartz,Yulia Tsvetkov
关键词-EN: identify systematic biases, explain social phenomena, computational social science, social science focuses, systematic biases
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures. To appear at EMNLP’24

点击查看摘要

Abstract:To explain social phenomena and identify systematic biases, much research in computational social science focuses on comparative text analyses. These studies often rely on coarse corpus-level statistics or local word-level analyses, mainly in English. We introduce the InfoGap method – an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages. We evaluate InfoGap by analyzing LGBT people’s portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias. We find large discrepancies in factual coverage across the languages. Moreover, our analysis reveals that biographical facts carrying negative connotations are more likely to be highlighted in Russian Wikipedia. Crucially, InfoGap both facilitates large scale analyses, and pinpoints local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.
摘要:为了解释社会现象并识别系统性偏见,计算社会科学领域的许多研究集中在比较文本分析上。这些研究通常依赖于粗略的语料库级别统计或局部的词级别分析,主要以英语进行。我们引入了信息缺口 (InfoGap) 方法——一种高效且可靠的方法,用于跨语言的文章中定位事实级别的信息缺口和不一致性。我们通过分析 LGBT 人群在英语、俄语和法语维基百科上的 2.7K 个传记页面中的描述来评估 InfoGap 方法。我们发现不同语言之间的实际覆盖范围存在较大差异。此外,我们的分析揭示了带有负面含义的传记事实在俄语维基百科中更可能被突出显示。关键的是,InfoGap 不仅促进了大规模分析,还精确指出了局部文档和事实级别的信息缺口,为大规模的针对性且细致的比较语言分析奠定了新的基础。

[NLP-114] Mechanistic Behavior Editing of Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在处理任务时因噪声数据导致的泛化能力受限问题。解决方案的关键在于提出了一种名为TaRot的新方法,通过使用可学习的旋转矩阵干预神经电路,并利用贝叶斯优化在标注样本上进行优化,从而在不引入新能力的前提下,增强或抑制现有能力,提升模型在零样本和少样本场景下的任务适应性和性能。实验结果表明,TaRot在多个分类和生成任务上显著提升了模型的表现,平均改进率分别为23.81%和11.15%。

链接: https://arxiv.org/abs/2410.04277
作者: Joykirat Singh,Subhabrata Dutta,Tanmoy Chakraborty
关键词-EN: Large Language Models, text acquire language, Large Language, web-scale text acquire, Language Models trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models trained on web-scale text acquire language generation abilities that can solve a wide range of tasks, particularly when task knowledge is refined into the generative prior using in-context examples. However, spurious features learned from noisy data hinder their generalizability. Supervised finetuning can introduce task specificity, but introduce data inefficiency. Prior studies indicate that (i) noisy neural circuitries coexist with generalizable ones within LLMs, and (ii) finetuning typically enhances (or suppresses) existing abilities without introducing newer ones. Building upon these, we propose TaRot, a novel method for task adaptation. TaRot intervenes in the neural circuitries using learnable rotation matrices that are optimized using Bayesian Optimization, on labelled samples in the order of standard few-shot prompting examples. Experiments on multiple classification and generation tasks using LLMs of varying sizes reveal the efficacy of TaRot, improving upon both zero- as well as few-shot performance, with average improvements (across models and tasks) of 23.81% and 11.15%, respectively. The source code is available at this https URL
摘要:在海量文本数据上训练的大语言模型 (Large Language Models, LLM) 获得了强大的语言生成能力,能够解决多种任务,尤其是在任务知识通过上下文示例融入生成先验时。然而,从噪声数据中学习到的虚假特征限制了其泛化能力。监督微调虽然可以引入任务特异性,但会导致数据效率低下。先前的研究表明:(i) 在大语言模型中,噪声神经回路与可泛化的神经回路共存;(ii) 微调通常会增强(或抑制)现有能力,而不会引入新的能力。基于此,我们提出了 TaRot,一种新颖的任务适应方法。TaRot 通过使用可学习的旋转矩阵干预神经回路,这些矩阵通过贝叶斯优化在标准少样本提示示例顺序的标注样本上进行优化。在多种分类和生成任务上进行的实验表明,TaRot 的有效性,相较于零样本和少样本性能,分别平均提升了 23.81% 和 11.15%。源代码可在以下链接获取:https URL

[NLP-115] Language Model-Driven Data Pruning Enables Efficient Active Learning

【速读】: 该论文试图解决在大规模未标注数据集中进行主动学习时,传统获取函数计算成本高的问题。解决方案的关键是引入了一种名为ActivePrune的新型即插即用未标注数据修剪策略,通过语言模型对未标注数据池进行修剪。ActivePrune采用两阶段修剪过程:首先使用n-gram语言模型的困惑度分数进行快速初步评估,然后通过量化的大型语言模型(LLM)计算的数据质量指标进行高质量选择。此外,为了增强未标注数据池的多样性,论文还提出了一种新颖的困惑度重加权方法,系统性地将未充分代表的实例提前到后续标注迭代中进行选择。实验结果表明,ActivePrune在多种任务和数据集上优于现有的数据修剪方法,并且在计算效率上显著优于其他基于LLM分数的修剪方法,能够将主动学习的端到端时间减少高达74%。

链接: https://arxiv.org/abs/2410.04275
作者: Abdul Hameed Azeemi,Ihsan Ayyub Qazi,Agha Ali Raza
关键词-EN: unlabeled pool, optimizes data labeling, unlabeled data pools, instances for annotation, unlabeled
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. A key component in this procedure is an acquisition function that guides the selection process and identifies the suitable instances for labeling from the unlabeled pool. However, these acquisition methods suffer from high computational costs with large unlabeled data pools, posing a roadblock to their applicability on large datasets. To address this challenge and bridge this gap, we introduce a novel plug-and-play unlabeled data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. Additionally, to enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection in subsequent labeling iterations. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on four diverse datasets and four active learning strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality \leftrightarrow efficiency tradeoff of the data pruning methods and demonstrate that ActivePrune is computationally more efficient than other LLM score-based pruning methods, and provides up to 74% reduction in the end-to-end time required for active learning.
摘要:主动学习 (Active Learning, AL) 通过选择最具信息量的实例进行标注来优化数据标注效率。该过程中一个关键组成部分是获取函数 (acquisition function),它指导选择过程并从无标注数据池中识别出适合标注的实例。然而,这些获取方法在大规模无标注数据池中面临高计算成本的问题,成为其在大型数据集上应用的障碍。为应对这一挑战并填补这一空白,我们引入了一种新颖的即插即用无标注数据修剪策略,称为 ActivePrune,该策略利用语言模型对无标注数据池进行修剪。ActivePrune 实施了一个两阶段修剪过程:首先使用 n-gram 语言模型的困惑度 (perplexity) 分数进行快速初步评估,然后通过量化大语言模型 (LLM) 计算的数据质量指标进行高质量选择。此外,为了增强无标注数据池的多样性,我们提出了一种新颖的困惑度重新加权方法,该方法系统地将未充分代表的实例提前,以便在后续标注迭代中进行选择。在翻译、情感分析、主题分类和摘要任务中,对四个不同数据集和四种主动学习策略的实验表明,ActivePrune 优于现有的数据修剪方法。最后,我们比较了数据修剪方法的选择质量与效率之间的权衡,并证明 ActivePrune 在计算上比其他基于 LLM 分数的修剪方法更高效,并能将主动学习的端到端时间减少高达 74%。

[NLP-116] Evaluating Language Model Character Traits EMNLP2024

【速读】: 该论文试图解决如何在不进行过度拟人化的情况下描述语言模型(LMs)表现出的人类行为特征的问题。解决方案的关键在于提出了一个行为主义视角下的LM性格特质理论,通过实证展示LMs表现出不同的性格特质(如真实性、谄媚性、逻辑一致的信念和意图),并分析这些特质在模型大小、微调及提示条件下的表现一致性。此外,论文还评估了这些特质在交互过程中的发展变化,发现某些特质(如真实性和有害性)在特定情境下是稳定的,而在不同情境下则可能反映前次交互的行为。这一理论框架使得我们能够用直观且精确的语言描述LM的行为,避免了过度拟人化的问题。

链接: https://arxiv.org/abs/2410.04272
作者: Francis Rhys Ward,Zejia Yang,Alex Jackson,Randy Brown,Chandler Smith,Grace Colverd,Louis Thomson,Raymond Douglas,Patrik Bartak,Andrew Rowan
关键词-EN: character traits, exhibit human-like behaviour, traits, character, behaviour
类目: Computation and Language (cs.CL)
备注: accepted as Findings of EMNLP2024

点击查看摘要

Abstract:Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs, and helpful and harmless intentions. We find that the consistency with which LMs exhibit certain character traits varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts, but may be reflective in different contexts, meaning they mirror the LM’s behavior in the preceding interaction. Our formalism enables us to describe LM behaviour precisely in intuitive language, without undue anthropomorphism.
摘要:语言模型 (Language Models, LMs) 能够表现出类人的行为,但如何在不过度拟人化的情况下描述这种行为尚不明确。我们形式化了一种行为主义的语言模型性格特征观点:诸如真实性、谄媚性或连贯的信念和意图等品质,这些品质可能表现为一致的行为模式。我们的理论基于语言模型展示不同性格特征的实证演示,例如准确且逻辑连贯的信念,以及有益且无害的意图。我们发现,语言模型展示某些性格特征的一致性随模型规模、微调 (fine-tuning) 和提示 (prompting) 的变化而变化。除了描述语言模型的性格特征外,我们还评估了这些特征在交互过程中的发展情况。我们发现,在某些情境下,真实性和有害性等特征可以是静态的,即在整个交互过程中保持一致,但在不同情境下可能是反射性的,即它们反映了语言模型在前一次交互中的行为。我们的形式化方法使我们能够用直观的语言精确描述语言模型的行为,而不过度拟人化。

[NLP-117] Fundamental Limitations on Subquadratic Alternatives to Transformers

【速读】: 该论文试图解决的问题是证明在某些重要任务中,Transformer架构的二次时间复杂度是不可避免的。具体来说,论文证明了在文档相似性任务中,Transformer能够执行该任务,而任何能够在亚二次时间内计算的模型(如使用亚二次时间启发式算法或替代注意力机制的模型,如Mamba)都无法执行这一任务。解决方案的关键在于证明了这些任务的复杂性下限,即任何亚二次时间的算法都无法完成这些任务,从而强调了Transformer在处理涉及文档相似性的任务时,其二次时间复杂度的必要性。

链接: https://arxiv.org/abs/2410.04271
作者: Josh Alman,Hantao Yu
关键词-EN: impactful Large Language, Large Language Models, impactful Large, Large Language, architecture is widely
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL) Cite as: arXiv:2410.04271 [cs.LG] (or arXiv:2410.04271v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要: Transformer 架构广泛应用于许多流行且具有影响力的大语言模型中。其核心是用于计算 Token 对之间相关性的注意力机制。执行注意力计算的时间复杂度与输入大小成二次方关系,已成为 Transformer 操作的时间瓶颈。为了规避这一问题,研究人员采用了多种方法,包括设计启发式算法以更快地执行注意力计算,以及提出可以更快计算的注意力机制替代方案。例如,状态空间模型如 Mamba 被设计用来以近似线性时间替代注意力机制。在本文中,我们证明了任何此类方法都无法执行 Transformer 能够执行的重要任务(假设细粒度复杂性理论中的一个流行猜想)。我们专注于文档相似性任务,其中输入为多个文档,目标是找到一对(近似)最相似的文档。我们证明了 Transformer 能够执行此任务,并且证明了任何算法都无法在真正次二次时间内执行此任务。因此,任何可以在次二次时间内评估的模型——无论是由于次二次时间注意力启发式算法、像 Mamba 这样的更快注意力替代方案,还是其他原因——都无法执行此任务。换句话说,为了执行(隐式或显式)涉及文档相似性的任务,人们可能仍然需要使用 Transformer,并且无法避免其二次运行时间。

主题: 机器学习 (cs.LG); 计算复杂性 (cs.CC); 计算与语言 (cs.CL)

引用为: arXiv:2410.04271 [cs.LG] (或 arXiv:2410.04271v1 [cs.LG] 用于此版本)

链接: https://doi.org/10.48550/arXiv.2410.04271

了解更多: arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-118] RoQLlama: A Lightweight Romanian Adapted Language Model EMNLP

【速读】: 该论文试图解决在计算资源有限的情况下,提升Llama2模型在罗马尼亚语任务中的性能问题。解决方案的关键在于使用QLoRA(量化低秩适应)技术进行模型训练,从而在保持性能的同时显著减少计算资源的消耗。论文发布了一个量化的LLM模型RoQLlama-7b,该模型在零样本设置下测试的七个罗马尼亚语下游任务中表现与全尺寸模型相当或更优,并且在少样本提示下持续获得更高的平均分数。此外,论文还引入了一个新的罗马尼亚语数据集RoMedQA,用于支持相关任务的评估。

链接: https://arxiv.org/abs/2410.04269
作者: George-Andrei Dima,Andrei-Marius Avram,Cristian-George Crăciun,Dumitru-Clementin Cercel
关键词-EN: open-source large language, remarkable achievements obtained, large language models, English language, involving the English
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP Findings 2024 (short papers)

点击查看摘要

Abstract:The remarkable achievements obtained by open-source large language models (LLMs) in recent years have predominantly been concentrated on tasks involving the English language. In this paper, we aim to advance the performance of Llama2 models on Romanian tasks. We tackle the problem of reduced computing resources by using QLoRA for training. We release RoQLlama-7b, a quantized LLM, which shows equal or improved results compared to its full-sized counterpart when tested on seven Romanian downstream tasks in the zero-shot setup. Also, it consistently achieves higher average scores across all few-shot prompts. Additionally, we introduce a novel Romanian dataset, namely RoMedQA, which contains single-choice medical questions in Romanian.
摘要:近年来,开源大语言模型 (LLM) 在涉及英语的任务中取得了显著成就。本文旨在提升 Llama2 模型在罗马尼亚语任务中的表现。我们通过使用 QLoRA 进行训练来解决计算资源减少的问题。我们发布了 RoQLlama-7b,这是一个量化的 LLM,在零样本设置下测试的七个罗马尼亚下游任务中,其表现与其全尺寸版本相当或有所提升。此外,在所有少样本提示中,它始终获得更高的平均分数。我们还引入了一个新的罗马尼亚语数据集,即 RoMedQA,其中包含罗马尼亚语的单选医学问题。

[NLP-119] Constructing Cloze Questions Generatively IJCNN

【速读】: 该论文试图解决从给定文章中自动生成填空题(cloze questions)的问题,特别是生成高质量的多词干扰项(multigram distractors)。解决方案的关键在于利用神经网络和WordNet进行词义消歧、文本到文本的转换,结合WordNet的同义词集分类和词汇标签,生成实例级别的干扰项候选(IDCs),并通过上下文嵌入相似性和同义词集及词汇相关性进行筛选和排序,最终组合成合法的短语作为干扰项。实验结果表明,该方法显著优于现有的最先进技术,且生成的干扰项得到了人工评判的高质量认可。

链接: https://arxiv.org/abs/2410.04266
作者: Yicheng Sun(1),Jie Wang(2)
关键词-EN: constructing cloze questions, generating multigram distractors, method called CQG, generative method called, called CQG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures,5 tables, 2023 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:We present a generative method called CQG for constructing cloze questions from a given article using neural networks and WordNet, with an emphasis on generating multigram distractors. Built on sense disambiguation, text-to-text transformation, WordNet’s synset taxonomies and lexical labels, CQG selects an answer key for a given sentence, segments it into a sequence of instances, generates instance-level distractor candidates (IDCs) using a transformer and sibling this http URL then removes inappropriate IDCs, ranks the remaining IDCs based on contextual embedding similarities, as well as synset and lexical relatedness, forms distractor candidates by combinatorially replacing instances with the corresponding top-ranked IDCs, and checks if they are legitimate phrases. Finally, it selects top-ranked distractor candidates based on contextual semantic similarities to the answer key. Experiments show that this method significantly outperforms SOTA results. Human judges also confirm the high qualities of the generated distractors.
摘要:我们提出了一种名为 CQG 的生成方法,用于利用神经网络和 WordNet 从给定文章中构建填空题,特别强调生成多词干扰项。基于词义消歧、文本到文本转换、WordNet 的同义词集分类和词汇标签,CQG 为给定句子选择答案关键字,将其分割为实例序列,使用 Transformer 和 sibling this http URL 生成实例级干扰项候选 (IDC),然后移除不合适的 IDC,根据上下文嵌入相似性以及同义词集和词汇相关性对剩余的 IDC 进行排序,通过组合替换实例与相应的高排名 IDC 形成干扰项候选,并检查它们是否为合法短语。最后,根据与答案关键字的上下文语义相似性选择高排名的干扰项候选。实验表明,该方法显著优于当前最先进的结果。人类评判也确认了生成的干扰项的高质量。

[NLP-120] AI as Humanitys Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

【速读】: 该论文试图解决的问题是如何量化文本的语言创造性,并探讨大型语言模型(如ChatGPT)是否能匹配或超越人类的创造力。解决方案的关键在于提出了CREATIVITY INDEX,这是一种通过从现有网络文本片段中重构文本来量化语言创造性的方法。论文通过引入DJ SEARCH算法,高效地搜索给定文档中的文本片段与网络上的精确或近似匹配,从而计算CREATIVITY INDEX。实验结果表明,专业人类作者的CREATIVITY INDEX平均比LLMs高出66.2%,并且发现CREATIVITY INDEX可以作为零样本机器文本检测的有效标准,显著优于现有的零样本检测系统DetectGPT和监督系统GhostBuster。

链接: https://arxiv.org/abs/2410.04265
作者: Ximing Lu,Melanie Sclar,Skyler Hallinan,Niloofar Mireshghallah,Jiacheng Liu,Seungju Han,Allyson Ettinger,Liwei Jiang,Khyathi Chandu,Nouha Dziri,Yejin Choi
关键词-EN: CREATIVITY INDEX, Large Language Models, Creativity, INDEX, present CREATIVITY INDEX
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Creativity has long been considered one of the most difficult aspect of human intelligence for AI to mimic. However, the rise of Large Language Models (LLMs), like ChatGPT, has raised questions about whether AI can match or even surpass human creativity. We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. CREATIVITY INDEX is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the creativity of human-written texts on the web. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm that can search verbatim and near-verbatim matches of text snippets from a given document against the web. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs, and that alignment reduces the CREATIVITY INDEX of LLMs by an average of 30.1%. In addition, we find that distinguished authors like Hemingway exhibit measurably higher CREATIVITY INDEX compared to other human writers. Finally, we demonstrate that CREATIVITY INDEX can be used as a surprisingly effective criterion for zero-shot machine text detection, surpassing the strongest existing zero-shot system, DetectGPT, by a significant margin of 30.2%, and even outperforming the strongest supervised system, GhostBuster, in five out of six domains.
摘要:长期以来,创造力一直被认为是人工智能最难模仿的人类智能之一。然而,随着大语言模型 (LLM) 如 ChatGPT 的兴起,人们开始质疑 AI 是否能够匹配甚至超越人类的创造力。我们提出了 CREATIVITY INDEX,作为量化文本语言创造力的第一步,通过从现有网页文本片段中重建文本。CREATIVITY INDEX 的提出基于这样一个假设:LLM 看似非凡的创造力在很大程度上可以归因于网络上人类编写的文本的创造力。为了高效计算 CREATIVITY INDEX,我们引入了 DJ SEARCH,一种新颖的动态规划算法,能够从给定文档中搜索与网页上文本片段完全匹配或近似匹配的内容。实验结果显示,专业人类作者的 CREATIVITY INDEX 平均比 LLM 高出 66.2%,而对齐操作使 LLM 的 CREATIVITY INDEX 平均降低了 30.1%。此外,我们发现像海明威这样的杰出作者相比其他人类作家表现出更高的 CREATIVITY INDEX。最后,我们证明了 CREATIVITY INDEX 可以作为零样本机器文本检测的一个出乎意料的有效标准,其表现显著优于现有的最强零样本系统 DetectGPT,超出 30.2%,甚至在六个领域中的五个领域超越了最强的监督系统 GhostBuster。

[NLP-121] Is deeper always better? Replacing linear mappings with deep learning networks in the Discriminative Lexicon Model

【速读】: 该论文试图解决的问题是深度学习是否能比传统的线性方法更好地帮助我们理解语言学习中需要解决的问题。解决方案的关键在于将传统的线性映射(Linear Discriminative Learning, LDL)替换为深度密集神经网络(Deep Discriminative Learning, DDL),以提高对大规模和多样化数据集的映射准确性。研究结果表明,DDL在处理具有伪形态结构(如“slend+er”)的词汇时表现优于LDL,但在某些语言(如爱沙尼亚语和台湾普通话)中效果不明显。此外,频率信息引导的深度学习(Frequency-Informed Deep Learning, FIDDL)在反应时间数据上显著优于频率信息引导的线性映射(Frequency-Informed Linear mappings, FIL)。然而,深度映射在增量词汇学习中的更新效率不如线性映射。总体而言,线性和深度映射各有优势,对理解语言学习均有重要意义。

链接: https://arxiv.org/abs/2410.04259
作者: Maria Heitmeier,Valeria Schmidt,Hendrik P.A. Lensch,R. Harald Baayen
关键词-EN: Recently, Discriminative Lexicon Model, learning, DDL, cognitive modelling
类目: Computation and Language (cs.CL)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Recently, deep learning models have increasingly been used in cognitive modelling of language. This study asks whether deep learning can help us to better understand the learning problem that needs to be solved by speakers, above and beyond linear methods. We utilise the Discriminative Lexicon Model (DLM, Baayen et al., 2019), which models comprehension and production with mappings between numeric form and meaning vectors. While so far, these mappings have been linear (Linear Discriminative Learning, LDL), in the present study we replace them with deep dense neural networks (Deep Discriminative Learning, DDL). We find that DDL affords more accurate mappings for large and diverse datasets from English and Dutch, but not necessarily for Estonian and Taiwan Mandarin. DDL outperforms LDL in particular for words with pseudo-morphological structure such as slend+er. Applied to average reaction times, we find that DDL is outperformed by frequency-informed linear mappings (FIL). However, DDL trained in a frequency-informed way (‘frequency-informed’ deep learning, FIDDL) substantially outperforms FIL. Finally, while linear mappings can very effectively be updated from trial-to-trial to model incremental lexical learning (Heitmeier et al., 2023), deep mappings cannot do so as effectively. At present, both linear and deep mappings are informative for understanding language.
摘要:近年来,深度学习模型越来越多地被用于语言认知建模。本研究探讨深度学习是否能帮助我们更好地理解说话者需要解决的学习问题,超越线性方法的范畴。我们采用了判别词汇模型 (Discriminative Lexicon Model, DLM, Baayen et al., 2019),该模型通过数值形式与意义向量之间的映射来模拟理解和生成。迄今为止,这些映射一直是线性的 (线性判别学习, Linear Discriminative Learning, LDL),在本研究中,我们将其替换为深度密集神经网络 (深度判别学习, Deep Discriminative Learning, DDL)。我们发现,DDL 对来自英语和荷兰语的大规模多样化数据集提供了更准确的映射,但对于爱沙尼亚语和台湾普通话则未必如此。对于具有伪形态结构 (如 slend+er) 的词汇,DDL 在特定情况下优于 LDL。应用于平均反应时间时,我们发现 DDL 被频率信息引导的线性映射 (Frequency-Informed Linear mappings, FIL) 超越。然而,经过频率信息引导训练的 DDL (‘频率信息引导的深度学习’, Frequency-Informed Deep Learning, FIDDL) 显著优于 FIL。最后,尽管线性映射可以非常有效地从一次试验到下一次试验进行更新,以模拟增量词汇学习 (Heitmeier et al., 2023),但深度映射却不能如此有效地实现。目前,线性和深度映射对于理解语言都具有信息价值。

[NLP-122] Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia EMNLP2024

【速读】: 该论文试图解决在信息网络中插入新链接的难题,特别是在源文本中缺乏明确锚点的情况下。解决方案的关键在于提出了一个名为LocEI(Localized Entity Insertion)的框架及其多语言变体XLocEI,用于在文本中定位并插入实体链接。通过构建一个包含105种语言的基准数据集,并验证XLocEI在多语言环境下的有效性,研究显示XLocEI不仅在性能上超越了所有基线模型,包括使用GPT-4等大型语言模型的提示排序方法,而且能够在零样本学习模式下应用于未见过的语言,且性能下降最小。这一解决方案对于支持编辑在超过300种语言版本的维基百科中添加链接具有重要实践意义。

链接: https://arxiv.org/abs/2410.04254
作者: Tomás Feith,Akhil Arora,Martin Gerlach,Debjit Paul,Robert West
关键词-EN: turning isolated pieces, fundamental part, entity insertion, turning isolated, isolated pieces
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: EMNLP 2024; 24 pages; 62 figures

点击查看摘要

Abstract:Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
摘要:链接是信息网络的基本组成部分,将孤立的知识片段转化为一个比其各部分总和更丰富的信息网络。然而,向网络中添加新链接并非易事:这不仅需要识别出合适的源实体和目标实体对,还需要理解源内容以在文本中找到适合插入链接的位置。后一个问题尚未得到有效解决,尤其是在源文本中缺乏可以作为锚点插入目标实体链接的文本片段时。为了填补这一空白,我们引入了信息网络中实体插入的任务,并将其操作化。我们以维基百科为例,实证表明这一问题对编辑者来说既相关又具有挑战性。我们编译了一个包含 105 种语言的基准数据集,并开发了一个名为 LocEI(本地化实体插入)的实体插入框架及其多语言变体 XLocEI。我们展示了 XLocEI 优于所有基线模型(包括使用 GPT-4 等大语言模型的最先进基于提示的排序方法),并且它可以在零样本情况下应用于训练期间未见过的语言,性能下降最小。这些发现对于在实际应用中应用实体插入模型具有重要意义,例如,支持编辑者在超过 300 种语言版本的维基百科中添加链接。

[NLP-123] Enhancing Future Link Prediction in Quantum Computing Semantic Networks through LLM-Initiated Node Features

【速读】: 该论文试图解决量子计算领域中,如何通过语义网络识别知识缺口和新概念组合的问题。解决方案的关键在于利用大型语言模型(LLMs)初始化节点特征,以增强图神经网络中节点表示的丰富性,从而提高链接预测任务的准确性。这种方法减少了手动特征创建的需求,降低了成本,并在量子计算语义网络的链接预测模型中表现出优于传统节点嵌入技术的有效性。

链接: https://arxiv.org/abs/2410.04251
作者: Gilchan Park,Paul Baity,Byung-Jun Yoon,Adolfy Hoisie
关键词-EN: accelerate computational processes, solve complex problems, computer science, offering the potential, computational processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Quantum computing is rapidly evolving in both physics and computer science, offering the potential to solve complex problems and accelerate computational processes. The development of quantum chips necessitates understanding the correlations among diverse experimental conditions. Semantic networks built on scientific literature, representing meaningful relationships between concepts, have been used across various domains to identify knowledge gaps and novel concept combinations. Neural network-based approaches have shown promise in link prediction within these networks. This study proposes initializing node features using LLMs to enhance node representations for link prediction tasks in graph neural networks. LLMs can provide rich descriptions, reducing the need for manual feature creation and lowering costs. Our method, evaluated using various link prediction models on a quantum computing semantic network, demonstrated efficacy compared to traditional node embedding techniques.
摘要:量子计算在物理学和计算机科学领域迅速发展,具备解决复杂问题和加速计算过程的潜力。量子芯片的开发需要理解多种实验条件之间的关联。基于科学文献构建的语义网络,能够表示概念之间的有意义关系,已在多个领域用于识别知识缺口和发现新的概念组合。基于神经网络的方法在这些网络中的链接预测方面显示出潜力。本研究提出使用大语言模型 (LLM) 初始化节点特征,以增强图神经网络中链接预测任务的节点表示。LLM 能够提供丰富的描述,减少手动特征创建的需求并降低成本。我们的方法在量子计算语义网络上使用多种链接预测模型进行评估,相较于传统的节点嵌入技术,展示了其有效性。

[NLP-124] Adaptive Question Answering: Enhancing Language Model Proficiency for Addressing Knowledge Conflicts with Source Citations EMNLP2024

【速读】: 该论文试图解决在问答(QA)任务中处理知识冲突和提供来源引用的双重挑战。解决方案的关键在于提出了一种新的任务框架,即在存在多个有效答案的模糊情境下进行问答并提供来源引用。为此,论文构建了包含五个新数据集的综合框架,引入了首个模糊多跳问答数据集,并设计了两个新的评估模型性能的指标。此外,论文还提供了基于规则、提示和微调方法的多个强基线模型,旨在推动QA研究的发展,构建更可信和可解释的系统。

链接: https://arxiv.org/abs/2410.04241
作者: Sagi Shaier,Ari Kobren,Philip Ogren
关键词-EN: Resolving knowledge conflicts, Question Answering, numerous conflicting facts, Resolving knowledge, challenge in Question
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Resolving knowledge conflicts is a crucial challenge in Question Answering (QA) tasks, as the internet contains numerous conflicting facts and opinions. While some research has made progress in tackling ambiguous settings where multiple valid answers exist, these approaches often neglect to provide source citations, leaving users to evaluate the factuality of each answer. On the other hand, existing work on citation generation has focused on unambiguous settings with single answers, failing to address the complexity of real-world scenarios. Despite the importance of both aspects, no prior research has combined them, leaving a significant gap in the development of QA systems. In this work, we bridge this gap by proposing the novel task of QA with source citation in ambiguous settings, where multiple valid answers exist. To facilitate research in this area, we create a comprehensive framework consisting of: (1) five novel datasets, obtained by augmenting three existing reading comprehension datasets with citation meta-data across various ambiguous settings, such as distractors and paraphrasing; (2) the first ambiguous multi-hop QA dataset featuring real-world, naturally occurring contexts; (3) two new metrics to evaluate models’ performances; and (4) several strong baselines using rule-based, prompting, and finetuning approaches over five large language models. We hope that this new task, datasets, metrics, and baselines will inspire the community to push the boundaries of QA research and develop more trustworthy and interpretable systems.
摘要:解决知识冲突是问答 (QA) 任务中的一个关键挑战,因为互联网包含大量相互冲突的事实和观点。尽管一些研究在处理存在多个有效答案的模糊情境方面取得了进展,但这些方法往往忽略了提供来源引用,使得用户难以评估每个答案的真实性。另一方面,现有的引用生成研究主要集中在单一答案的明确情境中,未能解决现实场景的复杂性。尽管这两个方面都非常重要,但之前的研究并未将它们结合起来,导致 QA 系统的发展存在显著的空白。在本研究中,我们通过提出在存在多个有效答案的模糊情境中进行带来源引用的问答这一新颖任务,填补了这一空白。为了促进该领域的研究,我们构建了一个综合框架,包括:(1) 五个新数据集,通过在三个现有的阅读理解数据集中加入跨多种模糊情境(如干扰项和改写)的引用元数据进行扩展;(2) 首个包含现实世界自然发生情境的模糊多跳问答数据集;(3) 两种新的评估模型性能的指标;以及 (4) 基于规则、提示和微调方法在五个大语言模型上构建的多个强基线。我们希望这一新任务、数据集、指标和基线能够激发社区推动问答研究的前沿,并开发出更可信和可解释的系统。

[NLP-125] Persona Knowledge-Aligned Prompt Tuning Method for Online Debate ECAI2024

【速读】: 该论文试图解决在论证质量评估中如何结合受众的社会角色特征来提升论证的说服力问题。解决方案的关键在于提出了一个基于受众角色知识对齐的框架,通过利用ChatGPT的模拟和拟人化能力,将受众的角色知识注入到较小的语言模型中,并通过提示调优来实现。这一方法显著提升了论证质量评估的性能,相较于传统架构有明显改进。

链接: https://arxiv.org/abs/2410.04239
作者: Chunkit Chan,Cheng Jiayang,Xin Liu,Yauwai Yim,Yuxin Jiang,Zheye Deng,Haoran Li,Yangqiu Song,Ginny Y. Wong,Simon See
关键词-EN: process of exchanging, exchanging viewpoints, viewpoints or convincing, Debate, provided empirical evidence
类目: Computation and Language (cs.CL)
备注: Accepted to ECAI 2024

点击查看摘要

Abstract:Debate is the process of exchanging viewpoints or convincing others on a particular issue. Recent research has provided empirical evidence that the persuasiveness of an argument is determined not only by language usage but also by communicator characteristics. Researchers have paid much attention to aspects of languages, such as linguistic features and discourse structures, but combining argument persuasiveness and impact with the social personae of the audience has not been explored due to the difficulty and complexity. We have observed the impressive simulation and personification capability of ChatGPT, indicating a giant pre-trained language model may function as an individual to provide personae and exert unique influences based on diverse background knowledge. Therefore, we propose a persona knowledge-aligned framework for argument quality assessment tasks from the audience side. This is the first work that leverages the emergence of ChatGPT and injects such audience personae knowledge into smaller language models via prompt tuning. The performance of our pipeline demonstrates significant and consistent improvement compared to competitive architectures.
摘要:辩论是就某一特定问题交换观点或说服他人的过程。最近的研究提供了实证证据,表明论点的说服力不仅取决于语言使用,还取决于沟通者的特征。研究者们对语言的各个方面,如语言特征和话语结构,给予了大量关注,但由于难度和复杂性,将论点的说服力和影响与受众的社会角色相结合的研究尚未展开。我们观察到 ChatGPT 具有令人印象深刻的模拟和拟人化能力,表明一个巨大的预训练语言模型可以作为一个个体,基于多样化的背景知识提供角色并施加独特的影响。因此,我们提出了一种面向受众的论点质量评估框架,该框架与角色知识相一致。这是首次利用 ChatGPT 的出现,并通过提示调优将此类受众角色知识注入到较小的语言模型中的工作。我们的流程性能相比竞争架构展示了显著且一致的改进。

[NLP-126] Overview of Factify5WQA: Fact Verification through 5W Question-Answering AAAI2024

【速读】: 该论文试图解决假新闻传播速度远超真实新闻的问题,尤其是在社交媒体成为年轻群体主要新闻来源的背景下。解决方案的关键在于通过自动化技术进行事实验证,具体方法是利用Factify5WQA共享任务提供的基于方面的问题回答(aspect-based question answering)数据集,通过5W问题比较声明和支持文档,以BLEU评分和分类准确性作为性能指标。最佳方案通过自定义训练设置和预训练语言模型,将准确率提升至69.56%,较基线提高了近35%。

链接: https://arxiv.org/abs/2410.04236
作者: Suryavardan Suresh,Anku Rani,Parth Patwa,Aishwarya Reganti,Vinija Jain,Aman Chadha,Amitava Das,Amit Sheth,Asif Ekbal
关键词-EN: Researchers have found, spreads much times, times faster, faster than real, Fact verification
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at defactify3@aaai2024

点击查看摘要

Abstract:Researchers have found that fake news spreads much times faster than real news. This is a major problem, especially in today’s world where social media is the key source of news for many among the younger population. Fact verification, thus, becomes an important task and many media sites contribute to the cause. Manual fact verification is a tedious task, given the volume of fake news online. The Factify5WQA shared task aims to increase research towards automated fake news detection by providing a dataset with an aspect-based question answering based fact verification method. Each claim and its supporting document is associated with 5W questions that help compare the two information sources. The objective performance measure in the task is done by comparing answers using BLEU score to measure the accuracy of the answers, followed by an accuracy measure of the classification. The task had submissions using custom training setup and pre-trained language-models among others. The best performing team posted an accuracy of 69.56%, which is a near 35% improvement over the baseline.
摘要:研究人员发现,假新闻的传播速度远快于真实新闻。这在当今社会尤为严重,尤其是在社交媒体成为年轻一代主要新闻来源的情况下。因此,事实核查变得至关重要,许多媒体网站也为此做出了贡献。然而,面对网络上大量的假新闻,手动事实核查是一项繁琐的任务。Factify5WQA 共享任务旨在通过提供一个基于方面的问题回答的事实核查方法的数据集,来促进自动化假新闻检测的研究。每个声明及其支持文档都关联了 5W 问题,有助于比较这两个信息源。任务的客观性能评估是通过比较答案的 BLEU 分数来衡量答案的准确性,随后是分类的准确性评估。该任务的提交包括使用自定义训练设置和预训练语言模型等。表现最佳的团队达到了 69.56% 的准确率,比基线提高了近 35%。

[NLP-127] Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension

【速读】: 该论文试图解决在大规模语言模型中处理长序列的技术和资源挑战问题。解决方案的关键在于提出了一种高效且灵活的注意力架构,通过引入相关性感知的选择和合并机制来实现高效的稀疏注意力,并采用新颖的数据增强技术,如位置编码的循环、随机截断和动态增长的NTK位置嵌入(CRD NTK),以增强模型对未见位置的泛化能力。这些方法显著减少了计算资源和微调时间,使得在单个A100 GPU上能够对Llama2-7B模型进行32K序列长度的微调,同时在预训练、微调和推理阶段实现了上下文长度的扩展,达到了在4M上下文长度下100%的准确率和1M上下文长度下的稳定困惑度,相比传统全注意力机制,资源需求减少了至少64倍。

链接: https://arxiv.org/abs/2410.04211
作者: Ning Wang,Zekun Li,Tongxin Bai,Guoqi Li
关键词-EN: Modeling long sequences, Modeling long, handle longer sequences, handle longer, extending existing architectures
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Modeling long sequences is crucial for various large-scale models; however, extending existing architectures to handle longer sequences presents significant technical and resource challenges. In this paper, we propose an efficient and flexible attention architecture that enables the extension of context lengths in large language models with reduced computational resources and fine-tuning time compared to other excellent methods. Specifically, we introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention. In addition, we also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions. The results are as follows: First, using a single A100, we achieve fine-tuning on Llama2-7B with a sequence length of 32K, which is more efficient than other methods that rely on subsets for regression. Second, we present a comprehensive method for extending context lengths across the pre-training, fine-tuning, and inference phases. During pre-training, our attention mechanism partially breaks translation invariance during token selection, so we apply positional encodings only to the selected tokens. This approach achieves relatively high performance and significant extrapolation capabilities. For fine-tuning, we introduce Cyclic, Randomly Truncated, and Dynamically Growing NTK Positional Embedding (CRD NTK). This design allows fine-tuning with a sequence length of only 16K, enabling models such as Llama2-7B and Mistral-7B to perform inference with context lengths of up to 1M or even arbitrary lengths. Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length. This represents at least a 64-fold reduction in resource requirements compared to traditional full-attention mechanisms, while still achieving competitive performance.
摘要:建模长序列对于各种大规模模型至关重要;然而,将现有架构扩展以处理更长的序列面临着显著的技术和资源挑战。本文提出了一种高效且灵活的注意力架构,能够在减少计算资源和微调时间的情况下,扩展大语言模型的上下文长度,相比其他优秀方法更具优势。具体而言,我们引入了相关性感知的选择和合并机制,以促进高效的稀疏注意力。此外,我们还提出了一种新颖的数据增强技术,涉及位置编码,以增强对未见位置的泛化能力。结果如下:首先,使用单个 A100,我们实现了在 Llama2-7B 上进行 32K 序列长度的微调,效率高于依赖子集进行回归的其他方法。其次,我们提出了一种全面的方法,用于在预训练、微调及推理阶段扩展上下文长度。在预训练阶段,我们的注意力机制在 Token 选择过程中部分打破了平移不变性,因此我们仅对选定的 Token 应用位置编码。这种方法实现了相对较高的性能和显著的外推能力。对于微调,我们引入了循环、随机截断和动态增长的 NTK 位置嵌入 (CRD NTK)。这种设计允许在仅 16K 序列长度的情况下进行微调,使 Llama2-7B 和 Mistral-7B 等模型能够以高达 1M 甚至任意长度的上下文长度进行推理。我们的方法在上下文长度为 4M 的 passkey 任务中达到了 100% 的准确率,并在 1M 上下文长度下保持稳定的困惑度。这代表了与传统全注意力机制相比,资源需求至少减少了 64 倍,同时仍能实现竞争性能。

[NLP-128] LongGenBench: Long-context Generation Benchmark EMNLP2024

【速读】: 该论文试图解决现有长上下文基准主要关注基于检索的测试,而缺乏评估长上下文生成能力的问题。解决方案的关键在于引入了一个名为LongGenBench的合成基准,该基准允许灵活配置生成上下文的长度,并通过重新设计问题格式,要求大型语言模型(LLMs)生成单一、连贯的长上下文答案。这一创新方法不仅填补了现有基准的空白,还通过广泛的评估揭示了不同LLMs在长上下文生成场景中的性能退化趋势,为模型性能的全面评估提供了新的视角。

链接: https://arxiv.org/abs/2410.04199
作者: Xiang Liu,Peijie Dong,Xuming Hu,Xiaowen Chu
关键词-EN: requiring Large Language, Large Language Models, locate specific information, requiring Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024

点击查看摘要

Abstract:Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.
摘要:当前的长上下文基准主要集中在基于检索的测试上,要求大语言模型 (LLM) 在广泛的输入上下文中定位特定信息,例如“大海捞针”(NIAH) 基准。长上下文生成指的是语言模型生成跨越长段落或文档的连贯且上下文准确文本的能力。尽管最近的研究显示在 NIAH 和其他基于检索的长上下文基准上表现强劲,但评估长上下文生成能力的基准严重缺乏。为了填补这一空白并提供全面的评估,我们引入了一个合成基准 LongGenBench,它允许灵活配置自定义生成上下文长度。LongGenBench 通过重新设计问题格式并要求 LLM 提供一个连贯的长上下文答案,超越了传统基准。通过广泛使用 LongGenBench 进行评估,我们观察到:(1) 无论是通过 API 访问的模型还是开源模型,在长上下文生成场景中都表现出性能下降,下降幅度从 1.2% 到 47.1% 不等;(2) 不同系列的 LLM 表现出不同的性能下降趋势,其中 Gemini-1.5-Flash 模型在通过 API 访问的模型中下降最少,而 Qwen2 系列在 LongGenBench 中在开源模型中下降最少。

[NLP-129] CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints

【速读】: 该论文试图解决大语言模型(LLMs)在故事写作中创造力的评估问题,特别是如何区分LLMs生成的故事是基于训练数据中的现有故事还是真正具有创造性的新故事。解决方案的关键在于引入了一个名为CS4的新基准数据集,通过增加提示中的要求和约束的数量来提高提示的特异性,从而限制LLMs简单地复述训练数据中的高质量故事。这种方法使得研究人员能够在不依赖人工标注的情况下,间接测量LLMs的创造力。实验结果表明,不同LLMs在处理高度特异的提示时表现出不同的创造力水平,而基于人类反馈的学习(LHF)虽然有助于LLMs从训练数据中选择更好的故事,但对提升其生成训练数据中未见过的创造性故事的能力影响有限。

链接: https://arxiv.org/abs/2410.04197
作者: Anirudh Atmakuru,Jatin Nainani,Rohith Siddhartha Reddy Bheemreddy,Anirudh Lakkaraju,Zonghai Yao,Hamed Zamani,Haw-Shiuan Chang
关键词-EN: mathbf, proprietary training corpus, story writing, writing is difficult, difficult because LLM-generated
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ( \mathbfC omparing the \mathbfS kill of \mathbfC reating \mathbfS tories by \mathbfC ontrolling the \mathbfS ynthesized \mathbfC onstraint \mathbfS pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs’ creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model’s instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs’ ability to produce creative stories that are unseen in the training corpora. The benchmark is released at this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.04197 [cs.CL] (or arXiv:2410.04197v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.04197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:评估大语言模型 (LLMs) 在故事写作中的创造性是困难的,因为 LLM 生成的故事看似具有创造性,但实际上可能与它们庞大的专有训练语料库中的某些现有故事非常相似。为了克服这一挑战,我们引入了一个具有不同提示特定性水平的新型基准数据集:CS4 (通过控制合成约束的特定性来比较创造故事的技能)。通过增加提示中的要求/约束数量,我们可以提高提示的特定性,并阻止 LLMs 复述其训练数据中的高质量叙事。因此,CS4 使我们能够在没有人工标注的情况下间接测量 LLMs 的创造性。我们在 LLaMA、Gemma 和 Mistral 上的实验不仅突显了 LLMs 在处理高度特定提示时面临的创造性挑战,还揭示了不同的 LLMs 在不同数量的约束下表现差异很大,并且在模型的指令遵循能力和叙事连贯性之间实现了不同的平衡。此外,我们在 OLMo 上的实验表明,从人类反馈中学习 (LHF) 可以帮助 LLMs 从其训练数据中选择更好的故事,但在提升 LLMs 生成训练语料库中未见过的创造性故事的能力方面影响有限。该基准已在 https URL 发布。

主题:计算与语言 (cs.CL)
引用为:arXiv:2410.04197 [cs.CL] (或 arXiv:2410.04197v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.04197
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-130] Consistent Autoformalization for Constructing Mathematical Libraries EMNLP2024

【速读】: 该论文试图解决自动形式化(autoformalization)过程中,大型语言模型(LLMs)在处理复杂和专业化数学内容时的一致性和可靠性问题。解决方案的关键在于协调使用三种机制:最相似检索增强生成(MS-RAG)、去噪步骤以及基于语法错误反馈的自动修正(Auto-SEF)。这些机制通过提高语法、术语和语义的一致性,显著改善了自动形式化的质量,并能在不同类型的LLMs中应用,展现出跨模型的改进效果。

链接: https://arxiv.org/abs/2410.04194
作者: Lan Zhang,Xin Quan,Andre Freitas
关键词-EN: formal language expression, automatically translating mathematical, translating mathematical content, mathematical content written, Large Language Models
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 camera-ready

点击查看摘要

Abstract:Autoformalization is the task of automatically translating mathematical content written in natural language to a formal language expression. The growing language interpretation capabilities of Large Language Models (LLMs), including in formal languages, are lowering the barriers for autoformalization. However, LLMs alone are not capable of consistently and reliably delivering autoformalization, in particular as the complexity and specialization of the target domain grows. As the field evolves into the direction of systematically applying autoformalization towards large mathematical libraries, the need to improve syntactic, terminological and semantic control increases. This paper proposes the coordinated use of three mechanisms, most-similar retrieval augmented generation (MS-RAG), denoising steps, and auto-correction with syntax error feedback (Auto-SEF) to improve autoformalization quality. The empirical analysis, across different models, demonstrates that these mechanisms can deliver autoformalizaton results which are syntactically, terminologically and semantically more consistent. These mechanisms can be applied across different LLMs and have shown to deliver improve results across different model types.
摘要:自动形式化是将用自然语言编写的数学内容自动翻译为形式语言表达的任务。随着大语言模型 (Large Language Models, LLM) 在语言解释能力上的不断提升,包括在形式语言中的应用,自动形式化的门槛正在降低。然而,仅依赖 LLM 无法始终如一且可靠地完成自动形式化任务,尤其是在目标领域的复杂性和专业化程度增加时。随着该领域朝着系统化地将自动形式化应用于大型数学库的方向发展,对提高句法、术语和语义控制的需求也在增加。本文提出了一种协调使用三种机制的方法,即最相似检索增强生成 (Most-Similar Retrieval Augmented Generation, MS-RAG)、去噪步骤以及基于句法错误反馈的自动校正 (Auto-SEF),以提高自动形式化的质量。通过在不同模型上的实证分析,结果表明这些机制能够提供在句法、术语和语义上更为一致的自动形式化结果。这些机制可以应用于不同的 LLM,并且在不同类型的模型中显示出改进的效果。

[NLP-131] Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在面对“越狱”攻击时安全机制容易被绕过的问题。解决方案的关键在于引入一种可扩展的“越狱”攻击方法,通过占用LLM的计算资源来预先阻止其安全策略的激活。具体而言,该方法通过让LLM执行一个资源密集型的初步任务(字符映射查找和解码过程),在处理目标指令之前饱和模型的处理能力,从而防止安全协议在后续指令处理时被激活。这种方法无需梯度访问或手动提示工程,且能灵活调整攻击强度以适应不同规模的模型,从而实现高效的攻击。

链接: https://arxiv.org/abs/2410.04190
作者: Yiting Dong,Guobin Shen,Dongcheng Zhao,Xiang He,Yi Zeng
关键词-EN: Large Language Models, Large Language, Language Models, remain vulnerable, Large
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. Existing attack methods are fixed or specifically tailored for certain models and cannot flexibly adjust attack strength, which is critical for generalization when attacking models of various sizes. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM’s safety policies by occupying its computational resources. Our method involves engaging the LLM in a resource-intensive preliminary task - a Character Map lookup and decoding process - before presenting the target instruction. By saturating the model’s processing capacity, we prevent the activation of safety protocols when processing the subsequent instruction. Extensive experiments on state-of-the-art LLMs demonstrate that our method achieves a high success rate in bypassing safety measures without requiring gradient access, manual prompt engineering. We verified our approach offers a scalable attack that quantifies attack strength and adapts to different model scales at the optimal strength. We shows safety policies of LLMs might be more susceptible to resource constraints. Our findings reveal a critical vulnerability in current LLM safety designs, highlighting the need for more robust defense strategies that account for resource-intense condition.
摘要:大语言模型 (LLMs) 在面对绕过其安全机制的越狱攻击时仍然显得脆弱。现有的攻击方法通常是固定的或专门为某些模型定制的,无法灵活调整攻击强度,这对于在攻击不同规模的模型时实现泛化至关重要。我们引入了一种新颖的可扩展越狱攻击方法,通过占用 LLM 的计算资源来预先阻止其安全策略的激活。我们的方法涉及在呈现目标指令之前,让 LLM 参与一项资源密集型的初步任务——字符映射查找和解码过程。通过饱和模型的处理能力,我们在处理后续指令时阻止了安全协议的激活。在多个最先进的大语言模型上进行的广泛实验表明,我们的方法在不需梯度访问和手动提示工程的情况下,实现了绕过安全措施的高成功率。我们验证了这种方法提供了一种可扩展的攻击,能够量化攻击强度并根据不同模型规模在最佳强度下进行调整。我们的研究表明,大语言模型的安全策略可能更容易受到资源限制的影响。我们的发现揭示了当前大语言模型安全设计中的一个关键漏洞,强调了需要更强大的防御策略来应对资源密集型条件。

[NLP-132] DiDOTS: Knowledge Distillation from Large-Language-Models for Dementia Obfuscation in Transcribed Speech

【速读】: 该论文试图解决痴呆症患者语音转录中隐私泄露的问题,特别是如何在不依赖大规模标注数据的情况下,有效混淆痴呆症相关信息。解决方案的关键在于利用大型语言模型(LLMs)通过多种提示设计(零样本、少样本和基于知识的提示)进行混淆处理,并提出了一种名为DiDOTS的新方法,通过教师-学生范式和参数高效微调技术,从LLMs中提取知识,显著减少模型参数,同时保持甚至提升隐私保护性能和文本实用性。

链接: https://arxiv.org/abs/2410.04188
作者: Dominika Woszczyk,Soteris Demetriou
关键词-EN: neurocognitive disorder affecting, disorder affecting tens, sensitive neurocognitive disorder, neurocognitive disorder, disorder affecting
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at PoPETS 25’

点击查看摘要

Abstract:Dementia is a sensitive neurocognitive disorder affecting tens of millions of people worldwide and its cases are expected to triple by 2050. Alarmingly, recent advancements in dementia classification make it possible for adversaries to violate affected individuals’ privacy and infer their sensitive condition from speech transcriptions. Existing obfuscation methods in text have never been applied for dementia and depend on the availability of large labeled datasets which are challenging to collect for sensitive medical attributes. In this work, we bridge this research gap and tackle the above issues by leveraging Large-Language-Models (LLMs) with diverse prompt designs (zero-shot, few-shot, and knowledge-based) to obfuscate dementia in speech transcripts. Our evaluation shows that LLMs are more effective dementia obfuscators compared to competing methods. However, they have billions of parameters which renders them hard to train, store and share, and they are also fragile suffering from hallucination, refusal and contradiction effects among others. To further mitigate these, we propose a novel method, DiDOTS. DiDOTS distills knowledge from LLMs using a teacher-student paradigm and parameter-efficient fine-tuning. DiDOTS has one order of magnitude fewer parameters compared to its teacher LLM and can be fine-tuned using three orders of magnitude less parameters compared to full fine-tuning. Our evaluation shows that compared to prior work DiDOTS retains the performance of LLMs achieving 1.3x and 2.2x improvement in privacy performance on two datasets, while humans rate it as better in preserving utility even when compared to state-of-the-art paraphrasing models.
摘要:痴呆症是一种影响全球数千万人的敏感性神经认知障碍,预计到2050年病例数将增加两倍。令人担忧的是,最近在痴呆症分类方面的进展使得攻击者有可能侵犯受影响个体的隐私,并通过语音转录推断其敏感状况。现有的文本混淆方法从未应用于痴呆症,并且依赖于大规模标注数据集,而这些数据集对于敏感的医疗属性来说难以收集。在本研究中,我们填补了这一研究空白,并通过利用大语言模型 (LLM) 结合多种提示设计(零样本、少样本和基于知识的)来混淆语音转录中的痴呆症信息。我们的评估显示,与竞争方法相比,LLM 在痴呆症混淆方面更为有效。然而,LLM 拥有数十亿参数,这使得它们难以训练、存储和共享,并且容易受到幻觉、拒绝和矛盾效应等问题的影响。为进一步缓解这些问题,我们提出了一种新方法,DiDOTS。DiDOTS 通过教师-学生范式和参数高效微调从 LLM 中提取知识。与教师 LLM 相比,DiDOTS 的参数数量减少了约一个数量级,并且与全量微调相比,其微调所需的参数数量减少了约三个数量级。我们的评估表明,与先前的工作相比,DiDOTS 保留了 LLM 的性能,在两个数据集上的隐私性能分别提高了1.3倍和2.2倍,同时人类评估认为其在保持实用性方面甚至优于最先进的释义模型。

[NLP-133] owards Effective Counter-Responses: Aligning Human Preferences with Strategies to Combat Online Trolling EMNLP2024

【速读】: 该论文试图解决在线社区中应对不同类型恶意行为(即“trolling”)的策略选择问题。解决方案的关键在于提出了一种基于人类偏好的策略推荐方法,通过分析不同trolling行为与人类偏好策略之间的关联,生成针对性的反制策略(RSs)。该方法利用一个包含多种trolling情境与相应策略的数据集,实验结果表明,这种方法能够有效引导建设性讨论,减少恶意行为的负面影响,从而提升在线社区的健康环境。

链接: https://arxiv.org/abs/2410.04164
作者: Huije Lee,Hoyun Song,Jisu Shin,Sukmin Cho,SeungYoon Han,Jong C. Park
关键词-EN: communities typically involves, typically involves disruptive, involves disruptive behaviors, online communities typically, emotional distress
类目: Computation and Language (cs.CL)
备注: Findings of EMNLP 2024

点击查看摘要

Abstract:Trolling in online communities typically involves disruptive behaviors such as provoking anger and manipulating discussions, leading to a polarized atmosphere and emotional distress. Robust moderation is essential for mitigating these negative impacts and maintaining a healthy and constructive community atmosphere. However, effectively addressing trolls is difficult because their behaviors vary widely and require different response strategies (RSs) to counter them. This diversity makes it challenging to choose an appropriate RS for each specific situation. To address this challenge, our research investigates whether humans have preferred strategies tailored to different types of trolling behaviors. Our findings reveal a correlation between the types of trolling encountered and the preferred RS. In this paper, we introduce a methodology for generating counter-responses to trolls by recommending appropriate RSs, supported by a dataset aligning these strategies with human preferences across various troll contexts. The experimental results demonstrate that our proposed approach guides constructive discussion and reduces the negative effects of trolls, thereby enhancing the online community environment.
摘要:在线社区中的恶意行为通常涉及引发愤怒和操纵讨论等破坏性行为,导致氛围两极分化和情感困扰。为了减轻这些负面影响并维持健康和建设性的社区氛围,强有力的管理是必不可少的。然而,有效应对恶意行为者是困难的,因为他们的行为多种多样,需要不同的应对策略 (RS) 来应对。这种多样性使得为每种特定情况选择合适的 RS 变得具有挑战性。为了应对这一挑战,我们的研究探讨了人类是否具有针对不同类型恶意行为的偏好策略。我们的研究结果揭示了遭遇的恶意行为类型与偏好 RS 之间的关联。本文介绍了一种通过推荐适当的 RS 来生成对恶意行为者的反制回应的方法,该方法基于一个数据集,该数据集将这些策略与人类在各种恶意行为情境中的偏好相匹配。实验结果表明,我们提出的方法能够引导建设性讨论并减少恶意行为的负面影响,从而改善在线社区环境。

[NLP-134] oxic Subword Pruning for Dialogue Response Generation on Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)生成有毒内容的问题。解决方案的关键是提出了一种名为**Toxic Subword Pruning (ToxPrune)**的新算法,该算法通过修剪训练好的LLMs中包含有毒词汇的子词(subword)来防止生成有毒内容。与以往认为修剪BPE(Byte Pair Encoding)子词会损害机器翻译任务的观点不同,该研究发现ToxPrune在防止有毒内容生成方面具有显著效果,并且还能明显提升对话生成任务中的多样性。

链接: https://arxiv.org/abs/2410.04155
作者: Hongyuan Lu,Wai Lam
关键词-EN: defend large language, important research area, defend large, large language models, generating toxic content
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How to defend large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on various model training techniques to remediate LLMs by updating their weights. A typical related research area is safety alignment. This however is often costly and tedious and can expose the model to even more problems such as catastrophic forgetting if the trainings are not carefully handled by experienced NLP practitioners. We thus propose a simple yet effective and novel algorithm, namely \textbfToxic Subword \textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Fortunately, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously. We surprisingly found that ToxPrune can even obviously improve official Llama-3.1-6B in the metric of dialogue diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.\footnoteWe plan to release the resources to facilitate future work.
摘要:如何防止大语言模型 (LLMs) 生成有害内容是一个重要的研究领域。然而,大多数研究集中在通过更新模型权重来修复 LLMs 的各种模型训练技术上。一个典型的相关研究领域是安全对齐。然而,这通常成本高昂且繁琐,并且如果未经经验丰富的 NLP 从业者仔细处理训练过程,可能会使模型暴露于更多问题,如灾难性遗忘。因此,我们提出了一种简单但有效且新颖的算法,即 有毒子词修剪 (ToxPrune),用于从训练好的 LLMs 中的 BPE 中修剪包含在有毒词汇中的子词。与之前将修剪 BPE Token 展示为对机器翻译任务有害的工作相比,我们惊讶地发现它在防止 LLMs 生成有害内容方面具有实用性。幸运的是,我们的研究结果表明,ToxPrune 同时显著改善了有毒语言模型 NSFW-3B 在对话响应生成任务上的表现。我们惊讶地发现,ToxPrune 甚至可以显著提高官方 Llama-3.1-6B 在对话多样性指标上的表现。广泛的自动结果和人工评估表明,ToxPrune 可能对修复有毒 LLMs 和在对话响应生成任务上改进非有毒 LLMs 都有帮助。[我们计划发布资源以促进未来的工作。]

[NLP-135] Reasoning with Natural Language Explanations EMNLP2024

【速读】: 该论文试图解决自然语言推理(NLI)中解释性推理的建模问题,关键在于构建能够有效编码和利用自然语言解释的NLI模型。解决方案的核心是基于解释的认知和语言学基础,系统地描述和评估用于构建解释性推理系统的架构趋势和方法,从而实现复杂推理的建模与应用。

链接: https://arxiv.org/abs/2410.04148
作者: Marco Valentino,André Freitas
关键词-EN: media supporting scientific, supporting scientific discovery, Natural Language Inference, natural language explanations, explanation-based NLI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Tutorial to be presented at EMNLP 2024. Website: this https URL

点击查看摘要

Abstract:Explanation constitutes an archetypal feature of human rationality, underpinning learning and generalisation, and representing one of the media supporting scientific discovery and communication. Due to the importance of explanations in human reasoning, an increasing amount of research in Natural Language Inference (NLI) has started reconsidering the role that explanations play in learning and inference, attempting to build explanation-based NLI models that can effectively encode and use natural language explanations on downstream tasks. Research in explanation-based NLI, however, presents specific challenges and opportunities, as explanatory reasoning reflects aspects of both material and formal inference, making it a particularly rich setting to model and deliver complex reasoning. In this tutorial, we provide a comprehensive introduction to the field of explanation-based NLI, grounding this discussion on the epistemological-linguistic foundations of explanations, systematically describing the main architectural trends and evaluation methodologies that can be used to build systems capable of explanatory reasoning.
摘要:解释构成了人类理性的典型特征,支撑着学习和泛化,并作为支持科学发现和交流的媒介之一。由于解释在人类推理中的重要性,自然语言推理 (NLI) 领域的研究越来越多地开始重新考虑解释在学习与推理中的作用,试图构建基于解释的 NLI 模型,这些模型能够有效地编码并在下游任务中使用自然语言解释。然而,基于解释的 NLI 研究带来了特定的挑战和机遇,因为解释性推理既反映了实质推理的方面,也反映了形式推理的方面,使其成为一个特别丰富的场景,用于建模和传递复杂的推理过程。在本教程中,我们全面介绍了基于解释的 NLI 领域,基于解释的认识论-语言学基础,系统地描述了构建能够进行解释性推理的系统所需的主要架构趋势和评估方法。

[NLP-136] Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

【速读】: 该论文试图解决在训练数据稀缺时,多对一神经机器翻译系统中如何有效选择小批量语言的问题。解决方案的关键在于设计了一种新算法,该算法通过监测Transformer网络各层权重的平滑KL散度变化,判断模型权重是否显著进化,从而动态调整小批量的语言选择。这种方法在翻译质量和收敛速度上优于交替使用单语小批量的方法,但不及使用混合小批量的方法。

链接: https://arxiv.org/abs/2410.04147
作者: Àlex R. Atrio,Alexis Allemann,Ljiljana Dolamic,Andrei Popescu-Belis
关键词-EN: neural machine translation, machine translation systems, translation systems improve, neural machine, data is scarce
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce. In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems. The algorithm changes the language of the minibatch when the weights of the model do not evolve significantly, as measured by the smoothed KL divergence between all layers of the Transformer network. This algorithm outperforms the use of alternating monolingual batches, but not the use of shuffled batches, in terms of translation quality (measured with BLEU and COMET) and convergence speed.
摘要:在训练数据稀缺的情况下,多对一神经机器翻译系统优于一对一系统。本文设计并测试了一种新颖的算法,用于在训练此类系统时选择小批次的语言。该算法通过测量 Transformer 网络各层之间的平滑 KL 散度来判断模型权重是否显著变化,从而在小批次语言选择上做出调整。该算法在翻译质量(使用 BLEU 和 COMET 进行评估)和收敛速度方面优于交替使用单语小批次的方法,但不及使用混合小批次的方法。

[NLP-137] From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在处理长提示时面临的计算成本高和关键信息丢失的问题。解决方案的关键是引入了一种名为Reading To Compressing (R2C)的新型提示压缩方法,利用Fusion-in-Decoder (FiD)架构的交叉注意力分数来识别提示中的重要信息块和句子。R2C能够在不损害语义一致性的前提下有效捕捉全局上下文,避免了伪标签训练压缩器的必要性,从而在减少提示长度80%的同时,提升了模型在域外评估中的性能6%。

链接: https://arxiv.org/abs/2410.04139
作者: Eunseong Choi,Sunkyung Lee,Minjin Choi,June Park,Jongwuk Lee
关键词-EN: Large language models, advanced prompting techniques, Large language, achieved significant performance, significant performance gains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of the Association for Computational Linguistics: EMNLP 2024; 21 pages; 10 figures and 7 tables

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant performance gains using advanced prompting techniques over various tasks. However, the increasing length of prompts leads to high computational costs and often obscures crucial information. Prompt compression has been proposed to alleviate these issues, but it faces challenges in (i) capturing the global context and (ii) training the compressor effectively. To tackle these challenges, we introduce a novel prompt compression method, namely Reading To Compressing (R2C), utilizing the Fusion-in-Decoder (FiD) architecture to identify the important information in the prompt. Specifically, the cross-attention scores of the FiD are used to discern essential chunks and sentences from the prompt. R2C effectively captures the global context without compromising semantic consistency while detouring the necessity of pseudo-labels for training the compressor. Empirical results show that R2C retains key contexts, enhancing the LLM performance by 6% in out-of-domain evaluations while reducing the prompt length by 80%.
摘要:大语言模型 (LLMs) 在各种任务中通过先进的提示技术取得了显著的性能提升。然而,提示长度的增加导致了高计算成本,并且常常掩盖了关键信息。为了缓解这些问题,提示压缩技术被提出,但它面临着两个挑战:(i) 捕捉全局上下文;(ii) 有效训练压缩器。为了应对这些挑战,我们引入了一种新颖的提示压缩方法,即阅读至压缩 (Reading To Compressing, R2C),利用解码器中的融合 (Fusion-in-Decoder, FiD) 架构来识别提示中的重要信息。具体而言,FiD 的交叉注意力分数用于从提示中辨别出关键的块和句子。R2C 在不损害语义一致性的情况下有效地捕捉了全局上下文,同时避免了为训练压缩器生成伪标签的必要性。实证结果表明,R2C 保留了关键上下文,在外部域评估中提升了 LLM 性能 6%,同时将提示长度减少了 80%。

[NLP-138] Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

【速读】: 该论文试图解决在医疗对话模型中使用AI反馈强化学习(RLAIF)技术时面临的两大挑战:自动化评估方法的局限性和准确表达医生偏好的困难。解决方案的关键在于引入基于标准化患者检查的新评估框架,以客观评估大型语言模型(LLMs)在指导用户和遵循指令方面的效果,并通过使用基于流程图的宪法AI算法来有效表达医生偏好。此外,论文提出了一种基于代理的创新数据标注方法,该方法能够自主生成适应患者病情的医疗对话流程,显示出强大的泛化能力,并显著减少了对专家的依赖。

链接: https://arxiv.org/abs/2410.04112
作者: Chengfeng Dou,Ying Zhang,Zhi Jin,Wenpin Jiao,Haiyan Zhao,Yongqiang Zhao,Zhengwei Tao
关键词-EN: Reinforcement Learning, current RLAIF research, improve healthcare dialogue, techniques to improve, aim of tackling
类目: Computation and Language (cs.CL)
备注: 14 Pages, 12 figures

点击查看摘要

Abstract:This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models, with the aim of tackling the challenges of preference-aligned data annotation while reducing the reliance on medical experts. We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods and the difficulties in accurately representing physician preferences. To address these challenges, we present a new evaluation framework based on standardized patient examinations. This framework is designed to objectively assess the effectiveness of large language models (LLMs) in guiding users and following instructions, enabling a comprehensive comparison across different models. Furthermore, our investigation of effective ways to express physician preferences using Constitutional AI algorithms highlighted the particular effectiveness of flowcharts. Utilizing this finding, we introduce an innovative agent-based approach for annotating preference data. This approach autonomously creates medical dialogue flows tailored to the patient’s condition, demonstrates strong generalization abilities, and reduces the need for expert involvement. Our results show that the agent-based approach outperforms existing RLAIF annotation methods in standardized patient examinations and surpasses current open source medical dialogue LLMs in various test scenarios.
摘要:本研究探讨了利用 AI 反馈强化学习 (Reinforcement Learning from AI Feedback, RLAIF) 技术改进医疗对话模型的应用,旨在解决偏好对齐数据标注的挑战,同时减少对医疗专家的依赖。我们认为,当前 RLAIF 在医疗领域的研究主要面临的挑战是自动化评估方法的局限性和准确表达医生偏好的困难。为应对这些挑战,我们提出了一种基于标准化患者检查的新评估框架。该框架旨在客观评估大语言模型 (Large Language Models, LLMs) 在指导用户和遵循指令方面的有效性,从而实现不同模型之间的全面比较。此外,我们对利用宪法 AI (Constitutional AI) 算法有效表达医生偏好的方法进行了研究,发现流程图 (flowcharts) 的特别有效性。基于这一发现,我们引入了一种创新的基于智能体 (agent-based) 的偏好数据标注方法。该方法能够自主生成针对患者病情的医疗对话流程,展现出强大的泛化能力,并减少了专家参与的需求。我们的结果表明,基于智能体的方法在标准化患者检查中优于现有的 RLAIF 标注方法,并在多种测试场景中超越了当前的开源医疗对话大语言模型。

[NLP-139] UBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions

【速读】: 该论文试图解决大视觉语言模型(LVLMs)在处理不可回答问题时的幻觉问题,即模型生成与视觉或文本输入不符的内容。解决方案的关键在于提出了TUBench基准,该基准专门用于评估LVLMs在不可回答问题上的可靠性。TUBench包含大量精心设计的不可回答问题,基于四个不同领域的图像(代码片段截图、自然图像、几何图表、统计表格截图),旨在测试模型在代码推理、常识推理、几何推理和数学推理方面的可信度。通过这一基准,研究者对28个领先的基础模型进行了全面评估,揭示了模型在判断问题可回答性方面的表现。

链接: https://arxiv.org/abs/2410.04107
作者: Xingwei He,Qianru Zhang,A-Long Jin,Yuan Yuan,Siu-Ming Yiu
关键词-EN: Large Vision-Language Models, achieved remarkable progress, Large Vision-Language, linguistic interpretation, unanswerable questions
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable progress on visual perception and linguistic interpretation. Despite their impressive capabilities across various tasks, LVLMs still suffer from the issue of hallucination, which involves generating content that is incorrect or unfaithful to the visual or textual inputs. Traditional benchmarks, such as MME and POPE, evaluate hallucination in LVLMs within the scope of Visual Question Answering (VQA) using answerable questions. However, some questions are unanswerable due to insufficient information in the images, and the performance of LVLMs on such unanswerable questions remains underexplored. To bridge this research gap, we propose TUBench, a benchmark specifically designed to evaluate the reliability of LVLMs using unanswerable questions. TUBench comprises an extensive collection of high-quality, unanswerable questions that are meticulously crafted using ten distinct strategies. To thoroughly evaluate LVLMs, the unanswerable questions in TUBench are based on images from four diverse domains as visual contexts: screenshots of code snippets, natural images, geometry diagrams, and screenshots of statistical tables. These unanswerable questions are tailored to test LVLMs’ trustworthiness in code reasoning, commonsense reasoning, geometric reasoning, and mathematical reasoning related to tables, respectively. We conducted a comprehensive quantitative evaluation of 28 leading foundational models on TUBench, with Gemini-1.5-Pro, the top-performing model, achieving an average accuracy of 69.2%, and GPT-4o, the third-ranked model, reaching 66.7% average accuracy, in determining whether questions are answerable. TUBench is available at this https URL.
摘要:大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 在视觉感知和语言解释方面取得了显著进展。尽管它们在各种任务中展现出令人印象深刻的能力,但 LVLMs 仍然存在幻觉问题,即生成与视觉或文本输入不符或不准确的内容。传统的基准测试,如 MME 和 POPE,通过可回答的问题在视觉问答 (Visual Question Answering, VQA) 范围内评估 LVLMs 的幻觉问题。然而,由于图像中信息不足,一些问题是无法回答的,而 LVLMs 在这些无法回答的问题上的表现仍未得到充分探索。为了填补这一研究空白,我们提出了 TUBench,这是一个专门设计用于评估 LVLMs 在无法回答的问题上的可靠性的基准测试。TUBench 包含大量精心设计的高质量无法回答的问题,这些问题使用了十种不同的策略。为了全面评估 LVLMs,TUBench 中的无法回答的问题基于来自四个不同领域的图像作为视觉上下文:代码片段截图、自然图像、几何图表和统计表格截图。这些无法回答的问题分别针对测试 LVLMs 在代码推理、常识推理、几何推理和与表格相关的数学推理方面的可信度。我们对 28 个领先的基础模型在 TUBench 上进行了全面的定量评估,其中表现最佳的模型 Gemini-1.5-Pro 的平均准确率为 69.2%,排名第三的模型 GPT-4o 的平均准确率为 66.7%,在判断问题是否可回答方面。TUBench 可通过此 https URL 获取。

[NLP-140] A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)版本更新时,预训练从零开始(PTFS)和持续预训练(CPT)两种训练范式在性能和训练成本上的差异问题。解决方案的关键在于提出了一种学习率路径切换训练范式,该范式包括一个主路径和多个分支路径。主路径使用最大学习率对LLM进行预训练,而每个分支路径对应于使用新添加的训练数据对LLM进行更新。通过在第一阶段使用大学习率,并在第二阶段完成学习率衰减过程,该方法显著降低了版本更新的总训练成本,同时保持了与PTFS相当的预训练性能。

链接: https://arxiv.org/abs/2410.04103
作者: Zhihao Wang,Shiyu Liu,Jianheng Huang,Zheng Wang,Yixuan Liao,Xiaoxin Chen,Junfeng Yao,Jinsong Su
关键词-EN: Large Language Models, Language Models, Large Language, version updates, learning rate
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 (main,long paper)

点击查看摘要

Abstract:Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Moreover, their performance and training cost gaps widen progressively with version updates. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and continual pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning rate path switching training paradigm. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data. Extensive experiments demonstrate the effectiveness and generalization of our paradigm. Particularly, when training four versions of LLMs, our paradigm reduces the total training cost to 58% compared to PTFS, while maintaining comparable pre-training performance.
摘要:随着新数据的不断涌现,版本更新已成为大语言模型 (LLM) 不可或缺的需求。LLM 版本更新的训练范式包括从头开始预训练 (PTFS) 和持续预训练 (CPT)。初步实验表明,PTFS 在预训练性能上表现更优,而 CPT 的训练成本较低。此外,随着版本更新的推进,两者在性能和训练成本上的差距逐渐扩大。为了探究这一现象背后的原因,我们分析了在 CPT 的两个阶段中学习率调整的影响:准备初始化检查点和基于该检查点的持续预训练。我们发现,第一阶段采用较大的学习率和第二阶段完成学习率衰减过程对于 LLM 的版本更新至关重要。因此,我们提出了一种学习率路径切换训练范式。我们的范式包括一条主路径,即以最大学习率预训练 LLM,以及多条分支路径,每条分支路径对应于使用新增训练数据对 LLM 进行更新。大量实验证明了我们范式的有效性和泛化能力。特别是,在训练四个版本的 LLM 时,我们的范式将总训练成本降低到 PTFS 的 58%,同时保持了相当的预训练性能。

[NLP-141] BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Blooms-Taxonomy-Inspired Prompts

【速读】: 该论文试图解决大语言模型(LLMs)在处理数学问题和推理任务时表现有限的问题。解决方案的关键在于引入了一种名为BloomWise的新提示技术,该技术受布鲁姆分类法启发,鼓励模型从简单的认知技能(如记忆)逐步提升到更复杂的认知技能(如分析),直至找到正确答案。通过模型的自我评估来决定是否需要更高级的认知技能,从而促使模型部署适当的认知过程。实验结果表明,该方法在多个数学推理数据集上显著提升了LLMs的性能。

链接: https://arxiv.org/abs/2410.04094
作者: Maria-Eleni Zoumpoulidi,Georgios Paraskevopoulos,Alexandros Potamianos
关键词-EN: Large Language Models, tasks remains limited, Language Models, Large Language, reasoning tasks remains
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Despite the continuous progress of Large Language Models (LLMs) across various tasks, their performance on mathematical problems and reasoning tasks remains limited. This limitation can be attributed, among other factors, to the inherent difficulty of these problems and the fact that solutions often consist of multiple steps, potentially of varying nature, making it challenging for a single prompting technique to execute all required steps. To address this, we introduce BloomWise, a new prompting technique, inspired by Bloom’s Taxonomy, aiming to improve LLMs’ performance in solving such problems by encouraging them to approach the problem starting from simple, i.e., remembering, and progressing to higher cognitive skills, i.e., analyzing, until the correct solution is reached. The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM. Thus, we encourage the LLM to deploy the appropriate cognitive processes. In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach. We also present extensive ablations, analyzing the strengths of each module within our system.
摘要:尽管大语言模型 (LLM) 在各种任务中不断取得进展,但其在数学问题和推理任务上的表现仍然有限。这一局限性可以归因于多种因素,包括这些问题的内在难度以及解决方案通常由多个步骤组成,且这些步骤的性质可能各不相同,使得单一的提示技术难以执行所有必要的步骤。为了解决这一问题,我们引入了 BloomWise,这是一种新的提示技术,灵感来源于 Bloom 分类法,旨在通过鼓励大语言模型从简单的认知技能(即记忆)开始,逐步提升到更高层次的认知技能(即分析),直至找到正确解决方案,从而提高其在解决此类问题上的表现。是否需要采用更复杂的认知技能的决策基于大语言模型自身的评估。因此,我们鼓励大语言模型部署适当的认知过程。在四个流行的数学推理数据集上的广泛实验中,我们展示了所提出方法的有效性。我们还进行了广泛的消融分析,以评估系统中每个模块的优势。

[NLP-142] GlobeSumm: A Challenging Benchmark Towards Unifying Multi-lingual Cross-lingual and Multi-document News Summarization EMNLP2024

【速读】: 该论文试图解决多语言、跨语言和多文档新闻摘要(MCMS)这一复杂任务,该任务涵盖了现实世界中新闻摘要的多种需求。解决方案的关键在于构建了一个名为GLOBESUMM的基准数据集,该数据集通过收集和重组多语言新闻报道,并以事件为中心进行格式化,解决了缺乏相关基准的问题。此外,论文还引入了协议引导的提示方法,用于高质量且成本有效的参考标注,从而应对新闻报道间的冲突、冗余和遗漏问题,增强了数据集的复杂性和实用性。

链接: https://arxiv.org/abs/2410.04087
作者: Yangfan Ye,Xiachong Feng,Xiaocheng Feng,Weitao Ma,Libo Qin,Dongliang Xu,Qing Yang,Hongtao Liu,Bing Qin
关键词-EN: today global scene, today global, global scene, content and varied, varied viewpoints
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 main conference, long paper

点击查看摘要

Abstract:News summarization in today’s global scene can be daunting with its flood of multilingual content and varied viewpoints from different sources. However, current studies often neglect such real-world scenarios as they tend to focus solely on either single-language or single-document tasks. To bridge this gap, we aim to unify Multi-lingual, Cross-lingual and Multi-document Summarization into a novel task, i.e., MCMS, which encapsulates the real-world requirements all-in-one. Nevertheless, the lack of a benchmark inhibits researchers from adequately studying this invaluable problem. To tackle this, we have meticulously constructed the GLOBESUMM dataset by first collecting a wealth of multilingual news reports and restructuring them into event-centric format. Additionally, we introduce the method of protocol-guided prompting for high-quality and cost-effective reference annotation. In MCMS, we also highlight the challenge of conflicts between news reports, in addition to the issues of redundancies and omissions, further enhancing the complexity of GLOBESUMM. Through extensive experimental analysis, we validate the quality of our dataset and elucidate the inherent challenges of the task. We firmly believe that GLOBESUMM, given its challenging nature, will greatly contribute to the multilingual communities and the evaluation of LLMs.
摘要:在全球化的背景下,新闻摘要面临着海量多语言内容和来自不同来源的多样化观点的挑战。然而,当前的研究往往忽视了这些现实场景,因为它们通常只关注单一语言或单一文档的任务。为了填补这一空白,我们旨在将多语言、跨语言和多文档摘要统一为一个新任务,即 MCMS(Multi-lingual, Cross-lingual, and Multi-document Summarization),该任务全面涵盖了现实世界的需求。然而,缺乏基准数据集阻碍了研究人员充分研究这一宝贵问题。为此,我们精心构建了 GLOBESUMM 数据集,首先收集了大量多语言新闻报道,并将其重组为以事件为中心的格式。此外,我们引入了协议引导的提示方法,以实现高质量且成本有效的参考标注。在 MCMS 中,我们还强调了新闻报道之间的冲突问题,除了冗余和遗漏问题外,进一步增强了 GLOBESUMM 的复杂性。通过广泛的实验分析,我们验证了数据集的质量,并阐明了任务的内在挑战。我们坚信,鉴于 GLOBESUMM 的挑战性,它将极大地促进多语言社区和大语言模型(LLM)的评估。

[NLP-143] PsFuture: A Pseudo-Future-based Zero-Shot Adaptive Policy for Simultaneous Machine Translation EMNLP2024

【速读】: 该论文试图解决实时机器翻译(Simultaneous Machine Translation, SiMT)中传统方法需要复杂架构和大量参数配置的问题。解决方案的关键在于提出了PsFuture,这是一种零样本自适应读写策略,使翻译模型能够在无需额外训练的情况下自主决定读写操作。此外,论文还引入了Prefix-to-Full(P2F)训练策略,通过调整离线翻译模型以适应SiMT应用,充分利用离线模型中的双向注意力机制,从而在翻译质量和延迟之间实现出色的平衡。

链接: https://arxiv.org/abs/2410.04075
作者: Libo Zhao,Jing Li,Ziqian Zeng
关键词-EN: Simultaneous Machine Translation, Simultaneous Machine, streaming source tokens, requires target tokens, Machine Translation
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 main conference

点击查看摘要

Abstract:Simultaneous Machine Translation (SiMT) requires target tokens to be generated in real-time as streaming source tokens are consumed. Traditional approaches to SiMT typically require sophisticated architectures and extensive parameter configurations for training adaptive read/write policies, which in turn demand considerable computational power and memory. We propose PsFuture, the first zero-shot adaptive read/write policy for SiMT, enabling the translation model to independently determine read/write actions without the necessity for additional training. Furthermore, we introduce a novel training strategy, Prefix-to-Full (P2F), specifically tailored to adjust offline translation models for SiMT applications, exploiting the advantages of the bidirectional attention mechanism inherent in offline models. Experiments across multiple benchmarks demonstrate that our zero-shot policy attains performance on par with strong baselines and the P2F method can further enhance performance, achieving an outstanding trade-off between translation quality and latency.
摘要:同时机器翻译 (Simultaneous Machine Translation, SiMT) 要求在流式源 Token 被消费的同时实时生成目标 Token。传统的 SiMT 方法通常需要复杂的架构和广泛的参数配置来训练适应性的读/写策略,这反过来又需要大量的计算能力和内存。我们提出了 PsFuture,这是首个用于 SiMT 的零样本适应性读/写策略,使翻译模型能够独立确定读/写操作,而无需额外的训练。此外,我们引入了一种新颖的训练策略,即前缀到全 (Prefix-to-Full, P2F),专门针对调整离线翻译模型以适应 SiMT 应用,利用离线模型中固有的双向注意力机制的优势。在多个基准测试中的实验表明,我们的零样本策略达到了与强基线相当的性能,而 P2F 方法可以进一步增强性能,实现了翻译质量和延迟之间的出色平衡。

[NLP-144] On Eliciting Syntax from Language Models via Hashing EMNLP-2024

【速读】: 该论文试图解决从原始文本中无监督地推导句法结构的问题,即语法归纳。解决方案的关键在于利用二进制表示在词汇和句法层面的信息保留能力,通过将比特级CKY算法从零阶升级到一阶,将词汇和句法编码在统一的二进制表示空间中。此外,论文将训练方式从有监督转变为无监督,并在对比哈希框架下引入了一种新的损失函数,以增强并平衡对齐信号。这些创新使得模型在多个数据集上表现出竞争性能,从而能够以较低成本从预训练语言模型中获取高质量的句法树。

链接: https://arxiv.org/abs/2410.04074
作者: Yiran Wang,Masao Utiyama
关键词-EN: infer syntactic structure, aims to infer, infer syntactic, syntactic structure, raw text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP-2024

点击查看摘要

Abstract:Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text. Recently, binary representation has exhibited remarkable information-preserving capabilities at both lexicon and syntax levels. In this paper, we explore the possibility of leveraging this capability to deduce parsing trees from raw text, relying solely on the implicitly induced grammars within models. To achieve this, we upgrade the bit-level CKY from zero-order to first-order to encode the lexicon and syntax in a unified binary representation space, switch training from supervised to unsupervised under the contrastive hashing framework, and introduce a novel loss function to impose stronger yet balanced alignment signals. Our model shows competitive performance on various datasets, therefore, we claim that our method is effective and efficient enough to acquire high-quality parsing trees from pre-trained language models at a low cost.
摘要:无监督解析,又称语法归纳,旨在从原始文本中推断出句法结构。近期,二进制表示在词汇和句法层面均展现出卓越的信息保留能力。本文探讨了利用这种能力从原始文本中推导解析树的可能性,仅依赖模型内部隐式诱导的语法。为此,我们将比特级CKY从零阶升级到一阶,以在统一的二进制表示空间中编码词汇和句法,在对比哈希框架下将训练从有监督切换到无监督,并引入一种新颖的损失函数以施加更强且平衡的对齐信号。我们的模型在多个数据集上表现出竞争性性能,因此,我们声称我们的方法在低成本下足以从预训练语言模型中高效获取高质量的解析树。

[NLP-145] PAD: Personalized Alignment at Decoding-Time

【速读】: 该论文试图解决传统对齐方法在计算成本和数据需求方面的问题,特别是在处理跨文化、教育和政治差异的个性化偏好时。解决方案的关键在于提出了“解码时个性化对齐”(PAD)框架,该框架通过引入独特的个性化奖励建模策略,在推理阶段动态调整基础模型的预测,以适应多样化的个性化偏好,无需额外的训练。PAD算法利用生成的个性化奖励来指导解码过程,从而实现对未见偏好的泛化能力和跨不同基础模型的可扩展性。

链接: https://arxiv.org/abs/2410.04070
作者: Ruizhe Chen,Xiaotian Zhang,Meng Luo,Wenhao Chai,Zuozhu Liu
关键词-EN: significant challenge due, significantly across cultural, political differences, personalized preferences, traditional alignment methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase

点击查看摘要

Abstract:Aligning with personalized preferences, which vary significantly across cultural, educational, and political differences, poses a significant challenge due to the computational costs and data demands of traditional alignment methods. In response, this paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase, eliminating the need for additional training. By introducing a unique personalized reward modeling strategy, this framework decouples the text generation process from personalized preferences, facilitating the generation of generalizable token-level personalized rewards. The PAD algorithm leverages these rewards to guide the decoding process, dynamically tailoring the base model’s predictions to personalized preferences. Extensive experimental results demonstrate that PAD not only outperforms existing training-based alignment methods in terms of aligning with diverse preferences but also shows significant generalizability to preferences unseen during training and scalability across different base models. This work advances the capability of LLMs to meet user needs in real-time applications, presenting a substantial step forward in personalized LLM alignment.
摘要:由于传统对齐方法的计算成本和数据需求,与个性化偏好(这些偏好因文化、教育和政治差异而显著不同)对齐是一个重大挑战。为此,本文提出了解码时个性化对齐(Personalized Alignment at Decoding-time, PAD),这是一种新颖的框架,旨在在推理阶段将大语言模型(LLM)的输出与多样化的个性化偏好对齐,而无需额外的训练。通过引入独特的个性化奖励建模策略,该框架将文本生成过程与个性化偏好解耦,促进了可泛化的Token级个性化奖励的生成。PAD算法利用这些奖励来指导解码过程,动态地将基础模型的预测调整为个性化偏好。广泛的实验结果表明,PAD不仅在多样化的偏好对齐方面优于现有的基于训练的对齐方法,而且在训练过程中未见过的偏好上表现出显著的泛化能力,并在不同基础模型之间具有可扩展性。这项工作提升了大语言模型在实时应用中满足用户需求的能力,在个性化大语言模型对齐方面迈出了重要的一步。

[NLP-146] ECon: On the Detection and Resolution of Evidence Conflicts EMNLP2024

【速读】: 该论文旨在解决决策系统中因大型语言模型(LLMs)生成内容导致的错误信息检测和冲突信息管理问题,特别是“证据间冲突”。解决方案的关键在于引入一种生成多样化、验证过的证据冲突的方法,以模拟现实世界中的错误信息场景。论文通过评估自然语言推理(NLI)模型、事实一致性(FC)模型和LLMs在检测这些冲突中的表现,并分析LLMs在冲突解决中的行为,发现NLI和LLM模型在检测答案冲突时表现出高精度,但较弱模型召回率较低;FC模型在处理词汇相似的答案冲突时表现不佳,而NLI和LLM模型则处理得更好;较强的模型如GPT-4在处理细微冲突时表现稳健。在冲突解决方面,LLMs往往倾向于支持某一证据而缺乏合理理由,并依赖内部知识,尤其是在已有先验信念的情况下。

链接: https://arxiv.org/abs/2410.04068
作者: Cheng Jiayang,Chunkit Chan,Qianqian Zhuang,Lin Qiu,Tianhang Zhang,Tengxiao Liu,Yangqiu Song,Yue Zhang,Pengfei Liu,Zheng Zhang
关键词-EN: managing conflicting information, Natural Language Inference, large language models, decision-making systems, rise of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 main conference

点击查看摘要

Abstract:The rise of large language models (LLMs) has significantly influenced the quality of information in decision-making systems, leading to the prevalence of AI-generated content and challenges in detecting misinformation and managing conflicting information, or “inter-evidence conflicts.” This study introduces a method for generating diverse, validated evidence conflicts to simulate real-world misinformation scenarios. We evaluate conflict detection methods, including Natural Language Inference (NLI) models, factual consistency (FC) models, and LLMs, on these conflicts (RQ1) and analyze LLMs’ conflict resolution behaviors (RQ2). Our key findings include: (1) NLI and LLM models exhibit high precision in detecting answer conflicts, though weaker models suffer from low recall; (2) FC models struggle with lexically similar answer conflicts, while NLI and LLM models handle these better; and (3) stronger models like GPT-4 show robust performance, especially with nuanced conflicts. For conflict resolution, LLMs often favor one piece of conflicting evidence without justification and rely on internal knowledge if they have prior beliefs.
摘要:大语言模型 (LLM) 的兴起显著影响了决策系统中信息的质量,导致 AI 生成内容的普及以及检测虚假信息和管理信息冲突(即“证据间冲突”)的挑战。本研究提出了一种生成多样化、经过验证的证据冲突的方法,以模拟现实世界中的虚假信息场景。我们评估了冲突检测方法,包括自然语言推理 (NLI) 模型、事实一致性 (FC) 模型和大语言模型在这些冲突上的表现(RQ1),并分析了大语言模型在冲突解决中的行为(RQ2)。我们的主要发现包括:(1) NLI 和大语言模型在检测答案冲突方面表现出高精度,但较弱的模型召回率较低;(2) FC 模型在处理词汇相似的答案冲突时表现不佳,而 NLI 和大语言模型在这方面表现更好;(3) 更强的模型如 GPT-4 表现出稳健的性能,特别是在处理细微冲突时。在冲突解决方面,大语言模型往往在没有充分理由的情况下偏向于某一条冲突证据,并在有先验信念时依赖内部知识。

[NLP-147] LoRTA: Low Rank Tensor Adaptation of Large Language Models

【速读】: 该论文试图解决低秩适应(LoRA)方法在参数高效微调(PEFT)中由于使用低秩矩阵模型导致可训练参数数量下限较高的问题。解决方案的关键在于提出一种新的低秩张量参数化方法,通过在模型更新中引入低秩张量,显著减少可训练参数的数量,同时提供更细粒度的适配器尺寸控制。实验结果表明,该方法在自然语言理解、指令微调、偏好优化和蛋白质折叠等基准测试中,能够在大幅减少参数数量的同时保持相当的性能。

链接: https://arxiv.org/abs/2410.04060
作者: Ignacio Hounie,Charilaos Kanatsoulis,Arnuv Tandon,Alejandro Ribeiro
关键词-EN: Efficient Fine Tuning, Low Rank Adaptation, Parameter Efficient Fine, Fine Tuning, Rank Adaptation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low Rank Adaptation (LoRA) is a popular Parameter Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks. LoRA parameterizes model updates using low-rank matrices at each layer, significantly reducing the number of trainable parameters and, consequently, resource requirements during fine-tuning. However, the lower bound on the number of trainable parameters remains high due to the use of the low-rank matrix model. In this paper, we address this limitation by proposing a novel approach that employs a low rank tensor parametrization for model updates. The proposed low rank tensor model can significantly reduce the number of trainable parameters, while also allowing for finer-grained control over adapter size. Our experiments on Natural Language Understanding, Instruction Tuning, Preference Optimization and Protein Folding benchmarks demonstrate that our method is both efficient and effective for fine-tuning large language models, achieving a substantial reduction in the number of parameters while maintaining comparable performance.
摘要:低秩适应 (LoRA) 是一种流行的参数高效微调 (PEFT) 方法,能够有效适应大型预训练模型以进行下游任务。LoRA 在每一层使用低秩矩阵对模型更新进行参数化,显著减少了可训练参数的数量,从而降低了微调过程中的资源需求。然而,由于使用低秩矩阵模型,可训练参数数量的下限仍然较高。本文通过提出一种新颖的方法来解决这一限制,该方法采用低秩张量参数化进行模型更新。所提出的低秩张量模型能够显著减少可训练参数的数量,同时允许对适配器大小进行更精细的控制。我们在自然语言理解、指令微调、偏好优化和蛋白质折叠基准测试中的实验表明,我们的方法在微调大语言模型时既高效又有效,实现了参数数量的显著减少,同时保持了相当的性能。

[NLP-148] Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

【速读】: 该论文试图解决视觉语言模型(VLMs)在生成响应时存在的缺陷问题,提出了一种自校正学习(Self-Correction Learning, SCL)方法。解决方案的关键在于通过直接偏好优化(Direct Preference Optimization, DPO),利用模型自身生成的自校正数据进行自我改进,而无需依赖外部反馈。具体来说,通过在推理阶段收集初始和校正后的响应样本,并将其分类为偏好和非偏好样本,VLMs可以在微调过程中学习如何避免先前的错误并提升性能。这种方法强调自校正不仅仅是响应的简单修正,而是通过额外的训练增强模型的推理能力,使其能够直接生成高质量的响应。

链接: https://arxiv.org/abs/2410.04055
作者: Jiayi He,Hehai Lin,Qingyun Wang,Yi Fung,Heng Ji
关键词-EN: shown remarkable abilities, Large Language Models, invariably generate flawed, language reasoning tasks, shown remarkable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have shown remarkable abilities in visual and language reasoning tasks, they invariably generate flawed responses. Self-correction that instructs models to refine their outputs presents a promising solution to this issue. Previous studies have mainly concentrated on Large Language Models (LLMs), while the self-correction abilities of VLMs, particularly concerning both visual and linguistic information, remain largely unexamined. This study investigates the self-correction capabilities of VLMs during both inference and fine-tuning stages. We introduce a Self-Correction Learning (SCL) approach that enables VLMs to learn from their self-generated self-correction data through Direct Preference Optimization (DPO) without relying on external feedback, facilitating self-improvement. Specifically, we collect preferred and disfavored samples based on the correctness of initial and refined responses, which are obtained by two-turn self-correction with VLMs during the inference stage. Experimental results demonstrate that although VLMs struggle to self-correct effectively during iterative inference without additional fine-tuning and external feedback, they can enhance their performance and avoid previous mistakes through preference fine-tuning when their self-generated self-correction data are categorized into preferred and disfavored samples. This study emphasizes that self-correction is not merely a refinement process; rather, it should enhance the reasoning abilities of models through additional training, enabling them to generate high-quality responses directly without further refinement.
摘要:尽管视觉-语言模型 (Vision-Language Models, VLMs) 在视觉和语言推理任务中展示了显著的能力,但它们不可避免地会产生有缺陷的响应。自我修正,即指导模型优化其输出,为解决这一问题提供了有前景的方案。以往的研究主要集中在大型语言模型 (Large Language Models, LLMs) 上,而关于 VLMs 的自我修正能力,特别是涉及视觉和语言信息的能力,仍未得到充分研究。本研究探讨了 VLMs 在推理和微调阶段的自我修正能力。我们提出了一种自我修正学习 (Self-Correction Learning, SCL) 方法,通过直接偏好优化 (Direct Preference Optimization, DPO) 使 VLMs 能够从其自我生成的自我修正数据中学习,而无需依赖外部反馈,从而促进自我提升。具体而言,我们根据初始和修正响应的正确性,收集了偏好和非偏好样本,这些样本是在推理阶段通过 VLMs 的两轮自我修正获得的。实验结果表明,尽管 VLMs 在没有额外微调和外部反馈的情况下,在迭代推理中难以有效自我修正,但通过将自我生成的自我修正数据分类为偏好和非偏好样本进行偏好微调,它们可以提高性能并避免之前的错误。本研究强调,自我修正不仅仅是优化过程,而应通过额外训练增强模型的推理能力,使其能够直接生成高质量响应,而无需进一步优化。

[NLP-149] Large Language Models can Achieve Social Balance

【速读】: 该论文试图解决的问题是研究在持续交互中,大型语言模型(LLMs)如何实现社会平衡。解决方案的关键在于三个因素:(i)交互更新是基于“关系”、“评价”还是“意见”;(ii)代理是否根据同质性或同伴影响更新其交互;(iii)LLMs考虑的同时交互数量。社会平衡的具体结构取决于这些条件,并在不同模型和规模间有所差异,其稳定性和更新依据也因模型而异,表明社会平衡受各LLM模型的预训练和校准特性驱动。

链接: https://arxiv.org/abs/2410.04054
作者: Pedro Cisneros-Velarde
关键词-EN: Social balance, achieve social balance, population ends, antagonistic factions, concept in sociology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Social balance is a concept in sociology which states that if every three individuals in a population achieve certain structures of positive or negative interactions, then the whole population ends up in one faction of positive interactions or divided between two or more antagonistic factions. In this paper, we consider a group of interacting large language models (LLMs) and study how, after continuous interactions, they can achieve social balance. Across three different LLM models, we found that social balance depends on (i) whether interactions are updated based on “relationships”, “appraisals”, or “opinions”; (ii) whether agents update their interactions based on homophily or influence from their peers; and (iii) the number of simultaneous interactions the LLMs consider. When social balance is achieved, its particular structure of positive or negative interactions depends on these three conditions and are different across LLM models and sizes. The stability of interactions and the justification for their update also vary across models. Thus, social balance is driven by the pre-training and alignment particular to each LLM model.
摘要:社会平衡是社会学中的一个概念,它指出如果一个群体中的每三个个体之间达到某种正向或负向互动的结构,那么整个群体最终会形成一个正向互动的派系,或者分裂成两个或多个对立的派系。本文中,我们考虑了一组相互作用的大语言模型 (LLM),并研究了在持续互动后,它们如何实现社会平衡。通过对三种不同 LLM 模型的研究,我们发现社会平衡取决于以下因素:(i) 互动是否基于“关系”、“评价”或“意见”进行更新;(ii) 智能体是否基于同质性或来自同行的影响来更新其互动;以及 (iii) LLM 考虑的同时互动的数量。当社会平衡实现时,其特定的正向或负向互动结构取决于这三个条件,并且在不同 LLM 模型和规模之间有所不同。互动的稳定性及其更新的合理性也因模型而异。因此,社会平衡是由每个 LLM 模型特有的预训练和对齐过程驱动的。

[NLP-150] Neuron-Level Sequential Editing for Large Language Models

【速读】: 该论文试图解决在大语言模型(LLMs)中进行连续多轮模型编辑的问题,特别是在不进行昂贵重训练的情况下,如何有效地修改模型内部知识并避免模型遗忘和失败。解决方案的关键在于提出了一种名为**神经元级顺序编辑(Neuron-level Sequential Editing, NSE)**的新方法。该方法通过优化目标层的隐藏状态,使用模型的原始权重来防止模型失败,并通过迭代选择多层中的神经元进行编辑,基于其激活值来减轻模型遗忘问题。实验结果表明,NSE显著优于现有的参数修改模型编辑方法,标志着在连续模型编辑领域取得了重大进展。

链接: https://arxiv.org/abs/2410.04045
作者: Houcheng Jiang,Junfeng Fang,Tianyu Zhang,An Zhang,Ruipeng Wang,Tao Liang,Xiang Wang
关键词-EN: sequential model editing, work explores sequential, model editing methods, model editing, large language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work explores sequential model editing in large language models (LLMs), a critical task that involves modifying internal knowledge within LLMs continuously through multi-round editing, each incorporating updates or corrections to adjust the model outputs without the need for costly retraining. Existing model editing methods, especially those that alter model parameters, typically focus on single-round editing and often face significant challenges in sequential model editing-most notably issues of model forgetting and failure. To address these challenges, we introduce a new model editing method, namely \textbfNeuron-level \textbfSequential \textbfEditing (NSE), tailored for supporting sequential model editing. Specifically, we optimize the target layer’s hidden states using the model’s original weights to prevent model failure. Furthermore, we iteratively select neurons in multiple layers for editing based on their activation values to mitigate model forgetting. Our empirical experiments demonstrate that NSE significantly outperforms current modifying parameters model editing methods, marking a substantial advancement in the field of sequential model editing. Our code is released on \urlthis https URL.
摘要:本研究探讨了大语言模型 (LLM) 中的序列模型编辑,这是一项关键任务,涉及通过多轮编辑不断修改 LLM 的内部知识,每轮编辑都包含更新或修正,以调整模型输出,而无需昂贵的重新训练。现有的模型编辑方法,尤其是那些修改模型参数的方法,通常专注于单轮编辑,并且在序列模型编辑中经常面临重大挑战,最显著的是模型遗忘和失败问题。为解决这些挑战,我们引入了一种新的模型编辑方法,即神经元级序列编辑 (Neuron-level Sequential Editing, NSE),专门设计用于支持序列模型编辑。具体而言,我们使用模型的原始权重优化目标层的隐藏状态,以防止模型失败。此外,我们根据激活值迭代选择多层中的神经元进行编辑,以减轻模型遗忘。我们的实证实验表明,NSE 显著优于当前的参数修改模型编辑方法,标志着序列模型编辑领域取得了重大进展。我们的代码已发布在 [https URL]。

[NLP-151] SyllableLM: Learning Coarse Semantic Units for Speech Language Models

【速读】: 该论文试图解决语音数据的高分辨率特性导致语音语言模型需要大量标记(tokens)的问题。解决方案的关键在于引入一种可控的自监督技术,通过分析预训练编码器损失中的相关性来提取噪声边界,并使用一种新颖的蒸馏技术迭代改进模型表示,从而将语音表示合并为类似音节的粗粒度单元,同时保留语义信息。这种方法能够在低至5Hz和60bps的速率下生成可控的语义单元,显著提高了语音语言模型的效率,减少了训练计算量并加速了推理过程。

链接: https://arxiv.org/abs/2410.04029
作者: Alan Baade,Puyuan Peng,David Harwath
关键词-EN: require tokenized inputs, models require tokenized, Language models require, tokenized inputs, require tokenized
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
摘要:语言模型需要经过 Token 化的输入。然而,对于音频和视觉等连续数据的 Token 化策略,通常基于简单的启发式方法,如固定大小的卷积或离散聚类,这些方法不一定与数据的语义结构相一致。特别是对于语音,由于波形的高分辨率(每秒 16,000 个样本或更多),语音语言模型在处理每个单词时所需的 Token 数量远多于基于文本的语言模型,这带来了显著的挑战。在本研究中,我们提出了一种可控的自监督技术,能够在保留语义信息的同时,将语音表示合并为更粗略的类似音节的单元。我们通过以下两种方式实现这一目标:1) 通过分析预训练编码器损失中的相关性来提取噪声边界;2) 使用一种新颖的蒸馏技术迭代改进模型表示。我们的方法能够在低至 5Hz 和 60bps 的速率下生成可控速率的语义单元,并在音节分割和聚类方面达到了最先进的水平。利用这些粗略的 Token,我们成功训练了 SyllableLM,这是一种语音语言模型 (SpeechLM),在各种口语语言建模任务中,其性能与当前最先进的 SpeechLM 相当或更优。SyllableLM 还在效率方面取得了显著提升,训练计算量减少了 30 倍,推理速度提高了 4 倍。

[NLP-152] A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models EMNLP2024

【速读】: 该论文试图解决中文拼写校正(CSC)任务中如何有效利用大型语言模型(LLMs)的问题。解决方案的关键在于将LLM作为纯粹的语言模型使用,通过逐步推理生成词汇分布来决定下一个词元,并设计了一个最小失真模型,利用原始字符与替换字符之间的发音或形状相似性来确保输出句子的忠实性。此外,论文还提出了两种奖励策略,以应对CSC任务中的实际挑战,从而显著提升LLM的性能,使其能够与领域内最先进的CSC模型竞争。

链接: https://arxiv.org/abs/2410.04027
作者: Houquan Zhou,Zhenghua Li,Bo Zhang,Chen Li,Shaopeng Lai,Ji Zhang,Fei Huang,Min Zhang
关键词-EN: Chinese spelling correction, simple training-free prompt-free, leverage large language, previous CSC approaches, Chinese spelling
类目: Computation and Language (cs.CL)
备注: Accepted at Main Conference of EMNLP 2024

点击查看摘要

Abstract:This work proposes a simple training-free prompt-free approach to leverage large language models (LLMs) for the Chinese spelling correction (CSC) task, which is totally different from all previous CSC approaches. The key idea is to use an LLM as a pure language model in a conventional manner. The LLM goes through the input sentence from the beginning, and at each inference step, produces a distribution over its vocabulary for deciding the next token, given a partial sentence. To ensure that the output sentence remains faithful to the input sentence, we design a minimal distortion model that utilizes pronunciation or shape similarities between the original and replaced characters. Furthermore, we propose two useful reward strategies to address practical challenges specific to the CSC task. Experiments on five public datasets demonstrate that our approach significantly improves LLM performance, enabling them to compete with state-of-the-art domain-general CSC models.
摘要:本文提出了一种无需训练、无需提示的方法,利用大语言模型 (LLM) 进行中文拼写校正 (CSC) 任务,这与以往所有 CSC 方法完全不同。其核心思想是将 LLM 作为传统意义上的纯语言模型使用。LLM 从输入句子的开头开始处理,在每个推理步骤中,根据部分句子生成其词汇表上的分布,以决定下一个 Token。为了确保输出句子忠实于输入句子,我们设计了一个最小失真模型,该模型利用原始字符与替换字符之间的发音或形状相似性。此外,我们提出了两种有用的奖励策略,以解决 CSC 任务中的实际挑战。在五个公开数据集上的实验表明,我们的方法显著提升了 LLM 的性能,使其能够与最先进的领域通用 CSC 模型相媲美。

[NLP-153] Hyperbolic Fine-tuning for Large Language Models ICML2024

【速读】: 该论文试图解决大语言模型(LLMs)中默认的欧几里得空间是否最适合嵌入token的问题。研究发现,token频率遵循幂律分布且嵌入空间具有高度双曲性,表明潜在的树状结构。解决方案的关键在于提出了一种新的方法——双曲低秩高效微调(HypLoRA),该方法直接在双曲流形上进行低秩适应,避免了传统指数和对数映射导致的抵消效应,从而保留了双曲建模能力。实验证明,HypLoRA显著提升了LLMs在复杂推理任务中的表现,特别是在AQuA数据集上提高了13.0%。

链接: https://arxiv.org/abs/2410.04010
作者: Menglin Yang,Aosong Feng,Bo Xiong,Jihong Liu,Irwin King,Rex Ying
关键词-EN: Large language models, Large language, demonstrated remarkable performance, language models, demonstrated remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: The preliminary work was accepted for the ICML 2024 LLM Cognition Workshop, and this version includes new investigations, analyses, experiments, and results

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance on various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for embedding tokens in LLMs. In this study, we first investigate the non-Euclidean characteristics of LLMs. Our findings reveal that token frequency follows a power-law distribution, with high-frequency tokens clustering near the origin and low-frequency tokens positioned farther away. Additionally, token embeddings exhibit a high degree of hyperbolicity, indicating a latent tree-like structure in the embedding space. Building on the observation, we propose to efficiently fine-tune LLMs in hyperbolic space to better exploit the underlying complex structures. However, we found that this fine-tuning in hyperbolic space cannot be achieved with naive application of exponential and logarithmic maps, when the embedding and weight matrices both reside in Euclidean space. To address this technique issue, we introduce a new method called hyperbolic low-rank efficient fine-tuning, HypLoRA, that performs low-rank adaptation directly on the hyperbolic manifold, avoiding the cancellation effect caused by the exponential and logarithmic maps, thus preserving the hyperbolic modeling capabilities. Through extensive experiments, we demonstrate that HypLoRA significantly enhances the performance of LLMs on reasoning tasks, particularly for complex reasoning problems. In particular, HypLoRA improves the performance in the complex AQuA dataset by up to 13.0%, showcasing its effectiveness in handling complex reasoning challenges
摘要:大语言模型 (LLMs) 在各种任务中展现了卓越的性能。然而,默认的欧几里得空间是否是嵌入 Token 的最佳选择,仍然是一个开放的问题。在本研究中,我们首先探讨了 LLMs 的非欧几里得特性。我们的研究发现,Token 频率遵循幂律分布,高频 Token 聚集在原点附近,而低频 Token 则分布在更远的位置。此外,Token 嵌入表现出高度的双曲性,表明嵌入空间中存在潜在的树状结构。基于这一观察,我们提出在双曲空间中高效微调 LLMs,以更好地利用底层复杂结构。然而,我们发现这种在双曲空间中的微调不能通过简单应用指数和对数映射来实现,特别是在嵌入和权重矩阵都位于欧几里得空间时。为了解决这一技术问题,我们引入了一种新的方法,称为双曲低秩高效微调 (HypLoRA),该方法直接在双曲流形上进行低秩适应,避免了由指数和对数映射引起的抵消效应,从而保留了双曲建模能力。通过广泛的实验,我们证明 HypLoRA 显著提升了 LLMs 在推理任务中的性能,特别是在复杂推理问题上。特别是,HypLoRA 在复杂 AQuA 数据集上的性能提升了高达 13.0%,展示了其在处理复杂推理挑战中的有效性。

[NLP-154] ake It Easy: Label-Adaptive Self-Rationalization for Fact Verification and Explanation Generation

【速读】: 该论文试图解决现有自动化事实核查方法在处理复杂领域数据和生成解释时存在的问题,特别是三分类数据集无法准确反映现实世界中的错误信息,以及现有方法未能充分解释声明与证据之间的关系。解决方案的关键在于提出了一种标签自适应学习方法,通过两步微调模型:首先微调模型以学习真实性预测,然后在此基础上进一步微调以学习自我理性化,使用相同的训练数据和额外的解释标注。这种方法显著提升了真实性预测的准确性,并在PubHealth和AVeriTec数据集上超越了GPT-4模型。此外,论文还展示了通过生成合成解释并进行少量样本微调,可以在低成本下实现与完全微调模型相当的效果,为未来在不同标注方案下进行可解释事实核查研究提供了有前景的方向。

链接: https://arxiv.org/abs/2410.04002
作者: Jing Yang,Anderson Rocha
关键词-EN: Computational methods, aid journalists, require adapting, specific domains, domains and generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper accepted in the 16th IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS) 2024

点击查看摘要

Abstract:Computational methods to aid journalists in the task often require adapting a model to specific domains and generating explanations. However, most automated fact-checking methods rely on three-class datasets, which do not accurately reflect real-world misinformation. Moreover, fact-checking explanations are often generated based on text summarization of evidence, failing to address the relationship between the claim and the evidence. To address these issues, we extend the self-rationalization method–typically used in natural language inference (NLI) tasks–to fact verification. We propose a label-adaptive learning approach: first, we fine-tune a model to learn veracity prediction with annotated labels (step-1 model). Then, we fine-tune the step-1 model again to learn self-rationalization, using the same data and additional annotated explanations. Our results show that our label-adaptive approach improves veracity prediction by more than ten percentage points (Macro F1) on both the PubHealth and AVeriTec datasets, outperforming the GPT-4 model. Furthermore, to address the high cost of explanation annotation, we generated 64 synthetic explanations from three large language models: GPT-4-turbo, GPT-3.5-turbo, and Llama-3-8B and few-shot fine-tune our step-1 model. The few-shot synthetic explanation fine-tuned model performed comparably to the fully fine-tuned self-rationalization model, demonstrating the potential of low-budget learning with synthetic data. Our label-adaptive self-rationalization approach presents a promising direction for future research on real-world explainable fact-checking with different labeling schemes.
摘要:辅助记者进行事实核查的计算方法通常需要将模型适应特定领域并生成解释。然而,大多数自动化事实核查方法依赖于三分类数据集,这些数据集并不能准确反映现实世界中的错误信息。此外,事实核查解释往往基于证据的文本摘要生成,未能解决声明与证据之间的关系。为了解决这些问题,我们将通常用于自然语言推理(NLI)任务的自理性化方法扩展到事实验证中。我们提出了一种标签自适应学习方法:首先,我们对模型进行微调,以学习带有标注标签的真实性预测(步骤一模型)。然后,我们再次对步骤一模型进行微调,以学习自理性化,使用相同的数据和额外的标注解释。我们的结果显示,我们的标签自适应方法在PubHealth和AVeriTec数据集上的真实性预测提高了超过十个百分点(Macro F1),优于GPT-4模型。此外,为了解决解释标注的高成本问题,我们从三个大语言模型:GPT-4-turbo、GPT-3.5-turbo和Llama-3-8B生成了64个合成解释,并对我们的步骤一模型进行了少样本微调。少样本合成解释微调模型表现与完全微调的自理性化模型相当,展示了使用合成数据进行低成本学习的潜力。我们的标签自适应自理性化方法为未来在不同标注方案下进行现实世界可解释事实核查的研究提供了一个有前景的方向。

[NLP-155] On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models EMNLP2024

【速读】: 该论文旨在探讨大型语言模型中存在的异性恋偏见和对跨种族恋爱关系的偏见。解决方案的关键在于通过控制姓名替换实验,分析模型在预测恋爱关系时的表现,发现模型在预测同性恋关系和涉及亚洲姓名的跨种族关系时准确性较低。研究进一步揭示了亚洲姓名在性别识别上的模糊性,强调了开发包容性和公平性技术的重要性。

链接: https://arxiv.org/abs/2410.03996
作者: Abhilasha Sancheti,Haozhe An,Rachel Rudinger
关键词-EN: performing controlled name-replacement, controlled name-replacement experiments, large language models, interracial romantic relationships, study the presence
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:We study the presence of heteronormative biases and prejudice against interracial romantic relationships in large language models by performing controlled name-replacement experiments for the task of relationship prediction. We show that models are less likely to predict romantic relationships for (a) same-gender character pairs than different-gender pairs; and (b) intra/inter-racial character pairs involving Asian names as compared to Black, Hispanic, or White names. We examine the contextualized embeddings of first names and find that gender for Asian names is less discernible than non-Asian names. We discuss the social implications of our findings, underlining the need to prioritize the development of inclusive and equitable technology.
摘要:我们通过进行关系预测任务的控制性名称替换实验,研究了大语言模型中存在的异性恋偏见和对跨种族恋爱关系的偏见。我们发现,模型在预测 (a) 同性别角色对与不同性别对之间的浪漫关系时,前者可能性较低;以及 (b) 涉及亚洲名字的同种族/跨种族角色对与涉及黑人、西班牙裔或白人名字的角色对相比,前者可能性也较低。我们检查了名字的上下文化嵌入,发现亚洲名字的性别识别度低于非亚洲名字。我们讨论了这些发现的社会影响,强调了优先发展包容性和公平技术的必要性。

[NLP-156] MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task

【速读】: 该论文旨在改进机器翻译评价指标MetricX,解决其在处理无参考或部分参考情况下的评分问题。解决方案的关键在于开发了一种混合的基于参考和无参考的评价方法,并通过两阶段训练策略,结合DA评分和MQM评分的混合数据集,以及引入合成示例来增强模型的鲁棒性,从而显著提升了MetricX在WMT23 MQM评分和新的合成挑战集上的表现。

链接: https://arxiv.org/abs/2410.03983
作者: Juraj Juraska,Daniel Deutsch,Mara Finkelstein,Markus Freitag
关键词-EN: Metrics Shared Task, Shared Task, Task and provide, Metrics Shared, version of MetricX
类目: Computation and Language (cs.CL)
备注: Accepted to WMT24

点击查看摘要

Abstract:In this paper, we present the MetricX-24 submissions to the WMT24 Metrics Shared Task and provide details on the improvements we made over the previous version of MetricX. Our primary submission is a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both. The metric is trained on previous WMT data in a two-stage fashion, first on the DA ratings only, then on a mixture of MQM and DA ratings. The training set in both stages is augmented with synthetic examples that we created to make the metric more robust to several common failure modes, such as fluent but unrelated translation, or undertranslation. We demonstrate the benefits of the individual modifications via an ablation study, and show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.
摘要:本文介绍了我们提交给 WMT24 评测共享任务的 MetricX-24 评测方法,并详细说明了我们在 MetricX 前一版本基础上所做的改进。我们的主要提交方案是一种混合的基于参考/无参考评测方法,能够在不考虑是否提供源段落、参考译文或两者都提供的情况下对翻译进行评分。该评测方法采用两阶段训练方式,首先仅基于 DA 评分进行训练,然后基于 MQM 和 DA 评分的混合数据进行训练。在两个训练阶段中,我们都通过增加我们创建的合成样本来增强训练集,以使评测方法对多种常见失效模式(如流畅但不相关的翻译或欠翻译)更具鲁棒性。我们通过消融研究展示了各个改进措施的益处,并展示了在 WMT23 MQM 评分以及我们新创建的合成挑战集上,MetricX-24 相较于 MetricX-23 的显著性能提升。

[NLP-157] Improving Arabic Multi-Label Emotion Classification using Stacked Embeddings and Hybrid Loss Function

【速读】: 该论文试图解决阿拉伯语等多标签情感分类任务中的类别不平衡和标签相关性问题。解决方案的关键在于结合堆叠嵌入、元学习和混合损失函数。具体来说,论文通过提取并堆叠来自ArabicBERT、MarBERT和AraBERT的上下文嵌入,形成丰富的嵌入表示,并训练元学习器。随后,这些表示被输入到Bi-LSTM模型和全连接神经网络中进行多标签分类。为了进一步提高性能,论文引入了一种混合损失函数,该函数结合了类别加权、标签相关矩阵和对比学习,有效解决了类别不平衡问题并增强了标签相关性的处理能力。实验结果表明,该方法显著提升了分类性能,特别是在少数类别的预测上,实现了更为平衡的情感分类。

链接: https://arxiv.org/abs/2410.03979
作者: Nisar Ahmed,Muhammad Imran Zaman
关键词-EN: multi-label emotion classification, emotion classification, hybrid loss function, accurately predicting minority, label correlation hinder
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In multi-label emotion classification, particularly for low-resource languages like Arabic, the challenges of class imbalance and label correlation hinder model performance, especially in accurately predicting minority emotions. To address these issues, this study proposes a novel approach that combines stacked embeddings, meta-learning, and a hybrid loss function to enhance multi-label emotion classification for the Arabic language. The study extracts contextual embeddings from three fine-tuned language models-ArabicBERT, MarBERT, and AraBERT-which are then stacked to form enriched embeddings. A meta-learner is trained on these stacked embeddings, and the resulting concatenated representations are provided as input to a Bi-LSTM model, followed by a fully connected neural network for multi-label classification. To further improve performance, a hybrid loss function is introduced, incorporating class weighting, label correlation matrix, and contrastive learning, effectively addressing class imbalances and improving the handling of label correlations. Extensive experiments validate the proposed model’s performance across key metrics such as Precision, Recall, F1-Score, Jaccard Accuracy, and Hamming Loss. The class-wise performance analysis demonstrates the hybrid loss function’s ability to significantly reduce disparities between majority and minority classes, resulting in a more balanced emotion classification. An ablation study highlights the contribution of each component, showing the superiority of the model compared to baseline approaches and other loss functions. This study not only advances multi-label emotion classification for Arabic but also presents a generalizable framework that can be adapted to other languages and domains, providing a significant step forward in addressing the challenges of low-resource emotion classification tasks.
摘要:在多标签情感分类中,特别是对于阿拉伯语等低资源语言,类别不平衡和标签相关性问题阻碍了模型性能,尤其是在准确预测少数情感方面。为了解决这些问题,本研究提出了一种新颖的方法,结合了堆叠嵌入、元学习以及混合损失函数,以增强阿拉伯语的多标签情感分类。研究从三个微调的语言模型——ArabicBERT、MarBERT 和 AraBERT——中提取上下文嵌入,然后将这些嵌入堆叠形成丰富的嵌入表示。在这些堆叠嵌入上训练一个元学习器,并将生成的连接表示作为输入提供给一个双向长短期记忆网络 (Bi-LSTM) 模型,随后通过一个全连接神经网络进行多标签分类。为进一步提高性能,引入了一种混合损失函数,结合了类别加权、标签相关矩阵和对比学习,有效解决了类别不平衡问题并改善了标签相关性的处理。广泛的实验验证了所提出模型在关键指标如精确率、召回率、F1 分数、Jaccard 准确率和汉明损失上的性能。类别性能分析表明,混合损失函数显著减少了多数类和少数类之间的差异,从而实现了更平衡的情感分类。消融研究突出了每个组件的贡献,显示了该模型相对于基线方法和其他损失函数的优越性。本研究不仅推进了阿拉伯语的多标签情感分类,还提出了一个可推广的框架,可以适应其他语言和领域,为解决低资源情感分类任务的挑战迈出了重要一步。

[NLP-158] Variational Language Concepts for Interpreting Foundation Language Models EMNLP2024

【速读】: 该论文试图解决基础语言模型(如BERT及其变体)在自然语言处理中的可解释性问题,特别是现有基于注意力权重的解释方法仅提供词级解释,缺乏对更高层次结构的理解,导致解释的易读性和直观性不足。解决方案的关键在于提出了一个名为VAriational Language Concept (VALC)的变分贝叶斯框架,该框架超越了词级解释,能够提供概念级的解释。通过理论分析,VALC被证明能够找到最优的语言概念来解释基础语言模型的预测结果,并在多个真实世界数据集上的实验结果表明,该方法能够成功地为基础语言模型提供概念级的解释。

链接: https://arxiv.org/abs/2410.03964
作者: Hengyi Wang,Shiwei Tan,Zhiqing Hong,Desheng Zhang,Hao Wang
关键词-EN: Foundation Language Models, achieved remarkable success, natural language processing, Foundation Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Accepted at EMNLP 2024 findings

点击查看摘要

Abstract:Foundation Language Models (FLMs) such as BERT and its variants have achieved remarkable success in natural language processing. To date, the interpretability of FLMs has primarily relied on the attention weights in their self-attention layers. However, these attention weights only provide word-level interpretations, failing to capture higher-level structures, and are therefore lacking in readability and intuitiveness. To address this challenge, we first provide a formal definition of conceptual interpretation and then propose a variational Bayesian framework, dubbed VAriational Language Concept (VALC), to go beyond word-level interpretations and provide concept-level interpretations. Our theoretical analysis shows that our VALC finds the optimal language concepts to interpret FLM predictions. Empirical results on several real-world datasets show that our method can successfully provide conceptual interpretation for FLMs.
摘要: 基础语言模型 (FLM) 如 BERT 及其变体在自然语言处理领域取得了显著的成功。迄今为止,FLM 的可解释性主要依赖于其自注意力层中的注意力权重。然而,这些注意力权重仅提供词级别的解释,无法捕捉更高层次的结构,因此在可读性和直观性方面存在不足。为了应对这一挑战,我们首先对概念性解释进行了形式化定义,然后提出了一种变分贝叶斯框架,称为变分语言概念 (VALC),以超越词级别的解释,提供概念级别的解释。我们的理论分析表明,VALC 能够找到最优的语言概念来解释 FLM 的预测结果。在多个真实世界数据集上的实证结果显示,我们的方法能够成功地为 FLM 提供概念性解释。

[NLP-159] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

【速读】: 该论文试图解决大型语言模型(LLM)在推理过程中由于提示长度远超生成长度而导致的预填充成本高和响应延迟增加的问题。解决方案的关键在于SwiftKV,它通过三种核心机制来实现:一是SingleInputKV,利用早期层的输出预填充后期层的KV缓存,从而减少提示令牌的计算量;二是AcrossKV,合并相邻层的KV缓存以减少内存占用并支持更大的批处理量,提高吞吐量;三是知识保留蒸馏过程,能够以最小的精度损失和低计算及数据需求,适应现有的LLM以支持SwiftKV。这些机制共同作用,显著降低了预填充的计算需求和KV缓存的内存需求,同时保持了生成令牌的高质量。

链接: https://arxiv.org/abs/2410.03960
作者: Aurick Qiao,Zhewei Yao,Samyam Rajbhandari,Yuxiong He
关键词-EN: typically observes orders, longer prompt lengths, magnitude longer prompt, generation lengths, enterprise use cases
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers’ KV cache using a much earlier layer’s output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs.
摘要:对于流行的企业用例,如总结、检索增强生成 (RAG) 和代码生成,大语言模型 (LLM) 推理通常观察到提示长度比生成长度长几个数量级。这一特性导致了预填充的高成本和响应延迟的增加。在本文中,我们提出了 SwiftKV,这是一种新颖的模型转换和蒸馏过程,专门设计用于在保持生成 Token 高质量的同时,减少处理提示 Token 的时间和成本。SwiftKV 结合了三种关键机制:i) SingleInputKV,它使用较早层的输出预填充较后层的 KV 缓存,允许提示 Token 跳过大部分模型计算;ii) AcrossKV,它合并相邻层的 KV 缓存,以减少内存占用并支持更大的批量,从而提高吞吐量;iii) 一种知识保留的蒸馏过程,可以以最小的准确性影响和低计算及数据需求,适应现有的 LLM 以支持 SwiftKV。对于 Llama-3.1-8B 和 70B,SwiftKV 将预填充的计算需求减少了 50%,并将 KV 缓存的内存需求减少了 62.5%,同时在广泛的任务中仅导致最小质量下降。在使用优化的 vLLM 实现的端到端推理服务中,SwiftKV 实现了高达 2 倍的总体吞吐量和每输出 Token 60% 的更低时间。它可以在 4x H100 GPU 上以 16 位精度实现令人震惊的 560 TFlops/GPU 的归一化推理吞吐量,相当于 Llama-3.1-70B 每秒 16K Token。

[NLP-160] Grounding Language in Multi-Perspective Referential Communication EMNLP2024

【速读】: 该论文试图解决在多智能体具身环境中,智能体之间如何生成和理解指向表达的问题。解决方案的关键在于训练一个开放权重的发言者模型,使其在与听众配对时能够基于沟通成功的证据进行优化。通过这种方式,模型在沟通成功率上从58.9%提升至69.3%,并超越了最强的专有模型。

链接: https://arxiv.org/abs/2410.03959
作者: Zineng Tang,Lingjun Mao,Alane Suhr
关键词-EN: multi-agent embodied environments, embodied environments, multi-agent embodied, referring expression generation, human-written referring expressions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to EMNLP2024 Main

点击查看摘要

Abstract:We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another’s visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.
摘要:我们介绍了一项在多智能体具身环境中用于指代表达生成与理解的任务和数据集。在该任务中,共享场景中的两个智能体必须考虑彼此的视觉视角,这些视角可能与自身不同,以便生成和理解场景中对象及其空间关系的指代。我们收集了一个包含 2,970 条人类编写的指代表达的数据集,每条表达都配有人类理解判断,并评估了自动化模型作为说话者和听众与人类伙伴配对时的表现,发现模型在指代生成和理解两方面的表现均落后于人类智能体对。最后,我们尝试训练一个开放权重说话者模型,该模型在与听众配对时表现出沟通成功的证据,结果沟通成功率从 58.9% 提升至 69.3%,甚至超过了最强的专有模型。

[NLP-161] LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

【速读】: 该论文试图解决如何通过组合多个大型语言模型(LLMs)来提升整体性能的问题。解决方案的关键在于提出了LLM-TOPLA方法,该方法具有三个独特特性:(i)引入焦点多样性度量来捕捉组件LLMs之间的多样性与性能的相关性;(ii)开发了一种多样性优化的集成剪枝算法,从N个基础LLMs中选择出表现最佳的k个子集成,通常这些子集成的规模S远小于N;(iii)通过学习集成方法,生成每个提示查询的新输出,以检测并解决集成中所有组件LLMs输出不一致的问题。实验结果表明,LLM-TOPLA在四个不同基准测试中显著优于现有的最佳LLM集成方法。

链接: https://arxiv.org/abs/2410.03953
作者: Selim Furkan Tekin,Fatih Ilhan,Tiansheng Huang,Sihao Hu,Ling Liu
关键词-EN: Combining large language, large language models, Combining large, shown substantial performance, component LLMs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Combining large language models during training or at inference time has shown substantial performance gain over component LLMs. This paper presents LLM-TOPLA, a diversity-optimized LLM ensemble method with three unique properties: (i) We introduce the focal diversity metric to capture the diversity-performance correlation among component LLMs of an ensemble. (ii) We develop a diversity-optimized ensemble pruning algorithm to select the top-k sub-ensembles from a pool of N base LLMs. Our pruning method recommends top-performing LLM subensembles of size S , often much smaller than N . (iii) We generate new output for each prompt query by utilizing a learn-to-ensemble approach, which learns to detect and resolve the output inconsistency among all component LLMs of an ensemble. Extensive evaluation on four different benchmarks shows good performance gain over the best LLM ensemble methods: (i) In constrained solution set problems, LLM-TOPLA outperforms the best-performing ensemble (Mixtral) by 2.2% in accuracy on MMLU and the best-performing LLM ensemble (MoreAgent) on GSM8k by 2.1%. (ii) In generative tasks, LLM-TOPLA outperforms the top-2 performers (Llama70b/Mixtral) on SearchQA by 3.9\mathrmx in F1, and on XSum by more than 38 in ROUGE-1. Our code and dataset, which contains outputs of 8 modern LLMs on 4 benchmarks is available at this https URL
摘要:在训练或推理阶段结合大语言模型(LLM)已显示出显著的性能提升。本文介绍了 LLM-TOPLA,这是一种多样性优化的 LLM 集成方法,具有三个独特特性:(i)我们引入了焦点多样性指标,以捕捉集成中各组件 LLM 之间的多样性与性能相关性。(ii)我们开发了一种多样性优化的集成剪枝算法,从 N 个基础 LLM 池中选择前 k 个子集成。我们的剪枝方法推荐了性能最佳的 LLM 子集成,其规模 S 通常远小于 N。(iii)我们通过采用一种学习集成的方法,为每个提示查询生成新的输出,该方法能够检测并解决集成中所有组件 LLM 之间的输出不一致性。在四个不同基准上的广泛评估显示,LLM-TOPLA 在性能上优于最佳的 LLM 集成方法:(i)在受限解集问题中,LLM-TOPLA 在 MMLU 上的准确率比表现最佳的集成(Mixtral)高出 2.2%,在 GSM8k 上比表现最佳的 LLM 集成(MoreAgent)高出 2.1%。(ii)在生成任务中,LLM-TOPLA 在 SearchQA 上的 F1 值比前两名(Llama70b/Mixtral)高出 3.9 倍,在 XSum 上的 ROUGE-1 值高出 38 以上。我们的代码和数据集(包含 8 个现代 LLM 在 4 个基准上的输出)可在以下链接获取:https URL。

[NLP-162] Structured List-Grounded Question Answering

【速读】: 该论文试图解决文档对话系统在处理结构化列表数据时,如何更有效地利用列表中的语义关系来提升问答性能的问题。解决方案的关键在于引入LIST2QA数据集,并通过Intermediate Steps for Lists (ISL)方法,将列表项与用户背景信息对齐,以模拟人类在生成回答前对列表的解读过程。实验结果表明,采用ISL方法的模型在ROUGE-L、正确性、忠实度和完整性等指标上均优于基线模型。

链接: https://arxiv.org/abs/2410.03950
作者: Mujeen Sung,Song Feng,James Gung,Raphael Shu,Yi Zhang,Saab Mansour
关键词-EN: Document-grounded dialogue systems, Document-grounded dialogue, leveraging external information, answer user queries, dialogue systems aim
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document-grounded dialogue systems aim to answer user queries by leveraging external information. Previous studies have mainly focused on handling free-form documents, often overlooking structured data such as lists, which can represent a range of nuanced semantic relations. Motivated by the observation that even advanced language models like GPT-3.5 often miss semantic cues from lists, this paper aims to enhance question answering (QA) systems for better interpretation and use of structured lists. To this end, we introduce the LIST2QA dataset, a novel benchmark to evaluate the ability of QA systems to respond effectively using list information. This dataset is created from unlabeled customer service documents using language models and model-based filtering processes to enhance data quality, and can be used to fine-tune and evaluate QA models. Apart from directly generating responses through fine-tuned models, we further explore the explicit use of Intermediate Steps for Lists (ISL), aligning list items with user backgrounds to better reflect how humans interpret list items before generating responses. Our experimental results demonstrate that models trained on LIST2QA with our ISL approach outperform baselines across various metrics. Specifically, our fine-tuned Flan-T5-XL model shows increases of 3.1% in ROUGE-L, 4.6% in correctness, 4.5% in faithfulness, and 20.6% in completeness compared to models without applying filtering and the proposed ISL method.
摘要:基于文档的对话系统旨在通过利用外部信息来回答用户查询。以往的研究主要集中在处理自由形式的文档上,往往忽略了结构化数据(如列表),这些数据可以表示各种细微的语义关系。受观察到即使是像 GPT-3.5 这样的高级语言模型也常常错过列表中的语义线索的启发,本文旨在增强问答 (QA) 系统,以更好地解释和利用结构化列表。为此,我们引入了 LIST2QA 数据集,这是一个新颖的基准,用于评估 QA 系统有效使用列表信息的能力。该数据集通过使用语言模型和基于模型的过滤过程从无标签的客户服务文档中创建,以提高数据质量,并可用于微调和评估 QA 模型。除了通过微调模型直接生成响应外,我们还进一步探索了列表中间步骤 (ISL) 的显式使用,将列表项与用户背景对齐,以更好地反映人类在生成响应前如何解释列表项。我们的实验结果表明,使用 LIST2QA 数据集并结合我们的 ISL 方法训练的模型在各种指标上优于基线模型。具体而言,我们的微调 Flan-T5-XL 模型在 ROUGE-L 上提高了 3.1%,在正确性上提高了 4.6%,在忠实度上提高了 4.5%,在完整性上提高了 20.6%,相较于未应用过滤和 ISL 方法的模型。

[NLP-163] Reverb: Open-Source ASR and Diarization from Rev

【速读】: 该论文旨在通过开源其核心语音识别和对话分割模型,推动语音技术领域的研究和创新。解决方案的关键在于发布了一个完整的生产流水线供开发者使用,并提供了简化的研究模型供实验,这些模型在长篇语音识别领域的表现优于所有现有的开源语音识别模型。

链接: https://arxiv.org/abs/2410.03930
作者: Nishchal Bhandari,Danny Chen,Miguel Ángel del Río Fernández,Natalie Delworth,Jennifer Drexler Fox,Migüel Jetté,Quinten McNamara,Corey Miller,Ondřej Novotný,Ján Profant,Nan Qin,Martin Ratajczak,Jean-Philippe Robichaud
关键词-EN: open-sourcing our core, speech recognition, core speech recognition, speech recognition models, recognition
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all existing open source speech recognition models across a variety of long-form speech recognition domains.
摘要:今天,我们开放了核心语音识别和话者分离模型,供非商业用途使用。我们不仅发布了一个完整的生产流水线供开发者使用,还提供了简化后的研究模型供实验使用。Rev 希望这些发布能够推动语音技术这一快速发展领域的研究和创新。今天发布的语音识别模型在各种长篇语音识别领域中,均优于所有现有的开源语音识别模型。

[NLP-164] C3PA: An Open Dataset of Expert-Annotated and Regulation-Aware Privacy Policies to Enable Scalable Regulatory Compliance Audits EMNLP2024

【速读】: 该论文试图解决现有隐私政策分析工具在识别和修复合规性问题方面的局限性,特别是在面对如欧盟GDPR和加州CCPA等重要隐私法规时。解决方案的关键在于开发了首个开放的、法规感知的专家标注隐私政策数据集C3PA,该数据集包含超过48,000条专家标注的隐私政策文本段落,关联到411个组织的CCPA特定披露要求。C3PA数据集的独特性在于其能够有效支持自动化审计,确保与CCPA相关的披露要求的合规性。

链接: https://arxiv.org/abs/2410.03925
作者: Maaz Bin Musa,Steven M. Winston,Garrison Allen,Jacob Schiller,Kevin Moore,Sean Quick,Johnathan Melvin,Padmini Srinivasan,Mihailis E. Diamantis,Rishab Nithyanand
关键词-EN: scalable regulatory compliance, extract organizations data, organizations data habits, techniques to analyze, analyze and extract
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, EMNLP 2024

点击查看摘要

Abstract:The development of tools and techniques to analyze and extract organizations data habits from privacy policies are critical for scalable regulatory compliance audits. Unfortunately, these tools are becoming increasingly limited in their ability to identify compliance issues and fixes. After all, most were developed using regulation-agnostic datasets of annotated privacy policies obtained from a time before the introduction of landmark privacy regulations such as EUs GDPR and Californias CCPA. In this paper, we describe the first open regulation-aware dataset of expert-annotated privacy policies, C3PA (CCPA Privacy Policy Provision Annotations), aimed to address this challenge. C3PA contains over 48K expert-labeled privacy policy text segments associated with responses to CCPA-specific disclosure mandates from 411 unique organizations. We demonstrate that the C3PA dataset is uniquely suited for aiding automated audits of compliance with CCPA-related disclosure mandates.
摘要:开发分析和提取隐私政策中组织数据习惯的工具和技术对于可扩展的监管合规审计至关重要。然而,这些工具在识别合规问题和修复措施方面的能力正变得越来越有限。毕竟,大多数工具是使用在欧盟通用数据保护条例 (GDPR) 和加利福尼亚消费者隐私法案 (CCPA) 等重要隐私法规出台之前获得的、与法规无关的注释隐私政策数据集开发的。在本文中,我们介绍了首个开放的、与法规相关的专家注释隐私政策数据集,即 C3PA (CCPA 隐私政策条款注释),旨在应对这一挑战。C3PA 包含超过 48,000 个专家标记的隐私政策文本片段,这些片段与 411 个独特组织的 CCPA 特定披露要求响应相关联。我们证明,C3PA 数据集特别适合于辅助自动化审计 CCPA 相关披露要求的合规性。

[NLP-165] Question-Answering System for Bangla: Fine-tuning BERT-Bangla for a Closed Domain

【速读】: 该论文试图解决孟加拉语领域特定问答系统的开发问题,解决方案的关键在于利用经过微调的BERT-Bangla模型。通过从Khulna University of Engineering & Technology (KUET)网站及其他相关文本中提取数据,构建了一个包含2500个问答对的训练集。该系统在封闭领域内进行训练和评估,主要评估指标为Exact Match (EM)分数和F1分数,分别达到了55.26%和74.21%。研究结果表明,该方法在孟加拉语领域特定问答系统中具有良好的应用潜力,但仍需进一步优化以应对更复杂的查询。

链接: https://arxiv.org/abs/2410.03923
作者: Subal Chandra Roy,Md Motaleb Hossen Manik
关键词-EN: fine-tuned BERT-Bangla model, Bengali question-answering systems, fine-tuned BERT-Bangla, domain-specific Bengali question-answering, Question-answering systems
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question-answering systems for Bengali have seen limited development, particularly in domain-specific applications. Leveraging advancements in natural language processing, this paper explores a fine-tuned BERT-Bangla model to address this gap. It presents the development of a question-answering system for Bengali using a fine-tuned BERT-Bangla model in a closed domain. The dataset was sourced from Khulna University of Engineering \ Technology’s (KUET) website and other relevant texts. The system was trained and evaluated with 2500 question-answer pairs generated from curated data. Key metrics, including the Exact Match (EM) score and F1 score, were used for evaluation, achieving scores of 55.26% and 74.21%, respectively. The results demonstrate promising potential for domain-specific Bengali question-answering systems. Further refinements are needed to improve performance for more complex queries.
摘要:孟加拉语的问答系统发展有限,特别是在特定领域应用方面。本文利用自然语言处理的进展,探讨了一种经过微调的 BERT-Bangla 模型来填补这一空白。本文介绍了在一个封闭领域中使用经过微调的 BERT-Bangla 模型开发的孟加拉语问答系统。数据集来源于 Khulna 科技大学 (KUET) 的网站和其他相关文本。该系统使用从精选数据中生成的 2500 个问答对进行训练和评估。关键指标包括精确匹配 (EM) 分数和 F1 分数,分别达到了 55.26% 和 74.21%。结果显示,特定领域的孟加拉语问答系统具有良好的潜力。为了提高对更复杂查询的性能,还需要进一步的改进。

[NLP-166] Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis

【速读】: 该论文试图解决从社交媒体帖子中准确分类抑郁症和焦虑症共病的问题。解决方案的关键在于引入了一个名为ANGST的新型基准数据集,该数据集不仅包含2876条由专家心理学家精心标注的帖子,还有7667条银标帖子,能够进行多标签分类,允许每条帖子同时被标记为抑郁症和/或焦虑症。通过使用从Mental-BERT到GPT-4等多种最先进的语言模型对ANGST进行基准测试,研究揭示了这些模型在复杂诊断场景中的能力与局限性,尽管GPT-4总体表现优于其他模型,但在多类共病分类中,所有模型的F1分数均未超过72%,突显了将语言模型应用于心理健康诊断的持续挑战。

链接: https://arxiv.org/abs/2410.03908
作者: Amey Hengle,Atharva Kulkarni,Shantanu Patankar,Madhumitha Chandrasekaran,Sneha D’Silva,Jemima Jacob,Rashmi Gupta
关键词-EN: depression-anxiety comorbidity classification, social media posts, depression-anxiety comorbidity, social media, ANGST
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 Pages

点击查看摘要

Abstract:In this study, we introduce ANGST, a novel, first-of-its kind benchmark for depression-anxiety comorbidity classification from social media posts. Unlike contemporary datasets that often oversimplify the intricate interplay between different mental health disorders by treating them as isolated conditions, ANGST enables multi-label classification, allowing each post to be simultaneously identified as indicating depression and/or anxiety. Comprising 2876 meticulously annotated posts by expert psychologists and an additional 7667 silver-labeled posts, ANGST posits a more representative sample of online mental health discourse. Moreover, we benchmark ANGST using various state-of-the-art language models, ranging from Mental-BERT to GPT-4. Our results provide significant insights into the capabilities and limitations of these models in complex diagnostic scenarios. While GPT-4 generally outperforms other models, none achieve an F1 score exceeding 72% in multi-class comorbid classification, underscoring the ongoing challenges in applying language models to mental health diagnostics.
摘要:在本研究中,我们介绍了 ANGST,这是一个首创的、用于从社交媒体帖子中分类抑郁-焦虑共病的新型基准。与当代数据集通常通过将不同心理健康障碍视为孤立条件来过度简化其复杂相互作用不同,ANGST 支持多标签分类,允许每个帖子同时被识别为指示抑郁和/或焦虑。ANGST 由 2876 条由专家心理学家精心标注的帖子以及另外 7667 条银标签帖子组成,提供了更具代表性的在线心理健康讨论样本。此外,我们使用多种最先进的语言模型(从 Mental-BERT 到 GPT-4)对 ANGST 进行了基准测试。我们的结果为这些模型在复杂诊断场景中的能力和局限性提供了重要见解。尽管 GPT-4 总体上优于其他模型,但在多类别共病分类中,没有任何模型的 F1 分数超过 72%,这突显了将语言模型应用于心理健康诊断的持续挑战。

[NLP-167] ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities EMNLP2024

【速读】: 该论文试图解决多模态任务输入下视觉语言模型(VLM)在程序性规划中的表现问题,特别是其在正常和反事实任务情境下的推理能力评估。解决方案的关键在于提出了ActPlan-1K基准,这是一个基于ChatGPT和iGibson2家庭活动模拟器的多模态规划基准,包含153个活动和1,187个实例,每个实例包含自然语言任务描述和多个环境图像,并提供黄金计划作为参考。通过评估当前VLM在生成人类级别程序性计划的能力,论文发现现有模型在这方面仍存在不足,并提出了通过微调BLEURT模型来提供自动评估指标,以促进未来在该基准上的研究。

链接: https://arxiv.org/abs/2410.03907
作者: Ying Su,Zhan Ling,Haochen Shi,Jiayang Cheng,Yauwai Yim,Yangqiu Song
关键词-EN: Large language models, Large language, process textual task, adopted to process, process textual
类目: Computation and Language (cs.CL)
备注: 13 pages, 9 figures, 8 tables, accepted to EMNLP 2024 main conference

点击查看摘要

Abstract:Large language models~(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models~(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model’s reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by finetuning over BLEURT model to facilitate future research on our benchmark.
摘要:大语言模型 (LLMs) 因其强大的推理能力,已被应用于处理文本任务描述并完成具身 AI 任务中的程序性规划。然而,关于视觉语言模型 (VLMs) 在考虑多模态任务输入时的表现研究仍显不足。对模型在替代任务情境下的推理能力进行评估的反事实规划也未得到充分开发。为了评估多模态和反事实方面的规划能力,我们提出了 ActPlan-1K。ActPlan-1K 是一个基于 ChatGPT 和家庭活动模拟器 iGibson2 构建的多模态规划基准。该基准包含 153 项活动和 1,187 个实例。每个实例描述一项活动,包含自然语言任务描述和来自模拟器的多个环境图像。每个实例的金标准计划是提供场景中对象的动作序列。典型的 VLMs 在正确性和常识满足度方面进行了评估。结果表明,当前的 VLMs 在生成正常活动和反事实活动的人类级别程序性计划方面仍面临挑战。我们进一步通过在 BLEURT 模型上进行微调,提供了自动评估指标,以促进未来对我们基准的研究。

[NLP-168] PersonalSum: A User-Subjective Guided Personalized Summarization Dataset for Large Language Models NEURIPS2024

【速读】: 该论文试图解决个性化摘要生成的问题,即现有的大语言模型(LLMs)生成的通用摘要是否能满足普通用户的个性化需求。解决方案的关键在于创建了一个高质量、个性化、手工标注的摘要数据集PersonalSum,该数据集首次研究了公众读者的关注点与LLMs生成的通用摘要之间的差异。通过包含用户档案、个性化摘要及其来源句子和机器生成的通用摘要及其来源,研究了实体/主题、情节和文章结构等个人信号对个性化摘要生成的影响。初步结果表明,实体/主题只是影响用户多样偏好的关键因素之一,个性化摘要生成对现有LLMs仍是一个重大挑战。

链接: https://arxiv.org/abs/2410.03905
作者: Lemei Zhang,Peng Liu,Marcus Tiedemann Oekland Henriksboe,Even W. Lauvrak,Jon Atle Gulla,Heri Ramampiaro
关键词-EN: Large Language Models, Natural Language Processing, Language Models, Natural Language, Language Processing
类目: Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2024 Track on Datasets and Benchmarks. Code available at this https URL

点击查看摘要

Abstract:With the rapid advancement of Natural Language Processing in recent years, numerous studies have shown that generic summaries generated by Large Language Models (LLMs) can sometimes surpass those annotated by experts, such as journalists, according to human evaluations. However, there is limited research on whether these generic summaries meet the individual needs of ordinary people. The biggest obstacle is the lack of human-annotated datasets from the general public. Existing work on personalized summarization often relies on pseudo datasets created from generic summarization datasets or controllable tasks that focus on specific named entities or other aspects, such as the length and specificity of generated summaries, collected from hypothetical tasks without the annotators’ initiative. To bridge this gap, we propose a high-quality, personalized, manually annotated abstractive summarization dataset called PersonalSum. This dataset is the first to investigate whether the focus of public readers differs from the generic summaries generated by LLMs. It includes user profiles, personalized summaries accompanied by source sentences from given articles, and machine-generated generic summaries along with their sources. We investigate several personal signals - entities/topics, plot, and structure of articles - that may affect the generation of personalized summaries using LLMs in a few-shot in-context learning scenario. Our preliminary results and analysis indicate that entities/topics are merely one of the key factors that impact the diverse preferences of users, and personalized summarization remains a significant challenge for existing LLMs.
摘要:近年来,自然语言处理 (Natural Language Processing) 的快速发展使得大量研究表明,大语言模型 (Large Language Models, LLMs) 生成的通用摘要有时在人类评估中能够超越专家(如记者)标注的摘要。然而,关于这些通用摘要是否满足普通人的个性化需求的研究却相对有限。最大的障碍在于缺乏来自普通公众的人工标注数据集。现有的个性化摘要工作通常依赖于从通用摘要数据集创建的伪数据集,或专注于特定命名实体或其他方面(如生成摘要的长度和具体性)的可控任务,这些任务通常在没有标注者主动性的假设任务中收集。为了填补这一空白,我们提出了一种高质量、个性化、人工标注的抽象摘要数据集,称为 PersonalSum。该数据集首次研究了公众读者的关注点是否与 LLMs 生成的通用摘要有所不同。它包括用户档案、个性化摘要及其来源句子的文章,以及机器生成的通用摘要及其来源。我们研究了几种可能影响个性化摘要生成的个人信号——实体/主题、情节和文章结构——在少样本上下文学习 (few-shot in-context learning) 场景中使用 LLMs 生成个性化摘要。我们的初步结果和分析表明,实体/主题仅仅是影响用户多样化偏好的关键因素之一,而个性化摘要对现有的 LLMs 仍然是一个重大挑战。

[NLP-169] KidLM: Advancing Language Models for Children – Early Insights and Future Directions EMNLP2024

【速读】: 该论文试图解决在开发面向儿童的教育工具时,如何确保语言模型能够准确理解和表达儿童特有的语言特征、认知需求及安全标准的问题。解决方案的关键在于引入了一个以用户为中心的数据收集流程,专门收集和验证儿童相关的语料库,并提出了一个新的训练目标——分层掩码(Stratified Masking),该方法根据儿童语言数据的特定领域动态调整掩码概率,使模型能够优先学习更适合儿童的词汇和概念。通过这种方法,模型在理解低年级文本、保持安全性和捕捉儿童独特偏好方面表现出色。

链接: https://arxiv.org/abs/2410.03884
作者: Mir Tafseer Nayeem,Davood Rafiei
关键词-EN: Recent studies highlight, creating educational tools, significant challenges remain, Recent studies, maintaining key child-specific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted to EMNLP 2024 (long, main)

点击查看摘要

Abstract:Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children’s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.
摘要:近期研究表明,大语言模型在为儿童创建教育工具方面具有巨大潜力,然而在保持儿童特有的语言细微差别、认知需求和安全标准等关键属性方面仍存在显著挑战。本文探讨了开发面向儿童的语言模型的基础步骤,强调了高质量预训练数据的必要性。我们引入了一种新颖的用户中心数据收集流程,该流程涉及收集和验证专门为儿童编写(有时由儿童编写)的语料库。此外,我们提出了一种新的训练目标——分层掩码 (Stratified Masking),该目标根据我们特定领域的儿童语言数据动态调整掩码概率,使模型能够优先考虑更适合儿童的词汇和概念。实验评估表明,我们的模型在理解低年级文本方面表现出色,通过避免刻板印象保持了安全性,并捕捉到了儿童独特的偏好。此外,我们为未来在儿童专用语言建模领域的研究和开发提供了可操作的见解。

[NLP-170] From Pixels to Personas: Investigating and Modeling Self-Anthropomorphism in Human-Robot Dialogues EMNLP2024

【速读】: 该论文试图解决机器人对话系统中自我拟人化表达的问题,即如何系统地分析和区分自我拟人化与非自我拟人化的对话响应,并探讨如何根据伦理标准和用户期望动态调整这些表达方式。解决方案的关键在于引入了一个名为Pix2Persona的新数据集,该数据集通过为每个原始机器人响应提供自我拟人化和非自我拟人化的配对响应,来支持对这两种响应类型的对比研究。这一数据集不仅揭示了先前未充分探索的机器人响应类别,还为未来研究如何动态调整AI系统中的自我拟人化水平奠定了基础。

链接: https://arxiv.org/abs/2410.03870
作者: Yu Li,Devamanyu Hazarika,Di Jin,Julia Hirschberg,Yang Liu
关键词-EN: preferences and emotions, robots manifests, display of human-like, human-like characteristics, expressing preferences
类目: Computation and Language (cs.CL)
备注: Findings of EMNLP 2024, 19 pages

点击查看摘要

Abstract:Self-anthropomorphism in robots manifests itself through their display of human-like characteristics in dialogue, such as expressing preferences and emotions. Our study systematically analyzes self-anthropomorphic expression within various dialogue datasets, outlining the contrasts between self-anthropomorphic and non-self-anthropomorphic responses in dialogue systems. We show significant differences in these two types of responses and propose transitioning from one type to the other. We also introduce Pix2Persona, a novel dataset aimed at developing ethical and engaging AI systems in various embodiments. This dataset preserves the original dialogues from existing corpora and enhances them with paired responses: self-anthropomorphic and non-self-anthropomorphic for each original bot response. Our work not only uncovers a new category of bot responses that were previously under-explored but also lays the groundwork for future studies about dynamically adjusting self-anthropomorphism levels in AI systems to align with ethical standards and user expectations.
摘要:机器人的自我拟人化通过在对话中展示类似人类的特征,例如表达偏好和情感,得以体现。我们的研究系统地分析了各种对话数据集中的自我拟人化表达,概述了自我拟人化与非自我拟人化响应在对话系统中的对比。我们展示了这两类响应之间的显著差异,并提出了从一种类型过渡到另一种类型的方法。此外,我们还引入了 Pix2Persona,这是一个旨在开发各种形式中符合伦理且引人入胜的 AI 系统的新型数据集。该数据集保留了现有语料库中的原始对话,并通过配对响应(自我拟人化和非自我拟人化)增强了每个原始机器人响应。我们的工作不仅揭示了一个以前未被充分探索的机器人响应新类别,还为未来关于动态调整 AI 系统中自我拟人化水平以符合伦理标准和用户期望的研究奠定了基础。

[NLP-171] Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

【速读】: 该论文试图解决文本生成图像模型在面对恶意查询时可能生成有害内容的问题。解决方案的关键在于提出了一种名为Chain-of-Jailbreak (CoJ)攻击的新型破解方法,通过将恶意查询分解为多个子查询,并逐步引导模型生成和编辑图像,从而绕过现有的安全防护机制。实验结果表明,CoJ攻击在多个图像生成服务上成功率显著高于其他破解方法。为应对这一攻击,论文还提出了一种有效的防御方法——Think Twice Prompting,能够在超过95%的情况下防御CoJ攻击。

链接: https://arxiv.org/abs/2410.03869
作者: Wenxuan Wang,Kuiyi Gao,Zihan Jia,Youliang Yuan,Jen-tse Huang,Qiuzhi Liu,Shuai Wang,Wenxiang Jiao,Zhaopeng Tu
关键词-EN: Stable Diffusion, Text-based image generation, hold significant potential, Diffusion and DALL-E, Text-based image
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness of our CoJ attack method, we constructed a comprehensive dataset, CoJ-Bench, encompassing nine safety scenarios, three types of editing operations, and three editing elements. Experiments on four widely-used image generation services provided by GPT-4V, GPT-4o, Gemini 1.5 and Gemini 1.5 Pro, demonstrate that our CoJ attack method can successfully bypass the safeguards of models for over 60% cases, which significantly outperforms other jailbreaking methods (i.e., 14%). Further, to enhance these models’ safety against our CoJ attack method, we also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack. We release our dataset and code to facilitate the AI safety research.
摘要:基于文本的图像生成模型,如 Stable Diffusion 和 DALL-E 3,在内容创作和出版工作流程中具有巨大的潜力,因此近年来成为研究焦点。尽管这些模型在生成多样化和生动的图像方面表现出色,但防止生成有害内容(如辱骂、暴力或色情材料)的努力也在不断加强。为了评估现有模型的安全性,我们引入了一种名为 Chain-of-Jailbreak (CoJ) 攻击的新型破解方法,通过逐步编辑过程破坏图像生成模型。具体来说,对于那些无法通过单一提示绕过安全措施的恶意查询,我们有意将其分解为多个子查询。然后,图像生成模型被提示根据这些子查询生成并迭代编辑图像。为了评估我们的 CoJ 攻击方法的有效性,我们构建了一个综合数据集 CoJ-Bench,涵盖了九种安全场景、三种编辑操作和三种编辑元素。在 GPT-4V、GPT-4o、Gemini 1.5 和 Gemini 1.5 Pro 提供的四个广泛使用的图像生成服务上的实验表明,我们的 CoJ 攻击方法在超过 60% 的情况下能够成功绕过模型的安全措施,显著优于其他破解方法(即 14%)。此外,为了增强这些模型对我们 CoJ 攻击方法的防御能力,我们还提出了一种有效的基于提示的方法,称为 Think Twice Prompting,该方法能够成功防御超过 95% 的 CoJ 攻击。我们发布了数据集和代码,以促进 AI 安全研究。

[NLP-172] Can Language Models Reason about Individualistic Human Values and Preferences?

【速读】: 该论文试图解决AI系统在处理多样性时可能出现的过度简化和刻板印象问题,提出了一种名为“个体化对齐(individualistic alignment)”的新方法。解决方案的关键在于引入IndieValueCatalog数据集,该数据集由世界价值观调查(WVS)转化而来,用于研究语言模型在个体化价值推理方面的能力。通过这一数据集,论文揭示了前沿语言模型在个体化价值推理上的局限性,并提出了Value Inequity Index(σINEQUITY)来衡量模型在处理全球个体化价值时的偏差。最终,论文训练了一系列Individualistic Value Reasoners(IndieValueReasoner)以提升模型的个体化价值推理能力,并指出了未来研究的方向和挑战。

链接: https://arxiv.org/abs/2410.03868
作者: Liwei Jiang,Taylor Sorensen,Sydney Levine,Yejin Choi
关键词-EN: Recent calls, pluralistic alignment emphasize, individualistic, calls for pluralistic, systems should address
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent calls for pluralistic alignment emphasize that AI systems should address the diverse needs of all people. Yet, efforts in this space often require sorting people into fixed buckets of pre-specified diversity-defining dimensions (e.g., demographics, personalities, communication styles), risking smoothing out or even stereotyping the rich spectrum of individualistic variations. To achieve an authentic representation of diversity that respects individuality, we propose individualistic alignment. While individualistic alignment can take various forms, in this paper, we introduce IndieValueCatalog, a dataset transformed from the influential World Values Survey (WVS), to study language models (LMs) on the specific challenge of individualistic value reasoning. Specifically, given a sample of an individual’s value-expressing statements, models are tasked with predicting their value judgments in novel cases. With IndieValueCatalog, we reveal critical limitations in frontier LMs’ abilities to reason about individualistic human values with accuracies, only ranging between 55% to 65%. Moreover, our results highlight that a precise description of individualistic values cannot be approximated only via demographic information. We also identify a partiality of LMs in reasoning about global individualistic values, as measured by our proposed Value Inequity Index (\sigmaINEQUITY). Finally, we train a series of Individualistic Value Reasoners (IndieValueReasoner) using IndieValueCatalog to enhance models’ individualistic value reasoning capability, revealing new patterns and dynamics into global human values. We outline future research challenges and opportunities for advancing individualistic alignment.
摘要:近期关于多元对齐的呼吁强调,AI系统应满足所有人的多样化需求。然而,这一领域的努力往往需要将人们分类到预先设定的多样性定义维度(如人口统计学、个性、沟通风格)的固定类别中,这可能会抹平甚至刻板化个体差异的丰富性。为了实现尊重个体性的真实多样性表示,我们提出了个体化对齐。虽然个体化对齐可以有多种形式,但本文中,我们引入了IndieValueCatalog,一个从具有影响力的世界价值观调查(World Values Survey, WVS)转化而来的数据集,用于研究语言模型(Language Models, LMs)在个体化价值推理这一特定挑战上的表现。具体来说,给定一个个体价值表达的样本,模型需要预测其在新颖情境中的价值判断。通过IndieValueCatalog,我们揭示了前沿LMs在个体化人类价值推理能力上的关键局限性,准确率仅在55%到65%之间。此外,我们的结果强调,个体化价值的精确描述不能仅通过人口统计信息来近似。我们还发现了LMs在推理全球个体化价值时存在偏见,这通过我们提出的价值不平等指数(Value Inequity Index, \sigmaINEQUITY)得以衡量。最后,我们使用IndieValueCatalog训练了一系列个体化价值推理器(Individualistic Value Reasoners, IndieValueReasoner),以增强模型的个体化价值推理能力,揭示了全球人类价值的新模式和动态。我们概述了推进个体化对齐的未来研究挑战和机遇。

[NLP-173] DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

【速读】: 该论文试图解决大语言模型(LLMs)在推理过程中缺乏针对具体问题和模型能力的动态推理策略的问题。解决方案的关键在于提出了一种名为DOTS的方法,通过最优推理轨迹搜索来动态调整推理策略。具体步骤包括:定义原子推理动作模块,通过迭代探索和评估为每个训练问题搜索最优动作轨迹,并利用收集到的最优轨迹训练LLM以规划未见问题的推理轨迹。论文还提出了两种学习范式,即微调外部LLM作为规划器指导任务解决LLM,或直接微调任务解决LLM以内置推理动作规划能力。实验结果表明,该方法在多个推理任务中均优于静态推理技术和传统的指令微调方法,并能根据问题复杂度动态调整计算资源。

链接: https://arxiv.org/abs/2410.03864
作者: Murong Yue,Wenlin Yao,Haitao Mi,Dian Yu,Ziyu Yao,Dong Yu
关键词-EN: large language models, gained significant attention, task-solving LLM, LLM, specific task-solving LLM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called “reasoning actions”), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.
摘要:近年来,提升大语言模型 (LLM) 在推理方面的能力引起了广泛关注。先前的研究表明,各种提示策略在辅助 LLM 进行推理(称为“推理动作”)方面具有显著效果,如逐步思考、回答前反思、使用程序解决问题及其组合。然而,这些方法通常对所有问题应用静态、预定义的推理动作,而未考虑每个问题的具体特征或任务解决 LLM 的固有能力。本文提出 DOTS,一种通过最优推理轨迹搜索使 LLM 能够动态推理的方法,该方法针对每个问题的具体特征和任务解决 LLM 的固有能力进行定制。我们的方法包括三个关键步骤:i) 定义可组合成各种推理动作轨迹的原子推理动作模块;ii) 通过迭代探索和评估,为每个训练问题搜索特定任务解决 LLM 的最优动作轨迹;iii) 使用收集到的最优轨迹训练 LLM,以规划未见问题的推理轨迹。特别地,我们提出了两种学习范式,即微调外部 LLM 作为规划器以指导任务解决 LLM,或直接微调任务解决 LLM 以具备推理动作规划的内化能力。我们在八个推理任务上的实验表明,我们的方法始终优于静态推理技术和传统的指令微调方法。进一步分析显示,我们的方法使 LLM 能够根据问题复杂性调整计算,对更难的问题分配更深入的思考和推理。

[NLP-174] SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

【速读】: 该论文试图解决现有软件工程自主系统在处理视觉元素丰富的软件问题(如前端、游戏开发和DevOps)时的局限性问题。解决方案的关键在于提出了SWE-bench Multimodal(SWE-bench M),这是一个专注于评估系统在修复包含图像元素的JavaScript软件中bug的能力的基准测试。SWE-bench M通过引入617个来自17个JavaScript库的任务实例,涵盖了网页界面设计、图表绘制、数据可视化、语法高亮和交互式地图等领域,从而有效评估系统在视觉问题解决和跨语言泛化方面的表现。研究结果表明,现有的顶级系统在SWE-bench M上的表现不佳,而SWE-agent凭借其语言无关的灵活特性,在该基准测试中显著优于其他系统。

链接: https://arxiv.org/abs/2410.03859
作者: John Yang,Carlos E. Jimenez,Alex L. Zhang,Kilian Lieret,Joyce Yang,Xindi Wu,Ori Press,Niklas Muennighoff,Gabriel Synnaeve,Karthik R. Narasimhan,Diyi Yang,Sida I. Wang,Ofir Press
关键词-EN: Autonomous systems, capable of fixing, SWE-bench, Autonomous, software engineering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as images. This limited coverage motivates our inquiry into how existing systems might perform on unrepresented software engineering domains (e.g., front-end, game development, DevOps), which use different programming languages and paradigms. Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Each SWE-bench M task instance contains at least one image in its problem statement or unit tests. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization. Lastly, we show that SWE-agent’s flexible language-agnostic features enable it to substantially outperform alternatives on SWE-bench M, resolving 12% of task instances compared to 6% for the next best system.
摘要:软件工程的自主系统现在能够修复漏洞并开发功能。这些系统通常在 SWE-bench (Jimenez et al., 2024a) 上进行评估,该基准测试评估它们解决来自 GitHub 仓库的软件问题的能力。然而,SWE-bench 仅使用 Python 仓库,且问题陈述主要以文本形式呈现,缺乏图像等视觉元素。这种有限的覆盖范围促使我们探究现有系统在未被代表的软件工程领域(例如,前端开发、游戏开发、DevOps)中的表现,这些领域使用不同的编程语言和范式。因此,我们提出了 SWE-bench 多模态 (SWE-bench M),以评估系统在修复面向用户的 JavaScript 软件中的漏洞的能力。SWE-bench M 包含从 17 个用于网页界面设计、图表绘制、数据可视化、语法高亮和交互式地图的 JavaScript 库中收集的 617 个任务实例。每个 SWE-bench M 任务实例在其问题陈述或单元测试中至少包含一张图像。我们的分析发现,顶级 SWE-bench 系统在 SWE-bench M 上表现不佳,揭示了其在视觉问题解决和跨语言泛化方面的局限性。最后,我们展示了 SWE-agent 的灵活语言无关特性使其在 SWE-bench M 上显著优于其他系统,解决了 12% 的任务实例,而次优系统仅解决了 6%。

[NLP-175] You Know What Im Saying – Jailbreak Attack via Implicit Reference

【速读】: 该论文试图解决大语言模型(LLM)在检测通过嵌套无害目标中的隐含引用表达的恶意目标方面的不足。解决方案的关键在于识别并防御一种名为“通过隐含引用攻击(Attack via Implicit Reference, AIR)”的新型攻击方法。AIR通过将恶意目标分解为多个允许的目标,并通过上下文中的隐含引用将它们链接起来,从而生成恶意内容而不触发拒绝响应,有效绕过现有检测机制。实验表明,AIR在多种最先进的LLM上具有超过90%的攻击成功率,且大型模型对此攻击方法更为脆弱。论文强调了理解和预防上下文攻击的防御机制的迫切需求,并提出了一种跨模型攻击策略,利用安全性较低的模型生成恶意上下文,进一步提高攻击成功率。

链接: https://arxiv.org/abs/2410.03857
作者: Tianyu Wu,Lingrui Mei,Ruibin Yuan,Lujun Li,Wei Xue,Yike Guo
关键词-EN: involving scene nesting, objectives involving scene, methods remain inadequate, large language model, alignment have enabled
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection this http URL experiments demonstrate AIR’s effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method. These findings underscore the urgent need for defense mechanisms capable of understanding and preventing contextual attacks. Furthermore, we introduce a cross-model attack strategy that leverages less secure models to generate malicious contexts, thereby further increasing the ASR when targeting other this http URL code and jailbreak artifacts can be found at this https URL.
摘要:尽管近期在大语言模型 (LLM) 对齐方面的进展使得能够有效识别涉及场景嵌套和关键词重写的恶意目标,但我们的研究表明,这些方法在检测通过嵌套无害目标中的上下文表达的恶意目标方面仍然不足。本研究识别了一个先前被忽视的漏洞,我们称之为通过隐式引用攻击 (Attack via Implicit Reference, AIR)。AIR 将恶意目标分解为允许的目标,并通过上下文中的隐式引用将它们链接起来。该方法利用多个相关的无害目标生成恶意内容,而不会触发拒绝响应,从而有效地绕过现有的检测机制。实验表明,AIR 在当前最先进的 LLM 中具有显著效果,在包括 GPT-4o、Claude-3.5-Sonnet 和 Qwen-2-72B 在内的多数模型上,攻击成功率 (ASR) 超过 90%。值得注意的是,我们观察到一个反向缩放现象,即较大的模型对这种攻击方法更为脆弱。这些发现强调了迫切需要能够理解和预防上下文攻击的防御机制。此外,我们引入了一种跨模型攻击策略,利用安全性较低的模型生成恶意上下文,从而在针对其他模型时进一步提高 ASR。相关代码和越狱工具可在以下链接找到:[https URL]。

[NLP-176] Detecting Machine-Generated Long-Form Content with Latent-Space Variables

【速读】: 该论文试图解决大语言模型(LLMs)生成的长文本与人类书写文本难以区分的问题,以确保文本的真实性和可信度。解决方案的关键在于提出了一种更为鲁棒的方法,通过训练潜在空间模型来识别文本中的事件或主题序列,从而捕捉机器生成文本与人类书写文本在事件触发和过渡方式上的固有差异。这种方法在不同领域中显著提高了机器生成文本的检测准确性,相较于现有的零样本检测器(如DetectGPT),性能提升了31%。

链接: https://arxiv.org/abs/2410.03856
作者: Yufei Tian,Zeyu Pan,Nanyun Peng
关键词-EN: large language models, distinguishing machine-generated outputs, generate fluent long-form, fluent long-form texts, trustworthiness of expressions
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing capability of large language models (LLMs) to generate fluent long-form texts is presenting new challenges in distinguishing machine-generated outputs from human-written ones, which is crucial for ensuring authenticity and trustworthiness of expressions. Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts, including different prompting and decoding strategies, and adversarial attacks. We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts by training a latent-space model on sequences of events or topics derived from human-written texts. In three different domains, machine-generated texts, which are originally inseparable from human texts on the token level, can be better distinguished with our latent-space model, leading to a 31% improvement over strong baselines such as DetectGPT. Our analysis further reveals that, unlike humans, modern LLMs like GPT-4 generate event triggers and their transitions differently, an inherent disparity that helps our method to robustly detect machine-generated texts.
摘要:随着大语言模型 (LLM) 生成流畅长文本的能力不断提升,区分机器生成的文本与人类撰写的文本变得愈发困难,这对于确保表达的真实性和可信度至关重要。现有的零样本检测器主要关注 Token 级别的分布,这些方法在面对实际应用中的领域偏移(包括不同的提示和解码策略)以及对抗性攻击时显得脆弱。我们提出了一种更为稳健的方法,通过将事件转换等抽象元素作为关键决策因素,训练一个潜在空间模型,该模型基于从人类撰写文本中提取的事件或主题序列来区分机器与人类文本。在三个不同的领域中,原本在 Token 级别上无法与人类文本区分的机器生成文本,通过我们的潜在空间模型可以得到更好的区分,相较于 DetectGPT 等强基线方法,性能提升了 31%。我们的进一步分析揭示,与人类不同,现代大语言模型如 GPT-4 生成事件触发器及其转换的方式存在本质差异,这种内在的不一致性有助于我们的方法稳健地检测出机器生成的文本。

[NLP-177] Using Prompts to Guide Large Language Models in Imitating a Real Persons Language Style

【速读】: 该论文试图解决如何通过优化提示(prompt)来提升大型语言模型(LLMs)在语言风格模仿任务中的表现。解决方案的关键在于采用Tree-of-Thoughts(ToT)提示方法,通过引导Llama 3模型在不改变其核心参数的情况下,模仿特定个体的语言风格,从而创建一个能够以特定个体语言风格进行文本对话的AI系统。研究结果表明,ToT提示方法在提升语言风格模仿能力方面最为有效。

链接: https://arxiv.org/abs/2410.03848
作者: Ziyang Chen,Stylios Moscholios
关键词-EN: demonstrated strong capabilities, natural language processing, GPT series, Large language models, language style
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs), such as GPT series and Llama series have demonstrated strong capabilities in natural language processing, contextual understanding, and text generation. In recent years, researchers are trying to enhance the abilities of LLMs in performing various tasks, and numerous studies have proved that well-designed prompts can significantly improve the performance of LLMs on these tasks. This study compares the language style imitation ability of three different large language models under the guidance of the same zero-shot prompt. It also involves comparing the imitation ability of the same large language model when guided by three different prompts individually. Additionally, by applying a Tree-of-Thoughts (ToT) Prompting method to Llama 3, a conversational AI with the language style of a real person was created. In this study, three evaluation methods were used to evaluate LLMs and prompts. The results show that Llama 3 performs best at imitating language styles, and that the ToT prompting method is the most effective to guide it in imitating language styles. Using a ToT framework, Llama 3 was guided to interact with users in the language style of a specific individual without altering its core parameters, thereby creating a text-based conversational AI that reflects the language style of the individual.
摘要:大语言模型 (LLM),如 GPT 系列和 Llama 系列,在自然语言处理、上下文理解和文本生成方面展示了强大的能力。近年来,研究人员致力于提升 LLM 在执行各种任务中的能力,众多研究表明,精心设计的提示词可以显著提高 LLM 在这些任务中的表现。本研究对比了在相同零样本提示词指导下,三种不同大语言模型的语言风格模仿能力,并比较了在三种不同提示词单独指导下,同一大语言模型的模仿能力。此外,通过应用 Tree-of-Thoughts (ToT) 提示词方法于 Llama 3,创建了一个具有真实人物语言风格的对话 AI。本研究采用了三种评估方法来评估 LLM 和提示词。结果显示,Llama 3 在语言风格模仿方面表现最佳,而 ToT 提示词方法在指导其模仿语言风格方面最为有效。通过 ToT 框架,Llama 3 被引导以特定个体的语言风格与用户互动,而无需改变其核心参数,从而创建了一个反映个体语言风格的基于文本的对话 AI。

[NLP-178] ORAssistant: A Custom RAG-based Conversational Assistant for OpenROAD

【速读】: 该论文试图解决开源电子设计自动化(EDA)工具在芯片设计过程中面临的复杂性、成本和访问障碍问题。解决方案的关键在于引入基于检索增强生成(RAG)的对话助手ORAssistant,该助手通过整合OpenROAD、OpenROAD-flow-scripts、Yosys、OpenSTA和KLayout等工具,提供针对用户常见查询(如安装、命令使用、流程设置和执行)的上下文特定响应,从而提升用户体验。ORAssistant的核心在于其可扩展的架构,支持集成其他开源工具和大型语言模型(LLM),并利用Google Gemini作为基础LLM模型进行构建和测试,显著提升了性能和准确性。

链接: https://arxiv.org/abs/2410.03845
作者: Aviral Kaintura,Palaniappan R,Shui Song Luar,Indira Iyer Almeida
关键词-EN: Electronic Design Automation, Open-source Electronic Design, commercial EDA tools, addressing key barriers, transforming chip design
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Open-source Electronic Design Automation (EDA) tools are rapidly transforming chip design by addressing key barriers of commercial EDA tools such as complexity, costs, and access. Recent advancements in Large Language Models (LLMs) have further enhanced efficiency in chip design by providing user assistance across a range of tasks like setup, decision-making, and flow automation. This paper introduces ORAssistant, a conversational assistant for OpenROAD, based on Retrieval-Augmented Generation (RAG). ORAssistant aims to improve the user experience for the OpenROAD flow, from RTL-GDSII by providing context-specific responses to common user queries, including installation, command usage, flow setup, and execution, in prose format. Currently, ORAssistant integrates OpenROAD, OpenROAD-flow-scripts, Yosys, OpenSTA, and KLayout. The data model is built from publicly available documentation and GitHub resources. The proposed architecture is scalable, supporting extensions to other open-source tools, operating modes, and LLM models. We use Google Gemini as the base LLM model to build and test ORAssistant. Early evaluation results of the RAG-based model show notable improvements in performance and accuracy compared to non-fine-tuned LLMs.
摘要:开源电子设计自动化 (Electronic Design Automation, EDA) 工具正在通过解决商业 EDA 工具的关键障碍,如复杂性、成本和访问性,迅速改变芯片设计。大语言模型 (Large Language Model, LLM) 的最新进展通过在设置、决策和流程自动化等一系列任务中提供用户协助,进一步提高了芯片设计的效率。本文介绍了 ORAssistant,一个基于检索增强生成 (Retrieval-Augmented Generation, RAG) 的 OpenROAD 对话助手。ORAssistant 旨在通过提供针对常见用户查询(包括安装、命令使用、流程设置和执行)的上下文特定响应,以散文格式改善 OpenROAD 从 RTL 到 GDSII 的用户体验。目前,ORAssistant 集成了 OpenROAD、OpenROAD-flow-scripts、Yosys、OpenSTA 和 KLayout。数据模型基于公开可用的文档和 GitHub 资源构建。所提出的架构具有可扩展性,支持扩展到其他开源工具、操作模式和 LLM 模型。我们使用 Google Gemini 作为基础 LLM 模型来构建和测试 ORAssistant。基于 RAG 模型的早期评估结果显示,与未微调的 LLM 相比,在性能和准确性方面有显著改进。

[NLP-179] FaithCAMERA: Construction of a Faithful Dataset for Ad Text Generation

【速读】: 该论文试图解决广告文本生成(ATG)中生成的广告文本既要忠实于输入文档又要包含吸引潜在客户的重要信息的问题。解决方案的关键在于与内部广告创作者合作,对现有的CAMERA评估数据集进行改进,创建一个新的评估数据集FaithCAMERA,确保参考文本的忠实性。通过FaithCAMERA,可以评估现有方法在保持忠实性的同时生成信息丰富广告文本的能力。实验结果表明,去除包含不忠实实体的训练数据可以提高实体级别的忠实性和信息性,但在句子级别上会降低这两者。这表明未来的ATG研究不仅需要扩大训练数据规模,还必须确保数据的忠实性。

链接: https://arxiv.org/abs/2410.03839
作者: Akihiko Kato,Masato Mita,Soichiro Murakami,Ukyo Honda,Sho Hoshino,Peinan Zhang
关键词-EN: text generation, desirable ad text, ATG, ATG research, faithfulness
类目: Computation and Language (cs.CL)
备注: For dataset, see this https URL

点击查看摘要

Abstract:In ad text generation (ATG), desirable ad text is both faithful and informative. That is, it should be faithful to the input document, while at the same time containing important information that appeals to potential customers. The existing evaluation data, CAMERA (arXiv:2309.12030), is suitable for evaluating informativeness, as it consists of reference ad texts created by ad creators. However, these references often include information unfaithful to the input, which is a notable obstacle in promoting ATG research. In this study, we collaborate with in-house ad creators to refine the CAMERA references and develop an alternative ATG evaluation dataset called FaithCAMERA, in which the faithfulness of references is guaranteed. Using FaithCAMERA, we can evaluate how well existing methods for improving faithfulness can generate informative ad text while maintaining faithfulness. Our experiments show that removing training data that contains unfaithful entities improves the faithfulness and informativeness at the entity level, but decreases both at the sentence level. This result suggests that for future ATG research, it is essential not only to scale the training data but also to ensure their faithfulness. Our dataset will be publicly available.
摘要:在广告文本生成 (Ad Text Generation, ATG) 中,理想的广告文本既要忠实于输入文档,又要包含吸引潜在客户的重要信息。现有的评估数据集 CAMERA (arXiv:2309.12030) 适合用于评估信息量,因为它包含了广告创作者创建的参考广告文本。然而,这些参考文本往往包含与输入不符的信息,这是推动 ATG 研究的一个显著障碍。在本研究中,我们与内部广告创作者合作,对 CAMERA 的参考文本进行了优化,并开发了一个名为 FaithCAMERA 的替代 ATG 评估数据集,该数据集确保了参考文本的忠实性。通过使用 FaithCAMERA,我们可以评估现有提高忠实性的方法在保持忠实性的同时生成信息丰富的广告文本的能力。我们的实验表明,移除包含不忠实实体的训练数据可以提高实体级别的忠实性和信息量,但在句子级别上却降低了这两者。这一结果表明,未来的 ATG 研究不仅需要扩大训练数据的规模,还需要确保其忠实性。我们的数据集将公开发布。

[NLP-180] Learning Code Preference via Synthetic Evolution

【速读】: 该论文试图解决如何评估和训练大型语言模型(LLMs)在代码生成中的偏好问题,特别是如何使生成的代码符合开发者偏好并具备可验证的代码属性(如正确性、效率和安全性)。解决方案的关键在于提出了CodeFavor框架,该框架通过从合成进化数据(包括代码提交和代码评论)中训练成对代码偏好模型,以预测有意义的代码偏好。此外,论文还引入了CodePrefBench基准,用于评估代码偏好,涵盖了1364个精心策划的代码偏好任务。实验结果表明,CodeFavor框架在模型预测代码偏好的准确性上提升了28.8%,并且在成本效益上显著优于更大参数量的模型。

链接: https://arxiv.org/abs/2410.03837
作者: Jiawei Liu,Thanh Nguyen,Mingyue Shang,Hantian Ding,Xiaopeng Li,Yu Yu,Varun Kumar,Zijian Wang
关键词-EN: Large Language Models, Large Language, remarkable coding capabilities, recently demonstrated remarkable, demonstrated remarkable coding
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our evaluation shows that CodeFavor holistically improves the accuracy of model-based code preferences by up to 28.8%. Meanwhile, CodeFavor models can match the performance of models with 6-9x more parameters while being 34x more cost-effective. We also rigorously validate the design choices in CodeFavor via a comprehensive set of controlled experiments. Furthermore, we discover the prohibitive costs and limitations of human-based code preference: despite spending 23.4 person-minutes on each task, 15.1-40.3% of tasks remain unsolved. Compared to model-based preference, human preference tends to be more accurate under the objective of code correctness, while being sub-optimal for non-functional objectives.
摘要:大语言模型 (LLMs) 近期展示了卓越的编码能力。然而,基于良好结构特性的代码生成评估以及与开发者偏好的对齐仍然是一个挑战。本文探讨了在代码偏好学习这一新挑战下的两个关键问题:(i) 如何训练模型以预测有意义的代码偏好?以及 (ii) 人类和 LLM 的偏好如何与可验证的代码特性和开发者代码品味相一致?为此,我们提出了 CodeFavor,一个从合成演化数据(包括代码提交和代码评论)中训练成对代码偏好模型的框架。为了评估代码偏好,我们引入了 CodePrefBench,这是一个包含 1364 个严格筛选的代码偏好任务的基准,涵盖了三个可验证的特性——正确性、效率和安全性——以及人类偏好。我们的评估显示,CodeFavor 整体上将基于模型的代码偏好准确性提高了多达 28.8%。同时,CodeFavor 模型在性能上可以媲美具有 6-9 倍参数的模型,同时成本效益提高了 34 倍。我们还通过一系列全面的控制实验严格验证了 CodeFavor 中的设计选择。此外,我们发现了基于人类代码偏好的高昂成本和局限性:尽管每个任务花费了 23.4 人分钟,仍有 15.1-40.3% 的任务未解决。与基于模型的偏好相比,人类偏好往往在代码正确性目标下更为准确,但在非功能性目标下表现次优。

[NLP-181] Misinformation with Legal Consequences (MisLC): A New Task Towards Harnessing Societal Harm of Misinformation EMNLP2024

【速读】: 该论文试图解决现有研究在检测虚假信息时忽视其法律影响和社会后果的问题。解决方案的关键在于引入“具有法律后果的虚假信息”(MisLC)这一新任务,通过整合法律领域的定义来评估虚假信息的社会影响。论文提出了一种两步数据集构建方法,结合众包的可靠性和专家评估,以涵盖广泛的法律法规,包括仇恨言论、选举法和隐私条例等。通过实证研究,论文展示了从问题定义到实验和专家参与的全过程,并指出尽管最新的语言模型和检索增强生成技术在MisLC任务中表现有效,但仍远未达到专家水平。

链接: https://arxiv.org/abs/2410.03829
作者: Chu Fei Luo,Radin Shayanfar,Rohan Bhambhoria,Samuel Dahan,Xiaodan Zhu
关键词-EN: defined as false, innocuous intent, significant societal harm, false or inaccurate, result in significant
类目: Computation and Language (cs.CL)
备注: 8.5 pages of main body, 20 pages total; Accepted to Findings of EMNLP 2024

点击查看摘要

Abstract:Misinformation, defined as false or inaccurate information, can result in significant societal harm when it is spread with malicious or even innocuous intent. The rapid online information exchange necessitates advanced detection mechanisms to mitigate misinformation-induced harm. Existing research, however, has predominantly focused on assessing veracity, overlooking the legal implications and social consequences of misinformation. In this work, we take a novel angle to consolidate the definition of misinformation detection using legal issues as a measurement of societal ramifications, aiming to bring interdisciplinary efforts to tackle misinformation and its consequence. We introduce a new task: Misinformation with Legal Consequence (MisLC), which leverages definitions from a wide range of legal domains covering 4 broader legal topics and 11 fine-grained legal issues, including hate speech, election laws, and privacy regulations. For this task, we advocate a two-step dataset curation approach that utilizes crowd-sourced checkworthiness and expert evaluations of misinformation. We provide insights about the MisLC task through empirical evidence, from the problem definition to experiments and expert involvement. While the latest large language models and retrieval-augmented generation are effective baselines for the task, we find they are still far from replicating expert performance.
摘要:错误信息,定义为虚假或不准确的信息,当其以恶意或甚至无意的意图传播时,可能导致严重的社会危害。快速的在线信息交换需要先进的检测机制来减轻错误信息带来的危害。然而,现有研究主要集中在评估真实性上,忽略了错误信息的法律影响和社会后果。在这项工作中,我们采用了一种新颖的角度,通过法律问题来衡量社会影响,以整合错误信息检测的定义,旨在通过跨学科的努力来应对错误信息及其后果。我们引入了一项新任务:具有法律后果的错误信息 (Misinformation with Legal Consequence, MisLC),该任务利用了涵盖 4 个广泛法律主题和 11 个细粒度法律问题的广泛法律领域的定义,包括仇恨言论、选举法和隐私法规。对于这项任务,我们提倡一种两步数据集构建方法,该方法利用众包的可信度检查和专家对错误信息的评估。我们通过实证证据,从问题定义到实验和专家参与,提供了关于 MisLC 任务的见解。尽管最新的生成式 AI (Generative AI) 和检索增强生成 (Retrieval-augmented Generation) 是该任务的有效基线,但我们发现它们仍远未达到专家的表现。

[NLP-182] Large Language Models can be Strong Self-Detoxifiers

【速读】: 该论文试图解决大型语言模型(LLMs)在生成有害或有毒输出时的问题,解决方案的关键在于提出了一种名为Self-disciplined Autoregressive Sampling (SASA)的轻量级控制解码算法。SASA利用LLM的上下文表示,通过学习线性子空间来区分有毒和非有毒输出,并在生成过程中动态调整自回归采样策略,以避免生成有毒内容,从而在不依赖额外奖励模型或重新训练的情况下显著降低输出毒性。

链接: https://arxiv.org/abs/2410.03818
作者: Ching-Yun Ko,Pin-Yu Chen,Payel Das,Youssef Mroueh,Soham Dan,Georgios Kollias,Subhajit Chaudhury,Tejaswini Pedapati,Luca Daniel
关键词-EN: aligning large language, large language models, likelihood of generating, generating harmful, essential task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textitSelf-disciplined Autoregressive Sampling (SASA), a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM’s internal representations.
摘要:在调整大语言模型 (LLM) 时,降低生成有害和有毒输出的可能性是一项关键任务。现有方法主要依赖于训练一个外部奖励模型 (即另一个语言模型) 或使用自生成数据对 LLM 进行微调以影响结果。本文表明,LLM 具备在不使用额外奖励模型或重新训练的情况下进行自我净化 (self-detoxification) 的能力。我们提出了一种轻量级的控制解码算法,称为自约束自回归采样 (Self-disciplined Autoregressive Sampling, SASA),用于降低 LLM 的毒性。SASA 利用 LLM 的上下文表示来学习以解析形式表征有毒与无毒输出的线性子空间。在逐 Token 自动完成响应时,SASA 通过调整自回归采样策略,动态跟踪当前输出的边际,以引导生成远离有毒子空间。在不同规模和性质的 LLM 上进行评估,包括 Llama-3.1-Instruct (8B)、Llama-2 (7B) 和 GPT2-L 模型,使用 RealToxicityPrompts、BOLD 和 AttaQ 基准测试,SASA 显著提高了生成句子的质量,相对于原始模型,达到了与最先进的净化技术相当的性能,显著降低了毒性水平,仅使用 LLM 的内部表示。

[NLP-183] Can Mamba Always Enjoy the “Free Lunch”?

【速读】: 该论文试图解决Mamba模型在处理长序列时的表达能力问题,特别是其在执行COPY操作和解决动态规划(DP)问题时的性能瓶颈。解决方案的关键在于理论分析Mamba与线性注意力机制的联系,揭示了Mamba在常数规模下处理COPY操作时的局限性,并指出当规模与序列长度线性增长时,Mamba能够达到完美性能。此外,论文还探讨了Mamba在配备Chain of Thought(CoT)时解决DP问题的能力,发现其在处理具有局部性等有利属性的DP问题时能够节省开销,但在解决任意DP问题时,其总成本与标准和高效的Transformer相当。

链接: https://arxiv.org/abs/2410.03810
作者: Ruifeng Ren,Zhicong Li,Yong Liu
关键词-EN: Large Language Models, current Large Language, Language Models, Large Language, current Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers have been the cornerstone of current Large Language Models (LLMs); however, its linear growth in overhead during inference with respect to sequence length poses challenges for modeling long sequences. In this context, Mamba has gradually attracted attention due to its constant-level size during inference and existing empirical results have shown that it can perform comparably to Transformers in sequence modeling while offering significant savings. However, one may ask that, can Mamba always enjoy the ``free lunch"? In this paper, we focus on analyzing the expressive ability of Mamba from a theoretical standpoint. First, inspired by the connection between Mamba and linear attention, we investigate potential shortcomings of the Mamba when performing the COPY operation. Our results indicate that Mamba with constant size may encounter bottlenecks when handling COPY, while it can achieve perfect performance when the size scales linearly with sequence length. Based on this observation, we analyze Mamba’s ability to tackle DP problems when equipped with Chain of Thought (CoT). Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is comparable to standard and efficient Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our results contribute to a deeper understanding of Mamba.
摘要:Transformer 已成为当前大语言模型 (LLM) 的基石;然而,其在推理过程中随着序列长度线性增长的计算开销,给长序列建模带来了挑战。在此背景下,Mamba 因其推理过程中常数级别的计算开销逐渐受到关注,且已有实证结果表明,Mamba 在序列建模中能够与 Transformer 相媲美,同时提供显著的计算节省。然而,人们可能会问,Mamba 是否总能享受这种“免费午餐”?本文从理论角度分析了 Mamba 的表达能力。首先,受 Mamba 与线性注意力机制之间联系的启发,我们探讨了 Mamba 在执行 COPY 操作时可能存在的不足。我们的结果表明,常数大小的 Mamba 在处理 COPY 操作时可能会遇到瓶颈,而当其大小随序列长度线性增长时,则能实现完美的性能。基于这一观察,我们分析了 Mamba 在配备思维链 (Chain of Thought, CoT) 时解决动态规划 (DP) 问题的能力。我们的研究发现,要解决任意 DP 问题,Mamba 的总计算成本与标准且高效的 Transformer 相当。然而,与高效的 Transformer 类似,当面对具有局部性等有利属性的 DP 问题时,Mamba 能够提供计算开销的节省。我们的研究结果有助于更深入地理解 Mamba。

[NLP-184] Metadata Matters for Time Series: Informative Forecasting with Transformers

【速读】: 该论文试图解决时间序列预测中仅依赖数值数据而忽视元数据(如数据集和变量描述)所携带的宝贵信息的问题。解决方案的关键在于提出了Metadata-informed Time Series Transformer (MetaTST)模型,通过将元数据形式化为自然语言文本并利用大型语言模型(LLMs)将其编码为元数据令牌,与经典的时间序列令牌结合,形成信息丰富的嵌入表示。随后,通过Transformer编码器实现时间序列令牌与元数据令牌的交互,从而扩展时间序列表示,提升预测精度。这一设计不仅增强了模型的解释性,还能自适应地学习不同场景下的特定模式,特别适用于大规模、多样化的预测任务。

链接: https://arxiv.org/abs/2410.03806
作者: Jiaxiang Dong,Haixu Wu,Yuxuan Wang,Li Zhang,Jianmin Wang,Mingsheng Long
关键词-EN: Time series, extensive real-world applications, Time series forecasting, series, Time
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time series forecasting is prevalent in extensive real-world applications, such as financial analysis and energy planning. Previous studies primarily focus on time series modality, endeavoring to capture the intricate variations and dependencies inherent in time series. Beyond numerical time series data, we notice that metadata (e.g.~dataset and variate descriptions) also carries valuable information essential for forecasting, which can be used to identify the application scenario and provide more interpretable knowledge than digit sequences. Inspired by this observation, we propose a Metadata-informed Time Series Transformer (MetaTST), which incorporates multiple levels of context-specific metadata into Transformer forecasting models to enable informative time series forecasting. To tackle the unstructured nature of metadata, MetaTST formalizes them into natural languages by pre-designed templates and leverages large language models (LLMs) to encode these texts into metadata tokens as a supplement to classic series tokens, resulting in an informative embedding. Further, a Transformer encoder is employed to communicate series and metadata tokens, which can extend series representations by metadata information for more accurate forecasting. This design also allows the model to adaptively learn context-specific patterns across various scenarios, which is particularly effective in handling large-scale, diverse-scenario forecasting tasks. Experimentally, MetaTST achieves state-of-the-art compared to advanced time series models and LLM-based methods on widely acknowledged short- and long-term forecasting benchmarks, covering both single-dataset individual and multi-dataset joint training settings.
摘要:时间序列预测在众多现实应用中广泛存在,如金融分析和能源规划。以往的研究主要集中在时间序列模态上,致力于捕捉时间序列中复杂的变异和依赖关系。除了数值时间序列数据外,我们注意到元数据(例如数据集和变量描述)也承载着对预测至关重要的信息,这些信息可用于识别应用场景,并提供比数字序列更具解释性的知识。受此启发,我们提出了一种元数据引导的时间序列 Transformer (Metadata-informed Time Series Transformer, MetaTST),该模型将多层次的上下文特定元数据融入 Transformer 预测模型中,以实现信息丰富的时间序列预测。为应对元数据的无结构特性,MetaTST 通过预设模板将其形式化为自然语言,并利用大语言模型 (Large Language Models, LLMs) 将这些文本编码为元数据 Token,作为经典序列 Token 的补充,从而形成信息丰富的嵌入。进一步,采用 Transformer 编码器来沟通序列和元数据 Token,通过元数据信息扩展序列表示,以实现更准确的预测。这种设计还使得模型能够自适应地学习跨不同场景的上下文特定模式,在处理大规模、多样场景的预测任务时尤为有效。实验结果表明,MetaTST 在广泛认可的短期和长期预测基准上,相较于先进的时间序列模型和基于 LLM 的方法,均达到了最先进的水平,涵盖了单数据集个体和多数据集联合训练设置。

[NLP-185] Mixture of Attentions For Speculative Decoding

【速读】: 该论文试图解决大型语言模型(LLMs)由于参数数量增加导致的计算需求激增,从而使得部署成本高昂且具有挑战性的问题。解决方案的关键在于提出了一种名为“Mixture of Attentions for SD”的新型架构,通过利用小型模型来高效地提出未来token,并在LLM中并行验证这些token。这种架构不仅解决了现有推测解码(SD)模型在训练过程中缺乏策略一致性和部分可观测性的问题,还能够在单设备部署和客户端-服务器部署两种场景下显著提升解码速度和接受长度,同时在网络断开情况下仍能保持较高的生成准确性。

链接: https://arxiv.org/abs/2410.03804
作者: Matthieu Zimmer,Milan Gritta,Gerasimos Lampouras,Haitham Bou Ammar,Jun Wang
关键词-EN: Large Language Models, Large Language, parameters of Large, Language Models, computational requirements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.
摘要:大语言模型 (LLM) 的参数数量增长导致计算需求显著增加,使其部署变得困难且成本高昂。推测性解码 (SD) 利用较小的模型高效地提出未来 Token,然后由 LLM 并行验证。目前,利用 LLM 激活的小模型实现了最快的解码速度。然而,我们发现了 SD 模型的几个局限性,包括训练过程中缺乏策略一致性和部分可观测性。为了解决这些不足,我们提出了一种更基础的架构,通过引入 SD 的注意力混合 (Mixture of Attentions) 来改进小模型。我们的新架构可应用于两种场景:传统的单设备部署和创新的客户端-服务器部署,其中小模型托管在消费设备上,而 LLM 托管在服务器上。在单设备场景中,我们展示了最先进的加速效果,使 EAGLE-2 的速度提高了 9.5%,接受长度增加了 25%。在客户端-服务器设置中,我们的实验表明:1) 在不同网络条件下,通过最小化对服务器的调用,实现了最先进的延迟;2) 在完全断开连接的情况下,我们的方法相比其他 SD 方法能够保持更高的准确性,并且相比直接调用 LLM API 的方法具有优势,后者在这种情况下无法继续生成过程。

[NLP-186] Self-Powered LLM Modality Expansion for Large Speech-Text Models EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在扩展为大语音-文本模型(LSMs)时,由于过度依赖语音输入而导致的“语音锚定偏差”问题。解决方案的关键在于引入自驱动的LSM,通过利用模型自身生成的增强型自动语音识别数据进行更有效的指令调优,从而减轻语音锚定偏差,提升语音与文本模态的融合效果。

链接: https://arxiv.org/abs/2410.03798
作者: Tengfei Yu,Xuebo Liu,Zhiyi Hou,Liang Ding,Dacheng Tao,Min Zhang
关键词-EN: exhibit remarkable performance, Large language models, large speech-text models, Large language, integrating speech capabilities
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable performance across diverse tasks, indicating their potential for expansion into large speech-text models (LSMs) by integrating speech capabilities. Although unified speech-text pre-training and multimodal data instruction-tuning offer considerable benefits, these methods generally entail significant resource demands and tend to overfit specific tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We explore the instruction-following dynamics within LSMs, identifying a critical issue termed speech anchor bias-a tendency for LSMs to over-rely on speech inputs, mistakenly interpreting the entire speech modality as directives, thereby neglecting textual instructions. To counteract this bias, we introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning. Our experiments across a range of speech-based tasks demonstrate that self-powered LSM mitigates speech anchor bias and improves the fusion of speech and text modalities in LSMs. Data, code and scripts are freely available at this https URL.
摘要:大语言模型 (LLMs) 在多样化的任务中展现出卓越的性能,表明其通过整合语音能力扩展为大型语音-文本模型 (LSMs) 的潜力。尽管统一的语音-文本预训练和多模态数据指令微调提供了显著的优势,但这些方法通常需要大量的资源,并且容易过度拟合特定任务。本研究旨在通过解决传统指令微调的局限性,优化语音数据集在大语言模型训练中的使用。我们探索了大语言模型中的指令跟随动态,识别出一个关键问题,即语音锚定偏差——大语言模型倾向于过度依赖语音输入,错误地将整个语音模态解释为指令,从而忽视了文本指令。为对抗这种偏差,我们引入了一种自驱动的大语言模型,该模型利用模型自身生成的增强自动语音识别数据进行更有效的指令微调。我们在一系列基于语音的任务中进行的实验表明,自驱动的大语言模型能够缓解语音锚定偏差,并提升大语言模型中语音和文本模态的融合效果。数据、代码和脚本可在以下链接免费获取:https URL。

[NLP-187] Searching for Best Practices in Medical Transcription with Large Language Model

【速读】: 该论文试图解决医学独白转录中因专业术语密集和口音差异导致的转录准确性问题,特别是针对印度口音的医生独白。解决方案的关键在于利用大型语言模型(LLM)结合先进的语言建模技术,降低词错误率(WER)并确保关键医学术语的精确识别。通过这种方法,论文提出的系统在医学录音数据集上显著提升了整体转录准确性和关键术语的保真度,从而为临床文档处理提供了一个高效且准确的工具。

链接: https://arxiv.org/abs/2410.03797
作者: Jiafeng Li,Yanda Mu
关键词-EN: Large Language Model, existing automated systems, Word Error Rate, presents a significant, density of specialized
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The transcription of medical monologues, especially those containing a high density of specialized terminology and delivered with a distinct accent, presents a significant challenge for existing automated systems. This paper introduces a novel approach leveraging a Large Language Model (LLM) to generate highly accurate medical transcripts from audio recordings of doctors’ monologues, specifically focusing on Indian accents. Our methodology integrates advanced language modeling techniques to lower the Word Error Rate (WER) and ensure the precise recognition of critical medical terms. Through rigorous testing on a comprehensive dataset of medical recordings, our approach demonstrates substantial improvements in both overall transcription accuracy and the fidelity of key medical terminologies. These results suggest that our proposed system could significantly aid in clinical documentation processes, offering a reliable tool for healthcare providers to streamline their transcription needs while maintaining high standards of accuracy.
摘要:医疗独白的转录,特别是那些包含高密度专业术语且带有明显口音的独白,对现有自动化系统提出了重大挑战。本文介绍了一种利用大语言模型 (LLM) 生成高度准确医疗转录的新方法,特别针对印度口音的医生独白。我们的方法结合了先进的语言建模技术,以降低词错误率 (WER) 并确保关键医疗术语的精确识别。通过对全面的医疗录音数据集进行严格测试,我们的方法在整体转录准确性和关键医疗术语的保真度方面展示了显著改进。这些结果表明,我们提出的系统可以显著辅助临床文档处理过程,为医疗提供者提供一个可靠的工具,以简化其转录需求的同时保持高标准的准确性。

[NLP-188] Reconstructing Human Mobility Pattern: A Semi-Supervised Approach for Cross-Dataset Transfer Learning

【速读】: 该论文试图解决人类移动模式研究中的两个主要问题:一是轨迹数据未能捕捉活动之间的语义依赖关系,二是现实世界轨迹数据的不完整性。解决方案的关键在于开发了一种基于语义活动链的模型,通过半监督迭代迁移学习算法,使模型能够适应不同地理环境并解决数据稀缺问题。该模型在美国综合数据集上验证了其有效性,能够重建活动链并生成高质量的合成移动数据,Jensen-Shannon Divergence (JSD)值仅为0.001,表明合成数据与真实数据高度相似。此外,该算法成功将美国移动模式迁移至埃及,相似度提高了64%,JSD值从0.09降至0.03,显示出在全球范围内进行人类移动模式建模的巨大潜力。

链接: https://arxiv.org/abs/2410.03788
作者: Xishun Liao,Yifan Liu,Chenchen Kuai,Haoxuan Ma,Yueshuai He,Shangqing Cao,Chris Stanford,Jiaqi Ma
关键词-EN: Understanding human mobility, Understanding human, urban planning, public health, crucial for urban
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 23 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Understanding human mobility patterns is crucial for urban planning, transportation management, and public health. This study tackles two primary challenges in the field: the reliance on trajectory data, which often fails to capture the semantic interdependencies of activities, and the inherent incompleteness of real-world trajectory data. We have developed a model that reconstructs and learns human mobility patterns by focusing on semantic activity chains. We introduce a semi-supervised iterative transfer learning algorithm to adapt models to diverse geographical contexts and address data scarcity. Our model is validated using comprehensive datasets from the United States, where it effectively reconstructs activity chains and generates high-quality synthetic mobility data, achieving a low Jensen-Shannon Divergence (JSD) value of 0.001, indicating a close similarity between synthetic and real data. Additionally, sparse GPS data from Egypt is used to evaluate the transfer learning algorithm, demonstrating successful adaptation of US mobility patterns to Egyptian contexts, achieving a 64% of increase in similarity, i.e., a JSD reduction from 0.09 to 0.03. This mobility reconstruction model and the associated transfer learning algorithm show significant potential for global human mobility modeling studies, enabling policymakers and researchers to design more effective and culturally tailored transportation solutions.
摘要:理解人类移动模式对于城市规划、交通管理和公共卫生至关重要。本研究解决了该领域的两大主要挑战:一是依赖轨迹数据,这些数据往往无法捕捉活动的语义相互依赖性;二是现实世界轨迹数据固有的不完整性。我们开发了一种模型,通过聚焦于语义活动链来重建和学习人类移动模式。我们引入了一种半监督迭代迁移学习算法,以使模型适应不同的地理环境并解决数据稀缺问题。我们的模型通过使用来自美国的综合数据集进行了验证,能够有效地重建活动链并生成高质量的合成移动数据,实现了低 Jensen-Shannon 散度 (JSD) 值 0.001,表明合成数据与真实数据之间具有高度相似性。此外,使用来自埃及的稀疏 GPS 数据评估了迁移学习算法,展示了美国移动模式成功适应埃及情境的能力,相似性提高了 64%,即 JSD 从 0.09 降至 0.03。这种移动重建模型及其相关的迁移学习算法显示出在全球人类移动模式建模研究中的巨大潜力,使政策制定者和研究人员能够设计出更有效且更符合文化特点的交通解决方案。

[NLP-189] CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

【速读】: 该论文试图解决现有书法生成系统在多模态控制和细粒度控制方面的不足,特别是缺乏灵活性和对新风格的快速学习能力。解决方案的关键在于引入CalliffusionV2系统,该系统结合图像和自然语言文本输入,实现细粒度级别的生成控制,并通过少样本学习方法快速适应新风格,同时具备生成非中文字符的能力。

链接: https://arxiv.org/abs/2410.03787
作者: Qisheng Liao,Liang Li,Yulang Fei,Gus Xia
关键词-EN: flexible multi-modal control, produce natural Chinese, natural Chinese calligraphy, natural Chinese, flexible multi-modal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:In this paper, we introduce CalliffusionV2, a novel system designed to produce natural Chinese calligraphy with flexible multi-modal control. Unlike previous approaches that rely solely on image or text inputs and lack fine-grained control, our system leverages both images to guide generations at fine-grained levels and natural language texts to describe the features of generations. CalliffusionV2 excels at creating a broad range of characters and can quickly learn new styles through a few-shot learning approach. It is also capable of generating non-Chinese characters without prior training. Comprehensive tests confirm that our system produces calligraphy that is both stylistically accurate and recognizable by neural network classifiers and human evaluators.
摘要:本文介绍了一种名为 CalliffusionV2 的新型系统,该系统旨在通过灵活的多模态控制生成自然的汉字书法。与以往仅依赖图像或文本输入且缺乏精细控制的方法不同,我们的系统利用图像在细粒度层面引导生成,并使用自然语言文本描述生成特征。CalliffusionV2 擅长创作广泛的汉字字符,并通过少样本学习方法快速学习新风格。它还能够生成未经预训练的非汉字字符。综合测试证实,我们的系统生成的书法在风格上准确且能被神经网络分类器和人类评估者识别。

[NLP-190] Reward-RAG: Enhancing RAG with Reward Driven Supervision

链接: https://arxiv.org/abs/2410.03780
作者: Thang Nguyen,Peter Chin,Yu-Wing Tai
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-191] Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

链接: https://arxiv.org/abs/2410.03777
作者: Yuxuan Yao,Han Wu,Mingyang Liu,Sichun Luo,Xiongwei Han,Jie Liu,Zhijiang Guo,Linqi Song
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-192] Precision Knowledge Editing: Enhancing Safety in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)生成有毒或有害内容的风险问题。解决方案的关键是引入了一种名为Precision Knowledge Editing(PKE)的高级技术,该技术通过利用神经元权重跟踪和激活路径追踪,更精细地识别和修改LLMs中的有毒参数区域。相比之前的Detoxifying Instance Neuron Modification(DINM)方法,PKE在有毒内容管理方面实现了更高的粒度,显著降低了攻击成功率(ASR),并在保持模型整体性能的同时,提升了模型的安全性。

链接: https://arxiv.org/abs/2410.03772
作者: Xuying Li,Zhuo Li,Yuji Kosuga,Yasuhiro Yoshida,Victor Bian
关键词-EN: demonstrated remarkable capabilities, Large language models, pose risks related, Large language, Precision Knowledge Editing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but they also pose risks related to the generation of toxic or harmful content. This work introduces Precision Knowledge Editing (PKE), an advanced technique that builds upon existing knowledge editing methods to more effectively identify and modify toxic parameter regions within LLMs. By leveraging neuron weight tracking and activation pathway tracing, PKE achieves finer granularity in toxic content management compared to previous methods like Detoxifying Instance Neuron Modification (DINM). Our experiments demonstrate that PKE significantly reduces the attack success rate (ASR) across various models, including Llama2-7b and Llama-3-8b-instruct, while maintaining overall model performance. Additionally, we also compared the performance of some closed-source models (gpt-4-0613 and Claude 3 Sonnet) in our experiments, and found that models adjusted using our method far outperformed the closed-source models in terms of safety. This research contributes to the ongoing efforts to make LLMs safer and more reliable for real-world applications.
摘要:大语言模型 (LLMs) 展示了显著的能力,但同时也带来了生成有毒或有害内容的潜在风险。本研究引入了精准知识编辑 (Precision Knowledge Editing, PKE),这是一种基于现有知识编辑方法的高级技术,能够更有效地识别和修改大语言模型中的有毒参数区域。通过利用神经元权重跟踪和激活路径追踪,PKE 在有毒内容管理方面实现了比之前方法(如去毒化实例神经元修改 (Detoxifying Instance Neuron Modification, DINM))更细的粒度。我们的实验表明,PKE 显著降低了各种模型(包括 Llama2-7b 和 Llama-3-8b-instruct)的攻击成功率 (Attack Success Rate, ASR),同时保持了整体模型性能。此外,我们还在实验中比较了一些闭源模型(如 gpt-4-0613 和 Claude 3 Sonnet)的性能,发现使用我们方法调整的模型在安全性方面远远优于这些闭源模型。这项研究为使大语言模型在实际应用中更安全、更可靠做出了贡献。

[NLP-193] A Two-Stage Proactive Dialogue Generator for Efficient Clinical Information Collection Using Large Language Model

【速读】: 该论文试图解决医生与患者之间互动效率低下的问题,特别是在疾病诊断过程中,医生需要通过多轮对话收集患者的症状、既往手术史等补充性诊断信息,这一过程通常耗时且效率不高。解决方案的关键在于提出了一种诊断对话系统,通过自动化患者信息收集过程来优化这一流程。该系统利用患者的医疗历史和对话逻辑,设计了多轮临床查询机制,以高效地收集最相关的疾病诊断信息。此外,通过两阶段推荐结构、精心设计的排名标准和互动式患者代理,该模型克服了对话生成中的探索不足和灵活性不足的问题,从而能够生成模拟真实医生对话风格的临床查询,具备高效、专业和安全的特点。

链接: https://arxiv.org/abs/2410.03770
作者: Xueshen Li,Xinlong Hou,Nirupama Ravi,Ziyi Huang,Yu Gan
关键词-EN: successful disease diagnosis, Efficient patient-doctor interaction, disease diagnosis, patient-doctor interaction, key factors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Prepare for submission

点击查看摘要

Abstract:Efficient patient-doctor interaction is among the key factors for a successful disease diagnosis. During the conversation, the doctor could query complementary diagnostic information, such as the patient’s symptoms, previous surgery, and other related information that goes beyond medical evidence data (test results) to enhance disease diagnosis. However, this procedure is usually time-consuming and less-efficient, which can be potentially optimized through computer-assisted systems. As such, we propose a diagnostic dialogue system to automate the patient information collection procedure. By exploiting medical history and conversation logic, our conversation agents, particularly the doctor agent, can pose multi-round clinical queries to effectively collect the most relevant disease diagnostic information. Moreover, benefiting from our two-stage recommendation structure, carefully designed ranking criteria, and interactive patient agent, our model is able to overcome the under-exploration and non-flexible challenges in dialogue generation. Our experimental results on a real-world medical conversation dataset show that our model can generate clinical queries that mimic the conversation style of real doctors, with efficient fluency, professionalism, and safety, while effectively collecting relevant disease diagnostic information.
摘要:高效的医患互动是成功疾病诊断的关键因素之一。在对话过程中,医生可以查询补充诊断信息,如患者的症状、既往手术以及其他超出医学证据数据(测试结果)的相关信息,以增强疾病诊断。然而,这一过程通常耗时且效率较低,可以通过计算机辅助系统进行优化。因此,我们提出了一种诊断对话系统,以自动化患者信息收集过程。通过利用病史和对话逻辑,我们的对话智能体,特别是医生智能体,能够提出多轮临床查询,以有效收集最相关的疾病诊断信息。此外,得益于我们两阶段的推荐结构、精心设计的排名标准以及交互式患者智能体,我们的模型能够克服对话生成中的探索不足和灵活性不足的挑战。我们在真实世界医疗对话数据集上的实验结果表明,我们的模型能够生成模仿真实医生对话风格的临床查询,具有高效、专业和安全的特点,同时有效地收集相关疾病诊断信息。

[NLP-194] SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

链接: https://arxiv.org/abs/2410.03769
作者: Tianhao Li,Jingyu Lu,Chuangxin Chu,Tianyu Zeng,Yujia Zheng,Mei Li,Haotian Huang,Bin Wu,Zuoxian Liu,Kai Ma,Xuejing Yuan,Xingkai Wang,Keyan Ding,Huajun Chen,Qiang Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

[NLP-195] Hidden in Plain Text: Emergence Mitigation of Steganographic Collusion in LLMs

链接: https://arxiv.org/abs/2410.03768
作者: Yohan Mathew,Ollie Matthews,Robert McCarthy,Joan Velja,Christian Schroeder de Witt,Dylan Cope,Nandi Schoots
关键词-EN:
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-196] Reasoning Elicitation in Language Models via Counterfactual Feedback

【速读】: 该论文试图解决语言模型在因果推理,特别是反事实问答方面的不足。解决方案的关键在于提出新的评估指标,以平衡事实和反事实问题的准确性,从而更全面地评估语言模型的推理能力。此外,论文还提出了几种微调方法,旨在通过这些新指标激发语言模型更强的推理机制,并在多种现实场景中评估这些微调模型的性能,特别是它们在需要归纳和演绎推理的问题上的泛化能力。

链接: https://arxiv.org/abs/2410.03767
作者: Alihan Hüyük,Xinnuo Xu,Jacqueline Maasch,Aditya V. Nori,Javier González
关键词-EN: capabilities remain underdeveloped, remain underdeveloped, increasing effectiveness, language models, reasoning capabilities remain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities.
摘要:尽管语言模型的效果不断提升,但其推理能力仍显不足。特别是在通过反事实问答进行因果推理方面,表现尤为欠缺。本研究旨在填补这一空白。首先,我们推导出一种新型评价指标,该指标在事实性问题和反事实性问题之间实现了准确性的平衡,从而比传统的事实性评价指标更能全面地反映语言模型的推理能力。其次,我们提出了几种微调方法,旨在根据所提出的评价指标激发更优的推理机制。最后,我们在多种现实场景中评估了这些微调后语言模型的性能。特别地,我们研究了在需要归纳和演绎推理能力等多个问题中,我们的微调方法在多大程度上系统性地实现了对基础模型的更好泛化。

[NLP-197] FutureFill: Fast Generation from Convolutional Sequence Models

链接: https://arxiv.org/abs/2410.03766
作者: Naman Agarwal,Xinyi Chen,Evan Dogariu,Vlad Feinberg,Daniel Suo,Peter Bartlett,Elad Hazan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-198] Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

【速读】: 该论文试图解决大型语言模型(LLMs)在推理过程中由于参数数量庞大而导致的内存存储需求过高的问题。解决方案的关键在于通过奇异值分解(SVD)技术,在不同层之间共享基础向量,从而实现更有效的模型压缩。具体来说,论文提出将不同层的权重矩阵分解为一组共享基础向量和独特的系数组合,并通过实验验证了这种基于基础向量共享的方法在大幅压缩模型时仍能保持性能,优于现有的SVD和参数共享技术。

链接: https://arxiv.org/abs/2410.03765
作者: Jingcun Wang,Yu-Guang Chen,Ing-Chao Lin,Bing Li,Grace Li Zhang
关键词-EN: Large Language Models, Language Models, achieved remarkable breakthroughs, Large Language, remarkable breakthroughs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable breakthroughs. However, the huge number of parameters in LLMs require significant amount of memory storage in inference, which prevents their practical deployment in many applications. To reduce memory storage of LLMs, singular value decomposition (SVD) provides a promising solution to approximate weight matrices for compressing LLMs. In this paper, we take a step further to explore parameter sharing across different layers with SVD to achieve more effective compression for LLMs. Specifically, weight matrices in different layers are decomposed and represented as a linear combination of a set of shared basis vectors and unique coefficients. The types of weight matrices and the layer selection for basis sharing are examined when compressing LLMs to maintain the performance. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios. Code is available at: this https URL
摘要:大语言模型 (LLMs) 已经取得了显著的突破。然而,LLMs 中庞大的参数数量在推理过程中需要大量的内存存储,这限制了它们在许多实际应用中的部署。为了减少 LLMs 的内存存储需求,奇异值分解 (SVD) 提供了一种有前景的解决方案,用于近似压缩 LLMs 的权重矩阵。在本文中,我们进一步探索了在不同层之间通过 SVD 进行参数共享,以实现更有效的 LLMs 压缩。具体而言,不同层的权重矩阵被分解并表示为一组共享基向量和唯一系数的线性组合。在压缩 LLMs 时,我们考察了权重矩阵的类型以及基共享的层选择,以保持模型的性能。综合实验表明,基共享方法在大型压缩比下优于当前最先进的基于 SVD 的压缩方法和参数共享技术。代码可在以下链接获取:this https URL

[NLP-199] Words that Represent Peace

【速读】: 该论文试图通过分析新闻媒体中的词汇来区分高和平与低和平国家,并识别出影响和平水平的社会过程。解决方案的关键在于利用LexisNexis数据库中的数据,确定能够有效分类国家和平程度的新闻主题词汇,发现高和平新闻多涉及金融、日常活动和健康主题,而低和平新闻则多涉及政治、政府和法律问题。这一研究为衡量和平水平提供了初步方法,并揭示了这些词汇背后的社会过程。

链接: https://arxiv.org/abs/2410.03764
作者: T. Prasad(1),L. S. Liebovitch(1),M. Wild(1),H. West(1),P. T. Coleman(1) ((1) Columbia University)
关键词-EN: data from LexisNexis, LexisNexis to determine, classifies countries, lower peace, characterized by themes
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:We used data from LexisNexis to determine the words in news media that best classifies countries as higher or lower peace. We found that higher peace news is characterized by themes of finance, daily actitivities, and health and that lower peace news is characterized by themes of politics, government, and legal issues. This work provides a starting point to measure levels of peace and identify the social processes that underly those words.
摘要:我们利用来自 LexisNexis 的数据,确定了在新闻媒体中能够最好地区分国家和平程度高低的词汇。研究发现,高和平度的新闻主题主要集中在金融、日常活动和健康领域,而低和平度的新闻则主要涉及政治、政府和法律问题。这项工作为衡量和平水平和识别支撑这些词汇的社会过程提供了起点。

[NLP-200] Getting in the Door: Streamlining Intake in Civil Legal Services with Large Language Models

【速读】: 该论文试图解决法律援助项目中申请人资格审查过程耗时且资源密集的问题。解决方案的关键在于利用大型语言模型(LLMs)与逻辑规则相结合的数字平台,提供资格推荐。通过评估8种不同LLMs的表现,发现最佳模型在F1得分达到0.82的同时,有效减少了误判率,从而有助于缩小司法准入差距。

链接: https://arxiv.org/abs/2410.03762
作者: Quinten Steenhuis,Hannes Westermann
关键词-EN: legal aid program, free legal aid, free legal, legal aid, aid program
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Legal intake, the process of finding out if an applicant is eligible for help from a free legal aid program, takes significant time and resources. In part this is because eligibility criteria are nuanced, open-textured, and require frequent revision as grants start and end. In this paper, we investigate the use of large language models (LLMs) to reduce this burden. We describe a digital intake platform that combines logical rules with LLMs to offer eligibility recommendations, and we evaluate the ability of 8 different LLMs to perform this task. We find promising results for this approach to help close the access to justice gap, with the best model reaching an F1 score of .82, while minimizing false negatives.
摘要:法律援助的初步审查,即确定申请人是否有资格获得免费法律援助的过程,需要大量的时间和资源。部分原因在于资格标准复杂、开放且需要频繁修订,以适应资助的开始和结束。本文探讨了使用大语言模型 (LLM) 来减轻这一负担的可能性。我们描述了一个结合逻辑规则与 LLM 的数字化初步审查平台,用于提供资格推荐,并评估了 8 种不同 LLM 执行此任务的能力。我们发现,这种方法在帮助缩小司法准入差距方面显示出有希望的结果,最佳模型的 F1 分数达到 0.82,同时最小化了假阴性。

[NLP-201] HiReview: Hierarchical Taxonomy-Driven Automatic Literature Review Generation

【速读】: 该论文试图解决大规模学术文献综述自动生成的问题,特别是传统方法在生成全面且结构化的文献综述时面临的挑战。解决方案的关键在于提出了一种两阶段的分层分类生成框架(HiReview),该框架结合了基于图的层次聚类和检索增强的大型语言模型(LLM)。首先,通过检索相关子社区并基于文本内容和引用关系进行论文聚类,生成层次分类树;然后,利用LLM为每个层次的聚类或主题生成连贯且上下文准确的摘要,确保文献综述的全面覆盖和逻辑组织。

链接: https://arxiv.org/abs/2410.03761
作者: Yuntong Hu,Zhuofeng Li,Zheng Zhang,Chen Ling,Raasikh Kanjiani,Boxin Zhao,Liang Zhao
关键词-EN: literature review generation, automatic literature review, literature review, taxonomy-driven automatic literature, literature
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we present HiReview, a novel framework for hierarchical taxonomy-driven automatic literature review generation. With the exponential growth of academic documents, manual literature reviews have become increasingly labor-intensive and time-consuming, while traditional summarization models struggle to generate comprehensive document reviews effectively. Large language models (LLMs), with their powerful text processing capabilities, offer a potential solution; however, research on incorporating LLMs for automatic document generation remains limited. To address key challenges in large-scale automatic literature review generation (LRG), we propose a two-stage taxonomy-then-generation approach that combines graph-based hierarchical clustering with retrieval-augmented LLMs. First, we retrieve the most relevant sub-community within the citation network, then generate a hierarchical taxonomy tree by clustering papers based on both textual content and citation relationships. In the second stage, an LLM generates coherent and contextually accurate summaries for clusters or topics at each hierarchical level, ensuring comprehensive coverage and logical organization of the literature. Extensive experiments demonstrate that HiReview significantly outperforms state-of-the-art methods, achieving superior hierarchical organization, content relevance, and factual accuracy in automatic literature review generation tasks.
摘要:在本研究中,我们提出了 HiReview,一种新颖的分层分类驱动型自动文献综述生成框架。随着学术文档数量的指数级增长,手动文献综述变得愈发劳动密集且耗时,而传统的摘要模型在生成全面的文档综述方面效果不佳。大语言模型 (LLMs) 凭借其强大的文本处理能力,提供了一个潜在的解决方案;然而,将 LLMs 应用于自动文档生成的研究仍相对有限。为应对大规模自动文献综述生成 (LRG) 中的关键挑战,我们提出了一种两阶段分类-生成方法,该方法结合了基于图的分层聚类与检索增强型 LLMs。首先,我们在引文网络中检索最相关的子社区,然后通过基于文本内容和引文关系的聚类生成一个分层分类树。在第二阶段,LLM 为每个层次的聚类或主题生成连贯且上下文准确的摘要,确保文献的全面覆盖和逻辑组织。大量实验表明,HiReview 在自动文献综述生成任务中显著优于现有最先进的方法,实现了卓越的分层组织、内容相关性和事实准确性。

[NLP-202] Enhancing Retrieval in QA Systems with Derived Feature Association

链接: https://arxiv.org/abs/2410.03754
作者: Keyush Shah,Abhishek Goyal,Isaac Wasserman
关键词-EN:
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

[NLP-203] Efficient Streaming LLM for Speech Recognition

链接: https://arxiv.org/abs/2410.03752
作者: Junteng Jia,Gil Keren,Wei Zhou,Egor Lakomkin,Xiaohui Zhang,Chunyang Wu,Frank Seide,Jay Mahadeokar,Ozlem Kalinli
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

[NLP-204] Recent Advances in Speech Language Models: A Survey

链接: https://arxiv.org/abs/2410.03751
作者: Wenqian Cui,Dianzhi Yu,Xiaoqi Jiao,Ziqiao Meng,Guangyan Zhang,Qichao Wang,Yiwen Guo,Irwin King
关键词-EN:
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

[NLP-205] SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models EMNLP-24

链接: https://arxiv.org/abs/2410.03750
作者: Juan Pablo Muñoz,Jinjie Yuan,Nilesh Jain
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To be published in EMNLP-24 Findings

点击查看摘要

[NLP-206] Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization

链接: https://arxiv.org/abs/2410.03749
作者: K. Lian(1),L. S. Liebovitch(1),M. Wild(1),H. West(1),P. T. Coleman(1),F. Chen(2),E. Kimani(2),K. Sieck(2) ((1) Columbia University, (2) Toyota Research Institute)
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 5 figures

点击查看摘要

[NLP-207] Khattat: Enhancing Readability and Concept Representation of Semantic Typography

链接: https://arxiv.org/abs/2410.03748
作者: Ahmed Hussein,Alaa Elsetohy,Sama Hadhoud,Tameem Bakr,Yasser Rohaim,Badr AlKhamissi
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-208] Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging EMNLP2024

链接: https://arxiv.org/abs/2410.03743
作者: Yiming Ju,Ziyi Ni,Xingrun Xing,Zhixiong Zeng,hanyu Zhao,Siqi Fan,Zheng Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024

点击查看摘要

[NLP-209] Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

链接: https://arxiv.org/abs/2410.03742
作者: Ziyi Ye,Xiangsheng Li,Qiuchi Li,Qingyao Ai,Yujia Zhou,Wei Shen,Dong Yan,Yiqun Liu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-210] Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model

链接: https://arxiv.org/abs/2410.03740
作者: Aidan Gilson,Xuguang Ai,Qianqian Xie,Sahana Srinivasan,Krithi Pushpanathan,Maxwell B. Singer,Jimin Huang,Hyunjae Kim,Erping Long,Peixing Wan,Luciano V. Del Priore,Lucila Ohno-Machado,Hua Xu,Dianbo Liu,Ron A. Adelman,Yih-Chung Tham,Qingyu Chen
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-211] Grammar Induction from Visual Speech and Text

链接: https://arxiv.org/abs/2410.03739
作者: Yu Zhao,Hao Fei,Shengqiong Wu,Meishan Zhang,Min Zhang,Tat-seng Chua
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-212] ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation

链接: https://arxiv.org/abs/2410.03738
作者: Fillipe dos Santos Silva,Gabriel Kenzo Kakimoto,Julio Cesar dos Reis,Marcelo S. Reis
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures, published in BRACIS 2024 conference

点击查看摘要

[NLP-213] ask-Adaptive Pretrained Language Models via Clustered-Importance Sampling

链接: https://arxiv.org/abs/2410.03735
作者: David Grangier,Simin Fan,Skyler Seto,Pierre Ablin
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-214] Accent conversion using discrete units with parallel data synthesized from controllable accented TTS

链接: https://arxiv.org/abs/2410.03734
作者: Tuan Nam Nguyen,Ngoc Quan Pham,Alexander Waibel
关键词-EN:
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Syndata4genAI

点击查看摘要

[NLP-215] Unsupervised Human Preference Learning EMNLP2024

链接: https://arxiv.org/abs/2410.03731
作者: Sumuk Shashidhar,Abhinav Chinta,Vaibhav Sahai,Dilek Hakkani Tur
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Main Conference

点击查看摘要

[NLP-216] Progress Report: Towards European LLMs

链接: https://arxiv.org/abs/2410.03730
作者: Mehdi Ali,Michael Fromm,Klaudia Thellmann,Jan Ebert,Alexander Arno Weber,Richard Rutmann,Charvi Jain,Max Lübbering,Daniel Steinigen,Johannes Leveling,Katrin Klug,Jasper Schulze Buschhoff,Lena Jurkschat,Hammam Abdelwahab,Benny Jörg Stein,Karl-Heinz Sylla,Pavel Denisov,Nicolo Brandizzi,Qasid Saleem,Bhowmick Anirban,Chelsea John,Pedro Ortiz Suarez,Malte Ostendorff,Alex Jude,Lalith Manjunath,Samuel Weinbach,Carolin Penke,Shima Asaadi,Fabio Barth,Rafet Sifa,Fabian Küch,René Jäkel,Georg Rehm,Stefan Kesselheim,Joachim Köhler,Nicolas Flores-Herr
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-217] FaithEval: Can Your Language Model Stay Faithful to Context Even If “The Moon is Made of Marshmallows”

链接: https://arxiv.org/abs/2410.03727
作者: Yifei Ming,Senthil Purushwalkam,Shrey Pandit,Zixuan Ke,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-218] Neurosymbolic AI approach to Attribution in Large Language Models

链接: https://arxiv.org/abs/2410.03726
作者: Deepa Tilwani,Revathy Venkataramanan,Amit P. Sheth
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Six pages, three figures, Paper under review

点击查看摘要

[NLP-219] Realtime multimodal invasive ventilation risk monitoring using language models and BoXHED

链接: https://arxiv.org/abs/2410.03725
作者: Arash Pakbin,Aaron Su,Donald K.K. Lee,Bobak J. Mortazavi
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-220] Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation

链接: https://arxiv.org/abs/2410.03723
作者: Tiffany Zhu,Iain Weissburg,Kexun Zhang,William Yang Wang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 2 figures

点击查看摘要

[NLP-221] hematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

链接: https://arxiv.org/abs/2410.03721
作者: Andrew Katz,Gabriella Coloyan Fleming,Joyce Main
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[NLP-222] FluentEditor: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency

链接: https://arxiv.org/abs/2410.03719
作者: Rui Liu,Jiatian Xi,Ziyue Jiang,Haizhou Li
关键词-EN:
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

[NLP-223] Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

链接: https://arxiv.org/abs/2410.03718
作者: Sagar Tamang,Dibya Jyoti Bora
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-224] Revisiting the Superficial Alignment Hypothesis

【速读】: 该论文试图解决的问题是验证“表面对齐假设”(Superficial Alignment Hypothesis),即语言模型在预训练阶段几乎学习了所有能力和知识,而微调阶段仅涉及调整模型风格和格式。论文通过实验研究了微调样本数量增加时微调效果的扩展行为,并使用客观的任务特定标准化基准进行评估。关键解决方案在于发现微调任务性能与微调样本数量之间存在幂律关系,类似于预训练阶段的扩展规律。此外,论文指出,对于某些任务(如数学推理和多跳推理),少量样本仅能对模型进行风格上的对齐,而性能提升需要更多样本,这表明模型性能与其推理能力相关,且微调能够显著提升模型在下游任务中的新知识整合能力。这些发现对“表面对齐假设”提出了质疑,表明该假设过于简化。

链接: https://arxiv.org/abs/2410.03717
作者: Mohit Raghavendra,Vaskar Nath,Sean Hendryx
关键词-EN: Superficial Alignment Hypothesis, Alignment Hypothesis posits, style and format, Superficial Alignment, Alignment Hypothesis
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Superficial Alignment Hypothesis posits that almost all of a language model’s abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model’s ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification.
摘要:表面对齐假设 (Superficial Alignment Hypothesis) 提出,语言模型的几乎所有能力和知识都是在预训练阶段学习的,而后续训练则是为了赋予模型正确的风格和格式。我们通过实证研究,探讨了在增加微调样本数量时后续训练的扩展行为,并使用客观的任务特定标准化基准对其进行评估,从而重新审视这些主张。通过使用 Llama-3、Mistral 和 Llama-2 等多个尺寸的模型家族进行实验,我们观察到,类似于预训练的扩展规律,后续训练的任务性能与微调样本数量之间呈幂律关系。这种幂律关系在广泛的多种能力中都成立,包括数学推理、编码、指令遵循和多跳推理。此外,对于数学和多跳推理等任务,我们发现少量样本仅能对模型进行风格上的对齐,但无法使性能达到基准测试的饱和点。模型的性能与其推理能力相关,并且随着更多样本的增加,性能显著提升,这表明除了对齐人类偏好外,还需要利用客观基准进行全面的评估程序。我们还观察到,语言模型并不一定局限于使用预训练阶段学到的知识。通过适当的后续训练,模型在多跳问答等下游任务中整合新知识的能力显著提高。综上所述,这些结果为表面对齐假设提供了新的视角,表明该假设充其量是一种过度简化。

[NLP-225] Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering

链接: https://arxiv.org/abs/2210.02627
作者: Shamane Siriwardhana,Rivindu Weerasekera,Elliott Wen,Tharindu Kaluarachchi,Rajib Rana,Suranga Nanayakkara
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: This paper is awaiting publication at Transactions of the Association for Computational Linguistics. This is a pre-MIT Press publication version. For associated huggingface transformers code, see this https URL

点击查看摘要

[NLP-226] Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering

链接: https://arxiv.org/abs/2106.11517
作者: Shamane Siriwardhana,Rivindu Weerasekera,Elliott Wen,Suranga Nanayakkara
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: for associated code, see this https URL

点击查看摘要

[NLP-227] BrainCodec: Neural fMRI codec for the decoding of cognitive brain states

链接: https://arxiv.org/abs/2410.04383
作者: Yuto Nishimura,Masataka Sawayama,Ayumu Yamashita,Hideki Nakayama,Kaoru Amano
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-228] Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition INTERSPEECH2020

链接: https://arxiv.org/abs/2008.06682
作者: Shamane Siriwardhana,Andrew Reis,Rivindu Weerasekera,Suranga Nanayakkara
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to INTERSPEECH 2020

点击查看摘要

人工智能

[AI-0] Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.05269
作者: Fei Wang,Ninareh Mehrabi,Palash Goyal,Rahul Gupta,Kai-Wei Chang,Aram Galstyan
关键词-EN: Data Advisor, Data, large language model, crucial element, element in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Main Conference. Project website: this https URL

点击查看摘要

Abstract:Data is a crucial element in large language model (LLM) alignment. Recent studies have explored using LLMs for efficient data collection. However, LLM-generated data often suffers from quality issues, with underrepresented or absent aspects and low-quality datapoints. To address these problems, we propose Data Advisor, an enhanced LLM-based method for generating data that takes into account the characteristics of the desired dataset. Starting from a set of pre-defined principles in hand, Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation accordingly. Data Advisor can be easily integrated into existing data generation methods to enhance data quality and coverage. Experiments on safety alignment of three representative LLMs (i.e., Mistral, Llama2, and Falcon) demonstrate the effectiveness of Data Advisor in enhancing model safety against various fine-grained safety issues without sacrificing model utility.

[AI-1] xtHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

链接: https://arxiv.org/abs/2410.05261
作者: Ya-Qi Yu,Minghui Liao,Jiwen Zhang,Jihao Wu
关键词-EN: Reading dense text, Large Vision-Language Models, Reading dense, abilities for Large, Large Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reading dense text and locating objects within images are fundamental abilities for Large Vision-Language Models (LVLMs) tasked with advanced jobs. Previous LVLMs, including superior proprietary models like GPT-4o, have struggled to excel in both tasks simultaneously. Moreover, previous LVLMs with fine-grained perception cost thousands of tokens per image, making them resource-intensive. We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens. Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources. (2) Visual Encoder Reinforcement: We enhance the visual encoder through LVLM co-training, unlocking its potential for previously unseen tasks like Chinese OCR and grounding. (3) Data Diversity: We maintain a comparable scale of 100 million samples while diversifying the sources of pre-training data. We assess TextHawk2 across multiple benchmarks, where it consistently delivers superior performance and outperforms closed-source models of similar scale, such as achieving 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% accuracy@0.5 on RefCOCOg-test.

[AI-2] GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

链接: https://arxiv.org/abs/2410.05254
作者: Eilam Shapira,Omer Madmon,Itamar Reinman,Samuel Joseph Amouyal,Roi Reichart,Moshe Tennenholtz
关键词-EN: Large Language Models, Large Language, Language Models, show significant potential, show significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic human behavior? Do they tend to reach an efficient and fair outcome? What is the role of natural language in the strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. While the ML community has been exploring the potential of LLMs in such multi-agent setups, varying assumptions, design choices and evaluation criteria across studies make it difficult to draw robust and meaningful conclusions. To address this, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents’ performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents to human players in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents.

[AI-3] Causal Micro-Narratives EMNLP2024

链接: https://arxiv.org/abs/2410.05252
作者: Mourad Heddaya,Qingcheng Zeng,Chenhao Tan,Rob Voigt,Alexander Zentefis
关键词-EN: classify causal micro-narratives, micro-narratives from text, classify causal, Abstract, causal micro-narratives
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Workshop on Narrative Understanding

点击查看摘要

Abstract:We present a novel approach to classify causal micro-narratives from text. These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject. The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives. Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task. The best-performing model–a fine-tuned Llama 3.1 8B–achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification. Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research.

[AI-4] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

链接: https://arxiv.org/abs/2410.05248
作者: Yuxin Xiao,Shujian Zhang,Wenxuan Zhou,Marzyeh Ghassemi,Sanqiang Zhao
关键词-EN: induce desired behaviors, stage typically trains, large language models, instruction-tuning stage typically, typically trains LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To induce desired behaviors in large language models (LLMs) for interaction-driven tasks, the instruction-tuning stage typically trains LLMs on instruction-response pairs using the next-token prediction (NTP) loss. Previous work aiming to improve instruction-tuning performance often emphasizes the need for higher-quality supervised fine-tuning (SFT) datasets, which typically involves expensive data filtering with proprietary LLMs or labor-intensive data generation by human annotators. However, these approaches do not fully leverage the datasets’ intrinsic properties, resulting in high computational and labor costs, thereby limiting scalability and performance gains. In this paper, we propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm, without the need for well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, then applies a Mixup-based regularization to mitigate overfitting on confident examples while propagating supervision signals to improve learning on relatively unconfident ones. This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks, demonstrating its adaptability to diverse LLM families and scalability to datasets of any size. Comprehensive ablation studies further verify the robustness of SFTMix’s design choices, underscoring its versatility in consistently enhancing performance across different LLMs and datasets in broader natural language processing applications.

[AI-5] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

链接: https://arxiv.org/abs/2410.05243
作者: Boyu Gou,Ruohan Wang,Boyuan Zheng,Yanan Xie,Cheng Chang,Yiheng Shu,Huan Sun,Yu Su
关键词-EN: Multimodal large language, graphical user interface, Multimodal large, GUI agents, GUI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

[AI-6] CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures

链接: https://arxiv.org/abs/2410.05235
作者: katerina Sviridova,Anar Yeginbergen,Ainara Estarrona,Elena Cabrio,Serena Villata,Rodrigo Agerri
关键词-EN: Explaining Artificial Intelligence, Explaining Artificial, Artificial Intelligence, major challenge nowadays, medicine and law
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Explaining Artificial Intelligence (AI) decisions is a major challenge nowadays in AI, in particular when applied to sensitive scenarios like medicine and law. However, the need to explain the rationale behind decisions is a main issue also for human-based deliberation as it is important to justify \textitwhy a certain decision has been taken. Resident medical doctors for instance are required not only to provide a (possibly correct) diagnosis, but also to explain how they reached a certain conclusion. Developing new tools to aid residents to train their explanation skills is therefore a central objective of AI in education. In this paper, we follow this direction, and we present, to the best of our knowledge, the first multilingual dataset for Medical Question Answering where correct and incorrect diagnoses for a clinical case are enriched with a natural language explanation written by doctors. These explanations have been manually annotated with argument components (i.e., premise, claim) and argument relations (i.e., attack, support), resulting in the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases in four languages (English, Spanish, French, Italian) with explanations, where we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 attack relations. We conclude by showing how competitive baselines perform over this challenging dataset for the argument mining task.

[AI-7] SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

链接: https://arxiv.org/abs/2410.05233
作者: Taha Bouhsine,Imad El Aaroussi,Atik Faysal,Wang Huaxia
关键词-EN: anchor-free contrastive learning, proposed Similarity-Orthogonality, fine-grained contrastive learning, contrastive learning, anchor-free contrastive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a novel anchor-free contrastive learning (AFCL) method leveraging our proposed Similarity-Orthogonality (SimO) loss. Our approach minimizes a semi-metric discriminative loss function that simultaneously optimizes two key objectives: reducing the distance and orthogonality between embeddings of similar inputs while maximizing these metrics for dissimilar inputs, facilitating more fine-grained contrastive learning. The AFCL method, powered by SimO loss, creates a fiber bundle topological structure in the embedding space, forming class-specific, internally cohesive yet orthogonal neighborhoods. We validate the efficacy of our method on the CIFAR-10 dataset, providing visualizations that demonstrate the impact of SimO loss on the embedding space. Our results illustrate the formation of distinct, orthogonal class neighborhoods, showcasing the method’s ability to create well-structured embeddings that balance class separation with intra-class variability. This work opens new avenues for understanding and leveraging the geometric properties of learned representations in various machine learning tasks.

[AI-8] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

链接: https://arxiv.org/abs/2410.05229
作者: Iman Mirzadeh,Keivan Alizadeh,Hooman Shahrokhi,Oncel Tuzel,Samy Bengio,Mehrdad Farajtabar
关键词-EN: Large Language Models, Large Language, advancements in Large, Language Models, formal reasoning capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of this http URL findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn’t contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs’ capabilities and limitations in mathematical reasoning.

[AI-9] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality EMNLP2024

链接: https://arxiv.org/abs/2410.05210
作者: Youngtaek Oh,Jae Won Cho,Dong-Jin Kim,In So Kweon,Junmo Kim
关键词-EN: enhance compositional understanding, method to enhance, understanding in pre-trained, pre-trained vision, vision and language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 (Long, Main). Project page: this https URL

点击查看摘要

Abstract:In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model’s multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model’s representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: this https URL.

[AI-10] Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

链接: https://arxiv.org/abs/2410.05203
作者: Ge Ya(Olga)Luo,Gian Favero,Zhi Hao Luo,Alexia Jolicoeur-Martineau,Christopher Pal
关键词-EN: Fréchet Video Distance, generation distribution quality, Fréchet Video, evaluating video generation, video generation distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD’s reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

[AI-11] LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation

链接: https://arxiv.org/abs/2410.05191
作者: Zhijie Wang,Zhehua Zhou,Jiayang Song,Yuheng Huang,Zhan Shu,Lei Ma
关键词-EN: Large Language Models, Vision Language Models, VLA models, advancements of Large, Large Language
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Building on the advancements of Large Language Models (LLMs) and Vision Language Models (VLMs), recent research has introduced Vision-Language-Action (VLA) models as an integrated solution for robotic manipulation tasks. These models take camera images and natural language task instructions as input and directly generate control actions for robots to perform specified tasks, greatly improving both decision-making capabilities and interaction with human users. However, the data-driven nature of VLA models, combined with their lack of interpretability, makes the assurance of their effectiveness and robustness a challenging task. This highlights the need for a reliable testing and evaluation platform. For this purpose, in this work, we propose LADEV, a comprehensive and efficient platform specifically designed for evaluating VLA models. We first present a language-driven approach that automatically generates simulation environments from natural language inputs, mitigating the need for manual adjustments and significantly improving testing efficiency. Then, to further assess the influence of language input on the VLA models, we implement a paraphrase mechanism that produces diverse natural language task instructions for testing. Finally, to expedite the evaluation process, we introduce a batch-style method for conducting large-scale testing of VLA models. Using LADEV, we conducted experiments on several state-of-the-art VLA models, demonstrating its effectiveness as a tool for evaluating these models. Our results showed that LADEV not only enhances testing efficiency but also establishes a solid baseline for evaluating VLA models, paving the way for the development of more intelligent and advanced robotic systems.

[AI-12] Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics EMNLP2024

链接: https://arxiv.org/abs/2410.05183
作者: Stefano Perrella,Lorenzo Proietti,Pere-Lluís Huguet Cabot,Edoardo Barba,Roberto Navigli
关键词-EN: Machine Translation, translation quality automatically, assess translation quality, metrics assess translation, metrics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at EMNLP 2024 Main Conference. 26 pages

点击查看摘要

Abstract:Machine Translation (MT) evaluation metrics assess translation quality automatically. Recently, researchers have employed MT metrics for various new use cases, such as data filtering and translation re-ranking. However, most MT metrics return assessments as scalar scores that are difficult to interpret, posing a challenge to making informed design choices. Moreover, MT metrics’ capabilities have historically been evaluated using correlation with human judgment, which, despite its efficacy, falls short of providing intuitive insights into metric performance, especially in terms of new metric use cases. To address these issues, we introduce an interpretable evaluation framework for MT metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. Furthermore, by measuring the performance of MT metrics using Precision, Recall, and F-score, we offer clearer insights into their capabilities than correlation with human judgments. Finally, we raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines, reporting a notably low agreement with Multidimensional Quality Metrics (MQM) annotations.

[AI-13] MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain ECCV2024

链接: https://arxiv.org/abs/2410.05182
作者: Timothy Chase Jr,Karthik Dantu
关键词-EN: celestial objects, tracking of surface, surface terrain, terrain is required, required for spacecraft
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ECCV 2024. Project page available at this https URL

点击查看摘要

Abstract:The visual detection and tracking of surface terrain is required for spacecraft to safely land on or navigate within close proximity to celestial objects. Current approaches rely on template matching with pre-gathered patch-based features, which are expensive to obtain and a limiting factor in perceptual capability. While recent literature has focused on in-situ detection methods to enhance navigation and operational autonomy, robust description is still needed. In this work, we explore metric learning as the lightweight feature description mechanism and find that current solutions fail to address inter-class similarity and multi-view observational geometry. We attribute this to the view-unaware attention mechanism and introduce Multi-view Attention Regularizations (MARs) to constrain the channel and spatial attention across multiple feature views, regularizing the what and where of attention focus. We thoroughly analyze many modern metric learning losses with and without MARs and demonstrate improved terrain-feature recognition performance by upwards of 85%. We additionally introduce the Luna-1 dataset, consisting of Moon crater landmarks and reference navigation frames from NASA mission data to support future research in this difficult task. Luna-1 and source code are publicly available at this https URL.

[AI-14] Presto! Distilling Steps and Layers for Accelerating Music Generation

链接: https://arxiv.org/abs/2410.05167
作者: Zachary Novack,Ge Zhu,Jonah Casebeer,Julian McAuley,Taylor Berg-Kirkpatrick,Nicholas J. Bryan
关键词-EN: high-quality generation remains, advances in diffusion-based, remains a challenge, generation remains, layer distillation methods
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) – the fastest high-quality TTM to our knowledge. Sound examples can be found at this https URL.

[AI-15] VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

链接: https://arxiv.org/abs/2410.05160
作者: Ziyan Jiang,Rui Meng,Xinyi Yang,Semih Yavuz,Yingbo Zhou,Wenhu Chen
关键词-EN: Embedding models, multimodal embedding models, multimodal embedding, semantic similarity, Embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Technical Report

点击查看摘要

Abstract:Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite their importance. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model - Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB. Unlike previous models such as CLIP and BLIP, VLM2Vec can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2Vec models on Phi-3.5-V and evaluate them on MMEB’s evaluation split. Our results show that \model achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB.

[AI-16] CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

链接: https://arxiv.org/abs/2410.05146
作者: Rui Zhao,Jinyu Li,Ruchao Fan,Matt Post
关键词-EN: achieve high accuracy, Connectionist Temporal Classification, target language, achieve high, low latency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by IEEE Spoken Language Technology Workshop (SLT 2024)

点击查看摘要

Abstract:Models for streaming speech translation (ST) can achieve high accuracy and low latency if they’re developed with vast amounts of paired audio in the source language and written text in the target language. Yet, these text labels for the target language are often pseudo labels due to the prohibitive cost of manual ST data labeling. In this paper, we introduce a methodology named Connectionist Temporal Classification guided modality matching (CTC-GMM) that enhances the streaming ST model by leveraging extensive machine translation (MT) text data. This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence, allowing us to utilize matched source-target language text pairs from the MT corpora to refine the streaming ST model further. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4% respectively, while also boosting decoding speed by 59.7% on GPU.

[AI-17] Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents

链接: https://arxiv.org/abs/2410.05130
作者: Yuwei Hu,Runlin Lei,Xinyi Huang,Zhewei Wei,Yongchao Liu
关键词-EN: Large Language Models, Large Language, Recent research, tackling complex graph, Language Models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent research has explored the use of Large Language Models (LLMs) for tackling complex graph reasoning tasks. However, due to the intricacies of graph structures and the inherent limitations of LLMs in handling long text, current approaches often fail to deliver satisfactory accuracy, even on small-scale graphs and simple tasks. To address these challenges, we introduce GraphAgent-Reasoner, a fine-tuning-free framework that utilizes a multi-agent collaboration strategy for explicit and precise graph reasoning. Inspired by distributed graph computation theory, our framework decomposes graph problems into smaller, node-centric tasks that are distributed among multiple agents. The agents collaborate to solve the overall problem, significantly reducing the amount of information and complexity handled by a single LLM, thus enhancing the accuracy of graph reasoning. By simply increasing the number of agents, GraphAgent-Reasoner can efficiently scale to accommodate larger graphs with over 1,000 nodes. Evaluated on the GraphInstruct dataset, our framework demonstrates near-perfect accuracy on polynomial-time graph reasoning tasks, significantly outperforming the best available models, both closed-source and fine-tuned open-source variants. Our framework also demonstrates the capability to handle real-world graph reasoning applications such as webpage importance analysis.

[AI-18] Last Iterate Convergence in Monotone Mean Field Games

链接: https://arxiv.org/abs/2410.05127
作者: Noboru Isobe,Kenshi Abe,Kaito Ariu
关键词-EN: Field Game, number of agents, subject of interest, framework utilized, utilized to model
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: Under review, 25 pages, 2 figures

点击查看摘要

Abstract:Mean Field Game (MFG) is a framework utilized to model and approximate the behavior of a large number of agents, and the computation of equilibria in MFG has been a subject of interest. Despite the proposal of methods to approximate the equilibria, algorithms where the sequence of updated policy converges to equilibrium, specifically those exhibiting last-iterate convergence, have been limited. We propose the use of a simple, proximal-point-type algorithm to compute equilibria for MFGs. Subsequently, we provide the first last-iterate convergence guarantee under the Lasry–Lions-type monotonicity condition. We further employ the Mirror Descent algorithm for the regularized MFG to efficiently approximate the update rules of the proximal point method for MFGs. We demonstrate that the algorithm can approximate with an accuracy of \varepsilon after \mathcalO(\log(1/\varepsilon)) iterations. This research offers a tractable approach for large-scale and large-population games.

[AI-19] Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

链接: https://arxiv.org/abs/2410.05116
作者: Ayano Hiranaka,Shang-Fu Chen,Chieh-Hsin Lai,Dongjun Kim,Naoki Murata,Takashi Shibuya,Wei-Hsiang Liao,Shao-Hua Sun,Yuki Mitsufuji
关键词-EN: Stable Diffusion, Controllable generation, improve fidelity, aims to improve, human feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD’s refined initialization samples, enabling faster convergence towards the evaluator’s intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.

[AI-20] Synthetic Generation of Dermatoscopic Images with GAN and Closed-Form Factorization

链接: https://arxiv.org/abs/2410.05114
作者: Rohan Reddy Mekala,Frederik Pahde,Simon Baur,Sneha Chandrashekar,Madeline Diep,Markus Wenzel,Eric L. Wisotzky,Galip Ümit Yolcu,Sebastian Lapuschkin,Jackie Ma,Peter Eisert,Mikael Lindvall,Adam Porter,Wojciech Samek
关键词-EN: Generative Adversarial Network, high-quality annotated datasets, machine learning models, harnesses Generative Adversarial, microscopic skin lesion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This preprint has been submitted to the Workshop on Synthetic Data for Computer Vision (SyntheticData4CV 2024 is a side event on 18th European Conference on Computer Vision 2024). This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:In the realm of dermatological diagnoses, where the analysis of dermatoscopic and microscopic skin lesion images is pivotal for the accurate and early detection of various medical conditions, the costs associated with creating diverse and high-quality annotated datasets have hampered the accuracy and generalizability of machine learning models. We propose an innovative unsupervised augmentation solution that harnesses Generative Adversarial Network (GAN) based models and associated techniques over their latent space to generate controlled semiautomatically-discovered semantic variations in dermatoscopic images. We created synthetic images to incorporate the semantic variations and augmented the training data with these images. With this approach, we were able to increase the performance of machine learning models and set a new benchmark amongst non-ensemble based models in skin lesion classification on the HAM10000 dataset; and used the observed analytics and generated models for detailed studies on model explainability, affirming the effectiveness of our solution.

[AI-21] AI-Enhanced Ethical Hacking: A Linux-Focused Experiment

链接: https://arxiv.org/abs/2410.05105
作者: Haitham S. Al-Sinani,Chris J. Mitchell
关键词-EN: comprehensive experimental study, technical report investigates, specifically ChatGPT, conceptual analysis, study evaluates GenAI
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This technical report investigates the integration of generative AI (GenAI), specifically ChatGPT, into the practice of ethical hacking through a comprehensive experimental study and conceptual analysis. Conducted in a controlled virtual environment, the study evaluates GenAI’s effectiveness across the key stages of penetration testing on Linux-based target machines operating within a virtual local area network (LAN), including reconnaissance, scanning and enumeration, gaining access, maintaining access, and covering tracks. The findings confirm that GenAI can significantly enhance and streamline the ethical hacking process while underscoring the importance of balanced human-AI collaboration rather than the complete replacement of human input. The report also critically examines potential risks such as misuse, data biases, hallucination, and over-reliance on AI. This research contributes to the ongoing discussion on the ethical use of AI in cybersecurity and highlights the need for continued innovation to strengthen security defences.

[AI-22] SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

链接: https://arxiv.org/abs/2410.05102
作者: Fenia Christopoulou,Ronald Cardenas,Gerasimos Lampouras,Haitham Bou-Ammar,Jun Wang
关键词-EN: Direct Preference Optimization, Preference Optimization objective, Preference Optimization, aligning language models, offline Direct Preference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 papges, 9 figures, 5 tables. Under Review

点击查看摘要

Abstract:Preference Optimization (PO) has proven an effective step for aligning language models to human-desired behaviors. Current variants, following the offline Direct Preference Optimization objective, have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected by each word in a sequence equally but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best weight reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments on multiple domains, including sentiment control, dialogue, text summarization and text-to-code generation, illustrate that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.

[AI-23] On the Structure of Game Provenance and its Applications

链接: https://arxiv.org/abs/2410.05094
作者: Shawn Bowers,Yilin Xia,Bertram Ludäscher
关键词-EN: studied for positive, recursive queries, Provenance, queries, text
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Provenance in databases has been thoroughly studied for positive and for recursive queries, then for first-order (FO) queries, i.e., having negation but no recursion. Query evaluation can be understood as a two-player game where the opponents argue whether or not a tuple is in the query answer. This game-theoretic approach yields a natural provenance model for FO queries, unifying how and why-not provenance. Here, we study the fine-grain structure of game provenance. A game G=(V,E) consists of positions V and moves E and can be solved by computing the well-founded model of a single, unstratifiable rule: [ \textwin(X) \leftarrow \textmove(X, Y), \neg , \textwin(Y). ] In the solved game G^\lambda , the value of a position x,\in,V is either won, lost, or drawn. This value is explained by the provenance \mathscrP (x), i.e., certain (annotated) edges reachable from x . We identify seven edge types that give rise to new kinds of provenance, i.e., potential, actual, and primary, and demonstrate that “not all moves are created equal”. We describe the new provenance types, show how they can be computed while solving games, and discuss applications, e.g., for abstract argumentation frameworks.

[AI-24] ScienceAgent Bench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

链接: https://arxiv.org/abs/2410.05080
作者: Ziru Chen,Shijie Chen,Yuting Ning,Qianheng Zhang,Boshi Wang,Botao Yu,Yifei Li,Zeyi Liao,Chen Wei,Zitong Lu,Vishal Dey,Mingyi Xue,Frazier N. Baker,Benjamin Burns,Daniel Adu-Ampratwum,Xuhui Huang,Xia Ning,Song Gao,Yu Su,Huan Sun
关键词-EN: piqued growing interest, automate scientific discovery, developing LLM-based language, scientific discovery, piqued growing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 55 pages

点击查看摘要

Abstract:The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

[AI-25] Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

链接: https://arxiv.org/abs/2410.05078
作者: David Heurtel-Depeiges,Anian Ruoss,Joel Veness,Tim Genewein
关键词-EN: recently been shown, strong data compressors, compression, parameter count, compression algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Foundation models have recently been shown to be strong data compressors. However, when accounting for their excessive parameter count, their compression ratios are actually inferior to standard compression algorithms. Moreover, naively reducing the number of parameters may not necessarily help as it leads to worse predictions and thus weaker compression. In this paper, we conduct a large-scale empirical study to investigate whether there is a sweet spot where competitive compression ratios with pre-trained vanilla transformers are possible. To this end, we train families of models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (i.e., millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG 2000, FLAC) - even when factoring in parameter count. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). To study the impact of model- and dataset scale, we conduct extensive ablations and hyperparameter sweeps, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but, in contrast to previously reported results with large-scale foundation models, transfer to unseen modalities is generally weak.

[AI-26] dalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

链接: https://arxiv.org/abs/2410.05076
作者: Lijie Yang,Zhihao Zhang,Zhuofu Chen,Zikun Li,Zhihao Jia
关键词-EN: Large language models, long-context models gaining, models gaining prominence, handling extended inputs, driven significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.

[AI-27] FreSh: Frequency Shifting for Accelerated Neural Representation Learning

链接: https://arxiv.org/abs/2410.05050
作者: Adam Kania,Marko Mihajlovic,Sergey Prokudin,Jacek Tabor,Przemysław Spurek
关键词-EN: recently gained attention, Implicit Neural Representations, Implicit Neural, continuously representing signals, shapes using multilayer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have recently gained attention as a powerful approach for continuously representing signals such as images, videos, and 3D shapes using multilayer perceptrons (MLPs). However, MLPs are known to exhibit a low-frequency bias, limiting their ability to capture high-frequency details accurately. This limitation is typically addressed by incorporating high-frequency input embeddings or specialized activation layers. In this work, we demonstrate that these embeddings and activations are often configured with hyperparameters that perform well on average but are suboptimal for specific input signals under consideration, necessitating a costly grid search to identify optimal settings. Our key observation is that the initial frequency spectrum of an untrained model’s output correlates strongly with the model’s eventual performance on a given target signal. Leveraging this insight, we propose frequency shifting (or FreSh), a method that selects embedding hyperparameters to align the frequency spectrum of the model’s initial output with that of the target signal. We show that this simple initialization technique improves performance across various neural representation methods and tasks, achieving results comparable to extensive hyperparameter sweeps but with only marginal computational overhead compared to training a single model with default hyperparameters.

[AI-28] Named Clinical Entity Recognition Benchmark

链接: https://arxiv.org/abs/2410.05046
作者: Wadood M Abdul,Marco AF Pimentel,Muhammad Umar Salman,Tathagata Raha,Clément Christophe,Praveen K Kanithi,Nasir Hayat,Ronnie Rajan,Shadab Khan
关键词-EN: trial cohort identification, extracting structured information, Named Clinical Entity, natural language processing, Entity Recognition Benchmark
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Technical Report

点击查看摘要

Abstract:This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare, addressing the crucial natural language processing (NLP) task of extracting structured information from clinical narratives to support applications like automated coding, clinical trial cohort identification, and clinical decision support. The leaderboard provides a standardized platform for assessing diverse language models, including encoder and decoder architectures, on their ability to identify and classify clinical entities across multiple medical domains. A curated collection of openly available clinical datasets is utilized, encompassing entities such as diseases, symptoms, medications, procedures, and laboratory measurements. Importantly, these entities are standardized according to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, ensuring consistency and interoperability across different healthcare systems and datasets, and a comprehensive evaluation of model performance. Performance of models is primarily assessed using the F1-score, and it is complemented by various assessment modes to provide comprehensive insights into model performance. The report also includes a brief analysis of models evaluated to date, highlighting observed trends and limitations. By establishing this benchmarking framework, the leaderboard aims to promote transparency, facilitate comparative analyses, and drive innovation in clinical entity recognition tasks, addressing the need for robust evaluation methods in healthcare NLP. Comments: Technical Report Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.05046 [cs.CL] (or arXiv:2410.05046v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.05046 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Can LLMs plan paths with extra hints from solvers?

链接: https://arxiv.org/abs/2410.05045
作者: Erik Wu,Sayan Mitra
关键词-EN: Large Language Models, natural language processing, Large Language, Language Models, shown remarkable capabilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language processing, mathematical problem solving, and tasks related to program synthesis. However, their effectiveness in long-term planning and higher-order reasoning has been noted to be limited and fragile. This paper explores an approach for enhancing LLM performance in solving a classical robotic planning task by integrating solver-generated feedback. We explore four different strategies for providing feedback, including visual feedback, we utilize fine-tuning, and we evaluate the performance of three different LLMs across a 10 standard and 100 more randomly generated planning problems. Our results suggest that the solver-generated feedback improves the LLM’s ability to solve the moderately difficult problems, but the harder problems still remain out of reach. The study provides detailed analysis of the effects of the different hinting strategies and the different planning tendencies of the evaluated LLMs.

[AI-30] PhotoReg: Photometrically Registering 3D Gaussian Splatting Models

链接: https://arxiv.org/abs/2410.05044
作者: Ziwen Yuan,Tianyi Zhang,Matthew Johnson-Roberson,Weiming Zhi
关键词-EN: Building accurate representations, Building accurate, decisions during deployment, accurate representations, make decisions
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building accurate representations of the environment is critical for intelligent robots to make decisions during deployment. Advances in photorealistic environment models have enabled robots to develop hyper-realistic reconstructions, which can be used to generate images that are intuitive for human inspection. In particular, the recently introduced \ac3DGS, which describes the scene with up to millions of primitive ellipsoids, can be rendered in real time. \ac3DGS has rapidly gained prominence. However, a critical unsolved problem persists: how can we fuse multiple \ac3DGS into a single coherent model? Solving this problem will enable robot teams to jointly build \ac3DGS models of their surroundings. A key insight of this work is to leverage the duality between photorealistic reconstructions, which render realistic 2D images from 3D structure, and \emph3D foundation models, which predict 3D structure from image pairs. To this end, we develop PhotoReg, a framework to register multiple photorealistic \ac3DGS models with 3D foundation models. As \ac3DGS models are generally built from monocular camera images, they have \empharbitrary scale. To resolve this, PhotoReg actively enforces scale consistency among the different \ac3DGS models by considering depth estimates within these models. Then, the alignment is iteratively refined with fine-grained photometric losses to produce high-quality fused \ac3DGS models. We rigorously evaluate PhotoReg on both standard benchmark datasets and our custom-collected datasets, including with two quadruped robots. The code is released at \urlthis http URL.

[AI-31] Stage-Wise and Prior-Aware Neural Speech Phase Prediction

链接: https://arxiv.org/abs/2410.04990
作者: Fei Liu,Yang Ai,Hui-Peng Du,Ye-Xin Lu,Rui-Chen Zheng,Zhen-Hua Ling
关键词-EN: Prior-aware Neural Speech, Stage-wise and Prior-aware, Speech Phase Prediction, Neural Speech Phase, input amplitude spectrum
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by SLT2024

点击查看摘要

Abstract:This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency.

[AI-32] 6DGS: Enhanced Direction-Aware Gaussian Splatting for Volumetric Rendering ATC WWW

链接: https://arxiv.org/abs/2410.04974
作者: Zhongpai Gao,Benjamin Planche,Meng Zheng,Anwesa Choudhuri,Terrence Chen,Ziyan Wu
关键词-EN: Gaussian splatting, view synthesis, synthesis has advanced, development of neural, neural radiance fields
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Demo Video: this https URL

点击查看摘要

Abstract:Novel view synthesis has advanced significantly with the development of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS). However, achieving high quality without compromising real-time rendering remains challenging, particularly for physically-based ray tracing with view-dependent effects. Recently, N-dimensional Gaussians (N-DG) introduced a 6D spatial-angular representation to better incorporate view-dependent effects, but the Gaussian representation and control scheme are sub-optimal. In this paper, we revisit 6D Gaussians and introduce 6D Gaussian Splatting (6DGS), which enhances color and opacity representations and leverages the additional directional information in the 6D space for optimized Gaussian control. Our approach is fully compatible with the 3DGS framework and significantly improves real-time radiance field rendering by better modeling view-dependent effects and fine details. Experiments demonstrate that 6DGS significantly outperforms 3DGS and N-DG, achieving up to a 15.73 dB improvement in PSNR with a reduction of 66.5% Gaussian points compared to 3DGS.

[AI-33] Collaboration! Towards Robust Neural Methods for Routing Problems NEURIPS2024

链接: https://arxiv.org/abs/2410.04968
作者: Jianan Zhou,Yaoxin Wu,Zhiguang Cao,Wen Song,Jie Zhang,Zhiqi Shen
关键词-EN: vehicle routing problems, enjoying desirable efficiency, performance significantly deteriorates, neural VRP methods, severe robustness issues
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Despite enjoying desirable efficiency and reduced reliance on domain expertise, existing neural methods for vehicle routing problems (VRPs) suffer from severe robustness issues – their performance significantly deteriorates on clean instances with crafted perturbations. To enhance robustness, we propose an ensemble-based Collaborative Neural Framework (CNF) w.r.t. the defense of neural VRP methods, which is crucial yet underexplored in the literature. Given a neural VRP method, we adversarially train multiple models in a collaborative manner to synergistically promote robustness against attacks, while boosting standard generalization on clean instances. A neural router is designed to adeptly distribute training instances among models, enhancing overall load balancing and collaborative efficacy. Extensive experiments verify the effectiveness and versatility of CNF in defending against various attacks across different neural VRP methods. Notably, our approach also achieves impressive out-of-distribution generalization on benchmark instances.

[AI-34] Activation Scaling for Steering and Interpreting Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.04962
作者: Niklas Stoehr,Kevin Du,Vésteinn Snæbjarnarson,Robert West,Ryan Cotterell,Aaron Schein
关键词-EN: steer a language, relevant activation vectors, France, Italy, Rome
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Findings of the Association for Computational Linguistics: EMNLP 2024

点击查看摘要

Abstract:Given the prompt “Rome is in”, can we steer a language model to flip its prediction of an incorrect token “France” to a correct token “Italy” by only multiplying a few relevant activation vectors with scalars? We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. Concretely, we establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa (effectiveness), and leave other tokens unaffected (faithfulness), all while being sparse (minimality). Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention: activation scaling only modifies the signed magnitude of activation vectors to strengthen, weaken, or reverse the steering directions already encoded in the model. On synthetic tasks, this intervention performs comparably with steering vectors in terms of effectiveness and faithfulness, but is much more minimal allowing us to pinpoint interpretable model components. We evaluate activation scaling from different angles, compare performance on different datasets, and make activation scalars a learnable function of the activation vectors themselves to generalize to varying-length prompts.

[AI-35] Leverage Knowledge Graph and Large Language Model for Law Article Recommendation: A Case Study of Chinese Criminal Law

链接: https://arxiv.org/abs/2410.04949
作者: Yongming Chen,Miner Chen,Ye Zhu,Juan Pei,Siyu Chen,Yu Zhou,Yi Wang,Yifan Zhou,Hao Li,Songan Zhang
关键词-EN: Article Knowledge Graph, Large Language Model, Knowledge Graph, social stability, law article
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Court efficiency is vital for social stability. However, in most countries around the world, the grassroots courts face case backlogs, with decisions relying heavily on judicial personnel’s cognitive labor, lacking intelligent tools to improve efficiency. To address this issue, we propose an efficient law article recommendation approach utilizing a Knowledge Graph (KG) and a Large Language Model (LLM). Firstly, we propose a Case-Enhanced Law Article Knowledge Graph (CLAKG) as a database to store current law statutes, historical case information, and correspondence between law articles and historical cases. Additionally, we introduce an automated CLAKG construction method based on LLM. On this basis, we propose a closed-loop law article recommendation method. Finally, through a series of experiments using judgment documents from the website “China Judgements Online”, we have improved the accuracy of law article recommendation in cases from 0.549 to 0.694, demonstrating that our proposed method significantly outperforms baseline approaches.

[AI-36] Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

链接: https://arxiv.org/abs/2410.04946
作者: Borja Carrillo Perez
关键词-EN: advanced situational awareness, situational awareness solutions, infrastructures are crucial, increasingly important, solutions are increasingly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In an era where maritime infrastructures are crucial, advanced situational awareness solutions are increasingly important. The use of optical camera systems can allow real-time usage of maritime footage. This thesis presents an investigation into leveraging deep learning and computer vision to advance real-time ship recognition and georeferencing for the improvement of maritime situational awareness. A novel dataset, ShipSG, is introduced, containing 3,505 images and 11,625 ship masks with corresponding class and geographic position. After an exploration of state-of-the-art, a custom real-time segmentation architecture, ScatYOLOv8+CBAM, is designed for the NVIDIA Jetson AGX Xavier embedded system. This architecture adds the 2D scattering transform and attention mechanisms to YOLOv8, achieving an mAP of 75.46% and an 25.3 ms per frame, outperforming state-of-the-art methods by over 5%. To improve small and distant ship recognition in high-resolution images on embedded systems, an enhanced slicing mechanism is introduced, improving mAP by 8% to 11%. Additionally, a georeferencing method is proposed, achieving positioning errors of 18 m for ships up to 400 m away and 44 m for ships between 400 m and 1200 m. The findings are also applied in real-world scenarios, such as the detection of abnormal ship behaviour, camera integrity assessment and 3D reconstruction. The approach of this thesis outperforms existing methods and provides a framework for integrating recognized and georeferenced ships into real-time systems, enhancing operational effectiveness and decision-making for maritime stakeholders. This thesis contributes to the maritime computer vision field by establishing a benchmark for ship segmentation and georeferencing research, demonstrating the viability of deep-learning-based recognition and georeferencing methods for real-time maritime monitoring.

[AI-37] Detecting and Approximating Redundant Computational Blocks in Neural Networks

链接: https://arxiv.org/abs/2410.04941
作者: Irene Cannistraci,Emanuele Rodolà,Bastian Rieck
关键词-EN: Deep neural networks, learn similar internal, Deep neural, learn similar, Deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Deep neural networks often learn similar internal representations, both across different models and within their own layers. While inter-network similarities have enabled techniques such as model stitching and merging, intra-network similarities present new opportunities for designing more efficient architectures. In this paper, we investigate the emergence of these internal similarities across different layers in diverse neural architectures, showing that similarity patterns emerge independently of the datataset used. We introduce a simple metric, Block Redundancy, to detect redundant blocks, providing a foundation for future architectural optimization methods. Building on this, we propose Redundant Blocks Approximation (RBA), a general framework that identifies and approximates one or more redundant computational blocks using simpler transformations. We show that the transformation \mathcalT between two representations can be efficiently computed in closed-form, and it is enough to replace the redundant blocks from the network. RBA reduces model parameters and time complexity while maintaining good performance. We validate our method on classification tasks in the vision domain using a variety of pretrained foundational models and datasets.

[AI-38] raining Interactive Agent in Large FPS Game Map with Rule-enhanced Reinforcement Learning

链接: https://arxiv.org/abs/2410.04936
作者: Chen Zhang,Huan Hu,Yuan Zhou,Qiyang Cao,Ruochen Liu,Wenya Wei,Elvis S. Liu
关键词-EN: gained immense popularity, FPS games, first-person shooter, complex FPS games, immense popularity
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of competitive gaming, 3D first-person shooter (FPS) games have gained immense popularity, prompting the development of game AI systems to enhance gameplay. However, deploying game AI in practical scenarios still poses challenges, particularly in large-scale and complex FPS games. In this paper, we focus on the practical deployment of game AI in the online multiplayer competitive 3D FPS game called Arena Breakout, developed by Tencent Games. We propose a novel gaming AI system named Private Military Company Agent (PMCA), which is interactable within a large game map and engages in combat with players while utilizing tactical advantages provided by the surrounding terrain. To address the challenges of navigation and combat in modern 3D FPS games, we introduce a method that combines navigation mesh (Navmesh) and shooting-rule with deep reinforcement learning (NSRL). The integration of Navmesh enhances the agent’s global navigation capabilities while shooting behavior is controlled using rule-based methods to ensure controllability. NSRL employs a DRL model to predict when to enable the navigation mesh, resulting in a diverse range of behaviors for the game AI. Customized rewards for human-like behaviors are also employed to align PMCA’s behavior with that of human players. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.04936 [cs.AI] (or arXiv:2410.04936v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.04936 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] he Role of Governments in Increasing Interconnected Post-Deployment Monitoring of AI

链接: https://arxiv.org/abs/2410.04931
作者: Merlin Stein,Jamie Bernardi,Connor Dunlop
关键词-EN: Language-based AI systems, diffusing into society, bringing positive, systems are diffusing, Mitigating negative impacts
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 7 pages, 2 figures, 1 table

点击查看摘要

Abstract:Language-based AI systems are diffusing into society, bringing positive and negative impacts. Mitigating negative impacts depends on accurate impact assessments, drawn from an empirical evidence base that makes causal connections between AI usage and impacts. Interconnected post-deployment monitoring combines information about model integration and use, application use, and incidents and impacts. For example, inference time monitoring of chain-of-thought reasoning can be combined with long-term monitoring of sectoral AI diffusion, impacts and incidents. Drawing on information sharing mechanisms in other industries, we highlight example data sources and specific data points that governments could collect to inform AI risk management.

[AI-40] Defense-as-a-Service: Black-box Shielding against Backdoored Graph Models

链接: https://arxiv.org/abs/2410.04916
作者: Xiao Yang,Kai Zhou,Yuni Lai,Gaolei Li
关键词-EN: deliver business services, large graph learning, business owners tend, trend of large, tend to employ
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:With the trend of large graph learning models, business owners tend to employ a model provided by a third party to deliver business services to users. However, these models might be backdoored, and malicious users can submit trigger-embedded inputs to manipulate the model predictions. Current graph backdoor defenses have several limitations: 1) depending on model-related details, 2) requiring additional model fine-tuning, and 3) relying upon extra explainability tools, all of which are infeasible under stringent privacy policies. To address those limitations, we propose GraphProt, which allows resource-constrained business owners to rely on third parties to avoid backdoor attacks on GNN-based graph classifiers. Our GraphProt is model-agnostic and only relies on the input graph. The key insight is to leverage subgraph information for prediction, thereby mitigating backdoor effects induced by triggers. GraphProt comprises two components: clustering-based trigger elimination and robust subgraph ensemble. Specifically, we first propose feature-topology clustering that aims to remove most of the anomalous subgraphs (triggers). Moreover, we design subgraph sampling strategies based on feature-topology clustering to build a robust classifier via majority vote. Experimental results across three backdoor attacks and six benchmark datasets demonstrate that GraphProt significantly reduces the backdoor attack success rate while preserving the model accuracy on regular graph classification tasks.

[AI-41] Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models

链接: https://arxiv.org/abs/2410.04884
作者: Dehong Kong,Siyuan Liang,Xiaopeng Zhu,Yuansheng Zhong,Wenqi Ren
关键词-EN: Visual language pre-training, Visual language, demonstrated significant success, language pre-training, demonstrated significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: accepted by Visual Intelligence

点击查看摘要

Abstract:Visual language pre-training (VLP) models have demonstrated significant success across various domains, yet they remain vulnerable to adversarial attacks. Addressing these adversarial vulnerabilities is crucial for enhancing security in multimodal learning. Traditionally, adversarial methods targeting VLP models involve simultaneously perturbing images and text. However, this approach faces notable challenges: first, adversarial perturbations often fail to translate effectively into real-world scenarios; second, direct modifications to the text are conspicuously visible. To overcome these limitations, we propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text. Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations. Moreover, to optimize patch placement and improve the efficacy of our attacks, we utilize the cross-attention mechanism, which encapsulates intermodal interactions by generating attention maps to guide strategic patch placements. Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate. Additionally, it demonstrates commendable performance in transfer tasks involving text-to-image configurations.

[AI-42] Leveraging Grammar Induction for Language Understanding and Generation EMNLP2024

链接: https://arxiv.org/abs/2410.04878
作者: Jushi Kai,Shengyuan Hou,Yusheng Huang,Zhouhan Lin
关键词-EN: made significant progress, recent years, made significant, significant progress, progress in recent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Findings

点击查看摘要

Abstract:Grammar induction has made significant progress in recent years. However, it is not clear how the application of induced grammar could enhance practical performance in downstream tasks. In this work, we introduce an unsupervised grammar induction method for language understanding and generation. We construct a grammar parser to induce constituency structures and dependency relations, which is simultaneously trained on downstream tasks without additional syntax annotations. The induced grammar features are subsequently incorporated into Transformer as a syntactic mask to guide self-attention. We evaluate and apply our method to multiple machine translation tasks and natural language understanding tasks. Our method demonstrates superior performance compared to the original Transformer and other models enhanced with external parsers. Experimental results indicate that our method is effective in both from-scratch and pre-trained scenarios. Additionally, our research highlights the contribution of explicitly modeling the grammatical structure of texts to neural network models.

[AI-43] Mastering Chinese Chess AI (Xiangqi) Without Search

链接: https://arxiv.org/abs/2410.04865
作者: Yu Chen,Juntong Lin,Zhichao Shu
关键词-EN: high-performance Chinese Chess, Carlo Tree Search, Monte Carlo Tree, Chinese Chess, developed a high-performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have developed a high-performance Chinese Chess AI that operates without reliance on search algorithms. This AI has demonstrated the capability to compete at a level commensurate with the top 0.1% of human players. By eliminating the search process typically associated with such systems, this AI achieves a Queries Per Second (QPS) rate that exceeds those of systems based on the Monte Carlo Tree Search (MCTS) algorithm by over a thousandfold and surpasses those based on the AlphaBeta pruning algorithm by more than a hundredfold. The AI training system consists of two parts: supervised learning and reinforcement learning. Supervised learning provides an initial human-like Chinese chess AI, while reinforcement learning, based on supervised learning, elevates the strength of the entire AI to a new level. Based on this training system, we carried out enough ablation experiments and discovered that 1. The same parameter amount of Transformer architecture has a higher performance than CNN on Chinese chess; 2. Possible moves of both sides as features can greatly improve the training process; 3. Selective opponent pool, compared to pure self-play training, results in a faster improvement curve and a higher strength limit. 4. Value Estimation with Cutoff(VECT) improves the original PPO algorithm training process and we will give the explanation.

[AI-44] Unsupervised Skill Discovery for Robotic Manipulation through Automatic Task Generation

链接: https://arxiv.org/abs/2410.04855
作者: Paul Jansonnie,Bingbing Wu,Julien Perez,Jan Peters
关键词-EN: manipulation tasks, major importance, Hierarchical Reinforcement Learning, unseen manipulation tasks, Skill Learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the 2024 IEEE-RAS International Conference on Humanoid Robots

点击查看摘要

Abstract:Learning skills that interact with objects is of major importance for robotic manipulation. These skills can indeed serve as an efficient prior for solving various manipulation tasks. We propose a novel Skill Learning approach that discovers composable behaviors by solving a large and diverse number of autonomously generated tasks. Our method learns skills allowing the robot to consistently and robustly interact with objects in its environment. The discovered behaviors are embedded in primitives which can be composed with Hierarchical Reinforcement Learning to solve unseen manipulation tasks. In particular, we leverage Asymmetric Self-Play to discover behaviors and Multiplicative Compositional Policies to embed them. We compare our method to Skill Learning baselines and find that our skills are more interactive. Furthermore, the learned skills can be used to solve a set of unseen manipulation tasks, in simulation as well as on a real robotic platform.

[AI-45] meCNN: Refining Cross-Variable Interaction on Time Point for Time Series Forecasting

链接: https://arxiv.org/abs/2410.04853
作者: Ao Hu,Dongkai Wang,Yong Dai,Shiyi Qi,Liangjian Wen,Jun Wang,Zhi Chen,Xun Zhou,Zenglin Xu,Jiang Duan
关键词-EN: diverse domains, Time series forecasting, extensively applied, applied across diverse, Time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series forecasting is extensively applied across diverse domains. Transformer-based models demonstrate significant potential in modeling cross-time and cross-variable interaction. However, we notice that the cross-variable correlation of multivariate time series demonstrates multifaceted (positive and negative correlations) and dynamic progression over time, which is not well captured by existing Transformer-based models. To address this issue, we propose a TimeCNN model to refine cross-variable interactions to enhance time series forecasting. Its key innovation is timepoint-independent, where each time point has an independent convolution kernel, allowing each time point to have its independent model to capture relationships among variables. This approach effectively handles both positive and negative correlations and adapts to the evolving nature of variable relationships over time. Extensive experiments conducted on 12 real-world datasets demonstrate that TimeCNN consistently outperforms state-of-the-art models. Notably, our model achieves significant reductions in computational requirements (approximately 60.46%) and parameter count (about 57.50%), while delivering inference speeds 3 to 4 times faster than the benchmark iTransformer model

[AI-46] PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

链接: https://arxiv.org/abs/2410.04844
作者: Feng Tian,Yixuan Li,Yichao Yan,Shanyan Guan,Yanhao Ge,Xiaokang Yang
关键词-EN: core challenges persist, initial features, background preservation, challenges persist, core challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of image editing, three core challenges persist: controllability, background preservation, and efficiency. Inversion-based methods rely on time-consuming optimization to preserve the features of the initial images, which results in low efficiency due to the requirement for extensive network inference. Conversely, inversion-free methods lack theoretical support for background similarity, as they circumvent the issue of maintaining initial features to achieve efficiency. As a consequence, none of these methods can achieve both high efficiency and background consistency. To tackle the challenges and the aforementioned disadvantages, we introduce PostEdit, a method that incorporates a posterior scheme to govern the diffusion sampling process. Specifically, a corresponding measurement term related to both the initial features and Langevin dynamics is introduced to optimize the estimated image generated by the given target prompt. Extensive experimental results indicate that the proposed PostEdit achieves state-of-the-art editing performance while accurately preserving unedited regions. Furthermore, the method is both inversion- and training-free, necessitating approximately 1.5 seconds and 18 GB of GPU memory to generate high-quality results.

[AI-47] Multimodal Fusion Strategies for Mapping Biophysical Landscape Features ECCV2024

链接: https://arxiv.org/abs/2410.04833
作者: Lucia Gordon,Nico Lang,Catherine Ressijac,Andrew Davies
关键词-EN: Multimodal aerial data, monitor natural systems, Multimodal aerial, natural systems, ecology and conservation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, ECCV 2024 Workshop in CV for Ecology

点击查看摘要

Abstract:Multimodal aerial data are used to monitor natural systems, and machine learning can significantly accelerate the classification of landscape features within such imagery to benefit ecology and conservation. It remains under-explored, however, how these multiple modalities ought to be fused in a deep learning model. As a step towards filling this gap, we study three strategies (Early fusion, Late fusion, and Mixture of Experts) for fusing thermal, RGB, and LiDAR imagery using a dataset of spatially-aligned orthomosaics in these three modalities. In particular, we aim to map three ecologically-relevant biophysical landscape features in African savanna ecosystems: rhino middens, termite mounds, and water. The three fusion strategies differ in whether the modalities are fused early or late, and if late, whether the model learns fixed weights per modality for each class or generates weights for each class adaptively, based on the input. Overall, the three methods have similar macro-averaged performance with Late fusion achieving an AUC of 0.698, but their per-class performance varies strongly, with Early fusion achieving the best recall for middens and water and Mixture of Experts achieving the best recall for mounds.

[AI-48] Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders

链接: https://arxiv.org/abs/2410.04817
作者: Kosta Dakic,Kanchana Thilakarathna,Rodrigo N. Calheiros,Teng Joon Lim
关键词-EN: modern computer vision, offering advanced capabilities, computer vision, offering advanced, understanding and analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 10 pages, conference

点击查看摘要

Abstract:Multiview systems have become a key technology in modern computer vision, offering advanced capabilities in scene understanding and analysis. However, these systems face critical challenges in bandwidth limitations and computational constraints, particularly for resource-limited camera nodes like drones. This paper presents a novel approach for communication-efficient distributed multiview detection and tracking using masked autoencoders (MAEs). We introduce a semantic-guided masking strategy that leverages pre-trained segmentation models and a tunable power function to prioritize informative image regions. This approach, combined with an MAE, reduces communication overhead while preserving essential visual information. We evaluate our method on both virtual and real-world multiview datasets, demonstrating comparable performance in terms of detection and tracking performance metrics compared to state-of-the-art techniques, even at high masking ratios. Our selective masking algorithm outperforms random masking, maintaining higher accuracy and precision as the masking ratio increases. Furthermore, our approach achieves a significant reduction in transmission data volume compared to baseline methods, thereby balancing multiview tracking performance with communication efficiency.

[AI-49] Learning Interpretable Hierarchical Dynamical Systems Models from Time Series Data

链接: https://arxiv.org/abs/2410.04814
作者: Manuel Brenner,Elias Weber,Georgia Koppe,Daniel Durstewitz
关键词-EN: observed time series, interested in obtaining, obtaining a generative, generative model, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Preprint

点击查看摘要

Abstract:In science, we are often interested in obtaining a generative model of the underlying system dynamics from observed time series. While powerful methods for dynamical systems reconstruction (DSR) exist when data come from a single domain, how to best integrate data from multiple dynamical regimes and leverage it for generalization is still an open question. This becomes particularly important when individual time series are short, and group-level information may help to fill in for gaps in single-domain data. At the same time, averaging is not an option in DSR, as it will wipe out crucial dynamical properties (e.g., limit cycles in one domain vs. chaos in another). Hence, a framework is needed that enables to efficiently harvest group-level (multi-domain) information while retaining all single-domain dynamical characteristics. Here we provide such a hierarchical approach and showcase it on popular DSR benchmarks, as well as on neuroscientific and medical time series. In addition to faithful reconstruction of all individual dynamical regimes, our unsupervised methodology discovers common low-dimensional feature spaces in which datasets with similar dynamics cluster. The features spanning these spaces were further dynamically highly interpretable, surprisingly in often linear relation to control parameters that govern the dynamics of the underlying system. Finally, we illustrate transfer learning and generalization to new parameter regimes.

[AI-50] ransforming Color: A Novel Image Colorization Method

链接: https://arxiv.org/abs/2410.04799
作者: Hamza Shafiq,Bumshik Lee
关键词-EN: appealing colorized images, generative adversarial networks, generating visually appealing, visually appealing colorized, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for image colorization that utilizes a color transformer and generative adversarial networks (GANs) to address the challenge of generating visually appealing colorized images. Conventional approaches often struggle with capturing long-range dependencies and producing realistic colorizations. The proposed method integrates a transformer architecture to capture global information and a GAN framework to improve visual quality. In this study, a color encoder that utilizes a random normal distribution to generate color features is applied. These features are then integrated with grayscale image features to enhance the overall representation of the images. Our method demonstrates superior performance compared with existing approaches by utilizing the capacity of the transformer, which can capture long-range dependencies and generate a realistic colorization of the GAN. Experimental results show that the proposed network significantly outperforms other state-of-the-art colorization techniques, highlighting its potential for image colorization. This research opens new possibilities for precise and visually compelling image colorization in domains such as digital restoration and historical image analysis.

[AI-51] Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models

链接: https://arxiv.org/abs/2410.04795
作者: Dahyun Kim,Sukyung Lee,Yungi Kim,Attapol Rutherford,Chanjun Park
关键词-EN: large language models, widely-used benchmark suites, robust evaluation frameworks, Thai LLM, Thai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks that assess their core capabilities, such as reasoning, knowledge, and commonsense, leading to the inception of certain widely-used benchmark suites such as the H6 benchmark. However, these benchmark suites are primarily built for the English language, and there exists a lack thereof for under-represented languages, in terms of LLM development, such as Thai. On the other hand, developing LLMs for Thai should also include enhancing the cultural understanding as well as core capabilities. To address these dual challenge in Thai LLM research, we propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI). Through a thorough evaluation of various LLMs with multi-lingual capabilities, we provide a comprehensive analysis of the proposed benchmarks and how they contribute to Thai LLM development. Furthermore, we will make both the datasets and evaluation code publicly available to encourage further research and development for Thai LLMs.

[AI-52] Analysis of Hybrid Compositions in Animation Film with Weakly Supervised Learning ECCV

链接: https://arxiv.org/abs/2410.04789
作者: Mónica Apellaniz Portos,Roberto Labadie-Tamayo,Claudius Stemmler,Erwin Feyersinger,Andreas Babic,Franziska Bruckner,Vrääth Öhner,Matthias Zeppelzauer
关键词-EN: hybrid visual compositions, domain of ephemeral, hybrid compositions, visual compositions, hybrid visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Vision for Art (VISART VII) Workshop at the European Conference of Computer Vision (ECCV)

点击查看摘要

Abstract:We present an approach for the analysis of hybrid visual compositions in animation in the domain of ephemeral film. We combine ideas from semi-supervised and weakly supervised learning to train a model that can segment hybrid compositions without requiring pre-labeled segmentation masks. We evaluate our approach on a set of ephemeral films from 13 film archives. Results demonstrate that the proposed learning strategy yields a performance close to a fully supervised baseline. On a qualitative level the performed analysis provides interesting insights on hybrid compositions in animation film.

[AI-53] Fast Training of Sinusoidal Neural Fields via Scaling Initialization

链接: https://arxiv.org/abs/2410.04779
作者: Taesun Yeom,Sangyoon Lee,Jaeho Lee
关键词-EN: continuous functions parameterized, Neural fields, emerging paradigm, paradigm that represent, continuous functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural fields are an emerging paradigm that represent data as continuous functions parameterized by neural networks. Despite many advantages, neural fields often have a high training cost, which prevents a broader adoption. In this paper, we focus on a popular family of neural fields, called sinusoidal neural fields (SNFs), and study how it should be initialized to maximize the training speed. We find that the standard initialization scheme for SNFs – designed based on the signal propagation principle – is suboptimal. In particular, we show that by simply multiplying each weight (except for the last layer) by a constant, we can accelerate SNF training by 10 \times . This method, coined \textitweight scaling , consistently provides a significant speedup over various data domains, allowing the SNFs to train faster than more recently proposed architectures. To understand why the weight scaling works well, we conduct extensive theoretical and empirical analyses which reveal that the weight scaling not only resolves the spectral bias quite effectively but also enjoys a well-conditioned optimization trajectory.

[AI-54] Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM

链接: https://arxiv.org/abs/2410.04759
作者: Tianhui Cai,Yifan Liu,Zewei Zhou,Haoxuan Ma,Seth Z. Zhao,Zhiwen Wu,Jiaqi Ma
关键词-EN: enables seamless adaptation, integrates traffic regulations, Traffic Regulation Retrieval, safety guidelines comprehensively, relevant traffic rules
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work presents an interpretable decision-making framework for autonomous vehicles that integrates traffic regulations, norms, and safety guidelines comprehensively and enables seamless adaptation to different regions. While traditional rule-based methods struggle to incorporate the full scope of traffic rules, we develop a Traffic Regulation Retrieval (TRR) Agent based on Retrieval-Augmented Generation (RAG) to automatically retrieve relevant traffic rules and guidelines from extensive regulation documents and relevant records based on the ego vehicle’s situation. Given the semantic complexity of the retrieved rules, we also design a reasoning module powered by a Large Language Model (LLM) to interpret these rules, differentiate between mandatory rules and safety guidelines, and assess actions on legal compliance and safety. Additionally, the reasoning is designed to be interpretable, enhancing both transparency and reliability. The framework demonstrates robust performance on both hypothesized and real-world cases across diverse scenarios, along with the ability to adapt to different regions with ease.

[AI-55] Item Cluster-aware Prompt Learning for Session-based Recommendation

链接: https://arxiv.org/abs/2410.04756
作者: Wooseong Yang,Chen Wang,Zihe Song,Weizhi Zhang,Philip S. Yu
关键词-EN: dynamic user preferences, capture dynamic user, analyzing item sequences, dynamic user, user preferences
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Session-based recommendation (SBR) aims to capture dynamic user preferences by analyzing item sequences within individual sessions. However, most existing approaches focus mainly on intra-session item relationships, neglecting the connections between items across different sessions (inter-session relationships), which limits their ability to fully capture complex item interactions. While some methods incorporate inter-session information, they often suffer from high computational costs, leading to longer training times and reduced efficiency. To address these challenges, we propose the CLIP-SBR (Cluster-aware Item Prompt learning for Session-Based Recommendation) framework. CLIP-SBR is composed of two modules: 1) an item relationship mining module that builds a global graph to effectively model both intra- and inter-session relationships, and 2) an item cluster-aware prompt learning module that uses soft prompts to integrate these relationships into SBR models efficiently. We evaluate CLIP-SBR across eight SBR models and three benchmark datasets, consistently demonstrating improved recommendation performance and establishing CLIP-SBR as a robust solution for session-based recommendation tasks.

[AI-56] ImProver: Agent -Based Automated Proof Optimization

链接: https://arxiv.org/abs/2410.04753
作者: Riyaz Ahuja,Jeremy Avigad,Prasad Tetali,Sean Welleck
关键词-EN: Large language models, Large language, generate formal proofs, language models, generate formal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 19 pages, 21 figures

点击查看摘要

Abstract:Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

[AI-57] Evaluating the Generalization Ability of Spatiotemporal Model in Urban Scenario

链接: https://arxiv.org/abs/2410.04740
作者: Hongjun Wang,Jiyuan Chen,Tong Pan,Zheng Dong,Lingyu Zhang,Renhe Jiang,Xuan Song
关键词-EN: shown great promise, effectively capturing temporal, Spatiotemporal neural networks, spatial correlations, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Spatiotemporal neural networks have shown great promise in urban scenarios by effectively capturing temporal and spatial correlations. However, urban environments are constantly evolving, and current model evaluations are often limited to traffic scenarios and use data mainly collected only a few weeks after training period to evaluate model performance. The generalization ability of these models remains largely unexplored. To address this, we propose a Spatiotemporal Out-of-Distribution (ST-OOD) benchmark, which comprises six urban scenario: bike-sharing, 311 services, pedestrian counts, traffic speed, traffic flow, ride-hailing demand, and bike-sharing, each with in-distribution (same year) and out-of-distribution (next years) settings. We extensively evaluate state-of-the-art spatiotemporal models and find that their performance degrades significantly in out-of-distribution settings, with most models performing even worse than a simple Multi-Layer Perceptron (MLP). Our findings suggest that current leading methods tend to over-rely on parameters to overfit training data, which may lead to good performance on in-distribution data but often results in poor generalization. We also investigated whether dropout could mitigate the negative effects of overfitting. Our results showed that a slight dropout rate could significantly improve generalization performance on most datasets, with minimal impact on in-distribution performance. However, balancing in-distribution and out-of-distribution performance remains a challenging problem. We hope that the proposed benchmark will encourage further research on this critical issue.

[AI-58] ableRAG: Million-Token Table Understanding with Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.04739
作者: Si-An Chen,Lesly Miculicich,Julian Martin Eisenschlos,Zifeng Wang,Zilong Wang,Yanfei Chen,Yasuhisa Fujii,Hsuan-Tien Lin,Chen-Yu Lee,Tomas Pfister
关键词-EN: Recent advancements, language models, primarily through program-aided, advancements in language, notably enhanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG’s effectiveness at scale. Our results demonstrate that TableRAG’s retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.

[AI-59] ProtoNAM: Prototypical Neural Additive Models for Interpretable Deep Tabular Learning

链接: https://arxiv.org/abs/2410.04723
作者: Guangzhi Xiong,Sanchit Sinha,Aidong Zhang
关键词-EN: Generalized additive models, powerful white-box tool, Generalized additive, Prototypical Neural Additive, Neural Additive Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generalized additive models (GAMs) have long been a powerful white-box tool for the intelligible analysis of tabular data, revealing the influence of each feature on the model predictions. Despite the success of neural networks (NNs) in various domains, their application as NN-based GAMs in tabular data analysis remains suboptimal compared to tree-based ones, and the opacity of encoders in NN-GAMs also prevents users from understanding how networks learn the functions. In this work, we propose a new deep tabular learning method, termed Prototypical Neural Additive Model (ProtoNAM), which introduces prototypes into neural networks in the framework of GAMs. With the introduced prototype-based feature activation, ProtoNAM can flexibly model the irregular mapping from tabular features to the outputs while maintaining the explainability of the final prediction. We also propose a gradient-boosting inspired hierarchical shape function modeling method, facilitating the discovery of complex feature patterns and bringing transparency into the learning process of each network layer. Our empirical evaluations demonstrate that ProtoNAM outperforms all existing NN-based GAMs, while providing additional insights into the shape function learned for each feature. The source code of ProtoNAM is available at \urlthis https URL.

[AI-60] textbfOnly-IF:Revealing the Decisive Effect of Instruction Diversity on Generalization

链接: https://arxiv.org/abs/2410.04717
作者: Dylan Zhang,Justin Wang,Francois Charton
关键词-EN: Understanding and accurately, large language models, data, large language, Turing-complete Markov algorithm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization \textbfonly emerges when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model’s adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of \textit \textbfspecialist and \textit \textbfgeneralist models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.

[AI-61] Rule-based Data Selection for Large Language Models

链接: https://arxiv.org/abs/2410.04715
作者: Xiaomin Li,Mingye Gao,Zhiwei Zhang,Chang Yue,Hong Hu
关键词-EN: data significantly impacts, large language models, significantly impacts, large language, rules
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The quality of training data significantly impacts the performance of large language models (LLMs). There are increasing studies using LLMs to rate and select data based on several human-crafted metrics (rules). However, these conventional rule-based approaches often depend too heavily on human heuristics, lack effective metrics for assessing rules, and exhibit limited adaptability to new tasks. In our study, we introduce an innovative rule-based framework that utilizes the orthogonality of score vectors associated with rules as a novel metric for rule evaluations. Our approach includes an automated pipeline that first uses LLMs to generate a diverse set of rules, encompassing various rating dimensions to evaluate data quality. Then it rates a batch of data based on these rules and uses the determinantal point process (DPP) from random matrix theory to select the most orthogonal score vectors, thereby identifying a set of independent rules. These rules are subsequently used to evaluate all data, selecting samples with the highest average scores for downstream tasks such as LLM training. We verify the effectiveness of our method through two experimental setups: 1) comparisons with ground truth ratings and 2) benchmarking LLMs trained with the chosen data. Our comprehensive experiments cover a range of scenarios, including general pre-training and domain-specific fine-tuning in areas such as IMDB, Medical, Math, and Code. The outcomes demonstrate that our DPP-based rule rating method consistently outperforms other approaches, including rule-free rating, uniform sampling, importance resampling, and QuRating, in terms of both rating precision and model performance.

[AI-62] ght Stability Convergence and Robustness Bounds for Predictive Coding Networks

链接: https://arxiv.org/abs/2410.04708
作者: Ankur Mali,Tommaso Salvatori,Alexander Ororbia
关键词-EN: garnered significant attention, biologically plausible mechanisms, Energy-based learning algorithms, machine learning community, predictive coding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 9 theorems

点击查看摘要

Abstract:Energy-based learning algorithms, such as predictive coding (PC), have garnered significant attention in the machine learning community due to their theoretical properties, such as local operations and biologically plausible mechanisms for error correction. In this work, we rigorously analyze the stability, robustness, and convergence of PC through the lens of dynamical systems theory. We show that, first, PC is Lyapunov stable under mild assumptions on its loss and residual energy functions, which implies intrinsic robustness to small random perturbations due to its well-defined energy-minimizing dynamics. Second, we formally establish that the PC updates approximate quasi-Newton methods by incorporating higher-order curvature information, which makes them more stable and able to converge with fewer iterations compared to models trained via backpropagation (BP). Furthermore, using this dynamical framework, we provide new theoretical bounds on the similarity between PC and other algorithms, i.e., BP and target propagation (TP), by precisely characterizing the role of higher-order derivatives. These bounds, derived through detailed analysis of the Hessian structures, show that PC is significantly closer to quasi-Newton updates than TP, providing a deeper understanding of the stability and efficiency of PC compared to conventional learning methods.

[AI-63] Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

链接: https://arxiv.org/abs/2410.04707
作者: Mehul Damani,Idan Shenfeld,Andi Peng,Andreea Bobu,Jacob Andreas
关键词-EN: Computationally intensive decoding, spanning code generation, problems spanning code, Computationally intensive, intensive decoding procedures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Computationally intensive decoding procedures–including search, reranking, and self-critique–can improve the quality of language model (LM) outputs in problems spanning code generation, numerical reasoning, and dialog. Existing work typically applies the same decoding procedure for every input to an LM. But not all inputs require the same amount of computation to process. Can we allocate decoding computation adaptively, using more resources to answer questions whose answers will be harder to compute? We present an approach that predicts the distribution of rewards given an input and computation budget, then allocates additional computation to inputs for which it is predicted to be most useful. We apply this approach in two decoding procedures: first, an adaptive best-of-k procedure that dynamically selects the number of samples to generate as input to a reranker; second, a routing procedure that dynamically responds to a query using a decoding procedure that is expensive but accurate, or one that is cheaper but less capable. Across a suite of programming, mathematics, and dialog tasks, we show that accurate computation-allocation procedures can be learned, and reduce computation by up to 50% at no cost to response quality, or improve quality by up to 10% at a fixed computational budget.

[AI-64] owards Measuring Goal-Directedness in AI Systems

链接: https://arxiv.org/abs/2410.04683
作者: Dylan Xu,Juan-Pablo Rivera
关键词-EN: Recent advances, creating advanced, advances in deep, brought attention, possibility of creating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning have brought attention to the possibility of creating advanced, general AI systems that outperform humans across many tasks. However, if these systems pursue unintended goals, there could be catastrophic consequences. A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner in the first place, optimizing for some unknown goal; there exists significant research trying to evaluate systems for said behaviors. However, the most rigorous definitions of goal-directedness we currently have are difficult to compute in real-world settings. Drawing upon this previous literature, we explore policy goal-directedness within reinforcement learning (RL) environments. In our findings, we propose a different family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many (sparse) reward functions. We operationalize this preliminary definition of goal-directedness and test it in toy Markov decision process (MDP) environments. Furthermore, we explore how goal-directedness could be measured in frontier large-language models (LLMs). Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals. We recommend further exploration of measuring coherence and goal-directedness, based on our findings.

[AI-65] Knowledge Graph Based Agent for Complex Knowledge-Intensive QA in Medicine

链接: https://arxiv.org/abs/2410.04660
作者: Xiaorui Su,Yibo Wang,Shanghua Gao,Xiaolong Liu,Valentina Giunchiglia,Djork-Arné Clevert,Marinka Zitnik
关键词-EN: requiring distinct reasoning, requiring distinct, physics or chemistry, distinct reasoning strategies, reasoning strategies compared
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Biomedical knowledge is uniquely complex and structured, requiring distinct reasoning strategies compared to other scientific disciplines like physics or chemistry. Biomedical scientists do not rely on a single approach to reasoning; instead, they use various strategies, including rule-based, prototype-based, and case-based reasoning. This diversity calls for flexible approaches that accommodate multiple reasoning strategies while leveraging in-domain knowledge. We introduce KGARevion, a knowledge graph (KG) based agent designed to address the complexity of knowledge-intensive medical queries. Upon receiving a query, KGARevion generates relevant triplets by using the knowledge base of the LLM. These triplets are then verified against a grounded KG to filter out erroneous information and ensure that only accurate, relevant data contribute to the final answer. Unlike RAG-based models, this multi-step process ensures robustness in reasoning while adapting to different models of medical reasoning. Evaluations on four gold-standard medical QA datasets show that KGARevion improves accuracy by over 5.2%, outperforming 15 models in handling complex medical questions. To test its capabilities, we curated three new medical QA datasets with varying levels of semantic complexity, where KGARevion achieved a 10.4% improvement in accuracy.

[AI-66] Contrastive Learning to Improve Retrieval for Real-world Fact Checking EMNLP2024

链接: https://arxiv.org/abs/2410.04657
作者: Aniruddh Sriram,Fangyuan Xu,Eunsol Choi,Greg Durrett
关键词-EN: Recent work, incorporate evidence retrieved, addresses a realistic, web to decide, models incorporate evidence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 FEVER Workshop

点击查看摘要

Abstract:Recent work on fact-checking addresses a realistic setting where models incorporate evidence retrieved from the web to decide the veracity of claims. A bottleneck in this pipeline is in retrieving relevant evidence: traditional methods may surface documents directly related to a claim, but fact-checking complex claims requires more inferences. For instance, a document about how a vaccine was developed is relevant to addressing claims about what it might contain, even if it does not address them directly. We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for this setting. By leveraging the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents, we fine-tune Contriever with a contrastive objective based on multiple training signals, including distillation from GPT-4, evaluating subquestion answers, and gold labels in the dataset. We evaluate our model on both retrieval and end-to-end veracity judgments about claims. On the AVeriTeC dataset, we find a 6% improvement in veracity classification accuracy. We also show our gains can be transferred to FEVER, ClaimDecomp, HotpotQA, and a synthetic dataset requiring retrievers to make inferences.

[AI-67] Graph Fourier Neural Kernels (G-FuNK): Learning Solutions of Nonlinear Diffusive Parametric PDEs on Multiple Domains

链接: https://arxiv.org/abs/2410.04655
作者: Shane E. Loeffler,Zan Ahmad,Syed Yusuf Ali,Carolyna Yamamoto,Dan M. Popescu,Alana Yee,Yash Lal,Natalia Trayanova,Mauro Maggioni
关键词-EN: Predicting time-dependent dynamics, non-linear partial differential, challenging task motivated, Predicting time-dependent, Fourier Neural Kernels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Spectral Theory (math.SP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predicting time-dependent dynamics of complex systems governed by non-linear partial differential equations (PDEs) with varying parameters and domains is a challenging task motivated by applications across various fields. We introduce a novel family of neural operators based on our Graph Fourier Neural Kernels, designed to learn solution generators for nonlinear PDEs in which the highest-order term is diffusive, across multiple domains and parameters. G-FuNK combines components that are parameter- and domain-adapted with others that are not. The domain-adapted components are constructed using a weighted graph on the discretized domain, where the graph Laplacian approximates the highest-order diffusive term, ensuring boundary condition compliance and capturing the parameter and domain-specific behavior. Meanwhile, the learned components transfer across domains and parameters via Fourier Neural Operators. This approach naturally embeds geometric and directional information, improving generalization to new test domains without need for retraining the network. To handle temporal dynamics, our method incorporates an integrated ODE solver to predict the evolution of the system. Experiments show G-FuNK’s capability to accurately approximate heat, reaction diffusion, and cardiac electrophysiology equations across various geometries and anisotropic diffusivity fields. G-FuNK achieves low relative errors on unseen domains and fiber fields, significantly accelerating predictions compared to traditional finite-element solvers.

[AI-68] Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

链接: https://arxiv.org/abs/2410.04652
作者: Chengyuan Xu,Radha Kumaran,Noah Stier,Kangyou Yu,Tobias Höllerer
关键词-EN: augmented reality benefits, Seamless integration, integration of virtual, worlds in augmented, augmented reality
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, accepted to IEEE ISMAR 2024

点击查看摘要

Abstract:Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically “understanding” the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models. We then propose “in-situ” machine learning, which, in conjunction with the multimodal representation, enables new tools and interfaces for users to interact with physical spaces and objects in a spatially and linguistically meaningful manner. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time. We also make our full implementation and demo data available at (this https URL) to encourage further exploration and research in spatially aware AI.

[AI-69] DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications

链接: https://arxiv.org/abs/2410.04631
作者: Mathias Jackermeier,Alessandro Abate
关键词-EN: Linear temporal logic, temporally extended tasks, Linear temporal, temporal logic, temporally extended
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear temporal logic (LTL) has recently been adopted as a powerful formalism for specifying complex, temporally extended tasks in reinforcement learning (RL). However, learning policies that efficiently satisfy arbitrary specifications not observed during training remains a challenging problem. Existing approaches suffer from several shortcomings: they are often only applicable to finite-horizon fragments of LTL, are restricted to suboptimal solutions, and do not adequately handle safety constraints. In this work, we propose a novel learning approach to address these concerns. Our method leverages the structure of Büchi automata, which explicitly represent the semantics of LTL specifications, to learn policies conditioned on sequences of truth assignments that lead to satisfying the desired formulae. Experiments in a variety of discrete and continuous domains demonstrate that our approach is able to zero-shot satisfy a wide range of finite- and infinite-horizon specifications, and outperforms existing methods in terms of both satisfaction probability and efficiency.

[AI-70] Passage Retrieval of Polish Texts Using OKAPI BM25 and an Ensemble of Cross Encoders

链接: https://arxiv.org/abs/2410.04620
作者: Jakub Pokrywka
关键词-EN: Passage Retrieval challenge, Passage Retrieval, traditionally relied, relied on lexical, lexical methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Passage Retrieval has traditionally relied on lexical methods like TF-IDF and BM25. Recently, some neural network models have surpassed these methods in performance. However, these models face challenges, such as the need for large annotated datasets and adapting to new domains. This paper presents a winning solution to the Poleval 2023 Task 3: Passage Retrieval challenge, which involves retrieving passages of Polish texts in three domains: trivia, legal, and customer support. However, only the trivia domain was used for training and development data. The method used the OKAPI BM25 algorithm to retrieve documents and an ensemble of publicly available multilingual Cross Encoders for Reranking. Fine-tuning the reranker models slightly improved performance but only in the training domain, while it worsened in other domains.

[AI-71] Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

链接: https://arxiv.org/abs/2410.04612
作者: Zhaolin Gao,Wenhao Zhan,Jonathan D. Chang,Gokul Swamy,Kianté Brantley,Jason D. Lee,Wen Sun
关键词-EN: Large Language Models, Large Language, achieved remarkable success, Language Models, REFUEL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate Q -values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at this https URL, and models trained by REFUEL can be found at this https URL.

[AI-72] Hammer: Robust Function-Calling for On-Device Language Models via Function Masking

链接: https://arxiv.org/abs/2410.04587
作者: Qiqiang Lin,Muning Wen,Qiuying Peng,Guanyu Nie,Junwei Liao,Jun Wang,Xiaoyun Mo,Jiamu Zhou,Cheng Cheng,Yin Zhao,Jun Wang,Weinan Zhang
关键词-EN: Large language models, API calls, Large language, tools and API, demonstrated impressive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large language models have demonstrated impressive value in performing as autonomous agents when equipped with external tools and API calls. Nonetheless, effectively harnessing their potential for executing complex tasks crucially relies on enhancements in their function calling capabilities. This paper identifies a critical gap in existing function calling models, where performance varies significantly across benchmarks, often due to being misled by specific naming conventions. To address such an issue, we introduce Hammer, a novel family of foundation models specifically engineered for on-device function calling. Hammer employs an augmented dataset that enhances models’ sensitivity to irrelevant functions and incorporates function masking techniques to minimize misleading. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving sota results. Our open source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function calling performance.

[AI-73] Ranking Policy Learning via Marketplace Expected Value Estimation From Observational Data

链接: https://arxiv.org/abs/2410.04568
作者: Ehsan Ebrahimzadeh,Nikhil Monga,Hang Gao,Alex Cozzi,Abraham Bagherjeiran
关键词-EN: decision making framework, reward optimization problem, expected reward optimization, expected reward, ranking policy
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 9 pages

点击查看摘要

Abstract:We develop a decision making framework to cast the problem of learning a ranking policy for search or recommendation engines in a two-sided e-commerce marketplace as an expected reward optimization problem using observational data. As a value allocation mechanism, the ranking policy allocates retrieved items to the designated slots so as to maximize the user utility from the slotted items, at any given stage of the shopping journey. The objective of this allocation can in turn be defined with respect to the underlying probabilistic user browsing model as the expected number of interaction events on presented items matching the user intent, given the ranking context. Through recognizing the effect of ranking as an intervention action to inform users’ interactions with slotted items and the corresponding economic value of the interaction events for the marketplace, we formulate the expected reward of the marketplace as the collective value from all presented ranking actions. The key element in this formulation is a notion of context value distribution, which signifies not only the attribution of value to ranking interventions within a session but also the distribution of marketplace reward across user sessions. We build empirical estimates for the expected reward of the marketplace from observational data that account for the heterogeneity of economic value across session contexts as well as the distribution shifts in learning from observational user activity data. The ranking policy can then be trained by optimizing the empirical expected reward estimates via standard Bayesian inference techniques. We report empirical results for a product search ranking task in a major e-commerce platform demonstrating the fundamental trade-offs governed by ranking polices trained on empirical reward estimates with respect to extreme choices of the context value distribution.

[AI-74] Modeling Social Media Recommendation Impacts Using Academic Networks: A Graph Neural Network Approach

链接: https://arxiv.org/abs/2410.04552
作者: Sabrina Guidotti,Gregor Donabauer,Simone Somazzi,Udo Kruschwitz,Davide Taibi,Dimitri Ognibene
关键词-EN: highlighted potential negative, potential negative impacts, shape user behavior, society and individuals, largely driven
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of social media has highlighted potential negative impacts on society and individuals, largely driven by recommendation algorithms that shape user behavior and social dynamics. Understanding these algorithms is essential but challenging due to the complex, distributed nature of social media networks as well as limited access to real-world data. This study proposes to use academic social networks as a proxy for investigating recommendation systems in social media. By employing Graph Neural Networks (GNNs), we develop a model that separates the prediction of academic infosphere from behavior prediction, allowing us to simulate recommender-generated infospheres and assess the model’s performance in predicting future co-authorships. Our approach aims to improve our understanding of recommendation systems’ roles and social networks modeling. To support the reproducibility of our work we publicly make available our implementations: this https URL

[AI-75] Pullback Flow Matching on Data Manifolds

链接: https://arxiv.org/abs/2410.04543
作者: Friso de Kruiff,Erik Bekkers,Ozan Öktem,Carola-Bibiane Schönlieb,Willem Diepeveen
关键词-EN: Pullback Flow Matching, Riemannian Flow Matching, propose Pullback Flow, Flow Matching, training Riemannian Flow
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:We propose Pullback Flow Matching (PFM), a novel framework for generative modeling on data manifolds. Unlike existing methods that assume or learn restrictive closed-form manifold mappings for training Riemannian Flow Matching (RFM) models, PFM leverages pullback geometry and isometric learning to preserve the underlying manifold’s geometry while enabling efficient generation and precise interpolation in latent space. This approach not only facilitates closed-form mappings on the data manifold but also allows for designable latent spaces, using assumed metrics on both data and latent manifolds. By enhancing isometric learning through Neural ODEs and proposing a scalable training objective, we achieve a latent space more suitable for interpolation, leading to improved manifold learning and generative performance. We demonstrate PFM’s effectiveness through applications in synthetic data, protein dynamics and protein sequence data, generating novel proteins with specific properties. This method shows strong potential for drug discovery and materials science, where generating novel samples with specific properties is of great interest.

[AI-76] On Evaluating LLMs Capabilities as Functional Approximators: A Bayesian Perspective

链接: https://arxiv.org/abs/2410.04541
作者: Shoaib Ahmed Siddiqui,Yanzhi Chen,Juyeon Heo,Menglin Xia,Adrian Weller
关键词-EN: Large Language Models, applied Large Language, successfully applied Large, Language Models, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent works have successfully applied Large Language Models (LLMs) to function modeling tasks. However, the reasons behind this success remain unclear. In this work, we propose a new evaluation framework to comprehensively assess LLMs’ function modeling abilities. By adopting a Bayesian perspective of function modeling, we discover that LLMs are relatively weak in understanding patterns in raw data, but excel at utilizing prior knowledge about the domain to develop a strong understanding of the underlying function. Our findings offer new insights about the strengths and limitations of LLMs in the context of function modeling.

[AI-77] FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

链接: https://arxiv.org/abs/2410.04526
作者: Siqiao Xue,Tingting Chen,Fan Zhou,Qingyang Dai,Zhixuan Chu,Hongyuan Mei
关键词-EN: multilingual multimodal question, financial multilingual multimodal, multimodal question answering, multilingual multimodal, introduce FAMMA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce FAMMA, an open-source benchmark for financial multilingual multimodal question answering (QA). Our benchmark aims to evaluate the abilities of multimodal large language models (MLLMs) in answering questions that require advanced financial knowledge and sophisticated reasoning. It includes 1,758 meticulously collected question-answer pairs from university textbooks and exams, spanning 8 major subfields in finance including corporate finance, asset management, and financial engineering. Some of the QA pairs are written in Chinese or French, while a majority of them are in English. These questions are presented in a mixed format combining text and heterogeneous image types, such as charts, tables, and diagrams. We evaluate a range of state-of-the-art MLLMs on our benchmark, and our analysis shows that FAMMA poses a significant challenge for these models. Even advanced systems like GPT-4o and Claude-35-Sonnet achieve only 42% accuracy. Additionally, the open-source Qwen2-VL lags notably behind its proprietary counterparts. Lastly, we explore GPT o1-style reasoning chains to enhance the models’ reasoning capabilities, which significantly improve error correction. Our FAMMA benchmark will facilitate future research to develop expert systems in financial QA. The leaderboard is available at this https URL .

[AI-78] Semi-Markovian Planning to Coordinate Aerial and Maritime Medical Evacuation Platforms

链接: https://arxiv.org/abs/2410.04523
作者: Mahdi Al-Husseini,Kyle H. Wray,Mykel J. Kochenderfer
关键词-EN: watercraft exchange points, exchange points, watercraft exchange, maritime environments, exchange
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The transfer of patients between two aircraft using an underway watercraft increases medical evacuation reach and flexibility in maritime environments. The selection of any one of multiple underway watercraft for patient exchange is complicated by participating aircraft utilization history and a participating watercraft position and velocity. The selection problem is modeled as a semi-Markov decision process with an action space including both fixed land and moving watercraft exchange points. Monte Carlo tree search with root parallelization is used to select optimal exchange points and determine aircraft dispatch times. Model parameters are varied in simulation to identify representative scenarios where watercraft exchange points reduce incident response times. We find that an optimal policy with watercraft exchange points outperforms an optimal policy without watercraft exchange points and a greedy policy by 35% and 40%, respectively. In partnership with the United States Army, we deploy for the first time the watercraft exchange point by executing a mock patient transfer with a manikin between two HH-60M medical evacuation helicopters and an underway Army Logistic Support Vessel south of the Hawaiian island of Oahu. Both helicopters were dispatched in accordance with our optimized decision strategy.

[AI-79] LRHP: Learning Representations for Human Preferences via Preference Pairs

链接: https://arxiv.org/abs/2410.04503
作者: Chenglong Wang,Yang Gan,Yifu Huo,Yongyu Mu,Qiaozhi He,Murun Yang,Tong Xiao,Chunliang Zhang,Tongran Liu,Jingbo Zhu
关键词-EN: human-preference alignment training, improve human-preference alignment, developed numerous preference, numerous preference datasets, preference datasets consisting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as “preferred” or “dispreferred”. These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). However, representing these human preferences as a numerical value complicates the analysis of these preferences and restricts their broader applications other than RLHF. In contrast, in this work, we introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We further develop a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction. Building upon the human preferences in representations, we achieve strong performance in both tasks, significantly outperforming baselines.

[AI-80] Leveraging Large Language Models for Suicide Detection on Social Media with Limited Labels

链接: https://arxiv.org/abs/2410.04501
作者: Vy Nguyen,Chau Pham
关键词-EN: suicidal thoughts highlights, Social media, increasing frequency, thoughts highlights, highlights the importance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing frequency of suicidal thoughts highlights the importance of early detection and intervention. Social media platforms, where users often share personal experiences and seek help, could be utilized to identify individuals at risk. However, the large volume of daily posts makes manual review impractical. This paper explores the use of Large Language Models (LLMs) to automatically detect suicidal content in text-based social media posts. We propose a novel method for generating pseudo-labels for unlabeled data by prompting LLMs, along with traditional classification fine-tuning techniques to enhance label accuracy. To create a strong suicide detection model, we develop an ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B. We evaluate our approach on the dataset of the Suicide Ideation Detection on Social Media Challenge, a track of the IEEE Big Data 2024 Big Data Cup. Additionally, we conduct a comprehensive analysis to assess the impact of different models and fine-tuning strategies on detection performance. Experimental results show that the ensemble model significantly improves the detection accuracy, by 5% points compared with the individual models. It achieves a weight F1 score of 0.770 on the public test set, and 0.731 on the private test set, providing a promising solution for identifying suicidal content in social media. Our analysis shows that the choice of LLMs affects the prompting performance, with larger models providing better accuracy. Our code and checkpoints are publicly available at this https URL.

[AI-81] Adjusting Pretrained Backbones for Performativity

链接: https://arxiv.org/abs/2410.04499
作者: Berker Demirel,Lingjing Kong,Kun Zhang,Theofanis Karaletsos,Celestine Mendler-Dünner,Francesco Locatello
关键词-EN: widespread deployment, influence their environment, deep learning models, deep learning, models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the widespread deployment of deep learning models, they influence their environment in various ways. The induced distribution shifts can lead to unexpected performance degradation in deployed models. Existing methods to anticipate performativity typically incorporate information about the deployed model into the feature vector when predicting future outcomes. While enjoying appealing theoretical properties, modifying the input dimension of the prediction task is often not practical. To address this, we propose a novel technique to adjust pretrained backbones for performativity in a modular way, achieving better sample efficiency and enabling the reuse of existing deep learning assets. Focusing on performative label shift, the key idea is to train a shallow adapter module to perform a Bayes-optimal label shift correction to the backbone’s logits given a sufficient statistic of the model to be deployed. As such, our framework decouples the construction of input-specific feature embeddings from the mechanism governing performativity. Motivated by dynamic benchmarking as a use-case, we evaluate our approach under adversarial sampling, for vision and language tasks. We show how it leads to smaller loss along the retraining trajectory and enables us to effectively select among candidate models to anticipate performance degradations. More broadly, our work provides a first baseline for addressing performativity in deep learning.

[AI-82] Generalizability analysis of deep learning predictions of human brain responses to augmented and semantically novel visual stimuli

链接: https://arxiv.org/abs/2410.04497
作者: Valentyn Piskovskyi,Riccardo Chimisso,Sabrina Patania,Tom Foulsham,Giuseppe Vizzari,Dimitri Ognibene
关键词-EN: neural network-based approach, image enhancement techniques, investigate the soundness, soundness and utility, network-based approach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The purpose of this work is to investigate the soundness and utility of a neural network-based approach as a framework for exploring the impact of image enhancement techniques on visual cortex activation. In a preliminary study, we prepare a set of state-of-the-art brain encoding models, selected among the top 10 methods that participated in The Algonauts Project 2023 Challenge [16]. We analyze their ability to make valid predictions about the effects of various image enhancement techniques on neural responses. Given the impossibility of acquiring the actual data due to the high costs associated with brain imaging procedures, our investigation builds up on a series of experiments. Specifically, we analyze the ability of brain encoders to estimate the cerebral reaction to various augmentations by evaluating the response to augmentations targeting objects (i.e., faces and words) with known impact on specific areas. Moreover, we study the predicted activation in response to objects unseen during training, exploring the impact of semantically out-of-distribution stimuli. We provide relevant evidence for the generalization ability of the models forming the proposed framework, which appears to be promising for the identification of the optimal visual augmentation filter for a given task, model-driven design strategies as well as for AR and VR applications.

[AI-83] Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification NEURIPS2024

链接: https://arxiv.org/abs/2410.04492
作者: Zhaorui Tan,Xi Yang,Qiufeng Wang,Anh Nguyen,Kaizhu Huang
关键词-EN: Vision models excel, Vision models, discovering novel categories, struggle to generalize, Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2024 as Spotlight

点击查看摘要

Abstract:Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image classification. Our work reveals that L-Reg reduces the complexity of the model in terms of the feature distribution and classifier weights. Specifically, we unveil the interpretability brought by L-Reg, as it enables the model to extract the salient features, such as faces to persons, for classification. Theoretical analysis and experiments demonstrate that L-Reg enhances generalization across various scenarios, including multi-domain generalization and generalized category discovery. In complex real-world scenarios where images span unknown classes and unseen domains, L-Reg consistently improves generalization, highlighting its practical efficacy.

[AI-84] Knowledge-Guided Dynamic Modality Attention Fusion Framework for Multimodal Sentiment Analysis EMNLP

链接: https://arxiv.org/abs/2410.04491
作者: Xinyu Feng,Yuming Lin,Lihua He,You Li,Liang Chang,Ya Zhou
关键词-EN: Multimodal Sentiment Analysis, utilizes multimodal data, Attention Fusion Framework, Sentiment Analysis, dominant modality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted to EMNLP Findings 2024

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) utilizes multimodal data to infer the users’ sentiment. Previous methods focus on equally treating the contribution of each modality or statically using text as the dominant modality to conduct interaction, which neglects the situation where each modality may become dominant. In this paper, we propose a Knowledge-Guided Dynamic Modality Attention Fusion Framework (KuDA) for multimodal sentiment analysis. KuDA uses sentiment knowledge to guide the model dynamically selecting the dominant modality and adjusting the contributions of each modality. In addition, with the obtained multimodal representation, the model can further highlight the contribution of dominant modality through the correlation evaluation loss. Extensive experiments on four MSA benchmark datasets indicate that KuDA achieves state-of-the-art performance and is able to adapt to different scenarios of dominant modality.

[AI-85] A Pluggable Common Sense-Enhanced Framework for Knowledge Graph Completion

链接: https://arxiv.org/abs/2410.04488
作者: Guanglin Niu,Bo Li,Siling Feng
关键词-EN: infer missing facts, knowledge-intensive applications, KGC, aim to infer, infer missing
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 18 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Knowledge graph completion (KGC) tasks aim to infer missing facts in a knowledge graph (KG) for many knowledge-intensive applications. However, existing embedding-based KGC approaches primarily rely on factual triples, potentially leading to outcomes inconsistent with common sense. Besides, generating explicit common sense is often impractical or costly for a KG. To address these challenges, we propose a pluggable common sense-enhanced KGC framework that incorporates both fact and common sense for KGC. This framework is adaptable to different KGs based on their entity concept richness and has the capability to automatically generate explicit or implicit common sense from factual triples. Furthermore, we introduce common sense-guided negative sampling and a coarse-to-fine inference approach for KGs with rich entity concepts. For KGs without concepts, we propose a dual scoring scheme involving a relation-aware concept embedding mechanism. Importantly, our approach can be integrated as a pluggable module for many knowledge graph embedding (KGE) models, facilitating joint common sense and fact-driven training and inference. The experiments illustrate that our framework exhibits good scalability and outperforms existing models across various KGC tasks.

[AI-86] Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench

链接: https://arxiv.org/abs/2410.04485
作者: Anton Cheshkov,Pavel Zadorozhny,Rodion Levichev,Evgeny Maslov,Ronaldo Franco Jaldin
关键词-EN: Automatic program repair, Automatic program, conversational patch generation, human activity, Patch generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 3 pages, 2 figures, 1 algorithm, appendix

点击查看摘要

Abstract:Automatic program repair at project level may open yet to be seen opportunities in various fields of human activity. Since the SWE-Bench challenge was presented, we have seen numerous of solutions. Patch generation is a part of program repair, and test suite-based conversational patch generation has proven its effectiveness. However, the potential of conversational patch generation has not yet specifically estimated on SWE-Bench. This study reports experimental results aimed at evaluating the individual effectiveness of conversational patch generation on problems from SWE-Bench. The experiments show that a simple conversational pipeline based on LLaMA 3.1 70B can generate valid patches in 47% of cases, which is comparable to the state-of-the-art in program repair on SWE-Bench.

[AI-87] Learning to Solve Abstract Reasoning Problems with Neurosymbolic Program Synthesis and Task Generation

链接: https://arxiv.org/abs/2410.04480
作者: Jakub Bednarek,Krzysztof Krawiec
关键词-EN: tackle newly encountered, newly encountered problems, solve problems comprehensively, tackle newly, abstractly and reason
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: 18th International Conference on Neural-Symbolic Learning and Reasoning

点击查看摘要

Abstract:The ability to think abstractly and reason by analogy is a prerequisite to rapidly adapt to new conditions, tackle newly encountered problems by decomposing them, and synthesize knowledge to solve problems comprehensively. We present TransCoder, a method for solving abstract problems based on neural program synthesis, and conduct a comprehensive analysis of decisions made by the generative module of the proposed architecture. At the core of TransCoder is a typed domain-specific language, designed to facilitate feature engineering and abstract reasoning. In training, we use the programs that failed to solve tasks to generate new tasks and gather them in a synthetic dataset. As each synthetic task created in this way has a known associated program (solution), the model is trained on them in supervised mode. Solutions are represented in a transparent programmatic form, which can be inspected and verified. We demonstrate TransCoder’s performance using the Abstract Reasoning Corpus dataset, for which our framework generates tens of thousands of synthetic problems with corresponding solutions and facilitates systematic progress in learning.

[AI-88] Revisiting In-context Learning Inference Circuit in Large Language Models ICLR2025

链接: https://arxiv.org/abs/2410.04468
作者: Hakaze Cho,Mariko Kato,Yoshihiro Sakai,Naoya Inoue
关键词-EN: emerging few-shot learning, few-shot learning paradigm, In-context Learning, ICL, few-shot learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages, 37 figures, 6 tables, ICLR 2025 under review

点击查看摘要

Abstract:In-context Learning (ICL) is an emerging few-shot learning paradigm on Language Models (LMs) with inner mechanisms un-explored. There are already existing works describing the inner processing of ICL, while they struggle to capture all the inference phenomena in large language models. Therefore, this paper proposes a comprehensive circuit to model the inference dynamics and try to explain the observed phenomena of ICL. In detail, we divide ICL inference into 3 major operations: (1) Summarize: LMs encode every input text (demonstrations and queries) into linear representation in the hidden states with sufficient information to solve ICL tasks. (2) Semantics Merge: LMs merge the encoded representations of demonstrations with their corresponding label tokens to produce joint representations of labels and demonstrations. (3) Feature Retrieval and Copy: LMs search the joint representations similar to the query representation on a task subspace, and copy the searched representations into the query. Then, language model heads capture these copied label representations to a certain extent and decode them into predicted labels. The proposed inference circuit successfully captured many phenomena observed during the ICL process, making it a comprehensive and practical explanation of the ICL inference process. Moreover, ablation analysis by disabling the proposed steps seriously damages the ICL performance, suggesting the proposed inference circuit is a dominating mechanism. Additionally, we confirm and list some bypass mechanisms that solve ICL tasks in parallel with the proposed circuit.

[AI-89] An Attention-Based Algorithm for Gravity Adaptation Zone Calibration

链接: https://arxiv.org/abs/2410.04457
作者: Chen Yu
关键词-EN: gravity adaptation zone, adaptation zone calibration, gravity field, Accurate calibration, gravity adaptation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
*备注: 15pages

点击查看摘要

Abstract:Accurate calibration of gravity adaptation zones is of great significance in fields such as underwater navigation, geophysical exploration, and marine engineering. With the increasing application of gravity field data in these areas, traditional calibration methods based on single features are becoming inadequate for capturing the complex characteristics of gravity fields and addressing the intricate interrelationships among multidimensional data. This paper proposes an attention-enhanced algorithm for gravity adaptation zone calibration. By introducing an attention mechanism, the algorithm adaptively fuses multidimensional gravity field features and dynamically assigns feature weights, effectively solving the problems of multicollinearity and redundancy inherent in traditional feature selection methods, significantly improving calibration accuracy and this http URL addition, a large-scale gravity field dataset with over 10,000 sampling points was constructed, and Kriging interpolation was used to enhance the spatial resolution of the data, providing a reliable data foundation for model training and evaluation. We conducted both qualitative and quantitative experiments on several classical machine learning models (such as SVM, GBDT, and RF), and the results demonstrate that the proposed algorithm significantly improves performance across these models, outperforming other traditional feature selection methods. The method proposed in this paper provides a new solution for gravity adaptation zone calibration, showing strong generalization ability and potential for application in complex environments. The code is available at \hrefthis link this https URL.

[AI-90] MindScope: Exploring cognitive biases in large language models through Multi-Agent Systems ECAI2024

链接: https://arxiv.org/abs/2410.04452
作者: Zhentao Xie,Jiabao Zhao,Yilei Wang,Jinxin Shi,Yanhong Bai,Xingjiao Wu,Liang He
关键词-EN: Detecting cognitive biases, existing cognitive biases, large language models, cognitive biases, Detecting cognitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages,7 figures,Our paper has been accepted for presentation at the 2024 European Conference on Artificial Intelligence (ECAI 2024)

点击查看摘要

Abstract:Detecting cognitive biases in large language models (LLMs) is a fascinating task that aims to probe the existing cognitive biases within these models. Current methods for detecting cognitive biases in language models generally suffer from incomplete detection capabilities and a restricted range of detectable bias types. To address this issue, we introduced the ‘MindScope’ dataset, which distinctively integrates static and dynamic elements. The static component comprises 5,170 open-ended questions spanning 72 cognitive bias categories. The dynamic component leverages a rule-based, multi-agent communication framework to facilitate the generation of multi-round dialogues. This framework is flexible and readily adaptable for various psychological experiments involving LLMs. In addition, we introduce a multi-agent detection method applicable to a wide range of detection tasks, which integrates Retrieval-Augmented Generation (RAG), competitive debate, and a reinforcement learning-based decision module. Demonstrating substantial effectiveness, this method has shown to improve detection accuracy by as much as 35.10% compared to GPT-4. Codes and appendix are available at this https URL.

[AI-91] G"odel Agent : A Self-Referential Agent Framework for Recursive Self-Improvement

链接: https://arxiv.org/abs/2410.04444
作者: Xunjian Yin,Xinyi Wang,Liangming Pan,Xiaojun Wan,William Yang Wang
关键词-EN: large language models, language models, Gödel Agent, rapid advancement, advancement of large
类目: Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has significantly enhanced the capabilities of AI-driven agents across various tasks. However, existing agentic systems, whether based on fixed pipeline algorithms or pre-defined meta-learning frameworks, cannot search the whole agent design space due to the restriction of human-designed components, and thus might miss the globally optimal agent design. In this paper, we introduce Gödel Agent, a self-evolving framework inspired by the Gödel machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms. Gödel Agent leverages LLMs to dynamically modify its own logic and behavior, guided solely by high-level objectives through prompting. Experimental results on mathematical reasoning and complex agent tasks demonstrate that implementation of Gödel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.

[AI-92] Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

链接: https://arxiv.org/abs/2410.04439
作者: Wenbo Li,Guohao Li,Zhibin Lan,Xue Xu,Wanru Zhuang,Jiachen Liu,Xinyan Xiao,Jinsong Su
关键词-EN: demonstrated impressive achievements, backbone models, empower backbone models, demonstrated impressive, impressive achievements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation quality.

[AI-93] CAPEEN: Image Captioning with Early Exits and Knowledge Distillation EMNLP

链接: https://arxiv.org/abs/2410.04433
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
关键词-EN: Deep neural networks, made significant progress, recognizing visual elements, generating descriptive text, Deep neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: To appear in EMNLP (finding) 2024

点击查看摘要

Abstract:Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at this https URL

[AI-94] DAdEE: Unsupervised Domain Adaptation in Early Exit PLMs EMNLP

链接: https://arxiv.org/abs/2410.04424
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
关键词-EN: Pre-trained Language Models, exhibit good accuracy, large size results, Pre-trained Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: To appear in EMNLP (findings) 2024

点击查看摘要

Abstract:Pre-trained Language Models (PLMs) exhibit good accuracy and generalization ability across various tasks using self-supervision, but their large size results in high inference latency. Early Exit (EE) strategies handle the issue by allowing the samples to exit from classifiers attached to the intermediary layers, but they do not generalize well, as exit classifiers can be sensitive to domain changes. To address this, we propose Unsupervised Domain Adaptation in EE framework (DADEE) that employs multi-level adaptation using knowledge distillation. DADEE utilizes GAN-based adversarial adaptation at each layer to achieve domain-invariant representations, reducing the domain gap between the source and target domain across all layers. The attached exits not only speed up inference but also enhance domain adaptation by reducing catastrophic forgetting and mode collapse, making it more suitable for real-world scenarios. Experiments on tasks such as sentiment analysis, entailment classification, and natural language inference demonstrate that DADEE consistently outperforms not only early exit methods but also various domain adaptation methods under domain shift scenarios. The anonymized source code is available at this https URL.

[AI-95] Disentangling Regional Primitives for Image Generation

链接: https://arxiv.org/abs/2410.04421
作者: Zhengting Chen,Lei Cheng,Lianghui Ding,Quanshi Zhang
关键词-EN: internal representation structure, feature component, neural network, image regions, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a method to explain the internal representation structure of a neural network for image generation. Specifically, our method disentangles primitive feature components from the intermediate-layer feature of the neural network, which ensures that each feature component is exclusively used to generate a specific set of image regions. In this way, the generation of the entire image can be considered as the superposition of different pre-encoded primitive regional patterns, each being generated by a feature component. We find that the feature component can be represented as an OR relationship between the demands for generating different image regions, which is encoded by the neural network. Therefore, we extend the Harsanyi interaction to represent such an OR interaction to disentangle the feature component. Experiments show a clear correspondence between each feature component and the generation of specific image regions.

[AI-96] Optimizing AI Reasoning: A Hamiltonian Dynamics Approach to Multi-Hop Question Answering

链接: https://arxiv.org/abs/2410.04415
作者: Javier Marin
关键词-EN: Hamiltonian mechanics, paper introduces, introduces an innovative, innovative approach, approach to analyzing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces an innovative approach to analyzing and improving multi-hop reasoning in AI systems by drawing inspiration from Hamiltonian mechanics. We propose a novel framework that maps reasoning chains in embedding spaces to Hamiltonian systems, allowing us to leverage powerful analytical tools from classical physics. Our method defines a Hamiltonian function that balances the progression of reasoning (kinetic energy) against the relevance to the question at hand (potential energy). Using this framework, we analyze a large dataset of reasoning chains from a multi-hop question-answering task, revealing intriguing patterns that distinguish valid from invalid reasoning. We show that valid reasoning chains have lower Hamiltonian energy and move in ways that make the best trade-off between getting more information and answering the right question. Furthermore, we demonstrate the application of this framework to steer the creation of more efficient reasoning algorithms within AI systems. Our results not only provide new insights into the nature of valid reasoning but also open up exciting possibilities for physics-inspired approaches to understanding and improving artificial intelligence.

[AI-97] owards Understanding and Enhancing Security of Proof-of-Training for DNN Model Ownership Verification USENIX-SECURITY2025

链接: https://arxiv.org/abs/2410.04397
作者: Yijia Chang,Hanrui Jiang,Chao Lin,Xinyi Huang,Jian Weng
关键词-EN: deep neural networks, neural networks, intellectual property, great economic, deep neural
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted by USENIX Security 2025 (Major Revision - Accept)

点击查看摘要

Abstract:The great economic values of deep neural networks (DNNs) urge AI enterprises to protect their intellectual property (IP) for these models. Recently, proof-of-training (PoT) has been proposed as a promising solution to DNN IP protection, through which AI enterprises can utilize the record of DNN training process as their ownership proof. To prevent attackers from forging ownership proof, a secure PoT scheme should be able to distinguish honest training records from those forged by attackers. Although existing PoT schemes provide various distinction criteria, these criteria are based on intuitions or observations. The effectiveness of these criteria lacks clear and comprehensive analysis, resulting in existing schemes initially deemed secure being swiftly compromised by simple ideas. In this paper, we make the first move to identify distinction criteria in the style of formal methods, so that their effectiveness can be explicitly demonstrated. Specifically, we conduct systematic modeling to cover a wide range of attacks and then theoretically analyze the distinctions between honest and forged training records. The analysis results not only induce a universal distinction criterion, but also provide detailed reasoning to demonstrate its effectiveness in defending against attacks covered by our model. Guided by the criterion, we propose a generic PoT construction that can be instantiated into concrete schemes. This construction sheds light on the realization that trajectory matching algorithms, previously employed in data distillation, possess significant advantages in PoT construction. Experimental results demonstrate that our scheme can resist attacks that have compromised existing PoT schemes, which corroborates its superiority in security.

[AI-98] Algorithmic Capabilities of Random Transformers NEURIPS2024

链接: https://arxiv.org/abs/2410.04368
作者: Ziqian Zhong,Jacob Andreas
关键词-EN: implement interpretable procedures, implement interpretable, interpretable procedures, procedures originate, associative recall
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To what extent do they depend on the supervisory signal provided to models, and to what extent are they attributable to behavior already present in models at the beginning of training? To investigate these questions, we investigate what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input–output mappings learnable from data are those already implemented (up to a choice of encoding scheme) by the randomly initialized model. We find that these random transformers can perform a wide range of meaningful algorithmic tasks, including modular arithmetic, in-weights and in-context associative recall, decimal addition, parenthesis balancing, and even some aspects of natural language text generation. Our results indicate that some algorithmic capabilities are present in transformers (and accessible via appropriately structured inputs) even before these models are trained. Code is available at this https URL.

[AI-99] VideoGuide: Improving Video Diffusion Models without Training Through a Teachers Guide

链接: https://arxiv.org/abs/2410.04364
作者: Dohun Lee,Bryan S Kim,Geon Yeong Park,Jong Chul Ye
关键词-EN: visual content creation, revolutionized visual content, preserving temporal consistency, content creation, generation remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 14 figures, Project Page: this http URL

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model’s denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: this http URL

[AI-100] GenSim: A General Social Simulation Platform with Large Language Model based Agents

链接: https://arxiv.org/abs/2410.04360
作者: Jiakai Tang,Heyang Gao,Xuchen Pan,Lei Wang,Haoran Tan,Dawei Gao,Yushuo Chen,Xu Chen,Yankai Lin,Yaliang Li,Bolin Ding,Jingren Zhou,Ji-Rong Wen
关键词-EN: large language models, human social behavior, leveraging LLM-based agents, language models, recent years
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during simulation. To overcome these limitations, we propose a novel LLM-agent-based simulation platform called \textitGenSim, which: (1) \textbfAbstracts a set of general functions to simplify the simulation of customized social scenarios; (2) \textbfSupports one hundred thousand agents to better simulate large-scale populations in real-world contexts; (3) \textbfIncorporates error-correction mechanisms to ensure more reliable and long-term simulations. To evaluate our platform, we assess both the efficiency of large-scale agent simulations and the effectiveness of the error-correction mechanisms. To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform based on LLM agents, promising to further advance the field of social science.

[AI-101] MVP-Bench: Can Large Vision–Language Models Conduct Multi-level Visual Perception Like Humans?

链接: https://arxiv.org/abs/2410.04345
作者: Guanzhen Li,Yuxi Xie,Min-Yen Kan
关键词-EN: including low-level object, multiple levels, low-level object recognition, perception, perform visual perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual-Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions. To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual-language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences model perception. Using MVP-Bench, we diagnose the visual perception of 10 open-source and 2 closed-source LVLMs, showing that high-level perception tasks significantly challenge existing LVLMs. The state-of-the-art GPT-4o only achieves an accuracy of 56% on Yes/No questions, compared with 74% in low-level scenarios. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do. Our data and code are publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.04345 [cs.CV] (or arXiv:2410.04345v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.04345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-102] Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

链接: https://arxiv.org/abs/2410.04332
作者: Alex Cloud,Jacob Goldman-Wetzler,Evžen Wybitul,Joseph Miller,Alexander Matt Turner
关键词-EN: trained primarily based, inputs and outputs, trained primarily, primarily based, gradient routing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural networks are trained primarily based on their inputs and outputs, without regard for their internal mechanisms. These neglected mechanisms determine properties that are critical for safety, like (i) transparency; (ii) the absence of sensitive information or harmful capabilities; and (iii) reliable generalization of goals beyond the training distribution. To address this shortcoming, we introduce gradient routing, a training method that isolates capabilities to specific subregions of a neural network. Gradient routing applies data-dependent, weighted masks to gradients during backpropagation. These masks are supplied by the user in order to configure which parameters are updated by which data points. We show that gradient routing can be used to (1) learn representations which are partitioned in an interpretable way; (2) enable robust unlearning via ablation of a pre-specified network subregion; and (3) achieve scalable oversight of a reinforcement learner by localizing modules responsible for different behaviors. Throughout, we find that gradient routing localizes capabilities even when applied to a limited, ad-hoc subset of the data. We conclude that the approach holds promise for challenging, real-world applications where quality data are scarce.

[AI-103] SONAR: A Synthetic AI-Audio Detection Framework~and Benchmark

链接: https://arxiv.org/abs/2410.04324
作者: Xiang Li,Pin-Yu Chen,Wenqi Wei
关键词-EN: generative Artificial Intelligence, Artificial Intelligence, Recent advances, generative Artificial, realistic human-like audio
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advances in Text-to-Speech (TTS) and Voice-Conversion (VC) using generative Artificial Intelligence (AI) technology have made it possible to generate high-quality and realistic human-like audio. This introduces significant challenges to distinguishing AI-synthesized speech from the authentic human voice and could raise potential issues of misuse for malicious purposes such as impersonation and fraud, spreading misinformation, deepfakes, and scams. However, existing detection techniques for AI-synthesized audio have not kept pace and often exhibit poor generalization across diverse datasets. In this paper, we introduce SONAR, a synthetic AI-Audio Detection Framework and Benchmark, aiming to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. SONAR includes a novel evaluation dataset sourced from 9 diverse audio synthesis platforms, including leading TTS providers and state-of-the-art TTS models. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems. Through extensive experiments, we reveal the generalization limitations of existing detection methods and demonstrate that foundation models exhibit stronger generalization capabilities, which can be attributed to their model size and the scale and quality of pretraining data. Additionally, we explore the effectiveness and efficiency of few-shot fine-tuning in improving generalization, highlighting its potential for tailored applications, such as personalized detection systems for specific entities or individuals. Code and dataset are available at this https URL.

[AI-104] oward Debugging Deep Reinforcement Learning Programs with RLExplorer

链接: https://arxiv.org/abs/2410.04322
作者: Rached Bouchoucha,Ahmed Haj Yahmed,Darshan Patil,Janarthanan Rajendran,Amin Nikanjam,Sarath Chandar,Foutse Khomh
关键词-EN: Deep reinforcement learning, Deep reinforcement, computer games, shown success, success in diverse
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication in The International Conference on Software Maintenance and Evolution (ICSME 2024)

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has shown success in diverse domains such as robotics, computer games, and recommendation systems. However, like any other software system, DRL-based software systems are susceptible to faults that pose unique challenges for debugging and diagnosing. These faults often result in unexpected behavior without explicit failures and error messages, making debugging difficult and time-consuming. Therefore, automating the monitoring and diagnosis of DRL systems is crucial to alleviate the burden on developers. In this paper, we propose RLExplorer, the first fault diagnosis approach for DRL-based software systems. RLExplorer automatically monitors training traces and runs diagnosis routines based on properties of the DRL learning dynamics to detect the occurrence of DRL-specific faults. It then logs the results of these diagnoses as warnings that cover theoretical concepts, recommended practices, and potential solutions to the identified faults. We conducted two sets of evaluations to assess RLExplorer. Our first evaluation of faulty DRL samples from Stack Overflow revealed that our approach can effectively diagnose real faults in 83% of the cases. Our second evaluation of RLExplorer with 15 DRL experts/developers showed that (1) RLExplorer could identify 3.6 times more defects than manual debugging and (2) RLExplorer is easily integrated into DRL applications.

[AI-105] Channel-Aware Throughput Maximization for Cooperative Data Fusion in CAV

链接: https://arxiv.org/abs/2410.04320
作者: Haonan An,Zhengru Fang,Yuang Zhang,Senkang Hu,Xianhao Chen,Guowen Xu,Yuguang Fang
关键词-EN: enhanced sensing coverage, garnered significant attention, significant attention due, Connected and autonomous, extended perception range
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Connected and autonomous vehicles (CAVs) have garnered significant attention due to their extended perception range and enhanced sensing coverage. To address challenges such as blind spots and obstructions, CAVs employ vehicle-to-vehicle (V2V) communications to aggregate sensory data from surrounding vehicles. However, cooperative perception is often constrained by the limitations of achievable network throughput and channel quality. In this paper, we propose a channel-aware throughput maximization approach to facilitate CAV data fusion, leveraging a self-supervised autoencoder for adaptive data compression. We formulate the problem as a mixed integer programming (MIP) model, which we decompose into two sub-problems to derive optimal data rate and compression ratio solutions under given link conditions. An autoencoder is then trained to minimize bitrate with the determined compression ratio, and a fine-tuning strategy is employed to further reduce spectrum resource consumption. Experimental evaluation on the OpenCOOD platform demonstrates the effectiveness of our proposed algorithm, showing more than 20.19% improvement in network throughput and a 9.38% increase in average precision (AP@IoU) compared to state-of-the-art methods, with an optimal latency of 19.99 ms.

[AI-106] Self-Supervised Anomaly Detection in the Wild: Favor Joint Embeddings Methods

链接: https://arxiv.org/abs/2410.04289
作者: Daniel Otero,Rafael Mateus,Randall Balestriero
关键词-EN: prevent costly failures, Accurate anomaly detection, vision-based infrastructure inspection, Accurate anomaly, SSL
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate anomaly detection is critical in vision-based infrastructure inspection, where it helps prevent costly failures and enhances safety. Self-Supervised Learning (SSL) offers a promising approach by learning robust representations from unlabeled data. However, its application in anomaly detection remains underexplored. This paper addresses this gap by providing a comprehensive evaluation of SSL methods for real-world anomaly detection, focusing on sewer infrastructure. Using the Sewer-ML dataset, we evaluate lightweight models such as ViT-Tiny and ResNet-18 across SSL frameworks, including BYOL, Barlow Twins, SimCLR, DINO, and MAE, under varying class imbalance levels. Through 250 experiments, we rigorously assess the performance of these SSL methods to ensure a robust and comprehensive evaluation. Our findings highlight the superiority of joint-embedding methods like SimCLR and Barlow Twins over reconstruction-based approaches such as MAE, which struggle to maintain performance under class imbalance. Furthermore, we find that the SSL model choice is more critical than the backbone architecture. Additionally, we emphasize the need for better label-free assessments of SSL representations, as current methods like RankMe fail to adequately evaluate representation quality, making cross-validation without labels infeasible. Despite the remaining performance gap between SSL and supervised models, these findings highlight the potential of SSL to enhance anomaly detection, paving the way for further research in this underexplored area of SSL applications.

[AI-107] Mechanistic Behavior Editing of Language Models

链接: https://arxiv.org/abs/2410.04277
作者: Joykirat Singh,Subhabrata Dutta,Tanmoy Chakraborty
关键词-EN: Large Language Models, text acquire language, Large Language, web-scale text acquire, Language Models trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models trained on web-scale text acquire language generation abilities that can solve a wide range of tasks, particularly when task knowledge is refined into the generative prior using in-context examples. However, spurious features learned from noisy data hinder their generalizability. Supervised finetuning can introduce task specificity, but introduce data inefficiency. Prior studies indicate that (i) noisy neural circuitries coexist with generalizable ones within LLMs, and (ii) finetuning typically enhances (or suppresses) existing abilities without introducing newer ones. Building upon these, we propose TaRot, a novel method for task adaptation. TaRot intervenes in the neural circuitries using learnable rotation matrices that are optimized using Bayesian Optimization, on labelled samples in the order of standard few-shot prompting examples. Experiments on multiple classification and generation tasks using LLMs of varying sizes reveal the efficacy of TaRot, improving upon both zero- as well as few-shot performance, with average improvements (across models and tasks) of 23.81% and 11.15%, respectively. The source code is available at this https URL

[AI-108] Constructing Cloze Questions Generatively IJCNN

链接: https://arxiv.org/abs/2410.04266
作者: Yicheng Sun(1),Jie Wang(2)
关键词-EN: constructing cloze questions, generating multigram distractors, method called CQG, generative method called, called CQG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures,5 tables, 2023 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:We present a generative method called CQG for constructing cloze questions from a given article using neural networks and WordNet, with an emphasis on generating multigram distractors. Built on sense disambiguation, text-to-text transformation, WordNet’s synset taxonomies and lexical labels, CQG selects an answer key for a given sentence, segments it into a sequence of instances, generates instance-level distractor candidates (IDCs) using a transformer and sibling this http URL then removes inappropriate IDCs, ranks the remaining IDCs based on contextual embedding similarities, as well as synset and lexical relatedness, forms distractor candidates by combinatorially replacing instances with the corresponding top-ranked IDCs, and checks if they are legitimate phrases. Finally, it selects top-ranked distractor candidates based on contextual semantic similarities to the answer key. Experiments show that this method significantly outperforms SOTA results. Human judges also confirm the high qualities of the generated distractors.

[AI-109] Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

链接: https://arxiv.org/abs/2410.04256
作者: Maria Marrium,Arif Mahmood,Mohammed Bennamoun
关键词-EN: Convolutional Neural Networks, deep neural networks, neural networks, Automatic annotation, introduce noisy training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic annotation of large-scale datasets can introduce noisy training data labels, which adversely affect the learning process of deep neural networks (DNNs). Consequently, Noisy Labels Learning (NLL) has become a critical research field for Convolutional Neural Networks (CNNs), though it remains less explored for Vision Transformers (ViTs). In this study, we evaluate the vulnerability of ViT fine-tuning to noisy labels and compare its robustness with CNNs. We also investigate whether NLL methods developed for CNNs are equally effective for ViTs. Using linear probing and MLP-K fine-tuning, we benchmark two ViT backbones (ViT-B/16 and ViT-L/16) using three commonly used classification losses: Cross Entropy (CE), Focal Loss (FL), and Mean Absolute Error (MAE), alongside six robust NLL methods: GCE, SCE, NLNL, APL, NCE+AGCE, and ANL-CE. The evaluation is conducted across six datasets including MNIST, CIFAR-10/100, WebVision, Clothing1M, and Food-101N. Furthermore, we explore whether implicit prediction entropy minimization contributes to ViT robustness against noisy labels, noting a general trend of prediction entropy reduction across most NLL methods. Building on this observation, we examine whether explicit entropy minimization could enhance ViT resilience to noisy labels. Our findings indicate that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.

[AI-110] Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia EMNLP2024

链接: https://arxiv.org/abs/2410.04254
作者: Tomás Feith,Akhil Arora,Martin Gerlach,Debjit Paul,Robert West
关键词-EN: turning isolated pieces, fundamental part, entity insertion, turning isolated, isolated pieces
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: EMNLP 2024; 24 pages; 62 figures

点击查看摘要

Abstract:Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.

[AI-111] Contrastive Explanations That Anticipate Human Misconceptions Can Improve Human Decision-Making Skills

链接: https://arxiv.org/abs/2410.04253
作者: Zana Buçinca,Siddharth Swaroop,Amanda E. Paluch,Finale Doshi-Velez,Krzysztof Z. Gajos
关键词-EN: People decision-making abilities, abilities often fail, fail to improve, contrastive explanations, explanations
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:People’s decision-making abilities often fail to improve or may even erode when they rely on AI for decision-support, even when the AI provides informative explanations. We argue this is partly because people intuitively seek contrastive explanations, which clarify the difference between the AI’s decision and their own reasoning, while most AI systems offer “unilateral” explanations that justify the AI’s decision but do not account for users’ thinking. To align human-AI knowledge on decision tasks, we introduce a framework for generating human-centered contrastive explanations that explain the difference between AI’s choice and a predicted, likely human choice about the same task. Results from a large-scale experiment (N = 628) demonstrate that contrastive explanations significantly enhance users’ independent decision-making skills compared to unilateral explanations, without sacrificing decision accuracy. Amid rising deskilling concerns, our research demonstrates that incorporating human reasoning into AI design can foster human skill development.

[AI-112] Enhancing Future Link Prediction in Quantum Computing Semantic Networks through LLM-Initiated Node Features

链接: https://arxiv.org/abs/2410.04251
作者: Gilchan Park,Paul Baity,Byung-Jun Yoon,Adolfy Hoisie
关键词-EN: accelerate computational processes, solve complex problems, computer science, offering the potential, computational processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum computing is rapidly evolving in both physics and computer science, offering the potential to solve complex problems and accelerate computational processes. The development of quantum chips necessitates understanding the correlations among diverse experimental conditions. Semantic networks built on scientific literature, representing meaningful relationships between concepts, have been used across various domains to identify knowledge gaps and novel concept combinations. Neural network-based approaches have shown promise in link prediction within these networks. This study proposes initializing node features using LLMs to enhance node representations for link prediction tasks in graph neural networks. LLMs can provide rich descriptions, reducing the need for manual feature creation and lowering costs. Our method, evaluated using various link prediction models on a quantum computing semantic network, demonstrated efficacy compared to traditional node embedding techniques.

[AI-113] owards Propositional KLM-Style Defeasible Standpoint Logics

链接: https://arxiv.org/abs/2410.04245
作者: Nicholas Leisegang,Thomas Meyer,Sebastian Rudolph
关键词-EN: propositional KLM, weakened form, form of implication, implication into classical, KLM
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The KLM approach to defeasible reasoning introduces a weakened form of implication into classical logic. This allows one to incorporate exceptions to general rules into a logical system, and for old conclusions to be withdrawn upon learning new contradictory information. Standpoint logics are a group of logics, introduced to the field of Knowledge Representation in the last 5 years, which allow for multiple viewpoints to be integrated into the same ontology, even when certain viewpoints may hold contradicting beliefs. In this paper, we aim to integrate standpoints into KLM propositional logic in a restricted setting. We introduce the logical system of Defeasible Restricted Standpoint Logic (DRSL) and define its syntax and semantics. Specifically, we integrate ranked interpretations and standpoint structures, which provide the semantics for propositional KLM and propositional standpoint logic respectively, in order to introduce ranked standpoint structures for DRSL. Moreover, we extend the non-monotonic entailment relation of rational closure from the propositional KLM case to the DRSL case. The main contribution of this paper is to characterize rational closure for DRSL both algorithmically and semantically, showing that rational closure can be characterized through a single representative ranked standpoint structure. Finally, we conclude that the semantic and algorithmic characterizations of rational closure are equivalent, and that entailment-checking for DRSL under rational closure is in the same complexity class as entailment-checking for propositional KLM.

[AI-114] Overview of Factify5WQA: Fact Verification through 5W Question-Answering AAAI2024

链接: https://arxiv.org/abs/2410.04236
作者: Suryavardan Suresh,Anku Rani,Parth Patwa,Aishwarya Reganti,Vinija Jain,Aman Chadha,Amitava Das,Amit Sheth,Asif Ekbal
关键词-EN: Researchers have found, spreads much times, times faster, faster than real, Fact verification
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at defactify3@aaai2024

点击查看摘要

Abstract:Researchers have found that fake news spreads much times faster than real news. This is a major problem, especially in today’s world where social media is the key source of news for many among the younger population. Fact verification, thus, becomes an important task and many media sites contribute to the cause. Manual fact verification is a tedious task, given the volume of fake news online. The Factify5WQA shared task aims to increase research towards automated fake news detection by providing a dataset with an aspect-based question answering based fact verification method. Each claim and its supporting document is associated with 5W questions that help compare the two information sources. The objective performance measure in the task is done by comparing answers using BLEU score to measure the accuracy of the answers, followed by an accuracy measure of the classification. The task had submissions using custom training setup and pre-trained language-models among others. The best performing team posted an accuracy of 69.56%, which is a near 35% improvement over the baseline.

[AI-115] Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks

链接: https://arxiv.org/abs/2410.04234
作者: Zi Wang,Divyam Anshumaan,Ashish Hooda,Yudong Chen,Somesh Jha
关键词-EN: undesired model responses, mitigate undesired model, widely employed, employed in deep, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emphfunctional homotopy method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a 20%-30% improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.

[AI-116] Improving Portfolio Optimization Results with Bandit Networks

链接: https://arxiv.org/abs/2410.04217
作者: Gustavo de Freitas Fonseca,Lucas Coelho e Silva,Paulo André Lima de Castro
关键词-EN: Reinforcement Learning, Discounted Thompson Sampling, Adaptive Discounted Thompson, recommender systems, Thompson Sampling
类目: Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
*备注:

点击查看摘要

Abstract:In Reinforcement Learning (RL), multi-armed Bandit (MAB) problems have found applications across diverse domains such as recommender systems, healthcare, and finance. Traditional MAB algorithms typically assume stationary reward distributions, which limits their effectiveness in real-world scenarios characterized by non-stationary dynamics. This paper addresses this limitation by introducing and evaluating novel Bandit algorithms designed for non-stationary environments. First, we present the \textitAdaptive Discounted Thompson Sampling (ADTS) algorithm, which enhances adaptability through relaxed discounting and sliding window mechanisms to better respond to changes in reward distributions. We then extend this approach to the Portfolio Optimization problem by introducing the \textitCombinatorial Adaptive Discounted Thompson Sampling (CADTS) algorithm, which addresses computational challenges within Combinatorial Bandits and improves dynamic asset allocation. Additionally, we propose a novel architecture called Bandit Networks, which integrates the outputs of ADTS and CADTS, thereby mitigating computational limitations in stock selection. Through extensive experiments using real financial market data, we demonstrate the potential of these algorithms and architectures in adapting to dynamic environments and optimizing decision-making processes. For instance, the proposed bandit network instances present superior performance when compared to classic portfolio optimization approaches, such as capital asset pricing model, equal weights, risk parity, and Markovitz, with the best network presenting an out-of-sample Sharpe Ratio 20% higher than the best performing classical model.

[AI-117] Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension

链接: https://arxiv.org/abs/2410.04211
作者: Ning Wang,Zekun Li,Tongxin Bai,Guoqi Li
关键词-EN: Modeling long sequences, Modeling long, handle longer sequences, handle longer, extending existing architectures
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Modeling long sequences is crucial for various large-scale models; however, extending existing architectures to handle longer sequences presents significant technical and resource challenges. In this paper, we propose an efficient and flexible attention architecture that enables the extension of context lengths in large language models with reduced computational resources and fine-tuning time compared to other excellent methods. Specifically, we introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention. In addition, we also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions. The results are as follows: First, using a single A100, we achieve fine-tuning on Llama2-7B with a sequence length of 32K, which is more efficient than other methods that rely on subsets for regression. Second, we present a comprehensive method for extending context lengths across the pre-training, fine-tuning, and inference phases. During pre-training, our attention mechanism partially breaks translation invariance during token selection, so we apply positional encodings only to the selected tokens. This approach achieves relatively high performance and significant extrapolation capabilities. For fine-tuning, we introduce Cyclic, Randomly Truncated, and Dynamically Growing NTK Positional Embedding (CRD NTK). This design allows fine-tuning with a sequence length of only 16K, enabling models such as Llama2-7B and Mistral-7B to perform inference with context lengths of up to 1M or even arbitrary lengths. Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length. This represents at least a 64-fold reduction in resource requirements compared to traditional full-attention mechanisms, while still achieving competitive performance.

[AI-118] RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization

链接: https://arxiv.org/abs/2410.04203
作者: Hanyang Zhao,Genta Indra Winata,Anirban Das,Shi-Xiong Zhang,David D. Yao,Wenpin Tang,Sambit Sahu
关键词-EN: Direct Preference Optimization, numerous preference optimization, preference optimization algorithms, preference optimization, Direct Preference
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, numerous preference optimization algorithms have been introduced as extensions to the Direct Preference Optimization (DPO) family. While these methods have successfully aligned models with human preferences, there is a lack of understanding regarding the contributions of their additional components. Moreover, fair and consistent comparisons are scarce, making it difficult to discern which components genuinely enhance downstream performance. In this work, we propose RainbowPO, a unified framework that demystifies the effectiveness of existing DPO methods by categorizing their key components into seven broad directions. We integrate these components into a single cohesive objective, enhancing the performance of each individual element. Through extensive experiments, we demonstrate that RainbowPO outperforms existing DPO variants. Additionally, we provide insights to guide researchers in developing new DPO methods and assist practitioners in their implementations.

[AI-119] LongGenBench: Long-context Generation Benchmark EMNLP2024

链接: https://arxiv.org/abs/2410.04199
作者: Xiang Liu,Peijie Dong,Xuming Hu,Xiaowen Chu
关键词-EN: requiring Large Language, Large Language Models, locate specific information, requiring Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

[AI-120] Accelerating Diffusion Models with One-to-Many Knowledge Distillation

链接: https://arxiv.org/abs/2410.04191
作者: Linfeng Zhang,Kaisheng Ma
关键词-EN: diffusion models, diffusion, advancements in image, models, image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Significant advancements in image generation have been made with diffusion models. Nevertheless, when contrasted with previous generative models, diffusion models face substantial computational overhead, leading to failure in real-time generation. Recent approaches have aimed to accelerate diffusion models by reducing the number of sampling steps through improved sampling techniques or step distillation. However, the methods to diminish the computational cost for each timestep remain a relatively unexplored area. Observing the fact that diffusion models exhibit varying input distributions and feature distributions at different timesteps, we introduce one-to-many knowledge distillation (O2MKD), which distills a single teacher diffusion model into multiple student diffusion models, where each student diffusion model is trained to learn the teacher’s knowledge for a subset of continuous timesteps. Experiments on CIFAR10, LSUN Church, CelebA-HQ with DDPM and COCO30K with Stable Diffusion show that O2MKD can be applied to previous knowledge distillation and fast sampling methods to achieve significant acceleration. Codes will be released in Github.

[AI-121] Non-monotonic Extensions to Formal Concept Analysis via Object Preferences

链接: https://arxiv.org/abs/2410.04184
作者: Lucas Carr,Nicholas Leisegang,Thomas Meyer,Sebastian Rudolph
关键词-EN: Formal Concept Analysis, formal context, set of objects, Concept Analysis, textit
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Formal Concept Analysis (FCA) is an approach to creating a conceptual hierarchy in which a \textitconcept lattice is generated from a \textitformal context. That is, a triple consisting of a set of objects, G , a set of attributes, M , and an incidence relation I on G \times M . A \textitconcept is then modelled as a pair consisting of a set of objects (the \textitextent), and a set of shared attributes (the \textitintent). Implications in FCA describe how one set of attributes follows from another. The semantics of these implications closely resemble that of logical consequence in classical logic. In that sense, it describes a monotonic conditional. The contributions of this paper are two-fold. First, we introduce a non-monotonic conditional between sets of attributes, which assumes a preference over the set of objects. We show that this conditional gives rise to a consequence relation that is consistent with the postulates for non-monotonicty proposed by Kraus, Lehmann, and Magidor (commonly referred to as the KLM postulates). We argue that our contribution establishes a strong characterisation of non-monotonicity in FCA. Typical concepts represent concepts where the intent aligns with expectations from the extent, allowing for an exception-tolerant view of concepts. To this end, we show that the set of all typical concepts is a meet semi-lattice of the original concept lattice. This notion of typical concepts is a further introduction of KLM-style typicality into FCA, and is foundational towards developing an algebraic structure representing a concept lattice of prototypical concepts.

[AI-122] IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

链接: https://arxiv.org/abs/2410.04171
作者: Shitong Shao,Zikai Zhou,Lichen Bai,Haoyi Xiond,Zeke Xie
关键词-EN: multi-step sampling mechanism, inference computational cost, OpenAI Strawberry, Strawberry in enhancing, visual diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The multi-step sampling mechanism, a key feature of visual diffusion models, has significant potential to replicate the success of OpenAI’s Strawberry in enhancing performance by increasing the inference computational cost. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.

[AI-123] Applying Quantum Autoencoders for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2410.04154
作者: Robin Frehner,Kurt Stockinger
关键词-EN: Anomaly detection, pattern recognition, medical diagnosis, quantum, recognition or medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
*备注: 22 pages, 16 figures

点击查看摘要

Abstract:Anomaly detection is an important problem with applications in various domains such as fraud detection, pattern recognition or medical diagnosis. Several algorithms have been introduced using classical computing approaches. However, using quantum computing for solving anomaly detection problems in time series data is a widely unexplored research field. This paper explores the application of quantum autoencoders to time series anomaly detection. We investigate two primary techniques for classifying anomalies: (1) Analyzing the reconstruction error generated by the quantum autoencoder and (2) latent representation analysis. Our simulated experimental results, conducted across various ansaetze, demonstrate that quantum autoencoders consistently outperform classical deep learning-based autoencoders across multiple datasets. Specifically, quantum autoencoders achieve superior anomaly detection performance while utilizing 60-230 times fewer parameters and requiring five times fewer training iterations. In addition, we implement our quantum encoder on real quantum hardware. Our experimental results demonstrate that quantum autoencoders achieve anomaly detection performance on par with their simulated counterparts. Comments: 22 pages, 16 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph) Cite as: arXiv:2410.04154 [cs.LG] (or arXiv:2410.04154v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-124] Neuro-Symbolic Entity Alignment via Variational Inference

链接: https://arxiv.org/abs/2410.04153
作者: Shengyuan Chen,Qinggang Zhang,Junnan Dong,Wen Hua,Jiannong Cao,Xiao Huang
关键词-EN: equivalent entity pairs, identifying equivalent entity, entity pairs, equivalent entity, aims to merge
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Entity alignment (EA) aims to merge two knowledge graphs (KGs) by identifying equivalent entity pairs. Existing methods can be categorized into symbolic and neural models. Symbolic models, while precise, struggle with substructure heterogeneity and sparsity, whereas neural models, although effective, generally lack interpretability and cannot handle uncertainty. We propose NeuSymEA, a probabilistic neuro-symbolic framework that combines the strengths of both methods. NeuSymEA models the joint probability of all possible pairs’ truth scores in a Markov random field, regulated by a set of rules, and optimizes it with the variational EM algorithm. In the E-step, a neural model parameterizes the truth score distributions and infers missing alignments. In the M-step, the rule weights are updated based on the observed and inferred alignments. To facilitate interpretability, we further design a path-ranking-based explainer upon this framework that generates supporting rules for the inferred alignments. Experiments on benchmarks demonstrate that NeuSymEA not only significantly outperforms baselines in terms of effectiveness and robustness, but also provides interpretable results.

[AI-125] DAMMI:Daily Activities in a Psychologically Annotated Multi-Modal IoT dataset

链接: https://arxiv.org/abs/2410.04152
作者: Mohsen Falah Rad,Kamrad Khoshhal Roudposhti,Mohammad Hassan Khoobkar,Mohsen Shirali,Zahra Ahmadi,Carlos Fernandez-Llatas
关键词-EN: well-being services, age pyramid, pyramid have increased, increased the demand, data
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:The growth in the elderly population and the shift in the age pyramid have increased the demand for healthcare and well-being services. To address this concern, alongside the rising cost of medical care, the concept of ageing at home has emerged, driven by recent advances in medical and technological solutions. Experts in computer science, communication technology, and healthcare have collaborated to develop affordable health solutions by employing sensors in living environments, wearable devices, and smartphones, in association with advanced data mining and intelligent systems with learning capabilities, to monitor, analyze, and predict the health status of elderly individuals. However, implementing intelligent healthcare systems and developing analytical techniques requires testing and evaluating algorithms on real-world data. Despite the need, there is a shortage of publicly available datasets that meet these requirements. To address this gap, we present the DAMMI dataset in this work, designed to support researchers in the field. The dataset includes daily activity data of an elderly individual collected via home-installed sensors, smartphone data, and a wristband over 146 days. It also contains daily psychological reports provided by a team of psychologists. Furthermore, the data collection spans significant events such as the COVID-19 pandemic, New Year’s holidays, and the religious month of Ramadan, offering additional opportunities for analysis. In this paper, we outline detailed information about the data collection system, the types of data recorded, and pre-processed event logs. This dataset is intended to assist professionals in IoT and data mining in evaluating and implementing their research ideas.

[AI-126] Reasoning with Natural Language Explanations EMNLP2024

链接: https://arxiv.org/abs/2410.04148
作者: Marco Valentino,André Freitas
关键词-EN: media supporting scientific, supporting scientific discovery, Natural Language Inference, natural language explanations, explanation-based NLI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Tutorial to be presented at EMNLP 2024. Website: this https URL

点击查看摘要

Abstract:Explanation constitutes an archetypal feature of human rationality, underpinning learning and generalisation, and representing one of the media supporting scientific discovery and communication. Due to the importance of explanations in human reasoning, an increasing amount of research in Natural Language Inference (NLI) has started reconsidering the role that explanations play in learning and inference, attempting to build explanation-based NLI models that can effectively encode and use natural language explanations on downstream tasks. Research in explanation-based NLI, however, presents specific challenges and opportunities, as explanatory reasoning reflects aspects of both material and formal inference, making it a particularly rich setting to model and deliver complex reasoning. In this tutorial, we provide a comprehensive introduction to the field of explanation-based NLI, grounding this discussion on the epistemological-linguistic foundations of explanations, systematically describing the main architectural trends and evaluation methodologies that can be used to build systems capable of explanatory reasoning.

[AI-127] From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression EMNLP2024

链接: https://arxiv.org/abs/2410.04139
作者: Eunseong Choi,Sunkyung Lee,Minjin Choi,June Park,Jongwuk Lee
关键词-EN: Large language models, advanced prompting techniques, Large language, achieved significant performance, significant performance gains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Findings of the Association for Computational Linguistics: EMNLP 2024; 21 pages; 10 figures and 7 tables

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant performance gains using advanced prompting techniques over various tasks. However, the increasing length of prompts leads to high computational costs and often obscures crucial information. Prompt compression has been proposed to alleviate these issues, but it faces challenges in (i) capturing the global context and (ii) training the compressor effectively. To tackle these challenges, we introduce a novel prompt compression method, namely Reading To Compressing (R2C), utilizing the Fusion-in-Decoder (FiD) architecture to identify the important information in the prompt. Specifically, the cross-attention scores of the FiD are used to discern essential chunks and sentences from the prompt. R2C effectively captures the global context without compromising semantic consistency while detouring the necessity of pseudo-labels for training the compressor. Empirical results show that R2C retains key contexts, enhancing the LLM performance by 6% in out-of-domain evaluations while reducing the prompt length by 80%.

[AI-128] From Hospital to Portables: A Universal ECG Foundation Model Built on 10 Million Diverse Recordings

链接: https://arxiv.org/abs/2410.04133
作者: Jun Li,Aaron Aguirre,Junior Moura,Che Liu,Lanhai Zhong,Chenxi Sun,Gari Clifford,Brandon Westover,Shenda Hong
关键词-EN: Artificial Intelligence, shown great promise, ECG, promise in electrocardiogram, shown great
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: working in progress

点击查看摘要

Abstract:Artificial Intelligence (AI) has shown great promise in electrocardiogram (ECG) analysis and cardiovascular disease detection. However, developing a general AI-ECG model has been challenging due to inter-individual variability and the diversity of ECG diagnoses, limiting existing models to specific diagnostic tasks and datasets. Moreover, current AI-ECG models struggle to achieve comparable performance between single-lead and 12-lead ECGs, limiting the application of AI-ECG to portable and wearable ECG devices. To address these limitations, we introduce an ECG Foundation Model (ECGFounder), a general-purpose model that leverages real-world ECG annotations from cardiology experts to broaden the diagnostic capabilities of ECG analysis. ECGFounder is trained on over 10 million ECGs with 150 label categories from the Harvard-Emory ECG Database, enabling comprehensive cardiovascular disease diagnosis through ECG analysis. The model is designed to be both effective out-of-the-box and fine-tunable for downstream tasks, maximizing usability. More importantly, we extend its application to single-lead ECGs, enabling complex condition diagnoses and supporting various downstream tasks in mobile and remote monitoring scenarios. Experimental results demonstrate that ECGFounder achieves expert-level performance on internal validation sets for both 12-lead and single-lead ECGs, while also exhibiting strong classification performance and generalization across various diagnoses on external validation sets. When fine-tuned, ECGFounder outperforms baseline models in demographics detection, clinical event detection, and cross-modality cardiac rhythm diagnosis. The trained model and data will be publicly released upon publication through the this http URL. Our code is available at this https URL.

[AI-129] Riemann Sum Optimization for Accurate Integrated Gradients Computation

链接: https://arxiv.org/abs/2410.04118
作者: Swadesh Swain,Shree Singhi
关键词-EN: Integrated Gradients, deep neural network, inaccurate Riemann Sum, input features, Riemann Sum approximations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Integrated Gradients (IG) is a widely used algorithm for attributing the outputs of a deep neural network to its input features. Due to the absence of closed-form integrals for deep learning models, inaccurate Riemann Sum approximations are used to calculate IG. This often introduces undesirable errors in the form of high levels of noise, leading to false insights in the model’s decision-making process. We introduce a framework, RiemannOpt, that minimizes these errors by optimizing the sample point selection for the Riemann Sum. Our algorithm is highly versatile and applicable to IG as well as its derivatives like Blur IG and Guided IG. RiemannOpt achieves up to 20% improvement in Insertion Scores. Additionally, it enables its users to curtail computational costs by up to four folds, thereby making it highly functional for constrained environments.

[AI-130] ransport-Embedded Neural Architecture: Redefining the Landscape of physics aware neural models in fluid mechanics

链接: https://arxiv.org/abs/2410.04114
作者: Amirmahdi Jafari
关键词-EN: physics-informed neural network, standard physics-informed neural, neural network, equation by design, transport-embedded neural network
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work introduces a new neural model which follows the transport equation by design. A physical problem, the Taylor-Green vortex, defined on a bi-periodic domain, is used as a benchmark to evaluate the performance of both the standard physics-informed neural network and our model (transport-embedded neural network). Results exhibit that while the standard physics-informed neural network fails to predict the solution accurately and merely returns the initial condition for the entire time span, our model successfully captures the temporal changes in the physics, particularly for high Reynolds numbers of the flow. Additionally, the ability of our model to prevent false minima can pave the way for addressing multiphysics problems, which are more prone to false minima, and help them accurately predict complex physics.

[AI-131] On the Sample Complexity of a Policy Gradient Algorithm with Occupancy Approximation for General Utility Reinforcement Learning

链接: https://arxiv.org/abs/2410.04108
作者: Anas Barakat,Souradip Chakraborty,Peihong Yu,Pratap Tokekar,Amrit Singh Bedi
关键词-EN: including imitation learning, recently gained attention, Reinforcement learning, pure exploration, imitation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:Reinforcement learning with general utilities has recently gained attention thanks to its ability to unify several problems, including imitation learning, pure exploration, and safe RL. However, prior work for solving this general problem in a unified way has mainly focused on the tabular setting. This is restrictive when considering larger state-action spaces because of the need to estimate occupancy measures during policy optimization. In this work, we address this issue and propose to approximate occupancy measures within a function approximation class using maximum likelihood estimation (MLE). We propose a simple policy gradient algorithm (PG-OMA) where an actor updates the policy parameters to maximize the general utility objective whereas a critic approximates the occupancy measure using MLE. We provide a sample complexity analysis of PG-OMA showing that our occupancy measure estimation error only scales with the dimension of our function approximation class rather than the size of the state action space. Under suitable assumptions, we establish first order stationarity and global optimality performance bounds for the proposed PG-OMA algorithm for nonconcave and concave general utilities respectively. We complement our methodological and theoretical findings with promising empirical results showing the scalability potential of our approach compared to existing tabular count-based approaches.

[AI-132] he OCON model: an old but green solution for distributable supervised classification for acoustic monitoring in smart cities

链接: https://arxiv.org/abs/2410.04098
作者: Stefano Giacomelli,Marco Giordano,Claudia Rinaldi
关键词-EN: Automatic Speech Recognition, Automatic Speech, supervised classification tasks, vowel phonemes classification, Speech Recognition
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted at “IEEE 5th International Symposium on the Internet of Sounds, 30 Sep / 2 Oct 2024, Erlangen, Germany”

点击查看摘要

Abstract:This paper explores a structured application of the One-Class approach and the One-Class-One-Network model for supervised classification tasks, focusing on vowel phonemes classification and speakers recognition for the Automatic Speech Recognition (ASR) domain. For our case-study, the ASR model runs on a proprietary sensing and lightning system, exploited to monitor acoustic and air pollution on urban streets. We formalize combinations of pseudo-Neural Architecture Search and Hyper-Parameters Tuning experiments, using an informed grid-search methodology, to achieve classification accuracy comparable to nowadays most complex architectures, delving into the speaker recognition and energy efficiency aspects. Despite its simplicity, our model proposal has a very good chance to generalize the language and speaker genders context for widespread applicability in computational constrained contexts, proved by relevant statistical and performance metrics. Our experiments code is openly accessible on our GitHub.

[AI-133] Sinc Kolmogorov-Arnold Network and Its Applications on Physics-informed Neural Networks

链接: https://arxiv.org/abs/2410.04096
作者: Tianchi Yu,Jingwei Qiu,Jiang Yang,Ivan Oseledets
关键词-EN: recently gained attention, Sinc interpolation proposes, Sinc interpolation, learnable activation functions, multilayer perceptron
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:In this paper, we propose to use Sinc interpolation in the context of Kolmogorov-Arnold Networks, neural networks with learnable activation functions, which recently gained attention as alternatives to multilayer perceptron. Many different function representations have already been tried, but we show that Sinc interpolation proposes a viable alternative, since it is known in numerical analysis to represent well both smooth functions and functions with singularities. This is important not only for function approximation but also for the solutions of partial differential equations with physics-informed neural networks. Through a series of experiments, we show that SincKANs provide better results in almost all of the examples we have considered.

[AI-134] GlobeSumm: A Challenging Benchmark Towards Unifying Multi-lingual Cross-lingual and Multi-document News Summarization EMNLP2024

链接: https://arxiv.org/abs/2410.04087
作者: Yangfan Ye,Xiachong Feng,Xiaocheng Feng,Weitao Ma,Libo Qin,Dongliang Xu,Qing Yang,Hongtao Liu,Bing Qin
关键词-EN: today global scene, today global, global scene, content and varied, varied viewpoints
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 main conference, long paper

点击查看摘要

Abstract:News summarization in today’s global scene can be daunting with its flood of multilingual content and varied viewpoints from different sources. However, current studies often neglect such real-world scenarios as they tend to focus solely on either single-language or single-document tasks. To bridge this gap, we aim to unify Multi-lingual, Cross-lingual and Multi-document Summarization into a novel task, i.e., MCMS, which encapsulates the real-world requirements all-in-one. Nevertheless, the lack of a benchmark inhibits researchers from adequately studying this invaluable problem. To tackle this, we have meticulously constructed the GLOBESUMM dataset by first collecting a wealth of multilingual news reports and restructuring them into event-centric format. Additionally, we introduce the method of protocol-guided prompting for high-quality and cost-effective reference annotation. In MCMS, we also highlight the challenge of conflicts between news reports, in addition to the issues of redundancies and omissions, further enhancing the complexity of GLOBESUMM. Through extensive experimental analysis, we validate the quality of our dataset and elucidate the inherent challenges of the task. We firmly believe that GLOBESUMM, given its challenging nature, will greatly contribute to the multilingual communities and the evaluation of LLMs.

[AI-135] aming the Tail: Leveraging Asymmetric Loss and Pade Approximation to Overcome Medical Image Long-Tailed Class Imbalance BMVC24

链接: https://arxiv.org/abs/2410.04084
作者: Pankhi Kashyap,Pavni Tandon,Sunny Gupta,Abhishek Tiwari,Ritwik Kulkarni,Kshitij Sharad Jadhav
关键词-EN: dependable classification methods, data imbalance due, warranting the requirement, problems in healthcare, healthcare emerge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 1 figures. Accepted in The 35th British Machine Vision Conference (BMVC24)

点击查看摘要

Abstract:Long-tailed problems in healthcare emerge from data imbalance due to variability in the prevalence and representation of different medical conditions, warranting the requirement of precise and dependable classification methods. Traditional loss functions such as cross-entropy and binary cross-entropy are often inadequate due to their inability to address the imbalances between the classes with high representation and the classes with low representation found in medical image datasets. We introduce a novel polynomial loss function based on Pade approximation, designed specifically to overcome the challenges associated with long-tailed classification. This approach incorporates asymmetric sampling techniques to better classify under-represented classes. We conducted extensive evaluations on three publicly available medical datasets and a proprietary medical dataset. Our implementation of the proposed loss function is open-sourced in the public repository:this https URL.

[AI-136] epsilon-VAE: Denoising as Visual Decoding

链接: https://arxiv.org/abs/2410.04081
作者: Long Zhao,Sanghyun Woo,Ziyu Wan,Yandong Li,Han Zhang,Boqing Gong,Hartwig Adam,Xuhui Jia,Ting Liu
关键词-EN: simplifies complex data, tokenization simplifies complex, learnable space, generative modeling, simplifies complex
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approach. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.

[AI-137] On Eliciting Syntax from Language Models via Hashing EMNLP-2024

链接: https://arxiv.org/abs/2410.04074
作者: Yiran Wang,Masao Utiyama
关键词-EN: infer syntactic structure, aims to infer, infer syntactic, syntactic structure, raw text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP-2024

点击查看摘要

Abstract:Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text. Recently, binary representation has exhibited remarkable information-preserving capabilities at both lexicon and syntax levels. In this paper, we explore the possibility of leveraging this capability to deduce parsing trees from raw text, relying solely on the implicitly induced grammars within models. To achieve this, we upgrade the bit-level CKY from zero-order to first-order to encode the lexicon and syntax in a unified binary representation space, switch training from supervised to unsupervised under the contrastive hashing framework, and introduce a novel loss function to impose stronger yet balanced alignment signals. Our model shows competitive performance on various datasets, therefore, we claim that our method is effective and efficient enough to acquire high-quality parsing trees from pre-trained language models at a low cost.

[AI-138] Multi-Round Region-Based Optimization for Scene Sketching

链接: https://arxiv.org/abs/2410.04072
作者: Yiqi Liang,Ying Liu,Dandan Long,Ruihui Li
关键词-EN: abstract representation, representation that captures, captures the essential, Scene, Scene sketching
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:Scene sketching is to convert a scene into a simplified, abstract representation that captures the essential elements and composition of the original scene. It requires semantic understanding of the scene and consideration of different regions within the scene. Since scenes often contain diverse visual information across various regions, such as foreground objects, background elements, and spatial divisions, dealing with these different regions poses unique difficulties. In this paper, we define a sketch as some sets of Bezier curves. We optimize the different regions of input scene in multiple rounds. In each round of optimization, strokes sampled from the next region can seamlessly be integrated into the sketch generated in the previous round of optimization. We propose additional stroke initialization method to ensure the integrity of the scene and the convergence of optimization. A novel CLIP-Based Semantic loss and a VGG-Based Feature loss are utilized to guide our multi-round optimization. Extensive experimental results on the quality and quantity of the generated sketches confirm the effectiveness of our method.

[AI-139] PAD: Personalized Alignment at Decoding-Time

链接: https://arxiv.org/abs/2410.04070
作者: Ruizhe Chen,Xiaotian Zhang,Meng Luo,Wenhao Chai,Zuozhu Liu
关键词-EN: significant challenge due, significantly across cultural, political differences, personalized preferences, traditional alignment methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase

点击查看摘要

Abstract:Aligning with personalized preferences, which vary significantly across cultural, educational, and political differences, poses a significant challenge due to the computational costs and data demands of traditional alignment methods. In response, this paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase, eliminating the need for additional training. By introducing a unique personalized reward modeling strategy, this framework decouples the text generation process from personalized preferences, facilitating the generation of generalizable token-level personalized rewards. The PAD algorithm leverages these rewards to guide the decoding process, dynamically tailoring the base model’s predictions to personalized preferences. Extensive experimental results demonstrate that PAD not only outperforms existing training-based alignment methods in terms of aligning with diverse preferences but also shows significant generalizability to preferences unseen during training and scalability across different base models. This work advances the capability of LLMs to meet user needs in real-time applications, presenting a substantial step forward in personalized LLM alignment.

[AI-140] ECon: On the Detection and Resolution of Evidence Conflicts EMNLP2024

链接: https://arxiv.org/abs/2410.04068
作者: Cheng Jiayang,Chunkit Chan,Qianqian Zhuang,Lin Qiu,Tianhang Zhang,Tengxiao Liu,Yangqiu Song,Yue Zhang,Pengfei Liu,Zheng Zhang
关键词-EN: managing conflicting information, Natural Language Inference, large language models, decision-making systems, rise of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 main conference

点击查看摘要

Abstract:The rise of large language models (LLMs) has significantly influenced the quality of information in decision-making systems, leading to the prevalence of AI-generated content and challenges in detecting misinformation and managing conflicting information, or “inter-evidence conflicts.” This study introduces a method for generating diverse, validated evidence conflicts to simulate real-world misinformation scenarios. We evaluate conflict detection methods, including Natural Language Inference (NLI) models, factual consistency (FC) models, and LLMs, on these conflicts (RQ1) and analyze LLMs’ conflict resolution behaviors (RQ2). Our key findings include: (1) NLI and LLM models exhibit high precision in detecting answer conflicts, though weaker models suffer from low recall; (2) FC models struggle with lexically similar answer conflicts, while NLI and LLM models handle these better; and (3) stronger models like GPT-4 show robust performance, especially with nuanced conflicts. For conflict resolution, LLMs often favor one piece of conflicting evidence without justification and rely on internal knowledge if they have prior beliefs.

[AI-141] xt2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback EMNLP2024

链接: https://arxiv.org/abs/2410.04064
作者: Fatemeh Pesaran Zadeh,Juyeon Kim,Jin-Hwa Kim,Gunhee Kim
关键词-EN: Large language models, demonstrated strong capabilities, Large language, notably through instruction-tuning, demonstrated strong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Main. Code and dataset are released at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities across various language tasks, notably through instruction-tuning methods. However, LLMs face challenges in visualizing complex, real-world data through charts and plots. Firstly, existing datasets rarely cover a full range of chart types, such as 3D, volumetric, and gridded charts. Secondly, supervised fine-tuning methods do not fully leverage the intricate relationships within rich datasets, including text, code, and figures. To address these challenges, we propose a hierarchical pipeline and a new dataset for chart generation. Our dataset, Text2Chart31, includes 31 unique plot types referring to the Matplotlib library, with 11.1K tuples of descriptions, code, data tables, and plots. Moreover, we introduce a reinforcement learning-based instruction tuning technique for chart generation tasks without requiring human feedback. Our experiments show that this approach significantly enhances the model performance, enabling smaller models to outperform larger open-source models and be comparable to state-of-the-art proprietary models in data visualization tasks. We make the code and dataset available at this https URL.

[AI-142] Enhancing Graph Self-Supervised Learning with Graph Interplay

链接: https://arxiv.org/abs/2410.04061
作者: Xinjian Zhao,Wei Pang,Xiangru Jian,Yaoyao Xu,Chaolong Ying,Tianshu Yu
关键词-EN: extracting informative representations, introduce Graph Interplay, Graph self-supervised learning, labeled inputs, compelling framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 27 pages, 12 figures

点击查看摘要

Abstract:Graph self-supervised learning (GSSL) has emerged as a compelling framework for extracting informative representations from graph-structured data without extensive reliance on labeled inputs. In this study, we introduce Graph Interplay (GIP), an innovative and versatile approach that significantly enhances the performance equipped with various existing GSSL methods. To this end, GIP advocates direct graph-level communications by introducing random inter-graph edges within standard batches. Against GIP’s simplicity, we further theoretically show that \textscGIP essentially performs a principled manifold separation via combining inter-graph message passing and GSSL, bringing about more structured embedding manifolds and thus benefits a series of downstream tasks. Our empirical study demonstrates that GIP surpasses the performance of prevailing GSSL methods across multiple benchmarks by significant margins, highlighting its potential as a breakthrough approach. Besides, GIP can be readily integrated into a series of GSSL methods and consistently offers additional performance gain. This advancement not only amplifies the capability of GSSL but also potentially sets the stage for a novel graph learning paradigm in a broader sense.

[AI-143] LoRTA: Low Rank Tensor Adaptation of Large Language Models

链接: https://arxiv.org/abs/2410.04060
作者: Ignacio Hounie,Charilaos Kanatsoulis,Arnuv Tandon,Alejandro Ribeiro
关键词-EN: Efficient Fine Tuning, Low Rank Adaptation, Parameter Efficient Fine, Fine Tuning, Rank Adaptation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low Rank Adaptation (LoRA) is a popular Parameter Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks. LoRA parameterizes model updates using low-rank matrices at each layer, significantly reducing the number of trainable parameters and, consequently, resource requirements during fine-tuning. However, the lower bound on the number of trainable parameters remains high due to the use of the low-rank matrix model. In this paper, we address this limitation by proposing a novel approach that employs a low rank tensor parametrization for model updates. The proposed low rank tensor model can significantly reduce the number of trainable parameters, while also allowing for finer-grained control over adapter size. Our experiments on Natural Language Understanding, Instruction Tuning, Preference Optimization and Protein Folding benchmarks demonstrate that our method is both efficient and effective for fine-tuning large language models, achieving a substantial reduction in the number of parameters while maintaining comparable performance.

[AI-144] Large Language Models can Achieve Social Balance

链接: https://arxiv.org/abs/2410.04054
作者: Pedro Cisneros-Velarde
关键词-EN: Social balance, achieve social balance, population ends, antagonistic factions, concept in sociology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Social balance is a concept in sociology which states that if every three individuals in a population achieve certain structures of positive or negative interactions, then the whole population ends up in one faction of positive interactions or divided between two or more antagonistic factions. In this paper, we consider a group of interacting large language models (LLMs) and study how, after continuous interactions, they can achieve social balance. Across three different LLM models, we found that social balance depends on (i) whether interactions are updated based on “relationships”, “appraisals”, or “opinions”; (ii) whether agents update their interactions based on homophily or influence from their peers; and (iii) the number of simultaneous interactions the LLMs consider. When social balance is achieved, its particular structure of positive or negative interactions depends on these three conditions and are different across LLM models and sizes. The stability of interactions and the justification for their update also vary across models. Thus, social balance is driven by the pre-training and alignment particular to each LLM model.

[AI-145] Beyond Forecasting: Compositional Time Series Reasoning for End-to-End Task Execution

链接: https://arxiv.org/abs/2410.04047
作者: Wen Ye,Yizhou Zhang,Wei Yang,Lumingyuan Tang,Defu Cao,Jie Cai,Yan Liu
关键词-EN: time series, time series data, time series forecasting, Time Series Reasoning, time series models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent decades, there has been substantial advances in time series models and benchmarks across various individual tasks, such as time series forecasting, classification, and anomaly detection. Meanwhile, compositional reasoning in time series prevalent in real-world applications (e.g., decision-making and compositional question answering) is in great demand. Unlike simple tasks that primarily focus on predictive accuracy, compositional reasoning emphasizes the synthesis of diverse information from both time series data and various domain knowledge, making it distinct and extremely more challenging. In this paper, we introduce Compositional Time Series Reasoning, a new task of handling intricate multistep reasoning tasks from time series data. Specifically, this new task focuses on various question instances requiring structural and compositional reasoning abilities on time series data, such as decision-making and compositional question answering. As an initial attempt to tackle this novel task, we developed TS-Reasoner, a program-aided approach that utilizes large language model (LLM) to decompose a complex task into steps of programs that leverage existing time series models and numerical subroutines. Unlike existing reasoning work which only calls off-the-shelf modules, TS-Reasoner allows for the creation of custom modules and provides greater flexibility to incorporate domain knowledge as well as user-specified constraints. We demonstrate the effectiveness of our method through a comprehensive set of experiments. These promising results indicate potential opportunities in the new task of time series reasoning and highlight the need for further research.

[AI-146] BlockFound: Customized blockchain foundation model for anomaly detection

链接: https://arxiv.org/abs/2410.04039
作者: Jiahao Yu,Xian Wu,Hao Liu,Wenbo Guo,Xinyu Xing
关键词-EN: blockchain, blockchain transaction, detection, customized, customized foundation model
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf large language models, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. After training the foundation model, we further design a novel detection method for anomaly detection. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound’s exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data.

[AI-147] Gamified crowd-sourcing of high-quality data for visual fine-tuning

链接: https://arxiv.org/abs/2410.04038
作者: Shashank Yadav,Rohan Tomar,Garvit Jain,Chirag Ahooja,Shubham Chaudhary,Charles Elkan
关键词-EN: Gamified Adversarial Prompting, introduces Gamified Adversarial, Adversarial Prompting, visual instruction tuning, paper introduces Gamified
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models. GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model’s knowledge. Our contributions include (1) an approach to capture question-answer pairs from humans that directly address weaknesses in a model’s knowledge, (2) a method for evaluating and rewarding players that successfully incentivizes them to provide high-quality submissions, and (3) a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks. Our implementation of GAP has significantly improved the accuracy of a small multimodal model, namely MiniCPM-Llama3-V-2.5-8B, increasing its GPT score from 0.147 to 0.477 on our dataset, approaching the benchmark set by the much larger GPT-4V. Moreover, we demonstrate that the data generated using MiniCPM-Llama3-V-2.5-8B also enhances its performance across other benchmarks, and exhibits cross-model benefits. Specifically, the same data improves the performance of QWEN2-VL-2B and QWEN2-VL-7B on the same multiple benchmarks.

[AI-148] SyllableLM: Learning Coarse Semantic Units for Speech Language Models

链接: https://arxiv.org/abs/2410.04029
作者: Alan Baade,Puyuan Peng,David Harwath
关键词-EN: require tokenized inputs, models require tokenized, Language models require, tokenized inputs, require tokenized
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.

[AI-149] IdeaSynth: Iterative Research Idea Development Through Evolving and Composing Idea Facets with Literature-Grounded Feedback

链接: https://arxiv.org/abs/2410.04025
作者: Kevin Pu,K. J. Kevin Feng,Tovi Grossman,Tom Hope,Bhavana Dalvi Mishra,Matt Latzke,Jonathan Bragg,Joseph Chee Chang,Pao Siangliulue
关键词-EN: deep refining ideas, involves broad exploring, ideation involves broad, deep refining, Research ideation involves
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research ideation involves broad exploring and deep refining ideas. Both require deep engagement with literature. Existing tools focus primarily on idea broad generation, yet offer little support for iterative specification, refinement, and evaluation needed to further develop initial ideas. To bridge this gap, we introduce IdeaSynth, a research idea development system that uses LLMs to provide literature-grounded feedback for articulating research problems, solutions, evaluations, and contributions. IdeaSynth represents these idea facets as nodes on a canvas, and allow researchers to iteratively refine them by creating and exploring variations and composing them. Our lab study (N=20) showed that participants, while using IdeaSynth, explored more alternative ideas and expanded initial ideas with more details compared to a strong LLM-based baseline. Our deployment study (N=7) demonstrated that participants effectively used IdeaSynth for real-world research projects at various ideation stages from developing initial ideas to revising framings of mature manuscripts, highlighting the possibilities to adopt IdeaSynth in researcher’s workflows.

[AI-150] Efficient Large-Scale Urban Parking Prediction: Graph Coarsening Based on Real-Time Parking Service Capability

链接: https://arxiv.org/abs/2410.04022
作者: Yixuan Wang,Zhenwu Chen,Kangshuai Zhang,Yunduan Cui,Lei Peng
关键词-EN: large-scale urban parking, parking, urban parking, predicting large-scale urban, number of vehicles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the sharp increase in the number of vehicles, the issue of parking difficulties has emerged as an urgent challenge that many cities need to address promptly. In the task of predicting large-scale urban parking data, existing research often lacks effective deep learning models and strategies. To tackle this challenge, this paper proposes an innovative framework for predicting large-scale urban parking graphs leveraging real-time service capabilities, aimed at improving the accuracy and efficiency of parking predictions. Specifically, we introduce a graph attention mechanism that assesses the real-time service capabilities of parking lots to construct a dynamic parking graph that accurately reflects real preferences in parking behavior. To effectively handle large-scale parking data, this study combines graph coarsening techniques with temporal convolutional autoencoders to achieve unified dimension reduction of the complex urban parking graph structure and features. Subsequently, we use a spatio-temporal graph convolutional model to make predictions based on the coarsened graph, and a pre-trained autoencoder-decoder module restores the predicted results to their original data dimensions, completing the task. Our methodology has been rigorously tested on a real dataset from parking lots in Shenzhen. The experimental results indicate that compared to traditional parking prediction models, our framework achieves improvements of 46.8% and 30.5% in accuracy and efficiency, respectively. Remarkably, with the expansion of the graph’s scale, our framework’s advantages become even more apparent, showcasing its substantial potential for solving complex urban parking dilemmas in practical scenarios.

[AI-151] JAM: A Comprehensive Model for Age Estimation Verification and Comparability

链接: https://arxiv.org/abs/2410.04012
作者: François David,Alexey A. Novikov,Ruslan Parkhomenko,Artem Voronin,Alix Melchy
关键词-EN: offering a comprehensive, introduces a comprehensive, comprehensive solution, paper introduces, age estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a comprehensive model for age estimation, verification, and comparability, offering a comprehensive solution for a wide range of applications. It employs advanced learning techniques to understand age distribution and uses confidence scores to create probabilistic age ranges, enhancing its ability to handle ambiguous cases. The model has been tested on both proprietary and public datasets and compared against one of the top-performing models in the field. Additionally, it has recently been evaluated by NIST as part of the FATE challenge, achieving top places in many categories.

[AI-152] Hyperbolic Fine-tuning for Large Language Models ICML2024

链接: https://arxiv.org/abs/2410.04010
作者: Menglin Yang,Aosong Feng,Bo Xiong,Jihong Liu,Irwin King,Rex Ying
关键词-EN: Large language models, Large language, demonstrated remarkable performance, language models, demonstrated remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注: The preliminary work was accepted for the ICML 2024 LLM Cognition Workshop, and this version includes new investigations, analyses, experiments, and results

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance on various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for embedding tokens in LLMs. In this study, we first investigate the non-Euclidean characteristics of LLMs. Our findings reveal that token frequency follows a power-law distribution, with high-frequency tokens clustering near the origin and low-frequency tokens positioned farther away. Additionally, token embeddings exhibit a high degree of hyperbolicity, indicating a latent tree-like structure in the embedding space. Building on the observation, we propose to efficiently fine-tune LLMs in hyperbolic space to better exploit the underlying complex structures. However, we found that this fine-tuning in hyperbolic space cannot be achieved with naive application of exponential and logarithmic maps, when the embedding and weight matrices both reside in Euclidean space. To address this technique issue, we introduce a new method called hyperbolic low-rank efficient fine-tuning, HypLoRA, that performs low-rank adaptation directly on the hyperbolic manifold, avoiding the cancellation effect caused by the exponential and logarithmic maps, thus preserving the hyperbolic modeling capabilities. Through extensive experiments, we demonstrate that HypLoRA significantly enhances the performance of LLMs on reasoning tasks, particularly for complex reasoning problems. In particular, HypLoRA improves the performance in the complex AQuA dataset by up to 13.0%, showcasing its effectiveness in handling complex reasoning challenges

[AI-153] ake It Easy: Label-Adaptive Self-Rationalization for Fact Verification and Explanation Generation

链接: https://arxiv.org/abs/2410.04002
作者: Jing Yang,Anderson Rocha
关键词-EN: Computational methods, aid journalists, require adapting, specific domains, domains and generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Paper accepted in the 16th IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS) 2024

点击查看摘要

Abstract:Computational methods to aid journalists in the task often require adapting a model to specific domains and generating explanations. However, most automated fact-checking methods rely on three-class datasets, which do not accurately reflect real-world misinformation. Moreover, fact-checking explanations are often generated based on text summarization of evidence, failing to address the relationship between the claim and the evidence. To address these issues, we extend the self-rationalization method–typically used in natural language inference (NLI) tasks–to fact verification. We propose a label-adaptive learning approach: first, we fine-tune a model to learn veracity prediction with annotated labels (step-1 model). Then, we fine-tune the step-1 model again to learn self-rationalization, using the same data and additional annotated explanations. Our results show that our label-adaptive approach improves veracity prediction by more than ten percentage points (Macro F1) on both the PubHealth and AVeriTec datasets, outperforming the GPT-4 model. Furthermore, to address the high cost of explanation annotation, we generated 64 synthetic explanations from three large language models: GPT-4-turbo, GPT-3.5-turbo, and Llama-3-8B and few-shot fine-tune our step-1 model. The few-shot synthetic explanation fine-tuned model performed comparably to the fully fine-tuned self-rationalization model, demonstrating the potential of low-budget learning with synthetic data. Our label-adaptive self-rationalization approach presents a promising direction for future research on real-world explainable fact-checking with different labeling schemes.

[AI-154] FastLRNR and Sparse Physics Informed Backpropagation

链接: https://arxiv.org/abs/2410.04001
作者: Woojin Cho,Kookjin Lee,Noseong Park,Donsub Rim,Gerrit Welper
关键词-EN: Rank Neural Representation, introduce Sparse Physics, called Low Rank, architecture called Low, Sparse Physics Informed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:We introduce Sparse Physics Informed Backpropagation (SPInProp), a new class of methods for accelerating backpropagation for a specialized neural network architecture called Low Rank Neural Representation (LRNR). The approach exploits the low rank structure within LRNR and constructs a reduced neural network approximation that is much smaller in size. We call the smaller network FastLRNR. We show that backpropagation of FastLRNR can be substituted for that of LRNR, enabling a significant reduction in complexity. We apply SPInProp to a physics informed neural networks framework and demonstrate how the solution of parametrized partial differential equations is accelerated.

[AI-155] Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

链接: https://arxiv.org/abs/2410.03977
作者: Hongjun Wang,Jiyuan Chen,Zhengwei Yin,Xuan Song,Yinqiang Zheng
关键词-EN: Cloth-Changing Person Re-Identification, involves recognizing individuals, Cloth-Changing Person, Person Re-Identification, involves recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloth-Changing Person Re-Identification (CC-ReID) involves recognizing individuals in images regardless of clothing status. In this paper, we empirically and experimentally demonstrate that completely eliminating or fully retaining clothing features is detrimental to the task. Existing work, either relying on clothing labels, silhouettes, or other auxiliary data, fundamentally aim to balance the learning of clothing and identity features. However, we practically find that achieving this balance is challenging and nuanced. In this study, we introduce a novel module called Diverse Norm, which expands personal features into orthogonal spaces and employs channel attention to separate clothing and identity features. A sample re-weighting optimization strategy is also introduced to guarantee the opposite optimization direction. Diverse Norm presents a simple yet effective approach that does not require additional data. Furthermore, Diverse Norm can be seamlessly integrated ResNet50 and significantly outperforms the state-of-the-art methods.

[AI-156] Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

链接: https://arxiv.org/abs/2410.03968
作者: Sijin Chen,Omar Hagrass,Jason M. Klusowski
关键词-EN: modern language models, puzzling gap divides, gap divides theory, Decoding strategies play, Decoding Game
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注: 17 pages

点击查看摘要

Abstract:Decoding strategies play a pivotal role in text generation for modern language models, yet a puzzling gap divides theory and practice. Surprisingly, strategies that should intuitively be optimal, such as Maximum a Posteriori (MAP), often perform poorly in practice. Meanwhile, popular heuristic approaches like Top- k and Nucleus sampling, which employ truncation and normalization of the conditional next-token probabilities, have achieved great empirical success but lack theoretical justifications. In this paper, we propose Decoding Game, a comprehensive theoretical framework which reimagines text generation as a two-player zero-sum game between Strategist, who seeks to produce text credible in the true distribution, and Nature, who distorts the true distribution adversarially. After discussing the decomposibility of multi-step generation, we derive the optimal strategy in closed form for one-step Decoding Game. It is shown that the adversarial Nature imposes an implicit regularization on likelihood maximization, and truncation-normalization methods are first-order approximations to the optimal strategy under this regularization. Additionally, by generalizing the objective and parameters of Decoding Game, near-optimal strategies encompass diverse methods such as greedy search, temperature scaling, and hybrids thereof. Numerical experiments are conducted to complement our theoretical analysis.

[AI-157] Variational Language Concepts for Interpreting Foundation Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.03964
作者: Hengyi Wang,Shiwei Tan,Zhiqing Hong,Desheng Zhang,Hao Wang
关键词-EN: Foundation Language Models, achieved remarkable success, natural language processing, Foundation Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Accepted at EMNLP 2024 findings

点击查看摘要

Abstract:Foundation Language Models (FLMs) such as BERT and its variants have achieved remarkable success in natural language processing. To date, the interpretability of FLMs has primarily relied on the attention weights in their self-attention layers. However, these attention weights only provide word-level interpretations, failing to capture higher-level structures, and are therefore lacking in readability and intuitiveness. To address this challenge, we first provide a formal definition of conceptual interpretation and then propose a variational Bayesian framework, dubbed VAriational Language Concept (VALC), to go beyond word-level interpretations and provide concept-level interpretations. Our theoretical analysis shows that our VALC finds the optimal language concepts to interpret FLM predictions. Empirical results on several real-world datasets show that our method can successfully provide conceptual interpretation for FLMs.

[AI-158] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

链接: https://arxiv.org/abs/2410.03960
作者: Aurick Qiao,Zhewei Yao,Samyam Rajbhandari,Yuxiong He
关键词-EN: typically observes orders, longer prompt lengths, magnitude longer prompt, generation lengths, enterprise use cases
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers’ KV cache using a much earlier layer’s output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs.

[AI-159] Grounding Language in Multi-Perspective Referential Communication EMNLP2024

链接: https://arxiv.org/abs/2410.03959
作者: Zineng Tang,Lingjun Mao,Alane Suhr
关键词-EN: multi-agent embodied environments, embodied environments, multi-agent embodied, referring expression generation, human-written referring expressions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted to EMNLP2024 Main

点击查看摘要

Abstract:We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another’s visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.

[AI-160] Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models

链接: https://arxiv.org/abs/2410.03955
作者: Gang Li,Wendi Yu,Yao Yao,Wei Tong,Yingbin Liang,Qihang Lin,Tianbao Yang
关键词-EN: undergoes multiple cycles, model developmental safety, model development, model, model development process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 41 pages, 8 figures

点击查看摘要

Abstract:In the real world, a learning-enabled system usually undergoes multiple cycles of model development to enhance the system’s ability to handle difficult or emerging tasks. This continual model development process raises a significant issue that the model development for acquiring new or improving existing capabilities may inadvertently lose capabilities of the old model, also known as catastrophic forgetting. Existing continual learning studies focus on mitigating catastrophic forgetting by trading off performance on previous tasks and new tasks to ensure good average performance. However, they are inadequate for many applications especially in safety-critical domains, as failure to strictly preserve the performance of the old model not only introduces safety risks and uncertainties but also imposes substantial expenses in the re-improving and re-validation of existing properties. To address this issue, we introduce model developmental safety as a guarantee of a learning system such that in the model development process the new model should strictly preserve the existing protected capabilities of the old model while improving its performance on target tasks. To ensure the model developmental safety, we present a safety-centric framework by formulating the model developmental safety as data-dependent constraints. Under this framework, we study how to develop a pretrained vision-language model (aka the CLIP model) for acquiring new capabilities or improving existing capabilities of image classification. We propose an efficient constrained optimization algorithm with theoretical guarantee and use its insights to finetune a CLIP model with task-dependent heads for promoting the model developmental safety. Our experiments on improving vision perception capabilities on autonomous driving and scene recognition datasets demonstrate the efficacy of the proposed approach.

[AI-161] SDA-GRIN for Adaptive Spatial-Temporal Multivariate Time Series Imputation

链接: https://arxiv.org/abs/2410.03954
作者: Amir Eskandari,Aman Anand,Drishti Sharma,Farhana Zulkernine
关键词-EN: missing data, multivariate time series, Spatial Dynamic Aware, Spatial, Recurrent Imputation Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In various applications, the multivariate time series often suffers from missing data. This issue can significantly disrupt systems that rely on the data. Spatial and temporal dependencies can be leveraged to impute the missing samples. Existing imputation methods often ignore dynamic changes in spatial dependencies. We propose a Spatial Dynamic Aware Graph Recurrent Imputation Network (SDA-GRIN) which is capable of capturing dynamic changes in spatial this http URL-GRIN leverages a multi-head attention mechanism to adapt graph structures with time. SDA-GRIN models multivariate time series as a sequence of temporal graphs and uses a recurrent message-passing architecture for imputation. We evaluate SDA-GRIN on four real-world datasets: SDA-GRIN improves MSE by 9.51% for the AQI and 9.40% for AQI-36. On the PEMS-BAY dataset, it achieves a 1.94% improvement in MSE. Detailed ablation study demonstrates the effect of window sizes and missing data on the performance of the method. Project page:this https URL

[AI-162] A Brain-Inspired Regularizer for Adversarial Robustness

链接: https://arxiv.org/abs/2410.03952
作者: Elie Attias,Cengiz Pehlevan,Dina Obeid
关键词-EN: Convolutional Neural Networks, slight input perturbations, Convolutional Neural, task failures, visual tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注: 10 pages plus appendix, 10 figures (main text), 15 figures (appendix), 3 tables (appendix)

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) excel in many visual tasks, but they tend to be sensitive to slight input perturbations that are imperceptible to the human eye, often resulting in task failures. Recent studies indicate that training CNNs with regularizers that promote brain-like representations, using neural recordings, can improve model robustness. However, the requirement to use neural data severely restricts the utility of these methods. Is it possible to develop regularizers that mimic the computational function of neural regularizers without the need for neural recordings, thereby expanding the usability and effectiveness of these techniques? In this work, we inspect a neural regularizer introduced in Li et al. (2019) to extract its underlying strength. The regularizer uses neural representational similarities, which we find also correlate with pixel similarities. Motivated by this finding, we introduce a new regularizer that retains the essence of the original but is computed using image pixel similarities, eliminating the need for neural recordings. We show that our regularization method 1) significantly increases model robustness to a range of black box attacks on various datasets and 2) is computationally inexpensive and relies only on original datasets. Our work explores how biologically motivated loss functions can be used to drive the performance of artificial neural networks.

[AI-163] Learning Truncated Causal History Model for Video Restoration NEURIPS2024

链接: https://arxiv.org/abs/2410.03936
作者: Amirhosein Ghasemabadi,Muhammad Kamran Janjua,Mohammad Salameh,Di Niu
关键词-EN: video frames governed, key challenge, transition dynamics, video, video restoration
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024. 24 pages

点击查看摘要

Abstract:One key challenge to video restoration is to model the transition dynamics of video frames governed by motion. In this work, we propose TURTLE to learn the truncated causal history model for efficient and high-performing video restoration. Unlike traditional methods that process a range of contextual frames in parallel, TURTLE enhances efficiency by storing and summarizing a truncated history of the input frame latent representation into an evolving historical state. This is achieved through a sophisticated similarity-based retrieval mechanism that implicitly accounts for inter-frame motion and alignment. The causal design in TURTLE enables recurrence in inference through state-memorized historical features while allowing parallel training by sampling truncated video clips. We report new state-of-the-art results on a multitude of video restoration benchmark tasks, including video desnowing, nighttime video deraining, video raindrops and rain streak removal, video super-resolution, real-world and synthetic video deblurring, and blind video denoising while reducing the computational cost compared to existing best contextual methods on all these tasks.

[AI-164] Learning Object Properties Using Robot Proprioception via Differentiable Robot-Object Interaction

链接: https://arxiv.org/abs/2410.03920
作者: Peter Yichen Chen,Chao Liu,Pingchuan Ma,John Eastman,Daniela Rus,Dylan Randle,Yuri Ivanov,Wojciech Matusik
关键词-EN: robot, properties, system identification, manipulated objects, objects
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Differentiable simulation has become a powerful tool for system identification. While prior work has focused on identifying robot properties using robot-specific data or object properties using object-specific data, our approach calibrates object properties by using information from the robot, without relying on data from the object itself. Specifically, we utilize robot joint encoder information, which is commonly available in standard robotic systems. Our key observation is that by analyzing the robot’s reactions to manipulated objects, we can infer properties of those objects, such as inertia and softness. Leveraging this insight, we develop differentiable simulations of robot-object interactions to inversely identify the properties of the manipulated objects. Our approach relies solely on proprioception – the robot’s internal sensing capabilities – and does not require external measurement tools or vision-based tracking systems. This general method is applicable to any articulated robot and requires only joint position information. We demonstrate the effectiveness of our method on a low-cost robotic platform, achieving accurate mass and elastic modulus estimations of manipulated objects with just a few seconds of computation on a laptop.

[AI-165] Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis

链接: https://arxiv.org/abs/2410.03908
作者: Amey Hengle,Atharva Kulkarni,Shantanu Patankar,Madhumitha Chandrasekaran,Sneha D’Silva,Jemima Jacob,Rashmi Gupta
关键词-EN: depression-anxiety comorbidity classification, social media posts, depression-anxiety comorbidity, social media, ANGST
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 24 Pages

点击查看摘要

Abstract:In this study, we introduce ANGST, a novel, first-of-its kind benchmark for depression-anxiety comorbidity classification from social media posts. Unlike contemporary datasets that often oversimplify the intricate interplay between different mental health disorders by treating them as isolated conditions, ANGST enables multi-label classification, allowing each post to be simultaneously identified as indicating depression and/or anxiety. Comprising 2876 meticulously annotated posts by expert psychologists and an additional 7667 silver-labeled posts, ANGST posits a more representative sample of online mental health discourse. Moreover, we benchmark ANGST using various state-of-the-art language models, ranging from Mental-BERT to GPT-4. Our results provide significant insights into the capabilities and limitations of these models in complex diagnostic scenarios. While GPT-4 generally outperforms other models, none achieve an F1 score exceeding 72% in multi-class comorbid classification, underscoring the ongoing challenges in applying language models to mental health diagnostics.

[AI-166] Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection

链接: https://arxiv.org/abs/2410.03904
作者: Ksheeraja Raghavan,Samiran Gode,Ankit Shah,Surabhi Raghavan,Wolfram Burgard,Bhiksha Raj,Rita Singh
关键词-EN: generation framework specifically, framework specifically designed, general-purpose audio generation, specifically designed, audio generation framework
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 9 pages, under review

点击查看摘要

Abstract:We introduce a novel, general-purpose audio generation framework specifically designed for anomaly detection and localization. Unlike existing datasets that predominantly focus on industrial and machine-related sounds, our framework focuses a broader range of environments, particularly useful in real-world scenarios where only audio data are available, such as in video-derived or telephonic audio. To generate such data, we propose a new method inspired by the LLM-Modulo framework, which leverages large language models(LLMs) as world models to simulate such real-world scenarios. This tool is modular allowing a plug-and-play approach. It operates by first using LLMs to predict plausible real-world scenarios. An LLM further extracts the constituent sounds, the order and the way in which these should be merged to create coherent wholes. Much like the LLM-Modulo framework, we include rigorous verification of each output stage, ensuring the reliability of the generated data. The data produced using the framework serves as a benchmark for anomaly detection applications, potentially enhancing the performance of models trained on audio data, particularly in handling out-of-distribution cases. Our contributions thus fill a critical void in audio anomaly detection resources and provide a scalable tool for generating diverse, realistic audio data.

[AI-167] Improving Node Representation by Boosting Target-Aware Contrastive Loss

链接: https://arxiv.org/abs/2410.03901
作者: Ying-Chun Lin,Jennifer Neville
关键词-EN: capturing intricate connections, edges capturing intricate, model complex relationships, relationships between entities, intricate connections
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graphs model complex relationships between entities, with nodes and edges capturing intricate connections. Node representation learning involves transforming nodes into low-dimensional embeddings. These embeddings are typically used as features for downstream tasks. Therefore, their quality has a significant impact on task performance. Existing approaches for node representation learning span (semi-)supervised, unsupervised, and self-supervised paradigms. In graph domains, (semi-)supervised learning often only optimizes models based on class labels, neglecting other abundant graph signals, which limits generalization. While self-supervised or unsupervised learning produces representations that better capture underlying graph signals, the usefulness of these captured signals for downstream target tasks can vary. To bridge this gap, we introduce Target-Aware Contrastive Learning (Target-aware CL) which aims to enhance target task performance by maximizing the mutual information between the target task and node representations with a self-supervised learning process. This is achieved through a sampling function, XGBoost Sampler (XGSampler), to sample proper positive examples for the proposed Target-Aware Contrastive Loss (XTCL). By minimizing XTCL, Target-aware CL increases the mutual information between the target task and node representations, such that model generalization is improved. Additionally, XGSampler enhances the interpretability of each signal by showing the weights for sampling the proper positive examples. We show experimentally that XTCL significantly improves the performance on two target tasks: node classification and link prediction tasks, compared to state-of-the-art models.

[AI-168] Human-aligned Chess with a Bit of Search

链接: https://arxiv.org/abs/2410.03893
作者: Yiming Zhang,Athul Paul Jacob,Vivian Lai,Daniel Fried,Daphne Ippolito
关键词-EN: recent years, surpassed the strongest, match human intelligence, quest to match, Chess
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Chess has long been a testbed for AI’s quest to match human intelligence, and in recent years, chess AI systems have surpassed the strongest humans at the game. However, these systems are not human-aligned; they are unable to match the skill levels of all human partners or model human-like behaviors beyond piece movement. In this paper, we introduce Allie, a chess-playing AI designed to bridge the gap between artificial and human intelligence in this classic game. Allie is trained on log sequences of real chess games to model the behaviors of human chess players across the skill spectrum, including non-move behaviors such as pondering times and resignations In offline evaluations, we find that Allie exhibits humanlike behavior: it outperforms the existing state-of-the-art in human chess move prediction and “ponders” at critical positions. The model learns to reliably assign reward at each game state, which can be used at inference as a reward function in a novel time-adaptive Monte-Carlo tree search (MCTS) procedure, where the amount of search depends on how long humans would think in the same positions. Adaptive search enables remarkable skill calibration; in a large-scale online evaluation against players with ratings from 1000 to 2600 Elo, our adaptive search method leads to a skill gap of only 49 Elo on average, substantially outperforming search-free and standard MCTS baselines. Against grandmaster-level (2500 Elo) opponents, Allie with adaptive search exhibits the strength of a fellow grandmaster, all while learning exclusively from humans.

[AI-169] owards Cost Sensitive Decision Making

链接: https://arxiv.org/abs/2410.03892
作者: Yang Li,Junier Oliva
关键词-EN: additional relevant information, real-world situations, additional relevant, relevant information, information when making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many real-world situations allow for the acquisition of additional relevant information when making decisions with limited or uncertain data. However, traditional RL approaches either require all features to be acquired beforehand (e.g. in a MDP) or regard part of them as missing data that cannot be acquired (e.g. in a POMDP). In this work, we consider RL models that may actively acquire features from the environment to improve the decision quality and certainty, while automatically balancing the cost of feature acquisition process and the reward of task decision process. We propose the Active-Acquisition POMDP and identify two types of the acquisition process for different application domains. In order to assist the agent in the actively-acquired partially-observed environment and alleviate the exploration-exploitation dilemma, we develop a model-based approach, where a deep generative model is utilized to capture the dependencies of the features and impute the unobserved features. The imputations essentially represent the beliefs of the agent. Equipped with the dynamics model, we develop hierarchical RL algorithms to resolve both types of the AA-POMDPs. Empirical results demonstrate that our approach achieves considerably better performance than existing POMDP-RL solutions.

[AI-170] Solving Dual Sourcing Problems with Supply Mode Dependent Failure Rates

链接: https://arxiv.org/abs/2410.03887
作者: Fabian Akkerman,Nils Knofius,Matthieu van der Heijden,Martijn Mes
关键词-EN: supply mode dependent, mode dependent failure, paper investigates dual, managing spare parts, investigates dual sourcing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates dual sourcing problems with supply mode dependent failure rates, particularly relevant in managing spare parts for downtime-critical assets. To enhance resilience, businesses increasingly adopt dual sourcing strategies using both conventional and additive manufacturing techniques. This paper explores how these strategies can optimise sourcing by addressing variations in part properties and failure rates. A significant challenge is the distinct failure characteristics of parts produced by these methods, which influence future demand. To tackle this, we propose a new iterative heuristic and several reinforcement learning techniques combined with an endogenous parameterised learning (EPL) approach. This EPL approach - compatible with any learning method - allows a single policy to handle various input parameters for multiple items. In a stylised setting, our best policy achieves an average optimality gap of 0.4%. In a case study within the energy sector, our policies outperform the baseline in 91.1% of instances, yielding average cost savings up to 22.6%.

[AI-171] KidLM: Advancing Language Models for Children – Early Insights and Future Directions EMNLP2024

链接: https://arxiv.org/abs/2410.03884
作者: Mir Tafseer Nayeem,Davood Rafiei
关键词-EN: Recent studies highlight, creating educational tools, significant challenges remain, Recent studies, maintaining key child-specific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Accepted to EMNLP 2024 (long, main)

点击查看摘要

Abstract:Recent studies highlight the potential of large language models in creating educational tools for children, yet significant challenges remain in maintaining key child-specific properties such as linguistic nuances, cognitive needs, and safety standards. In this paper, we explore foundational steps toward the development of child-specific language models, emphasizing the necessity of high-quality pre-training data. We introduce a novel user-centric data collection pipeline that involves gathering and validating a corpus specifically written for and sometimes by children. Additionally, we propose a new training objective, Stratified Masking, which dynamically adjusts masking probabilities based on our domain-specific child language data, enabling models to prioritize vocabulary and concepts more suitable for children. Experimental evaluations demonstrate that our model excels in understanding lower grade-level text, maintains safety by avoiding stereotypes, and captures children’s unique preferences. Furthermore, we provide actionable insights for future research and development in child-specific language modeling.

[AI-172] Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

链接: https://arxiv.org/abs/2410.03869
作者: Wenxuan Wang,Kuiyi Gao,Zihan Jia,Youliang Yuan,Jen-tse Huang,Qiuzhi Liu,Shuai Wang,Wenxiang Jiao,Zhaopeng Tu
关键词-EN: Stable Diffusion, Text-based image generation, hold significant potential, Diffusion and DALL-E, Text-based image
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness of our CoJ attack method, we constructed a comprehensive dataset, CoJ-Bench, encompassing nine safety scenarios, three types of editing operations, and three editing elements. Experiments on four widely-used image generation services provided by GPT-4V, GPT-4o, Gemini 1.5 and Gemini 1.5 Pro, demonstrate that our CoJ attack method can successfully bypass the safeguards of models for over 60% cases, which significantly outperforms other jailbreaking methods (i.e., 14%). Further, to enhance these models’ safety against our CoJ attack method, we also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack. We release our dataset and code to facilitate the AI safety research.

[AI-173] Empowering Domain-Specific Language Models with Graph-Oriented Databases: A Paradigm Shift in Performance and Model Maintenance

链接: https://arxiv.org/abs/2410.03867
作者: Ricardo Di Pasquale,Soledad Represa
关键词-EN: domain-specific language models, domain-specific language, application domains, specific application domains, industry-specific requirements
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an era dominated by data, the management and utilization of domain-specific language have emerged as critical challenges in various application domains, particularly those with industry-specific requirements. Our work is driven by the need to effectively manage and process large volumes of short text documents inherent in specific application domains. By leveraging domain-specific knowledge and expertise, our approach aims to shape factual data within these domains, thereby facilitating enhanced utilization and understanding by end-users. Central to our methodology is the integration of domain-specific language models with graph-oriented databases, facilitating seamless processing, analysis, and utilization of textual data within targeted domains. Our work underscores the transformative potential of the partnership of domain-specific language models and graph-oriented databases. This cooperation aims to assist researchers and engineers in metric usage, mitigation of latency issues, boosting explainability, enhancing debug and improving overall model performance. Moving forward, we envision our work as a guide AI engineers, providing valuable insights for the implementation of domain-specific language models in conjunction with graph-oriented databases, and additionally provide valuable experience in full-life cycle maintenance of this kind of products.

[AI-174] DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

链接: https://arxiv.org/abs/2410.03864
作者: Murong Yue,Wenlin Yao,Haitao Mi,Dian Yu,Ziyu Yao,Dong Yu
关键词-EN: large language models, gained significant attention, task-solving LLM, LLM, specific task-solving LLM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called “reasoning actions”), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

[AI-175] SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

链接: https://arxiv.org/abs/2410.03859
作者: John Yang,Carlos E. Jimenez,Alex L. Zhang,Kilian Lieret,Joyce Yang,Xindi Wu,Ori Press,Niklas Muennighoff,Gabriel Synnaeve,Karthik R. Narasimhan,Diyi Yang,Sida I. Wang,Ofir Press
关键词-EN: Autonomous systems, capable of fixing, SWE-bench, Autonomous, software engineering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as images. This limited coverage motivates our inquiry into how existing systems might perform on unrepresented software engineering domains (e.g., front-end, game development, DevOps), which use different programming languages and paradigms. Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Each SWE-bench M task instance contains at least one image in its problem statement or unit tests. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization. Lastly, we show that SWE-agent’s flexible language-agnostic features enable it to substantially outperform alternatives on SWE-bench M, resolving 12% of task instances compared to 6% for the next best system.

[AI-176] A Survey on Group Fairness in Federated Learning: Challenges Taxonomy of Solutions and Directions for Future Research

链接: https://arxiv.org/abs/2410.03855
作者: Teresa Salazar,Helder Araújo,Alberto Cano,Pedro Henriques Abreu
关键词-EN: Group fairness, achieving equitable outcomes, race or gender, equitable outcomes, Federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Group fairness in machine learning is a critical area of research focused on achieving equitable outcomes across different groups defined by sensitive attributes such as race or gender. Federated learning, a decentralized approach to training machine learning models across multiple devices or organizations without sharing raw data, amplifies the need for fairness due to the heterogeneous data distributions across clients, which can exacerbate biases. The intersection of federated learning and group fairness has attracted significant interest, with 47 research works specifically dedicated to addressing this issue. However, no dedicated survey has focused comprehensively on group fairness in federated learning. In this work, we present an in-depth survey on this topic, addressing the critical challenges and reviewing related works in the field. We create a novel taxonomy of these approaches based on key criteria such as data partitioning, location, and applied strategies. Additionally, we explore broader concerns related to this problem and investigate how different approaches handle the complexities of various sensitive groups and their intersections. Finally, we review the datasets and applications commonly used in current research. We conclude by highlighting key areas for future research, emphasizing the need for more methods to address the complexities of achieving group fairness in federated systems.

[AI-177] Model-Based Reward Shaping for Adversarial Inverse Reinforcement Learning in Stochastic Environments

链接: https://arxiv.org/abs/2410.03847
作者: Simon Sinong Zhan,Qingyuan Wu,Philip Wang,Yixuan Wang,Ruochen Jiao,Chao Huang,Qi Zhu
关键词-EN: Inverse Reinforcement Learning, Adversarial Inverse Reinforcement, Reinforcement Learning, Adversarial Inverse, Inverse Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we aim to tackle the limitation of the Adversarial Inverse Reinforcement Learning (AIRL) method in stochastic environments where theoretical results cannot hold and performance is degraded. To address this issue, we propose a novel method which infuses the dynamics information into the reward shaping with the theoretical guarantee for the induced optimal policy in the stochastic environments. Incorporating our novel model-enhanced rewards, we present a novel Model-Enhanced AIRL framework, which integrates transition model estimation directly into reward shaping. Furthermore, we provide a comprehensive theoretical analysis of the reward error bound and performance difference bound for our method. The experimental results in MuJoCo benchmarks show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines.

[AI-178] Explaining the (Not So) Obvious: Simple and Fast Explanation of STAN a Next Point of Interest Recommendation System

链接: https://arxiv.org/abs/2410.03841
作者: Fajrian Yunus,Talel Abdessalem
关键词-EN: explain machine learning, machine learning systems, machine learning, machine learning methods, lot of effort
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A lot of effort in recent years have been expended to explain machine learning systems. However, some machine learning methods are inherently explainable, and thus are not completely black box. This enables the developers to make sense of the output without a developing a complex and expensive explainability technique. Besides that, explainability should be tailored to suit the context of the problem. In a recommendation system which relies on collaborative filtering, the recommendation is based on the behaviors of similar users, therefore the explanation should tell which other users are similar to the current user. Similarly, if the recommendation system is based on sequence prediction, the explanation should also tell which input timesteps are the most influential. We demonstrate this philosophy/paradigm in STAN (Spatio-Temporal Attention Network for Next Location Recommendation), a next Point of Interest recommendation system based on collaborative filtering and sequence prediction. We also show that the explanation helps to “debug” the output.

[AI-179] GraphRouter: A Graph-based Router for LLM Selections

链接: https://arxiv.org/abs/2410.03834
作者: Tao Feng,Yanzhen Shen,Jiaxuan You
关键词-EN: Large Language Models, Language Models, Large Language, rapidly growing number, variety of Large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapidly growing number and variety of Large Language Models (LLMs) present significant challenges in efficiently selecting the appropriate LLM for a given query, especially considering the trade-offs between performance and computational cost. Current LLM selection methods often struggle to generalize across new LLMs and different tasks because of their limited ability to leverage contextual interactions among tasks, queries, and LLMs, as well as their dependence on a transductive learning framework. To address these shortcomings, we introduce a novel inductive graph framework, named as GraphRouter, which fully utilizes the contextual information among tasks, queries, and LLMs to enhance the LLM selection process. GraphRouter constructs a heterogeneous graph comprising task, query, and LLM nodes, with interactions represented as edges, which efficiently captures the contextual information between the query’s requirements and the LLM’s capabilities. Through an innovative edge prediction mechanism, GraphRouter is able to predict attributes (the effect and cost of LLM response) of potential edges, allowing for optimized recommendations that adapt to both existing and newly introduced LLMs without requiring retraining. Comprehensive experiments across three distinct effect-cost weight scenarios have shown that GraphRouter substantially surpasses existing routers, delivering a minimum performance improvement of 12.3%. In addition, it achieves enhanced generalization across new LLMs settings and supports diverse tasks with at least a 9.5% boost in effect and a significant reduction in computational demands. This work endeavors to apply a graph-based approach for the contextual and adaptive selection of LLMs, offering insights for real-world applications. Our codes for GraphRouter will soon be released at this https URL.

[AI-180] Large Language Models can be Strong Self-Detoxifiers

链接: https://arxiv.org/abs/2410.03818
作者: Ching-Yun Ko,Pin-Yu Chen,Payel Das,Youssef Mroueh,Soham Dan,Georgios Kollias,Subhajit Chaudhury,Tejaswini Pedapati,Luca Daniel
关键词-EN: aligning large language, large language models, likelihood of generating, generating harmful, essential task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 20 pages

点击查看摘要

Abstract:Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textitSelf-disciplined Autoregressive Sampling (SASA), a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM’s internal representations.

[AI-181] Can Mamba Always Enjoy the “Free Lunch”?

链接: https://arxiv.org/abs/2410.03810
作者: Ruifeng Ren,Zhicong Li,Yong Liu
关键词-EN: Large Language Models, current Large Language, Language Models, Large Language, current Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Transformers have been the cornerstone of current Large Language Models (LLMs); however, its linear growth in overhead during inference with respect to sequence length poses challenges for modeling long sequences. In this context, Mamba has gradually attracted attention due to its constant-level size during inference and existing empirical results have shown that it can perform comparably to Transformers in sequence modeling while offering significant savings. However, one may ask that, can Mamba always enjoy the ``free lunch"? In this paper, we focus on analyzing the expressive ability of Mamba from a theoretical standpoint. First, inspired by the connection between Mamba and linear attention, we investigate potential shortcomings of the Mamba when performing the COPY operation. Our results indicate that Mamba with constant size may encounter bottlenecks when handling COPY, while it can achieve perfect performance when the size scales linearly with sequence length. Based on this observation, we analyze Mamba’s ability to tackle DP problems when equipped with Chain of Thought (CoT). Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is comparable to standard and efficient Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our results contribute to a deeper understanding of Mamba.

[AI-182] Mixture of Attentions For Speculative Decoding

链接: https://arxiv.org/abs/2410.03804
作者: Matthieu Zimmer,Milan Gritta,Gerasimos Lampouras,Haitham Bou Ammar,Jun Wang
关键词-EN: Large Language Models, Large Language, parameters of Large, Language Models, computational requirements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.

[AI-183] xt-guided Diffusion Model for 3D Molecule Generation

链接: https://arxiv.org/abs/2410.03803
作者: Yanchen Luo,Junfeng Fang,Sihang Li,Zhiyuan Liu,Jiancan Wu,An Zhang,Wenjie Du,Xiang Wang
关键词-EN: Text-guided Small Molecule, Small Molecule Generation, crucial in biology, drug discovery, targeted properties
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation Approach via 3D Diffusion Model which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG’s proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations.

[AI-184] Dynamic Evidence Decoupling for Trusted Multi-view Learning

链接: https://arxiv.org/abs/2410.03796
作者: Ying Liu,Lihong Liu,Cai Xu,Xiangyu Song,Ziyu Guan,Wei Zhao
关键词-EN: Multi-view learning methods, trusted multi-view learning, Multi-view learning, improving decision accuracy, improving decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-view learning methods often focus on improving decision accuracy, while neglecting the decision uncertainty, limiting their suitability for safety-critical applications. To mitigate this, researchers propose trusted multi-view learning methods that estimate classification probabilities and uncertainty by learning the class distributions for each instance. However, these methods assume that the data from each view can effectively differentiate all categories, ignoring the semantic vagueness phenomenon in real-world multi-view data. Our findings demonstrate that this phenomenon significantly suppresses the learning of view-specific evidence in existing methods. We propose a Consistent and Complementary-aware trusted Multi-view Learning (CCML) method to solve this problem. We first construct view opinions using evidential deep neural networks, which consist of belief mass vectors and uncertainty estimates. Next, we dynamically decouple the consistent and complementary evidence. The consistent evidence is derived from the shared portions across all views, while the complementary evidence is obtained by averaging the differing portions across all views. We ensure that the opinion constructed from the consistent evidence strictly aligns with the ground-truth category. For the opinion constructed from the complementary evidence, we allow it for potential vagueness in the evidence. We compare CCML with state-of-the-art baselines on one synthetic and six real-world datasets. The results validate the effectiveness of the dynamic evidence decoupling strategy and show that CCML significantly outperforms baselines on accuracy and reliability. The code is released at this https URL.

[AI-185] People are poorly equipped to detect AI-powered voice clones

链接: https://arxiv.org/abs/2410.03791
作者: Sarah Barrington,Hany Farid
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-186] CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

链接: https://arxiv.org/abs/2410.03787
作者: Qisheng Liao,Liang Li,Yulang Fei,Gus Xia
关键词-EN: flexible multi-modal control, produce natural Chinese, natural Chinese calligraphy, natural Chinese, flexible multi-modal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:In this paper, we introduce CalliffusionV2, a novel system designed to produce natural Chinese calligraphy with flexible multi-modal control. Unlike previous approaches that rely solely on image or text inputs and lack fine-grained control, our system leverages both images to guide generations at fine-grained levels and natural language texts to describe the features of generations. CalliffusionV2 excels at creating a broad range of characters and can quickly learn new styles through a few-shot learning approach. It is also capable of generating non-Chinese characters without prior training. Comprehensive tests confirm that our system produces calligraphy that is both stylistically accurate and recognizable by neural network classifiers and human evaluators.

[AI-187] AI-rays: Exploring Bias in the Gaze of AI Through a Multimodal Interactive Installation SIGGRAPH

链接: https://arxiv.org/abs/2410.03786
作者: Ziyao Gao,Yiwen Zhang,Ling Li,Theodoros Papatheodorou,Wei Zeng
关键词-EN: biased social classifications, social classifications, Data surveillance, covert and pervasive, result in biased
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Siggraph Asia 2024 Art Paper

点击查看摘要

Abstract:Data surveillance has become more covert and pervasive with AI algorithms, which can result in biased social classifications. Appearance offers intuitive identity signals, but what does it mean to let AI observe and speculate on them? We introduce AI-rays, an interactive installation where AI generates speculative identities from participants’ appearance which are expressed through synthesized personal items placed in participants’ bags. It uses speculative X-ray visions to contrast reality with AI-generated assumptions, metaphorically highlighting AI’s scrutiny and biases. AI-rays promotes discussions on modern surveillance and the future of human-machine reality through a playful, immersive experience exploring AI biases.

[AI-188] owards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure

链接: https://arxiv.org/abs/2410.03781
作者: Romain Puech,Jakub Macina,Julia Chatain,Mrinmaya Sachan,Manu Kapur
关键词-EN: Large Language Models, Productive Failure, Large Language, Productive Failure tutoring, Pedagogical Steering
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
*备注: 18 pages, 9 figures, 6 tables

点击查看摘要

Abstract:One-to-one tutoring is one of the most efficient methods of teaching. Following the rise in popularity of Large Language Models (LLMs), there have been efforts to use them to create conversational tutoring systems, which can make the benefits of one-to-one tutoring accessible to everyone. However, current LLMs are primarily trained to be helpful assistants and thus lack crucial pedagogical skills. For example, they often quickly reveal the solution to the student and fail to plan for a richer multi-turn pedagogical interaction. To use LLMs in pedagogical scenarios, they need to be steered towards using effective teaching strategies: a problem we introduce as Pedagogical Steering and believe to be crucial for the efficient use of LLMs as tutors. We address this problem by formalizing a concept of tutoring strategy, and introducing StratL, an algorithm to model a strategy and use prompting to steer the LLM to follow this strategy. As a case study, we create a prototype tutor for high school math following Productive Failure (PF), an advanced and effective learning design. To validate our approach in a real-world setting, we run a field study with 17 high school students in Singapore. We quantitatively show that StratL succeeds in steering the LLM to follow a Productive Failure tutoring strategy. We also thoroughly investigate the existence of spillover effects on desirable properties of the LLM, like its ability to generate human-like answers. Based on these results, we highlight the challenges in Pedagogical Steering and suggest opportunities for further improvements. We further encourage follow-up research by releasing a dataset of Productive Failure problems and the code of our prototype and algorithm.

[AI-189] Discovering Message Passing Hierarchies for Mesh-Based Physics Simulation

链接: https://arxiv.org/abs/2410.03779
作者: Huayu Deng,Xiangming Zhu,Yunbo Wang,Xiaokang Yang
关键词-EN: large-scale mesh-based physics, message passing, powerful tool, tool for large-scale, large-scale mesh-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Graph neural networks have emerged as a powerful tool for large-scale mesh-based physics simulation. Existing approaches primarily employ hierarchical, multi-scale message passing to capture long-range dependencies within the graph. However, these graph hierarchies are typically fixed and manually designed, which do not adapt to the evolving dynamics present in complex physical systems. In this paper, we introduce a novel neural network named DHMP, which learns Dynamic Hierarchies for Message Passing networks through a differentiable node selection method. The key component is the anisotropic message passing mechanism, which operates at both intra-level and inter-level interactions. Unlike existing methods, it first supports directionally non-uniform aggregation of dynamic features between adjacent nodes within each graph hierarchy. Second, it determines node selection probabilities for the next hierarchy according to different physical contexts, thereby creating more flexible message shortcuts for learning remote node relations. Our experiments demonstrate the effectiveness of DHMP, achieving 22.7% improvement on average compared to recent fixed-hierarchy message passing networks across five classic physics simulation datasets.

[AI-190] Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

链接: https://arxiv.org/abs/2410.03777
作者: Yuxuan Yao,Han Wu,Mingyang Liu,Sichun Luo,Xiongwei Han,Jie Liu,Zhijiang Guo,Linqi Song
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-191] Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

链接: https://arxiv.org/abs/2410.03775
作者: Aparna Elangovan,Jongwoo Ko,Lei Xu,Mahsa Elyasi,Ling Liu,Sravan Bodapati,Dan Roth
关键词-EN: automatic evaluation, human, automatic evaluation methods, automatic, evaluation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The effectiveness of automatic evaluation of generative models is typically measured by comparing it to human evaluation using correlation metrics. However, metrics like Krippendorff’s \alpha and Randolph’s \kappa , originally designed to measure the reliability of human labeling, make assumptions about human behavior and the labeling process. In this paper, we show how relying on a single aggregate correlation score can obscure fundamental differences between human behavior and automatic evaluation methods, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human labels (gathered during human evaluation) is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or better correlation with the human majority label compared to human-to-human (HH) correlation. This can create the misleading impression that automatic evaluation is accurate enough to approximate the human majority label. However, as the proportion of samples with consistent human labels increases, the correlation between machine labels and human majority labels declines, falling below HH correlation. Based on these findings, we first propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance. Second, recognizing that uncertainty and variation are inherent in perception-based human evaluations, such as those involving attitudes or preferences, we introduce a new metric - binned Jensen-Shannon Divergence for perception for such scenarios to better measure the effectiveness of automatic evaluations. Third, we present visualization techniques – perception charts, to compare the strengths and limitations of automatic evaluation and to contextualize correlation measures appropriately

[AI-192] Human-Based Risk Model for Improved Driver Support in Interactive Driving Scenarios

链接: https://arxiv.org/abs/2410.03774
作者: Tim Puphal,Benedict Flade,Matti Krüger,Ryohei Hirano,Akihito Kimata
关键词-EN: driver, driver support, addresses the problem, risk model, human-based risk model
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of human-based driver support. Nowadays, driver support systems help users to operate safely in many driving situations. Nevertheless, these systems do not fully use the rich information that is available from sensing the human driver. In this paper, we therefore present a human-based risk model that uses driver information for improved driver support. In contrast to state of the art, our proposed risk model combines a) the current driver perception based on driver errors, such as the driver overlooking another vehicle (i.e., notice error), and b) driver personalization, such as the driver being defensive or confident. In extensive simulations of multiple interactive driving scenarios, we show that our novel human-based risk model achieves earlier warning times and reduced warning errors compared to a baseline risk model not using human driver information.

[AI-193] Precision Knowledge Editing: Enhancing Safety in Large Language Models

链接: https://arxiv.org/abs/2410.03772
作者: Xuying Li,Zhuo Li,Yuji Kosuga,Yasuhiro Yoshida,Victor Bian
关键词-EN: demonstrated remarkable capabilities, Large language models, pose risks related, Large language, Precision Knowledge Editing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but they also pose risks related to the generation of toxic or harmful content. This work introduces Precision Knowledge Editing (PKE), an advanced technique that builds upon existing knowledge editing methods to more effectively identify and modify toxic parameter regions within LLMs. By leveraging neuron weight tracking and activation pathway tracing, PKE achieves finer granularity in toxic content management compared to previous methods like Detoxifying Instance Neuron Modification (DINM). Our experiments demonstrate that PKE significantly reduces the attack success rate (ASR) across various models, including Llama2-7b and Llama-3-8b-instruct, while maintaining overall model performance. Additionally, we also compared the performance of some closed-source models (gpt-4-0613 and Claude 3 Sonnet) in our experiments, and found that models adjusted using our method far outperformed the closed-source models in terms of safety. This research contributes to the ongoing efforts to make LLMs safer and more reliable for real-world applications.

[AI-194] A Two-Stage Proactive Dialogue Generator for Efficient Clinical Information Collection Using Large Language Model

链接: https://arxiv.org/abs/2410.03770
作者: Xueshen Li,Xinlong Hou,Nirupama Ravi,Ziyi Huang,Yu Gan
关键词-EN: successful disease diagnosis, Efficient patient-doctor interaction, disease diagnosis, patient-doctor interaction, key factors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Prepare for submission

点击查看摘要

Abstract:Efficient patient-doctor interaction is among the key factors for a successful disease diagnosis. During the conversation, the doctor could query complementary diagnostic information, such as the patient’s symptoms, previous surgery, and other related information that goes beyond medical evidence data (test results) to enhance disease diagnosis. However, this procedure is usually time-consuming and less-efficient, which can be potentially optimized through computer-assisted systems. As such, we propose a diagnostic dialogue system to automate the patient information collection procedure. By exploiting medical history and conversation logic, our conversation agents, particularly the doctor agent, can pose multi-round clinical queries to effectively collect the most relevant disease diagnostic information. Moreover, benefiting from our two-stage recommendation structure, carefully designed ranking criteria, and interactive patient agent, our model is able to overcome the under-exploration and non-flexible challenges in dialogue generation. Our experimental results on a real-world medical conversation dataset show that our model can generate clinical queries that mimic the conversation style of real doctors, with efficient fluency, professionalism, and safety, while effectively collecting relevant disease diagnostic information.

[AI-195] SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

链接: https://arxiv.org/abs/2410.03769
作者: Tianhao Li,Jingyu Lu,Chuangxin Chu,Tianyu Zeng,Yujia Zheng,Mei Li,Haotian Huang,Bin Wu,Zuoxian Liu,Kai Ma,Xuejing Yuan,Xingkai Wang,Keyan Ding,Huajun Chen,Qiang Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[AI-196] Reasoning Elicitation in Language Models via Counterfactual Feedback

链接: https://arxiv.org/abs/2410.03767
作者: Alihan Hüyük,Xinnuo Xu,Jacqueline Maasch,Aditya V. Nori,Javier González
关键词-EN: capabilities remain underdeveloped, remain underdeveloped, increasing effectiveness, language models, reasoning capabilities remain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities.

[AI-197] FutureFill: Fast Generation from Convolutional Sequence Models

链接: https://arxiv.org/abs/2410.03766
作者: Naman Agarwal,Xinyi Chen,Evan Dogariu,Vlad Feinberg,Daniel Suo,Peter Bartlett,Elad Hazan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-198] Getting in the Door: Streamlining Intake in Civil Legal Services with Large Language Models

链接: https://arxiv.org/abs/2410.03762
作者: Quinten Steenhuis,Hannes Westermann
关键词-EN: legal aid program, free legal aid, free legal, legal aid, aid program
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Legal intake, the process of finding out if an applicant is eligible for help from a free legal aid program, takes significant time and resources. In part this is because eligibility criteria are nuanced, open-textured, and require frequent revision as grants start and end. In this paper, we investigate the use of large language models (LLMs) to reduce this burden. We describe a digital intake platform that combines logical rules with LLMs to offer eligibility recommendations, and we evaluate the ability of 8 different LLMs to perform this task. We find promising results for this approach to help close the access to justice gap, with the best model reaching an F1 score of .82, while minimizing false negatives.

[AI-199] owards a Deeper Understanding of Transformer for Residential Non-intrusive Load Monitoring

链接: https://arxiv.org/abs/2410.03758
作者: Minhajur Rahman,Yasir Arafat
关键词-EN: Non-Intrusive Load Monitoring, Load Monitoring, Non-Intrusive Load, demonstrated impressive performance, recent years
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: Accepted to 2024 International Conference on Innovation in Science, Engineering and Technology (ICISET)

点击查看摘要

Abstract:Transformer models have demonstrated impressive performance in Non-Intrusive Load Monitoring (NILM) applications in recent years. Despite their success, existing studies have not thoroughly examined the impact of various hyper-parameters on model performance, which is crucial for advancing high-performing transformer models. In this work, a comprehensive series of experiments have been conducted to analyze the influence of these hyper-parameters in the context of residential NILM. This study delves into the effects of the number of hidden dimensions in the attention layer, the number of attention layers, the number of attention heads, and the dropout ratio on transformer performance. Furthermore, the role of the masking ratio has explored in BERT-style transformer training, providing a detailed investigation into its impact on NILM tasks. Based on these experiments, the optimal hyper-parameters have been selected and used them to train a transformer model, which surpasses the performance of existing models. The experimental findings offer valuable insights and guidelines for optimizing transformer architectures, aiming to enhance their effectiveness and efficiency in NILM applications. It is expected that this work will serve as a foundation for future research and development of more robust and capable transformer models for NILM.

[AI-200] Real-World Data and Calibrated Simulation Suite for Offline Training of Reinforcement Learning Agents to Optimize Energy and Emission in Buildings for Environmental Sustainability

链接: https://arxiv.org/abs/2410.03756
作者: Judah Goldfeder,John Sipple
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

[AI-201] Efficient Streaming LLM for Speech Recognition

链接: https://arxiv.org/abs/2410.03752
作者: Junteng Jia,Gil Keren,Wei Zhou,Egor Lakomkin,Xiaohui Zhang,Chunyang Wu,Frank Seide,Jay Mahadeokar,Ozlem Kalinli
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-202] SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models EMNLP-24

链接: https://arxiv.org/abs/2410.03750
作者: Juan Pablo Muñoz,Jinjie Yuan,Nilesh Jain
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: To be published in EMNLP-24 Findings

点击查看摘要

[AI-203] Distributed AI Platform for the 6G RAN

链接: https://arxiv.org/abs/2410.03747
作者: Ganesh Ananthanarayanan,Xenofon Foukas,Bozidar Radunovic,Yongguang Zhang
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-204] Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging EMNLP2024

链接: https://arxiv.org/abs/2410.03743
作者: Yiming Ju,Ziyi Ni,Xingrun Xing,Zhixiong Zeng,hanyu Zhao,Siqi Fan,Zheng Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024

点击查看摘要

[AI-205] Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

链接: https://arxiv.org/abs/2410.03742
作者: Ziyi Ye,Xiangsheng Li,Qiuchi Li,Qingyao Ai,Yujia Zhou,Wei Shen,Dong Yan,Yiqun Liu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-206] owards Democratization of Subspeciality Medical Expertise

链接: https://arxiv.org/abs/2410.03741
作者: Jack W. O’Sullivan,Anil Palepu,Khaled Saab,Wei-Hung Weng,Yong Cheng,Emily Chu,Yaanik Desai,Aly Elezaby,Daniel Seung Kim,Roy Lan,Wilson Tang,Natalie Tapaskar,Victoria Parikh,Sneha S. Jain,Kavita Kulkarni,Philip Mansfield,Dale Webster,Juraj Gottweis,Joelle Barral,Mike Schaekermann,Ryutaro Tanno,S. Sara Mahdavi,Vivek Natarajan,Alan Karthikesalingam,Euan Ashley,Tao Tu
关键词-EN: Articulate Medical Intelligence, Medical Intelligence Explorer, AMIE, life-threatening diseases, poses a significant
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The scarcity of subspecialist medical expertise, particularly in rare, complex and life-threatening diseases, poses a significant challenge for healthcare delivery. This issue is particularly acute in cardiology where timely, accurate management determines outcomes. We explored the potential of AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based experimental AI system optimized for diagnostic dialogue, to potentially augment and support clinical decision-making in this challenging context. We curated a real-world dataset of 204 complex cases from a subspecialist cardiology practice, including results for electrocardiograms, echocardiograms, cardiac MRI, genetic tests, and cardiopulmonary stress tests. We developed a ten-domain evaluation rubric used by subspecialists to evaluate the quality of diagnosis and clinical management plans produced by general cardiologists or AMIE, the latter enhanced with web-search and self-critique capabilities. AMIE was rated superior to general cardiologists for 5 of the 10 domains (with preference ranging from 9% to 20%), and equivalent for the rest. Access to AMIE’s response improved cardiologists’ overall response quality in 63.7% of cases while lowering quality in just 3.4%. Cardiologists’ responses with access to AMIE were superior to cardiologist responses without access to AMIE for all 10 domains. Qualitative examinations suggest AMIE and general cardiologist could complement each other, with AMIE thorough and sensitive, while general cardiologist concise and specific. Overall, our results suggest that specialized medical LLMs have the potential to augment general cardiologists’ capabilities by bridging gaps in subspecialty expertise, though further research and validation are essential for wide clinical utility.

[AI-207] Grammar Induction from Visual Speech and Text

链接: https://arxiv.org/abs/2410.03739
作者: Yu Zhao,Hao Fei,Shengqiong Wu,Meishan Zhang,Min Zhang,Tat-seng Chua
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-208] ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation

链接: https://arxiv.org/abs/2410.03738
作者: Fillipe dos Santos Silva,Gabriel Kenzo Kakimoto,Julio Cesar dos Reis,Marcelo S. Reis
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 10 figures, published in BRACIS 2024 conference

点击查看摘要

[AI-209] Meta Reinforcement Learning Approach for Adaptive Resource Optimization in O-RAN

链接: https://arxiv.org/abs/2410.03737
作者: Fatemeh Lotfi,Fatemeh Afghah
关键词-EN: RAN Intelligent Controller, Open Radio Access, smart RAN Intelligent, Radio Access Network, Intelligent Controller
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As wireless networks grow to support more complex applications, the Open Radio Access Network (O-RAN) architecture, with its smart RAN Intelligent Controller (RIC) modules, becomes a crucial solution for real-time network data collection, analysis, and dynamic management of network resources including radio resource blocks and downlink power allocation. Utilizing artificial intelligence (AI) and machine learning (ML), O-RAN addresses the variable demands of modern networks with unprecedented efficiency and adaptability. Despite progress in using ML-based strategies for network optimization, challenges remain, particularly in the dynamic allocation of resources in unpredictable environments. This paper proposes a novel Meta Deep Reinforcement Learning (Meta-DRL) strategy, inspired by Model-Agnostic Meta-Learning (MAML), to advance resource block and downlink power allocation in O-RAN. Our approach leverages O-RAN’s disaggregated architecture with virtual distributed units (DUs) and meta-DRL strategies, enabling adaptive and localized decision-making that significantly enhances network efficiency. By integrating meta-learning, our system quickly adapts to new network conditions, optimizing resource allocation in real-time. This results in a 19.8% improvement in network management performance over traditional methods, advancing the capabilities of next-generation wireless networks.

[AI-210] CliMB: An AI-enabled Partner for Clinical Predictive Modeling

链接: https://arxiv.org/abs/2410.03736
作者: Evgeny Saveliev,Tim Schubert,Thomas Pouplin,Vasilis Kosmoliaptsis,Mihaela van der Schaar
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: * Evgeny Saveliev and Tim Schubert contributed equally to this work

点击查看摘要

[AI-211] Evaluating the Effects of AI Directors for Quest Selection

链接: https://arxiv.org/abs/2410.03733
作者: Kristen K. Yu,Matthew Guzdial,Nathan Sturtevant
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-212] Unsupervised Human Preference Learning EMNLP2024

链接: https://arxiv.org/abs/2410.03731
作者: Sumuk Shashidhar,Abhinav Chinta,Vaibhav Sahai,Dilek Hakkani Tur
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Main Conference

点击查看摘要

[AI-213] Progress Report: Towards European LLMs

链接: https://arxiv.org/abs/2410.03730
作者: Mehdi Ali,Michael Fromm,Klaudia Thellmann,Jan Ebert,Alexander Arno Weber,Richard Rutmann,Charvi Jain,Max Lübbering,Daniel Steinigen,Johannes Leveling,Katrin Klug,Jasper Schulze Buschhoff,Lena Jurkschat,Hammam Abdelwahab,Benny Jörg Stein,Karl-Heinz Sylla,Pavel Denisov,Nicolo Brandizzi,Qasid Saleem,Bhowmick Anirban,Chelsea John,Pedro Ortiz Suarez,Malte Ostendorff,Alex Jude,Lalith Manjunath,Samuel Weinbach,Carolin Penke,Shima Asaadi,Fabio Barth,Rafet Sifa,Fabian Küch,René Jäkel,Georg Rehm,Stefan Kesselheim,Joachim Köhler,Nicolas Flores-Herr
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-214] Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic Analysis

链接: https://arxiv.org/abs/2410.03728
作者: Barak Gahtan,Robert J. Sahala,Alex M. Bronstein,Reuven Cohen
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: The dataset and the supplementary material can be provided upon request

点击查看摘要

[AI-215] FaithEval: Can Your Language Model Stay Faithful to Context Even If “The Moon is Made of Marshmallows”

链接: https://arxiv.org/abs/2410.03727
作者: Yifei Ming,Senthil Purushwalkam,Shrey Pandit,Zixuan Ke,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-216] Large Language Models Overcome the Machine Penalty When Acting Fairly but Not When Acting Selfishly or Altruistically

链接: https://arxiv.org/abs/2410.03724
作者: Zhen Wang(1),Ruiqi Song(1),Chen Shen(2),Shiya Yin(1),Zhao Song(3),Balaraju Battu(4),Lei Shi(5),Danyang Jia(1),Talal Rahwan(4),Shuyue Hu(6) ((1) School of Cybersecurity, and School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, China (2) Faculty of Engineering Sciences, Kyushu University, Japan, (3) School of Computing, Engineering and Digital Technologies, Teesside University, United Kingdom, (4) Computer Science, Science Division, New York University Abu Dhabi, UAE, (5) School of Statistics and Mathematics, Yunnan University of Finance and Economics, China, (6) Shanghai Artificial Intelligence Laboratory, China)
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
*备注:

点击查看摘要

[AI-217] Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation

链接: https://arxiv.org/abs/2410.03723
作者: Tiffany Zhu,Iain Weissburg,Kexun Zhang,William Yang Wang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 9 pages, 2 figures

点击查看摘要

[AI-218] hematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

链接: https://arxiv.org/abs/2410.03721
作者: Andrew Katz,Gabriella Coloyan Fleming,Joyce Main
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

[AI-219] Revisiting the Superficial Alignment Hypothesis

链接: https://arxiv.org/abs/2410.03717
作者: Mohit Raghavendra,Vaskar Nath,Sean Hendryx
关键词-EN: Superficial Alignment Hypothesis, Alignment Hypothesis posits, style and format, Superficial Alignment, Alignment Hypothesis
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Superficial Alignment Hypothesis posits that almost all of a language model’s abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model’s ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification.

[AI-220] opological Foundations of Reinforcement Learning

链接: https://arxiv.org/abs/2410.03706
作者: David Krame Kadurha
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA)
*备注: Supervisor : Yae Ulrich Gaba , Mentor : Domini Jocema Leko

点击查看摘要

[AI-221] Combining Open-box Simulation and Importance Sampling for Tuning Large-Scale Recommenders RECSYS’24

链接: https://arxiv.org/abs/2410.03697
作者: Kaushal Paneri,Michael Munje,Kailash Singh Maurya,Adith Swaminathan,Yifan Shi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

[AI-222] Improving Emotion Recognition Accuracy with Personalized Clustering

链接: https://arxiv.org/abs/2410.03696
作者: Laura Gutierrez-Martin(1),Celia Lopez Ongil(1 and 2),Jose M. Lanza-Gutierrez(3),Jose A. Miranda Calero(4) ((1) Department of Electronics, Universidad Carlos III de Madrid, Spain, (2) Gender Studies Institute, Universidad Carlos III de Madrid, Spain, (3) Department of Computer Science, Universidad de Alcala, Spain, (4) Embedded Systems Laboratory, Ecole Polytechnique Federale de Lausanne, Switzerland)
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 11 pages, 2 figures

点击查看摘要

[AI-223] LLM Agents as 6G Orchestrator: A Paradigm for Task-Oriented Physical-Layer Automation

链接: https://arxiv.org/abs/2410.03688
作者: Zhuoran Xiao,Chenhui Ye,Yunbo Hu,Honggang Yuan,Yihang Huang,Yijia Feng,Liyu Cai,Jiang Chang
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-224] AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation

链接: https://arxiv.org/abs/2409.20398
作者: Boyu Han,Qianqian Xu,Zhiyong Yang,Shilong Bao,Peisong Wen,Yangbangyan Jiang,Qingming Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-225] Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering

链接: https://arxiv.org/abs/2210.02627
作者: Shamane Siriwardhana,Rivindu Weerasekera,Elliott Wen,Tharindu Kaluarachchi,Rajib Rana,Suranga Nanayakkara
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: This paper is awaiting publication at Transactions of the Association for Computational Linguistics. This is a pre-MIT Press publication version. For associated huggingface transformers code, see this https URL

点击查看摘要

[AI-226] Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering

链接: https://arxiv.org/abs/2106.11517
作者: Shamane Siriwardhana,Rivindu Weerasekera,Elliott Wen,Suranga Nanayakkara
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: for associated code, see this https URL

点击查看摘要

[AI-227] Regression Conformal Prediction under Bias

链接: https://arxiv.org/abs/2410.05263
作者: Matt Y. Cheung,Tucker J. Netherton,Laurence E. Court,Ashok Veeraraghavan,Guha Balakrishnan
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 17 pages, 6 figures, code available at: this https URL

点击查看摘要

[AI-228] AlphaRouter: Quantum Circuit Routing with Reinforcement Learning and Tree Search

链接: https://arxiv.org/abs/2410.05115
作者: Wei Tang,Yiheng Duan,Yaroslav Kharkov,Rasool Fakoor,Eric Kessler,Yunong Shi
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 11 pages, 11 figures, International Conference on Quantum Computing and Engineering - QCE24

点击查看摘要

[AI-229] ransition of alpha-mixing in Random Iterations with Applications in Queuing Theory

链接: https://arxiv.org/abs/2410.05056
作者: Attila Lovas
关键词-EN:
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI); Probability (math.PR)
*备注: 33 pages, 1 figure

点击查看摘要

[AI-230] A Review of Artificial Intelligence based Biological-Tree Construction: Priorities Methods Applications and Trends

链接: https://arxiv.org/abs/2410.04815
作者: Zelin Zang,Yongjie Xu,Chenrui Duan,Jinlin Wu,Stan Z. Li,Zhen Lei
关键词-EN:
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI)
*备注: 83 pages, 15 figures

点击查看摘要

[AI-231] Molecular topological deep learning for polymer property prediction

链接: https://arxiv.org/abs/2410.04765
作者: Cong Shen,Yipeng Zhang,Fei Han,Kelin Xia
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-232] Multi-Tiered Self-Contrastive Learning for Medical Microwave Radiometry (MWR) Breast Cancer Detection

链接: https://arxiv.org/abs/2410.04636
作者: Christoforos Galazis,Huiyi Wu,Igor Goryanin
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-233] RespDiff: An End-to-End Multi-scale RNN Diffusion Model for Respiratory Waveform Estimation from PPG Signals

链接: https://arxiv.org/abs/2410.04366
作者: Yuyang Miao,Zehua Chen,Chang Li,Danilo Mandic
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

[AI-234] Pareto Control Barrier Function for Inner Safe Set Maximization Under Input Constraints

链接: https://arxiv.org/abs/2410.04260
作者: Xiaoyang Cao,Zhe Fu,Alexandre M. Bayen
关键词-EN:
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Submitted to ACC 2025

点击查看摘要

[AI-235] IceCloudNet: 3D reconstruction of cloud ice from Meteosat SEVIRI

链接: https://arxiv.org/abs/2410.04135
作者: Kai Jeggle,Mikolaj Czerkawski,Federico Serva,Bertrand Le Saux,David Neubauer,Ulrike Lohmann
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: his paper was submitted to Artificial Intelligence for the Earth Systems

点击查看摘要

[AI-236] Robust Barycenter Estimation using Semi-Unbalanced Neural Optimal Transport

链接: https://arxiv.org/abs/2410.03974
作者: Milena Gazdieva,Jaemoo Choi,Alexander Kolesov,Jaewoong Choi,Petr Mokrov,Alexander Korotin
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

[AI-237] Leveraging Fundamental Analysis for Stock Trend Prediction for Profit

链接: https://arxiv.org/abs/2410.03913
作者: John Phan,Hung-Fu Chang
关键词-EN:
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

[AI-238] Example-Based Framework for Perceptually Guided Audio Texture Generation

链接: https://arxiv.org/abs/2308.11859
作者: Purnima Kamath,Chitralekha Gupta,Lonce Wyse,Suranga Nanayakkara
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted for publication at IEEE Transactions on Audio, Speech and Language Processing

点击查看摘要

[AI-239] owards Controllable Audio Texture Morphing ICASSP2023

链接: https://arxiv.org/abs/2304.11648
作者: Chitralekha Gupta,Purnima Kamath,Yize Wei,Zhuoyao Li,Suranga Nanayakkara,Lonce Wyse
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: accepted to ICASSP 2023

点击查看摘要

[AI-240] Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition INTERSPEECH2020

链接: https://arxiv.org/abs/2008.06682
作者: Shamane Siriwardhana,Andrew Reis,Rivindu Weerasekera,Suranga Nanayakkara
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注: Accepted to INTERSPEECH 2020

点击查看摘要

计算机视觉

[CV-0] Fine-Tuning CLIPs Last Visual Projector: A Few-Shot Cornucopia

链接: https://arxiv.org/abs/2410.05270
作者: Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Patrick Pérez,Raoul de Charette
关键词-EN: contrastively pretrained vision-language, pretrained vision-language model, vision-language model, learning external feature, Radford
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint,under review

点击查看摘要

Abstract:We consider the problem of adapting a contrastively pretrained vision-language model like CLIP (Radford et al., 2021) for few-shot classification. The existing literature addresses this problem by learning a linear classifier of the frozen visual features, optimizing word embeddings, or learning external feature adapters. This paper introduces an alternative way for CLIP adaptation without adding ‘external’ parameters to optimize. We find that simply fine-tuning the last projection matrix of the vision encoder leads to strong performance compared to the existing baselines. Furthermore, we show that regularizing training with the distance between the fine-tuned and pretrained matrices adds reliability for adapting CLIP through this layer. Perhaps surprisingly, this approach, coined ProLIP, yields performances on par or better than state of the art on 11 few-shot classification benchmarks, few-shot domain generalization, cross-dataset transfer and test-time adaptation. Code will be made available at this https URL .

[CV-1] Grounding Partially-Defined Events in Multimodal Data EMNLP

链接: https://arxiv.org/abs/2410.05267
作者: Kate Sanders,Reno Kriz,David Etter,Hannah Recknor,Alexander Martin,Cameron Carpenter,Jingyang Lin,Benjamin Van Durme
关键词-EN: learn about complex, short snippets, complex current events, complex current, events
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint; 9 pages; 2024 EMNLP Findings

点击查看摘要

Abstract:How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be able to model events from collections of unstructured video data. To tackle robust event modeling in multimodal settings, we introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis, and evaluate them on MultiVENT-G. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.

[CV-2] Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers

链接: https://arxiv.org/abs/2410.05266
作者: Andrew F. Luo,Jacob Yeung,Rushikesh Zawar,Shaurya Dewan,Margaret M. Henderson,Leila Wehbe,Michael J. Tarr
关键词-EN: large-scale artificial neural, artificial neural networks, visual, large-scale artificial, networks have facilitated
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Advances in large-scale artificial neural networks have facilitated novel insights into the functional topology of the brain. Here, we leverage this approach to study how semantic categories are organized in the human visual cortex. To overcome the challenge presented by the co-occurrence of multiple categories in natural images, we introduce BrainSAIL (Semantic Attribution and Image Localization), a method for isolating specific neurally-activating visual concepts in images. BrainSAIL exploits semantically consistent, dense spatial features from pre-trained vision models, building upon their demonstrated ability to robustly predict neural activity. This method derives clean, spatially dense embeddings without requiring any additional training, and employs a novel denoising process that leverages the semantic consistency of images under random augmentations. By unifying the space of whole-image embeddings and dense visual features and then applying voxel-wise encoding models to these features, we enable the identification of specific subregions of each image which drive selectivity patterns in different areas of the higher visual cortex. We validate BrainSAIL on cortical regions with known category selectivity, demonstrating its ability to accurately localize and disentangle selectivity to diverse visual concepts. Next, we demonstrate BrainSAIL’s ability to characterize high-level visual selectivity to scene properties and low-level visual features such as depth, luminance, and saturation, providing insights into the encoding of complex visual information. Finally, we use BrainSAIL to directly compare the feature selectivity of different brain encoding models across different regions of interest in visual cortex. Our innovative method paves the way for significant advances in mapping and decomposing high-level visual representations in the human brain.

[CV-3] xtHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

链接: https://arxiv.org/abs/2410.05261
作者: Ya-Qi Yu,Minghui Liao,Jiwen Zhang,Jihao Wu
关键词-EN: Reading dense text, Large Vision-Language Models, Reading dense, abilities for Large, Large Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reading dense text and locating objects within images are fundamental abilities for Large Vision-Language Models (LVLMs) tasked with advanced jobs. Previous LVLMs, including superior proprietary models like GPT-4o, have struggled to excel in both tasks simultaneously. Moreover, previous LVLMs with fine-grained perception cost thousands of tokens per image, making them resource-intensive. We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens. Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources. (2) Visual Encoder Reinforcement: We enhance the visual encoder through LVLM co-training, unlocking its potential for previously unseen tasks like Chinese OCR and grounding. (3) Data Diversity: We maintain a comparable scale of 100 million samples while diversifying the sources of pre-training data. We assess TextHawk2 across multiple benchmarks, where it consistently delivers superior performance and outperforms closed-source models of similar scale, such as achieving 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% accuracy@0.5 on RefCOCOg-test.

[CV-4] DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

链接: https://arxiv.org/abs/2410.05260
作者: Kaifeng Zhao,Gen Li,Siyu Tang
关键词-EN: motion, increasingly popular, user interaction, Text-conditioned human motion, motion primitive
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model, DART, effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model’s versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: this https URL.

[CV-5] GS-VTON: Controllable 3D Virtual Try-on with Gaussian Splatting

链接: https://arxiv.org/abs/2410.05259
作者: Yukang Cao,Masoud Hadi,Liang Pan,Ziwei Liu
关键词-EN: VTON, demonstrated strong performance, recently demonstrated strong, virtual try-on, techniques have recently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:Diffusion-based 2D virtual try-on (VTON) techniques have recently demonstrated strong performance, while the development of 3D VTON has largely lagged behind. Despite recent advances in text-guided 3D scene editing, integrating 2D VTON into these pipelines to achieve vivid 3D VTON remains challenging. The reasons are twofold. First, text prompts cannot provide sufficient details in describing clothing. Second, 2D VTON results generated from different viewpoints of the same 3D scene lack coherence and spatial relationships, hence frequently leading to appearance inconsistencies and geometric distortions. To resolve these problems, we introduce an image-prompted 3D VTON method (dubbed GS-VTON) which, by leveraging 3D Gaussian Splatting (3DGS) as the 3D representation, enables the transfer of pre-trained knowledge from 2D VTON models to 3D while improving cross-view consistency. (1) Specifically, we propose a personalized diffusion model that utilizes low-rank adaptation (LoRA) fine-tuning to incorporate personalized information into pre-trained 2D VTON models. To achieve effective LoRA training, we introduce a reference-driven image editing approach that enables the simultaneous editing of multi-view images while ensuring consistency. (2) Furthermore, we propose a persona-aware 3DGS editing framework to facilitate effective editing while maintaining consistent cross-view appearance and high-quality 3D geometry. (3) Additionally, we have established a new 3D VTON benchmark, 3D-VTONBench, which facilitates comprehensive qualitative and quantitative 3D VTON evaluations. Through extensive experiments and comparative analyses with existing methods, the proposed \OM has demonstrated superior fidelity and advanced editing capabilities, affirming its effectiveness for 3D VTON.

[CV-6] SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

链接: https://arxiv.org/abs/2410.05255
作者: Daoan Zhang,Guangchen Lan,Dong-Jun Han,Wenlin Yao,Xiaoman Pan,Hongming Zhang,Mingxiao Li,Pengcheng Chen,Yu Dong,Christopher Brinton,Jiebo Luo
关键词-EN: fine-tune diffusion models, Reinforcement learning, paired human-annotated data, visual generation, human feedback
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace “losing images” in preference pairs. This approach allows us to optimize using only off-policy “winning images.” Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in this https URL.

[CV-7] LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

链接: https://arxiv.org/abs/2410.05249
作者: Wei Wu,Kecheng Zheng,Shuailei Ma,Fan Lu,Yuxin Guo,Yifei Zhang,Wei Chen,Qingpei Guo,Yujun Shen,Zheng-Jun Zha
关键词-EN: Understanding long text, long text understanding, short text understanding, understanding short text, language-image pre-training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs. It is noteworthy that, on the task of long-text image retrieval, we beat the competitor using long captions with 11.1% improvement (i.e., from 72.62% to 83.72%). We will release the code, the model, and the new dataset to facilitate the reproducibility and further research. The project page is available at this https URL.

[CV-8] Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

链接: https://arxiv.org/abs/2410.05243
作者: Boyu Gou,Ruohan Wang,Boyuan Zheng,Yanan Xie,Cheng Chang,Yiheng Shu,Huan Sun,Yu Su
关键词-EN: Multimodal large language, graphical user interface, Multimodal large, GUI agents, GUI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

[CV-9] uneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models ACCV2024

链接: https://arxiv.org/abs/2410.05239
作者: Rabin Adhikari,Safal Thapaliya,Manish Dhakal,Bishesh Khanal
关键词-EN: requires expensive fine-tuning, Prompt tuning, Vision-Language Segmentation Models, shown impressive performance, Prompt
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted at ACCV 2024 (oral presentation)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown impressive performance in vision tasks, but adapting them to new domains often requires expensive fine-tuning. Prompt tuning techniques, including textual, visual, and multimodal prompting, offer efficient alternatives by leveraging learnable prompts. However, their application to Vision-Language Segmentation Models (VLSMs) and evaluation under significant domain shifts remain unexplored. This work presents an open-source benchmarking framework, TuneVLSeg, to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes. TuneVLSeg includes 6 prompt tuning strategies on various prompt depths used in 2 VLSMs totaling of 8 different combinations. We test various prompt tuning on 8 diverse medical datasets, including 3 radiology datasets (breast tumor, echocardiograph, chest X-ray pathologies) and 5 non-radiology datasets (polyp, ulcer, skin cancer), and two natural domain segmentation datasets. Our study found that textual prompt tuning struggles under significant domain shifts, from natural-domain images to medical data. Furthermore, visual prompt tuning, with fewer hyperparameters than multimodal prompt tuning, often achieves performance competitive to multimodal approaches, making it a valuable first attempt. Our work advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation. The source code is available at this https URL.

[CV-10] DiffuseReg: Denoising Diffusion Model for Obtaining Deformation Fields in Unsupervised Deformable Image Registration MICCAI2024

链接: https://arxiv.org/abs/2410.05234
作者: Yongtai Zhuo,Yiqing Shen
关键词-EN: precisely align medical, Deformable image registration, Deformable image, align medical images, aims to precisely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024, W-AM-067, this https URL

点击查看摘要

Abstract:Deformable image registration aims to precisely align medical images from different modalities or times. Traditional deep learning methods, while effective, often lack interpretability, real-time observability and adjustment capacity during registration inference. Denoising diffusion models present an alternative by reformulating registration as iterative image denoising. However, existing diffusion registration approaches do not fully harness capabilities, neglecting the critical sampling phase that enables continuous observability during the inference. Hence, we introduce DiffuseReg, an innovative diffusion-based method that denoises deformation fields instead of images for improved transparency. We also propose a novel denoising network upon Swin Transformer, which better integrates moving and fixed images with diffusion time step throughout the denoising process. Furthermore, we enhance control over the denoising registration process with a novel similarity consistency regularization. Experiments on ACDC datasets demonstrate DiffuseReg outperforms existing diffusion registration methods by 1.32 in Dice score. The sampling process in DiffuseReg enables real-time output observability and adjustment unmatched by previous deep models.

[CV-11] SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

链接: https://arxiv.org/abs/2410.05233
作者: Taha Bouhsine,Imad El Aaroussi,Atik Faysal,Wang Huaxia
关键词-EN: anchor-free contrastive learning, proposed Similarity-Orthogonality, fine-grained contrastive learning, contrastive learning, anchor-free contrastive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a novel anchor-free contrastive learning (AFCL) method leveraging our proposed Similarity-Orthogonality (SimO) loss. Our approach minimizes a semi-metric discriminative loss function that simultaneously optimizes two key objectives: reducing the distance and orthogonality between embeddings of similar inputs while maximizing these metrics for dissimilar inputs, facilitating more fine-grained contrastive learning. The AFCL method, powered by SimO loss, creates a fiber bundle topological structure in the embedding space, forming class-specific, internally cohesive yet orthogonal neighborhoods. We validate the efficacy of our method on the CIFAR-10 dataset, providing visualizations that demonstrate the impact of SimO loss on the embedding space. Our results illustrate the formation of distinct, orthogonal class neighborhoods, showcasing the method’s ability to create well-structured embeddings that balance class separation with intra-class variability. This work opens new avenues for understanding and leveraging the geometric properties of learned representations in various machine learning tasks.

[CV-12] he Dawn of Video Generation: Preliminary Explorations with SORA-like Models

链接: https://arxiv.org/abs/2410.05227
作者: Ailing Zeng,Yuhang Yang,Weidong Chen,Wei Liu
关键词-EN: High-quality video generation, holds considerable significance, High-quality video, world simulation, holds considerable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project: this https URL

点击查看摘要

Abstract:High-quality video generation, encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation to benefit anyone express their inherent creativity in new ways and world simulation to modeling and understanding the world. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability, particularly for long video sequences. These improvements have been driven by the evolution of model architectures, shifting from UNet to more scalable and parameter-rich DiT models, along with large-scale data expansion and refined training strategies. However, despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking. Furthermore, the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements. Additionally, evaluation metrics often fail to align with human preferences.

[CV-13] Precise Model Benchmarking with Only a Few Observations EMNLP2024

链接: https://arxiv.org/abs/2410.05222
作者: Riccardo Fogliato,Pratik Patil,Nil-Jana Akpinar,Mathew Monfort
关键词-EN: larger question-answering dataset, large language model, model accuracy, larger question-answering, accuracy on questions
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: To appear at EMNLP 2024

点击查看摘要

Abstract:How can we precisely estimate a large language model’s (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model’s accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model’s accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.

[CV-14] Organizing Unstructured Image Collections using Natural Language

链接: https://arxiv.org/abs/2410.05217
作者: Mingxuan Liu,Zhun Zhong,Jun Li,Gianni Franchi,Subhankar Roy,Elisa Ricci
关键词-EN: unstructured visual data, Organizing unstructured visual, Semantic Multiple Clustering, computer vision, multiple clustering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint. Project webpage: this https URL

点击查看摘要

Abstract:Organizing unstructured visual data into semantic clusters is a key challenge in computer vision. Traditional deep clustering (DC) approaches focus on a single partition of data, while multiple clustering (MC) methods address this limitation by uncovering distinct clustering solutions. The rise of large language models (LLMs) and multimodal LLMs (MLLMs) has enhanced MC by allowing users to define clustering criteria in natural language. However, manually specifying criteria for large datasets is impractical. In this work, we introduce the task Semantic Multiple Clustering (SMC) that aims to automatically discover clustering criteria from large image collections, uncovering interpretable substructures without requiring human input. Our framework, Text Driven Semantic Multiple Clustering (TeDeSC), uses text as a proxy to concurrently reason over large image collections, discover partitioning criteria, expressed in natural language, and reveal semantic substructures. To evaluate TeDeSC, we introduce the COCO-4c and Food-4c benchmarks, each containing four grouping criteria and ground-truth annotations. We apply TeDeSC to various applications, such as discovering biases and analyzing social media image popularity, demonstrating its utility as a tool for automatically organizing image collections and revealing novel insights.

[CV-15] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality EMNLP2024

链接: https://arxiv.org/abs/2410.05210
作者: Youngtaek Oh,Jae Won Cho,Dong-Jin Kim,In So Kweon,Junmo Kim
关键词-EN: enhance compositional understanding, method to enhance, understanding in pre-trained, pre-trained vision, vision and language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 (Long, Main). Project page: this https URL

点击查看摘要

Abstract:In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model’s multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model’s representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: this https URL.

[CV-16] Studying and Mitigating Biases in Sign Language Understanding Models

链接: https://arxiv.org/abs/2410.05206
作者: Katherine Atwell,Danielle Bragg,Malihe Alikhani
关键词-EN: ASL Citizen dataset, ASL Citizen, sign language technologies, Citizen dataset, members is crucial
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ensuring that the benefits of sign language technologies are distributed equitably among all community members is crucial. Thus, it is important to address potential biases and inequities that may arise from the design or use of these resources. Crowd-sourced sign language datasets, such as the ASL Citizen dataset, are great resources for improving accessibility and preserving linguistic diversity, but they must be used thoughtfully to avoid reinforcing existing biases. In this work, we utilize the rich information about participant demographics and lexical features present in the ASL Citizen dataset to study and document the biases that may result from models trained on crowd-sourced sign datasets. Further, we apply several bias mitigation techniques during model training, and find that these techniques reduce performance disparities without decreasing accuracy. With the publication of this work, we release the demographic information about the participants in the ASL Citizen dataset to encourage future bias mitigation work in this space. Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.05206 [cs.CL] (or arXiv:2410.05206v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.05206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-17] Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

链接: https://arxiv.org/abs/2410.05203
作者: Ge Ya(Olga)Luo,Gian Favero,Zhi Hao Luo,Alexia Jolicoeur-Martineau,Christopher Pal
关键词-EN: Fréchet Video Distance, generation distribution quality, Fréchet Video, evaluating video generation, video generation distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD’s reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

[CV-18] MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain ECCV2024

链接: https://arxiv.org/abs/2410.05182
作者: Timothy Chase Jr,Karthik Dantu
关键词-EN: celestial objects, tracking of surface, surface terrain, terrain is required, required for spacecraft
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ECCV 2024. Project page available at this https URL

点击查看摘要

Abstract:The visual detection and tracking of surface terrain is required for spacecraft to safely land on or navigate within close proximity to celestial objects. Current approaches rely on template matching with pre-gathered patch-based features, which are expensive to obtain and a limiting factor in perceptual capability. While recent literature has focused on in-situ detection methods to enhance navigation and operational autonomy, robust description is still needed. In this work, we explore metric learning as the lightweight feature description mechanism and find that current solutions fail to address inter-class similarity and multi-view observational geometry. We attribute this to the view-unaware attention mechanism and introduce Multi-view Attention Regularizations (MARs) to constrain the channel and spatial attention across multiple feature views, regularizing the what and where of attention focus. We thoroughly analyze many modern metric learning losses with and without MARs and demonstrate improved terrain-feature recognition performance by upwards of 85%. We additionally introduce the Luna-1 dataset, consisting of Moon crater landmarks and reference navigation frames from NASA mission data to support future research in this difficult task. Luna-1 and source code are publicly available at this https URL.

[CV-19] VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

链接: https://arxiv.org/abs/2410.05160
作者: Ziyan Jiang,Rui Meng,Xinyi Yang,Semih Yavuz,Yingbo Zhou,Wenhu Chen
关键词-EN: Embedding models, multimodal embedding models, multimodal embedding, semantic similarity, Embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Technical Report

点击查看摘要

Abstract:Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite their importance. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model - Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB. Unlike previous models such as CLIP and BLIP, VLM2Vec can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2Vec models on Phi-3.5-V and evaluate them on MMEB’s evaluation split. Our results show that \model achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB.

[CV-20] MIBench: A Comprehensive Benchmark for Model Inversion Attack and Defense

链接: https://arxiv.org/abs/2410.05159
作者: Yixiang Qiu,Hongyao Yu,Hao Fang,Wenbo Yu,Bin Chen,Xuan Wang,Shu-Tao Xia,Ke Xu
关键词-EN: Deep Neural Networks, Neural Networks, Deep Neural, privacy-sensitive training data, raising widespread concerns
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: 23 pages

点击查看摘要

Abstract:Model Inversion (MI) attacks aim at leveraging the output information of target models to reconstruct privacy-sensitive training data, raising widespread concerns on privacy threats of Deep Neural Networks (DNNs). Unfortunately, in tandem with the rapid evolution of MI attacks, the lack of a comprehensive, aligned, and reliable benchmark has emerged as a formidable challenge. This deficiency leads to inadequate comparisons between different attack methods and inconsistent experimental setups. In this paper, we introduce the first practical benchmark for model inversion attacks and defenses to address this critical gap, which is named \textitMIBench. This benchmark serves as an extensible and reproducible modular-based toolbox and currently integrates a total of 16 state-of-the-art attack and defense methods. Moreover, we furnish a suite of assessment tools encompassing 9 commonly used evaluation protocols to facilitate standardized and fair evaluation and analysis. Capitalizing on this foundation, we conduct extensive experiments from multiple perspectives to holistically compare and analyze the performance of various methods across different scenarios, which overcomes the misalignment issues and discrepancy prevalent in previous works. Based on the collected attack methods and defense strategies, we analyze the impact of target resolution, defense robustness, model predictive power, model architectures, transferability and loss function. Our hope is that this \textitMIBench could provide a unified, practical and extensible toolbox and is widely utilized by researchers in the field to rigorously test and compare their novel methods, ensuring equitable evaluations and thereby propelling further advancements in the future development.

[CV-21] Leveraging Multimodal Diffusion Models to Accelerate Imaging with Side Information

链接: https://arxiv.org/abs/2410.05143
作者: Timofey Efimov,Harry Dong,Megna Shah,Jeff Simmons,Sean Donegan,Yuejie Chi
关键词-EN: domains remains limited, found phenomenal success, structured scientific domains, scientific domains remains, solving inverse problems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have found phenomenal success as expressive priors for solving inverse problems, but their extension beyond natural images to more structured scientific domains remains limited. Motivated by applications in materials science, we aim to reduce the number of measurements required from an expensive imaging modality of interest, by leveraging side information from an auxiliary modality that is much cheaper to obtain. To deal with the non-differentiable and black-box nature of the forward model, we propose a framework to train a multimodal diffusion model over the joint modalities, turning inverse problems with black-box forward models into simple linear inpainting problems. Numerically, we demonstrate the feasibility of training diffusion models over materials imagery data, and show that our approach achieves superior image reconstruction by leveraging the available side information, requiring significantly less amount of data from the expensive microscopy modality.

[CV-22] Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

链接: https://arxiv.org/abs/2410.05116
作者: Ayano Hiranaka,Shang-Fu Chen,Chieh-Hsin Lai,Dongjun Kim,Naoki Murata,Takashi Shibuya,Wei-Hsiang Liao,Shao-Hua Sun,Yuki Mitsufuji
关键词-EN: Stable Diffusion, Controllable generation, improve fidelity, aims to improve, human feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD’s refined initialization samples, enabling faster convergence towards the evaluator’s intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.

[CV-23] Synthetic Generation of Dermatoscopic Images with GAN and Closed-Form Factorization

链接: https://arxiv.org/abs/2410.05114
作者: Rohan Reddy Mekala,Frederik Pahde,Simon Baur,Sneha Chandrashekar,Madeline Diep,Markus Wenzel,Eric L. Wisotzky,Galip Ümit Yolcu,Sebastian Lapuschkin,Jackie Ma,Peter Eisert,Mikael Lindvall,Adam Porter,Wojciech Samek
关键词-EN: Generative Adversarial Network, high-quality annotated datasets, machine learning models, harnesses Generative Adversarial, microscopic skin lesion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This preprint has been submitted to the Workshop on Synthetic Data for Computer Vision (SyntheticData4CV 2024 is a side event on 18th European Conference on Computer Vision 2024). This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:In the realm of dermatological diagnoses, where the analysis of dermatoscopic and microscopic skin lesion images is pivotal for the accurate and early detection of various medical conditions, the costs associated with creating diverse and high-quality annotated datasets have hampered the accuracy and generalizability of machine learning models. We propose an innovative unsupervised augmentation solution that harnesses Generative Adversarial Network (GAN) based models and associated techniques over their latent space to generate controlled semiautomatically-discovered semantic variations in dermatoscopic images. We created synthetic images to incorporate the semantic variations and augmented the training data with these images. With this approach, we were able to increase the performance of machine learning models and set a new benchmark amongst non-ensemble based models in skin lesion classification on the HAM10000 dataset; and used the observed analytics and generated models for detailed studies on model explainability, affirming the effectiveness of our solution.

[CV-24] LiDAR-GS:Real-time LiDAR Re-Simulation using Gaussian Splatting

链接: https://arxiv.org/abs/2410.05111
作者: Qifeng Chen,Sheng Yang,Sicong Du,Tao Tang,Peng Chen,Yuchi Huo
关键词-EN: LiDAR simulation plays, Neural Radiance Fields, Neural Gaussian Fields, simulation plays, closed-loop simulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:LiDAR simulation plays a crucial role in closed-loop simulation for autonomous driving. Although recent advancements, such as the use of reconstructed mesh and Neural Radiance Fields (NeRF), have made progress in simulating the physical properties of LiDAR, these methods have struggled to achieve satisfactory frame rates and rendering quality. To address these limitations, we present LiDAR-GS, the first LiDAR Gaussian Splatting method, for real-time high-fidelity re-simulation of LiDAR sensor scans in public urban road scenes. The vanilla Gaussian Splatting, designed for camera models, cannot be directly applied to LiDAR re-simulation. To bridge the gap between passive camera and active LiDAR, our LiDAR-GS designs a differentiable laser beam splatting, grounded in the LiDAR range view model. This innovation allows for precise surface splatting by projecting lasers onto micro cross-sections, effectively eliminating artifacts associated with local affine approximations. Additionally, LiDAR-GS leverages Neural Gaussian Fields, which further integrate view-dependent clues, to represent key LiDAR properties that are influenced by the incident angle and external factors. Combining these practices with some essential adaptations, e.g., dynamic instances decomposition, our approach succeeds in simultaneously re-simulating depth, intensity, and ray-drop channels, achieving state-of-the-art results in both rendering frame rate and quality on publically available large scene datasets. Our source code will be made publicly available.

[CV-25] MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization

链接: https://arxiv.org/abs/2410.05103
作者: Yunlong Zhao,Xiaoheng Deng,Xiu Su,Hongyan Xu,Xiuxing Li,Yijing Liu,Shan You
关键词-EN: facilitate efficient training, compact distilled dataset, distilled dataset, entails creating, creating a refined
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dataset distillation (DD) entails creating a refined, compact distilled dataset from a large-scale dataset to facilitate efficient training. A significant challenge in DD is the dependency between the distilled dataset and the neural network (NN) architecture used. Training a different NN architecture with a distilled dataset distilled using a specific architecture often results in diminished trainning performance for other architectures. This paper introduces MetaDD, designed to enhance the generalizability of DD across various NN architectures. Specifically, MetaDD partitions distilled data into meta features (i.e., the data’s common characteristics that remain consistent across different NN architectures) and heterogeneous features (i.e., the data’s unique feature to each NN architecture). Then, MetaDD employs an architecture-invariant loss function for multi-architecture feature alignment, which increases meta features and reduces heterogeneous features in distilled data. As a low-memory consumption component, MetaDD can be seamlessly integrated into any DD methodology. Experimental results demonstrate that MetaDD significantly improves performance across various DD methods. On the Distilled Tiny-Imagenet with Sre2L (50 IPC), MetaDD achieves cross-architecture NN accuracy of up to 30.1%, surpassing the second-best method (GLaD) by 1.7%.

[CV-26] IGroupSS-Mamba: Interval Group Spatial-Spectral Mamba for Hyperspectral Image Classification

链接: https://arxiv.org/abs/2410.05100
作者: Yan He,Bing Tu,Puzhao Jiang,Bo Liu,Jun Li,Antonio Plaza
关键词-EN: remote sensing fields, garnered substantial attention, State Space Models, Selective State Space, Interval Group
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification has garnered substantial attention in remote sensing fields. Recent Mamba architectures built upon the Selective State Space Models (S6) have demonstrated enormous potential in long-range sequence modeling. However, the high dimensionality of hyperspectral data and information redundancy pose challenges to the application of Mamba in HSI classification, suffering from suboptimal performance and computational efficiency. In light of this, this paper investigates a lightweight Interval Group Spatial-Spectral Mamba framework (IGroupSS-Mamba) for HSI classification, which allows for multi-directional and multi-scale global spatial-spectral information extraction in a grouping and hierarchical manner. Technically, an Interval Group S6 Mechanism (IGSM) is developed as the core component, which partitions high-dimensional features into multiple non-overlapping groups at intervals, and then integrates a unidirectional S6 for each group with a specific scanning direction to achieve non-redundant sequence modeling. Compared to conventional applying multi-directional scanning to all bands, this grouping strategy leverages the complementary strengths of different scanning directions while decreasing computational costs. To adequately capture the spatial-spectral contextual information, an Interval Group Spatial-Spectral Block (IGSSB) is introduced, in which two IGSM-based spatial and spectral operators are cascaded to characterize the global spatial-spectral relationship along the spatial and spectral dimensions, respectively. IGroupSS-Mamba is constructed as a hierarchical structure stacked by multiple IGSSB blocks, integrating a pixel aggregation-based downsampling strategy for multiscale spatial-spectral semantic learning from shallow to deep stages. Extensive experiments demonstrate that IGroupSS-Mamba outperforms the state-of-the-art methods.

[CV-27] DreamSat: Towards a General 3D Model for Novel View Synthesis of Space Objects

链接: https://arxiv.org/abs/2410.05097
作者: Nidhi Mathihalli,Audrey Wei,Giovanni Lavezzi,Peng Mun Siew,Victor Rodriguez-Fernandez,Hodei Urrutxua,Richard Linares
关键词-EN: Space Domain Awareness, view synthesis, enables to generate, Domain Awareness, Structural Similarity Index
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Presented at the 75th International Astronautical Congress, October 2024, Milan, Italy

点击查看摘要

Abstract:Novel view synthesis (NVS) enables to generate new images of a scene or convert a set of 2D images into a comprehensive 3D model. In the context of Space Domain Awareness, since space is becoming increasingly congested, NVS can accurately map space objects and debris, improving the safety and efficiency of space operations. Similarly, in Rendezvous and Proximity Operations missions, 3D models can provide details about a target object’s shape, size, and orientation, allowing for better planning and prediction of the target’s behavior. In this work, we explore the generalization abilities of these reconstruction techniques, aiming to avoid the necessity of retraining for each new scene, by presenting a novel approach to 3D spacecraft reconstruction from single-view images, DreamSat, by fine-tuning the Zero123 XL, a state-of-the-art single-view reconstruction model, on a high-quality dataset of 190 high-quality spacecraft models and integrating it into the DreamGaussian framework. We demonstrate consistent improvements in reconstruction quality across multiple metrics, including Contrastive Language-Image Pretraining (CLIP) score (+0.33%), Peak Signal-to-Noise Ratio (PSNR) (+2.53%), Structural Similarity Index (SSIM) (+2.38%), and Learned Perceptual Image Patch Similarity (LPIPS) (+0.16%) on a test set of 30 previously unseen spacecraft images. Our method addresses the lack of domain-specific 3D reconstruction tools in the space industry by leveraging state-of-the-art diffusion models and 3D Gaussian splatting techniques. This approach maintains the efficiency of the DreamGaussian framework while enhancing the accuracy and detail of spacecraft reconstructions. The code for this work can be accessed on GitHub (this https URL).

[CV-28] Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava

链接: https://arxiv.org/abs/2410.05096
作者: Mehdi Azarafza,Fatima Idrees,Ali Ehteshami Bejnordi,Charles Steinmetz,Stefan Henkler,Achim Rettberg
关键词-EN: Traffic Sign Recognition, Sign Recognition, autonomous vehicles, crucial component, component of autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Traffic Sign Recognition (TSR) detection is a crucial component of autonomous vehicles. While You Only Look Once (YOLO) is a popular real-time object detection algorithm, factors like training data quality and adverse weather conditions (e.g., heavy rain) can lead to detection failures. These failures can be particularly dangerous when visual similarities between objects exist, such as mistaking a 30 km/h sign for a higher speed limit sign. This paper proposes a method that combines video analysis and reasoning, prompting with a human-in-the-loop guide large vision model to improve YOLOs accuracy in detecting road speed limit signs, especially in semi-real-world conditions. It is hypothesized that the guided prompting and reasoning abilities of Video-LLava can enhance YOLOs traffic sign detection capabilities. This hypothesis is supported by an evaluation based on human-annotated accuracy metrics within a dataset of recorded videos from the CARLA car simulator. The results demonstrate that a collaborative approach combining YOLO with Video-LLava and reasoning can effectively address challenging situations such as heavy rain and overcast conditions that hinder YOLOs detection capabilities.

[CV-29] xLSTM-FER: Enhancing Student Expression Recognition with Extended Vision Long Short-Term Memory Network APWEB

链接: https://arxiv.org/abs/2410.05074
作者: Qionghao Huang,Jili Chen
关键词-EN: assessing learning experiences, Extended Long Short-Term, emotional states, Student expression recognition, expression recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper, consisting of 10 pages and 3 figures, has been accepted by the AIEDM Workshop at the 8th APWeb-WAIM Joint International Conference on Web and Big Data

点击查看摘要

Abstract:Student expression recognition has become an essential tool for assessing learning experiences and emotional states. This paper introduces xLSTM-FER, a novel architecture derived from the Extended Long Short-Term Memory (xLSTM), designed to enhance the accuracy and efficiency of expression recognition through advanced sequence processing capabilities for student facial expression recognition. xLSTM-FER processes input images by segmenting them into a series of patches and leveraging a stack of xLSTM blocks to handle these patches. xLSTM-FER can capture subtle changes in real-world students’ facial expressions and improve recognition accuracy by learning spatial-temporal relationships within the sequence. Experiments on CK+, RAF-DF, and FERplus demonstrate the potential of xLSTM-FER in expression recognition tasks, showing better performance compared to state-of-the-art methods on standard datasets. The linear computational and memory complexity of xLSTM-FER make it particularly suitable for handling high-resolution images. Moreover, the design of xLSTM-FER allows for efficient processing of non-sequential inputs such as images without additional computation.

[CV-30] Control-oriented Clustering of Visual Latent Representation

链接: https://arxiv.org/abs/2410.05063
作者: Han Qi(1),Haocheng Yin(1 and 2),Heng Yang(2) ((1) Harvard University, (2) ETH Zürich)
关键词-EN: visual representation space, visual representation, control pipeline learned, control-oriented visual representation, representation space
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We initiate a study of the geometry of the visual representation space – the information channel from the vision encoder to the action decoder – in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification, we investigate whether a similar law of clustering emerges in the visual representation space. Since image-based control is a regression task without explicitly defined classes, the central piece of the puzzle lies in determining according to what implicit classes the visual features cluster, if such a law exists. Focusing on image-based planar pushing, we posit the most important role of the visual representation in a control task is to convey a goal to the action decoder. We then classify training samples of expert demonstrations into eight “control-oriented” classes based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output, where one class corresponds to one relative pose orthant (REPO). Across four different instantiations of architecture, we report the prevalent emergence of control-oriented clustering in the visual representation space according to the eight REPOs. Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with limited expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35% in the low-data regime. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.

[CV-31] Improving Object Detection via Local-global Contrastive Learning BMVC2024

链接: https://arxiv.org/abs/2410.05058
作者: Danai Triantafyllidou,Sarah Parisot,Ales Leonardis,Steven McDonagh
关键词-EN: Visual domain gaps, Visual domain, object, impact object detection, gaps often impact
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: BMVC 2024 - Project page: this https URL

点击查看摘要

Abstract:Visual domain gaps often impact object detection performance. Image-to-image translation can mitigate this effect, where contrastive approaches enable learning of the image-to-image mapping under unsupervised regimes. However, existing methods often fail to handle content-rich scenes with multiple object instances, which manifests in unsatisfactory detection performance. Sensitivity to such instance-level content is typically only gained through object annotations, which can be expensive to obtain. Towards addressing this issue, we present a novel image-to-image translation method that specifically targets cross-domain object detection. We formulate our approach as a contrastive learning framework with an inductive prior that optimises the appearance of object instances through spatial attention masks, implicitly delineating the scene into foreground regions associated with the target object instances and background non-object regions. Instead of relying on object annotations to explicitly account for object instances during translation, our approach learns to represent objects by contrasting local-global information. This affords investigation of an under-explored challenge: obtaining performant detection, under domain shifts, without relying on object annotations nor detector model fine-tuning. We experiment with multiple cross-domain object detection settings across three challenging benchmarks and report state-of-the-art performance. Project page: this https URL

[CV-32] SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification NEURIPS2024

链接: https://arxiv.org/abs/2410.05057
作者: Benjamin Feuer,Jiawei Xu,Niv Cohen,Patrick Yubeaton,Govind Mittal,Chinmay Hegde
关键词-EN: supports efficient learning, Data curation, collect and organize, organize samples, supports efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024, Datasets and Benchmarks Track

点击查看摘要

Abstract:Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at this https URL. Comments: NeurIPS 2024, Datasets and Benchmarks Track Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.05057 [cs.CV] (or arXiv:2410.05057v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.05057 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-33] HE-Drive: Human-Like End-to-End Driving with Vision Language Models

链接: https://arxiv.org/abs/2410.05051
作者: Junming Wang,Xingyu Zhang,Zebin Xing,Songen Gu,Xiaoyang Guo,Yang Hu,Ziying Song,Qian Zhang,Xiaoxiao Long,Wei Yin
关键词-EN: autonomous driving system, Diffusion Probabilistic Models, Denoising Diffusion Probabilistic, Conditional Denoising Diffusion, temporally consistent
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this paper, we propose HE-Drive: the first human-like-centric end-to-end autonomous driving system to generate trajectories that are both temporally consistent and comfortable. Recent studies have shown that imitation learning-based planners and learning-based trajectory scorers can effectively generate and select accuracy trajectories that closely mimic expert demonstrations. However, such trajectory planners and scorers face the dilemma of generating temporally inconsistent and uncomfortable trajectories. To solve the above problems, Our HE-Drive first extracts key 3D spatial representations through sparse perception, which then serves as conditional inputs for a Conditional Denoising Diffusion Probabilistic Models (DDPMs)-based motion planner to generate temporal consistency multi-modal trajectories. A Vision-Language Models (VLMs)-guided trajectory scorer subsequently selects the most comfortable trajectory from these candidates to control the vehicle, ensuring human-like end-to-end driving. Experiments show that HE-Drive not only achieves state-of-the-art performance (i.e., reduces the average collision rate by 71% than VAD) and efficiency (i.e., 1.9X faster than SparseDrive) on the challenging nuScenes and OpenScene datasets but also provides the most comfortable driving experience on real-world this http URL more information, visit the project website: this https URL.

[CV-34] PhotoReg: Photometrically Registering 3D Gaussian Splatting Models

链接: https://arxiv.org/abs/2410.05044
作者: Ziwen Yuan,Tianyi Zhang,Matthew Johnson-Roberson,Weiming Zhi
关键词-EN: Building accurate representations, Building accurate, decisions during deployment, accurate representations, make decisions
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building accurate representations of the environment is critical for intelligent robots to make decisions during deployment. Advances in photorealistic environment models have enabled robots to develop hyper-realistic reconstructions, which can be used to generate images that are intuitive for human inspection. In particular, the recently introduced \ac3DGS, which describes the scene with up to millions of primitive ellipsoids, can be rendered in real time. \ac3DGS has rapidly gained prominence. However, a critical unsolved problem persists: how can we fuse multiple \ac3DGS into a single coherent model? Solving this problem will enable robot teams to jointly build \ac3DGS models of their surroundings. A key insight of this work is to leverage the duality between photorealistic reconstructions, which render realistic 2D images from 3D structure, and \emph3D foundation models, which predict 3D structure from image pairs. To this end, we develop PhotoReg, a framework to register multiple photorealistic \ac3DGS models with 3D foundation models. As \ac3DGS models are generally built from monocular camera images, they have \empharbitrary scale. To resolve this, PhotoReg actively enforces scale consistency among the different \ac3DGS models by considering depth estimates within these models. Then, the alignment is iteratively refined with fine-grained photometric losses to produce high-quality fused \ac3DGS models. We rigorously evaluate PhotoReg on both standard benchmark datasets and our custom-collected datasets, including with two quadruped robots. The code is released at \urlthis http URL.

[CV-35] Systematic Literature Review of Vision-Based Approaches to Outdoor Livestock Monitoring with Lessons from Wildlife Studies

链接: https://arxiv.org/abs/2410.05041
作者: Stacey D. Scott,Zayn J. Abbas,Feerass Ellid,Eli-Henry Dykhne,Muhammad Muhaiminul Islam,Weam Ayad,Kristina Kacmorova,Dan Tulpan,Minglun Gong
关键词-EN: Precision livestock farming, Precision livestock, farming outcomes, health and welfare, aims to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 28 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Precision livestock farming (PLF) aims to improve the health and welfare of livestock animals and farming outcomes through the use of advanced technologies. Computer vision, combined with recent advances in machine learning and deep learning artificial intelligence approaches, offers a possible solution to the PLF ideal of 24/7 livestock monitoring that helps facilitate early detection of animal health and welfare issues. However, a significant number of livestock species are raised in large outdoor habitats that pose technological challenges for computer vision approaches. This review provides a comprehensive overview of computer vision methods and open challenges in outdoor animal monitoring. We include research from both the livestock and wildlife fields in the review because of the similarities in appearance, behaviour, and habitat for many livestock and wildlife. We focus on large terrestrial mammals, such as cattle, horses, deer, goats, sheep, koalas, giraffes, and elephants. We use an image processing pipeline to frame our discussion and highlight the current capabilities and open technical challenges at each stage of the pipeline. The review found a clear trend towards the use of deep learning approaches for animal detection, counting, and multi-species classification. We discuss in detail the applicability of current vision-based methods to PLF contexts and promising directions for future research.

[CV-36] Conditional Variational Autoencoders for Probabilistic Pose Regression IROS2024

链接: https://arxiv.org/abs/2410.04989
作者: Fereidoon Zangeneh,Leonard Bruns,Amit Dekel,Alessandro Pieropan,Patric Jensfelt
关键词-EN: lose track, visual relocalization, Robots rely, Abstract, visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IROS 2024

点击查看摘要

Abstract:Robots rely on visual relocalization to estimate their pose from camera images when they lose track. One of the challenges in visual relocalization is repetitive structures in the operation environment of the robot. This calls for probabilistic methods that support multiple hypotheses for robot’s pose. We propose such a probabilistic method to predict the posterior distribution of camera poses given an observed image. Our proposed training strategy results in a generative model of camera poses given an image, which can be used to draw samples from the pose posterior distribution. Our method is streamlined and well-founded in theory and outperforms existing methods on localization in presence of ambiguities.

[CV-37] RoWeeder: Unsupervised Weed Mapping through Crop-Row Detection ECCV2024

链接: https://arxiv.org/abs/2410.04983
作者: Pasquale De Marinis,Rino Vessio,Giovanna Castellano
关键词-EN: Precision agriculture relies, agriculture relies heavily, Precision agriculture, robust crop yields, ensure robust crop
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Computer Vision for Plant Phenotyping and Agriculture (CVPPA) workshop at ECCV 2024

点击查看摘要

Abstract:Precision agriculture relies heavily on effective weed management to ensure robust crop yields. This study presents RoWeeder, an innovative framework for unsupervised weed mapping that combines crop-row detection with a noise-resilient deep learning model. By leveraging crop-row information to create a pseudo-ground truth, our method trains a lightweight deep learning model capable of distinguishing between crops and weeds, even in the presence of noisy data. Evaluated on the WeedMap dataset, RoWeeder achieves an F1 score of 75.3, outperforming several baselines. Comprehensive ablation studies further validated the model’s performance. By integrating RoWeeder with drone technology, farmers can conduct real-time aerial surveys, enabling precise weed management across large fields. The code is available at: \urlthis https URL.

[CV-38] Comparison of marker-less 2D image-based methods for infant pose estimation

链接: https://arxiv.org/abs/2410.04980
作者: Lennart Jahn,Sarah Flügge,Dajie Zhang,Luise Poustka,Sven Bölte,Florentin Wörgötter,Peter B Marschik,Tomas Kulvicius
关键词-EN: General Movement Assessment, General Movement, pose estimation accuracy, pose estimation, Movement Assessment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:There are increasing efforts to automate clinical methods for early diagnosis of developmental disorders, among them the General Movement Assessment (GMA), a video-based tool to classify infant motor functioning. Optimal pose estimation is a crucial part of the automated GMA. In this study we compare the performance of available generic- and infant-pose estimators, and the choice of viewing angle for optimal recordings, i.e., conventional diagonal view used in GMA vs. top-down view. For this study, we used 4500 annotated video-frames from 75 recordings of infant spontaneous motor functions from 4 to 26 weeks. To determine which available pose estimation method and camera angle yield the best pose estimation accuracy on infants in a GMA related setting, the distance to human annotations as well as the percentage of correct key-points (PCK) were computed and compared. The results show that the best performing generic model trained on adults, ViTPose, also performs best on infants. We see no improvement from using specialized infant-pose estimators over the generic pose estimators on our own infant dataset. However, when retraining a generic model on our data, there is a significant improvement in pose estimation accuracy. The pose estimation accuracy obtained from the top-down view is significantly better than that obtained from the diagonal view, especially for the detection of the hip key-points. The results also indicate only limited generalization capabilities of infant-pose estimators to other infant datasets, which hints that one should be careful when choosing infant pose estimators and using them on infant datasets which they were not trained on. While the standard GMA method uses a diagonal view for assessment, pose estimation accuracy significantly improves using a top-down view. This suggests that a top-down view should be included in recording setups for automated GMA research.

[CV-39] 6DGS: Enhanced Direction-Aware Gaussian Splatting for Volumetric Rendering ATC WWW

链接: https://arxiv.org/abs/2410.04974
作者: Zhongpai Gao,Benjamin Planche,Meng Zheng,Anwesa Choudhuri,Terrence Chen,Ziyan Wu
关键词-EN: Gaussian splatting, view synthesis, synthesis has advanced, development of neural, neural radiance fields
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Demo Video: this https URL

点击查看摘要

Abstract:Novel view synthesis has advanced significantly with the development of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS). However, achieving high quality without compromising real-time rendering remains challenging, particularly for physically-based ray tracing with view-dependent effects. Recently, N-dimensional Gaussians (N-DG) introduced a 6D spatial-angular representation to better incorporate view-dependent effects, but the Gaussian representation and control scheme are sub-optimal. In this paper, we revisit 6D Gaussians and introduce 6D Gaussian Splatting (6DGS), which enhances color and opacity representations and leverages the additional directional information in the 6D space for optimized Gaussian control. Our approach is fully compatible with the 3DGS framework and significantly improves real-time radiance field rendering by better modeling view-dependent effects and fine details. Experiments demonstrate that 6DGS significantly outperforms 3DGS and N-DG, achieving up to a 15.73 dB improvement in PSNR with a reduction of 66.5% Gaussian points compared to 3DGS.

[CV-40] L-C4: Language-Based Video Colorization for Creative and Consistent Color

链接: https://arxiv.org/abs/2410.04972
作者: Zheng Chang,Shuchen Weng,Huan Ouyang,Yu Li,Si Li,Boxin Shi
关键词-EN: Automatic video colorization, Automatic video, optional color candidates, multiple optional color, video colorization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic video colorization is inherently an ill-posed problem because each monochrome frame has multiple optional color candidates. Previous exemplar-based video colorization methods restrict the user’s imagination due to the elaborate retrieval process. Alternatively, conditional image colorization methods combined with post-processing algorithms still struggle to maintain temporal consistency. To address these issues, we present Language-based video Colorization for Creative and Consistent Colors (L-C4) to guide the colorization process using user-provided language descriptions. Our model is built upon a pre-trained cross-modality generative model, leveraging its comprehensive language understanding and robust color representation abilities. We introduce the cross-modality pre-fusion module to generate instance-aware text embeddings, enabling the application of creative colors. Additionally, we propose temporally deformable attention to prevent flickering or color shifts, and cross-clip fusion to maintain long-term color consistency. Extensive experimental results demonstrate that L-C4 outperforms relevant methods, achieving semantically accurate colors, unrestricted creative correspondence, and temporally robust consistency.

[CV-41] Revealing Directions for Text-guided 3D Face Editing

链接: https://arxiv.org/abs/2410.04965
作者: Zhuo Chen,Yichao Yan,Sehngqi Liu,Yuhao Cheng,Weiming Zhao,Lincheng Li,Mengxiao Bi,Xiaokang Yang
关键词-EN: control signals, significant task, face, Face Clan, face editing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D face editing is a significant task in multimedia, aimed at the manipulation of 3D face models across various control signals. The success of 3D-aware GAN provides expressive 3D models learned from 2D single-view images only, encouraging researchers to discover semantic editing directions in its latent space. However, previous methods face challenges in balancing quality, efficiency, and generalization. To solve the problem, we explore the possibility of introducing the strength of diffusion model into 3D-aware GANs. In this paper, we present Face Clan, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions. To achieve disentangled editing, we propose to diffuse on the latent space under a pair of opposite prompts to estimate the mask indicating the region of interest on latent codes. Based on the mask, we then apply denoising to the masked latent codes to reveal the editing direction. Our method offers a precisely controllable manipulation method, allowing users to intuitively customize regions of interest with the text description. Experiments demonstrate the effectiveness and generalization of our Face Clan for various pre-trained GANs. It offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.

[CV-42] On Efficient Variants of Segment Anything Model: A Survey

链接: https://arxiv.org/abs/2410.04960
作者: Xiaorui Sun,Jun Liu,Heng Tao Shen,Xiaofeng Zhu,Ping Hu
关键词-EN: image segmentation tasks, segmentation tasks, diverse applications, image segmentation, strong generalization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Report in progress

点击查看摘要

Abstract:The Segment Anything Model (SAM) is a foundational model for image segmentation tasks, known for its strong generalization across diverse applications. However, its impressive performance comes with significant computational and resource demands, making it challenging to deploy in resource-limited environments such as mobile devices. To address this, a variety of SAM variants have been proposed to enhance efficiency without sacrificing accuracy. This survey provides the first comprehensive review of these efficient SAM variants. We begin by exploring the motivations driving this research. We then present core techniques used in SAM and model acceleration. This is followed by an in-depth analysis of various acceleration strategies, categorized by approach. Finally, we offer a unified and extensive evaluation of these methods, assessing their efficiency and accuracy on representative benchmarks, and providing a clear comparison of their overall performance.

[CV-43] Real-time Ship Recognition and Georeferencing for the Improvement of Maritime Situational Awareness

链接: https://arxiv.org/abs/2410.04946
作者: Borja Carrillo Perez
关键词-EN: advanced situational awareness, situational awareness solutions, infrastructures are crucial, increasingly important, solutions are increasingly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In an era where maritime infrastructures are crucial, advanced situational awareness solutions are increasingly important. The use of optical camera systems can allow real-time usage of maritime footage. This thesis presents an investigation into leveraging deep learning and computer vision to advance real-time ship recognition and georeferencing for the improvement of maritime situational awareness. A novel dataset, ShipSG, is introduced, containing 3,505 images and 11,625 ship masks with corresponding class and geographic position. After an exploration of state-of-the-art, a custom real-time segmentation architecture, ScatYOLOv8+CBAM, is designed for the NVIDIA Jetson AGX Xavier embedded system. This architecture adds the 2D scattering transform and attention mechanisms to YOLOv8, achieving an mAP of 75.46% and an 25.3 ms per frame, outperforming state-of-the-art methods by over 5%. To improve small and distant ship recognition in high-resolution images on embedded systems, an enhanced slicing mechanism is introduced, improving mAP by 8% to 11%. Additionally, a georeferencing method is proposed, achieving positioning errors of 18 m for ships up to 400 m away and 44 m for ships between 400 m and 1200 m. The findings are also applied in real-world scenarios, such as the detection of abnormal ship behaviour, camera integrity assessment and 3D reconstruction. The approach of this thesis outperforms existing methods and provides a framework for integrating recognized and georeferenced ships into real-time systems, enhancing operational effectiveness and decision-making for maritime stakeholders. This thesis contributes to the maritime computer vision field by establishing a benchmark for ship segmentation and georeferencing research, demonstrating the viability of deep-learning-based recognition and georeferencing methods for real-time maritime monitoring.

[CV-44] Next state prediction gives rise to entangled yet compositional representations of objects

链接: https://arxiv.org/abs/2410.04940
作者: Tankred Saanum,Luca M. Schulze Buschoff,Peter Dayan,Eric Schulz
关键词-EN: vast state spaces, combinatorially vast state, state spaces, humans to generalize, generalize across combinatorially
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Compositional representations are thought to enable humans to generalize across combinatorially vast state spaces. Models with learnable object slots, which encode information about objects in separate latent codes, have shown promise for this type of generalization but rely on strong architectural priors. Models with distributed representations, on the other hand, use overlapping, potentially entangled neural codes, and their ability to support compositional generalization remains underexplored. In this paper we examine whether distributed models can develop linearly separable representations of objects, like slotted models, through unsupervised training on videos of object interactions. We show that, surprisingly, models with distributed representations often match or outperform models with object slots in downstream prediction tasks. Furthermore, we find that linearly separable object representations can emerge without object-centric priors, with auxiliary objectives like next-state prediction playing a key role. Finally, we observe that distributed models’ object representations are never fully disentangled, even if they are linearly separable: Multiple objects can be encoded through partially overlapping neural populations while still being highly separable with a linear classifier. We hypothesize that maintaining partially shared codes enables distributed models to better compress object dynamics, potentially enhancing generalization.

[CV-45] PRFusion: Toward Effective and Robust Multi-Modal Place Recognition with Image and Point Cloud Fusion

链接: https://arxiv.org/abs/2410.04939
作者: Sijie Wang,Qiyu Kang,Rui She,Kai Zhao,Yang Song,Wee Peng Tay
关键词-EN: Place recognition plays, Place recognition, computer vision, finding applications, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by IEEE TITS 2024

点击查看摘要

Abstract:Place recognition plays a crucial role in the fields of robotics and computer vision, finding applications in areas such as autonomous driving, mapping, and localization. Place recognition identifies a place using query sensor data and a known database. One of the main challenges is to develop a model that can deliver accurate results while being robust to environmental variations. We propose two multi-modal place recognition models, namely PRFusion and PRFusion++. PRFusion utilizes global fusion with manifold metric attention, enabling effective interaction between features without requiring camera-LiDAR extrinsic calibrations. In contrast, PRFusion++ assumes the availability of extrinsic calibrations and leverages pixel-point correspondences to enhance feature learning on local windows. Additionally, both models incorporate neural diffusion layers, which enable reliable operation even in challenging environments. We verify the state-of-the-art performance of both models on three large-scale benchmarks. Notably, they outperform existing models by a substantial margin of +3.0 AR@1 on the demanding Boreas dataset. Furthermore, we conduct ablation studies to validate the effectiveness of our proposed methods. The codes are available at: this https URL

[CV-46] OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

链接: https://arxiv.org/abs/2410.04932
作者: Leheng Li,Weichao Qiu,Xu Yan,Jing He,Kaiqiang Zhou,Yingjie Cai,Qing Lian,Bingbing Liu,Ying-Cong Chen
关键词-EN: present OmniBooth, image, instance-level multi-modal customization, image generation framework, multi-modal customization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets. Project page: this https URL

[CV-47] Art2Mus: Bridging Visual Arts and Music through Cross-Modal Generation ECCV2024

链接: https://arxiv.org/abs/2410.04906
作者: Ivan Rinaldi,Nicola Fanelli,Giovanna Castellano,Gennaro Vessio
关键词-EN: Artificial Intelligence, Intelligence and generative, revolutionized music creation, mathcal, textit
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Presented at the AI for Visual Arts (AI4VA) workshop at ECCV 2024

点击查看摘要

Abstract:Artificial Intelligence and generative models have revolutionized music creation, with many models leveraging textual or visual prompts for guidance. However, existing image-to-music models are limited to simple images, lacking the capability to generate music from complex digitized artworks. To address this gap, we introduce \mathcalA\textitrt2\mathcalM\textitus , a novel model designed to create music from digitized artworks or text inputs. \mathcalA\textitrt2\mathcalM\textitus extends the AudioLDM~2 architecture, a text-to-audio model, and employs our newly curated datasets, created via ImageBind, which pair digitized artworks with music. Experimental results demonstrate that \mathcalA\textitrt2\mathcalM\textitus can generate music that resonates with the input stimuli. These findings suggest promising applications in multimedia art, interactive installations, and AI-driven creative tools.

[CV-48] D-PoSE: Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation

链接: https://arxiv.org/abs/2410.04889
作者: Nikolaos Vasilikopoulos,Drosakis Drosakis,Antonis Argyros
关键词-EN: single RGB image, estimates human pose, Human Pose, RGB image, Shape Estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present D-PoSE (Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation), a one-stage method that estimates human pose and SMPL-X shape parameters from a single RGB image. Recent works use larger models with transformer backbones and decoders to improve the accuracy in human pose and shape (HPS) benchmarks. D-PoSE proposes a vision based approach that uses the estimated human depth-maps as an intermediate representation for HPS and leverages training with synthetic data and the ground-truth depth-maps provided with them for depth supervision during training. Although trained on synthetic datasets, D-PoSE achieves state-of-the-art performance on the real-world benchmark datasets, EMDB and 3DPW. Despite its simple lightweight design and the CNN backbone, it outperforms ViT-based models that have a number of parameters that is larger by almost an order of magnitude. D-PoSE code is available at: this https URL

[CV-49] Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models

链接: https://arxiv.org/abs/2410.04884
作者: Dehong Kong,Siyuan Liang,Xiaopeng Zhu,Yuansheng Zhong,Wenqi Ren
关键词-EN: Visual language pre-training, Visual language, demonstrated significant success, language pre-training, demonstrated significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: accepted by Visual Intelligence

点击查看摘要

Abstract:Visual language pre-training (VLP) models have demonstrated significant success across various domains, yet they remain vulnerable to adversarial attacks. Addressing these adversarial vulnerabilities is crucial for enhancing security in multimodal learning. Traditionally, adversarial methods targeting VLP models involve simultaneously perturbing images and text. However, this approach faces notable challenges: first, adversarial perturbations often fail to translate effectively into real-world scenarios; second, direct modifications to the text are conspicuously visible. To overcome these limitations, we propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text. Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations. Moreover, to optimize patch placement and improve the efficacy of our attacks, we utilize the cross-attention mechanism, which encapsulates intermodal interactions by generating attention maps to guide strategic patch placements. Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate. Additionally, it demonstrates commendable performance in transfer tasks involving text-to-image configurations.

[CV-50] Improved detection of discarded fish species through BoxAL active learning

链接: https://arxiv.org/abs/2410.04880
作者: Maria Sokolova,Pieter M. Blok,Angelo Mencarelli,Arjan Vroegop,Aloysius van Helmond,Gert Kootstra
关键词-EN: automated catch registration, powerful data-driven deep-learning, data-driven deep-learning techniques, recent years, powerful data-driven
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, powerful data-driven deep-learning techniques have been developed and applied for automated catch registration. However, these methods are dependent on the labelled data, which is time-consuming, labour-intensive, expensive to collect and need expert knowledge. In this study, we present an active learning technique, named BoxAL, which includes estimation of epistemic certainty of the Faster R-CNN object-detection model. The method allows selecting the most uncertain training images from an unlabeled pool, which are then used to train the object-detection model. To evaluate the method, we used an open-source image dataset obtained with a dedicated image-acquisition system developed for commercial trawlers targeting demersal species. We demonstrated, that our approach allows reaching the same object-detection performance as with the random sampling using 400 fewer labelled images. Besides, mean AP score was significantly higher at the last training iteration with 1100 training images, specifically, 39.0plusmn;1.6 and 34.8plusmn;1.8 for certainty-based sampling and random sampling, respectively. Additionally, we showed that epistemic certainty is a suitable method to sample images that the current iteration of the model cannot deal with yet. Our study additionally showed that the sampled new data is more valuable for training than the remaining unlabeled data. Our software is available on this https URL.

[CV-51] X-NeRF: Neural Radiance Fields from Pseudo-TeX Vision

链接: https://arxiv.org/abs/2410.04873
作者: Chonghao Zhong,Chao Xu
关键词-EN: Neural radiance fields, exceptional visual effects, gained significant attention, Neural radiance, visible light cameras
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Neural radiance fields (NeRF) has gained significant attention for its exceptional visual effects. However, most existing NeRF methods reconstruct 3D scenes from RGB images captured by visible light cameras. In practical scenarios like darkness, low light, or bad weather, visible light cameras become ineffective. Therefore, we propose TeX-NeRF, a 3D reconstruction method using only infrared images, which introduces the object material emissivity as a priori, preprocesses the infrared images using Pseudo-TeX vision, and maps the temperatures (T), emissivities (e), and textures (X) of the scene into the saturation (S), hue (H), and value (V) channels of the HSV color space, respectively. Novel view synthesis using the processed images has yielded excellent results. Additionally, we introduce 3D-TeX Datasets, the first dataset comprising infrared images and their corresponding Pseudo-TeX vision images. Experiments demonstrate that our method not only matches the quality of scene reconstruction achieved with high-quality RGB images but also provides accurate temperature estimations for objects in the scene.

[CV-52] Art Forgery Detection using Kolmogorov Arnold and Convolutional Neural Networks ECCV2024

链接: https://arxiv.org/abs/2410.04866
作者: Sandro Boccuzzo,Deborah Desirée Meyer,Ludovica Schaerf
关键词-EN: requiring profound connoisseurship, task requiring profound, Wolfgang Beltracchi, forger Wolfgang Beltracchi, historically established
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024 workshop AI4VA, oral presentation

点击查看摘要

Abstract:Art authentication has historically established itself as a task requiring profound connoisseurship of one particular artist. Nevertheless, famous art forgers such as Wolfgang Beltracchi were able to deceive dozens of art experts. In recent years Artificial Intelligence algorithms have been successfully applied to various image processing tasks. In this work, we leverage the growing improvements in AI to present an art authentication framework for the identification of the forger Wolfgang Beltracchi. Differently from existing literature on AI-aided art authentication, we focus on a specialized model of a forger, rather than an artist, flipping the approach of traditional AI methods. We use a carefully compiled dataset of known artists forged by Beltracchi and a set of known works by the forger to train a multiclass image classification model based on EfficientNet. We compare the results with Kolmogorov Arnold Networks (KAN) which, to the best of our knowledge, have never been tested in the art domain. The results show a general agreement between the different models’ predictions on artworks flagged as forgeries, which are then closely studied using visual analysis.

[CV-53] PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing

链接: https://arxiv.org/abs/2410.04844
作者: Feng Tian,Yixuan Li,Yichao Yan,Shanyan Guan,Yanhao Ge,Xiaokang Yang
关键词-EN: core challenges persist, initial features, background preservation, challenges persist, core challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of image editing, three core challenges persist: controllability, background preservation, and efficiency. Inversion-based methods rely on time-consuming optimization to preserve the features of the initial images, which results in low efficiency due to the requirement for extensive network inference. Conversely, inversion-free methods lack theoretical support for background similarity, as they circumvent the issue of maintaining initial features to achieve efficiency. As a consequence, none of these methods can achieve both high efficiency and background consistency. To tackle the challenges and the aforementioned disadvantages, we introduce PostEdit, a method that incorporates a posterior scheme to govern the diffusion sampling process. Specifically, a corresponding measurement term related to both the initial features and Langevin dynamics is introduced to optimize the estimated image generated by the given target prompt. Extensive experimental results indicate that the proposed PostEdit achieves state-of-the-art editing performance while accurately preserving unedited regions. Furthermore, the method is both inversion- and training-free, necessitating approximately 1.5 seconds and 18 GB of GPU memory to generate high-quality results.

[CV-54] A Simple Image Segmentation Framework via In-Context Examples NEURIPS

链接: https://arxiv.org/abs/2410.04842
作者: Yang Liu,Chenchen Jing,Hengtao Li,Muzhi Zhu,Hao Chen,Xinlong Wang,Chunhua Shen
关键词-EN: unified in-context learning, explorations of generalist, tackle a variety, in-context, in-context learning framework
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Proc. Conference on Neural Information Processing Systems (NeurIPS) 2024. Webpage: this https URL

点击查看摘要

Abstract:Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image Segmentation framework utilizing in-context examples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to effectively eliminate task ambiguity. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.

[CV-55] Multimodal Fusion Strategies for Mapping Biophysical Landscape Features ECCV2024

链接: https://arxiv.org/abs/2410.04833
作者: Lucia Gordon,Nico Lang,Catherine Ressijac,Andrew Davies
关键词-EN: Multimodal aerial data, monitor natural systems, Multimodal aerial, natural systems, ecology and conservation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, ECCV 2024 Workshop in CV for Ecology

点击查看摘要

Abstract:Multimodal aerial data are used to monitor natural systems, and machine learning can significantly accelerate the classification of landscape features within such imagery to benefit ecology and conservation. It remains under-explored, however, how these multiple modalities ought to be fused in a deep learning model. As a step towards filling this gap, we study three strategies (Early fusion, Late fusion, and Mixture of Experts) for fusing thermal, RGB, and LiDAR imagery using a dataset of spatially-aligned orthomosaics in these three modalities. In particular, we aim to map three ecologically-relevant biophysical landscape features in African savanna ecosystems: rhino middens, termite mounds, and water. The three fusion strategies differ in whether the modalities are fused early or late, and if late, whether the model learns fixed weights per modality for each class or generates weights for each class adaptively, based on the input. Overall, the three methods have similar macro-averaged performance with Late fusion achieving an AUC of 0.698, but their per-class performance varies strongly, with Early fusion achieving the best recall for middens and water and Mixture of Experts achieving the best recall for mounds.

[CV-56] CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models

链接: https://arxiv.org/abs/2410.04823
作者: Songning Lai,Jiayu Yang,Yu Huang,Lijie Hu,Tianlang Xue,Zhangyi Hu,Jiaxu Li,Haicheng Liao,Yutao Yue
关键词-EN: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, development of Explainable, Concept Bottleneck Models
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Despite the transformative impact of deep learning across multiple domains, the inherent opacity of these models has driven the development of Explainable Artificial Intelligence (XAI). Among these efforts, Concept Bottleneck Models (CBMs) have emerged as a key approach to improve interpretability by leveraging high-level semantic information. However, CBMs, like other machine learning models, are susceptible to security threats, particularly backdoor attacks, which can covertly manipulate model behaviors. Understanding that the community has not yet studied the concept level backdoor attack of CBM, because of “Better the devil you know than the devil you don’t know.”, we introduce CAT (Concept-level Backdoor ATtacks), a methodology that leverages the conceptual representations within CBMs to embed triggers during training, enabling controlled manipulation of model predictions at inference time. An enhanced attack pattern, CAT+, incorporates a correlation function to systematically select the most effective and stealthy concept triggers, thereby optimizing the attack’s impact. Our comprehensive evaluation framework assesses both the attack success rate and stealthiness, demonstrating that CAT and CAT+ maintain high performance on clean data while achieving significant targeted effects on backdoored datasets. This work underscores the potential security risks associated with CBMs and provides a robust testing methodology for future security assessments.

[CV-57] Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders

链接: https://arxiv.org/abs/2410.04817
作者: Kosta Dakic,Kanchana Thilakarathna,Rodrigo N. Calheiros,Teng Joon Lim
关键词-EN: modern computer vision, offering advanced capabilities, computer vision, offering advanced, understanding and analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 10 pages, conference

点击查看摘要

Abstract:Multiview systems have become a key technology in modern computer vision, offering advanced capabilities in scene understanding and analysis. However, these systems face critical challenges in bandwidth limitations and computational constraints, particularly for resource-limited camera nodes like drones. This paper presents a novel approach for communication-efficient distributed multiview detection and tracking using masked autoencoders (MAEs). We introduce a semantic-guided masking strategy that leverages pre-trained segmentation models and a tunable power function to prioritize informative image regions. This approach, combined with an MAE, reduces communication overhead while preserving essential visual information. We evaluate our method on both virtual and real-world multiview datasets, demonstrating comparable performance in terms of detection and tracking performance metrics compared to state-of-the-art techniques, even at high masking ratios. Our selective masking algorithm outperforms random masking, maintaining higher accuracy and precision as the masking ratio increases. Furthermore, our approach achieves a significant reduction in transmission data volume compared to baseline methods, thereby balancing multiview tracking performance with communication efficiency.

[CV-58] Learning Efficient and Effective Trajectories for Differential Equation-based Image Restoration

链接: https://arxiv.org/abs/2410.04811
作者: Zhiyu Zhu,Jinhui Hou,Hui Liu,Huanqiang Zeng,Junhui Hou
关键词-EN: Gaussian distribution, establish learnable trajectories, learnable trajectories connecting, trajectories connecting high-quality, restoration approach aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The differential equation-based image restoration approach aims to establish learnable trajectories connecting high-quality images to a tractable distribution, e.g., low-quality images or a Gaussian distribution. In this paper, we reformulate the trajectory optimization of this kind of method, focusing on enhancing both reconstruction quality and efficiency. Initially, we navigate effective restoration paths through a reinforcement learning process, gradually steering potential trajectories toward the most precise options. Additionally, to mitigate the considerable computational burden associated with iterative sampling, we propose cost-aware trajectory distillation to streamline complex paths into several manageable steps with adaptable sizes. Moreover, we fine-tune a foundational diffusion model (FLUX) with 12B parameters by using our algorithms, producing a unified framework for handling 7 kinds of image restoration tasks. Extensive experiments showcase the significant superiority of the proposed method, achieving a maximum PSNR improvement of 2.1 dB over state-of-the-art methods, while also greatly enhancing visual perceptual quality. Project page: \urlthis https URL.

[CV-59] FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

链接: https://arxiv.org/abs/2410.04810
作者: Haokun Chen,Hang Li,Yao Zhang,Gengyuan Zhang,Jinhe Bi,Philip Torr,Jindong Gu,Denis Krompass,Volker Tresp
关键词-EN: machine learning paradigm, decentralized machine learning, One-Shot Federated Learning, special decentralized machine, learning paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:One-Shot Federated Learning (OSFL), a special decentralized machine learning paradigm, has recently gained significant attention. OSFL requires only a single round of client data or model upload, which reduces communication costs and mitigates privacy threats compared to traditional FL. Despite these promising prospects, existing methods face challenges due to client data heterogeneity and limited data quantity when applied to real-world OSFL systems. Recently, Latent Diffusion Models (LDM) have shown remarkable advancements in synthesizing high-quality images through pretraining on large-scale datasets, thereby presenting a potential solution to overcome these issues. However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data. This issue is particularly pronounced in rare domains, such as medical imaging, which are underrepresented in LDM’s pretraining data. To address this challenge, we propose Federated Bi-Level Personalization (FedBiP), which personalizes the pretrained LDM at both instance-level and concept-level. Hereby, FedBiP synthesizes images following the client’s local data distribution without compromising the privacy regulations. FedBiP is also the first approach to simultaneously address feature space heterogeneity and client data scarcity in OSFL. Our method is validated through extensive experiments on three OSFL benchmarks with feature space heterogeneity, as well as on challenging medical and satellite image datasets with label heterogeneity. The results demonstrate the effectiveness of FedBiP, which substantially outperforms other OSFL methods.

[CV-60] Building Damage Assessment in Conflict Zones: A Deep Learning Approach Using Geospatial Sub-Meter Resolution Data

链接: https://arxiv.org/abs/2410.04802
作者: Matteo Risso,Alessia Goffi,Beatrice Alessandra Motetti,Alessio Burrello,Jean Baptiste Bove,Enrico Macii,Massimo Poncino,Daniele Jahier Pagliari,Giuseppe Maffeis
关键词-EN: geospatial image analysis, Deep Neural Networks, Convolutional Neural Networks, anthropogenic crises, High Resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication in the Sixth IEEE International Conference on Image Processing Applications and Systems 2024 copyright IEEE

点击查看摘要

Abstract:Very High Resolution (VHR) geospatial image analysis is crucial for humanitarian assistance in both natural and anthropogenic crises, as it allows to rapidly identify the most critical areas that need support. Nonetheless, manually inspecting large areas is time-consuming and requires domain expertise. Thanks to their accuracy, generalization capabilities, and highly parallelizable workload, Deep Neural Networks (DNNs) provide an excellent way to automate this task. Nevertheless, there is a scarcity of VHR data pertaining to conflict situations, and consequently, of studies on the effectiveness of DNNs in those scenarios. Motivated by this, our work extensively studies the applicability of a collection of state-of-the-art Convolutional Neural Networks (CNNs) originally developed for natural disasters damage assessment in a war scenario. To this end, we build an annotated dataset with pre- and post-conflict images of the Ukrainian city of Mariupol. We then explore the transferability of the CNN models in both zero-shot and learning scenarios, demonstrating their potential and limitations. To the best of our knowledge, this is the first study to use sub-meter resolution imagery to assess building damage in combat zones.

[CV-61] Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering ACCV2024

链接: https://arxiv.org/abs/2410.04801
作者: Kazumoto Nakamura,Yuji Nozawa,Yu-Chieh Lin,Kengo Nakata,Youyang Ng
关键词-EN: pretrained Vision Transformer, Vision Transformer, pretrained Vision, Transformer, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ACCV 2024

点击查看摘要

Abstract:The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.

[CV-62] ransforming Color: A Novel Image Colorization Method

链接: https://arxiv.org/abs/2410.04799
作者: Hamza Shafiq,Bumshik Lee
关键词-EN: appealing colorized images, generative adversarial networks, generating visually appealing, visually appealing colorized, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for image colorization that utilizes a color transformer and generative adversarial networks (GANs) to address the challenge of generating visually appealing colorized images. Conventional approaches often struggle with capturing long-range dependencies and producing realistic colorizations. The proposed method integrates a transformer architecture to capture global information and a GAN framework to improve visual quality. In this study, a color encoder that utilizes a random normal distribution to generate color features is applied. These features are then integrated with grayscale image features to enhance the overall representation of the images. Our method demonstrates superior performance compared with existing approaches by utilizing the capacity of the transformer, which can capture long-range dependencies and generate a realistic colorization of the GAN. Experimental results show that the proposed network significantly outperforms other state-of-the-art colorization techniques, highlighting its potential for image colorization. This research opens new possibilities for precise and visually compelling image colorization in domains such as digital restoration and historical image analysis.

[CV-63] Analysis of Hybrid Compositions in Animation Film with Weakly Supervised Learning ECCV

链接: https://arxiv.org/abs/2410.04789
作者: Mónica Apellaniz Portos,Roberto Labadie-Tamayo,Claudius Stemmler,Erwin Feyersinger,Andreas Babic,Franziska Bruckner,Vrääth Öhner,Matthias Zeppelzauer
关键词-EN: hybrid visual compositions, domain of ephemeral, hybrid compositions, visual compositions, hybrid visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Vision for Art (VISART VII) Workshop at the European Conference of Computer Vision (ECCV)

点击查看摘要

Abstract:We present an approach for the analysis of hybrid visual compositions in animation in the domain of ephemeral film. We combine ideas from semi-supervised and weakly supervised learning to train a model that can segment hybrid compositions without requiring pre-labeled segmentation masks. We evaluate our approach on a set of ephemeral films from 13 film archives. Results demonstrate that the proposed learning strategy yields a performance close to a fully supervised baseline. On a qualitative level the performed analysis provides interesting insights on hybrid compositions in animation film.

[CV-64] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

链接: https://arxiv.org/abs/2410.04780
作者: Guanyu Zhou,Yibo Yan,Xin Zou,Kun Wang,Aiwei Liu,Xuming Hu
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, Multimodal Large, industry and academia
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have emerged as a central focus in both industry and academia, but often suffer from biases introduced by visual and language priors, which can lead to multimodal hallucination. These biases arise from the visual encoder and the Large Language Model (LLM) backbone, affecting the attention mechanism responsible for aligning multimodal inputs. Existing decoding-based mitigation methods focus on statistical correlations and overlook the causal relationships between attention mechanisms and model output, limiting their effectiveness in addressing these biases. To tackle this issue, we propose a causal inference framework termed CausalMM that applies structural causal modeling to MLLMs, treating modality priors as a confounder between attention mechanisms and output. Specifically, by employing backdoor adjustment and counterfactual reasoning at both the visual and language attention levels, our method mitigates the negative effects of modality priors and enhances the alignment of MLLM’s inputs and outputs, with a maximum score improvement of 65.3% on 6 VLind-Bench indicators and 164 points on MME Benchmark compared to conventional methods. Extensive experiments validate the effectiveness of our approach while being a plug-and-play solution. Our code is available at: this https URL

[CV-65] MM-R3: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

链接: https://arxiv.org/abs/2410.04778
作者: Shih-Han Chou,Shivam Chandhok,James J. Little,Leonid Sigal
关键词-EN: Large Language Models, Large Language, Visual Question Answering, advent of Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the advent of Large Language Models (LLMs) and Multimodal (Visio-lingual) LLMs, a flurry of research has emerged, analyzing the performance of such models across a diverse array of tasks. While most studies focus on evaluating the capabilities of state-of-the-art (SoTA) MLLM models through task accuracy (e.g., Visual Question Answering, grounding) across various datasets, our work explores the related but complementary aspect of consistency - the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries. We note that consistency is a fundamental prerequisite (necessary but not sufficient condition) for robustness and trust in MLLMs. Humans, in particular, are known to be highly consistent (even if not always accurate) in their responses, and consistency is inherently expected from AI systems. Armed with this perspective, we propose the MM-R ^3 benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs with three tasks: Question Rephrasing, Image Restyling, and Context Reasoning. Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa. Furthermore, we propose a simple yet effective mitigation strategy in the form of an adapter module trained to minimize inconsistency across prompts. With our proposed strategy, we are able to achieve absolute improvements of 5.7% and 12.5%, on average on widely used MLLMs such as BLIP-2 and LLaVa 1.5M in terms of consistency over their existing counterparts.

[CV-66] WTCL-Dehaze: Rethinking Real-world Image Dehazing via Wavelet Transform and Contrastive Learning

链接: https://arxiv.org/abs/2410.04762
作者: Divine Joseph Appiah,Donghai Guan,Abdul Nasser Kasule,Mingqiang Wei
关键词-EN: high-level vision tasks, impair high-level vision, hazy outdoor conditions, Discrete Wavelet Transform, low contrast
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages,4 figures

点击查看摘要

Abstract:Images captured in hazy outdoor conditions often suffer from colour distortion, low contrast, and loss of detail, which impair high-level vision tasks. Single image dehazing is essential for applications such as autonomous driving and surveillance, with the aim of restoring image clarity. In this work, we propose WTCL-Dehaze an enhanced semi-supervised dehazing network that integrates Contrastive Loss and Discrete Wavelet Transform (DWT). We incorporate contrastive regularization to enhance feature representation by contrasting hazy and clear image pairs. Additionally, we utilize DWT for multi-scale feature extraction, effectively capturing high-frequency details and global structures. Our approach leverages both labelled and unlabelled data to mitigate the domain gap and improve generalization. The model is trained on a combination of synthetic and real-world datasets, ensuring robust performance across different scenarios. Extensive experiments demonstrate that our proposed algorithm achieves superior performance and improved robustness compared to state-of-the-art single image dehazing methods on both benchmark datasets and real-world images.

[CV-67] Intriguing Properties of Large Language and Vision Models

链接: https://arxiv.org/abs/2410.04751
作者: Young-Jun Lee,Byungsoo Ko,Han-Gyu Kim,Yechan Hwang,Ho-Jin Choi
关键词-EN: received significant attention, development efforts due, large language model, tasks requiring perception, remarkable generalization performance
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Code is available in this https URL

点击查看摘要

Abstract:Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM’s families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

[CV-68] LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies

链接: https://arxiv.org/abs/2410.04749
作者: Ameer Hamza,Abdullah,Yong Hyun Ahn,Sungyoung Lee,Seong Tae Kim
关键词-EN: Natural Language Explanations, Generating Natural Language, Natural Language, domain-specific medical knowledge, Language Explanations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generating Natural Language Explanations (NLEs) for model predictions on medical images, particularly those depicting thoracic pathologies, remains a critical and challenging task. Existing methodologies often struggle due to general models’ insufficient domain-specific medical knowledge and privacy concerns associated with retrieval-based augmentation techniques. To address these issues, we propose a novel Vision-Language framework augmented with a Knowledge Graph (KG)-based datastore, which enhances the model’s understanding by incorporating additional domain-specific medical knowledge essential for generating accurate and informative NLEs. Our framework employs a KG-based retrieval mechanism that not only improves the precision of the generated explanations but also preserves data privacy by avoiding direct data retrieval. The KG datastore is designed as a plug-and-play module, allowing for seamless integration with various model architectures. We introduce and evaluate three distinct frameworks within this paradigm: KG-LLaVA, which integrates the pre-trained LLaVA model with KG-RAG; Med-XPT, a custom framework combining MedCLIP, a transformer-based projector, and GPT-2; and Bio-LLaVA, which adapts LLaVA by incorporating the Bio-ViT-L vision model. These frameworks are validated on the MIMIC-NLE dataset, where they achieve state-of-the-art results, underscoring the effectiveness of KG augmentation in generating high-quality NLEs for thoracic pathologies.

[CV-69] Diffusion Models in 3D Vision: A Survey

链接: https://arxiv.org/abs/2410.04738
作者: Zhen Wang,Dongyuan Li,Renhe Jiang
关键词-EN: augmented reality, recent years, powering a wide, autonomous driving, medical imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, 3D vision has become a crucial field within computer vision, powering a wide range of applications such as autonomous driving, robotics, augmented reality (AR), and medical imaging. This field relies on the accurate perception, understanding, and reconstruction of 3D scenes from 2D data sources like images and videos. Diffusion models, originally designed for 2D generative tasks, offer the potential for more flexible, probabilistic approaches that can better capture the variability and uncertainty present in real-world 3D data. However, traditional methods often struggle with efficiency and scalability. In this paper, we review the state-of-the-art approaches that leverage diffusion models for 3D visual tasks, including but not limited to 3D object generation, shape completion, point cloud reconstruction, and scene understanding. We provide an in-depth discussion of the underlying mathematical principles of diffusion models, outlining their forward and reverse processes, as well as the various architectural advancements that enable these models to work with 3D datasets. We also discuss the key challenges in applying diffusion models to 3D vision, such as handling occlusions and varying point densities, and the computational demands of high-dimensional data. Finally, we discuss potential solutions, including improving computational efficiency, enhancing multimodal fusion, and exploring the use of large-scale pretraining for better generalization across 3D tasks. This paper serves as a foundation for future exploration and development in this rapidly evolving field.

[CV-70] LDR: Token-Level Detective Reward Model for Large Vision Language Models

链接: https://arxiv.org/abs/2410.04734
作者: Deqing Fu,Tong Xiao,Rui Wang,Wang Zhu,Pengchuan Zhang,Guan Pang,Robin Jia,Lawrence Chen
关键词-EN: improving multimodal large, TLDR models, reward models, models, minimal information
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Work done at Meta

点击查看摘要

Abstract:Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a \textbfT oken- \textbfL evel \textbfD etective \textbfR eward Model ( \textbfTLDR ) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

[CV-71] PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

链接: https://arxiv.org/abs/2410.04733
作者: Yujin Tang,Lu Qi,Fei Xie,Xiangtai Li,Chao Ma,Ming-Hsuan Yang
关键词-EN: convolutional neural networks, employ convolutional neural, Spatiotemporal predictive learning, methods generally fall, recurrent-based approaches
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved- spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer’s potential for real-world applications. The source code will be released at this https URL.

[CV-72] ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction

链接: https://arxiv.org/abs/2410.04721
作者: Hyungjin Chung,Dohun Lee,Jong Chul Ye
关键词-EN: global context modeling, generative modeling, distinct areas, generating high-quality local, paradigms in generative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 10 figures. Project page: this https URL

点击查看摘要

Abstract:Autoregressive models (ARMs) and diffusion models (DMs) represent two leading paradigms in generative modeling, each excelling in distinct areas: ARMs in global context modeling and long-sequence generation, and DMs in generating high-quality local contexts, especially for continuous data such as images and short videos. However, ARMs often suffer from exponential error accumulation over long sequences, leading to physically implausible results, while DMs are limited by their local context generation capabilities. In this work, we introduce Autoregressive Coherent multimodal generation with Diffusion Correction (ACDC), a zero-shot approach that combines the strengths of both ARMs and DMs at the inference stage without the need for additional fine-tuning. ACDC leverages ARMs for global context generation and memory-conditioned DMs for local correction, ensuring high-quality outputs by correcting artifacts in generated multimodal tokens. In particular, we propose a memory module based on large language models (LLMs) that dynamically adjusts the conditioning texts for the DMs, preserving crucial global context information. Our experiments on multimodal tasks, including coherent multi-frame story generation and autoregressive video generation, demonstrate that ACDC effectively mitigates the accumulation of errors and significantly enhances the quality of generated outputs, achieving superior performance while remaining agnostic to specific ARM and DM architectures. Project page: this https URL

[CV-73] H-SIREN: Improving implicit neural representations with hyperbolic periodic functions

链接: https://arxiv.org/abs/2410.04716
作者: Rui Gao,Rajeev K. Jaiman
关键词-EN: Implicit neural representations, partial differential equations, solving partial differential, Implicit neural, neural representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Implicit neural representations (INR) have been recently adopted in various applications ranging from computer vision tasks to physics simulations by solving partial differential equations. Among existing INR-based works, multi-layer perceptrons with sinusoidal activation functions find widespread applications and are also frequently treated as a baseline for the development of better activation functions for INR applications. Recent investigations claim that the use of sinusoidal activation functions could be sub-optimal due to their limited supported frequency set as well as their tendency to generate over-smoothed solutions. We provide a simple solution to mitigate such an issue by changing the activation function at the first layer from \sin(x) to \sin(\sinh(2x)) . We demonstrate H-SIREN in various computer vision and fluid flow problems, where it surpasses the performance of several state-of-the-art INRs.

[CV-74] Low-Rank Continual Pyramid Vision Transformer: Incrementally Segment Whole-Body Organs in CT with Light-Weighted Adaptation MICCAI2024

链接: https://arxiv.org/abs/2410.04689
作者: Vince Zhu,Zhanghexuan Ji,Dazhou Guo,Puyang Wang,Yingda Xia,Le Lu,Xianghua Ye,Wei Zhu,Dakai Jin
关键词-EN: Deep segmentation networks, Deep segmentation, Deep, segmentation, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Medical Image Computing and Computer Assisted Intervention – MICCAI 2024

点击查看摘要

Abstract:Deep segmentation networks achieve high performance when trained on specific datasets. However, in clinical practice, it is often desirable that pretrained segmentation models can be dynamically extended to enable segmenting new organs without access to previous training datasets or without training from scratch. This would ensure a much more efficient model development and deployment paradigm accounting for the patient privacy and data storage issues. This clinically preferred process can be viewed as a continual semantic segmentation (CSS) problem. Previous CSS works would either experience catastrophic forgetting or lead to unaffordable memory costs as models expand. In this work, we propose a new continual whole-body organ segmentation model with light-weighted low-rank adaptation (LoRA). We first train and freeze a pyramid vision transformer (PVT) base segmentation model on the initial task, then continually add light-weighted trainable LoRA parameters to the frozen model for each new learning task. Through a holistically exploration of the architecture modification, we identify three most important layers (i.e., patch-embedding, multi-head attention and feed forward layers) that are critical in adapting to the new segmentation tasks, while retaining the majority of the pretrained parameters fixed. Our proposed model continually segments new organs without catastrophic forgetting and meanwhile maintaining a low parameter increasing rate. Continually trained and tested on four datasets covering different body parts of a total of 121 organs, results show that our model achieves high segmentation accuracy, closely reaching the PVT and nnUNet upper bounds, and significantly outperforms other regularization-based CSS methods. When comparing to the leading architecture-based CSS method, our model has a substantial lower parameter increasing rate while achieving comparable performance.

[CV-75] On the Adversarial Risk of Test Time Adaptation: An Investigation into Realistic Test-Time Data Poisoning

链接: https://arxiv.org/abs/2410.04682
作者: Yongyi Su,Yushu Li,Nanqing Liu,Kui Jia,Xulei Yang,Chuan-Sheng Foo,Xun Xu
关键词-EN: updates the model, enhance generalization, TTA, model weights, inference stage
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Test-time adaptation (TTA) updates the model weights during the inference stage using testing data to enhance generalization. However, this practice exposes TTA to adversarial risks. Existing studies have shown that when TTA is updated with crafted adversarial test samples, also known as test-time poisoned data, the performance on benign samples can deteriorate. Nonetheless, the perceived adversarial risk may be overstated if the poisoned data is generated under overly strong assumptions. In this work, we first review realistic assumptions for test-time data poisoning, including white-box versus grey-box attacks, access to benign data, attack budget, and more. We then propose an effective and realistic attack method that better produces poisoned samples without access to benign samples, and derive an effective in-distribution attack objective. We also design two TTA-aware attack objectives. Our benchmarks of existing attack methods reveal that the TTA methods are more robust than previously believed. In addition, we analyze effective defense strategies to help develop adversarially robust TTA methods.

[CV-76] Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2410.04680
作者: Matthew Strong,Boshu Lei,Aiden Swann,Wen Jiang,Kostas Daniilidis,Monroe Kennedy III
关键词-EN: Gaussian Splatting, propose a framework, Gaussian, Splatting, view selection
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world, online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and surface normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection on a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to them, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes. For more information, please see our project page at this https URL.

[CV-77] CAR: Controllable Autoregressive Modeling for Visual Generation ACL

链接: https://arxiv.org/abs/2410.04671
作者: Ziyu Yao,Jialin Li,Yifeng Zhou,Yong Liu,Xi Jiang,Chengjie Wang,Feng Zheng,Yuexian Zou,Lei Li
关键词-EN: enables fine-grained control, generated outputs, enables fine-grained, critical focus, Controllable AutoRegressive Modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code available at: this https URL

点击查看摘要

Abstract:Controllable generation, which enables fine-grained control over generated outputs, has emerged as a critical focus in visual generative models. Currently, there are two primary technical approaches in visual generation: diffusion models and autoregressive models. Diffusion models, as exemplified by ControlNet and T2I-Adapter, offer advanced control mechanisms, whereas autoregressive models, despite showcasing impressive generative quality and scalability, remain underexplored in terms of controllability and flexibility. In this study, we introduce Controllable AutoRegressive Modeling (CAR), a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling, enabling efficient control generation within a pre-trained visual autoregressive model. CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process. Our approach demonstrates excellent controllability across various types of conditions and delivers higher image quality compared to previous methods. Additionally, CAR achieves robust generalization with significantly fewer training resources compared to those required for pre-training the model. To the best of our knowledge, we are the first to propose a control framework for pre-trained autoregressive visual generation models.

[CV-78] ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.04659
作者: Ziyue Wang,Chi Chen,Fuwen Luo,Yurui Dong,Yuanchi Zhang,Yuzhuang Xu,Xiaolong Wang,Peng Li,Yang Liu
关键词-EN: Large Language Models, Active perception, Multimodal Large Language, crucial human capability, involves setting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. Since comprehensively assessing active perception is challenging, we focus on a specialized form of Visual Question Answering (VQA) that eases the evaluation yet challenging for existing MLLMs. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 27 models, including proprietary and open-source models, and observe that the ability to read and comprehend multiple images simultaneously plays a significant role in enabling active perception. Results reveal a significant gap in the active perception capability of MLLMs, indicating that this area deserves more attention. We hope that our benchmark could help develop methods for MLLMs to understand multimodal inputs in more natural and holistic ways.

[CV-79] Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

链接: https://arxiv.org/abs/2410.04652
作者: Chengyuan Xu,Radha Kumaran,Noah Stier,Kangyou Yu,Tobias Höllerer
关键词-EN: augmented reality benefits, Seamless integration, integration of virtual, worlds in augmented, augmented reality
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, accepted to IEEE ISMAR 2024

点击查看摘要

Abstract:Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically “understanding” the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models. We then propose “in-situ” machine learning, which, in conjunction with the multimodal representation, enables new tools and interfaces for users to interact with physical spaces and objects in a spatially and linguistically meaningful manner. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time. We also make our full implementation and demo data available at (this https URL) to encourage further exploration and research in spatially aware AI.

[CV-80] AdaptDiff: Cross-Modality Domain Adaptation via Weak Conditional Semantic Diffusion for Retinal Vessel Segmentation

链接: https://arxiv.org/abs/2410.04648
作者: Dewei Hu,Hao Li,Han Liu,Jiacheng Wang,Xing Yao,Daiwei Lu,Ipek Oguz
关键词-EN: Deep learning, shown remarkable performance, shown remarkable, domain, learning has shown
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has shown remarkable performance in medical image segmentation. However, despite its promise, deep learning has many challenges in practice due to its inability to effectively transition to unseen domains, caused by the inherent data distribution shift and the lack of manual annotations to guide domain adaptation. To tackle this problem, we present an unsupervised domain adaptation (UDA) method named AdaptDiff that enables a retinal vessel segmentation network trained on fundus photography (FP) to produce satisfactory results on unseen modalities (e.g., OCT-A) without any manual labels. For all our target domains, we first adopt a segmentation model trained on the source domain to create pseudo-labels. With these pseudo-labels, we train a conditional semantic diffusion probabilistic model to represent the target domain distribution. Experimentally, we show that even with low quality pseudo-labels, the diffusion model can still capture the conditional semantic information. Subsequently, we sample on the target domain with binary vessel masks from the source domain to get paired data, i.e., target domain synthetic images conditioned on the binary vessel map. Finally, we fine-tune the pre-trained segmentation network using the synthetic paired data to mitigate the domain gap. We assess the effectiveness of AdaptDiff on seven publicly available datasets across three distinct modalities. Our results demonstrate a significant improvement in segmentation performance across all unseen datasets. Our code is publicly available at this https URL.

[CV-81] Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering

链接: https://arxiv.org/abs/2410.04646
作者: Yonghan Lee,Jaehoon Choi,Dongki Jung,Jaeseong Yun,Soohyun Ryu,Dinesh Manocha,Suyong Yeon
关键词-EN: novel-view rendering algorithm, Gaussian splatting algorithms, present a novel-view, Gaussian splats, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present a novel-view rendering algorithm, Mode-GS, for ground-robot trajectory datasets. Our approach is based on using anchored Gaussian splats, which are designed to overcome the limitations of existing 3D Gaussian splatting algorithms. Prior neural rendering methods suffer from severe splat drift due to scene complexity and insufficient multi-view observation, and can fail to fix splats on the true geometry in ground-robot datasets. Our method integrates pixel-aligned anchors from monocular depths and generates Gaussian splats around these anchors using residual-form Gaussian decoders. To address the inherent scale ambiguity of monocular depth, we parameterize anchors with per-view depth-scales and employ scale-consistent depth loss for online scale calibration. Our method results in improved rendering performance, based on PSNR, SSIM, and LPIPS metrics, in ground scenes with free trajectory patterns, and achieves state-of-the-art rendering performance on the R3LIVE odometry dataset and the Tanks and Temples dataset.

[CV-82] Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models

链接: https://arxiv.org/abs/2410.04634
作者: Salma Abdel Magid,Weiwei Pan,Simon Warchol,Grace Guo,Junsik Kim,Mahia Rahman,Hanspeter Pfister
关键词-EN: impactful real-life applications, real-life applications, impactful real-life, models, cs.CV
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) models are increasingly used in impactful real-life applications. As such, there is a growing need to audit these models to ensure that they generate desirable, task-appropriate images. However, systematically inspecting the associations between prompts and generated content in a human-understandable way remains challenging. To address this, we propose \emphConcept2Concept, a framework where we characterize conditional distributions of vision language models using interpretable concepts and metrics that can be defined in terms of these concepts. This characterization allows us to use our framework to audit models and prompt-datasets. To demonstrate, we investigate several case studies of conditional distributions of prompts, such as user defined distributions or empirical, real world distributions. Lastly, we implement Concept2Concept as an open-source interactive visualization tool facilitating use by non-technical end-users. Warning: This paper contains discussions of harmful content, including CSAM and NSFW material, which may be disturbing to some readers. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.04634 [cs.CV] (or arXiv:2410.04634v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.04634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-83] owards Unsupervised Blind Face Restoration using Diffusion Prior

链接: https://arxiv.org/abs/2410.04618
作者: Tianshu Kuai,Sina Honari,Igor Gilitschenski,Alex Levinshtein
关键词-EN: shown remarkable performance, supervised learning, Blind face restoration, face restoration methods, methods have shown
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Blind face restoration methods have shown remarkable performance, particularly when trained on large-scale synthetic datasets with supervised learning. These datasets are often generated by simulating low-quality face images with a handcrafted image degradation pipeline. The models trained on such synthetic degradations, however, cannot deal with inputs of unseen degradations. In this paper, we address this issue by using only a set of input images, with unknown degradations and without ground truth targets, to fine-tune a restoration model that learns to map them to clean and contextually consistent outputs. We utilize a pre-trained diffusion model as a generative prior through which we generate high quality images from the natural image distribution while maintaining the input image content through consistency constraints. These generated images are then used as pseudo targets to fine-tune a pre-trained restoration model. Unlike many recent approaches that employ diffusion models at test time, we only do so during training and thus maintain an efficient inference-time performance. Extensive experiments show that the proposed approach can consistently improve the perceptual quality of pre-trained blind face restoration models while maintaining great consistency with the input contents. Our best model also achieves the state-of-the-art results on both synthetic and real-world datasets.

[CV-84] VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models

链接: https://arxiv.org/abs/2410.04609
作者: Harshit,Tolga Tasdizen
关键词-EN: natural language processing, powerful integrated Vision, deep learning led, language processing, natural language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The recent developments in deep learning led to the integration of natural language processing (NLP) with computer vision, resulting in powerful integrated Vision and Language Models (VLMs). Despite their remarkable capabilities, these models are frequently regarded as black boxes within the machine learning research community. This raises a critical question: which parts of an image correspond to specific segments of text, and how can we decipher these associations? Understanding these connections is essential for enhancing model transparency, interpretability, and trustworthiness. To answer this question, we present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments. We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the model’s decision-making process. This approach aims to enhance model transparency, interpretability, and trustworthiness by providing insights into how these models align visual and linguistic information. We conducted a comprehensive study on text-guided visual saliency detection in these VL models. This study aims to understand how different models prioritize and focus on specific visual elements in response to corresponding text segments, providing deeper insights into their internal mechanisms and improving our ability to interpret their outputs.

[CV-85] Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion

链接: https://arxiv.org/abs/2410.04574
作者: Mehwish Ghafoor,Arif Mahmood,Muhammad Bilal
关键词-EN: Human Pose Estimation, Pose Estimation, Human Pose, diverse occlusion types, occlusion types presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of 3D Human Pose Estimation from monocular videos, the presence of diverse occlusion types presents a formidable challenge. Prior research has made progress by harnessing spatial and temporal cues to infer 3D poses from 2D joint observations. This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation, even in the presence of severe occlusions. Confronting the issue of occlusion-induced missing joint data, we propose a temporal interpolation-based occlusion guidance mechanism. To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views. Each intermediate-view undergoes spatial refinement through a self-refinement schema. Subsequently, these intermediate-views are fused to yield the final 3D human pose estimation. The entire system is end-to-end trainable. Through extensive experiments conducted on the Human3.6M and MPI-INF-3DHP datasets, our method’s performance is rigorously evaluated. Notably, our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements. The code is available here: this https URL.

[CV-86] Learning De-Biased Representations for Remote-Sensing Imagery

链接: https://arxiv.org/abs/2410.04546
作者: Zichen Tian,Zhaozheng Chen,Qianru Sun
关键词-EN: requiring specialized satellites, Remote sensing, data scarcity, requiring specialized, difficult to annotate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote sensing (RS) imagery, requiring specialized satellites to collect and being difficult to annotate, suffers from data scarcity and class imbalance in certain spectrums. Due to data scarcity, training any large-scale RS models from scratch is unrealistic, and the alternative is to transfer pre-trained models by fine-tuning or a more data-efficient method LoRA. Due to class imbalance, transferred models exhibit strong bias, where features of the major class dominate over those of the minor class. In this paper, we propose debLoRA, a generic training approach that works with any LoRA variants to yield debiased features. It is an unsupervised learning approach that can diversify minor class features based on the shared attributes with major classes, where the attributes are obtained by a simple step of clustering. To evaluate it, we conduct extensive experiments in two transfer learning scenarios in the RS domain: from natural to optical RS images, and from optical RS to multi-spectrum RS images. We perform object classification and oriented object detection tasks on the optical RS dataset DOTA and the SAR dataset FUSRS. Results show that our debLoRA consistently surpasses prior arts across these RS adaptation settings, yielding up to 3.3 and 4.7 percentage points gains on the tail classes for natural to optical RS and optical RS to multi-spectrum RS adaptations, respectively, while preserving the performance on head classes, substantiating its efficacy and adaptability.

[CV-87] UniMuMo: Unified Text Music and Motion Generation

链接: https://arxiv.org/abs/2410.04534
作者: Han Yang,Kun Su,Yutong Zhang,Jiaben Chen,Kaizhi Qian,Gaowen Liu,Chuang Gan
关键词-EN: taking arbitrary text, multimodal model capable, capable of taking, taking arbitrary, input conditions
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \hrefthis https URLproject page.

[CV-88] In-Place Panoptic Radiance Field Segmentation with Perceptual Prior for 3D Scene Understanding

链接: https://arxiv.org/abs/2410.04529
作者: Shenghao Li
关键词-EN: panoptic understanding, scene representation, panoptic, virtual reality, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate 3D scene representation and panoptic understanding are essential for applications such as virtual reality, robotics, and autonomous driving. However, challenges persist with existing methods, including precise 2D-to-3D mapping, handling complex scene characteristics like boundary ambiguity and varying scales, and mitigating noise in panoptic pseudo-labels. This paper introduces a novel perceptual-prior-guided 3D scene representation and panoptic understanding method, which reformulates panoptic understanding within neural radiance fields as a linear assignment problem involving 2D semantics and instance recognition. Perceptual information from pre-trained 2D panoptic segmentation models is incorporated as prior guidance, thereby synchronizing the learning processes of appearance, geometry, and panoptic understanding within neural radiance fields. An implicit scene representation and understanding model is developed to enhance generalization across indoor and outdoor scenes by extending the scale-encoded cascaded grids within a reparameterized domain distillation framework. This model effectively manages complex scene attributes and generates 3D-consistent scene representations and panoptic understanding outcomes for various scenes. Experiments and ablation studies under challenging conditions, including synthetic and real-world scenes, demonstrate the proposed method’s effectiveness in enhancing 3D scene representation and panoptic segmentation accuracy.

[CV-89] Look Around and Find Out: OOD Detection with Relative Angles

链接: https://arxiv.org/abs/2410.04525
作者: Berker Demirel,Marco Fumero,Francesco Locatello
关键词-EN: Deep learning systems, Deep learning, learning systems deployed, deployed in real-world, real-world applications
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning systems deployed in real-world applications often encounter data that is different from their in-distribution (ID). A reliable system should ideally abstain from making decisions in this out-of-distribution (OOD) setting. Existing state-of-the-art methods primarily focus on feature distances, such as k-th nearest neighbors and distances to decision boundaries, either overlooking or ineffectively using in-distribution statistics. In this work, we propose a novel angle-based metric for OOD detection that is computed relative to the in-distribution structure. We demonstrate that the angles between feature representations and decision boundaries, viewed from the mean of in-distribution features, serve as an effective discriminative factor between ID and OOD data. Our method achieves state-of-the-art performance on CIFAR-10 and ImageNet benchmarks, reducing FPR95 by 0.88% and 7.74% respectively. Our score function is compatible with existing feature space regularization techniques, enhancing performance. Additionally, its scale-invariance property enables creating an ensemble of models for OOD detection via simple score summation.

[CV-90] MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration

链接: https://arxiv.org/abs/2410.04521
作者: Lai Wei,Wenkai Wang,Xiaoyu Shen,Yu Xie,Zhihao Fan,Xiaojin Zhang,Zhongyu Wei,Wei Chen
关键词-EN: visual question answering, address medical visual, medical visual question, multimodal large language, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 14 figures, 6 tables

点击查看摘要

Abstract:In recent advancements, multimodal large language models (MLLMs) have been fine-tuned on specific medical image datasets to address medical visual question answering (Med-VQA) tasks. However, this common approach of task-specific fine-tuning is costly and necessitates separate models for each downstream task, limiting the exploration of zero-shot capabilities. In this paper, we introduce MC-CoT, a modular cross-modal collaboration Chain-of-Thought (CoT) framework designed to enhance the zero-shot performance of MLLMs in Med-VQA by leveraging large language models (LLMs). MC-CoT improves reasoning and information extraction by integrating medical knowledge and task-specific guidance, where LLM provides various complex medical reasoning chains and MLLM provides various observations of medical images based on instructions of the LLM. Our experiments on datasets such as SLAKE, VQA-RAD, and PATH-VQA show that MC-CoT surpasses standalone MLLMs and various multimodality CoT frameworks in recall rate and accuracy. These findings highlight the importance of incorporating background information and detailed guidance in addressing complex zero-shot Med-VQA tasks.

[CV-91] DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination EMNLP2024

链接: https://arxiv.org/abs/2410.04514
作者: Xuan Gong,Tianshi Ming,Xinpeng Wang,Zhihua Wei
关键词-EN: Large Vision-Language Models, Large Language Model, Large Vision-Language, Vision-Language Models, Large Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by EMNLP2024 (Main Conference)

点击查看摘要

Abstract:Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that D ive into A ttention M echanism of LVLM to R educe O bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code of our method will be released soon.

[CV-92] Realizing Video Summarization from the Path of Language-based Semantic Understanding

链接: https://arxiv.org/abs/2410.04511
作者: Kuan-Chen Mu,Zhi-Yi Chin,Wei-Chen Chiu
关键词-EN: Video-based Large Language, Large Language Models, Large Language, Video-based Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The recent development of Video-based Large Language Models (VideoLLMs), has significantly advanced video summarization by aligning video features and, in some cases, audio features with Large Language Models (LLMs). Each of these VideoLLMs possesses unique strengths and weaknesses. Many recent methods have required extensive fine-tuning to overcome the limitations of these models, which can be resource-intensive. In this work, we observe that the strengths of one VideoLLM can complement the weaknesses of another. Leveraging this insight, we propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm, which operates as an inference-time algorithm without requiring any form of fine-tuning. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries. It effectively combines visual and audio content, provides detailed background descriptions, and excels at identifying keyframes, which enables more semantically meaningful retrieval compared to traditional computer vision approaches that rely solely on visual information, all without the need for additional fine-tuning. Moreover, the resulting summaries enhance performance in downstream tasks such as summary video generation, either through keyframe selection or in combination with text-to-image models. Our language-driven approach offers a semantically rich alternative to conventional methods and provides flexibility to incorporate newer VideoLLMs, enhancing adaptability and performance in video summarization tasks.

[CV-93] MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network ACCV2024

链接: https://arxiv.org/abs/2410.04507
作者: Doanh C. Bui,Jin Tae Kwak
关键词-EN: slide image, clinics and hospitals, diagnostics in clinics, WSI, Expert Consultation Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at ACCV2024

点击查看摘要

Abstract:Whole slide image (WSI) classification is a crucial problem for cancer diagnostics in clinics and hospitals. A WSI, acquired at gigapixel size, is commonly tiled into patches and processed by multiple-instance learning (MIL) models. Previous MIL-based models designed for this problem have only been evaluated on individual tasks for specific organs, and the ability to handle multiple tasks within a single model has not been investigated. In this study, we propose MECFormer, a generative Transformer-based model designed to handle multiple tasks within one model. To leverage the power of learning multiple tasks simultaneously and to enhance the model’s effectiveness in focusing on each individual task, we introduce an Expert Consultation Network, a projection layer placed at the beginning of the Transformer-based model. Additionally, to enable flexible classification, autoregressive decoding is incorporated by a language decoder for WSI classification. Through extensive experiments on five datasets involving four different organs, one cancer classification task, and four cancer subtyping tasks, MECFormer demonstrates superior performance compared to individual state-of-the-art multiple-instance learning models.

[CV-94] Generalizability analysis of deep learning predictions of human brain responses to augmented and semantically novel visual stimuli

链接: https://arxiv.org/abs/2410.04497
作者: Valentyn Piskovskyi,Riccardo Chimisso,Sabrina Patania,Tom Foulsham,Giuseppe Vizzari,Dimitri Ognibene
关键词-EN: neural network-based approach, image enhancement techniques, investigate the soundness, soundness and utility, network-based approach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The purpose of this work is to investigate the soundness and utility of a neural network-based approach as a framework for exploring the impact of image enhancement techniques on visual cortex activation. In a preliminary study, we prepare a set of state-of-the-art brain encoding models, selected among the top 10 methods that participated in The Algonauts Project 2023 Challenge [16]. We analyze their ability to make valid predictions about the effects of various image enhancement techniques on neural responses. Given the impossibility of acquiring the actual data due to the high costs associated with brain imaging procedures, our investigation builds up on a series of experiments. Specifically, we analyze the ability of brain encoders to estimate the cerebral reaction to various augmentations by evaluating the response to augmentations targeting objects (i.e., faces and words) with known impact on specific areas. Moreover, we study the predicted activation in response to objects unseen during training, exploring the impact of semantically out-of-distribution stimuli. We provide relevant evidence for the generalization ability of the models forming the proposed framework, which appears to be promising for the identification of the optimal visual augmentation filter for a given task, model-driven design strategies as well as for AR and VR applications.

[CV-95] Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification NEURIPS2024

链接: https://arxiv.org/abs/2410.04492
作者: Zhaorui Tan,Xi Yang,Qiufeng Wang,Anh Nguyen,Kaizhu Huang
关键词-EN: Vision models excel, Vision models, discovering novel categories, struggle to generalize, Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2024 as Spotlight

点击查看摘要

Abstract:Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image classification. Our work reveals that L-Reg reduces the complexity of the model in terms of the feature distribution and classifier weights. Specifically, we unveil the interpretability brought by L-Reg, as it enables the model to extract the salient features, such as faces to persons, for classification. Theoretical analysis and experiments demonstrate that L-Reg enhances generalization across various scenarios, including multi-domain generalization and generalized category discovery. In complex real-world scenarios where images span unknown classes and unseen domains, L-Reg consistently improves generalization, highlighting its practical efficacy.

[CV-96] nsor-Train Point Cloud Compression and Efficient Approximate Nearest-Neighbor Search

链接: https://arxiv.org/abs/2410.04462
作者: Georgii Novikov,Alexander Gneushev,Alexey Kadeishvili,Ivan Oseledets
关键词-EN: machine learning applications, large vector databases, learning applications, approximate nearest-neighbor searches, large vector
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nearest-neighbor search in large vector databases is crucial for various machine learning applications. This paper introduces a novel method using tensor-train (TT) low-rank tensor decomposition to efficiently represent point clouds and enable fast approximate nearest-neighbor searches. We propose a probabilistic interpretation and utilize density estimation losses like Sliced Wasserstein to train TT decompositions, resulting in robust point cloud compression. We reveal an inherent hierarchical structure within TT point clouds, facilitating efficient approximate nearest-neighbor searches. In our paper, we provide detailed insights into the methodology and conduct comprehensive comparisons with existing methods. We demonstrate its effectiveness in various scenarios, including out-of-distribution (OOD) detection problems and approximate nearest-neighbor (ANN) search tasks.

[CV-97] Video Summarization Techniques: A Comprehensive Review

链接: https://arxiv.org/abs/2410.04449
作者: Toqa Alaa,Ahmad Mongy,Assem Bakr,Mariam Diab,Walid Gomaa
关键词-EN: including social media, made video summarization, variety of industries, including social, social media
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid expansion of video content across a variety of industries, including social media, education, entertainment, and surveillance, has made video summarization an essential field of study. The current work is a survey that explores the various approaches and methods created for video summarizing, emphasizing both abstractive and extractive strategies. The process of extractive summarization involves the identification of key frames or segments from the source video, utilizing methods such as shot boundary recognition, and clustering. On the other hand, abstractive summarization creates new content by getting the essential content from the video, using machine learning models like deep neural networks and natural language processing, reinforcement learning, attention mechanisms, generative adversarial networks, and multi-modal learning. We also include approaches that incorporate the two methodologies, along with discussing the uses and difficulties encountered in real-world implementations. The paper also covers the datasets used to benchmark these techniques. This review attempts to provide a state-of-the-art thorough knowledge of the current state and future directions of video summarization research.

[CV-98] Attention Shift: Steering AI Away from Unsafe Content

链接: https://arxiv.org/abs/2410.04447
作者: Shivank Garg,Manyana Tiwari
关键词-EN: generative models, study investigates, investigates the generation, restricting such generations, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.

[CV-99] Optimising for the Unknown: Domain Alignment for Cephalometric Landmark Detection MICCAI

链接: https://arxiv.org/abs/2410.04445
作者: Julian Wyatt,Irina Voiculescu
关键词-EN: Cephalometric Landmark Detection, identifying key areas, Cephalometric Landmark, Landmark Detection, areas for cephalometry
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI CL-Detection2024: Cephalometric Landmark Detection in Lateral X-ray Images

点击查看摘要

Abstract:Cephalometric Landmark Detection is the process of identifying key areas for cephalometry. Each landmark is a single GT point labelled by a clinician. A machine learning model predicts the probability locus of a landmark represented by a heatmap. This work, for the 2024 CL-Detection MICCAI Challenge, proposes a domain alignment strategy with a regional facial extraction module and an X-ray artefact augmentation procedure. The challenge ranks our method’s results as the best in MRE of 1.186mm and third in the 2mm SDR of 82.04% on the online validation leaderboard. The code is available at this https URL.

[CV-100] Automated Detection of Defects on Metal Surfaces using Vision Transformers

链接: https://arxiv.org/abs/2410.04440
作者: Toqa Alaa,Mostafa Kotb,Arwa Zakaria,Mariam Diab,Walid Gomaa
关键词-EN: defective products, production of defective, Vision Transformers, Metal manufacturing, operational challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Metal manufacturing often results in the production of defective products, leading to operational challenges. Since traditional manual inspection is time-consuming and resource-intensive, automatic solutions are needed. The study utilizes deep learning techniques to develop a model for detecting metal surface defects using Vision Transformers (ViTs). The proposed model focuses on the classification and localization of defects using a ViT for feature extraction. The architecture branches into two paths: classification and localization. The model must approach high classification accuracy while keeping the Mean Square Error (MSE) and Mean Absolute Error (MAE) as low as possible in the localization process. Experimental results show that it can be utilized in the process of automated defects detection, improve operational efficiency, and reduce errors in metal manufacturing.

[CV-101] Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

链接: https://arxiv.org/abs/2410.04439
作者: Wenbo Li,Guohao Li,Zhibin Lan,Xue Xu,Wanru Zhuang,Jiachen Liu,Xinyan Xiao,Jinsong Su
关键词-EN: demonstrated impressive achievements, backbone models, empower backbone models, demonstrated impressive, impressive achievements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation quality.

[CV-102] A Mathematical Explanation of UNet

链接: https://arxiv.org/abs/2410.04434
作者: Xue-Cheng Tai,Hao Liu,Raymond H. Chan,Lingfeng Li
关键词-EN: transformed image segmentation, UNet, image segmentation, transformed image, UNet architecture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The UNet architecture has transformed image segmentation. UNet’s versatility and accuracy have driven its widespread adoption, significantly advancing fields reliant on machine learning problems with images. In this work, we give a clear and concise mathematical explanation of UNet. We explain what is the meaning and function of each of the components of UNet. We will show that UNet is solving a control problem. We decompose the control variables using multigrid methods. Then, operator-splitting techniques is used to solve the problem, whose architecture exactly recovers the UNet architecture. Our result shows that UNet is a one-step operator-splitting algorithm for the control problem.

[CV-103] CAPEEN: Image Captioning with Early Exits and Knowledge Distillation EMNLP

链接: https://arxiv.org/abs/2410.04433
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
关键词-EN: Deep neural networks, made significant progress, recognizing visual elements, generating descriptive text, Deep neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: To appear in EMNLP (finding) 2024

点击查看摘要

Abstract:Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE) strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels of semantic information for accurate predictions. To overcome this, we introduce CAPEEN to improve the performance of EE strategies using knowledge distillation. Inference in CAPEEN is completed at intermediary layers if prediction confidence exceeds a predefined value learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we introduce a variant A-CAPEEN to adapt the thresholds on the fly using Multiarmed bandits framework. Experiments on the MS COCO and Flickr30k datasets show that CAPEEN gains speedup of 1.77x while maintaining competitive performance compared to the final layer, and A-CAPEEN additionally offers robustness against distortions. The source code is available at this https URL

[CV-104] CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection ACCV2024

链接: https://arxiv.org/abs/2410.04426
作者: Devank,Jayateja Kalla,Soma Biswas
关键词-EN: misinformation detection, image is paired, incorrect caption, caption for creating, labeled data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ACCV 2024

点击查看摘要

Abstract:In this work, we address the real-world, challenging task of out-of-context misinformation detection, where a real image is paired with an incorrect caption for creating fake news. Existing approaches for this task assume the availability of large amounts of labeled data, which is often impractical in real-world, since it requires extensive manual intervention and domain expertise. In contrast, since obtaining a large corpus of unlabeled image-text pairs is much easier, here, we propose a semi-supervised protocol, where the model has access to a limited number of labeled image-text pairs and a large corpus of unlabeled pairs. Additionally, the occurrence of fake news being much lesser compared to the real ones, the datasets tend to be highly imbalanced, thus making the task even more challenging. Towards this goal, we propose a novel framework, Consensus from Vision-Language Models (CoVLM), which generates robust pseudo-labels for unlabeled pairs using thresholds derived from the labeled data. This approach can automatically determine the right threshold parameters of the model for selecting the confident pseudo-labels. Experimental results on benchmark datasets across challenging conditions and comparisons with state-of-the-art approaches demonstrate the effectiveness of our framework.

[CV-105] Disentangling Regional Primitives for Image Generation

链接: https://arxiv.org/abs/2410.04421
作者: Zhengting Chen,Lei Cheng,Lianghui Ding,Quanshi Zhang
关键词-EN: internal representation structure, feature component, neural network, image regions, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a method to explain the internal representation structure of a neural network for image generation. Specifically, our method disentangles primitive feature components from the intermediate-layer feature of the neural network, which ensures that each feature component is exclusively used to generate a specific set of image regions. In this way, the generation of the entire image can be considered as the superposition of different pre-encoded primitive regional patterns, each being generated by a feature component. We find that the feature component can be represented as an OR relationship between the demands for generating different image regions, which is encoded by the neural network. Therefore, we extend the Harsanyi interaction to represent such an OR interaction to disentangle the feature component. Experiments show a clear correspondence between each feature component and the generation of specific image regions.

[CV-106] LiteVLoc: Map-Lite Visual Localization for Image Goal Navigation

链接: https://arxiv.org/abs/2410.04419
作者: Jianhao Jiao,Jinhao He,Changkun Liu,Sebastian Aegidius,Xiangcheng Hu,Tristan Braud,Dimitrios Kanoulas
关键词-EN: lightweight topo-metric map, paper presents LiteVLoc, hierarchical visual localization, visual localization framework, represent the environment
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:This paper presents LiteVLoc, a hierarchical visual localization framework that uses a lightweight topo-metric map to represent the environment. The method consists of three sequential modules that estimate camera poses in a coarse-to-fine manner. Unlike mainstream approaches relying on detailed 3D representations, LiteVLoc reduces storage overhead by leveraging learning-based feature matching and geometric solvers for metric pose estimation. A novel dataset for the map-free relocalization task is also introduced. Extensive experiments including localization and navigation in both simulated and real-world scenarios have validate the system’s performance and demonstrated its precision and efficiency for large-scale deployment. Code and data will be made publicly available.

[CV-107] SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

链接: https://arxiv.org/abs/2410.04417
作者: Yuan Zhang,Chun-Kai Fan,Junpeng Ma,Wenzhao Zheng,Tao Huang,Kuan Cheng,Denis Gudovskiy,Tomoyuki Okuno,Yohei Nakata,Kurt Keutzer,Shanghang Zhang
关键词-EN: sparser information density, information density compared, vision-language models, computational overhead, consume a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages

点击查看摘要

Abstract:In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% to 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy. Our code is available at this https URL.

[CV-108] Deformable NeRF using Recursively Subdivided Tetrahedra

链接: https://arxiv.org/abs/2410.04402
作者: Zherui Qiu,Chenqu Ren,Kaiwen Song,Xiaoyi Zeng,Leyuan Yang,Juyong Zhang
关键词-EN: neural radiance fields, limits explicit control, implicit representation limits, representation limits explicit, radiance fields
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted by ACM Multimedia 2024. Project Page: this https URL

点击查看摘要

Abstract:While neural radiance fields (NeRF) have shown promise in novel view synthesis, their implicit representation limits explicit control over object manipulation. Existing research has proposed the integration of explicit geometric proxies to enable deformation. However, these methods face two primary challenges: firstly, the time-consuming and computationally demanding tetrahedralization process; and secondly, handling complex or thin structures often leads to either excessive, storage-intensive tetrahedral meshes or poor-quality ones that impair deformation capabilities. To address these challenges, we propose DeformRF, a method that seamlessly integrates the manipulability of tetrahedral meshes with the high-quality rendering capabilities of feature grid representations. To avoid ill-shaped tetrahedra and tetrahedralization for each object, we propose a two-stage training strategy. Starting with an almost-regular tetrahedral grid, our model initially retains key tetrahedra surrounding the object and subsequently refines object details using finer-granularity mesh in the second stage. We also present the concept of recursively subdivided tetrahedra to create higher-resolution meshes implicitly. This enables multi-resolution encoding while only necessitating the storage of the coarse tetrahedral mesh generated in the first training stage. We conduct a comprehensive evaluation of our DeformRF on both synthetic and real-captured datasets. Both quantitative and qualitative results demonstrate the effectiveness of our method for novel view synthesis and deformation tasks. Project page: this https URL

[CV-109] DiffusionFake: Enhancing Generalization in Deepfake Detection via Guided Stable Diffusion NEURIPS2024

链接: https://arxiv.org/abs/2410.04372
作者: Ke Sun,Shen Chen,Taiping Yao,Hong Liu,Xiaoshuai Sun,Shouhong Ding,Rongrong Ji
关键词-EN: swapping highly realistic, fabricated facial content, made face swapping, face swapping highly, highly realistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:The rapid progress of Deepfake technology has made face swapping highly realistic, raising concerns about the malicious use of fabricated facial content. Existing methods often struggle to generalize to unseen domains due to the diverse nature of facial manipulations. In this paper, we revisit the generation process and identify a universal principle: Deepfake images inherently contain information from both source and target identities, while genuine faces maintain a consistent identity. Building upon this insight, we introduce DiffusionFake, a novel plug-and-play framework that reverses the generative process of face forgeries to enhance the generalization of detection models. DiffusionFake achieves this by injecting the features extracted by the detection model into a frozen pre-trained Stable Diffusion model, compelling it to reconstruct the corresponding target and source images. This guided reconstruction process constrains the detection network to capture the source and target related features to facilitate the reconstruction, thereby learning rich and disentangled representations that are more resilient to unseen forgeries. Extensive experiments demonstrate that DiffusionFake significantly improves cross-domain generalization of various detector architectures without introducing additional parameters during inference. Our Codes are available in this https URL.

[CV-110] VideoGuide: Improving Video Diffusion Models without Training Through a Teachers Guide

链接: https://arxiv.org/abs/2410.04364
作者: Dohun Lee,Bryan S Kim,Geon Yeong Park,Jong Chul Ye
关键词-EN: visual content creation, revolutionized visual content, preserving temporal consistency, content creation, generation remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 14 figures, Project Page: this http URL

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model’s denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: this http URL

[CV-111] StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting

链接: https://arxiv.org/abs/2410.04354
作者: Xiao Cui,Weicai Ye,Yifan Wang,Guofeng Zhang,Wengang Zhou,Tong He,Houqiang Li
关键词-EN: Reconstructing urban street, Reconstructing urban, crucial due, vital role, role in applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing urban street scenes is crucial due to its vital role in applications such as autonomous driving and urban planning. These scenes are characterized by long and narrow camera trajectories, occlusion, complex object relationships, and data sparsity across multiple scales. Despite recent advancements, existing surface reconstruction methods, which are primarily designed for object-centric scenarios, struggle to adapt effectively to the unique characteristics of street scenes. To address this challenge, we introduce StreetSurfGS, the first method to employ Gaussian Splatting specifically tailored for scalable urban street scene surface reconstruction. StreetSurfGS utilizes a planar-based octree representation and segmented training to reduce memory costs, accommodate unique camera characteristics, and ensure scalability. Additionally, to mitigate depth inaccuracies caused by object overlap, we propose a guided smoothing strategy within regularization to eliminate inaccurate boundary points and outliers. Furthermore, to address sparse views and multi-scale challenges, we use a dual-step matching strategy that leverages adjacent and long-term information. Extensive experiments validate the efficacy of StreetSurfGS in both novel view synthesis and surface reconstruction.

[CV-112] MVP-Bench: Can Large Vision–Language Models Conduct Multi-level Visual Perception Like Humans?

链接: https://arxiv.org/abs/2410.04345
作者: Guanzhen Li,Yuxi Xie,Min-Yen Kan
关键词-EN: including low-level object, multiple levels, low-level object recognition, perception, perform visual perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual-Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions. To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual-language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences model perception. Using MVP-Bench, we diagnose the visual perception of 10 open-source and 2 closed-source LVLMs, showing that high-level perception tasks significantly challenge existing LVLMs. The state-of-the-art GPT-4o only achieves an accuracy of 56% on Yes/No questions, compared with 74% in low-level scenarios. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do. Our data and code are publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.04345 [cs.CV] (or arXiv:2410.04345v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.04345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-113] Accelerating Inference of Networks in the Frequency Domain

链接: https://arxiv.org/abs/2410.04342
作者: Chenqiu Zhao,Guanfang Dong,Anup Basu
关键词-EN: frequency, frequency inference chain, frequency domain, demonstrated that networks’, small decrease
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ACM Multimedia Asia 2024

点击查看摘要

Abstract:It has been demonstrated that networks’ parameters can be significantly reduced in the frequency domain with a very small decrease in accuracy. However, given the cost of frequency transforms, the computational complexity is not significantly decreased. In this work, we propose performing network inference in the frequency domain to speed up networks whose frequency parameters are sparse. In particular, we propose a frequency inference chain that is dual to the network inference in the spatial domain. In order to handle the non-linear layers, we make a compromise to apply non-linear operations on frequency data directly, which works effectively. Enabled by the frequency inference chain and the strategy for non-linear layers, the proposed approach completes the entire inference in the frequency domain. Unlike previous approaches which require extra frequency or inverse transforms for all layers, the proposed approach only needs the frequency transform and its inverse once at the beginning and once at the end of a network. Comparisons with state-of-the-art methods demonstrate that the proposed approach significantly improves accuracy in the case of a high speedup ratio (over 100x). The source code is available at \urlthis https URL.

[CV-114] st-Time Adaptation for Keypoint-Based Spacecraft Pose Estimation Based on Predicted-View Synthesis

链接: https://arxiv.org/abs/2410.04298
作者: Juan Ignacio Bravo Pérez-Villar,Álvaro García-Martín,Jesús Bescós,Juan C. SanMiguel
关键词-EN: real operational data, real conditions, synthetic data, real operational, operational data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Due to the difficulty of replicating the real conditions during training, supervised algorithms for spacecraft pose estimation experience a drop in performance when trained on synthetic data and applied to real operational data. To address this issue, we propose a test-time adaptation approach that leverages the temporal redundancy between images acquired during close proximity operations. Our approach involves extracting features from sequential spacecraft images, estimating their poses, and then using this information to synthesise a reconstructed view. We establish a self-supervised learning objective by comparing the synthesised view with the actual one. During training, we supervise both pose estimation and image synthesis, while at test-time, we optimise the self-supervised objective. Additionally, we introduce a regularisation loss to prevent solutions that are not consistent with the keypoint structure of the spacecraft. Our code is available at: this https URL.

[CV-115] Self-Supervised Anomaly Detection in the Wild: Favor Joint Embeddings Methods

链接: https://arxiv.org/abs/2410.04289
作者: Daniel Otero,Rafael Mateus,Randall Balestriero
关键词-EN: prevent costly failures, Accurate anomaly detection, vision-based infrastructure inspection, Accurate anomaly, SSL
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate anomaly detection is critical in vision-based infrastructure inspection, where it helps prevent costly failures and enhances safety. Self-Supervised Learning (SSL) offers a promising approach by learning robust representations from unlabeled data. However, its application in anomaly detection remains underexplored. This paper addresses this gap by providing a comprehensive evaluation of SSL methods for real-world anomaly detection, focusing on sewer infrastructure. Using the Sewer-ML dataset, we evaluate lightweight models such as ViT-Tiny and ResNet-18 across SSL frameworks, including BYOL, Barlow Twins, SimCLR, DINO, and MAE, under varying class imbalance levels. Through 250 experiments, we rigorously assess the performance of these SSL methods to ensure a robust and comprehensive evaluation. Our findings highlight the superiority of joint-embedding methods like SimCLR and Barlow Twins over reconstruction-based approaches such as MAE, which struggle to maintain performance under class imbalance. Furthermore, we find that the SSL model choice is more critical than the backbone architecture. Additionally, we emphasize the need for better label-free assessments of SSL representations, as current methods like RankMe fail to adequately evaluate representation quality, making cross-validation without labels infeasible. Despite the remaining performance gap between SSL and supervised models, these findings highlight the potential of SSL to enhance anomaly detection, paving the way for further research in this underexplored area of SSL applications.

[CV-116] Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

链接: https://arxiv.org/abs/2410.04256
作者: Maria Marrium,Arif Mahmood,Mohammed Bennamoun
关键词-EN: Convolutional Neural Networks, deep neural networks, neural networks, Automatic annotation, introduce noisy training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic annotation of large-scale datasets can introduce noisy training data labels, which adversely affect the learning process of deep neural networks (DNNs). Consequently, Noisy Labels Learning (NLL) has become a critical research field for Convolutional Neural Networks (CNNs), though it remains less explored for Vision Transformers (ViTs). In this study, we evaluate the vulnerability of ViT fine-tuning to noisy labels and compare its robustness with CNNs. We also investigate whether NLL methods developed for CNNs are equally effective for ViTs. Using linear probing and MLP-K fine-tuning, we benchmark two ViT backbones (ViT-B/16 and ViT-L/16) using three commonly used classification losses: Cross Entropy (CE), Focal Loss (FL), and Mean Absolute Error (MAE), alongside six robust NLL methods: GCE, SCE, NLNL, APL, NCE+AGCE, and ANL-CE. The evaluation is conducted across six datasets including MNIST, CIFAR-10/100, WebVision, Clothing1M, and Food-101N. Furthermore, we explore whether implicit prediction entropy minimization contributes to ViT robustness against noisy labels, noting a general trend of prediction entropy reduction across most NLL methods. Building on this observation, we examine whether explicit entropy minimization could enhance ViT resilience to noisy labels. Our findings indicate that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.

[CV-117] Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution

链接: https://arxiv.org/abs/2410.04224
作者: Jianze Li,Jiezhang Cao,Zichen Zou,Xiongfei Su,Xin Yuan,Yulun Zhang,Yong Guo,Xiaokang Yang
关键词-EN: real-world image super-resolution, considerable computational costs, achieving excellent performance, image super-resolution, one-step diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have been achieving excellent performance for real-world image super-resolution (Real-ISR) with considerable computational costs. Current approaches are trying to derive one-step diffusion models from multi-step counterparts through knowledge distillation. However, these methods incur substantial training costs and may constrain the performance of the student model by the teacher’s limitations. To tackle these issues, we propose DFOSD, a Distillation-Free One-Step Diffusion model. Specifically, we propose a noise-aware discriminator (NAD) to participate in adversarial training, further enhancing the authenticity of the generated content. Additionally, we improve the perceptual loss with edge-aware DISTS (EA-DISTS) to enhance the model’s ability to generate fine details. Our experiments demonstrate that, compared with previous diffusion-based methods requiring dozens or even hundreds of steps, our DFOSD attains comparable or even superior results in both quantitative metrics and qualitative evaluations. Our DFOSD also abtains higher performance and efficiency compared with other one-step diffusion methods. We will release code and models at \urlthis https URL.

[CV-118] ANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

链接: https://arxiv.org/abs/2410.04221
作者: Haiyang Liu,Xingchao Yang,Tomoya Akiyama,Yuantian Huang,Qiaoge Li,Shigeru Kuriyama,Takafumi Taketomi
关键词-EN: generating co-speech body-gesture, co-speech body-gesture videos, present TANGO, generating co-speech, co-speech body-gesture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:We present TANGO, a framework for generating co-speech body-gesture videos. Given a few-minute, single-speaker reference video and target speech audio, TANGO produces high-fidelity videos with synchronized body gestures. TANGO builds on Gesture Video Reenactment (GVR), which splits and retrieves video clips using a directed graph structure - representing video frames as nodes and valid transitions as edges. We address two key limitations of GVR: audio-motion misalignment and visual artifacts in GAN-generated transition frames. In particular, (i) we propose retrieving gestures using latent feature distance to improve cross-modal alignment. To ensure the latent features could effectively model the relationship between speech audio and gesture motion, we implement a hierarchical joint embedding space (AuMoCLIP); (ii) we introduce the diffusion-based model to generate high-quality transition frames. Our diffusion model, Appearance Consistent Interpolation (ACInterp), is built upon AnimateAnyone and includes a reference motion module and homography background flow to preserve appearance consistency between generated and reference videos. By integrating these components into the graph-based retrieval framework, TANGO reliably produces realistic, audio-synchronized videos and outperforms all existing generative and retrieval methods. Our codes and pretrained models are available: \urlthis https URL

[CV-119] Exploring Strengths and Weaknesses of Super-Resolution Attack in Deepfake Detection ECCV2024

链接: https://arxiv.org/abs/2410.04205
作者: Davide Alessandro Coccomini,Roberto Caldelli,Fabrizio Falchi,Claudio Gennaro,Giuseppe Amato
关键词-EN: rapidly evolving, allowing the creation, bend reality, manipulation is rapidly, creation of credible
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Trust What You learN (TWYN) Workshop at European Conference on Computer Vision ECCV 2024

点击查看摘要

Abstract:Image manipulation is rapidly evolving, allowing the creation of credible content that can be used to bend reality. Although the results of deepfake detectors are promising, deepfakes can be made even more complicated to detect through adversarial attacks. They aim to further manipulate the image to camouflage deepfakes’ artifacts or to insert signals making the image appear pristine. In this paper, we further explore the potential of super-resolution attacks based on different super-resolution techniques and with different scales that can impact the performance of deepfake detectors with more or less intensity. We also evaluated the impact of the attack on more diverse datasets discovering that the super-resolution process is effective in hiding the artifacts introduced by deepfake generation models but fails in hiding the traces contained in fully synthetic images. Finally, we propose some changes to the detectors’ training process to improve their robustness to this kind of attack.

[CV-120] IT3: Idempotent Test-Time Training

链接: https://arxiv.org/abs/2410.04201
作者: Nikita Durasov,Assaf Shocher,Doruk Oner,Gal Chechik,Alexei A. Efros,Pascal Fua
关键词-EN: paper introduces Idempotent, introduces Idempotent Test-Time, paper introduces, addressing the challenge, Idempotent Test-Time Training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces Idempotent Test-Time Training (IT ^3 ), a novel approach to addressing the challenge of distribution shift. While supervised-learning methods assume matching train and test distributions, this is rarely the case for machine learning systems deployed in the real world. Test-Time Training (TTT) approaches address this by adapting models during inference, but they are limited by a domain specific auxiliary task. IT ^3 is based on the universal property of idempotence. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, that is f(f(x))=f(x) . At training, the model receives an input x along with another signal that can either be the ground truth label y or a neutral “don’t know” signal 0 . At test time, the additional signal can only be 0 . When sequentially applying the model, first predicting y_0 = f(x, 0) and then y_1 = f(x, y_0) , the distance between y_0 and y_1 measures certainty and indicates out-of-distribution input x if high. We use this distance, that can be expressed as ||f(x, f(x, 0)) - f(x, 0)|| as our TTT loss during inference. By carefully optimizing this objective, we effectively train f(x,\cdot) to be idempotent, projecting the internal representation of the input onto the training distribution. We demonstrate the versatility of our approach across various tasks, including corrupted image classification, aerodynamic predictions, tabular data with missing information, age prediction from face, and large-scale aerial photo segmentation. Moreover, these tasks span different architectures such as MLPs, CNNs, and GNNs.

[CV-121] Accelerating Diffusion Models with One-to-Many Knowledge Distillation

链接: https://arxiv.org/abs/2410.04191
作者: Linfeng Zhang,Kaisheng Ma
关键词-EN: diffusion models, diffusion, advancements in image, models, image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Significant advancements in image generation have been made with diffusion models. Nevertheless, when contrasted with previous generative models, diffusion models face substantial computational overhead, leading to failure in real-time generation. Recent approaches have aimed to accelerate diffusion models by reducing the number of sampling steps through improved sampling techniques or step distillation. However, the methods to diminish the computational cost for each timestep remain a relatively unexplored area. Observing the fact that diffusion models exhibit varying input distributions and feature distributions at different timesteps, we introduce one-to-many knowledge distillation (O2MKD), which distills a single teacher diffusion model into multiple student diffusion models, where each student diffusion model is trained to learn the teacher’s knowledge for a subset of continuous timesteps. Experiments on CIFAR10, LSUN Church, CelebA-HQ with DDPM and COCO30K with Stable Diffusion show that O2MKD can be applied to previous knowledge distillation and fast sampling methods to achieve significant acceleration. Codes will be released in Github.

[CV-122] Unsupervised Assessment of Landscape Shifts Based on Persistent Entropy and Topological Preservation KDD’2024

链接: https://arxiv.org/abs/2410.04183
作者: Sebastian Basterrech
关键词-EN: drift typically refers, Concept drift typically, Concept drift, typically refers, data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: KDD’2024. Workshop on Drift Detection and Landscape Shifts

点击查看摘要

Abstract:Concept drift typically refers to the analysis of changes in data distribution. A drift in the input data can have negative consequences on a learning predictor and the system’s stability. The majority of concept drift methods emphasize the analysis of statistical changes in non-stationary data over time. In this context, we consider another perspective, where the concept drift also integrates substantial changes in the topological characteristics of the data stream. In this article, we introduce a novel framework for monitoring changes in multi-dimensional data streams. We explore a generalization of the standard concept drift focusing on the changes in the topological characteristics of the data. Our developed approach is based on persistent entropy and topology-preserving projections in a continual learning scenario. The framework operates in both unsupervised and supervised environments. To demonstrate the utility of the proposed framework, we analyze the model across three scenarios using data streams generated with MNIST samples. The obtained results reveal the potential of applying topological data analysis for shift detection and encourage further research in this area.

[CV-123] Artistic Portrait Drawing with Vector Strokes

链接: https://arxiv.org/abs/2410.04182
作者: Yiqi Liang,Ying Liu,Dandan Long,Ruihui Li
关键词-EN: human face image, Crop-based Shadow Loss, portrait sketch, human face, face image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 12 figures

点击查看摘要

Abstract:In this paper, we present a method, VectorPD, for converting a given human face image into a vector portrait sketch. VectorPD supports different levels of abstraction by simply controlling the number of strokes. Since vector graphics are composed of different shape primitives, it is challenging for rendering complex faces to accurately express facial details and structure. To address this, VectorPD employs a novel two-round optimization mechanism. We first initialize the strokes with facial keypoints, and generate a basic portrait sketch by a CLIP-based Semantic Loss. Then we complete the face structure through VGG-based Structure Loss, and propose a novel Crop-based Shadow Loss to enrich the shadow details of the sketch, achieving a visually pleasing portrait sketch. Quantitative and qualitative evaluations both demonstrate that the portrait sketches generated by VectorPD can produce better visual effects than existing state-of-the-art methods, maintaining as much fidelity as possible at different levels of abstraction.

[CV-124] Fast Object Detection with a Machine Learning Edge Device

链接: https://arxiv.org/abs/2410.04173
作者: Richard C. Rodriguez,Jonah Elijah P. Bardos
关键词-EN: lowcost edge device, edge device integrated, learning study investigates, inferencing time, detection and classification
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This machine learning study investigates a lowcost edge device integrated with an embedded system having computer vision and resulting in an improved performance in inferencing time and precision of object detection and classification. A primary aim of this study focused on reducing inferencing time and low-power consumption and to enable an embedded device of a competition-ready autonomous humanoid robot and to support real-time object recognition, scene understanding, visual navigation, motion planning, and autonomous navigation of the robot. This study compares processors for inferencing time performance between a central processing unit (CPU), a graphical processing unit (GPU), and a tensor processing unit (TPU). CPUs, GPUs, and TPUs are all processors that can be used for machine learning tasks. Related to the aim of supporting an autonomous humanoid robot, there was an additional effort to observe whether or not there was a significant difference in using a camera having monocular vision versus stereo vision capability. TPU inference time results for this study reflect a 25% reduction in time over the GPU, and a whopping 87.5% reduction in inference time compared to the CPU. Much information in this paper is contributed to the final selection of Google’s Coral brand, Edge TPU device. The Arduino Nano 33 BLE Sense Tiny ML Kit was also considered for comparison but due to initial incompatibilities and in the interest of time to complete this study, a decision was made to review the kit in a future experiment.

[CV-125] IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

链接: https://arxiv.org/abs/2410.04171
作者: Shitong Shao,Zikai Zhou,Lichen Bai,Haoyi Xiond,Zeke Xie
关键词-EN: multi-step sampling mechanism, inference computational cost, OpenAI Strawberry, Strawberry in enhancing, visual diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The multi-step sampling mechanism, a key feature of visual diffusion models, has significant potential to replicate the success of OpenAI’s Strawberry in enhancing performance by increasing the inference computational cost. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.

[CV-126] Overcoming False Illusions in Real-World Face Restoration with Multi-Modal Guided Diffusion Model

链接: https://arxiv.org/abs/2410.04161
作者: Keda Tao,Jinjin Gu,Yulun Zhang,Xiucheng Wang,Nan Cheng
关键词-EN: Guided Real-World Face, Multi-modal Guided Real-World, Guided Real-World, Multi-modal Guided, Real-World Face Restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 Pages, 28 Figures

点击查看摘要

Abstract:We introduce a novel Multi-modal Guided Real-World Face Restoration (MGFR) technique designed to improve the quality of facial image restoration from low-quality inputs. Leveraging a blend of attribute text prompts, high-quality reference images, and identity information, MGFR can mitigate the generation of false facial attributes and identities often associated with generative face restoration methods. By incorporating a dual-control adapter and a two-stage training strategy, our method effectively utilizes multi-modal prior information for targeted restoration tasks. We also present the Reface-HQ dataset, comprising over 23,000 high-resolution facial images across 5,000 identities, to address the need for reference face training images. Our approach achieves superior visual quality in restoring facial details under severe degradation and allows for controlled restoration processes, enhancing the accuracy of identity preservation and attribute correction. Including negative quality samples and attribute prompts in the training further refines the model’s ability to generate detailed and perceptually accurate images.

[CV-127] Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

链接: https://arxiv.org/abs/2410.04140
作者: Yong Guo,Shulian Zhang,Haolin Pan,Jing Liu,Yulun Zhang,Jian Chen
关键词-EN: Knowledge distillation aims, transfer knowledge, compact student counterpart, significant performance gap, performance gap
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages for the main paper

点击查看摘要

Abstract:Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. We find that a too-large performance gap can hamper the training process, which is also verified in recent studies. To address this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In this way, it becomes possible to maintain a reasonable performance gap between teacher and student during the whole distillation process. To further strengthen distillation from the dynamic teacher to the student, we develop a hard strategy by enforcing them to share parameters and encouraging parameter inheritance. Besides hard strategy, we also build the soft bidirectional mappings between them which are built on an Inverse Reparameterization (IR) method and a Channel-Branch Reparameterization (CBR) strategy. We highlight that our IR is able to initialize a larger dynamic teacher with an arbitrary expansion ratio, while preserving exactly the same accuracy as the given student model. In this way, it guarantees that the dynamic teacher and student start from the same point and avoid a too large gap in early stage of training. As for our CBR, with parameter-sharing, it directly extracts an effective student model from the well-learned dynamic teacher without any post-training, making our method highly flexible for model deployment. In the experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.

[CV-128] UBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions

链接: https://arxiv.org/abs/2410.04107
作者: Xingwei He,Qianru Zhang,A-Long Jin,Yuan Yuan,Siu-Ming Yiu
关键词-EN: Large Vision-Language Models, achieved remarkable progress, Large Vision-Language, linguistic interpretation, unanswerable questions
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable progress on visual perception and linguistic interpretation. Despite their impressive capabilities across various tasks, LVLMs still suffer from the issue of hallucination, which involves generating content that is incorrect or unfaithful to the visual or textual inputs. Traditional benchmarks, such as MME and POPE, evaluate hallucination in LVLMs within the scope of Visual Question Answering (VQA) using answerable questions. However, some questions are unanswerable due to insufficient information in the images, and the performance of LVLMs on such unanswerable questions remains underexplored. To bridge this research gap, we propose TUBench, a benchmark specifically designed to evaluate the reliability of LVLMs using unanswerable questions. TUBench comprises an extensive collection of high-quality, unanswerable questions that are meticulously crafted using ten distinct strategies. To thoroughly evaluate LVLMs, the unanswerable questions in TUBench are based on images from four diverse domains as visual contexts: screenshots of code snippets, natural images, geometry diagrams, and screenshots of statistical tables. These unanswerable questions are tailored to test LVLMs’ trustworthiness in code reasoning, commonsense reasoning, geometric reasoning, and mathematical reasoning related to tables, respectively. We conducted a comprehensive quantitative evaluation of 28 leading foundational models on TUBench, with Gemini-1.5-Pro, the top-performing model, achieving an average accuracy of 69.2%, and GPT-4o, the third-ranked model, reaching 66.7% average accuracy, in determining whether questions are answerable. TUBench is available at this https URL.

[CV-129] High-Speed Stereo Visual SLAM for Low-Powered Computing Devices

链接: https://arxiv.org/abs/2410.04090
作者: Ashish Kumar,Jaesik Park,Laxmidhar Behera
关键词-EN: Visual SLAM design, Stereo Visual SLAM, GPU-accelerated Stereo Visual, SLAM design called, Stereo Visual
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present an accurate and GPU-accelerated Stereo Visual SLAM design called Jetson-SLAM. It exhibits frame-processing rates above 60FPS on NVIDIA’s low-powered 10W Jetson-NX embedded computer and above 200FPS on desktop-grade 200W GPUs, even in stereo configuration and in the multiscale setting. Our contributions are threefold: (i) a Bounded Rectification technique to prevent tagging many non-corner points as a corner in FAST detection, improving SLAM accuracy. (ii) A novel Pyramidal Culling and Aggregation (PyCA) technique that yields robust features while suppressing redundant ones at high speeds by harnessing a GPU device. PyCA uses our new Multi-Location Per Thread culling strategy (MLPT) and Thread-Efficient Warp-Allocation (TEWA) scheme for GPU to enable Jetson-SLAM achieving high accuracy and speed on embedded devices. (iii) Jetson-SLAM library achieves resource efficiency by having a data-sharing mechanism. Our experiments on three challenging datasets: KITTI, EuRoC, and KAIST-VIO, and two highly accurate SLAM backends: Full-BA and ICE-BA show that Jetson-SLAM is the fastest available accurate and GPU-accelerated SLAM system (Fig. 1).

[CV-130] Designing Concise ConvNets with Columnar Stages

链接: https://arxiv.org/abs/2410.04089
作者: Ashish Kumar,Jaesik Park
关键词-EN: convolutional neural networks, concise convolutional neural, Columnar Stage Network, era of vision, recent success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the era of vision Transformers, the recent success of VanillaNet shows the huge potential of simple and concise convolutional neural networks (ConvNets). Where such models mainly focus on runtime, it is also crucial to simultaneously focus on other aspects, e.g., FLOPs, parameters, etc, to strengthen their utility further. To this end, we introduce a refreshing ConvNet macro design called Columnar Stage Network (CoSNet). CoSNet has a systematically developed simple and concise structure, smaller depth, low parameter count, low FLOPs, and attention-less operations, well suited for resource-constrained deployment. The key novelty of CoSNet is deploying parallel convolutions with fewer kernels fed by input replication, using columnar stacking of these convolutions, and minimizing the use of 1x1 convolution layers. Our comprehensive evaluations show that CoSNet rivals many renowned ConvNets and Transformer designs under resource-constrained scenarios. Code: this https URL

[CV-131] Cross Resolution Encoding-Decoding For Detection Transformers

链接: https://arxiv.org/abs/2410.04088
作者: Ashish Kumar,Jaesik Park
关键词-EN: object detection pipelines, renowned object detection, Detection Transformers, computationally efficient multiscale, DETR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED (calling CRED-DETR), becomes 76% faster, with ~50% reduced FLOPs than its high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan to release pretrained CRED-DETRs for use by the community. Code: this https URL

[CV-132] aming the Tail: Leveraging Asymmetric Loss and Pade Approximation to Overcome Medical Image Long-Tailed Class Imbalance BMVC24

链接: https://arxiv.org/abs/2410.04084
作者: Pankhi Kashyap,Pavni Tandon,Sunny Gupta,Abhishek Tiwari,Ritwik Kulkarni,Kshitij Sharad Jadhav
关键词-EN: dependable classification methods, data imbalance due, warranting the requirement, problems in healthcare, healthcare emerge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 1 figures. Accepted in The 35th British Machine Vision Conference (BMVC24)

点击查看摘要

Abstract:Long-tailed problems in healthcare emerge from data imbalance due to variability in the prevalence and representation of different medical conditions, warranting the requirement of precise and dependable classification methods. Traditional loss functions such as cross-entropy and binary cross-entropy are often inadequate due to their inability to address the imbalances between the classes with high representation and the classes with low representation found in medical image datasets. We introduce a novel polynomial loss function based on Pade approximation, designed specifically to overcome the challenges associated with long-tailed classification. This approach incorporates asymmetric sampling techniques to better classify under-represented classes. We conducted extensive evaluations on three publicly available medical datasets and a proprietary medical dataset. Our implementation of the proposed loss function is open-sourced in the public repository:this https URL.

[CV-133] epsilon-VAE: Denoising as Visual Decoding

链接: https://arxiv.org/abs/2410.04081
作者: Long Zhao,Sanghyun Woo,Ziyu Wan,Yandong Li,Han Zhang,Boqing Gong,Hartwig Adam,Xuhui Jia,Ting Liu
关键词-EN: simplifies complex data, tokenization simplifies complex, learnable space, generative modeling, simplifies complex
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approach. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.

[CV-134] Multi-Round Region-Based Optimization for Scene Sketching

链接: https://arxiv.org/abs/2410.04072
作者: Yiqi Liang,Ying Liu,Dandan Long,Ruihui Li
关键词-EN: abstract representation, representation that captures, captures the essential, Scene, Scene sketching
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:Scene sketching is to convert a scene into a simplified, abstract representation that captures the essential elements and composition of the original scene. It requires semantic understanding of the scene and consideration of different regions within the scene. Since scenes often contain diverse visual information across various regions, such as foreground objects, background elements, and spatial divisions, dealing with these different regions poses unique difficulties. In this paper, we define a sketch as some sets of Bezier curves. We optimize the different regions of input scene in multiple rounds. In each round of optimization, strokes sampled from the next region can seamlessly be integrated into the sketch generated in the previous round of optimization. We propose additional stroke initialization method to ensure the integrity of the scene and the convergence of optimization. A novel CLIP-Based Semantic loss and a VGG-Based Feature loss are utilized to guide our multi-round optimization. Extensive experimental results on the quality and quantity of the generated sketches confirm the effectiveness of our method.

[CV-135] RetCompletion:High-Speed Inference Image Completion with Retentive Network

链接: https://arxiv.org/abs/2410.04056
作者: Yueyang Cang,Pingge Hu,Xiaoteng Zhang,Xingtong Wang,Yuhang Liu
关键词-EN: Retentive Network, major challenge, pluralistic image completion, high-quality pluralistic image, achieving high-quality pluralistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Time cost is a major challenge in achieving high-quality pluralistic image completion. Recently, the Retentive Network (RetNet) in natural language processing offers a novel approach to this problem with its low-cost inference capabilities. Inspired by this, we apply RetNet to the pluralistic image completion task in computer vision. We present RetCompletion, a two-stage framework. In the first stage, we introduce Bi-RetNet, a bidirectional sequence information fusion model that integrates contextual information from images. During inference, we employ a unidirectional pixel-wise update strategy to restore consistent image structures, achieving both high reconstruction quality and fast inference speed. In the second stage, we use a CNN for low-resolution upsampling to enhance texture details. Experiments on ImageNet and CelebA-HQ demonstrate that our inference speed is 10 \times faster than ICT and 15 \times faster than RePaint. The proposed RetCompletion significantly improves inference speed and delivers strong performance, especially when masks cover large areas of the image.

[CV-136] Beyond Imperfections: A Conditional Inpainting Approach for End-to-End Artifact Removal in VTON and Pose Transfer

链接: https://arxiv.org/abs/2410.04052
作者: Aref Tabatabaei,Zahra Dehghanian,Maryam Amirmazlaghani
关键词-EN: impacting user experience, pose transfer applications, virtual try-on, impacting user, user experience
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Artifacts often degrade the visual quality of virtual try-on (VTON) and pose transfer applications, impacting user experience. This study introduces a novel conditional inpainting technique designed to detect and remove such distortions, improving image aesthetics. Our work is the first to present an end-to-end framework addressing this specific issue, and we developed a specialized dataset of artifacts in VTON and pose transfer tasks, complete with masks highlighting the affected areas. Experimental results show that our method not only effectively removes artifacts but also significantly enhances the visual quality of the final images, setting a new benchmark in computer vision and image processing.

[CV-137] Lane Detection System for Driver Assistance in Vehicles

链接: https://arxiv.org/abs/2410.04046
作者: Kauan Divino Pouso Mariano,Fernanda de Castro Fernandes,Luan Gabriel Silva Oliveira,Lyan Eduardo Sakuno Rodrigues,Matheus Andrade Brandão
关键词-EN: detection system aimed, work presents, presents the development, aimed at assisting, assisting the driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work presents the development of a lane detection system aimed at assisting the driving of conventional and autonomous vehicles. The system was implemented using traditional computer vision techniques, focusing on robustness and efficiency to operate in real-time, even under adverse conditions such as worn-out lanes and weather variations. The methodology employs an image processing pipeline that includes camera calibration, distortion correction, perspective transformation, and binary image generation. Lane detection is performed using sliding window techniques and segmentation based on gradients and color channels, enabling the precise identification of lanes in various road scenarios. The results indicate that the system can effectively detect and track lanes, performing well under different lighting conditions and road surfaces. However, challenges were identified in extreme situations, such as intense shadows and sharp curves. It is concluded that, despite its limitations, the traditional computer vision approach shows significant potential for application in driver assistance systems and autonomous navigation, with room for future improvements.

[CV-138] Gamified crowd-sourcing of high-quality data for visual fine-tuning

链接: https://arxiv.org/abs/2410.04038
作者: Shashank Yadav,Rohan Tomar,Garvit Jain,Chirag Ahooja,Shubham Chaudhary,Charles Elkan
关键词-EN: Gamified Adversarial Prompting, introduces Gamified Adversarial, Adversarial Prompting, visual instruction tuning, paper introduces Gamified
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models. GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model’s knowledge. Our contributions include (1) an approach to capture question-answer pairs from humans that directly address weaknesses in a model’s knowledge, (2) a method for evaluating and rewarding players that successfully incentivizes them to provide high-quality submissions, and (3) a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks. Our implementation of GAP has significantly improved the accuracy of a small multimodal model, namely MiniCPM-Llama3-V-2.5-8B, increasing its GPT score from 0.147 to 0.477 on our dataset, approaching the benchmark set by the much larger GPT-4V. Moreover, we demonstrate that the data generated using MiniCPM-Llama3-V-2.5-8B also enhances its performance across other benchmarks, and exhibits cross-model benefits. Specifically, the same data improves the performance of QWEN2-VL-2B and QWEN2-VL-7B on the same multiple benchmarks.

[CV-139] ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

链接: https://arxiv.org/abs/2410.04032
作者: Weihuang Liu,Xi Shen,Chi-Man Pun,Xiaodong Cun
关键词-EN: Social media, realistic fake images, making it hard, trust content, media is increasingly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques. Our code and data will be released upon publication.

[CV-140] JAM: A Comprehensive Model for Age Estimation Verification and Comparability

链接: https://arxiv.org/abs/2410.04012
作者: François David,Alexey A. Novikov,Ruslan Parkhomenko,Artem Voronin,Alix Melchy
关键词-EN: offering a comprehensive, introduces a comprehensive, comprehensive solution, paper introduces, age estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a comprehensive model for age estimation, verification, and comparability, offering a comprehensive solution for a wide range of applications. It employs advanced learning techniques to understand age distribution and uses confidence scores to create probabilistic age ranges, enhancing its ability to handle ambiguous cases. The model has been tested on both proprietary and public datasets and compared against one of the top-performing models in the field. Additionally, it has recently been evaluated by NIST as part of the FATE challenge, achieving top places in many categories.

[CV-141] Impact of Regularization on Calibration and Robustness: from the Representation Space Perspective

链接: https://arxiv.org/abs/2410.03999
作者: Jonghyun Park,Juyeop Kim,Jong-Seok Lee
关键词-EN: enhance image classification, image classification accuracy, Recent studies, improve model calibration, adversarial attacks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent studies have shown that regularization techniques using soft labels, e.g., label smoothing, Mixup, and CutMix, not only enhance image classification accuracy but also improve model calibration and robustness against adversarial attacks. However, the underlying mechanisms of such improvements remain underexplored. In this paper, we offer a novel explanation from the perspective of the representation space (i.e., the space of the features obtained at the penultimate layer). Our investigation first reveals that the decision regions in the representation space form cone-like shapes around the origin after training regardless of the presence of regularization. However, applying regularization causes changes in the distribution of features (or representation vectors). The magnitudes of the representation vectors are reduced and subsequently the cosine similarities between the representation vectors and the class centers (minimal loss points for each class) become higher, which acts as a central mechanism inducing improved calibration and robustness. Our findings provide new insights into the characteristics of the high-dimensional representation space in relation to training and regularization using soft labels.

[CV-142] Mamba Capsule Routing Towards Part-Whole Relational Camouflaged Object Detection

链接: https://arxiv.org/abs/2410.03987
作者: Dingwen Zhang,Liangbo Cheng,Yi Liu,Xinggang Wang,Junwei Han
关键词-EN: relational property endowed, object detection due, Capsule Networks, pixel-level capsule routing, previous Expectation Maximization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The part-whole relational property endowed by Capsule Networks (CapsNets) has been known successful for camouflaged object detection due to its segmentation integrity. However, the previous Expectation Maximization (EM) capsule routing algorithm with heavy computation and large parameters obstructs this trend. The primary attribution behind lies in the pixel-level capsule routing. Alternatively, in this paper, we propose a novel mamba capsule routing at the type level. Specifically, we first extract the implicit latent state in mamba as capsule vectors, which abstract type-level capsules from pixel-level versions. These type-level mamba capsules are fed into the EM routing algorithm to get the high-layer mamba capsules, which greatly reduce the computation and parameters caused by the pixel-level capsule routing for part-whole relationships exploration. On top of that, to retrieve the pixel-level capsule features for further camouflaged prediction, we achieve this on the basis of the low-layer pixel-level capsules with the guidance of the correlations from adjacent-layer type-level mamba capsules. Extensive experiments on three widely used COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-arts. Code has been available on this https URL_capsule.

[CV-143] Improving Arabic Multi-Label Emotion Classification using Stacked Embeddings and Hybrid Loss Function

链接: https://arxiv.org/abs/2410.03979
作者: Nisar Ahmed,Muhammad Imran Zaman
关键词-EN: multi-label emotion classification, emotion classification, hybrid loss function, accurately predicting minority, label correlation hinder
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In multi-label emotion classification, particularly for low-resource languages like Arabic, the challenges of class imbalance and label correlation hinder model performance, especially in accurately predicting minority emotions. To address these issues, this study proposes a novel approach that combines stacked embeddings, meta-learning, and a hybrid loss function to enhance multi-label emotion classification for the Arabic language. The study extracts contextual embeddings from three fine-tuned language models-ArabicBERT, MarBERT, and AraBERT-which are then stacked to form enriched embeddings. A meta-learner is trained on these stacked embeddings, and the resulting concatenated representations are provided as input to a Bi-LSTM model, followed by a fully connected neural network for multi-label classification. To further improve performance, a hybrid loss function is introduced, incorporating class weighting, label correlation matrix, and contrastive learning, effectively addressing class imbalances and improving the handling of label correlations. Extensive experiments validate the proposed model’s performance across key metrics such as Precision, Recall, F1-Score, Jaccard Accuracy, and Hamming Loss. The class-wise performance analysis demonstrates the hybrid loss function’s ability to significantly reduce disparities between majority and minority classes, resulting in a more balanced emotion classification. An ablation study highlights the contribution of each component, showing the superiority of the model compared to baseline approaches and other loss functions. This study not only advances multi-label emotion classification for Arabic but also presents a generalizable framework that can be adapted to other languages and domains, providing a significant step forward in addressing the challenges of low-resource emotion classification tasks.

[CV-144] Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

链接: https://arxiv.org/abs/2410.03977
作者: Hongjun Wang,Jiyuan Chen,Zhengwei Yin,Xuan Song,Yinqiang Zheng
关键词-EN: Cloth-Changing Person Re-Identification, involves recognizing individuals, Cloth-Changing Person, Person Re-Identification, involves recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloth-Changing Person Re-Identification (CC-ReID) involves recognizing individuals in images regardless of clothing status. In this paper, we empirically and experimentally demonstrate that completely eliminating or fully retaining clothing features is detrimental to the task. Existing work, either relying on clothing labels, silhouettes, or other auxiliary data, fundamentally aim to balance the learning of clothing and identity features. However, we practically find that achieving this balance is challenging and nuanced. In this study, we introduce a novel module called Diverse Norm, which expands personal features into orthogonal spaces and employs channel attention to separate clothing and identity features. A sample re-weighting optimization strategy is also introduced to guarantee the opposite optimization direction. Diverse Norm presents a simple yet effective approach that does not require additional data. Furthermore, Diverse Norm can be seamlessly integrated ResNet50 and significantly outperforms the state-of-the-art methods.

[CV-145] Grounding Language in Multi-Perspective Referential Communication EMNLP2024

链接: https://arxiv.org/abs/2410.03959
作者: Zineng Tang,Lingjun Mao,Alane Suhr
关键词-EN: multi-agent embodied environments, embodied environments, multi-agent embodied, referring expression generation, human-written referring expressions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted to EMNLP2024 Main

点击查看摘要

Abstract:We introduce a task and dataset for referring expression generation and comprehension in multi-agent embodied environments. In this task, two agents in a shared scene must take into account one another’s visual perspective, which may be different from their own, to both produce and understand references to objects in a scene and the spatial relations between them. We collect a dataset of 2,970 human-written referring expressions, each paired with human comprehension judgments, and evaluate the performance of automated models as speakers and listeners paired with human partners, finding that model performance in both reference generation and comprehension lags behind that of pairs of human agents. Finally, we experiment training an open-weight speaker model with evidence of communicative success when paired with a listener, resulting in an improvement from 58.9 to 69.3% in communicative success and even outperforming the strongest proprietary model.

[CV-146] A Brain-Inspired Regularizer for Adversarial Robustness

链接: https://arxiv.org/abs/2410.03952
作者: Elie Attias,Cengiz Pehlevan,Dina Obeid
关键词-EN: Convolutional Neural Networks, slight input perturbations, Convolutional Neural, task failures, visual tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注: 10 pages plus appendix, 10 figures (main text), 15 figures (appendix), 3 tables (appendix)

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) excel in many visual tasks, but they tend to be sensitive to slight input perturbations that are imperceptible to the human eye, often resulting in task failures. Recent studies indicate that training CNNs with regularizers that promote brain-like representations, using neural recordings, can improve model robustness. However, the requirement to use neural data severely restricts the utility of these methods. Is it possible to develop regularizers that mimic the computational function of neural regularizers without the need for neural recordings, thereby expanding the usability and effectiveness of these techniques? In this work, we inspect a neural regularizer introduced in Li et al. (2019) to extract its underlying strength. The regularizer uses neural representational similarities, which we find also correlate with pixel similarities. Motivated by this finding, we introduce a new regularizer that retains the essence of the original but is computed using image pixel similarities, eliminating the need for neural recordings. We show that our regularization method 1) significantly increases model robustness to a range of black box attacks on various datasets and 2) is computationally inexpensive and relies only on original datasets. Our work explores how biologically motivated loss functions can be used to drive the performance of artificial neural networks.

[CV-147] Interpolation-Free Deep Learning for Meteorological Downscaling on Unaligned Grids Across Multiple Domains with Application to Wind Power

链接: https://arxiv.org/abs/2410.03945
作者: Jean-Sébastien Giroux,Simon-Philippe Breton,Julie Carreau
关键词-EN: cleaner energy sources, climate change intensifies, change intensifies, increasingly urgent, shift to cleaner
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As climate change intensifies, the shift to cleaner energy sources becomes increasingly urgent. With wind energy production set to accelerate, reliable wind probabilistic forecasts are essential to ensure its efficient use. However, since numerical weather prediction models are computationally expensive, probabilistic forecasts are produced at resolutions too coarse to capture all mesoscale wind behaviors. Statistical downscaling, typically applied to enchance the resolution of climate model simulations, presents a viable solution with lower computational costs by learning a mapping from low-resolution (LR) variables to high-resolution (HR) meteorological variables. Leveraging deep learning, we evaluate a downscaling model based on a state-of-the-art U-Net architecture, applied to an ensemble member from a coarse-scale probabilistic forecast of wind velocity. The architecture is modified to incorporate (1) a learned grid alignment strategy to resolve LR-HR grid mismatches and (2) a processing module for multi-level atmospheric predictors. To extend the downscaling model’s applicability from fixed spatial domains to the entire Canadian region, we assess a transfer learning approach. Our results show that the learned grid alignment strategy performs as well as conventional pre-processing interpolation steps and that LR wind speed at multiple levels is sufficient as a predictor, enabling a more compact architecture. Additionally, they suggest that extending to new spatial domains using transfer learning is promising, and that downscaled wind velocities demonstrate potential in improving the detection of wind power ramps, a critical phenomenon for wind energy.

[CV-148] AutoLoRA: AutoGuidance Meets Low-Rank Adaptation for Diffusion Models

链接: https://arxiv.org/abs/2410.03941
作者: Artur Kasymov,Marcin Sendera,Michał Stypułkowski,Maciej Zięba,Przemysław Spurek
关键词-EN: Low-rank adaptation, generative diffusion models, Low-rank, conditional generative diffusion, LoRA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) is a fine-tuning technique that can be applied to conditional generative diffusion models. LoRA utilizes a small number of context examples to adapt the model to a specific domain, character, style, or concept. However, due to the limited data utilized during training, the fine-tuned model performance is often characterized by strong context bias and a low degree of variability in the generated images. To solve this issue, we introduce AutoLoRA, a novel guidance technique for diffusion models fine-tuned with the LoRA approach. Inspired by other guidance techniques, AutoLoRA searches for a trade-off between consistency in the domain represented by LoRA weights and sample diversity from the base conditional diffusion model. Moreover, we show that incorporating classifier-free guidance for both LoRA fine-tuned and base models leads to generating samples with higher diversity and better quality. The experimental results for several fine-tuned LoRA domains show superiority over existing guidance techniques on selected metrics.

[CV-149] Clustering Alzheimers Disease Subtypes via Similarity Learning and Graph Diffusion

链接: https://arxiv.org/abs/2410.03937
作者: Tianyi Wei,Shu Yang,Davoud Ataee Tarzanagh,Jingxuan Bao,Jia Xu,Patryk Orzechowski,Joost B. Wagenaar,Qi Long,Li Shen
关键词-EN: complex neurodegenerative disorder, Alzheimer disease, people worldwide, complex neurodegenerative, neurodegenerative disorder
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注: ICIBM’23’: International Conference on Intelligent Biology and Medicine, Tampa, FL, USA, July 16-19, 2023

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Due to the heterogeneous nature of AD, its diagnosis and treatment pose critical challenges. Consequently, there is a growing research interest in identifying homogeneous AD subtypes that can assist in addressing these challenges in recent years. In this study, we aim to identify subtypes of AD that represent distinctive clinical features and underlying pathology by utilizing unsupervised clustering with graph diffusion and similarity learning. We adopted SIMLR, a multi-kernel similarity learning framework, and graph diffusion to perform clustering on a group of 829 patients with AD and mild cognitive impairment (MCI, a prodromal stage of AD) based on their cortical thickness measurements extracted from magnetic resonance imaging (MRI) scans. Although the clustering approach we utilized has not been explored for the task of AD subtyping before, it demonstrated significantly better performance than several commonly used clustering methods. Specifically, we showed the power of graph diffusion in reducing the effects of noise in the subtype detection. Our results revealed five subtypes that differed remarkably in their biomarkers, cognitive status, and some other clinical features. To evaluate the resultant subtypes further, a genetic association study was carried out and successfully identified potential genetic underpinnings of different AD subtypes. Our source code is available at: this https URL.

[CV-150] Learning Truncated Causal History Model for Video Restoration NEURIPS2024

链接: https://arxiv.org/abs/2410.03936
作者: Amirhosein Ghasemabadi,Muhammad Kamran Janjua,Mohammad Salameh,Di Niu
关键词-EN: video frames governed, key challenge, transition dynamics, video, video restoration
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024. 24 pages

点击查看摘要

Abstract:One key challenge to video restoration is to model the transition dynamics of video frames governed by motion. In this work, we propose TURTLE to learn the truncated causal history model for efficient and high-performing video restoration. Unlike traditional methods that process a range of contextual frames in parallel, TURTLE enhances efficiency by storing and summarizing a truncated history of the input frame latent representation into an evolving historical state. This is achieved through a sophisticated similarity-based retrieval mechanism that implicitly accounts for inter-frame motion and alignment. The causal design in TURTLE enables recurrence in inference through state-memorized historical features while allowing parallel training by sampling truncated video clips. We report new state-of-the-art results on a multitude of video restoration benchmark tasks, including video desnowing, nighttime video deraining, video raindrops and rain streak removal, video super-resolution, real-world and synthetic video deblurring, and blind video denoising while reducing the computational cost compared to existing best contextual methods on all these tasks.

[CV-151] Learning Object Properties Using Robot Proprioception via Differentiable Robot-Object Interaction

链接: https://arxiv.org/abs/2410.03920
作者: Peter Yichen Chen,Chao Liu,Pingchuan Ma,John Eastman,Daniela Rus,Dylan Randle,Yuri Ivanov,Wojciech Matusik
关键词-EN: robot, properties, system identification, manipulated objects, objects
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Differentiable simulation has become a powerful tool for system identification. While prior work has focused on identifying robot properties using robot-specific data or object properties using object-specific data, our approach calibrates object properties by using information from the robot, without relying on data from the object itself. Specifically, we utilize robot joint encoder information, which is commonly available in standard robotic systems. Our key observation is that by analyzing the robot’s reactions to manipulated objects, we can infer properties of those objects, such as inertia and softness. Leveraging this insight, we develop differentiable simulations of robot-object interactions to inversely identify the properties of the manipulated objects. Our approach relies solely on proprioception – the robot’s internal sensing capabilities – and does not require external measurement tools or vision-based tracking systems. This general method is applicable to any articulated robot and requires only joint position information. We demonstrate the effectiveness of our method on a low-cost robotic platform, achieving accurate mass and elastic modulus estimations of manipulated objects with just a few seconds of computation on a laptop.

[CV-152] STONE: A Submodular Optimization Framework for Active 3D Object Detection

链接: https://arxiv.org/abs/2410.03918
作者: Ruiyu Mao,Sarthak Kumar Maharana,Rishabh K Iyer,Yunhui Guo
关键词-EN: point cloud data, including autonomous driving, emerging applications, driving and robotics, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D object detection is fundamentally important for various emerging applications, including autonomous driving and robotics. A key requirement for training an accurate 3D object detector is the availability of a large amount of LiDAR-based point cloud data. Unfortunately, labeling point cloud data is extremely challenging, as accurate 3D bounding boxes and semantic labels are required for each potential object. This paper proposes a unified active 3D object detection framework, for greatly reducing the labeling cost of training 3D object detector. Our framework is based on a novel formulation of submodular optimization, specifically tailored to the problem of active 3D object detection. In particular, we address two fundamental challenges associated with active 3D object detection: data imbalance and the need to cover the distribution of the data, including LiDAR-based point cloud data of varying difficulty levels. Extensive experiments demonstrate that our method achieves state-of-the-art performance with high computational efficiency compared to existing active learning methods.

[CV-153] he Wallpaper is Ugly: Indoor Localization using Vision and Language

链接: https://arxiv.org/abs/2410.03900
作者: Seth Pate,Lawson L.S. Wong
关键词-EN: Toggle, natural language queries, Code, mapped indoor environment, Papers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: RO-MAN 2023

点击查看摘要

Abstract:We study the task of locating a user in a mapped indoor environment using natural language queries and images from the environment. Building on recent pretrained vision-language models, we learn a similarity score between text descriptions and images of locations in the environment. This score allows us to identify locations that best match the language query, estimating the user’s location. Our approach is capable of localizing on environments, text, and images that were not seen during training. One model, finetuned CLIP, outperformed humans in our evaluation. Comments: RO-MAN 2023 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.03900 [cs.CV] (or arXiv:2410.03900v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.03900 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Seth Pate [view email] [v1] Fri, 4 Oct 2024 20:08:01 UTC (7,208 KB) Full-text links: Access Paper: View a PDF of the paper titled The Wallpaper is Ugly: Indoor Localization using Vision and Language, by Seth Pate and Lawson L.S. WongView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-10 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-154] SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models

链接: https://arxiv.org/abs/2410.03878
作者: Yue Zhang,Zhiyang Xu,Ying Shen,Parisa Kordjamshidi,Lifu Huang
关键词-EN: promising research direction, large language models, world into large, promising research, research direction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Integrating the 3D world into large language models (3D-based LLMs) has been a promising research direction for 3D scene understanding. However, current 3D-based LLMs fall short in situated understanding due to two key limitations: 1) existing 3D datasets are constructed from a global perspective of the 3D scenes and lack situated context. 2) the architectures of existing 3D-based LLMs lack explicit alignment between the spatial representations of 3D scenes and natural language, limiting their performance in tasks requiring precise spatial reasoning. We address these issues by introducing a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. Furthermore, we propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module, aiming to enhance the alignment between 3D visual representations and their corresponding textual descriptions. Experimental results demonstrate that both our proposed dataset and alignment module significantly enhance the situated spatial understanding of 3D-based LLMs.

[CV-155] Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

链接: https://arxiv.org/abs/2410.03869
作者: Wenxuan Wang,Kuiyi Gao,Zihan Jia,Youliang Yuan,Jen-tse Huang,Qiuzhi Liu,Shuai Wang,Wenxiang Jiao,Zhaopeng Tu
关键词-EN: Stable Diffusion, Text-based image generation, hold significant potential, Diffusion and DALL-E, Text-based image
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness of our CoJ attack method, we constructed a comprehensive dataset, CoJ-Bench, encompassing nine safety scenarios, three types of editing operations, and three editing elements. Experiments on four widely-used image generation services provided by GPT-4V, GPT-4o, Gemini 1.5 and Gemini 1.5 Pro, demonstrate that our CoJ attack method can successfully bypass the safeguards of models for over 60% cases, which significantly outperforms other jailbreaking methods (i.e., 14%). Further, to enhance these models’ safety against our CoJ attack method, we also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack. We release our dataset and code to facilitate the AI safety research.

[CV-156] Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering

链接: https://arxiv.org/abs/2410.03861
作者: Laura Fink,Linus Franke,Joachim Keinert,Marc Stamminger
关键词-EN: depth, depth maps, computer graphics, computer vision, per-pixel depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9.5 pages main paper + 3 pages of references + 1.5 pages appendix

点击查看摘要

Abstract:The accurate reconstruction of per-pixel depth for an image is vital for many tasks in computer graphics, computer vision, and robotics. In this paper, we present a novel approach to generate view consistent and detailed depth maps from a number of posed images. We leverage advances in monocular depth estimation, which generate topologically complete, but metrically inaccurate depth maps and refine them in a two-stage optimization process based on a differentiable renderer. Taking the monocular depth map as input, we first scale this map to absolute distances based on structure-from-motion and transform the depths to a triangle surface mesh. We then refine this depth mesh in a local optimization, enforcing photometric and geometric consistency. Our evaluation shows that our method is able to generate dense, detailed, high-quality depth maps, also in challenging indoor scenarios, and outperforms state-of-the-art depth reconstruction approaches. Overview and supplemental material of this project can be found at this https URL. Comments: 9.5 pages main paper + 3 pages of references + 1.5 pages appendix Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.03861 [cs.CV] (or arXiv:2410.03861v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.03861 Focus to learn more arXiv-issued DOI via DataCite

[CV-157] MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty

链接: https://arxiv.org/abs/2410.03860
作者: Leo Bringer,Joey Wilson,Kira Barton,Maani Ghaffari
关键词-EN: synchronizes skeletal data, generate refined long-term, refined long-term motion, paper introduces, integrates and synchronizes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a Multi-modal Diffusion model for Motion Prediction (MDMP) that integrates and synchronizes skeletal data and textual descriptions of actions to generate refined long-term motion predictions with quantifiable uncertainty. Existing methods for motion forecasting or motion generation rely solely on either prior motions or text prompts, facing limitations with precision or control, particularly over extended durations. The multi-modal nature of our approach enhances the contextual understanding of human motion, while our graph-based transformer framework effectively capture both spatial and temporal motion dynamics. As a result, our model consistently outperforms existing generative techniques in accurately predicting long-term motions. Additionally, by leveraging diffusion models’ ability to capture different modes of prediction, we estimate uncertainty, significantly improving spatial awareness in human-robot interactions by incorporating zones of presence with varying confidence levels for each body joint.

[CV-158] Unsupervised Prior Learning: Discovering Categorical Pose Priors from Videos

链接: https://arxiv.org/abs/2410.03858
作者: Ziyu Wang,Shuangpeng Han,Mike Zheng Shou,Mengmi Zhang
关键词-EN: pose, pose estimation, represents a set, set of beliefs, beliefs or assumptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A prior represents a set of beliefs or assumptions about a system, aiding inference and decision-making. In this work, we introduce the challenge of unsupervised prior learning in pose estimation, where AI models learn pose priors of animate objects from videos in a self-supervised manner. These videos present objects performing various actions, providing crucial information about their keypoints and connectivity. While priors are effective in pose estimation, acquiring them can be difficult. We propose a novel method, named Pose Prior Learner (PPL), to learn general pose priors applicable to any object category. PPL uses a hierarchical memory to store compositional parts of prototypical poses, from which we distill a general pose prior. This prior enhances pose estimation accuracy through template transformation and image reconstruction. PPL learns meaningful pose priors without any additional human annotations or interventions, outperforming competitive baselines on both human and animal pose estimation datasets. Notably, our experimental results reveal the effectiveness of PPL using learnt priors for pose estimation on occluded images. Through iterative inference, PPL leverages priors to refine estimated poses, regressing them to any prototypical poses stored in memory. Our code, model, and data will be publicly available.

[CV-159] Using Prompts to Guide Large Language Models in Imitating a Real Persons Language Style

链接: https://arxiv.org/abs/2410.03848
作者: Ziyang Chen,Stylios Moscholios
关键词-EN: demonstrated strong capabilities, natural language processing, GPT series, Large language models, language style
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large language models (LLMs), such as GPT series and Llama series have demonstrated strong capabilities in natural language processing, contextual understanding, and text generation. In recent years, researchers are trying to enhance the abilities of LLMs in performing various tasks, and numerous studies have proved that well-designed prompts can significantly improve the performance of LLMs on these tasks. This study compares the language style imitation ability of three different large language models under the guidance of the same zero-shot prompt. It also involves comparing the imitation ability of the same large language model when guided by three different prompts individually. Additionally, by applying a Tree-of-Thoughts (ToT) Prompting method to Llama 3, a conversational AI with the language style of a real person was created. In this study, three evaluation methods were used to evaluate LLMs and prompts. The results show that Llama 3 performs best at imitating language styles, and that the ToT prompting method is the most effective to guide it in imitating language styles. Using a ToT framework, Llama 3 was guided to interact with users in the language style of a specific individual without altering its core parameters, thereby creating a text-based conversational AI that reflects the language style of the individual.

[CV-160] MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

链接: https://arxiv.org/abs/2410.03825
作者: Junyi Zhang,Charles Herrmann,Junhwa Hur,Varun Jampani,Trevor Darrell,Forrester Cole,Deqing Sun,Ming-Hsuan Yang
关键词-EN: deform over time, remains a core, computer vision, dynamic scenes, objects move
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUST3R’s representation, previously only used for static scenes, to dynamic scenes. However, this approach presents a significant challenge: the scarcity of suitable training data, namely dynamic, posed videos with depth labels. Despite this, we show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation. Based on this, we introduce new optimizations for several downstream video-specific tasks and demonstrate strong performance on video depth and camera pose estimation, outperforming prior work in terms of robustness and efficiency. Moreover, MonST3R shows promising results for primarily feed-forward 4D reconstruction.

[CV-161] Modeling and Analysis of Spatial and Temporal Land Clutter Statistics in SAR Imaging Based on MSTAR Data

链接: https://arxiv.org/abs/2410.03816
作者: Shahrokh Hamidi
关键词-EN: Synthetic Aperture Radar, Synthetic Aperture, increasingly important subject, land clutter, Aperture Radar
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Applications (stat.AP)
*备注: arXiv admin note: substantial text overlap with arXiv:2409.02155

点击查看摘要

Abstract:The statistical analysis of land clutter for Synthetic Aperture Radar (SAR) imaging has become an increasingly important subject for research and investigation. It is also absolutely necessary for designing robust algorithms capable of performing the task of target detection in the background clutter. Any attempt to extract the energy of the desired targets from the land clutter requires complete knowledge of the statistical properties of the background clutter. In this paper, the spatial as well as the temporal characteristics of the land clutter are studied. Since the data for each image has been collected based on a different aspect angle; therefore, the temporal analysis contains variation in the aspect angle. Consequently, the temporal analysis includes the characteristics of the radar cross section with respect to the aspect angle based on which the data has been collected. In order to perform the statistical analysis, several well-known and relevant distributions, namely, Weibull, Log-normal, Gamma, and Rayleigh are considered as prime candidates to model the land clutter. The goodness-of-fit test is based on the Kullback-Leibler (KL) Divergence metric. The detailed analysis presented in this paper demonstrates that the Weibull distribution is a more accurate fit for the temporal-aspect-angle statistical analysis while the Rayleigh distribution models the spatial characteristics of the background clutter with higher accuracy. Finally, based on the aforementioned statistical analyses and by utilizing the Constant False Alarm Rate (CFAR) algorithm, we perform target detection in land clutter. The overall verification of the analysis is performed by exploiting the Moving and Stationary Target Acquisition and Recognition (MSTAR) data-set, which has been collected in spotlight mode at X-band, and the results are presented.

[CV-162] EvenNICER-SLAM: Event-based Neural Implicit Encoding SLAM

链接: https://arxiv.org/abs/2410.03812
作者: Shi Chen,Danda Pani Paudel,Luc Van Gool
关键词-EN: visual simultaneous localization, neural implicit representations, dense visual simultaneous, implicit encoding SLAM, neural implicit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advancement of dense visual simultaneous localization and mapping (SLAM) has been greatly facilitated by the emergence of neural implicit representations. Neural implicit encoding SLAM, a typical example of which is NICE-SLAM, has recently demonstrated promising results in large-scale indoor scenes. However, these methods typically rely on temporally dense RGB-D image streams as input in order to function properly. When the input source does not support high frame rates or the camera movement is too fast, these methods often experience crashes or significant degradation in tracking and mapping accuracy. In this paper, we propose EvenNICER-SLAM, a novel approach that addresses this issue through the incorporation of event cameras. Event cameras are bio-inspired cameras that respond to intensity changes instead of absolute brightness. Specifically, we integrated an event loss backpropagation stream into the NICE-SLAM pipeline to enhance camera tracking with insufficient RGB-D input. We found through quantitative evaluation that EvenNICER-SLAM, with an inclusion of higher-frequency event image input, significantly outperforms NICE-SLAM with reduced RGB-D input frequency. Our results suggest the potential for event cameras to improve the robustness of dense SLAM systems against fast camera motion in real-world scenarios.

[CV-163] Accelerating Deep Learning with Fixed Time Budget

链接: https://arxiv.org/abs/2410.03790
作者: Muhammad Asif Khan,Ridha Hamila,Hamid Menouar
关键词-EN: large model sizes, key elements, success of modern, huge amounts, modern deep learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The success of modern deep learning is attributed to two key elements: huge amounts of training data and large model sizes. Where a vast amount of data allows the model to learn more features, the large model architecture boosts the learning capability of the model. However, both these factors result in prolonged training time. In some practical applications such as edge-based learning and federated learning, limited-time budgets necessitate more efficient training methods. This paper proposes an effective technique for training arbitrary deep learning models within fixed time constraints utilizing sample importance and dynamic ranking. The proposed method is extensively evaluated in both classification and regression tasks in computer vision. The results consistently show clear gains achieved by the proposed method in improving the learning performance of various state-of-the-art deep learning models in both regression and classification tasks.

[CV-164] CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

链接: https://arxiv.org/abs/2410.03787
作者: Qisheng Liao,Liang Li,Yulang Fei,Gus Xia
关键词-EN: flexible multi-modal control, produce natural Chinese, natural Chinese calligraphy, natural Chinese, flexible multi-modal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:In this paper, we introduce CalliffusionV2, a novel system designed to produce natural Chinese calligraphy with flexible multi-modal control. Unlike previous approaches that rely solely on image or text inputs and lack fine-grained control, our system leverages both images to guide generations at fine-grained levels and natural language texts to describe the features of generations. CalliffusionV2 excels at creating a broad range of characters and can quickly learn new styles through a few-shot learning approach. It is also capable of generating non-Chinese characters without prior training. Comprehensive tests confirm that our system produces calligraphy that is both stylistically accurate and recognizable by neural network classifiers and human evaluators.

[CV-165] Improving Neural Optimal Transport via Displacement Interpolation

链接: https://arxiv.org/abs/2410.03783
作者: Jaemoo Choi,Yongxin Chen,Jaewoong Choi
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages

点击查看摘要

[CV-166] DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation

链接: https://arxiv.org/abs/2410.03782
作者: Changdae Oh,Yixuan Li,Kyungwoo Song,Sangdoo Yun,Dongyoon Han
关键词-EN: Adapting a pre-trained, pre-trained foundation model, pre-trained foundation, ensure robustness, Adapting
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adapting a pre-trained foundation model on downstream tasks should ensure robustness against distribution shifts without the need to retrain the whole model. Although existing weight interpolation methods are simple yet effective, we argue their static nature limits downstream performance while achieving efficiency. In this work, we propose DaWin, a training-free dynamic weight interpolation method that leverages the entropy of individual models over each unlabeled test sample to assess model expertise, and compute per-sample interpolation coefficients dynamically. Unlike previous works that typically rely on additional training to learn such coefficients, our approach requires no training. Then, we propose a mixture modeling approach that greatly reduces inference overhead raised by dynamic interpolation. We validate DaWin on the large-scale visual recognition benchmarks, spanning 14 tasks across robust fine-tuning – ImageNet and derived five distribution shift benchmarks – and multi-task learning with eight classification tasks. Results demonstrate that DaWin achieves significant performance gain in considered settings, with minimal computational overhead. We further discuss DaWin’s analytic behavior to explain its empirical success.

[CV-167] SGW-based Multi-Task Learning in Vision Tasks

链接: https://arxiv.org/abs/2410.03778
作者: Ruiyuan Zhang,Yuyao Chen,Yuchi Huo,Jiaxiang Liu,Dianbing Xi,Jie Liu,Chao Wu
关键词-EN: multi-target optimization task, multi-target optimization, MTL, optimization task, previous cross-attention MTL
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task-learning(MTL) is a multi-target optimization task. Neural networks try to realize each target using a shared interpretative space within MTL. However, as the scale of datasets expands and the complexity of tasks increases, knowledge sharing becomes increasingly challenging. In this paper, we first re-examine previous cross-attention MTL methods from the perspective of noise. We theoretically analyze this issue and identify it as a flaw in the cross-attention mechanism. To address this issue, we propose an information bottleneck knowledge extraction module (KEM). This module aims to reduce inter-task interference by constraining the flow of information, thereby reducing computational complexity. Furthermore, we have employed neural collapse to stabilize the knowledge-selection process. That is, before input to KEM, we projected the features into ETF space. This mapping makes our method more robust. We implemented and conducted comparative experiments with this method on multiple datasets. The results demonstrate that our approach significantly outperforms existing methods in multi-task learning.

[CV-168] Denoising with a Joint-Embedding Predictive Architecture

链接: https://arxiv.org/abs/2410.03755
作者: Dengsheng Chen,Jie Hu,Xiaoming Wei,Enhua Wu
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 38 pages

点击查看摘要

[CV-169] Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic Analysis

链接: https://arxiv.org/abs/2410.03728
作者: Barak Gahtan,Robert J. Sahala,Alex M. Bronstein,Reuven Cohen
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: The dataset and the supplementary material can be provided upon request

点击查看摘要

[CV-170] LCM: Log Conformal Maps for Robust Representation Learning to Mitigate Perspective Distortion ACCV2024

链接: https://arxiv.org/abs/2410.03686
作者: Meenakshi Subhash Chippa,Prakash Chandra Chhipa,Kanjar De,Marcus Liwicki,Rajkumar Saini
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Asian Conference on Computer Vision (ACCV2024)

点击查看摘要

[CV-171] Controllable Shape Modeling with Neural Generalized Cylinder SIGGRAPH

链接: https://arxiv.org/abs/2410.03675
作者: Xiangyu Zhu,Zhiqin Chen,Ruizhen Hu,Xiaoguang Han
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted by Siggraph Asia 2024 (Conference track)

点击查看摘要

[CV-172] AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation

链接: https://arxiv.org/abs/2409.20398
作者: Boyu Han,Qianqian Xu,Zhiyong Yang,Shilong Bao,Peisong Wen,Yangbangyan Jiang,Qingming Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-173] Causal Context Adjustment Loss for Learned Image Compression NEURIPS2024

链接: https://arxiv.org/abs/2410.04847
作者: Minghao Han,Shiyin Jiang,Shengxi Li,Xin Deng,Mai Xu,Ce Zhu,Shuhang Gu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

[CV-174] Multi-Tiered Self-Contrastive Learning for Medical Microwave Radiometry (MWR) Breast Cancer Detection

链接: https://arxiv.org/abs/2410.04636
作者: Christoforos Galazis,Huiyi Wu,Igor Goryanin
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-175] SITCOM: Step-wise Triple-Consistent Diffusion Sampling for Inverse Problems

链接: https://arxiv.org/abs/2410.04479
作者: Ismail Alkhouri,Shijun Liang,Cheng-Han Huang,Jimmy Dai,Qing Qu,Saiprasad Ravishankar,Rongrong Wang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-176] U-net based prediction of cerebrospinal fluid distribution and ventricular reflux grading

链接: https://arxiv.org/abs/2410.04460
作者: Melanie Rieff,Fabian Holzberger,Oksana Lapina,Geir Ringstad,Lars Magnus Valnes,Bogna Warsza,Kent-Andre Mardal,Per Kristian Eide,Barbara Wohlmuth
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

[CV-177] AIM 2024 Challenge on Video Super-Resolution Quality Assessment: Methods and Results

链接: https://arxiv.org/abs/2410.04225
作者: Ivan Molodetskikh,Artem Borisov,Dmitriy Vatolin,Radu Timofte,Jianzhao Liu,Tianwu Zhi,Yabin Zhang,Yang Li,Jingwen Xu,Yiting Liao,Qing Luo,Ao-Xiang Zhang,Peng Zhang,Haibo Lei,Linyan Jiang,Yaqing Li,Yuqin Cao,Wei Sun,Weixia Zhang,Yinan Sun,Ziheng Jia,Yuxin Zhu,Xiongkuo Min,Guangtao Zhai,Weihua Luo,Yupeng Z.,Hong Y
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 18 pages, 7 figures

点击查看摘要

[CV-178] DB-SAM: Delving into High Quality Universal Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2410.04172
作者: Chao Qin,Jiale Cao,Huazhu Fu,Fahad Shahbaz Khan,Rao Muhammad Anwer
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024 Oral

点击查看摘要

[CV-179] IceCloudNet: 3D reconstruction of cloud ice from Meteosat SEVIRI

链接: https://arxiv.org/abs/2410.04135
作者: Kai Jeggle,Mikolaj Czerkawski,Federico Serva,Bertrand Le Saux,David Neubauer,Ulrike Lohmann
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: his paper was submitted to Artificial Intelligence for the Earth Systems

点击查看摘要

[CV-180] Optimizing Medical Image Segmentation with Advanced Decoder Design

链接: https://arxiv.org/abs/2410.04128
作者: Weibin Yang,Zhiqi Dong,Mingyuan Xu,Longwei Xu,Dehua Geng,Yusong Li,Pengwei Wang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-181] WAVE-UNET: Wavelength based Image Reconstruction method using attention UNET for OCT images

链接: https://arxiv.org/abs/2410.04123
作者: Maryam Viqar,Erdem Sahin,Violeta Madjarova,Elena Stoykova,Keehoon Hong
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Optics (physics.optics)
*备注:

点击查看摘要

[CV-182] V-based Deep 3D Self Super-Resolution for fMRI

链接: https://arxiv.org/abs/2410.04097
作者: Fernando Pérez-Bueno,Hongwei Bran Li,Shahin Nasr,Cesar Caballero-Gaudes,Juan Eugenio Iglesias
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint Submitted to ISBI 2025

点击查看摘要

[CV-183] Hybrid NeRF-Stereo Vision: Pioneering Depth Estimation and 3D Reconstruction in Endoscopy

链接: https://arxiv.org/abs/2410.04041
作者: Pengcheng Chen,Wenhao Li,Nicole Gunderson,Jeremy Ruthberg,Randall Bly,Waleed M. Abuzeid,Zhenglong Sun,Eric J. Seibel
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-184] Multiscale Latent Diffusion Model for Enhanced Feature Extraction from Medical Images

链接: https://arxiv.org/abs/2410.04000
作者: Rabeya Tus Sadia,Jie Zhang,Jin Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-185] SpecSAR-Former: A Lightweight Transformer-based Network for Global LULC Mapping Using Integrated Sentinel-1 and Sentinel-2

链接: https://arxiv.org/abs/2410.03962
作者: Hao Yu,Gen Li,Haoyu Liu,Songyan Zhu,Wenquan Dong,Changjian Li
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-186] Radio-opaque artefacts in digital mammography: automatic detection and analysis of downstream effects

链接: https://arxiv.org/abs/2410.03809
作者: Amelia Schueppert,Ben Glocker,Mélanie Roschewitz
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code available at this https URL

点击查看摘要

机器学习

[LG-0] Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.05269
作者: Fei Wang,Ninareh Mehrabi,Palash Goyal,Rahul Gupta,Kai-Wei Chang,Aram Galstyan
关键词-EN: Data Advisor, Data, large language model, crucial element, element in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Main Conference. Project website: this https URL

点击查看摘要

Abstract:Data is a crucial element in large language model (LLM) alignment. Recent studies have explored using LLMs for efficient data collection. However, LLM-generated data often suffers from quality issues, with underrepresented or absent aspects and low-quality datapoints. To address these problems, we propose Data Advisor, an enhanced LLM-based method for generating data that takes into account the characteristics of the desired dataset. Starting from a set of pre-defined principles in hand, Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation accordingly. Data Advisor can be easily integrated into existing data generation methods to enhance data quality and coverage. Experiments on safety alignment of three representative LLMs (i.e., Mistral, Llama2, and Falcon) demonstrate the effectiveness of Data Advisor in enhancing model safety against various fine-grained safety issues without sacrificing model utility.

[LG-1] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

链接: https://arxiv.org/abs/2410.05265
作者: Mengzhao Chen,Yi Liu,Jiahao Wang,Yi Bin,Wenqi Shao,Ping Luo
关键词-EN: deploying Large Language, Large Language Models, Large Language, enhancing memory efficiency, deploying Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: A PTQ method to significantly boost the performance of static activation quantization

点击查看摘要

Abstract:Quantization is essential for deploying Large Language Models (LLMs) by enhancing memory efficiency and inference speed. Existing methods for activation quantization mainly address channel-wise outliers, often neglecting token-wise outliers, leading to reliance on costly per-token dynamic quantization. To address this, we introduce PrefixQuant, a novel technique that isolates outlier tokens offline without re-training. Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization. To our knowledge, PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and 4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5 common-sense reasoning tasks, outperforming previous per-token dynamic quantization methods like QuaRot with 0.98 perplexity improvement and +5.98 points accuracy. Additionally, the inference speed of W4A4 quantized models using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot models by 1.2x to 1.3x. Our code is available at \urlthis https URL.

[LG-2] Differential Transformer

链接: https://arxiv.org/abs/2410.05258
作者: Tianzhu Ye,Li Dong,Yuqing Xia,Yutao Sun,Yi Zhu,Gao Huang,Furu Wei
关键词-EN: Diff Transformer, Transformer, Diff, introduce Diff Transformer, attention
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

[LG-3] SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

链接: https://arxiv.org/abs/2410.05255
作者: Daoan Zhang,Guangchen Lan,Dong-Jun Han,Wenlin Yao,Xiaoman Pan,Hongming Zhang,Mingxiao Li,Pengcheng Chen,Yu Dong,Christopher Brinton,Jiebo Luo
关键词-EN: fine-tune diffusion models, Reinforcement learning, paired human-annotated data, visual generation, human feedback
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace “losing images” in preference pairs. This approach allows us to optimize using only off-policy “winning images.” Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in this https URL.

[LG-4] GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

链接: https://arxiv.org/abs/2410.05254
作者: Eilam Shapira,Omer Madmon,Itamar Reinman,Samuel Joseph Amouyal,Roi Reichart,Moshe Tennenholtz
关键词-EN: Large Language Models, Large Language, Language Models, show significant potential, show significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic human behavior? Do they tend to reach an efficient and fair outcome? What is the role of natural language in the strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. While the ML community has been exploring the potential of LLMs in such multi-agent setups, varying assumptions, design choices and evaluation criteria across studies make it difficult to draw robust and meaningful conclusions. To address this, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents’ performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents to human players in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents.

[LG-5] Causal Micro-Narratives EMNLP2024

链接: https://arxiv.org/abs/2410.05252
作者: Mourad Heddaya,Qingcheng Zeng,Chenhao Tan,Rob Voigt,Alexander Zentefis
关键词-EN: classify causal micro-narratives, micro-narratives from text, classify causal, Abstract, causal micro-narratives
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Workshop on Narrative Understanding

点击查看摘要

Abstract:We present a novel approach to classify causal micro-narratives from text. These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject. The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives. Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task. The best-performing model–a fine-tuned Llama 3.1 8B–achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification. Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research.

[LG-6] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

链接: https://arxiv.org/abs/2410.05248
作者: Yuxin Xiao,Shujian Zhang,Wenxuan Zhou,Marzyeh Ghassemi,Sanqiang Zhao
关键词-EN: induce desired behaviors, stage typically trains, large language models, instruction-tuning stage typically, typically trains LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To induce desired behaviors in large language models (LLMs) for interaction-driven tasks, the instruction-tuning stage typically trains LLMs on instruction-response pairs using the next-token prediction (NTP) loss. Previous work aiming to improve instruction-tuning performance often emphasizes the need for higher-quality supervised fine-tuning (SFT) datasets, which typically involves expensive data filtering with proprietary LLMs or labor-intensive data generation by human annotators. However, these approaches do not fully leverage the datasets’ intrinsic properties, resulting in high computational and labor costs, thereby limiting scalability and performance gains. In this paper, we propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm, without the need for well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, then applies a Mixup-based regularization to mitigate overfitting on confident examples while propagating supervision signals to improve learning on relatively unconfident ones. This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks, demonstrating its adaptability to diverse LLM families and scalability to datasets of any size. Comprehensive ablation studies further verify the robustness of SFTMix’s design choices, underscoring its versatility in consistently enhancing performance across different LLMs and datasets in broader natural language processing applications.

[LG-7] SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

链接: https://arxiv.org/abs/2410.05233
作者: Taha Bouhsine,Imad El Aaroussi,Atik Faysal,Wang Huaxia
关键词-EN: anchor-free contrastive learning, proposed Similarity-Orthogonality, fine-grained contrastive learning, contrastive learning, anchor-free contrastive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a novel anchor-free contrastive learning (AFCL) method leveraging our proposed Similarity-Orthogonality (SimO) loss. Our approach minimizes a semi-metric discriminative loss function that simultaneously optimizes two key objectives: reducing the distance and orthogonality between embeddings of similar inputs while maximizing these metrics for dissimilar inputs, facilitating more fine-grained contrastive learning. The AFCL method, powered by SimO loss, creates a fiber bundle topological structure in the embedding space, forming class-specific, internally cohesive yet orthogonal neighborhoods. We validate the efficacy of our method on the CIFAR-10 dataset, providing visualizations that demonstrate the impact of SimO loss on the embedding space. Our results illustrate the formation of distinct, orthogonal class neighborhoods, showcasing the method’s ability to create well-structured embeddings that balance class separation with intra-class variability. This work opens new avenues for understanding and leveraging the geometric properties of learned representations in various machine learning tasks.

[LG-8] SymmetryLens: A new candidate paradigm for unsupervised symmetry learning via locality and equivariance

链接: https://arxiv.org/abs/2410.05232
作者: Onur Efe,Arkadas Ozakin
关键词-EN: unsupervised symmetry learning, symmetry learning method, underlying Lie group, raw data, symmetry equivariant representation
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:We develop a new, unsupervised symmetry learning method that starts with raw data, and gives the minimal (discrete) generator of an underlying Lie group of symmetries, together with a symmetry equivariant representation of the data. The method is able to learn the pixel translation operator from a dataset with only an approximate translation symmetry, and can learn quite different types of symmetries which are not apparent to the naked eye, equally well. The method is based on the formulation of an information-theoretic loss function that measures both the degree to which the dataset is symmetric under a given candidate symmetry, and also, the degree of locality of the samples in the dataset with respect to this symmetry. We demonstrate that this coupling between symmetry and locality, together with a special optimization technique developed for entropy estimation, results in a highly stable system that gives reproducible results. The symmetry actions we consider are group representations, however, we believe the approach has the potential to be generalized to more general, nonlinear actions of non-commutative Lie groups.

[LG-9] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

链接: https://arxiv.org/abs/2410.05229
作者: Iman Mirzadeh,Keivan Alizadeh,Hooman Shahrokhi,Oncel Tuzel,Samy Bengio,Mehrdad Farajtabar
关键词-EN: Large Language Models, Large Language, advancements in Large, Language Models, formal reasoning capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of this http URL findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn’t contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs’ capabilities and limitations in mathematical reasoning.

[LG-10] ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

链接: https://arxiv.org/abs/2410.05225
作者: Ehsan Futuhi,Shayan Karimi,Chao Gao,Martin Müller
关键词-EN: deterministic policy gradient, deep deterministic policy, policy gradient, sparse rewards, emph
类目: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph \epsilont -greedy, which generates exploratory options for exploring less-visited states. We prove that search using \epsilon t -greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emphGDRB, and implement \emphlongest n-step returns. The resulting algorithm, \emphETGL-DDPG, integrates all three techniques: \bm \epsilon t -greedy, \textbfGDRB, and \textbfLongest n -step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.

[LG-11] Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates

链接: https://arxiv.org/abs/2410.05224
作者: Avanika Narayan,Mayee F. Chen,Kush Bhatia,Christopher Ré
关键词-EN: Fine-tuning large language, LLM generative capabilities, large language models, instruction datasets, improve LLM generative
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: COLM 2024

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data is less labor-intensive, it may violate user privacy agreements or terms of service of LLM providers. Therefore, we seek a way of constructing instruction datasets with samples that are not generated by humans or LLMs but still improve LLM generative capabilities. In this work, we introduce Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues. First, Cookbook uses a template – a data generating Python function – to produce training data that encourages the model to learn an explicit pattern-based rule that corresponds to a desired task. We find that fine-tuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points. Second, since instruction datasets improve performance on multiple downstream tasks simultaneously, Cookbook algorithmically learns how to mix data from various templates to optimize performance on multiple tasks. On the standard multi-task GPT4ALL evaluation suite, Mistral-7B fine-tuned using a Cookbook-generated dataset attains the best accuracy on average compared to other 7B parameter instruction-tuned models and is the best performing model on 3 out of 8 tasks. Finally, we analyze when and why Cookbook improves performance and present a metric that allows us to verify that the improvement is largely explained by the model’s generations adhering better to template rules.

[LG-12] Precise Model Benchmarking with Only a Few Observations EMNLP2024

链接: https://arxiv.org/abs/2410.05222
作者: Riccardo Fogliato,Pratik Patil,Nil-Jana Akpinar,Mathew Monfort
关键词-EN: larger question-answering dataset, large language model, model accuracy, larger question-answering, accuracy on questions
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: To appear at EMNLP 2024

点击查看摘要

Abstract:How can we precisely estimate a large language model’s (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model’s accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model’s accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.

[LG-13] Density estimation with LLMs: a geometric investigation of in-context learning trajectories ICLR2025

链接: https://arxiv.org/abs/2410.05218
作者: Toni J.B. Liu,Nicolas Boullé,Raphaël Sarfati,Christopher J. Earls
关键词-EN: Large language models, demonstrate remarkable emergent, including time series, time series forecasting, remarkable emergent abilities
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Under review as a conference paper at ICLR 2025

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable emergent abilities to perform in-context learning across various tasks, including time series forecasting. This work investigates LLMs’ ability to estimate probability density functions (PDFs) from data observed in-context; such density estimation (DE) is a fundamental task underlying many probabilistic modeling problems. We leverage the Intensive Principal Component Analysis (InPCA) to visualize and analyze the in-context learning dynamics of LLaMA-2 models. Our main finding is that these LLMs all follow similar learning trajectories in a low-dimensional InPCA space, which are distinct from those of traditional density estimation methods like histograms and Gaussian kernel density estimation (KDE). We interpret the LLaMA in-context DE process as a KDE with an adaptive kernel width and shape. This custom kernel model captures a significant portion of LLaMA’s behavior despite having only two parameters. We further speculate on why LLaMA’s kernel width and shape differs from classical algorithms, providing insights into the mechanism of in-context probabilistic reasoning in LLMs.

[LG-14] Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

链接: https://arxiv.org/abs/2410.05203
作者: Ge Ya(Olga)Luo,Gian Favero,Zhi Hao Luo,Alexia Jolicoeur-Martineau,Christopher Pal
关键词-EN: Fréchet Video Distance, generation distribution quality, Fréchet Video, evaluating video generation, video generation distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD’s reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

[LG-15] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

链接: https://arxiv.org/abs/2410.05192
作者: Kaiyue Wen,Zhiyuan Li,Jason Wang,David Hall,Percy Liang,Tengyu Ma
关键词-EN: typical cosine learning, learning rate, Training language models, fixed compute budget, rate schedule depends
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 45 pages,13 figures

点击查看摘要

Abstract:Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate’s oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints’ decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.

[LG-16] Matrix-weighted networks for modeling multidimensional dynamics

链接: https://arxiv.org/abs/2410.05188
作者: Yu Tian,Sadamori Kojaku,Hiroki Sayama,Renaud Lambiotte
关键词-EN: powerful tools, complex systems, Networks, tools for modeling, Abstract
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Mathematical Physics (math-ph); Physics and Society (physics.soc-ph)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Networks are powerful tools for modeling interactions in complex systems. While traditional networks use scalar edge weights, many real-world systems involve multidimensional interactions. For example, in social networks, individuals often have multiple interconnected opinions that can affect different opinions of other individuals, which can be better characterized by matrices. We propose a novel, general framework for modeling such multidimensional interacting dynamics: matrix-weighted networks (MWNs). We present the mathematical foundations of MWNs and examine consensus dynamics and random walks within this context. Our results reveal that the coherence of MWNs gives rise to non-trivial steady states that generalize the notions of communities and structural balance in traditional networks.

[LG-17] MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain ECCV2024

链接: https://arxiv.org/abs/2410.05182
作者: Timothy Chase Jr,Karthik Dantu
关键词-EN: celestial objects, tracking of surface, surface terrain, terrain is required, required for spacecraft
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ECCV 2024. Project page available at this https URL

点击查看摘要

Abstract:The visual detection and tracking of surface terrain is required for spacecraft to safely land on or navigate within close proximity to celestial objects. Current approaches rely on template matching with pre-gathered patch-based features, which are expensive to obtain and a limiting factor in perceptual capability. While recent literature has focused on in-situ detection methods to enhance navigation and operational autonomy, robust description is still needed. In this work, we explore metric learning as the lightweight feature description mechanism and find that current solutions fail to address inter-class similarity and multi-view observational geometry. We attribute this to the view-unaware attention mechanism and introduce Multi-view Attention Regularizations (MARs) to constrain the channel and spatial attention across multiple feature views, regularizing the what and where of attention focus. We thoroughly analyze many modern metric learning losses with and without MARs and demonstrate improved terrain-feature recognition performance by upwards of 85%. We additionally introduce the Luna-1 dataset, consisting of Moon crater landmarks and reference navigation frames from NASA mission data to support future research in this difficult task. Luna-1 and source code are publicly available at this https URL.

[LG-18] Presto! Distilling Steps and Layers for Accelerating Music Generation

链接: https://arxiv.org/abs/2410.05167
作者: Zachary Novack,Ge Zhu,Jonah Casebeer,Julian McAuley,Taylor Berg-Kirkpatrick,Nicholas J. Bryan
关键词-EN: high-quality generation remains, advances in diffusion-based, remains a challenge, generation remains, layer distillation methods
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) – the fastest high-quality TTM to our knowledge. Sound examples can be found at this https URL.

[LG-19] A Simulation-Free Deep Learning Approach to Stochastic Optimal Control

链接: https://arxiv.org/abs/2410.05163
作者: Mengjian Hua,Matthieu Laurière,Eric Vanden-Eijnden
关键词-EN: stochastic optimal control, propose a simulation-free, simulation-free algorithm, SOC objective on-policy, solution of generic
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a simulation-free algorithm for the solution of generic problems in stochastic optimal control (SOC). Unlike existing methods, our approach does not require the solution of an adjoint problem, but rather leverages Girsanov theorem to directly calculate the gradient of the SOC objective on-policy. This allows us to speed up the optimization of control policies parameterized by neural networks since it completely avoids the expensive back-propagation step through stochastic differential equations (SDEs) used in the Neural SDE framework. In particular, it enables us to solve SOC problems in high dimension and on long time horizons. We demonstrate the efficiency of our approach in various domains of applications, including standard stochastic optimal control problems, sampling from unnormalized distributions via construction of a Schrödinger-Föllmer process, and fine-tuning of pre-trained diffusion models. In all cases our method is shown to outperform the existing methods in both the computing time and memory efficiency.

[LG-20] PAMLR: A Passive-Active Multi-Armed Bandit-Based Solution for LoRa Channel Allocation

链接: https://arxiv.org/abs/2410.05147
作者: Jihoon Yun,Chengzhang Li,Anish Arora
关键词-EN: duty cycle operation, low-power wireless networks, Achieving low duty, low duty cycle, Achieving low
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Achieving low duty cycle operation in low-power wireless networks in urban environments is complicated by the complex and variable dynamics of external interference and fading. We explore the use of reinforcement learning for achieving low power consumption for the task of optimal selection of channels. The learning relies on a hybrid of passive channel sampling for dealing with external interference and active channel sampling for dealing with fading. Our solution, Passive-Active Multi-armed bandit for LoRa (PAMLR, pronounced “Pamela”), balances the two types of samples to achieve energy-efficient channel selection: active channel measurements are tuned to an appropriately low level to update noise thresholds, and to compensate passive channel measurements are tuned to an appropriately high level for selecting the top-most channels from channel exploration using the noise thresholds. The rates of both types of samples are adapted in response to channel dynamics. Based on extensive testing in multiple environments in different cities, we validate that PAMLR can maintain excellent communication quality, as demonstrated by a low SNR regret compared to the optimal channel allocation policy, while substantially minimizing the energy cost associated with channel measurements.

[LG-21] uning-Free Bilevel Optimization: New Algorithms and Convergence Analysis

链接: https://arxiv.org/abs/2410.05140
作者: Yifan Yang,Hao Ban,Minhui Huang,Shiqian Ma,Kaiyi Ji
关键词-EN: recently attracted considerable, attracted considerable attention, considerable attention due, Bilevel optimization, machine learning problems
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bilevel optimization has recently attracted considerable attention due to its abundant applications in machine learning problems. However, existing methods rely on prior knowledge of problem parameters to determine stepsizes, resulting in significant effort in tuning stepsizes when these parameters are unknown. In this paper, we propose two novel tuning-free algorithms, D-TFBO and S-TFBO. D-TFBO employs a double-loop structure with stepsizes adaptively adjusted by the “inverse of cumulative gradient norms” strategy. S-TFBO features a simpler fully single-loop structure that updates three variables simultaneously with a theory-motivated joint design of adaptive stepsizes for all variables. We provide a comprehensive convergence analysis for both algorithms and show that D-TFBO and S-TFBO respectively require O(\frac1\epsilon) and O(\frac1\epsilon\log^4(\frac1\epsilon)) iterations to find an \epsilon -accurate stationary point, (nearly) matching their well-tuned counterparts using the information of problem parameters. Experiments on various problems show that our methods achieve performance comparable to existing well-tuned approaches, while being more robust to the selection of initial stepsizes. To the best of our knowledge, our methods are the first to completely eliminate the need for stepsize tuning, while achieving theoretical guarantees.

[LG-22] LOTOS: Layer-wise Orthogonalization for Training Robust Ensembles

链接: https://arxiv.org/abs/2410.05136
作者: Ali Ebrahimpour-Boroojeny,Hari Sundaram,Varun Chandrasekaran
关键词-EN: well-known property, property that endangers, endangers all classification, models, Transferability
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transferability of adversarial examples is a well-known property that endangers all classification models, even those that are only accessible through black-box queries. Prior work has shown that an ensemble of models is more resilient to transferability: the probability that an adversarial example is effective against most models of the ensemble is low. Thus, most ongoing research focuses on improving ensemble diversity. Another line of prior work has shown that Lipschitz continuity of the models can make models more robust since it limits how a model’s output changes with small input perturbations. In this paper, we study the effect of Lipschitz continuity on transferability rates. We show that although a lower Lipschitz constant increases the robustness of a single model, it is not as beneficial in training robust ensembles as it increases the transferability rate of adversarial examples across models in the ensemble. Therefore, we introduce LOTOS, a new training paradigm for ensembles, which counteracts this adverse effect. It does so by promoting orthogonality among the top- k sub-spaces of the transformations of the corresponding affine layers of any pair of models in the ensemble. We theoretically show that k does not need to be large for convolutional layers, which makes the computational overhead negligible. Through various experiments, we show LOTOS increases the robust accuracy of ensembles of ResNet-18 models by 6 percentage points (p.p) against black-box attacks on CIFAR-10. It is also capable of combining with the robustness of prior state-of-the-art methods for training robust ensembles to enhance their robust accuracy by 10.7 p.p.

[LG-23] A Digital Twin Framework for Liquid-cooled Supercomputers as Demonstrated at Exascale

链接: https://arxiv.org/abs/2410.05133
作者: Wesley Brewer,Matthias Maiterth,Vineet Kumar,Rafal Wojda,Sedrick Bouknight,Jesse Hines,Woong Shin,Scott Greenwood,David Grant,Wesley Williams,Feiyi Wang
关键词-EN: open-source framework, Abstract, framework, augmented reality model, digital twins
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures, To be published in the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2024

点击查看摘要

Abstract:We present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of “what-if” scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework’s capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.

[LG-24] Assouad Fano and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability

链接: https://arxiv.org/abs/2410.05117
作者: Fan Chen,Dylan J. Foster,Yanjun Han,Jian Qian,Alexander Rakhlin,Yunbei Xu
关键词-EN: interactive decision making, statistical estimation, lower bound, interactive decision, decision making
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we develop a unified framework for lower bound methods in statistical estimation and interactive decision making. Classical lower bound techniques – such as Fano’s inequality, Le Cam’s method, and Assouad’s lemma – have been central to the study of minimax risk in statistical estimation, yet they are insufficient for the analysis of methods that collect data in an interactive manner. The recent minimax lower bounds for interactive decision making via the Decision-Estimation Coefficient (DEC) appear to be genuinely different from the classical methods. We propose a unified view of these distinct methodologies through a general algorithmic lower bound method. We further introduce a novel complexity measure, decision dimension, which facilitates the derivation of new lower bounds for interactive decision making. In particular, decision dimension provides a characterization of bandit learnability for any structured bandit model class. Further, we characterize the sample complexity of learning convex model class up to a polynomial gap with the decision dimension, addressing the remaining gap between upper and lower bounds in Foster et al. (2021, 2023).

[LG-25] Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

链接: https://arxiv.org/abs/2410.05116
作者: Ayano Hiranaka,Shang-Fu Chen,Chieh-Hsin Lai,Dongjun Kim,Naoki Murata,Takashi Shibuya,Wei-Hsiang Liao,Shao-Hua Sun,Yuki Mitsufuji
关键词-EN: Stable Diffusion, Controllable generation, improve fidelity, aims to improve, human feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD’s refined initialization samples, enabling faster convergence towards the evaluator’s intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.

[LG-26] Hyper-Representations: Learning from Populations of Neural Networks

链接: https://arxiv.org/abs/2410.05107
作者: Konstantin Schürholt
关键词-EN: Neural Networks, Neural Network models, thesis, Neural, understanding Neural Networks
类目: Machine Learning (cs.LG)
*备注: PhD Dissertation accepted at University of St. Gallen

点击查看摘要

Abstract:This thesis addresses the challenge of understanding Neural Networks through the lens of their most fundamental component: the weights, which encapsulate the learned information and determine the model behavior. At the core of this thesis is a fundamental question: Can we learn general, task-agnostic representations from populations of Neural Network models? The key contribution of this thesis to answer that question are hyper-representations, a self-supervised method to learn representations of NN weights. Work in this thesis finds that trained NN models indeed occupy meaningful structures in the weight space, that can be learned and used. Through extensive experiments, this thesis demonstrates that hyper-representations uncover model properties, such as their performance, state of training, or hyperparameters. Moreover, the identification of regions with specific properties in hyper-representation space allows to sample and generate model weights with targeted properties. This thesis demonstrates applications for fine-tuning, and transfer learning to great success. Lastly, it presents methods that allow hyper-representations to generalize beyond model sizes, architectures, and tasks. The practical implications of that are profound, as it opens the door to foundation models of Neural Networks, which aggregate and instantiate their knowledge across models and architectures. Ultimately, this thesis contributes to the deeper understanding of Neural Networks by investigating structures in their weights which leads to more interpretable, efficient, and adaptable models. By laying the groundwork for representation learning of NN weights, this research demonstrates the potential to change the way Neural Networks are developed, analyzed, and used.

[LG-27] SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

链接: https://arxiv.org/abs/2410.05102
作者: Fenia Christopoulou,Ronald Cardenas,Gerasimos Lampouras,Haitham Bou-Ammar,Jun Wang
关键词-EN: Direct Preference Optimization, Preference Optimization objective, Preference Optimization, aligning language models, offline Direct Preference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 papges, 9 figures, 5 tables. Under Review

点击查看摘要

Abstract:Preference Optimization (PO) has proven an effective step for aligning language models to human-desired behaviors. Current variants, following the offline Direct Preference Optimization objective, have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected by each word in a sequence equally but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best weight reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments on multiple domains, including sentiment control, dialogue, text summarization and text-to-code generation, illustrate that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.

[LG-28] DreamSat: Towards a General 3D Model for Novel View Synthesis of Space Objects

链接: https://arxiv.org/abs/2410.05097
作者: Nidhi Mathihalli,Audrey Wei,Giovanni Lavezzi,Peng Mun Siew,Victor Rodriguez-Fernandez,Hodei Urrutxua,Richard Linares
关键词-EN: Space Domain Awareness, view synthesis, enables to generate, Domain Awareness, Structural Similarity Index
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Presented at the 75th International Astronautical Congress, October 2024, Milan, Italy

点击查看摘要

Abstract:Novel view synthesis (NVS) enables to generate new images of a scene or convert a set of 2D images into a comprehensive 3D model. In the context of Space Domain Awareness, since space is becoming increasingly congested, NVS can accurately map space objects and debris, improving the safety and efficiency of space operations. Similarly, in Rendezvous and Proximity Operations missions, 3D models can provide details about a target object’s shape, size, and orientation, allowing for better planning and prediction of the target’s behavior. In this work, we explore the generalization abilities of these reconstruction techniques, aiming to avoid the necessity of retraining for each new scene, by presenting a novel approach to 3D spacecraft reconstruction from single-view images, DreamSat, by fine-tuning the Zero123 XL, a state-of-the-art single-view reconstruction model, on a high-quality dataset of 190 high-quality spacecraft models and integrating it into the DreamGaussian framework. We demonstrate consistent improvements in reconstruction quality across multiple metrics, including Contrastive Language-Image Pretraining (CLIP) score (+0.33%), Peak Signal-to-Noise Ratio (PSNR) (+2.53%), Structural Similarity Index (SSIM) (+2.38%), and Learned Perceptual Image Patch Similarity (LPIPS) (+0.16%) on a test set of 30 previously unseen spacecraft images. Our method addresses the lack of domain-specific 3D reconstruction tools in the space industry by leveraging state-of-the-art diffusion models and 3D Gaussian splatting techniques. This approach maintains the efficiency of the DreamGaussian framework while enhancing the accuracy and detail of spacecraft reconstructions. The code for this work can be accessed on GitHub (this https URL).

[LG-29] HyperINF: Unleashing the HyperPower of the Schulzs Method for Data Influence Estimation

链接: https://arxiv.org/abs/2410.05090
作者: Xinyu Zhou,Simin Fan,Martin Jaggi
关键词-EN: individual training samples, Influence functions provide, influence function approximation, specific target, provide a principled
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Influence functions provide a principled method to assess the contribution of individual training samples to a specific target. Yet, their high computational costs limit their applications on large-scale models and datasets. Existing methods proposed for influence function approximation have significantly reduced the computational overheads. However, they mostly suffer from inaccurate estimation due to the lack of strong convergence guarantees from the algorithm. The family of hyperpower methods are well-known for their rigorous convergence guarantees on matrix inverse approximation, while the matrix multiplication operation can involve intractable memory and computation costs on large-scale models. We propose HyperINF, an efficient and accurate influence function approximation method which leverages the hyperpower method, specifically Schulz’s iterative algorithm. To deal with the computation-intensive matrix multiplication, we incorporate the generalized fisher information (GFIM) as a low-rank approximation of the Hessian matrix, which reduces the memory and computation overheads to constant costs independent of ranks on LoRA-tuned models. We first demonstrate the superior accuracy and stability of \method compared to other baselines through a synthetic convergence simulation for matrix inversion. We further validate the efficacy of \method through extensive real-world data attribution tasks, including mislabeled data detection and data selection for LLM and VLM fine-tuning. On LoRA-tuned models, HyperINF achieves superior downstream performance with minimal memory and computational overhead, while other baselines suffer from significant degradation. Our codebase is available at this https URL. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.05090 [cs.LG] (or arXiv:2410.05090v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.05090 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Simin Fan [view email] [v1] Mon, 7 Oct 2024 14:42:45 UTC (3,218 KB)

[LG-30] ScienceAgent Bench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

链接: https://arxiv.org/abs/2410.05080
作者: Ziru Chen,Shijie Chen,Yuting Ning,Qianheng Zhang,Boshi Wang,Botao Yu,Yifei Li,Zeyi Liao,Chen Wei,Zitong Lu,Vishal Dey,Mingyi Xue,Frazier N. Baker,Benjamin Burns,Daniel Adu-Ampratwum,Xuhui Huang,Xia Ning,Song Gao,Yu Su,Huan Sun
关键词-EN: piqued growing interest, automate scientific discovery, developing LLM-based language, scientific discovery, piqued growing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 55 pages

点击查看摘要

Abstract:The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

[LG-31] Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

链接: https://arxiv.org/abs/2410.05078
作者: David Heurtel-Depeiges,Anian Ruoss,Joel Veness,Tim Genewein
关键词-EN: recently been shown, strong data compressors, compression, parameter count, compression algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Foundation models have recently been shown to be strong data compressors. However, when accounting for their excessive parameter count, their compression ratios are actually inferior to standard compression algorithms. Moreover, naively reducing the number of parameters may not necessarily help as it leads to worse predictions and thus weaker compression. In this paper, we conduct a large-scale empirical study to investigate whether there is a sweet spot where competitive compression ratios with pre-trained vanilla transformers are possible. To this end, we train families of models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (i.e., millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG 2000, FLAC) - even when factoring in parameter count. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). To study the impact of model- and dataset scale, we conduct extensive ablations and hyperparameter sweeps, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but, in contrast to previously reported results with large-scale foundation models, transfer to unseen modalities is generally weak.

[LG-32] dalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

链接: https://arxiv.org/abs/2410.05076
作者: Lijie Yang,Zhihao Zhang,Zhuofu Chen,Zikun Li,Zhihao Jia
关键词-EN: Large language models, long-context models gaining, models gaining prominence, handling extended inputs, driven significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.

[LG-33] Function Gradient Approximation with Random Shallow ReLU Networks with Control Applications

链接: https://arxiv.org/abs/2410.05071
作者: Andrew Lamperski,Siddharth Salapaka
关键词-EN: output parameters, common neural network, neural network architecture, parameters, input parameters
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: Under Review for American Control Conference, 2025

点击查看摘要

Abstract:Neural networks are widely used to approximate unknown functions in control. A common neural network architecture uses a single hidden layer (i.e. a shallow network), in which the input parameters are fixed in advance and only the output parameters are trained. The typical formal analysis asserts that if output parameters exist to approximate the unknown function with sufficient accuracy, then desired control performance can be achieved. A long-standing theoretical gap was that no conditions existed to guarantee that, for the fixed input parameters, required accuracy could be obtained by training the output parameters. Our recent work has partially closed this gap by demonstrating that if input parameters are chosen randomly, then for any sufficiently smooth function, with high-probability there are output parameters resulting in O((1/m)^1/2) approximation errors, where m is the number of neurons. However, some applications, notably continuous-time value function approximation, require that the network approximates the both the unknown function and its gradient with sufficient accuracy. In this paper, we show that randomly generated input parameters and trained output parameters result in gradient errors of O((\log(m)/m)^1/2) , and additionally, improve the constants from our prior work. We show how to apply the result to policy evaluation problems.

[LG-34] Control-oriented Clustering of Visual Latent Representation

链接: https://arxiv.org/abs/2410.05063
作者: Han Qi(1),Haocheng Yin(1 and 2),Heng Yang(2) ((1) Harvard University, (2) ETH Zürich)
关键词-EN: visual representation space, visual representation, control pipeline learned, control-oriented visual representation, representation space
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We initiate a study of the geometry of the visual representation space – the information channel from the vision encoder to the action decoder – in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification, we investigate whether a similar law of clustering emerges in the visual representation space. Since image-based control is a regression task without explicitly defined classes, the central piece of the puzzle lies in determining according to what implicit classes the visual features cluster, if such a law exists. Focusing on image-based planar pushing, we posit the most important role of the visual representation in a control task is to convey a goal to the action decoder. We then classify training samples of expert demonstrations into eight “control-oriented” classes based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output, where one class corresponds to one relative pose orthant (REPO). Across four different instantiations of architecture, we report the prevalent emergence of control-oriented clustering in the visual representation space according to the eight REPOs. Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with limited expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35% in the low-data regime. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.

[LG-35] SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification NEURIPS2024

链接: https://arxiv.org/abs/2410.05057
作者: Benjamin Feuer,Jiawei Xu,Niv Cohen,Patrick Yubeaton,Govind Mittal,Chinmay Hegde
关键词-EN: supports efficient learning, Data curation, collect and organize, organize samples, supports efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024, Datasets and Benchmarks Track

点击查看摘要

Abstract:Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at this https URL. Comments: NeurIPS 2024, Datasets and Benchmarks Track Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.05057 [cs.CV] (or arXiv:2410.05057v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.05057 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] FreSh: Frequency Shifting for Accelerated Neural Representation Learning

链接: https://arxiv.org/abs/2410.05050
作者: Adam Kania,Marko Mihajlovic,Sergey Prokudin,Jacek Tabor,Przemysław Spurek
关键词-EN: recently gained attention, Implicit Neural Representations, Implicit Neural, continuously representing signals, shapes using multilayer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have recently gained attention as a powerful approach for continuously representing signals such as images, videos, and 3D shapes using multilayer perceptrons (MLPs). However, MLPs are known to exhibit a low-frequency bias, limiting their ability to capture high-frequency details accurately. This limitation is typically addressed by incorporating high-frequency input embeddings or specialized activation layers. In this work, we demonstrate that these embeddings and activations are often configured with hyperparameters that perform well on average but are suboptimal for specific input signals under consideration, necessitating a costly grid search to identify optimal settings. Our key observation is that the initial frequency spectrum of an untrained model’s output correlates strongly with the model’s eventual performance on a given target signal. Leveraging this insight, we propose frequency shifting (or FreSh), a method that selects embedding hyperparameters to align the frequency spectrum of the model’s initial output with that of the target signal. We show that this simple initialization technique improves performance across various neural representation methods and tasks, achieving results comparable to extensive hyperparameter sweeps but with only marginal computational overhead compared to training a single model with default hyperparameters.

[LG-37] PhotoReg: Photometrically Registering 3D Gaussian Splatting Models

链接: https://arxiv.org/abs/2410.05044
作者: Ziwen Yuan,Tianyi Zhang,Matthew Johnson-Roberson,Weiming Zhi
关键词-EN: Building accurate representations, Building accurate, decisions during deployment, accurate representations, make decisions
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building accurate representations of the environment is critical for intelligent robots to make decisions during deployment. Advances in photorealistic environment models have enabled robots to develop hyper-realistic reconstructions, which can be used to generate images that are intuitive for human inspection. In particular, the recently introduced \ac3DGS, which describes the scene with up to millions of primitive ellipsoids, can be rendered in real time. \ac3DGS has rapidly gained prominence. However, a critical unsolved problem persists: how can we fuse multiple \ac3DGS into a single coherent model? Solving this problem will enable robot teams to jointly build \ac3DGS models of their surroundings. A key insight of this work is to leverage the duality between photorealistic reconstructions, which render realistic 2D images from 3D structure, and \emph3D foundation models, which predict 3D structure from image pairs. To this end, we develop PhotoReg, a framework to register multiple photorealistic \ac3DGS models with 3D foundation models. As \ac3DGS models are generally built from monocular camera images, they have \empharbitrary scale. To resolve this, PhotoReg actively enforces scale consistency among the different \ac3DGS models by considering depth estimates within these models. Then, the alignment is iteratively refined with fine-grained photometric losses to produce high-quality fused \ac3DGS models. We rigorously evaluate PhotoReg on both standard benchmark datasets and our custom-collected datasets, including with two quadruped robots. The code is released at \urlthis http URL.

[LG-38] Systematic Literature Review of Vision-Based Approaches to Outdoor Livestock Monitoring with Lessons from Wildlife Studies

链接: https://arxiv.org/abs/2410.05041
作者: Stacey D. Scott,Zayn J. Abbas,Feerass Ellid,Eli-Henry Dykhne,Muhammad Muhaiminul Islam,Weam Ayad,Kristina Kacmorova,Dan Tulpan,Minglun Gong
关键词-EN: Precision livestock farming, Precision livestock, farming outcomes, health and welfare, aims to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 28 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Precision livestock farming (PLF) aims to improve the health and welfare of livestock animals and farming outcomes through the use of advanced technologies. Computer vision, combined with recent advances in machine learning and deep learning artificial intelligence approaches, offers a possible solution to the PLF ideal of 24/7 livestock monitoring that helps facilitate early detection of animal health and welfare issues. However, a significant number of livestock species are raised in large outdoor habitats that pose technological challenges for computer vision approaches. This review provides a comprehensive overview of computer vision methods and open challenges in outdoor animal monitoring. We include research from both the livestock and wildlife fields in the review because of the similarities in appearance, behaviour, and habitat for many livestock and wildlife. We focus on large terrestrial mammals, such as cattle, horses, deer, goats, sheep, koalas, giraffes, and elephants. We use an image processing pipeline to frame our discussion and highlight the current capabilities and open technical challenges at each stage of the pipeline. The review found a clear trend towards the use of deep learning approaches for animal detection, counting, and multi-species classification. We discuss in detail the applicability of current vision-based methods to PLF contexts and promising directions for future research.

[LG-39] Active Fine-Tuning of Generalist Policies

链接: https://arxiv.org/abs/2410.05026
作者: Marco Bagatella,Jonas Hübotter,Georg Martius,Andreas Krause
关键词-EN: Pre-trained generalist policies, rapidly gaining relevance, Pre-trained generalist, robot learning due, rapidly gaining
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Pre-trained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks. This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning. However, as soon as several tasks need to be learned, we must decide which tasks should be demonstrated and how often? We study this multi-task problem and explore an interactive framework in which the agent adaptively selects the tasks to be demonstrated. We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy. We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.

[LG-40] DEPT: Decoupled Embeddings for Pre-training Language Models

链接: https://arxiv.org/abs/2410.05021
作者: Alex Iacob,Lorenzo Sani,Meghdad Kurmanji,William F. Shen,Xinchi Qiu,Dongqi Cai,Yan Gao,Nicholas D. Lane
关键词-EN: broader data mixture, Model pre-training benefits, DEPT, broader data, data mixture
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Language Model pre-training benefits from a broader data mixture to enhance performance across domains and languages. However, training on such heterogeneous text corpora is complex, requiring extensive and cost-intensive efforts. Since these data sources vary in lexical, syntactic, and semantic aspects, they cause negative interference or the “curse of multilinguality”. We propose a novel pre-training framework to alleviate this curse. Our method, DEPT, decouples the embedding layers from the transformer body while simultaneously training the latter in multiple contexts. DEPT enables the model to train without being bound to a shared global vocabulary. DEPT: (1) can train robustly and effectively under significant data heterogeneity, (2) reduces the parameter count of the token embeddings by up to 80% and the communication costs by 675x for billion-scale models (3) enhances model generalization and plasticity in adapting to new languages and domains, and (4) allows training with custom optimized vocabulary per data source. We prove DEPT’s potential by performing the first vocabulary-agnostic federated multilingual pre-training of a 1.3 billion-parameter model across high and low-resource languages, reducing its parameter count by 409 million.

[LG-41] FRIDA: Free-Rider Detection using Privacy Attacks

链接: https://arxiv.org/abs/2410.05020
作者: Pol G. Recasens,Ádám Horváth,Alberto Gutierrez-Torre,Jordi Torres,Josep Ll.Berral,Balázs Pejó
关键词-EN: enables multiple parties, learning model collaboratively, Federated learning, high-performing machine learning, increasingly popular
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning is increasingly popular as it enables multiple parties with limited datasets and resources to train a high-performing machine learning model collaboratively. However, similarly to other collaborative systems, federated learning is vulnerable to free-riders – participants who do not contribute to the training but still benefit from the shared model. Free-riders not only compromise the integrity of the learning process but also slow down the convergence of the global model, resulting in increased costs for the honest participants. To address this challenge, we propose FRIDA: free-rider detection using privacy attacks, a framework that leverages inference attacks to detect free-riders. Unlike traditional methods that only capture the implicit effects of free-riding, FRIDA directly infers details of the underlying training datasets, revealing characteristics that indicate free-rider behaviour. Through extensive experiments, we demonstrate that membership and property inference attacks are effective for this purpose. Our evaluation shows that FRIDA outperforms state-of-the-art methods, especially in non-IID settings. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2410.05020 [cs.LG] (or arXiv:2410.05020v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.05020 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

链接: https://arxiv.org/abs/2410.05019
作者: Ibrahim Aldarmaki,Thamar Solorio,Bhiksha Raj,Hanan Aldarmaki
关键词-EN: Neural multi-channel speech, Neural multi-channel, generalization potential, multi-channel speech enhancement, demonstrate promising performance
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.

[LG-43] -JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data

链接: https://arxiv.org/abs/2410.05016
作者: Hugo Thimonier,José Lucas De Melo Costa,Fabrice Popineau,Arpad Rimmel,Bich-Liên Doan
关键词-EN: constructing meaningful representations, constructing meaningful, Embedding Predictive Architecture, Joint Embedding Predictive, data
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Self-supervision is often used for pre-training to foster performance on a downstream task by constructing meaningful representations of samples. Self-supervised learning (SSL) generally involves generating different views of the same sample and thus requires data augmentations that are challenging to construct for tabular data. This constitutes one of the main challenges of self-supervision for structured data. In the present work, we propose a novel augmentation-free SSL method for tabular data. Our approach, T-JEPA, relies on a Joint Embedding Predictive Architecture (JEPA) and is akin to mask reconstruction in the latent space. It involves predicting the latent representation of one subset of features from the latent representation of a different subset within the same sample, thereby learning rich representations without augmentations. We use our method as a pre-training technique and train several deep classifiers on the obtained representation. Our experimental results demonstrate a substantial improvement in both classification and regression tasks, outperforming models trained directly on samples in their original data space. Moreover, T-JEPA enables some methods to consistently outperform or match the performance of traditional methods likes Gradient Boosted Decision Trees. To understand why, we extensively characterize the obtained representations and show that T-JEPA effectively identifies relevant features for downstream tasks without access to the labels. Additionally, we introduce regularization tokens, a novel regularization method critical for training of JEPA-based models on structured data.

[LG-44] MC-QDSNN: Quantized Deep evolutionary SNN with Multi-Dendritic Compartment Neurons for Stress Detection using Physiological Signals

链接: https://arxiv.org/abs/2410.04992
作者: Ajay B.S.,Phani Pavan K,Madhav Rao
关键词-EN: Long short-term memory, Long short-term, analyzing and inferring, inferring time series, Long
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 13 pages, 15 figures. Applied to IEEE Transactions on Computer Aided Design Journal. Awaiting a verdict

点击查看摘要

Abstract:Long short-term memory (LSTM) has emerged as a definitive network for analyzing and inferring time series data. LSTM has the capability to extract spectral features and a mixture of temporal features. Due to this benefit, a similar feature extraction method is explored for the spiking counterparts targeting time-series data. Though LSTMs perform well in their spiking form, they tend to be compute and power intensive. Addressing this issue, this work proposes Multi-Compartment Leaky (MCLeaky) neuron as a viable alternative for efficient processing of time series data. The MCLeaky neuron, derived from the Leaky Integrate and Fire (LIF) neuron model, contains multiple memristive synapses interlinked to form a memory component, which emulates the human brain’s Hippocampus region. The proposed MCLeaky neuron based Spiking Neural Network model and its quantized variant were benchmarked against state-of-the-art (SOTA) Spiking LSTMs to perform human stress detection, by comparing compute requirements, latency and real-world performances on unseen data with models derived through Neural Architecture Search (NAS). Results show that networks with MCLeaky activation neuron managed a superior accuracy of 98.8% to detect stress based on Electrodermal Activity (EDA) signals, better than any other investigated models, while using 20% less parameters on average. MCLeaky neuron was also tested for various signals including EDA Wrist and Chest, Temperature, ECG, and combinations of them. Quantized MCLeaky model was also derived and validated to forecast their performance on hardware architectures, which resulted in 91.84% accuracy. The neurons were evaluated for multiple modalities of data towards stress detection, which resulted in energy savings of 25.12x to 39.20x and EDP gains of 52.37x to 81.9x over ANNs, while offering a best accuracy of 98.8% when compared with the rest of the SOTA implementations.

[LG-45] Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling

链接: https://arxiv.org/abs/2410.04988
作者: Jasmine Bayrooti,Carl Henrik Ek,Amanda Prorok
关键词-EN: complex robot behavior, Learning complex robot, necessitates principled exploration, environment necessitates principled, complex robot
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this idea and enabling sample-efficient reinforcement learning. However, existing methods overlook a crucial aspect: the need for optimism to be informed by a belief connecting the reward and state. To address this, we propose a practical, theoretically grounded approach to optimistic exploration based on Thompson sampling. Our model structure is the first that allows for reasoning about joint uncertainty over transitions and rewards. We apply our method on a set of MuJoCo and VMAS continuous control tasks. Our experiments demonstrate that optimistic exploration significantly accelerates learning in environments with sparse rewards, action penalties, and difficult-to-explore regions. Furthermore, we provide insights into when optimism is beneficial and emphasize the critical role of model uncertainty in guiding exploration.

[LG-46] Safe Learning-Based Optimization of Model Predictive Control: Application to Battery Fast-Charging

链接: https://arxiv.org/abs/2410.04982
作者: Sebastian Hirt,Andreas Höhl,Johannes Pohlodek,Joachim Schaeffer,Maik Pfefferkorn,Richard D. Braatz,Rolf Findeisen
关键词-EN: controlling complex nonlinear, Model predictive control, complex nonlinear systems, predictive control, suitable cost functions
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, submitted to ACC 2025

点击查看摘要

Abstract:Model predictive control (MPC) is a powerful tool for controlling complex nonlinear systems under constraints, but often struggles with model uncertainties and the design of suitable cost functions. To address these challenges, we discuss an approach that integrates MPC with safe Bayesian optimization to optimize long-term closed-loop performance despite significant model-plant mismatches. By parameterizing the MPC stage cost function using a radial basis function network, we employ Bayesian optimization as a multi-episode learning strategy to tune the controller without relying on precise system models. This method mitigates conservativeness introduced by overly cautious soft constraints in the MPC cost function and provides probabilistic safety guarantees during learning, ensuring that safety-critical constraints are met with high probability. As a practical application, we apply our approach to fast charging of lithium-ion batteries, a challenging task due to the complicated battery dynamics and strict safety requirements, subject to the requirement to be implementable in real time. Simulation results demonstrate that, in the context of model-plant mismatch, our method reduces charging times compared to traditional MPC methods while maintaining safety. This work extends previous research by emphasizing closed-loop constraint satisfaction and offers a promising solution for enhancing performance in systems where model uncertainties and safety are critical concerns.

[LG-47] Collaboration! Towards Robust Neural Methods for Routing Problems NEURIPS2024

链接: https://arxiv.org/abs/2410.04968
作者: Jianan Zhou,Yaoxin Wu,Zhiguang Cao,Wen Song,Jie Zhang,Zhiqi Shen
关键词-EN: vehicle routing problems, enjoying desirable efficiency, performance significantly deteriorates, neural VRP methods, severe robustness issues
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Despite enjoying desirable efficiency and reduced reliance on domain expertise, existing neural methods for vehicle routing problems (VRPs) suffer from severe robustness issues – their performance significantly deteriorates on clean instances with crafted perturbations. To enhance robustness, we propose an ensemble-based Collaborative Neural Framework (CNF) w.r.t. the defense of neural VRP methods, which is crucial yet underexplored in the literature. Given a neural VRP method, we adversarially train multiple models in a collaborative manner to synergistically promote robustness against attacks, while boosting standard generalization on clean instances. A neural router is designed to adeptly distribute training instances among models, enhancing overall load balancing and collaborative efficacy. Extensive experiments verify the effectiveness and versatility of CNF in defending against various attacks across different neural VRP methods. Notably, our approach also achieves impressive out-of-distribution generalization on benchmark instances.

[LG-48] Failure-Proof Non-Contrastive Self-Supervised Learning

链接: https://arxiv.org/abs/2410.04959
作者: Emanuele Sansone,Tim Lebailly,Tinne Tuytelaars
关键词-EN: identify sufficient conditions, cluster and intracluster, intracluster collapses, occurring in non-contrastive, non-contrastive self-supervised learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We identify sufficient conditions to avoid known failure modes, including representation, dimensional, cluster and intracluster collapses, occurring in non-contrastive self-supervised learning. Based on these findings, we propose a principled design for the projector and loss function. We theoretically demonstrate that this design introduces an inductive bias that promotes learning representations that are both decorrelated and clustered without explicit enforcing these properties and leading to improved generalization. To the best of our knowledge, this is the first solution that achieves robust training with respect to these failure modes while guaranteeing enhanced generalization performance in downstream tasks. We validate our theoretical findings on image datasets including SVHN, CIFAR10, CIFAR100 and ImageNet-100, and show that our solution, dubbed FALCON, outperforms existing feature decorrelation and cluster-based self-supervised learning methods in terms of generalization to clustering and linear classification tasks.

[LG-49] Detecting and Approximating Redundant Computational Blocks in Neural Networks

链接: https://arxiv.org/abs/2410.04941
作者: Irene Cannistraci,Emanuele Rodolà,Bastian Rieck
关键词-EN: Deep neural networks, learn similar internal, Deep neural, learn similar, Deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Deep neural networks often learn similar internal representations, both across different models and within their own layers. While inter-network similarities have enabled techniques such as model stitching and merging, intra-network similarities present new opportunities for designing more efficient architectures. In this paper, we investigate the emergence of these internal similarities across different layers in diverse neural architectures, showing that similarity patterns emerge independently of the datataset used. We introduce a simple metric, Block Redundancy, to detect redundant blocks, providing a foundation for future architectural optimization methods. Building on this, we propose Redundant Blocks Approximation (RBA), a general framework that identifies and approximates one or more redundant computational blocks using simpler transformations. We show that the transformation \mathcalT between two representations can be efficiently computed in closed-form, and it is enough to replace the redundant blocks from the network. RBA reduces model parameters and time complexity while maintaining good performance. We validate our method on classification tasks in the vision domain using a variety of pretrained foundational models and datasets.

[LG-50] Next state prediction gives rise to entangled yet compositional representations of objects

链接: https://arxiv.org/abs/2410.04940
作者: Tankred Saanum,Luca M. Schulze Buschoff,Peter Dayan,Eric Schulz
关键词-EN: vast state spaces, combinatorially vast state, state spaces, humans to generalize, generalize across combinatorially
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Compositional representations are thought to enable humans to generalize across combinatorially vast state spaces. Models with learnable object slots, which encode information about objects in separate latent codes, have shown promise for this type of generalization but rely on strong architectural priors. Models with distributed representations, on the other hand, use overlapping, potentially entangled neural codes, and their ability to support compositional generalization remains underexplored. In this paper we examine whether distributed models can develop linearly separable representations of objects, like slotted models, through unsupervised training on videos of object interactions. We show that, surprisingly, models with distributed representations often match or outperform models with object slots in downstream prediction tasks. Furthermore, we find that linearly separable object representations can emerge without object-centric priors, with auxiliary objectives like next-state prediction playing a key role. Finally, we observe that distributed models’ object representations are never fully disentangled, even if they are linearly separable: Multiple objects can be encoded through partially overlapping neural populations while still being highly separable with a linear classifier. We hypothesize that maintaining partially shared codes enables distributed models to better compress object dynamics, potentially enhancing generalization.

[LG-51] Goal-Conditioned Terminal Value Estimation for Real-time and Multi-task Model Predictive Control

链接: https://arxiv.org/abs/2410.04929
作者: Mitsuki Morita,Satoshi Yamamori,Satoshi Yagi,Norikazu Sugimoto,Jun Morimoto
关键词-EN: enables nonlinear feedback, nonlinear feedback control, optimal control problem, MPC enables nonlinear, significantly large
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:While MPC enables nonlinear feedback control by solving an optimal control problem at each timestep, the computational burden tends to be significantly large, making it difficult to optimize a policy within the control period. To address this issue, one possible approach is to utilize terminal value learning to reduce computational costs. However, the learned value cannot be used for other tasks in situations where the task dynamically changes in the original MPC setup. In this study, we develop an MPC framework with goal-conditioned terminal value learning to achieve multitask policy optimization while reducing computational time. Furthermore, by using a hierarchical control structure that allows the upper-level trajectory planner to output appropriate goal-conditioned trajectories, we demonstrate that a robot model is able to generate diverse motions. We evaluate the proposed method on a bipedal inverted pendulum robot model and confirm that combining goal-conditioned terminal value learning with an upper-level trajectory planner enables real-time control; thus, the robot successfully tracks a target trajectory on sloped terrain.

[LG-52] Defense-as-a-Service: Black-box Shielding against Backdoored Graph Models

链接: https://arxiv.org/abs/2410.04916
作者: Xiao Yang,Kai Zhou,Yuni Lai,Gaolei Li
关键词-EN: deliver business services, large graph learning, business owners tend, trend of large, tend to employ
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:With the trend of large graph learning models, business owners tend to employ a model provided by a third party to deliver business services to users. However, these models might be backdoored, and malicious users can submit trigger-embedded inputs to manipulate the model predictions. Current graph backdoor defenses have several limitations: 1) depending on model-related details, 2) requiring additional model fine-tuning, and 3) relying upon extra explainability tools, all of which are infeasible under stringent privacy policies. To address those limitations, we propose GraphProt, which allows resource-constrained business owners to rely on third parties to avoid backdoor attacks on GNN-based graph classifiers. Our GraphProt is model-agnostic and only relies on the input graph. The key insight is to leverage subgraph information for prediction, thereby mitigating backdoor effects induced by triggers. GraphProt comprises two components: clustering-based trigger elimination and robust subgraph ensemble. Specifically, we first propose feature-topology clustering that aims to remove most of the anomalous subgraphs (triggers). Moreover, we design subgraph sampling strategies based on feature-topology clustering to build a robust classifier via majority vote. Experimental results across three backdoor attacks and six benchmark datasets demonstrate that GraphProt significantly reduces the backdoor attack success rate while preserving the model accuracy on regular graph classification tasks.

[LG-53] Low-Rank Continual Personalization of Diffusion Models

链接: https://arxiv.org/abs/2410.04891
作者: Łukasz Staniszewski,Katarzyna Zaleska,Kamil Deja
关键词-EN: generate new concepts, fine-tuning pre-trained models, Recent personalization methods, Dreambooth, diffusion models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent personalization methods for diffusion models, such as Dreambooth, allow fine-tuning pre-trained models to generate new concepts. However, applying these techniques across multiple tasks in order to include, e.g., several new objects or styles, leads to mutual interference between their adapters. While recent studies attempt to mitigate this issue by combining trained adapters across tasks after fine-tuning, we adopt a more rigorous regime and investigate the personalization of large diffusion models under a continual learning scenario, where such interference leads to catastrophic forgetting of previous knowledge. To that end, we evaluate the naïve continual fine-tuning of customized models and compare this approach with three methods for consecutive adapters’ training: sequentially merging new adapters, merging orthogonally initialized adapters, and updating only relevant parameters according to the task. In our experiments, we show that the proposed approaches mitigate forgetting when compared to the naïve approach.

[LG-54] Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse

链接: https://arxiv.org/abs/2410.04887
作者: Arthur Jacot,Peter Súkeník,Zihan Wang,Marco Mondelli
关键词-EN: convergence consistently represent, highly symmetric geometric, symmetric geometric structure, geometric structure referred, Deep neural networks
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 5 figures

点击查看摘要

Abstract:Deep neural networks (DNNs) at convergence consistently represent the training data in the last layer via a highly symmetric geometric structure referred to as neural collapse. This empirical evidence has spurred a line of theoretical research aimed at proving the emergence of neural collapse, mostly focusing on the unconstrained features model. Here, the features of the penultimate layer are free variables, which makes the model data-agnostic and, hence, puts into question its ability to capture DNN training. Our work addresses the issue, moving away from unconstrained features and studying DNNs that end with at least two linear layers. We first prove generic guarantees on neural collapse that assume (i) low training error and balancedness of the linear layers (for within-class variability collapse), and (ii) bounded conditioning of the features before the linear part (for orthogonality of class-means, as well as their alignment with weight matrices). We then show that such assumptions hold for gradient descent training with weight decay: (i) for networks with a wide first layer, we prove low training error and balancedness, and (ii) for solutions that are either nearly optimal or stable under large learning rates, we additionally prove the bounded conditioning. Taken together, our results are the first to show neural collapse in the end-to-end training of DNNs.

[LG-55] Improving the Sampling Strategy in KernelSHAP

链接: https://arxiv.org/abs/2410.04883
作者: Lars Henry Berge Olsen,Martin Jullum
关键词-EN: explaining predictions made, complex machine learning, machine learning models, popular model-agnostic explanation, Shapley
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Shapley values are a popular model-agnostic explanation framework for explaining predictions made by complex machine learning models. The framework provides feature contribution scores that sum to the predicted response and represent each feature’s importance. The computation of exact Shapley values is computationally expensive due to estimating an exponential amount of non-trivial conditional expectations. The KernelSHAP framework enables us to approximate the Shapley values using a sampled subset of weighted conditional expectations. We propose three main novel contributions: a stabilizing technique to reduce the variance of the weights in the current state-of-the-art strategy, a novel weighing scheme that corrects the Shapley kernel weights based on sampled subsets, and a straightforward strategy that includes the important subsets and integrates them with the corrected Shapley kernel weights. We compare these new approximation strategies against existing ones by evaluating their Shapley value accuracy as a function of the number of subsets. The results demonstrate that our sampling strategies significantly enhance the accuracy of the approximated Shapley value explanations, making them more reliable in practical applications. This work provides valuable insights and practical recommendations for researchers and practitioners seeking to implement Shapley value-based explainability of their models.

[LG-56] On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

链接: https://arxiv.org/abs/2410.04870
作者: Bingrui Li,Wei Huang,Andi Han,Zhanpeng Zhou,Taiji Suzuki,Jun Zhu,Jianfei Chen
关键词-EN: Sign Gradient Descent, underlying optimization mechanisms, important problem, Adam, optimizer is widely
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: preprint

点击查看摘要

Abstract:The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam’s complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. Despite its simplicity, theoretical understanding of how SignGD optimizes transformers still lags behind. In this work, we study how SignGD optimizes a two-layer transformer – consisting of a softmax attention layer with trainable query-key parameterization followed by a linear layer – on a linearly separable noisy dataset. We identify four stages in the training dynamics, each exhibiting intriguing behaviors. Based on the training dynamics, we prove the fast convergence but poor generalization of the learned transformer on the noisy dataset. We also show that Adam behaves similarly to SignGD in terms of both optimization and generalization in this setting. Additionally, we find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks. Finally, experiments on synthetic and real-world datasets empirically support our theoretical results.

[LG-57] Mastering Chinese Chess AI (Xiangqi) Without Search

链接: https://arxiv.org/abs/2410.04865
作者: Yu Chen,Juntong Lin,Zhichao Shu
关键词-EN: high-performance Chinese Chess, Carlo Tree Search, Monte Carlo Tree, Chinese Chess, developed a high-performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have developed a high-performance Chinese Chess AI that operates without reliance on search algorithms. This AI has demonstrated the capability to compete at a level commensurate with the top 0.1% of human players. By eliminating the search process typically associated with such systems, this AI achieves a Queries Per Second (QPS) rate that exceeds those of systems based on the Monte Carlo Tree Search (MCTS) algorithm by over a thousandfold and surpasses those based on the AlphaBeta pruning algorithm by more than a hundredfold. The AI training system consists of two parts: supervised learning and reinforcement learning. Supervised learning provides an initial human-like Chinese chess AI, while reinforcement learning, based on supervised learning, elevates the strength of the entire AI to a new level. Based on this training system, we carried out enough ablation experiments and discovered that 1. The same parameter amount of Transformer architecture has a higher performance than CNN on Chinese chess; 2. Possible moves of both sides as features can greatly improve the training process; 3. Selective opponent pool, compared to pure self-play training, results in a faster improvement curve and a higher strength limit. 4. Value Estimation with Cutoff(VECT) improves the original PPO algorithm training process and we will give the explanation.

[LG-58] Unsupervised Skill Discovery for Robotic Manipulation through Automatic Task Generation

链接: https://arxiv.org/abs/2410.04855
作者: Paul Jansonnie,Bingbing Wu,Julien Perez,Jan Peters
关键词-EN: manipulation tasks, major importance, Hierarchical Reinforcement Learning, unseen manipulation tasks, Skill Learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the 2024 IEEE-RAS International Conference on Humanoid Robots

点击查看摘要

Abstract:Learning skills that interact with objects is of major importance for robotic manipulation. These skills can indeed serve as an efficient prior for solving various manipulation tasks. We propose a novel Skill Learning approach that discovers composable behaviors by solving a large and diverse number of autonomously generated tasks. Our method learns skills allowing the robot to consistently and robustly interact with objects in its environment. The discovered behaviors are embedded in primitives which can be composed with Hierarchical Reinforcement Learning to solve unseen manipulation tasks. In particular, we leverage Asymmetric Self-Play to discover behaviors and Multiplicative Compositional Policies to embed them. We compare our method to Skill Learning baselines and find that our skills are more interactive. Furthermore, the learned skills can be used to solve a set of unseen manipulation tasks, in simulation as well as on a real robotic platform.

[LG-59] meCNN: Refining Cross-Variable Interaction on Time Point for Time Series Forecasting

链接: https://arxiv.org/abs/2410.04853
作者: Ao Hu,Dongkai Wang,Yong Dai,Shiyi Qi,Liangjian Wen,Jun Wang,Zhi Chen,Xun Zhou,Zenglin Xu,Jiang Duan
关键词-EN: diverse domains, Time series forecasting, extensively applied, applied across diverse, Time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series forecasting is extensively applied across diverse domains. Transformer-based models demonstrate significant potential in modeling cross-time and cross-variable interaction. However, we notice that the cross-variable correlation of multivariate time series demonstrates multifaceted (positive and negative correlations) and dynamic progression over time, which is not well captured by existing Transformer-based models. To address this issue, we propose a TimeCNN model to refine cross-variable interactions to enhance time series forecasting. Its key innovation is timepoint-independent, where each time point has an independent convolution kernel, allowing each time point to have its independent model to capture relationships among variables. This approach effectively handles both positive and negative correlations and adapts to the evolving nature of variable relationships over time. Extensive experiments conducted on 12 real-world datasets demonstrate that TimeCNN consistently outperforms state-of-the-art models. Notably, our model achieves significant reductions in computational requirements (approximately 60.46%) and parameter count (about 57.50%), while delivering inference speeds 3 to 4 times faster than the benchmark iTransformer model

[LG-60] Strong Model Collapse

链接: https://arxiv.org/abs/2410.04840
作者: Elvis Dohmatob,Yunzhen Feng,Julia Kempe
关键词-EN: scaling laws paradigm, supervised regression setting, ChatGPT and Llama, critical performance degradation, performance degradation due
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.

[LG-61] Multimodal Fusion Strategies for Mapping Biophysical Landscape Features ECCV2024

链接: https://arxiv.org/abs/2410.04833
作者: Lucia Gordon,Nico Lang,Catherine Ressijac,Andrew Davies
关键词-EN: Multimodal aerial data, monitor natural systems, Multimodal aerial, natural systems, ecology and conservation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, ECCV 2024 Workshop in CV for Ecology

点击查看摘要

Abstract:Multimodal aerial data are used to monitor natural systems, and machine learning can significantly accelerate the classification of landscape features within such imagery to benefit ecology and conservation. It remains under-explored, however, how these multiple modalities ought to be fused in a deep learning model. As a step towards filling this gap, we study three strategies (Early fusion, Late fusion, and Mixture of Experts) for fusing thermal, RGB, and LiDAR imagery using a dataset of spatially-aligned orthomosaics in these three modalities. In particular, we aim to map three ecologically-relevant biophysical landscape features in African savanna ecosystems: rhino middens, termite mounds, and water. The three fusion strategies differ in whether the modalities are fused early or late, and if late, whether the model learns fixed weights per modality for each class or generates weights for each class adaptively, based on the input. Overall, the three methods have similar macro-averaged performance with Late fusion achieving an AUC of 0.698, but their per-class performance varies strongly, with Early fusion achieving the best recall for middens and water and Mixture of Experts achieving the best recall for mounds.

[LG-62] aming Gradient Oversmoothing and Expansion in Graph Neural Networks

链接: https://arxiv.org/abs/2410.04824
作者: MoonJeong Park,Dongwoo Kim
关键词-EN: graph neural networks, multi-layered graph neural, neural networks, primary bottleneck, bottleneck for multi-layered
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Oversmoothing has been claimed as a primary bottleneck for multi-layered graph neural networks (GNNs). Multiple analyses have examined how and why oversmoothing occurs. However, none of the prior work addressed how optimization is performed under the oversmoothing regime. In this work, we show the presence of \textitgradient oversmoothing preventing optimization during training. We further analyze that GNNs with residual connections, a well-known solution to help gradient flow in deep architecture, introduce \textitgradient expansion , a phenomenon of the gradient explosion in diverse directions. Therefore, adding residual connections cannot be a solution for making a GNN deep. Our analysis reveals that constraining the Lipschitz bound of each layer can neutralize the gradient expansion. To this end, we provide a simple yet effective normalization method to prevent the gradient expansion. An empirical study shows that the residual GNNs with hundreds of layers can be efficiently trained with the proposed normalization without compromising performance. Additional studies show that the empirical observations corroborate our theoretical analysis.

[LG-63] Physics-Informed GNN for non-linear constrained optimization: PINCO a solver for the AC-optimal power flow

链接: https://arxiv.org/abs/2410.04818
作者: Anna Varbella,Damien Briens,Blazhe Gjorgiev,Giuseppe Alessio D’Inverno,Giovanni Sansavini
关键词-EN: electric power grid, intermittent power sources, driving the integration, shares of intermittent, power
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The energy transition is driving the integration of large shares of intermittent power sources in the electric power grid. Therefore, addressing the AC optimal power flow (AC-OPF) effectively becomes increasingly essential. The AC-OPF, which is a fundamental optimization problem in power systems, must be solved more frequently to ensure the safe and cost-effective operation of power systems. Due to its non-linear nature, AC-OPF is often solved in its linearized form, despite inherent inaccuracies. Non-linear solvers, such as the interior point method, are typically employed to solve the full OPF problem. However, these iterative methods may not converge for large systems and do not guarantee global optimality. This work explores a physics-informed graph neural network, PINCO, to solve the AC-OPF. We demonstrate that this method provides accurate solutions in a fraction of the computational time when compared to the established non-linear programming solvers. Remarkably, PINCO generalizes effectively across a diverse set of loading conditions in the power system. We show that our method can solve the AC-OPF without violating inequality constraints. Furthermore, it can function both as a solver and as a hybrid universal function approximator. Moreover, the approach can be easily adapted to different power systems with minimal adjustments to the hyperparameters, including systems with multiple generators at each bus. Overall, this work demonstrates an advancement in the field of power system optimization to tackle the challenges of the energy transition. The code and data utilized in this paper are available at this https URL.

[LG-64] Learning Interpretable Hierarchical Dynamical Systems Models from Time Series Data

链接: https://arxiv.org/abs/2410.04814
作者: Manuel Brenner,Elias Weber,Georgia Koppe,Daniel Durstewitz
关键词-EN: observed time series, interested in obtaining, obtaining a generative, generative model, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Preprint

点击查看摘要

Abstract:In science, we are often interested in obtaining a generative model of the underlying system dynamics from observed time series. While powerful methods for dynamical systems reconstruction (DSR) exist when data come from a single domain, how to best integrate data from multiple dynamical regimes and leverage it for generalization is still an open question. This becomes particularly important when individual time series are short, and group-level information may help to fill in for gaps in single-domain data. At the same time, averaging is not an option in DSR, as it will wipe out crucial dynamical properties (e.g., limit cycles in one domain vs. chaos in another). Hence, a framework is needed that enables to efficiently harvest group-level (multi-domain) information while retaining all single-domain dynamical characteristics. Here we provide such a hierarchical approach and showcase it on popular DSR benchmarks, as well as on neuroscientific and medical time series. In addition to faithful reconstruction of all individual dynamical regimes, our unsupervised methodology discovers common low-dimensional feature spaces in which datasets with similar dynamics cluster. The features spanning these spaces were further dynamically highly interpretable, surprisingly in often linear relation to control parameters that govern the dynamics of the underlying system. Finally, we illustrate transfer learning and generalization to new parameter regimes.

[LG-65] FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

链接: https://arxiv.org/abs/2410.04810
作者: Haokun Chen,Hang Li,Yao Zhang,Gengyuan Zhang,Jinhe Bi,Philip Torr,Jindong Gu,Denis Krompass,Volker Tresp
关键词-EN: machine learning paradigm, decentralized machine learning, One-Shot Federated Learning, special decentralized machine, learning paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:One-Shot Federated Learning (OSFL), a special decentralized machine learning paradigm, has recently gained significant attention. OSFL requires only a single round of client data or model upload, which reduces communication costs and mitigates privacy threats compared to traditional FL. Despite these promising prospects, existing methods face challenges due to client data heterogeneity and limited data quantity when applied to real-world OSFL systems. Recently, Latent Diffusion Models (LDM) have shown remarkable advancements in synthesizing high-quality images through pretraining on large-scale datasets, thereby presenting a potential solution to overcome these issues. However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data. This issue is particularly pronounced in rare domains, such as medical imaging, which are underrepresented in LDM’s pretraining data. To address this challenge, we propose Federated Bi-Level Personalization (FedBiP), which personalizes the pretrained LDM at both instance-level and concept-level. Hereby, FedBiP synthesizes images following the client’s local data distribution without compromising the privacy regulations. FedBiP is also the first approach to simultaneously address feature space heterogeneity and client data scarcity in OSFL. Our method is validated through extensive experiments on three OSFL benchmarks with feature space heterogeneity, as well as on challenging medical and satellite image datasets with label heterogeneity. The results demonstrate the effectiveness of FedBiP, which substantially outperforms other OSFL methods.

[LG-66] mer-XL: Long-Context Transformers for Unified Time Series Forecasting

链接: https://arxiv.org/abs/2410.04803
作者: Yong Liu,Guo Qin,Xiangdong Huang,Jianmin Wang,Mingsheng Long
关键词-EN: time series, series, generative Transformer, time, token prediction
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present Timer-XL, a generative Transformer for unified time series forecasting. To uniformly predict 1D and 2D time series, we generalize next token prediction, predominantly adopted for causal generation of 1D sequences, to multivariate next token prediction. The proposed paradigm uniformly formulates various forecasting scenarios as a long-context generation problem. We opt for the generative Transformer, which can capture global-range and causal dependencies while providing contextual flexibility, to implement unified forecasting on univariate series characterized by non-stationarity, multivariate time series with complicated dynamics and correlations, and covariate-informed contexts that include both endogenous and exogenous variables. Technically, we propose a universal TimeAttention to facilitate generative Transformers on time series, which can effectively capture fine-grained intra- and inter-series dependencies of flattened time series tokens (patches) and is further strengthened by position embeddings in both temporal and variable dimensions. Timer-XL achieves state-of-the-art performance across challenging forecasting benchmarks through a unified approach. As a large time series model, it demonstrates notable model transferability by large-scale pre-training, as well as contextual flexibility in token lengths, positioning it as a one-for-all forecaster.

[LG-67] Building Damage Assessment in Conflict Zones: A Deep Learning Approach Using Geospatial Sub-Meter Resolution Data

链接: https://arxiv.org/abs/2410.04802
作者: Matteo Risso,Alessia Goffi,Beatrice Alessandra Motetti,Alessio Burrello,Jean Baptiste Bove,Enrico Macii,Massimo Poncino,Daniele Jahier Pagliari,Giuseppe Maffeis
关键词-EN: geospatial image analysis, Deep Neural Networks, Convolutional Neural Networks, anthropogenic crises, High Resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication in the Sixth IEEE International Conference on Image Processing Applications and Systems 2024 copyright IEEE

点击查看摘要

Abstract:Very High Resolution (VHR) geospatial image analysis is crucial for humanitarian assistance in both natural and anthropogenic crises, as it allows to rapidly identify the most critical areas that need support. Nonetheless, manually inspecting large areas is time-consuming and requires domain expertise. Thanks to their accuracy, generalization capabilities, and highly parallelizable workload, Deep Neural Networks (DNNs) provide an excellent way to automate this task. Nevertheless, there is a scarcity of VHR data pertaining to conflict situations, and consequently, of studies on the effectiveness of DNNs in those scenarios. Motivated by this, our work extensively studies the applicability of a collection of state-of-the-art Convolutional Neural Networks (CNNs) originally developed for natural disasters damage assessment in a war scenario. To this end, we build an annotated dataset with pre- and post-conflict images of the Ukrainian city of Mariupol. We then explore the transferability of the CNN models in both zero-shot and learning scenarios, demonstrating their potential and limitations. To the best of our knowledge, this is the first study to use sub-meter resolution imagery to assess building damage in combat zones.

[LG-68] Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering ACCV2024

链接: https://arxiv.org/abs/2410.04801
作者: Kazumoto Nakamura,Yuji Nozawa,Yu-Chieh Lin,Kengo Nakata,Youyang Ng
关键词-EN: pretrained Vision Transformer, Vision Transformer, pretrained Vision, Transformer, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ACCV 2024

点击查看摘要

Abstract:The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.

[LG-69] Fast Training of Sinusoidal Neural Fields via Scaling Initialization

链接: https://arxiv.org/abs/2410.04779
作者: Taesun Yeom,Sangyoon Lee,Jaeho Lee
关键词-EN: continuous functions parameterized, Neural fields, emerging paradigm, paradigm that represent, continuous functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural fields are an emerging paradigm that represent data as continuous functions parameterized by neural networks. Despite many advantages, neural fields often have a high training cost, which prevents a broader adoption. In this paper, we focus on a popular family of neural fields, called sinusoidal neural fields (SNFs), and study how it should be initialized to maximize the training speed. We find that the standard initialization scheme for SNFs – designed based on the signal propagation principle – is suboptimal. In particular, we show that by simply multiplying each weight (except for the last layer) by a constant, we can accelerate SNF training by 10 \times . This method, coined \textitweight scaling , consistently provides a significant speedup over various data domains, allowing the SNFs to train faster than more recently proposed architectures. To understand why the weight scaling works well, we conduct extensive theoretical and empirical analyses which reveal that the weight scaling not only resolves the spectral bias quite effectively but also enjoys a well-conditioned optimization trajectory.

[LG-70] OmniBuds: A Sensory Earable Platform for Advanced Bio-Sensing and On-Device Machine Learning

链接: https://arxiv.org/abs/2410.04775
作者: Alessandro Montanari,Ashok Thangarajan,Khaldoon Al-Naimi,Andrea Ferlini,Yang Liu,Ananta Narayanan Balaji,Fahim Kawsar
关键词-EN: basic audio enhancement, clinical-grade health monitoring, audio enhancement devices, sensory earable platform, wellbeing management
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sensory earables have evolved from basic audio enhancement devices into sophisticated platforms for clinical-grade health monitoring and wellbeing management. This paper introduces OmniBuds, an advanced sensory earable platform integrating multiple biosensors and onboard computation powered by a machine learning accelerator, all within a real-time operating system (RTOS). The platform’s dual-ear symmetric design, equipped with precisely positioned kinetic, acoustic, optical, and thermal sensors, enables highly accurate and real-time physiological assessments. Unlike conventional earables that rely on external data processing, OmniBuds leverage real-time onboard computation to significantly enhance system efficiency, reduce latency, and safeguard privacy by processing data locally. This capability includes executing complex machine learning models directly on the device. We provide a comprehensive analysis of OmniBuds’ design, hardware and software architecture demonstrating its capacity for multi-functional applications, accurate and robust tracking of physiological parameters, and advanced human-computer interaction.

[LG-71] Granular Ball Twin Support Vector Machine

链接: https://arxiv.org/abs/2410.04774
作者: A. Quadir,M. Sajid,M. Tanveer
关键词-EN: Nonparametric Maximum Likelihood, Maximum Likelihood Estimator, Mixture ModelsTwin support, Efficient and Scalable, Scalable Computation
类目: Machine Learning (cs.LG)
*备注: Manuscript submitted to IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS: 19 September 2023; revised 13 February 2024 and 14 July 2024; accepted 05 October 2024

点击查看摘要

Abstract:On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture ModelsTwin support vector machine (TSVM) is an emerging machine learning model with versatile applicability in classification and regression endeavors. Nevertheless, TSVM confronts noteworthy challenges: (i) the imperative demand for matrix inversions presents formidable obstacles to its efficiency and applicability on large-scale datasets; (ii) the omission of the structural risk minimization (SRM) principle in its primal formulation heightens the vulnerability to overfitting risks; and (iii) the TSVM exhibits a high susceptibility to noise and outliers, and also demonstrates instability when subjected to resampling. In view of the aforementioned challenges, we propose the granular ball twin support vector machine (GBTSVM). GBTSVM takes granular balls, rather than individual data points, as inputs to construct a classifier. These granular balls, characterized by their coarser granularity, exhibit robustness to resampling and reduced susceptibility to the impact of noise and outliers. We further propose a novel large-scale granular ball twin support vector machine (LS-GBTSVM). LS-GBTSVM’s optimization formulation ensures two critical facets: (i) it eliminates the need for matrix inversions, streamlining the LS-GBTSVM’s computational efficiency, and (ii) it incorporates the SRM principle through the incorporation of regularization terms, effectively addressing the issue of overfitting. The proposed LS-GBTSVM exemplifies efficiency, scalability for large datasets, and robustness against noise and outliers. We conduct a comprehensive evaluation of the GBTSVM and LS-GBTSVM models on benchmark datasets from UCI, KEEL, and NDC datasets. Our experimental findings and statistical analyses affirm the superior generalization prowess of the proposed GBTSVM and LS-GBTSVM models.

[LG-72] From Transparency to Accountability and Back: A Discussion of Access and Evidence in AI Auditing

链接: https://arxiv.org/abs/2410.04772
作者: Sarah H. Cen,Rohan Alur
关键词-EN: undeclared side effects, Artificial intelligence, raising widespread concern, raising widespread, side effects
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 23 pages, 1 table

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly intervening in our lives, raising widespread concern about its unintended and undeclared side effects. These developments have brought attention to the problem of AI auditing: the systematic evaluation and analysis of an AI system, its development, and its behavior relative to a set of predetermined criteria. Auditing can take many forms, including pre-deployment risk assessments, ongoing monitoring, and compliance testing. It plays a critical role in providing assurances to various AI stakeholders, from developers to end users. Audits may, for instance, be used to verify that an algorithm complies with the law, is consistent with industry standards, and meets the developer’s claimed specifications. However, there are many operational challenges to AI auditing that complicate its implementation. In this work, we examine a key operational issue in AI auditing: what type of access to an AI system is needed to perform a meaningful audit? Addressing this question has direct policy relevance, as it can inform AI audit guidelines and requirements. We begin by discussing the factors that auditors balance when determining the appropriate type of access, and unpack the benefits and drawbacks of four types of access. We conclude that, at minimum, black-box access – providing query access to a model without exposing its internal implementation – should be granted to auditors, as it balances concerns related to trade secrets, data privacy, audit standardization, and audit efficiency. We then suggest a framework for determining how much further access (in addition to black-box access) to grant auditors. We show that auditing can be cast as a natural hypothesis test, draw parallels hypothesis testing and legal procedure, and argue that this framing provides clear and interpretable guidance on audit implementation. Comments: 23 pages, 1 table Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2410.04772 [cs.CY] (or arXiv:2410.04772v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2410.04772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-73] Double Oracle Neural Architecture Search for Game Theoretic Deep Learning Models

链接: https://arxiv.org/abs/2410.04764
作者: Aye Phyu Phyu Aung,Xinrun Wang,Ruiyu Wang,Hau Chan,Bo An,Xiaoli Li,J. Senthilnath
关键词-EN: including Generative Adversarial, Generative Adversarial Networks, concepts including Generative, train deep learning, Generative Adversarial
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:In this paper, we propose a new approach to train deep learning models using game theory concepts including Generative Adversarial Networks (GANs) and Adversarial Training (AT) where we deploy a double-oracle framework using best response oracles. GAN is essentially a two-player zero-sum game between the generator and the discriminator. The same concept can be applied to AT with attacker and classifier as players. Training these models is challenging as a pure Nash equilibrium may not exist and even finding the mixed Nash equilibrium is difficult as training algorithms for both GAN and AT have a large-scale strategy space. Extending our preliminary model DO-GAN, we propose the methods to apply the double oracle framework concept to Adversarial Neural Architecture Search (NAS for GAN) and Adversarial Training (NAS for AT) algorithms. We first generalize the players’ strategies as the trained models of generator and discriminator from the best response oracles. We then compute the meta-strategies using a linear program. For scalability of the framework where multiple network models of best responses are stored in the memory, we prune the weakly-dominated players’ strategies to keep the oracles from becoming intractable. Finally, we conduct experiments on MNIST, CIFAR-10 and TinyImageNet for DONAS-GAN. We also evaluate the robustness under FGSM and PGD attacks on CIFAR-10, SVHN and TinyImageNet for DONAS-AT. We show that all our variants have significant improvements in both subjective qualitative evaluation and quantitative metrics, compared with their respective base architectures.

[LG-74] Item Cluster-aware Prompt Learning for Session-based Recommendation

链接: https://arxiv.org/abs/2410.04756
作者: Wooseong Yang,Chen Wang,Zihe Song,Weizhi Zhang,Philip S. Yu
关键词-EN: dynamic user preferences, capture dynamic user, analyzing item sequences, dynamic user, user preferences
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Session-based recommendation (SBR) aims to capture dynamic user preferences by analyzing item sequences within individual sessions. However, most existing approaches focus mainly on intra-session item relationships, neglecting the connections between items across different sessions (inter-session relationships), which limits their ability to fully capture complex item interactions. While some methods incorporate inter-session information, they often suffer from high computational costs, leading to longer training times and reduced efficiency. To address these challenges, we propose the CLIP-SBR (Cluster-aware Item Prompt learning for Session-Based Recommendation) framework. CLIP-SBR is composed of two modules: 1) an item relationship mining module that builds a global graph to effectively model both intra- and inter-session relationships, and 2) an item cluster-aware prompt learning module that uses soft prompts to integrate these relationships into SBR models efficiently. We evaluate CLIP-SBR across eight SBR models and three benchmark datasets, consistently demonstrating improved recommendation performance and establishing CLIP-SBR as a robust solution for session-based recommendation tasks.

[LG-75] ImProver: Agent -Based Automated Proof Optimization

链接: https://arxiv.org/abs/2410.04753
作者: Riyaz Ahuja,Jeremy Avigad,Prasad Tetali,Sean Welleck
关键词-EN: Large language models, Large language, generate formal proofs, language models, generate formal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 19 pages, 21 figures

点击查看摘要

Abstract:Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

[LG-76] Smart energy management: process structure-based hybrid neural networks for optimal scheduling and economic predictive control in integrated systems

链接: https://arxiv.org/abs/2410.04743
作者: Long Wu,Xunyuan Yin,Lei Pan,Jinfeng Liu(University of Alberta)
关键词-EN: Integrated energy systems, spanning multiple domains, Integrated energy, complex systems consisting, units spanning multiple
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Integrated energy systems (IESs) are complex systems consisting of diverse operating units spanning multiple domains. To address its operational challenges, we propose a physics-informed hybrid time-series neural network (NN) surrogate to predict the dynamic performance of IESs across multiple time scales. This neural network-based modeling approach develops time-series multi-layer perceptrons (MLPs) for the operating units and integrates them with prior process knowledge about system structure and fundamental dynamics. This integration forms three hybrid NNs (long-term, slow, and fast MLPs) that predict the entire system dynamics across multiple time scales. Leveraging these MLPs, we design an NN-based scheduler and an NN-based economic model predictive control (NEMPC) framework to meet global operational requirements: rapid electrical power responsiveness to operators requests, adequate cooling supply to customers, and increased system profitability, while addressing the dynamic time-scale multiplicity present in IESs. The proposed day-ahead scheduler is formulated using the ReLU network-based MLP, which effectively represents IES performance under a broad range of conditions from a long-term perspective. The scheduler is then exactly recast into a mixed-integer linear programming problem for efficient evaluation. The real-time NEMPC, based on slow and fast MLPs, comprises two sequential distributed control agents: a slow NEMPC for the cooling-dominant subsystem with slower transient responses and a fast NEMPC for the power-dominant subsystem with faster responses. Extensive simulations demonstrate that the developed scheduler and NEMPC schemes outperform their respective benchmark scheduler and controller by about 25% and 40%. Together, they enhance overall system performance by over 70% compared to benchmark approaches.

[LG-77] Evaluating the Generalization Ability of Spatiotemporal Model in Urban Scenario

链接: https://arxiv.org/abs/2410.04740
作者: Hongjun Wang,Jiyuan Chen,Tong Pan,Zheng Dong,Lingyu Zhang,Renhe Jiang,Xuan Song
关键词-EN: shown great promise, effectively capturing temporal, Spatiotemporal neural networks, spatial correlations, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Spatiotemporal neural networks have shown great promise in urban scenarios by effectively capturing temporal and spatial correlations. However, urban environments are constantly evolving, and current model evaluations are often limited to traffic scenarios and use data mainly collected only a few weeks after training period to evaluate model performance. The generalization ability of these models remains largely unexplored. To address this, we propose a Spatiotemporal Out-of-Distribution (ST-OOD) benchmark, which comprises six urban scenario: bike-sharing, 311 services, pedestrian counts, traffic speed, traffic flow, ride-hailing demand, and bike-sharing, each with in-distribution (same year) and out-of-distribution (next years) settings. We extensively evaluate state-of-the-art spatiotemporal models and find that their performance degrades significantly in out-of-distribution settings, with most models performing even worse than a simple Multi-Layer Perceptron (MLP). Our findings suggest that current leading methods tend to over-rely on parameters to overfit training data, which may lead to good performance on in-distribution data but often results in poor generalization. We also investigated whether dropout could mitigate the negative effects of overfitting. Our results showed that a slight dropout rate could significantly improve generalization performance on most datasets, with minimal impact on in-distribution performance. However, balancing in-distribution and out-of-distribution performance remains a challenging problem. We hope that the proposed benchmark will encourage further research on this critical issue.

[LG-78] ableRAG: Million-Token Table Understanding with Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.04739
作者: Si-An Chen,Lesly Miculicich,Julian Martin Eisenschlos,Zifeng Wang,Zilong Wang,Yanfei Chen,Yasuhisa Fujii,Hsuan-Tien Lin,Chen-Yu Lee,Tomas Pfister
关键词-EN: Recent advancements, language models, primarily through program-aided, advancements in language, notably enhanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG’s effectiveness at scale. Our results demonstrate that TableRAG’s retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.

[LG-79] LDR: Token-Level Detective Reward Model for Large Vision Language Models

链接: https://arxiv.org/abs/2410.04734
作者: Deqing Fu,Tong Xiao,Rui Wang,Wang Zhu,Pengchuan Zhang,Guan Pang,Robin Jia,Lawrence Chen
关键词-EN: improving multimodal large, TLDR models, reward models, models, minimal information
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Work done at Meta

点击查看摘要

Abstract:Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a \textbfT oken- \textbfL evel \textbfD etective \textbfR eward Model ( \textbfTLDR ) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

[LG-80] ProtoNAM: Prototypical Neural Additive Models for Interpretable Deep Tabular Learning

链接: https://arxiv.org/abs/2410.04723
作者: Guangzhi Xiong,Sanchit Sinha,Aidong Zhang
关键词-EN: Generalized additive models, powerful white-box tool, Generalized additive, Prototypical Neural Additive, Neural Additive Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generalized additive models (GAMs) have long been a powerful white-box tool for the intelligible analysis of tabular data, revealing the influence of each feature on the model predictions. Despite the success of neural networks (NNs) in various domains, their application as NN-based GAMs in tabular data analysis remains suboptimal compared to tree-based ones, and the opacity of encoders in NN-GAMs also prevents users from understanding how networks learn the functions. In this work, we propose a new deep tabular learning method, termed Prototypical Neural Additive Model (ProtoNAM), which introduces prototypes into neural networks in the framework of GAMs. With the introduced prototype-based feature activation, ProtoNAM can flexibly model the irregular mapping from tabular features to the outputs while maintaining the explainability of the final prediction. We also propose a gradient-boosting inspired hierarchical shape function modeling method, facilitating the discovery of complex feature patterns and bringing transparency into the learning process of each network layer. Our empirical evaluations demonstrate that ProtoNAM outperforms all existing NN-based GAMs, while providing additional insights into the shape function learned for each feature. The source code of ProtoNAM is available at \urlthis https URL.

[LG-81] A Strategy for Label Alignment in Deep Neural Networks

链接: https://arxiv.org/abs/2410.04722
作者: Xuanrui Zeng
关键词-EN: demonstrated successful application, recent research demonstrated, research demonstrated successful, linear regression settings, label alignment property
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One recent research demonstrated successful application of the label alignment property for unsupervised domain adaptation in a linear regression settings. Instead of regularizing representation learning to be domain invariant, the research proposed to regularize the linear regression model to align with the top singular vectors of the data matrix from the target domain. In this work we expand upon this idea and generalize it to the case of deep learning, where we derive an alternative formulation of the original adaptation algorithm exploiting label alignment suitable for deep neural network. We also perform experiments to demonstrate that our approach achieves comparable performance to mainstream unsupervised domain adaptation methods while having stabler convergence. All experiments and implementations in our work can be found at the following codebase: \urlthis https URL.

[LG-82] ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction

链接: https://arxiv.org/abs/2410.04721
作者: Hyungjin Chung,Dohun Lee,Jong Chul Ye
关键词-EN: global context modeling, generative modeling, distinct areas, generating high-quality local, paradigms in generative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 10 figures. Project page: this https URL

点击查看摘要

Abstract:Autoregressive models (ARMs) and diffusion models (DMs) represent two leading paradigms in generative modeling, each excelling in distinct areas: ARMs in global context modeling and long-sequence generation, and DMs in generating high-quality local contexts, especially for continuous data such as images and short videos. However, ARMs often suffer from exponential error accumulation over long sequences, leading to physically implausible results, while DMs are limited by their local context generation capabilities. In this work, we introduce Autoregressive Coherent multimodal generation with Diffusion Correction (ACDC), a zero-shot approach that combines the strengths of both ARMs and DMs at the inference stage without the need for additional fine-tuning. ACDC leverages ARMs for global context generation and memory-conditioned DMs for local correction, ensuring high-quality outputs by correcting artifacts in generated multimodal tokens. In particular, we propose a memory module based on large language models (LLMs) that dynamically adjusts the conditioning texts for the DMs, preserving crucial global context information. Our experiments on multimodal tasks, including coherent multi-frame story generation and autoregressive video generation, demonstrate that ACDC effectively mitigates the accumulation of errors and significantly enhances the quality of generated outputs, achieving superior performance while remaining agnostic to specific ARM and DM architectures. Project page: this https URL

[LG-83] textbfOnly-IF:Revealing the Decisive Effect of Instruction Diversity on Generalization

链接: https://arxiv.org/abs/2410.04717
作者: Dylan Zhang,Justin Wang,Francois Charton
关键词-EN: Understanding and accurately, large language models, data, large language, Turing-complete Markov algorithm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization \textbfonly emerges when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model’s adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of \textit \textbfspecialist and \textit \textbfgeneralist models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.

[LG-84] Rule-based Data Selection for Large Language Models

链接: https://arxiv.org/abs/2410.04715
作者: Xiaomin Li,Mingye Gao,Zhiwei Zhang,Chang Yue,Hong Hu
关键词-EN: data significantly impacts, large language models, significantly impacts, large language, rules
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The quality of training data significantly impacts the performance of large language models (LLMs). There are increasing studies using LLMs to rate and select data based on several human-crafted metrics (rules). However, these conventional rule-based approaches often depend too heavily on human heuristics, lack effective metrics for assessing rules, and exhibit limited adaptability to new tasks. In our study, we introduce an innovative rule-based framework that utilizes the orthogonality of score vectors associated with rules as a novel metric for rule evaluations. Our approach includes an automated pipeline that first uses LLMs to generate a diverse set of rules, encompassing various rating dimensions to evaluate data quality. Then it rates a batch of data based on these rules and uses the determinantal point process (DPP) from random matrix theory to select the most orthogonal score vectors, thereby identifying a set of independent rules. These rules are subsequently used to evaluate all data, selecting samples with the highest average scores for downstream tasks such as LLM training. We verify the effectiveness of our method through two experimental setups: 1) comparisons with ground truth ratings and 2) benchmarking LLMs trained with the chosen data. Our comprehensive experiments cover a range of scenarios, including general pre-training and domain-specific fine-tuning in areas such as IMDB, Medical, Math, and Code. The outcomes demonstrate that our DPP-based rule rating method consistently outperforms other approaches, including rule-free rating, uniform sampling, importance resampling, and QuRating, in terms of both rating precision and model performance.

[LG-85] ght Stability Convergence and Robustness Bounds for Predictive Coding Networks

链接: https://arxiv.org/abs/2410.04708
作者: Ankur Mali,Tommaso Salvatori,Alexander Ororbia
关键词-EN: garnered significant attention, biologically plausible mechanisms, Energy-based learning algorithms, machine learning community, predictive coding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 9 theorems

点击查看摘要

Abstract:Energy-based learning algorithms, such as predictive coding (PC), have garnered significant attention in the machine learning community due to their theoretical properties, such as local operations and biologically plausible mechanisms for error correction. In this work, we rigorously analyze the stability, robustness, and convergence of PC through the lens of dynamical systems theory. We show that, first, PC is Lyapunov stable under mild assumptions on its loss and residual energy functions, which implies intrinsic robustness to small random perturbations due to its well-defined energy-minimizing dynamics. Second, we formally establish that the PC updates approximate quasi-Newton methods by incorporating higher-order curvature information, which makes them more stable and able to converge with fewer iterations compared to models trained via backpropagation (BP). Furthermore, using this dynamical framework, we provide new theoretical bounds on the similarity between PC and other algorithms, i.e., BP and target propagation (TP), by precisely characterizing the role of higher-order derivatives. These bounds, derived through detailed analysis of the Hessian structures, show that PC is significantly closer to quasi-Newton updates than TP, providing a deeper understanding of the stability and efficiency of PC compared to conventional learning methods.

[LG-86] Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

链接: https://arxiv.org/abs/2410.04707
作者: Mehul Damani,Idan Shenfeld,Andi Peng,Andreea Bobu,Jacob Andreas
关键词-EN: Computationally intensive decoding, spanning code generation, problems spanning code, Computationally intensive, intensive decoding procedures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Computationally intensive decoding procedures–including search, reranking, and self-critique–can improve the quality of language model (LM) outputs in problems spanning code generation, numerical reasoning, and dialog. Existing work typically applies the same decoding procedure for every input to an LM. But not all inputs require the same amount of computation to process. Can we allocate decoding computation adaptively, using more resources to answer questions whose answers will be harder to compute? We present an approach that predicts the distribution of rewards given an input and computation budget, then allocates additional computation to inputs for which it is predicted to be most useful. We apply this approach in two decoding procedures: first, an adaptive best-of-k procedure that dynamically selects the number of samples to generate as input to a reranker; second, a routing procedure that dynamically responds to a query using a decoding procedure that is expensive but accurate, or one that is cheaper but less capable. Across a suite of programming, mathematics, and dialog tasks, we show that accurate computation-allocation procedures can be learned, and reduce computation by up to 50% at no cost to response quality, or improve quality by up to 10% at a fixed computational budget.

[LG-87] Neural Fourier Modelling: A Highly Compact Approach to Time-Series Analysis

链接: https://arxiv.org/abs/2410.04703
作者: Minjung Kim,Yusuke Hioka,Michael Witbrock
关键词-EN: Neural Fourier Modelling, Fourier domain, Neural Fourier Filters, equivalent Fourier domain, Implicit Neural Fourier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to conference (currently under review)

点击查看摘要

Abstract:Neural time-series analysis has traditionally focused on modeling data in the time domain, often with some approaches incorporating equivalent Fourier domain representations as auxiliary spectral features. In this work, we shift the main focus to frequency representations, modeling time-series data fully and directly in the Fourier domain. We introduce Neural Fourier Modelling (NFM), a compact yet powerful solution for time-series analysis. NFM is grounded in two key properties of the Fourier transform (FT): (i) the ability to model finite-length time series as functions in the Fourier domain, treating them as continuous-time elements in function space, and (ii) the capacity for data manipulation (such as resampling and timespan extension) within the Fourier domain. We reinterpret Fourier-domain data manipulation as frequency extrapolation and interpolation, incorporating this as a core learning mechanism in NFM, applicable across various tasks. To support flexible frequency extension with spectral priors and effective modulation of frequency representations, we propose two learning modules: Learnable Frequency Tokens (LFT) and Implicit Neural Fourier Filters (INFF). These modules enable compact and expressive modeling in the Fourier domain. Extensive experiments demonstrate that NFM achieves state-of-the-art performance on a wide range of tasks (forecasting, anomaly detection, and classification), including challenging time-series scenarios with previously unseen sampling rates at test time. Moreover, NFM is highly compact, requiring fewer than 40K parameters in each task, with time-series lengths ranging from 100 to 16K.

[LG-88] A Clifford Algebraic Approach to E(n)-Equivariant High-order Graph Neural Networks

链接: https://arxiv.org/abs/2410.04692
作者: Hoang-Viet Tran,Thieu N. Vo,Tho Tran Huu,Tan Minh Nguyen
关键词-EN: Designing neural network, handle data symmetry, graph neural networks, Designing neural, equivariant graph neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Designing neural network architectures that can handle data symmetry is crucial. This is especially important for geometric graphs whose properties are equivariance under Euclidean transformations. Current equivariant graph neural networks (EGNNs), particularly those using message passing, have a limitation in expressive power. Recent high-order graph neural networks can overcome this limitation, yet they lack equivariance properties, representing a notable drawback in certain applications in chemistry and physical sciences. In this paper, we introduce the Clifford Group Equivariant Graph Neural Networks (CG-EGNNs), a novel EGNN that enhances high-order message passing by integrating high-order local structures in the context of Clifford algebras. As a key benefit of using Clifford algebras, CG-EGNN can learn functions that capture equivariance from positional features. By adopting the high-order message passing mechanism, CG-EGNN gains richer information from neighbors, thus improving model performance. Furthermore, we establish the universality property of the k -hop message passing framework, showcasing greater expressive power of CG-EGNNs with additional k -hop message passing mechanism. We empirically validate that CG-EGNNs outperform previous methods on various benchmarks including n-body, CMU motion capture, and MD17, highlighting their effectiveness in geometric deep learning.

[LG-89] Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning EMNLP’24

链接: https://arxiv.org/abs/2410.04691
作者: Qingyu Yin,Xuzheng He,Luoao Deng,Chak Tou Leong,Fan Wang,Yanzhao Yan,Xiaoyu Shen,Qiang Zhang
关键词-EN: imbuing large language, large language models, in-context learning, task-specific knowledge, ICL
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: EMNLP’24 Findings

点击查看摘要

Abstract:Fine-tuning and in-context learning (ICL) are two prevalent methods in imbuing large language models with task-specific knowledge. It is commonly believed that fine-tuning can surpass ICL given sufficient training samples as it allows the model to adjust its internal parameters based on the data. However, this paper presents a counterintuitive finding: For tasks with implicit patterns, ICL captures these patterns significantly better than fine-tuning. We developed several datasets featuring implicit patterns, such as sequences determining answers through parity or identifying reducible terms in calculations. We then evaluated the models’ understanding of these patterns under both fine-tuning and ICL across models ranging from 0.5B to 7B parameters. The results indicate that models employing ICL can quickly grasp deep patterns and significantly improve accuracy. In contrast, fine-tuning, despite utilizing thousands of times more training samples than ICL, achieved only limited improvements. We also proposed circuit shift theory from a mechanistic interpretability’s view to explain why ICL wins.

[LG-90] owards Measuring Goal-Directedness in AI Systems

链接: https://arxiv.org/abs/2410.04683
作者: Dylan Xu,Juan-Pablo Rivera
关键词-EN: Recent advances, creating advanced, advances in deep, brought attention, possibility of creating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning have brought attention to the possibility of creating advanced, general AI systems that outperform humans across many tasks. However, if these systems pursue unintended goals, there could be catastrophic consequences. A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner in the first place, optimizing for some unknown goal; there exists significant research trying to evaluate systems for said behaviors. However, the most rigorous definitions of goal-directedness we currently have are difficult to compute in real-world settings. Drawing upon this previous literature, we explore policy goal-directedness within reinforcement learning (RL) environments. In our findings, we propose a different family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many (sparse) reward functions. We operationalize this preliminary definition of goal-directedness and test it in toy Markov decision process (MDP) environments. Furthermore, we explore how goal-directedness could be measured in frontier large-language models (LLMs). Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals. We recommend further exploration of measuring coherence and goal-directedness, based on our findings.

[LG-91] On the Adversarial Risk of Test Time Adaptation: An Investigation into Realistic Test-Time Data Poisoning

链接: https://arxiv.org/abs/2410.04682
作者: Yongyi Su,Yushu Li,Nanqing Liu,Kui Jia,Xulei Yang,Chuan-Sheng Foo,Xun Xu
关键词-EN: updates the model, enhance generalization, TTA, model weights, inference stage
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Test-time adaptation (TTA) updates the model weights during the inference stage using testing data to enhance generalization. However, this practice exposes TTA to adversarial risks. Existing studies have shown that when TTA is updated with crafted adversarial test samples, also known as test-time poisoned data, the performance on benign samples can deteriorate. Nonetheless, the perceived adversarial risk may be overstated if the poisoned data is generated under overly strong assumptions. In this work, we first review realistic assumptions for test-time data poisoning, including white-box versus grey-box attacks, access to benign data, attack budget, and more. We then propose an effective and realistic attack method that better produces poisoned samples without access to benign samples, and derive an effective in-distribution attack objective. We also design two TTA-aware attack objectives. Our benchmarks of existing attack methods reveal that the TTA methods are more robust than previously believed. In addition, we analyze effective defense strategies to help develop adversarially robust TTA methods.

[LG-92] he role of interface boundary conditions and sampling strategies for Schwarz-based coupling of projection-based reduced order models

链接: https://arxiv.org/abs/2410.04668
作者: Christopher R. Wentland,Francesco Rizzi,Joshua Barnett,Irina Tezaur
关键词-EN: Schwarz alternating method, subdomain-local projection-based reduced, projection-based reduced order, Schwarz alternating, interest is posed
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:This paper presents and evaluates a framework for the coupling of subdomain-local projection-based reduced order models (PROMs) using the Schwarz alternating method following a domain decomposition (DD) of the spatial domain on which a given problem of interest is posed. In this approach, the solution on the full domain is obtained via an iterative process in which a sequence of subdomain-local problems are solved, with information propagating between subdomains through transmission boundary conditions (BCs). We explore several new directions involving the Schwarz alternating method aimed at maximizing the method’s efficiency and flexibility, and demonstrate it on three challenging two-dimensional nonlinear hyperbolic problems: the shallow water equations, Burgers’ equation, and the compressible Euler equations. We demonstrate that, for a cell-centered finite volume discretization and a non-overlapping DD, it is possible to obtain a stable and accurate coupled model utilizing Dirichlet-Dirichlet (rather than Robin-Robin or alternating Dirichlet-Neumann) transmission BCs on the subdomain boundaries. We additionally explore the impact of boundary sampling when utilizing the Schwarz alternating method to couple subdomain-local hyper-reduced PROMs. Our numerical results suggest that the proposed methodology has the potential to improve PROM accuracy by enabling the spatial localization of these models via domain decomposition, and achieve up to two orders of magnitude speedup over equivalent coupled full order model solutions and moderate speedups over analogous monolithic solutions.

[LG-93] Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

链接: https://arxiv.org/abs/2410.04663
作者: Chaithanya Bandi,Hari Bandi,Abir Harrasse
关键词-EN: paper explores optimal, large language models, explores optimal architectures, paper explores, explores optimal
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper explores optimal architectures for evaluating the outputs of large language models (LLMs) using LLMs themselves. We propose a novel framework that interprets LLMs as advocates within an ensemble of interacting agents, allowing them to defend their answers and reach conclusions through a judge and jury system. This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics. We discuss the motivation behind this framework, its key components, and comparative advantages. We also present a probabilistic model to evaluate the error reduction achieved by iterative advocate systems. Finally, we outline experiments to validate the effectiveness of multi-advocate architectures and discuss future research directions.

[LG-94] Federated Learning Nodes Can Reconstruct Peers Image Data

链接: https://arxiv.org/abs/2410.04661
作者: Ethan Wilson,Kai Yue,Chau-Wai Wong,Huaiyu Dai
关键词-EN: periodically average weight, machine learning framework, enables multiple nodes, privacy-preserving machine learning, average weight updates
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 12 pages including references, 12 figures

点击查看摘要

Abstract:Federated learning (FL) is a privacy-preserving machine learning framework that enables multiple nodes to train models on their local data and periodically average weight updates to benefit from other nodes’ training. Each node’s goal is to collaborate with other nodes to improve the model’s performance while keeping its training data private. However, this framework does not guarantee data privacy. Prior work has shown that the gradient-sharing steps in FL can be vulnerable to data reconstruction attacks from an honest-but-curious central server. In this work, we show that an honest-but-curious node/client can also launch attacks to reconstruct peers’ image data in a centralized system, presenting a severe privacy risk. We demonstrate that a single client can silently reconstruct other clients’ private images using diluted information available within consecutive updates. We leverage state-of-the-art diffusion models to enhance the perceptual quality and recognizability of the reconstructed images, further demonstrating the risk of information leakage at a semantic level. This highlights the need for more robust privacy-preserving mechanisms that protect against silent client-side attacks during federated training.

[LG-95] Contrastive Learning to Improve Retrieval for Real-world Fact Checking EMNLP2024

链接: https://arxiv.org/abs/2410.04657
作者: Aniruddh Sriram,Fangyuan Xu,Eunsol Choi,Greg Durrett
关键词-EN: Recent work, incorporate evidence retrieved, addresses a realistic, web to decide, models incorporate evidence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 FEVER Workshop

点击查看摘要

Abstract:Recent work on fact-checking addresses a realistic setting where models incorporate evidence retrieved from the web to decide the veracity of claims. A bottleneck in this pipeline is in retrieving relevant evidence: traditional methods may surface documents directly related to a claim, but fact-checking complex claims requires more inferences. For instance, a document about how a vaccine was developed is relevant to addressing claims about what it might contain, even if it does not address them directly. We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for this setting. By leveraging the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents, we fine-tune Contriever with a contrastive objective based on multiple training signals, including distillation from GPT-4, evaluating subquestion answers, and gold labels in the dataset. We evaluate our model on both retrieval and end-to-end veracity judgments about claims. On the AVeriTeC dataset, we find a 6% improvement in veracity classification accuracy. We also show our gains can be transferred to FEVER, ClaimDecomp, HotpotQA, and a synthetic dataset requiring retrievers to make inferences.

[LG-96] Graph Fourier Neural Kernels (G-FuNK): Learning Solutions of Nonlinear Diffusive Parametric PDEs on Multiple Domains

链接: https://arxiv.org/abs/2410.04655
作者: Shane E. Loeffler,Zan Ahmad,Syed Yusuf Ali,Carolyna Yamamoto,Dan M. Popescu,Alana Yee,Yash Lal,Natalia Trayanova,Mauro Maggioni
关键词-EN: Predicting time-dependent dynamics, non-linear partial differential, challenging task motivated, Predicting time-dependent, Fourier Neural Kernels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Spectral Theory (math.SP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predicting time-dependent dynamics of complex systems governed by non-linear partial differential equations (PDEs) with varying parameters and domains is a challenging task motivated by applications across various fields. We introduce a novel family of neural operators based on our Graph Fourier Neural Kernels, designed to learn solution generators for nonlinear PDEs in which the highest-order term is diffusive, across multiple domains and parameters. G-FuNK combines components that are parameter- and domain-adapted with others that are not. The domain-adapted components are constructed using a weighted graph on the discretized domain, where the graph Laplacian approximates the highest-order diffusive term, ensuring boundary condition compliance and capturing the parameter and domain-specific behavior. Meanwhile, the learned components transfer across domains and parameters via Fourier Neural Operators. This approach naturally embeds geometric and directional information, improving generalization to new test domains without need for retraining the network. To handle temporal dynamics, our method incorporates an integrated ODE solver to predict the evolution of the system. Experiments show G-FuNK’s capability to accurately approximate heat, reaction diffusion, and cardiac electrophysiology equations across various geometries and anisotropic diffusivity fields. G-FuNK achieves low relative errors on unseen domains and fiber fields, significantly accelerating predictions compared to traditional finite-element solvers.

[LG-97] he Optimization Landscape of SGD Across the Feature Learning Strength

链接: https://arxiv.org/abs/2410.04642
作者: Alexander Atanasov,Alexandru Meterez,James B. Simon,Cengiz Pehlevan
关键词-EN: gamma, eta, final layer, layer is down-scaled, learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 Pages, 38 figures

点击查看摘要

Abstract:We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter \gamma . Recent work has identified \gamma as controlling the strength of feature learning. As \gamma increases, network evolution changes from lazy'' kernel dynamics to rich’’ feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling \gamma across a variety of models and datasets in the online training setting. We first examine the interaction of \gamma with the learning rate \eta , identifying several scaling regimes in the \gamma - \eta plane which we explain theoretically using a simple model. We find that the optimal learning rate \eta^* scales non-trivially with \gamma . In particular, \eta^* \propto \gamma^2 when \gamma \ll 1 and \eta^* \propto \gamma^2/L when \gamma \gg 1 for a feed-forward network of depth L . Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored ``ultra-rich’’ \gamma \gg 1 regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large \gamma values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large \gamma and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large- \gamma limit may yield useful insights into the dynamics of representation learning in performant models.

[LG-98] Radial Basis Operator Networks

链接: https://arxiv.org/abs/2410.04639
作者: Jason Kurz,Sean Oughton,Shitao Liu
关键词-EN: approximate nonlinear operators, infinite-dimensional spaces, designed to approximate, approximate nonlinear, provide mappings
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Operator networks are designed to approximate nonlinear operators, which provide mappings between infinite-dimensional spaces such as function spaces. These networks are playing an increasingly important role in machine learning, with their most notable contributions in the field of scientific computing. Their significance stems from their ability to handle the type of data often encountered in scientific applications. For instance, in climate modeling or fluid dynamics, input data typically consists of discretized continuous fields (like temperature distributions or velocity fields). We introduce the radial basis operator network (RBON), which represents a significant advancement as the first operator network capable of learning an operator in both the time domain and frequency domain when adjusted to accept complex-valued inputs. Despite the small, single hidden-layer structure, the RBON boasts small L^2 relative test error for both in- and out-of-distribution data (OOD) of less than 1\times 10^-7 in some benchmark cases. Moreover, the RBON maintains small error on OOD data from entirely different function classes from the training data.

[LG-99] Provable Weak-to-Strong Generalization via Benign Overfitting

链接: https://arxiv.org/abs/2410.04638
作者: David X. Wu,Anant Sahai
关键词-EN: machine learning posits, classic teacher-student model, strong teacher supervises, teacher supervises, weak teacher supervises
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages, 5 figures

点击查看摘要

Abstract:The classic teacher-student model in machine learning posits that a strong teacher supervises a weak student to improve the student’s capabilities. We instead consider the inverted situation, where a weak teacher supervises a strong student with imperfect pseudolabels. This paradigm was recently brought forth by Burns et al.'23 and termed \emphweak-to-strong generalization. We theoretically investigate weak-to-strong generalization for binary and multilabel classification in a stylized overparameterized spiked covariance model with Gaussian covariates where the weak teacher’s pseudolabels are asymptotically like random guessing. Under these assumptions, we provably identify two asymptotic phases of the strong student’s generalization after weak supervision: (1) successful generalization and (2) random guessing. Our techniques should eventually extend to weak-to-strong multiclass classification. Towards doing so, we prove a tight lower tail inequality for the maximum of correlated Gaussians, which may be of independent interest. Understanding the multilabel setting reinforces the value of using logits for weak supervision when they are available.

[LG-100] DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications

链接: https://arxiv.org/abs/2410.04631
作者: Mathias Jackermeier,Alessandro Abate
关键词-EN: Linear temporal logic, temporally extended tasks, Linear temporal, temporal logic, temporally extended
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear temporal logic (LTL) has recently been adopted as a powerful formalism for specifying complex, temporally extended tasks in reinforcement learning (RL). However, learning policies that efficiently satisfy arbitrary specifications not observed during training remains a challenging problem. Existing approaches suffer from several shortcomings: they are often only applicable to finite-horizon fragments of LTL, are restricted to suboptimal solutions, and do not adequately handle safety constraints. In this work, we propose a novel learning approach to address these concerns. Our method leverages the structure of Büchi automata, which explicitly represent the semantics of LTL specifications, to learn policies conditioned on sequences of truth assignments that lead to satisfying the desired formulae. Experiments in a variety of discrete and continuous domains demonstrate that our approach is able to zero-shot satisfy a wide range of finite- and infinite-horizon specifications, and outperforms existing methods in terms of both satisfaction probability and efficiency.

[LG-101] Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

链接: https://arxiv.org/abs/2410.04612
作者: Zhaolin Gao,Wenhao Zhan,Jonathan D. Chang,Gokul Swamy,Kianté Brantley,Jason D. Lee,Wen Sun
关键词-EN: Large Language Models, Large Language, achieved remarkable success, Language Models, REFUEL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate Q -values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at this https URL, and models trained by REFUEL can be found at this https URL.

[LG-102] Hammer: Robust Function-Calling for On-Device Language Models via Function Masking

链接: https://arxiv.org/abs/2410.04587
作者: Qiqiang Lin,Muning Wen,Qiuying Peng,Guanyu Nie,Junwei Liao,Jun Wang,Xiaoyun Mo,Jiamu Zhou,Cheng Cheng,Yin Zhao,Jun Wang,Weinan Zhang
关键词-EN: Large language models, API calls, Large language, tools and API, demonstrated impressive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large language models have demonstrated impressive value in performing as autonomous agents when equipped with external tools and API calls. Nonetheless, effectively harnessing their potential for executing complex tasks crucially relies on enhancements in their function calling capabilities. This paper identifies a critical gap in existing function calling models, where performance varies significantly across benchmarks, often due to being misled by specific naming conventions. To address such an issue, we introduce Hammer, a novel family of foundation models specifically engineered for on-device function calling. Hammer employs an augmented dataset that enhances models’ sensitivity to irrelevant functions and incorporates function masking techniques to minimize misleading. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving sota results. Our open source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function calling performance.

[LG-103] Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

链接: https://arxiv.org/abs/2410.04579
作者: Tianjian Li,Haoran Xu,Weiting Tan,Dongwei Jiang,Kenton Murray,Daniel Khashabi
关键词-EN: face data scarcity, long-tail distribution, data scarcity, Data, low-resource languages
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages

点击查看摘要

Abstract:Data availability across domains often follows a long-tail distribution: a few domains have abundant data, while most face data scarcity. This imbalance poses challenges in training language models uniformly across all domains. In our study, we focus on multilingual settings, where data sizes vary significantly between high- and low-resource languages. Common strategies to address this include upsampling low-resource languages (Temperature Sampling) or upweighting their loss (Scalarization). Although often considered equivalent, this assumption has not been proven, which motivates our study. Through both theoretical and empirical analysis, we identify the conditions under which these approaches are equivalent and when they diverge. Specifically, we demonstrate that these two methods are equivalent under full gradient descent, but this equivalence breaks down with stochastic gradient descent. Empirically, we observe that Temperature Sampling converges more quickly but is prone to overfitting. We argue that this faster convergence is likely due to the lower variance in gradient estimations, as shown theoretically. Based on these insights, we propose Cooldown, a strategy that reduces sampling temperature during training, accelerating convergence without overfitting to low-resource languages. Our method is competitive with existing data re-weighting and offers computational efficiency.

[LG-104] Robustness Reprogramming for Representation Learning

链接: https://arxiv.org/abs/2410.04577
作者: Zhichao Hou,MohamadAli Torkamani,Hamid Krim,Xiaorui Liu
关键词-EN: noisy input perturbations, fundamental open challenge, altering its parameters, tackles an intriguing, intriguing and fundamental
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work tackles an intriguing and fundamental open challenge in representation learning: Given a well-trained deep learning model, can it be reprogrammed to enhance its robustness against adversarial or noisy input perturbations without altering its parameters? To explore this, we revisit the core feature transformation mechanism in representation learning and propose a novel non-linear robust pattern matching technique as a robust alternative. Furthermore, we introduce three model reprogramming paradigms to offer flexible control of robustness under different efficiency requirements. Comprehensive experiments and ablation studies across diverse learning models ranging from basic linear model and MLPs to shallow and modern deep ConvNets demonstrate the effectiveness of our approaches. This work not only opens a promising and orthogonal direction for improving adversarial defenses in deep learning beyond existing methods but also provides new insights into designing more resilient AI systems with robust statistics.

[LG-105] Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion

链接: https://arxiv.org/abs/2410.04574
作者: Mehwish Ghafoor,Arif Mahmood,Muhammad Bilal
关键词-EN: Human Pose Estimation, Pose Estimation, Human Pose, diverse occlusion types, occlusion types presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of 3D Human Pose Estimation from monocular videos, the presence of diverse occlusion types presents a formidable challenge. Prior research has made progress by harnessing spatial and temporal cues to infer 3D poses from 2D joint observations. This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation, even in the presence of severe occlusions. Confronting the issue of occlusion-induced missing joint data, we propose a temporal interpolation-based occlusion guidance mechanism. To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views. Each intermediate-view undergoes spatial refinement through a self-refinement schema. Subsequently, these intermediate-views are fused to yield the final 3D human pose estimation. The entire system is end-to-end trainable. Through extensive experiments conducted on the Human3.6M and MPI-INF-3DHP datasets, our method’s performance is rigorously evaluated. Notably, our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements. The code is available here: this https URL.

[LG-106] EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

链接: https://arxiv.org/abs/2410.04571
作者: Aakriti Agrawal,Mucong Ding,Zora Che,Chenghao Deng,Anirudh Satheesh,John Langford,Furong Huang
关键词-EN: multiple Large Language, Large Language Models, Large Language, multiple Large, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How can we harness the collective capabilities of multiple Large Language Models (LLMs) to create an even more powerful model? This question forms the foundation of our research, where we propose an innovative approach to weak-to-strong (w2s) generalization-a critical problem in AI alignment. Our work introduces an easy-to-hard (e2h) framework for studying the feasibility of w2s generalization, where weak models trained on simpler tasks collaboratively supervise stronger models on more complex tasks. This setup mirrors real-world challenges, where direct human supervision is limited. To achieve this, we develop a novel AdaBoost-inspired ensemble method, demonstrating that an ensemble of weak supervisors can enhance the performance of stronger LLMs across classification and generative tasks on difficult QA datasets. In several cases, our ensemble approach matches the performance of models trained on ground-truth data, establishing a new benchmark for w2s generalization. We observe an improvement of up to 14% over existing baselines and average improvements of 5% and 4% for binary classification and generative tasks, respectively. This research points to a promising direction for enhancing AI through collective supervision, especially in scenarios where labeled data is sparse or insufficient.

[LG-107] Watermarking Decision Tree Ensembles

链接: https://arxiv.org/abs/2410.04570
作者: Stefano Calzavara,Lorenzo Cazzaro,Donald Gera,Salvatore Orlando
关键词-EN: deep neural networks, Protecting the intellectual, machine learning models, decision tree ensembles, intellectual property
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multimedia (cs.MM)
*备注: 7 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Protecting the intellectual property of machine learning models is a hot topic and many watermarking schemes for deep neural networks have been proposed in the literature. Unfortunately, prior work largely neglected the investigation of watermarking techniques for other types of models, including decision tree ensembles, which are a state-of-the-art model for classification tasks on non-perceptual data. In this paper, we present the first watermarking scheme designed for decision tree ensembles, focusing in particular on random forest models. We discuss watermark creation and verification, presenting a thorough security analysis with respect to possible attacks. We finally perform an experimental evaluation of the proposed scheme, showing excellent results in terms of accuracy and security against the most relevant threats.

[LG-108] Ranking Policy Learning via Marketplace Expected Value Estimation From Observational Data

链接: https://arxiv.org/abs/2410.04568
作者: Ehsan Ebrahimzadeh,Nikhil Monga,Hang Gao,Alex Cozzi,Abraham Bagherjeiran
关键词-EN: decision making framework, reward optimization problem, expected reward optimization, expected reward, ranking policy
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 9 pages

点击查看摘要

Abstract:We develop a decision making framework to cast the problem of learning a ranking policy for search or recommendation engines in a two-sided e-commerce marketplace as an expected reward optimization problem using observational data. As a value allocation mechanism, the ranking policy allocates retrieved items to the designated slots so as to maximize the user utility from the slotted items, at any given stage of the shopping journey. The objective of this allocation can in turn be defined with respect to the underlying probabilistic user browsing model as the expected number of interaction events on presented items matching the user intent, given the ranking context. Through recognizing the effect of ranking as an intervention action to inform users’ interactions with slotted items and the corresponding economic value of the interaction events for the marketplace, we formulate the expected reward of the marketplace as the collective value from all presented ranking actions. The key element in this formulation is a notion of context value distribution, which signifies not only the attribution of value to ranking interventions within a session but also the distribution of marketplace reward across user sessions. We build empirical estimates for the expected reward of the marketplace from observational data that account for the heterogeneity of economic value across session contexts as well as the distribution shifts in learning from observational user activity data. The ranking policy can then be trained by optimizing the empirical expected reward estimates via standard Bayesian inference techniques. We report empirical results for a product search ranking task in a major e-commerce platform demonstrating the fundamental trade-offs governed by ranking polices trained on empirical reward estimates with respect to extreme choices of the context value distribution.

[LG-109] GAMformer: In-Context Learning for Generalized Additive Models

链接: https://arxiv.org/abs/2410.04560
作者: Andreas Mueller,Julien Siems,Harsha Nori,David Salinas,Arber Zela,Rich Caruana,Frank Hutter
关键词-EN: Generalized Additive Models, machine learning models, Generalized Additive, fully interpretable machine, create fully interpretable
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 12 figures

点击查看摘要

Abstract:Generalized Additive Models (GAMs) are widely recognized for their ability to create fully interpretable machine learning models for tabular data. Traditionally, training GAMs involves iterative learning algorithms, such as splines, boosted trees, or neural networks, which refine the additive components through repeated error reduction. In this paper, we introduce GAMformer, the first method to leverage in-context learning to estimate shape functions of a GAM in a single forward pass, representing a significant departure from the conventional iterative approaches to GAM fitting. Building on previous research applying in-context learning to tabular data, we exclusively use complex, synthetic data to train GAMformer, yet find it extrapolates well to real-world data. Our experiments show that GAMformer performs on par with other leading GAMs across various classification benchmarks while generating highly interpretable shape functions.

[LG-110] textttdattri: A Library for Efficient Data Attribution

链接: https://arxiv.org/abs/2410.04555
作者: Junwei Deng,Ting-Wei Li,Shiyuan Zhang,Shixuan Liu,Yijun Pan,Hao Huang,Xinhe Wang,Pingbang Hu,Xingjian Zhang,Jiaqi W. Ma
关键词-EN: Data attribution methods, Data attribution, attribution methods, attribution, individual training samples
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Data attribution methods aim to quantify the influence of individual training samples on the prediction of artificial intelligence (AI) models. As training data plays an increasingly crucial role in the modern development of large-scale AI models, data attribution has found broad applications in improving AI performance and safety. However, despite a surge of new data attribution methods being developed recently, there lacks a comprehensive library that facilitates the development, benchmarking, and deployment of different data attribution methods. In this work, we introduce \textttdattri , an open-source data attribution library that addresses the above needs. Specifically, \textttdattri highlights three novel design features. Firstly, \textttdattri proposes a unified and easy-to-use API, allowing users to integrate different data attribution methods into their PyTorch-based machine learning pipeline with a few lines of code changed. Secondly, \textttdattri modularizes low-level utility functions that are commonly used in data attribution methods, such as Hessian-vector product, inverse-Hessian-vector product or random projection, making it easier for researchers to develop new data attribution methods. Thirdly, \textttdattri provides a comprehensive benchmark framework with pre-trained models and ground truth annotations for a variety of benchmark settings, including generative AI settings. We have implemented a variety of state-of-the-art efficient data attribution methods that can be applied to large-scale neural network models, and will continuously update the library in the future. Using the developed \textttdattri library, we are able to perform a comprehensive and fair benchmark analysis across a wide range of data attribution methods. The source code of \textttdattri is available at this https URL.

[LG-111] Bisimulation metric for Model Predictive Control

链接: https://arxiv.org/abs/2410.04553
作者: Yutaka Shimizu,Masayoshi Tomizuka
关键词-EN: Model-based reinforcement learning, improving sample efficiency, Model-based reinforcement, complex environments, reinforcement learning
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning has shown promise for improving sample efficiency and decision-making in complex environments. However, existing methods face challenges in training stability, robustness to noise, and computational efficiency. In this paper, we propose Bisimulation Metric for Model Predictive Control (BS-MPC), a novel approach that incorporates bisimulation metric loss in its objective function to directly optimize the encoder. This time-step-wise direct optimization enables the learned encoder to extract intrinsic information from the original state space while discarding irrelevant details and preventing the gradients and errors from diverging. BS-MPC improves training stability, robustness against input noise, and computational efficiency by reducing training time. We evaluate BS-MPC on both continuous control and image-based tasks from the DeepMind Control Suite, demonstrating superior performance and robustness compared to state-of-the-art baseline methods.

[LG-112] Modeling Social Media Recommendation Impacts Using Academic Networks: A Graph Neural Network Approach

链接: https://arxiv.org/abs/2410.04552
作者: Sabrina Guidotti,Gregor Donabauer,Simone Somazzi,Udo Kruschwitz,Davide Taibi,Dimitri Ognibene
关键词-EN: highlighted potential negative, potential negative impacts, shape user behavior, society and individuals, largely driven
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of social media has highlighted potential negative impacts on society and individuals, largely driven by recommendation algorithms that shape user behavior and social dynamics. Understanding these algorithms is essential but challenging due to the complex, distributed nature of social media networks as well as limited access to real-world data. This study proposes to use academic social networks as a proxy for investigating recommendation systems in social media. By employing Graph Neural Networks (GNNs), we develop a model that separates the prediction of academic infosphere from behavior prediction, allowing us to simulate recommender-generated infospheres and assess the model’s performance in predicting future co-authorships. Our approach aims to improve our understanding of recommendation systems’ roles and social networks modeling. To support the reproducibility of our work we publicly make available our implementations: this https URL

[LG-113] Social Choice for Heterogeneous Fairness in Recommendation

链接: https://arxiv.org/abs/2410.04551
作者: Amanda Aird,Elena Štefancová,Cassidy All,Amy Voida,Martin Homola,Nicholas Mattei,Robin Burke
关键词-EN: recommender systems requires, systems requires close, requires close attention, Algorithmic fairness, competing interests
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic fairness in recommender systems requires close attention to the needs of a diverse set of stakeholders that may have competing interests. Previous work in this area has often been limited by fixed, single-objective definitions of fairness, built into algorithms or optimization criteria that are applied to a single fairness dimension or, at most, applied identically across dimensions. These narrow conceptualizations limit the ability to adapt fairness-aware solutions to the wide range of stakeholder needs and fairness definitions that arise in practice. Our work approaches recommendation fairness from the standpoint of computational social choice, using a multi-agent framework. In this paper, we explore the properties of different social choice mechanisms and demonstrate the successful integration of multiple, heterogeneous fairness definitions across multiple data sets.

[LG-114] Pullback Flow Matching on Data Manifolds

链接: https://arxiv.org/abs/2410.04543
作者: Friso de Kruiff,Erik Bekkers,Ozan Öktem,Carola-Bibiane Schönlieb,Willem Diepeveen
关键词-EN: Pullback Flow Matching, Riemannian Flow Matching, propose Pullback Flow, Flow Matching, training Riemannian Flow
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:We propose Pullback Flow Matching (PFM), a novel framework for generative modeling on data manifolds. Unlike existing methods that assume or learn restrictive closed-form manifold mappings for training Riemannian Flow Matching (RFM) models, PFM leverages pullback geometry and isometric learning to preserve the underlying manifold’s geometry while enabling efficient generation and precise interpolation in latent space. This approach not only facilitates closed-form mappings on the data manifold but also allows for designable latent spaces, using assumed metrics on both data and latent manifolds. By enhancing isometric learning through Neural ODEs and proposing a scalable training objective, we achieve a latent space more suitable for interpolation, leading to improved manifold learning and generative performance. We demonstrate PFM’s effectiveness through applications in synthetic data, protein dynamics and protein sequence data, generating novel proteins with specific properties. This method shows strong potential for drug discovery and materials science, where generating novel samples with specific properties is of great interest.

[LG-115] On Evaluating LLMs Capabilities as Functional Approximators: A Bayesian Perspective

链接: https://arxiv.org/abs/2410.04541
作者: Shoaib Ahmed Siddiqui,Yanzhi Chen,Juyeon Heo,Menglin Xia,Adrian Weller
关键词-EN: Large Language Models, applied Large Language, successfully applied Large, Language Models, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent works have successfully applied Large Language Models (LLMs) to function modeling tasks. However, the reasons behind this success remain unclear. In this work, we propose a new evaluation framework to comprehensively assess LLMs’ function modeling abilities. By adopting a Bayesian perspective of function modeling, we discover that LLMs are relatively weak in understanding patterns in raw data, but excel at utilizing prior knowledge about the domain to develop a strong understanding of the underlying function. Our findings offer new insights about the strengths and limitations of LLMs in the context of function modeling.

[LG-116] UniMuMo: Unified Text Music and Motion Generation

链接: https://arxiv.org/abs/2410.04534
作者: Han Yang,Kun Su,Yutong Zhang,Jiaben Chen,Kaizhi Qian,Gaowen Liu,Chuang Gan
关键词-EN: taking arbitrary text, multimodal model capable, capable of taking, taking arbitrary, input conditions
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \hrefthis https URLproject page.

[LG-117] Look Around and Find Out: OOD Detection with Relative Angles

链接: https://arxiv.org/abs/2410.04525
作者: Berker Demirel,Marco Fumero,Francesco Locatello
关键词-EN: Deep learning systems, Deep learning, learning systems deployed, deployed in real-world, real-world applications
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning systems deployed in real-world applications often encounter data that is different from their in-distribution (ID). A reliable system should ideally abstain from making decisions in this out-of-distribution (OOD) setting. Existing state-of-the-art methods primarily focus on feature distances, such as k-th nearest neighbors and distances to decision boundaries, either overlooking or ineffectively using in-distribution statistics. In this work, we propose a novel angle-based metric for OOD detection that is computed relative to the in-distribution structure. We demonstrate that the angles between feature representations and decision boundaries, viewed from the mean of in-distribution features, serve as an effective discriminative factor between ID and OOD data. Our method achieves state-of-the-art performance on CIFAR-10 and ImageNet benchmarks, reducing FPR95 by 0.88% and 7.74% respectively. Our score function is compatible with existing feature space regularization techniques, enhancing performance. Additionally, its scale-invariance property enables creating an ensemble of models for OOD detection via simple score summation.

[LG-118] Dynamic Post-Hoc Neural Ensemblers

链接: https://arxiv.org/abs/2410.04520
作者: Sebastian Pineda Arango,Maciej Janowski,Lennart Purucker,Arber Zela,Frank Hutter,Josif Grabocka
关键词-EN: multiple base learners, combining multiple base, enhancing the accuracy, accuracy and robustness, robustness of machine
类目: Machine Learning (cs.LG)
*备注: Preprint under review, 10 pages

点击查看摘要

Abstract:Ensemble methods are known for enhancing the accuracy and robustness of machine learning models by combining multiple base learners. However, standard approaches like greedy or random ensembles often fall short, as they assume a constant weight across samples for the ensemble members. This can limit expressiveness and hinder performance when aggregating the ensemble predictions. In this study, we explore employing neural networks as ensemble methods, emphasizing the significance of dynamic ensembling to leverage diverse model predictions adaptively. Motivated by the risk of learning low-diversity ensembles, we propose regularizing the model by randomly dropping base model predictions during the training. We demonstrate this approach lower bounds the diversity within the ensemble, reducing overfitting and improving generalization capabilities. Our experiments showcase that the dynamic neural ensemblers yield competitive results compared to strong baselines in computer vision, natural language processing, and tabular data.

[LG-119] Leveraging Large Language Models for Suicide Detection on Social Media with Limited Labels

链接: https://arxiv.org/abs/2410.04501
作者: Vy Nguyen,Chau Pham
关键词-EN: suicidal thoughts highlights, Social media, increasing frequency, thoughts highlights, highlights the importance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing frequency of suicidal thoughts highlights the importance of early detection and intervention. Social media platforms, where users often share personal experiences and seek help, could be utilized to identify individuals at risk. However, the large volume of daily posts makes manual review impractical. This paper explores the use of Large Language Models (LLMs) to automatically detect suicidal content in text-based social media posts. We propose a novel method for generating pseudo-labels for unlabeled data by prompting LLMs, along with traditional classification fine-tuning techniques to enhance label accuracy. To create a strong suicide detection model, we develop an ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B. We evaluate our approach on the dataset of the Suicide Ideation Detection on Social Media Challenge, a track of the IEEE Big Data 2024 Big Data Cup. Additionally, we conduct a comprehensive analysis to assess the impact of different models and fine-tuning strategies on detection performance. Experimental results show that the ensemble model significantly improves the detection accuracy, by 5% points compared with the individual models. It achieves a weight F1 score of 0.770 on the public test set, and 0.731 on the private test set, providing a promising solution for identifying suicidal content in social media. Our analysis shows that the choice of LLMs affects the prompting performance, with larger models providing better accuracy. Our code and checkpoints are publicly available at this https URL.

[LG-120] Adjusting Pretrained Backbones for Performativity

链接: https://arxiv.org/abs/2410.04499
作者: Berker Demirel,Lingjing Kong,Kun Zhang,Theofanis Karaletsos,Celestine Mendler-Dünner,Francesco Locatello
关键词-EN: widespread deployment, influence their environment, deep learning models, deep learning, models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the widespread deployment of deep learning models, they influence their environment in various ways. The induced distribution shifts can lead to unexpected performance degradation in deployed models. Existing methods to anticipate performativity typically incorporate information about the deployed model into the feature vector when predicting future outcomes. While enjoying appealing theoretical properties, modifying the input dimension of the prediction task is often not practical. To address this, we propose a novel technique to adjust pretrained backbones for performativity in a modular way, achieving better sample efficiency and enabling the reuse of existing deep learning assets. Focusing on performative label shift, the key idea is to train a shallow adapter module to perform a Bayes-optimal label shift correction to the backbone’s logits given a sufficient statistic of the model to be deployed. As such, our framework decouples the construction of input-specific feature embeddings from the mechanism governing performativity. Motivated by dynamic benchmarking as a use-case, we evaluate our approach under adversarial sampling, for vision and language tasks. We show how it leads to smaller loss along the retraining trajectory and enables us to effectively select among candidate models to anticipate performance degradations. More broadly, our work provides a first baseline for addressing performativity in deep learning.

[LG-121] AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

链接: https://arxiv.org/abs/2410.04498
作者: Renye Yan,Yaozhong Gan,You Wu,Junliang Xing,Ling Liangn,Yeshang Zhu,Yimao Cai
关键词-EN: sparse reward scenarios, past experiences, sparse reward, reward scenarios, scenarios of reinforcement
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In sparse reward scenarios of reinforcement learning (RL), the memory mechanism provides promising shortcuts to policy optimization by reflecting on past experiences like humans. However, current memory-based RL methods simply store and reuse high-value policies, lacking a deeper refining and filtering of diverse past experiences and hence limiting the capability of memory. In this paper, we propose AdaMemento, an adaptive memory-enhanced RL framework. Instead of just memorizing positive past experiences, we design a memory-reflection module that exploits both positive and negative experiences by learning to predict known local optimal policies based on real-time states. To effectively gather informative trajectories for the memory, we further introduce a fine-grained intrinsic motivation paradigm, where nuances in similar states can be precisely distinguished to guide exploration. The exploitation of past experiences and exploration of new policies are then adaptively coordinated by ensemble learning to approach the global optimum. Furthermore, we theoretically prove the superiority of our new intrinsic motivation and ensemble mechanism. From 59 quantitative and visualization experiments, we confirm that AdaMemento can distinguish subtle states for better exploration and effectively exploiting past experiences in memory, achieving significant improvement over previous methods.

[LG-122] Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification NEURIPS2024

链接: https://arxiv.org/abs/2410.04492
作者: Zhaorui Tan,Xi Yang,Qiufeng Wang,Anh Nguyen,Kaizhu Huang
关键词-EN: Vision models excel, Vision models, discovering novel categories, struggle to generalize, Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2024 as Spotlight

点击查看摘要

Abstract:Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image classification. Our work reveals that L-Reg reduces the complexity of the model in terms of the feature distribution and classifier weights. Specifically, we unveil the interpretability brought by L-Reg, as it enables the model to extract the salient features, such as faces to persons, for classification. Theoretical analysis and experiments demonstrate that L-Reg enhances generalization across various scenarios, including multi-domain generalization and generalized category discovery. In complex real-world scenarios where images span unknown classes and unseen domains, L-Reg consistently improves generalization, highlighting its practical efficacy.

[LG-123] A Large-Scale Exploit Instrumentation Study of AI/ML Supply Chain Attacks in Hugging Face Models

链接: https://arxiv.org/abs/2410.04490
作者: Beatrice Casey,Joanna C. S. Santos,Mehdi Mirakhorli
关键词-EN: Hugging Face, unsafe serialization methods, machine learning, Hugging Face serves, led to ample
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The development of machine learning (ML) techniques has led to ample opportunities for developers to develop and deploy their own models. Hugging Face serves as an open source platform where developers can share and download other models in an effort to make ML development more collaborative. In order for models to be shared, they first need to be serialized. Certain Python serialization methods are considered unsafe, as they are vulnerable to object injection. This paper investigates the pervasiveness of these unsafe serialization methods across Hugging Face, and demonstrates through an exploitation approach, that models using unsafe serialization methods can be exploited and shared, creating an unsafe environment for ML developers. We investigate to what extent Hugging Face is able to flag repositories and files using unsafe serialization methods, and develop a technique to detect malicious models. Our results show that Hugging Face is home to a wide range of potentially vulnerable models.

[LG-124] Revisiting In-context Learning Inference Circuit in Large Language Models ICLR2025

链接: https://arxiv.org/abs/2410.04468
作者: Hakaze Cho,Mariko Kato,Yoshihiro Sakai,Naoya Inoue
关键词-EN: emerging few-shot learning, few-shot learning paradigm, In-context Learning, ICL, few-shot learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages, 37 figures, 6 tables, ICLR 2025 under review

点击查看摘要

Abstract:In-context Learning (ICL) is an emerging few-shot learning paradigm on Language Models (LMs) with inner mechanisms un-explored. There are already existing works describing the inner processing of ICL, while they struggle to capture all the inference phenomena in large language models. Therefore, this paper proposes a comprehensive circuit to model the inference dynamics and try to explain the observed phenomena of ICL. In detail, we divide ICL inference into 3 major operations: (1) Summarize: LMs encode every input text (demonstrations and queries) into linear representation in the hidden states with sufficient information to solve ICL tasks. (2) Semantics Merge: LMs merge the encoded representations of demonstrations with their corresponding label tokens to produce joint representations of labels and demonstrations. (3) Feature Retrieval and Copy: LMs search the joint representations similar to the query representation on a task subspace, and copy the searched representations into the query. Then, language model heads capture these copied label representations to a certain extent and decode them into predicted labels. The proposed inference circuit successfully captured many phenomena observed during the ICL process, making it a comprehensive and practical explanation of the ICL inference process. Moreover, ablation analysis by disabling the proposed steps seriously damages the ICL performance, suggesting the proposed inference circuit is a dominating mechanism. Additionally, we confirm and list some bypass mechanisms that solve ICL tasks in parallel with the proposed circuit.

[LG-125] Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

链接: https://arxiv.org/abs/2410.04466
作者: Jinhao Li,Jiaming Xu,Shan Huang,Yonghua Chen,Wen Li,Jun Liu,Yaoxiu Lian,Jiayi Pan,Li Ding,Hao Zhou,Guohao Dai
关键词-EN: Large Language Models, natural language understanding, Large Language, Language Models, natural language
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 43 pages, 15 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms, which can point to the future trends and potential developments of generative LLMs and hardware technology for edge-side scenarios.

[LG-126] nsor-Train Point Cloud Compression and Efficient Approximate Nearest-Neighbor Search

链接: https://arxiv.org/abs/2410.04462
作者: Georgii Novikov,Alexander Gneushev,Alexey Kadeishvili,Ivan Oseledets
关键词-EN: machine learning applications, large vector databases, learning applications, approximate nearest-neighbor searches, large vector
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nearest-neighbor search in large vector databases is crucial for various machine learning applications. This paper introduces a novel method using tensor-train (TT) low-rank tensor decomposition to efficiently represent point clouds and enable fast approximate nearest-neighbor searches. We propose a probabilistic interpretation and utilize density estimation losses like Sliced Wasserstein to train TT decompositions, resulting in robust point cloud compression. We reveal an inherent hierarchical structure within TT point clouds, facilitating efficient approximate nearest-neighbor searches. In our paper, we provide detailed insights into the methodology and conduct comprehensive comparisons with existing methods. We demonstrate its effectiveness in various scenarios, including out-of-distribution (OOD) detection problems and approximate nearest-neighbor (ANN) search tasks.

[LG-127] Improved Off-policy Reinforcement Learning in Biological Sequence Design

链接: https://arxiv.org/abs/2410.04461
作者: Hyeonah Kim,Minsu Kim,Taeyoung Yun,Sanghyeok Choi,Emmanuel Bengio,Alex Hernández-García,Jinkyoo Park
关键词-EN: Designing biological sequences, significant challenge due, Designing biological, vast search space, combinatorially vast search
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 11 pages

点击查看摘要

Abstract:Designing biological sequences with desired properties is a significant challenge due to the combinatorially vast search space and the high cost of evaluating each candidate sequence. To address these challenges, reinforcement learning (RL) methods, such as GFlowNets, utilize proxy models for rapid reward evaluation and annotated data for policy training. Although these approaches have shown promise in generating diverse and novel sequences, the limited training data relative to the vast search space often leads to the misspecification of proxy for out-of-distribution inputs. We introduce \delta -Conservative Search, a novel off-policy search method for training GFlowNets designed to improve robustness against proxy misspecification. The key idea is to incorporate conservativeness, controlled by parameter \delta , to constrain the search to reliable regions. Specifically, we inject noise into high-score offline sequences by randomly masking tokens with a Bernoulli distribution of parameter \delta and then denoise masked tokens using the GFlowNet policy. Additionally, \delta is adaptively adjusted based on the uncertainty of the proxy model for each data point. This enables the reflection of proxy uncertainty to determine the level of conservativeness. Experimental results demonstrate that our method consistently outperforms existing machine learning methods in discovering high-score sequences across diverse tasks-including DNA, RNA, protein, and peptide design-especially in large-scale scenarios.

[LG-128] A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD

链接: https://arxiv.org/abs/2410.04458
作者: Ruinan Jin,Xiao Li,Yaoliang Yu,Baoxiang Wang
关键词-EN: Adaptive Moment Estimation, Moment Estimation, handling large-scale data, Adaptive Moment, adaptive learning rates
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adaptive Moment Estimation (Adam) is a cornerstone optimization algorithm in deep learning, widely recognized for its flexibility with adaptive learning rates and efficiency in handling large-scale data. However, despite its practical success, the theoretical understanding of Adam’s convergence has been constrained by stringent assumptions, such as almost surely bounded stochastic gradients or uniformly bounded gradients, which are more restrictive than those typically required for analyzing stochastic gradient descent (SGD). In this paper, we introduce a novel and comprehensive framework for analyzing the convergence properties of Adam. This framework offers a versatile approach to establishing Adam’s convergence. Specifically, we prove that Adam achieves asymptotic (last iterate sense) convergence in both the almost sure sense and the (L_1) sense under the relaxed assumptions typically used for SGD, namely (L)-smoothness and the ABC inequality. Meanwhile, under the same assumptions, we show that Adam attains non-asymptotic sample complexity bounds similar to those of SGD. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2410.04458 [cs.LG] (or arXiv:2410.04458v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04458 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-129] An Attention-Based Algorithm for Gravity Adaptation Zone Calibration

链接: https://arxiv.org/abs/2410.04457
作者: Chen Yu
关键词-EN: gravity adaptation zone, adaptation zone calibration, gravity field, Accurate calibration, gravity adaptation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
*备注: 15pages

点击查看摘要

Abstract:Accurate calibration of gravity adaptation zones is of great significance in fields such as underwater navigation, geophysical exploration, and marine engineering. With the increasing application of gravity field data in these areas, traditional calibration methods based on single features are becoming inadequate for capturing the complex characteristics of gravity fields and addressing the intricate interrelationships among multidimensional data. This paper proposes an attention-enhanced algorithm for gravity adaptation zone calibration. By introducing an attention mechanism, the algorithm adaptively fuses multidimensional gravity field features and dynamically assigns feature weights, effectively solving the problems of multicollinearity and redundancy inherent in traditional feature selection methods, significantly improving calibration accuracy and this http URL addition, a large-scale gravity field dataset with over 10,000 sampling points was constructed, and Kriging interpolation was used to enhance the spatial resolution of the data, providing a reliable data foundation for model training and evaluation. We conducted both qualitative and quantitative experiments on several classical machine learning models (such as SVM, GBDT, and RF), and the results demonstrate that the proposed algorithm significantly improves performance across these models, outperforming other traditional feature selection methods. The method proposed in this paper provides a new solution for gravity adaptation zone calibration, showing strong generalization ability and potential for application in complex environments. The code is available at \hrefthis link this https URL.

[LG-130] Attention Shift: Steering AI Away from Unsafe Content

链接: https://arxiv.org/abs/2410.04447
作者: Shivank Garg,Manyana Tiwari
关键词-EN: generative models, study investigates, investigates the generation, restricting such generations, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.

[LG-131] meBridge: Non-Stationarity Matters for Long-term Time Series Forecasting

链接: https://arxiv.org/abs/2410.04442
作者: Peiyuan Liu,Beiliang Wu,Yifan Hu,Naiqi Li,Tao Dai,Jigang Bao,Shu-tao Xia
关键词-EN: poses significant challenges, Non-stationarity poses significant, inherent short-term fluctuations, essential long-term relationships, poses significant
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Non-stationarity poses significant challenges for multivariate time series forecasting due to the inherent short-term fluctuations and long-term trends that can lead to spurious regressions or obscure essential long-term relationships. Most existing methods either eliminate or retain non-stationarity without adequately addressing its distinct impacts on short-term and long-term modeling. Eliminating non-stationarity is essential for avoiding spurious regressions and capturing local dependencies in short-term modeling, while preserving it is crucial for revealing long-term cointegration across variates. In this paper, we propose TimeBridge, a novel framework designed to bridge the gap between non-stationarity and dependency modeling in long-term time series forecasting. By segmenting input series into smaller patches, TimeBridge applies Integrated Attention to mitigate short-term non-stationarity and capture stable dependencies within each variate, while Cointegrated Attention preserves non-stationarity to model long-term cointegration across variates. Extensive experiments show that TimeBridge consistently achieves state-of-the-art performance in both short-term and long-term forecasting. Additionally, TimeBridge demonstrates exceptional performance in financial forecasting on the CSI 500 and SP 500 indices, further validating its robustness and effectiveness. Code is available at \urlthis https URL.

[LG-132] Disentangling Regional Primitives for Image Generation

链接: https://arxiv.org/abs/2410.04421
作者: Zhengting Chen,Lei Cheng,Lianghui Ding,Quanshi Zhang
关键词-EN: internal representation structure, feature component, neural network, image regions, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a method to explain the internal representation structure of a neural network for image generation. Specifically, our method disentangles primitive feature components from the intermediate-layer feature of the neural network, which ensures that each feature component is exclusively used to generate a specific set of image regions. In this way, the generation of the entire image can be considered as the superposition of different pre-encoded primitive regional patterns, each being generated by a feature component. We find that the feature component can be represented as an OR relationship between the demands for generating different image regions, which is encoded by the neural network. Therefore, we extend the Harsanyi interaction to represent such an OR interaction to disentangle the feature component. Experiments show a clear correspondence between each feature component and the generation of specific image regions.

[LG-133] Optimizing AI Reasoning: A Hamiltonian Dynamics Approach to Multi-Hop Question Answering

链接: https://arxiv.org/abs/2410.04415
作者: Javier Marin
关键词-EN: Hamiltonian mechanics, paper introduces, introduces an innovative, innovative approach, approach to analyzing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces an innovative approach to analyzing and improving multi-hop reasoning in AI systems by drawing inspiration from Hamiltonian mechanics. We propose a novel framework that maps reasoning chains in embedding spaces to Hamiltonian systems, allowing us to leverage powerful analytical tools from classical physics. Our method defines a Hamiltonian function that balances the progression of reasoning (kinetic energy) against the relevance to the question at hand (potential energy). Using this framework, we analyze a large dataset of reasoning chains from a multi-hop question-answering task, revealing intriguing patterns that distinguish valid from invalid reasoning. We show that valid reasoning chains have lower Hamiltonian energy and move in ways that make the best trade-off between getting more information and answering the right question. Furthermore, we demonstrate the application of this framework to steer the creation of more efficient reasoning algorithms within AI systems. Our results not only provide new insights into the nature of valid reasoning but also open up exciting possibilities for physics-inspired approaches to understanding and improving artificial intelligence.

[LG-134] Data Distribution Valuation NEURIPS2024

链接: https://arxiv.org/abs/2410.04386
作者: Xinyi Xu,Shuaiqi Wang,Chuan-Sheng Foo,Bryan Kian Hsiang Low,Giulia Fanti
关键词-EN: Data, class of techniques, techniques for quantitatively, quantitatively assessing, data distributions
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024 as a poster. Main paper with appendix (38 pages in total). Code will be released soon at this https URL

点击查看摘要

Abstract:Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a discrete dataset. However, in many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small preview sample from each vendor, to decide which vendor’s data distribution is most useful to the buyer and purchase. The core question is how should we compare the values of data distributions from their samples? Under a Huber characterization of the data heterogeneity across vendors, we propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).

[LG-135] Suspiciousness of Adversarial Texts to Human

链接: https://arxiv.org/abs/2410.04377
作者: Shakila Mahjabin Tonni,Pedro Faustini,Mark Dras
关键词-EN: deep neural networks, meticulously altered inputs, degrade model performance, Adversarial, neural networks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注: Under review

点击查看摘要

Abstract:Adversarial examples pose a significant challenge to deep neural networks (DNNs) across both image and text domains, with the intent to degrade model performance through meticulously altered inputs. Adversarial texts, however, are distinct from adversarial images due to their requirement for semantic similarity and the discrete nature of the textual contents. This study delves into the concept of human suspiciousness, a quality distinct from the traditional focus on imperceptibility found in image-based adversarial examples. Unlike images, where adversarial changes are meant to be indistinguishable to the human eye, textual adversarial content must often remain undetected or non-suspicious to human readers, even when the text’s purpose is to deceive NLP systems or bypass filters. In this research, we expand the study of human suspiciousness by analyzing how individuals perceive adversarial texts. We gather and publish a novel dataset of Likert-scale human evaluations on the suspiciousness of adversarial sentences, crafted by four widely used adversarial attack methods and assess their correlation with the human ability to detect machine-generated alterations. Additionally, we develop a regression-based model to quantify suspiciousness and establish a baseline for future research in reducing the suspiciousness in adversarial text generation. We also demonstrate how the regressor-generated suspicious scores can be incorporated into adversarial generation methods to produce texts that are less likely to be perceived as computer-generated. We make our human suspiciousness annotated data and our code available. Comments: Under review Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2410.04377 [cs.LG] (or arXiv:2410.04377v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04377 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-136] Putting Gale Shapley to Work: Guaranteeing Stability Through Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.04376
作者: Hadi Hosseini,Sanjukta Roy,Duohan Zhang
关键词-EN: Two-sided matching markets, matching markets describe, Two-sided matching, describe a large, large class
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Two-sided matching markets describe a large class of problems wherein participants from one side of the market must be matched to those from the other side according to their preferences. In many real-world applications (e.g. content matching or online labor markets), the knowledge about preferences may not be readily available and must be learned, i.e., one side of the market (aka agents) may not know their preferences over the other side (aka arms). Recent research on online settings has focused primarily on welfare optimization aspects (i.e. minimizing the overall regret) while paying little attention to the game-theoretic properties such as the stability of the final matching. In this paper, we exploit the structure of stable solutions to devise algorithms that improve the likelihood of finding stable solutions. We initiate the study of the sample complexity of finding a stable matching, and provide theoretical bounds on the number of samples needed to reach a stable matching with high probability. Finally, our empirical results demonstrate intriguing tradeoffs between stability and optimality of the proposed algorithms, further complementing our theoretical findings.

[LG-137] Algorithmic Capabilities of Random Transformers NEURIPS2024

链接: https://arxiv.org/abs/2410.04368
作者: Ziqian Zhong,Jacob Andreas
关键词-EN: implement interpretable procedures, implement interpretable, interpretable procedures, procedures originate, associative recall
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To what extent do they depend on the supervisory signal provided to models, and to what extent are they attributable to behavior already present in models at the beginning of training? To investigate these questions, we investigate what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input–output mappings learnable from data are those already implemented (up to a choice of encoding scheme) by the randomly initialized model. We find that these random transformers can perform a wide range of meaningful algorithmic tasks, including modular arithmetic, in-weights and in-context associative recall, decimal addition, parenthesis balancing, and even some aspects of natural language text generation. Our results indicate that some algorithmic capabilities are present in transformers (and accessible via appropriately structured inputs) even before these models are trained. Code is available at this https URL.

[LG-138] VideoGuide: Improving Video Diffusion Models without Training Through a Teachers Guide

链接: https://arxiv.org/abs/2410.04364
作者: Dohun Lee,Bryan S Kim,Geon Yeong Park,Jong Chul Ye
关键词-EN: visual content creation, revolutionized visual content, preserving temporal consistency, content creation, generation remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 14 figures, Project Page: this http URL

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model’s denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: this http URL

[LG-139] Latent Feature Mining for Predictive Model Enhancement with Large Language Models

链接: https://arxiv.org/abs/2410.04347
作者: Bingxuan Li,Pengyi Shi,Amy Ward
关键词-EN: faces challenges due, latent feature mining, practical difficulties, modeling often faces, weakly correlated
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Predictive modeling often faces challenges due to limited data availability and quality, especially in domains where collected features are weakly correlated with outcomes and where additional feature collection is constrained by ethical or practical difficulties. Traditional machine learning (ML) models struggle to incorporate unobserved yet critical factors. In this work, we introduce an effective approach to formulate latent feature mining as text-to-text propositional logical reasoning. We propose FLAME (Faithful Latent Feature Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features and enhance the predictive power of ML models in downstream tasks. Our framework is generalizable across various domains with necessary domain-specific adaptation, as it is designed to incorporate contextual information unique to each area, ensuring effective transfer to different areas facing similar data availability challenges. We validate our framework with two case studies: (1) the criminal justice system, a domain characterized by limited and ethically challenging data collection; (2) the healthcare domain, where patient privacy concerns and the complexity of medical data limit comprehensive feature collection. Our results show that inferred latent features align well with ground truth labels and significantly enhance the downstream classifier.

[LG-140] DeepONet for Solving PDEs: Generalization Analysis in Sobolev Training

链接: https://arxiv.org/abs/2410.04344
作者: Yahong Yang
关键词-EN: partial differential equations, solve partial differential, differential equations, solve partial, partial differential
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the application of operator learning, specifically DeepONet, to solve partial differential equations (PDEs). Unlike function learning methods that require training separate neural networks for each PDE, operator learning generalizes across different PDEs without retraining. We focus on the performance of DeepONet in Sobolev training, addressing two key questions: the approximation ability of deep branch and trunk networks, and the generalization error in Sobolev norms. Our findings highlight that deep branch networks offer significant performance benefits, while trunk networks are best kept simple. Moreover, standard sampling methods without adding derivative information in the encoding part are sufficient for minimizing generalization error in Sobolev training, based on generalization analysis. This paper fills a theoretical gap by providing error estimations for a wide range of physics-informed machine learning models and applications.

[LG-141] Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

链接: https://arxiv.org/abs/2410.04332
作者: Alex Cloud,Jacob Goldman-Wetzler,Evžen Wybitul,Joseph Miller,Alexander Matt Turner
关键词-EN: trained primarily based, inputs and outputs, trained primarily, primarily based, gradient routing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural networks are trained primarily based on their inputs and outputs, without regard for their internal mechanisms. These neglected mechanisms determine properties that are critical for safety, like (i) transparency; (ii) the absence of sensitive information or harmful capabilities; and (iii) reliable generalization of goals beyond the training distribution. To address this shortcoming, we introduce gradient routing, a training method that isolates capabilities to specific subregions of a neural network. Gradient routing applies data-dependent, weighted masks to gradients during backpropagation. These masks are supplied by the user in order to configure which parameters are updated by which data points. We show that gradient routing can be used to (1) learn representations which are partitioned in an interpretable way; (2) enable robust unlearning via ablation of a pre-specified network subregion; and (3) achieve scalable oversight of a reinforcement learner by localizing modules responsible for different behaviors. Throughout, we find that gradient routing localizes capabilities even when applied to a limited, ad-hoc subset of the data. We conclude that the approach holds promise for challenging, real-world applications where quality data are scarce.

[LG-142] Leveraging Hierarchical Taxonomies in Prompt-based Continual Learning

链接: https://arxiv.org/abs/2410.04327
作者: Quyen Tran,Minh Le,Tuan Truong,Dinh Phung,Linh Ngo,Thien Nguyen,Nhat Ho,Trung Le
关键词-EN: Prompt-based Continual Learning, Prompt-based Continual, mitigate catastrophic forgetting, continuously emerging class, Continual Learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drawing inspiration from human learning behaviors, this work proposes a novel approach to mitigate catastrophic forgetting in Prompt-based Continual Learning models by exploiting the relationships between continuously emerging class data. We find that applying human habits of organizing and connecting information can serve as an efficient strategy when training deep learning models. Specifically, by building a hierarchical tree structure based on the expanding set of labels, we gain fresh insights into the data, identifying groups of similar classes could easily cause confusion. Additionally, we delve deeper into the hidden connections between classes by exploring the original pretrained model’s behavior through an optimal transport-based approach. From these insights, we propose a novel regularization loss function that encourages models to focus more on challenging knowledge areas, thereby enhancing overall performance. Experimentally, our method demonstrated significant superiority over the most robust state-of-the-art models on various benchmarks.

[LG-143] oward Debugging Deep Reinforcement Learning Programs with RLExplorer

链接: https://arxiv.org/abs/2410.04322
作者: Rached Bouchoucha,Ahmed Haj Yahmed,Darshan Patil,Janarthanan Rajendran,Amin Nikanjam,Sarath Chandar,Foutse Khomh
关键词-EN: Deep reinforcement learning, Deep reinforcement, computer games, shown success, success in diverse
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication in The International Conference on Software Maintenance and Evolution (ICSME 2024)

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has shown success in diverse domains such as robotics, computer games, and recommendation systems. However, like any other software system, DRL-based software systems are susceptible to faults that pose unique challenges for debugging and diagnosing. These faults often result in unexpected behavior without explicit failures and error messages, making debugging difficult and time-consuming. Therefore, automating the monitoring and diagnosis of DRL systems is crucial to alleviate the burden on developers. In this paper, we propose RLExplorer, the first fault diagnosis approach for DRL-based software systems. RLExplorer automatically monitors training traces and runs diagnosis routines based on properties of the DRL learning dynamics to detect the occurrence of DRL-specific faults. It then logs the results of these diagnoses as warnings that cover theoretical concepts, recommended practices, and potential solutions to the identified faults. We conducted two sets of evaluations to assess RLExplorer. Our first evaluation of faulty DRL samples from Stack Overflow revealed that our approach can effectively diagnose real faults in 83% of the cases. Our second evaluation of RLExplorer with 15 DRL experts/developers showed that (1) RLExplorer could identify 3.6 times more defects than manual debugging and (2) RLExplorer is easily integrated into DRL applications.

[LG-144] Calibrating Expressions of Certainty

链接: https://arxiv.org/abs/2410.04315
作者: Peiqi Wang,Barbara D. Lam,Yingcheng Liu,Ameneh Asgari-Targhi,Rameswar Panda,William M. Wells,Tina Kapur,Polina Golland
关键词-EN: calibrating linguistic expressions, approach to calibrating, calibrating linguistic, linguistic expressions, Abstract
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel approach to calibrating linguistic expressions of certainty, e.g., “Maybe” and “Likely”. Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.

[LG-145] Discovering Hidden Pollution Hotspots Using Sparse Sensor Measurements

链接: https://arxiv.org/abs/2410.04309
作者: Ankit Bhardwaj,Ananth Balashankar,Shiva Iyer,Nita Soans,Anant Sudarshan,Rohini Pande,Lakshminarayanan Subramanian
关键词-EN: air pollution management, urban areas relies, mitigation strategies, management in urban, high costs
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective air pollution management in urban areas relies on both monitoring and mitigation strategies, yet high costs often limit sensor networks to a few key pollution hotspots. In this paper, we show that New Delhi’s public sensor network is insufficient for identifying all pollution hotspots. To address this, we augmented the city’s network with 28 low-cost sensors, monitoring PM 2.5 concentrations over 30 months (May 2018 to November 2020). Our analysis uncovered 189 additional hotspots, supplementing the 660 already detected by the government network. We observed that Space-Time Kriging with limited but accurate sensor data provides a more robust and generalizable approach for identifying these hotspots, as compared to deep learning models that require large amounts of fine-grained multi-modal data (emissions inventory, meteorology, etc.) which was not reliably, frequently and accurately available in the New Delhi context. Using Space-Time Kriging, we achieved 98% precision and 95.4% recall in detecting hotspots with 50% sensor failure. Furthermore, this method proved effective in predicting hotspots in areas without sensors, achieving 95.3% precision and 88.5% recall in the case of 50% missing sensors. Our findings revealed that a significant portion of New Delhi’s population, around 23 million people, was exposed to pollution hotspots for at least half of the study period. We also identified areas beyond the reach of the public sensor network that should be prioritized for pollution control. These results highlight the need for more comprehensive monitoring networks and suggest Space-Time Kriging as a viable solution for cities facing similar resource constraints.

[LG-146] Integrating Physics-Informed Deep Learning and Numerical Methods for Robust Dynamics Discovery and Parameter Estimation

链接: https://arxiv.org/abs/2410.04299
作者: Caitlin Ho,Andrea Arnold
关键词-EN: priori physics knowledge, machine learning leads, Incorporating a priori, interpretable algorithms, priori physics
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 30 pages, 11 figures

点击查看摘要

Abstract:Incorporating a priori physics knowledge into machine learning leads to more robust and interpretable algorithms. In this work, we combine deep learning techniques and classic numerical methods for differential equations to solve two challenging problems in dynamical systems theory: dynamics discovery and parameter estimation. Results demonstrate the effectiveness of the proposed approaches on a suite of test problems exhibiting oscillatory and chaotic dynamics. When comparing the performance of various numerical schemes, such as the Runge-Kutta and linear multistep families of methods, we observe promising results in predicting the system dynamics and estimating physical parameters, given appropriate choices of spatial and temporal discretization schemes and numerical method orders.

[LG-147] Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

链接: https://arxiv.org/abs/2410.04297
作者: Stanisław Kaźmierczak,Jacek Mańdziuk
关键词-EN: individual training set, utilize bootstrap sampling, component tree, original training set, training set
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Random forests utilize bootstrap sampling to create an individual training set for each component tree. This involves sampling with replacement, with the number of instances equal to the size of the original training set ( N ). Research literature indicates that drawing fewer than N observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is called the bootstrap rate (BR). Sampling more than N observations (BR 1) has been explored in the literature only to a limited extent and has generally proven ineffective. In this paper, we re-examine this approach using 36 diverse datasets and consider BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that such parameterization can result in statistically significant improvements in classification accuracy compared to standard settings (BR \leq 1). Furthermore, we investigate what the optimal BR depends on and conclude that it is more a property of the dataset than a dependence on the random forest hyperparameters. Finally, we develop a binary classifier to predict whether the optimal BR is \leq 1 or 1 for a given dataset, achieving between 81.88% and 88.81% accuracy, depending on the experiment configuration.

[LG-148] Self-Supervised Anomaly Detection in the Wild: Favor Joint Embeddings Methods

链接: https://arxiv.org/abs/2410.04289
作者: Daniel Otero,Rafael Mateus,Randall Balestriero
关键词-EN: prevent costly failures, Accurate anomaly detection, vision-based infrastructure inspection, Accurate anomaly, SSL
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate anomaly detection is critical in vision-based infrastructure inspection, where it helps prevent costly failures and enhances safety. Self-Supervised Learning (SSL) offers a promising approach by learning robust representations from unlabeled data. However, its application in anomaly detection remains underexplored. This paper addresses this gap by providing a comprehensive evaluation of SSL methods for real-world anomaly detection, focusing on sewer infrastructure. Using the Sewer-ML dataset, we evaluate lightweight models such as ViT-Tiny and ResNet-18 across SSL frameworks, including BYOL, Barlow Twins, SimCLR, DINO, and MAE, under varying class imbalance levels. Through 250 experiments, we rigorously assess the performance of these SSL methods to ensure a robust and comprehensive evaluation. Our findings highlight the superiority of joint-embedding methods like SimCLR and Barlow Twins over reconstruction-based approaches such as MAE, which struggle to maintain performance under class imbalance. Furthermore, we find that the SSL model choice is more critical than the backbone architecture. Additionally, we emphasize the need for better label-free assessments of SSL representations, as current methods like RankMe fail to adequately evaluate representation quality, making cross-validation without labels infeasible. Despite the remaining performance gap between SSL and supervised models, these findings highlight the potential of SSL to enhance anomaly detection, paving the way for further research in this underexplored area of SSL applications.

[LG-149] Enhancing Carbon Emission Reduction Strategies using OCO and ICOS data

链接: https://arxiv.org/abs/2410.04288
作者: Oskar Åström,Carina Geldhauser,Markus Grillitsch,Ola Hall,Alexandros Sopasakis
关键词-EN: Orbiting Carbon Observatories, Carbon Observation System, Integrated Carbon Observation, ECMWF Reanalysis, Observation System
类目: Machine Learning (cs.LG)
*备注: 18 pages, 7 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:We propose a methodology to enhance local CO2 monitoring by integrating satellite data from the Orbiting Carbon Observatories (OCO-2 and OCO-3) with ground level observations from the Integrated Carbon Observation System (ICOS) and weather data from the ECMWF Reanalysis v5 (ERA5). Unlike traditional methods that downsample national data, our approach uses multimodal data fusion for high-resolution CO2 estimations. We employ weighted K-nearest neighbor (KNN) interpolation with machine learning models to predict ground level CO2 from satellite measurements, achieving a Root Mean Squared Error of 3.92 ppm. Our results show the effectiveness of integrating diverse data sources in capturing local emission patterns, highlighting the value of high-resolution atmospheric transport models. The developed model improves the granularity of CO2 monitoring, providing precise insights for targeted carbon mitigation strategies, and represents a novel application of neural networks and KNN in environmental monitoring, adaptable to various regions and temporal scales.

[LG-150] Unveiling the Impact of Local Homophily on GNN Fairness: In-Depth Analysis and New Benchmarks

链接: https://arxiv.org/abs/2410.04287
作者: Donald Loveland,Danai Koutra
关键词-EN: Graph Neural Networks, Neural Networks, local homophily levels, homophily levels, homophily
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) often struggle to generalize when graphs exhibit both homophily (same-class connections) and heterophily (different-class connections). Specifically, GNNs tend to underperform for nodes with local homophily levels that differ significantly from the global homophily level. This issue poses a risk in user-centric applications where underrepresented homophily levels are present. Concurrently, fairness within GNNs has received substantial attention due to the potential amplification of biases via message passing. However, the connection between local homophily and fairness in GNNs remains underexplored. In this work, we move beyond global homophily and explore how local homophily levels can lead to unfair predictions. We begin by formalizing the challenge of fair predictions for underrepresented homophily levels as an out-of-distribution (OOD) problem. We then conduct a theoretical analysis that demonstrates how local homophily levels can alter predictions for differing sensitive attributes. We additionally introduce three new GNN fairness benchmarks, as well as a novel semi-synthetic graph generator, to empirically study the OOD problem. Across extensive analysis we find that two factors can promote unfairness: (a) OOD distance, and (b) heterophilous nodes situated in homophilous graphs. In cases where these two conditions are met, fairness drops by up to 24% on real world datasets, and 30% in semi-synthetic datasets. Together, our theoretical insights, empirical analysis, and algorithmic contributions unveil a previously overlooked source of unfairness rooted in the graph’s homophily information.

[LG-151] Applying Hybrid Graph Neural Networks to Strengthen Credit Risk Analysis

链接: https://arxiv.org/abs/2410.04283
作者: Mengfang Sun,Wenying Sun,Ying Sun,Shaobo Liu,Mohan Jiang,Zhen Xu
关键词-EN: Convolutional Neural Networks, Graph Convolutional Neural, employing Graph Convolutional, Neural Networks, employing Graph
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to credit risk prediction by employing Graph Convolutional Neural Networks (GCNNs) to assess the creditworthiness of borrowers. Leveraging the power of big data and artificial intelligence, the proposed method addresses the challenges faced by traditional credit risk assessment models, particularly in handling imbalanced datasets and extracting meaningful features from complex relationships. The paper begins by transforming raw borrower data into graph-structured data, where borrowers and their relationships are represented as nodes and edges, respectively. A classic subgraph convolutional model is then applied to extract local features, followed by the introduction of a hybrid GCNN model that integrates both local and global convolutional operators to capture a comprehensive representation of node features. The hybrid model incorporates an attention mechanism to adaptively select features, mitigating issues of over-smoothing and insufficient feature consideration. The study demonstrates the potential of GCNNs in improving the accuracy of credit risk prediction, offering a robust solution for financial institutions seeking to enhance their lending decision-making processes.

[LG-152] Black Boxes and Looking Glasses: Multilevel Symmetries Reflection Planes and Convex Optimization in Deep Networks

链接: https://arxiv.org/abs/2410.04279
作者: Emi Zeger,Mert Pilanci
关键词-EN: arbitrary input dimension, convex Lasso problems, equivalent convex Lasso, absolute value activation, activation and arbitrary
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We show that training deep neural networks (DNNs) with absolute value activation and arbitrary input dimension can be formulated as equivalent convex Lasso problems with novel features expressed using geometric algebra. This formulation reveals geometric structures encoding symmetry in neural networks. Using the equivalent Lasso form of DNNs, we formally prove a fundamental distinction between deep and shallow networks: deep networks inherently favor symmetric structures in their fitted functions, with greater depth enabling multilevel symmetries, i.e., symmetries within symmetries. Moreover, Lasso features represent distances to hyperplanes that are reflected across training points. These reflection hyperplanes are spanned by training data and are orthogonal to optimal weight vectors. Numerical experiments support theory and demonstrate theoretically predicted features when training networks using embeddings generated by Large Language Models.

[LG-153] Language Model-Driven Data Pruning Enables Efficient Active Learning

链接: https://arxiv.org/abs/2410.04275
作者: Abdul Hameed Azeemi,Ihsan Ayyub Qazi,Agha Ali Raza
关键词-EN: unlabeled pool, optimizes data labeling, unlabeled data pools, instances for annotation, unlabeled
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. A key component in this procedure is an acquisition function that guides the selection process and identifies the suitable instances for labeling from the unlabeled pool. However, these acquisition methods suffer from high computational costs with large unlabeled data pools, posing a roadblock to their applicability on large datasets. To address this challenge and bridge this gap, we introduce a novel plug-and-play unlabeled data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. Additionally, to enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection in subsequent labeling iterations. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on four diverse datasets and four active learning strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality \leftrightarrow efficiency tradeoff of the data pruning methods and demonstrate that ActivePrune is computationally more efficient than other LLM score-based pruning methods, and provides up to 74% reduction in the end-to-end time required for active learning.

[LG-154] Fundamental Limitations on Subquadratic Alternatives to Transformers

链接: https://arxiv.org/abs/2410.04271
作者: Josh Alman,Hantao Yu
关键词-EN: impactful Large Language, Large Language Models, impactful Large, Large Language, architecture is widely
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL) Cite as: arXiv:2410.04271 [cs.LG] (or arXiv:2410.04271v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-155] DeFoG: Discrete Flow Matching for Graph Generation

链接: https://arxiv.org/abs/2410.04263
作者: Yiming Qin,Manuel Madeira,Dorina Thanou,Pascal Frossard
关键词-EN: diverse scientific applications, realistic data points, scientific applications, fundamental in diverse, diverse scientific
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph generation is fundamental in diverse scientific applications, due to its ability to reveal the underlying distribution of complex data, and eventually generate new, realistic data points. Despite the success of diffusion models in this domain, those face limitations in sampling efficiency and flexibility, stemming from the tight coupling between the training and sampling stages. To address this, we propose DeFoG, a novel framework using discrete flow matching for graph generation. DeFoG employs a flow-based approach that features an efficient linear interpolation noising process and a flexible denoising process based on a continuous-time Markov chain formulation. We leverage an expressive graph transformer and ensure desirable node permutation properties to respect graph symmetry. Crucially, our framework enables a disentangled design of the training and sampling stages, enabling more effective and efficient optimization of model performance. We navigate this design space by introducing several algorithmic improvements that boost the model performance, consistently surpassing existing diffusion models. We also theoretically demonstrate that, for general discrete data, discrete flow models can faithfully replicate the ground truth distribution - a result that naturally extends to graph data and reinforces DeFoG’s foundations. Extensive experiments show that DeFoG achieves state-of-the-art results on synthetic and molecular datasets, improving both training and sampling efficiency over diffusion models, and excels in conditional generation on a digital pathology dataset.

[LG-156] Compositional Diffusion Models for Powered Descent Trajectory Generation with Flexible Constraints

链接: https://arxiv.org/abs/2410.04261
作者: Julia Briden,Yilun Du,Enrico M. Zucchelli,Richard Linares
关键词-EN: powered descent guidance, compositional diffusion-based flexible, freedom powered descent, work introduces TrajDiffuser, concurrent trajectory generator
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Full manuscript submitted to IEEE Aerospace 2025 on 4-Oct-2024

点击查看摘要

Abstract:This work introduces TrajDiffuser, a compositional diffusion-based flexible and concurrent trajectory generator for 6 degrees of freedom powered descent guidance. TrajDiffuser is a statistical model that learns the multi-modal distributions of a dataset of simulated optimal trajectories, each subject to only one or few constraints that may vary for different trajectories. During inference, the trajectory is generated simultaneously over time, providing stable long-horizon planning, and constraints can be composed together, increasing the model’s generalizability and decreasing the training data required. The generated trajectory is then used to initialize an optimizer, increasing its robustness and speed.

[LG-157] Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia EMNLP2024

链接: https://arxiv.org/abs/2410.04254
作者: Tomás Feith,Akhil Arora,Martin Gerlach,Debjit Paul,Robert West
关键词-EN: turning isolated pieces, fundamental part, entity insertion, turning isolated, isolated pieces
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: EMNLP 2024; 24 pages; 62 figures

点击查看摘要

Abstract:Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.

[LG-158] Enhancing Future Link Prediction in Quantum Computing Semantic Networks through LLM-Initiated Node Features

链接: https://arxiv.org/abs/2410.04251
作者: Gilchan Park,Paul Baity,Byung-Jun Yoon,Adolfy Hoisie
关键词-EN: accelerate computational processes, solve complex problems, computer science, offering the potential, computational processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum computing is rapidly evolving in both physics and computer science, offering the potential to solve complex problems and accelerate computational processes. The development of quantum chips necessitates understanding the correlations among diverse experimental conditions. Semantic networks built on scientific literature, representing meaningful relationships between concepts, have been used across various domains to identify knowledge gaps and novel concept combinations. Neural network-based approaches have shown promise in link prediction within these networks. This study proposes initializing node features using LLMs to enhance node representations for link prediction tasks in graph neural networks. LLMs can provide rich descriptions, reducing the need for manual feature creation and lowering costs. Our method, evaluated using various link prediction models on a quantum computing semantic network, demonstrated efficacy compared to traditional node embedding techniques.

[LG-159] owards the Best Solution for Complex System Reliability: Can Statistics Outperform Machine Learning?

链接: https://arxiv.org/abs/2410.04238
作者: Maria Luz Gamiz,Fernando Navas-Gomez,Rafael Nozal-Cañadas,Rocio Raya-Miranda
关键词-EN: effectively deploying models, techniques involves facing, learning techniques involves, machine learning, machine learning methods
类目: Machine Learning (cs.LG)
*备注: 33 pages; 5 figures

点击查看摘要

Abstract:Studying the reliability of complex systems using machine learning techniques involves facing a series of technical and practical challenges, ranging from the intrinsic nature of the system and data to the difficulties in modeling and effectively deploying models in real-world scenarios. This study compares the effectiveness of classical statistical techniques and machine learning methods for improving complex system analysis in reliability assessments. We aim to demonstrate that classical statistical algorithms often yield more precise and interpretable results than black-box machine learning approaches in many practical applications. The evaluation is conducted using both real-world data and simulated scenarios. We report the results obtained from statistical modeling algorithms, as well as from machine learning methods including neural networks, K-nearest neighbors, and random forests.

[LG-160] Overview of Factify5WQA: Fact Verification through 5W Question-Answering AAAI2024

链接: https://arxiv.org/abs/2410.04236
作者: Suryavardan Suresh,Anku Rani,Parth Patwa,Aishwarya Reganti,Vinija Jain,Aman Chadha,Amitava Das,Amit Sheth,Asif Ekbal
关键词-EN: Researchers have found, spreads much times, times faster, faster than real, Fact verification
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at defactify3@aaai2024

点击查看摘要

Abstract:Researchers have found that fake news spreads much times faster than real news. This is a major problem, especially in today’s world where social media is the key source of news for many among the younger population. Fact verification, thus, becomes an important task and many media sites contribute to the cause. Manual fact verification is a tedious task, given the volume of fake news online. The Factify5WQA shared task aims to increase research towards automated fake news detection by providing a dataset with an aspect-based question answering based fact verification method. Each claim and its supporting document is associated with 5W questions that help compare the two information sources. The objective performance measure in the task is done by comparing answers using BLEU score to measure the accuracy of the answers, followed by an accuracy measure of the classification. The task had submissions using custom training setup and pre-trained language-models among others. The best performing team posted an accuracy of 69.56%, which is a near 35% improvement over the baseline.

[LG-161] Improving Distribution Alignment with Diversity-based Sampling

链接: https://arxiv.org/abs/2410.04235
作者: Andrea Napoli,Paul White
关键词-EN: machine learning, ubiquitous in machine, substantially degrade, performance when deployed, real-world domain shift
类目: Machine Learning (cs.LG)
*备注: DCASE 2024

点击查看摘要

Abstract:Domain shifts are ubiquitous in machine learning, and can substantially degrade a model’s performance when deployed to real-world data. To address this, distribution alignment methods aim to learn feature representations which are invariant across domains, by minimising the discrepancy between the distributions. However, the discrepancy estimates can be extremely noisy when training via stochastic gradient descent (SGD), and shifts in the relative proportions of different subgroups can lead to domain misalignments; these can both stifle the benefits of the method. This paper proposes to improve these estimates by inducing diversity in each sampled minibatch. This simultaneously balances the data and reduces the variance of the gradients, thereby enhancing the model’s generalisation ability. We describe two options for diversity-based data samplers, based on the k-determinantal point process (k-DPP) and the k-means++ algorithm, which can function as drop-in replacements for a standard random sampler. On a real-world domain shift task of bioacoustic event detection, we show that both options 1) yield minibatches which are more representative of the full dataset; 2) reduce the distance estimation error between distributions, for a given sample size; and 3) improve out-of-distribution accuracy for two distribution alignment algorithms, as well as standard ERM.

[LG-162] Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks

链接: https://arxiv.org/abs/2410.04234
作者: Zi Wang,Divyam Anshumaan,Ashish Hooda,Yudong Chen,Somesh Jha
关键词-EN: undesired model responses, mitigate undesired model, widely employed, employed in deep, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emphfunctional homotopy method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a 20%-30% improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.

[LG-163] SGD with memory: fundamental properties and stochastic acceleration

链接: https://arxiv.org/abs/2410.04228
作者: Dmitry Yarotsky,Maksim Velikanov
关键词-EN: important open problem, theoretically feasible acceleration, mini-batch SGD-type algorithms, open problem, quadratic problems
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:An important open problem is the theoretically feasible acceleration of mini-batch SGD-type algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the optimal exponent \xi in the loss convergence L_t\sim C_Lt^-\xi is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise. We address this challenge by considering first-order methods with an arbitrary fixed number M of auxiliary velocity vectors (memory- M algorithms). We first prove an equivalence between two forms of such algorithms and describe them in terms of suitable characteristic polynomials. Then we develop a general expansion of the loss in terms of signal and noise propagators. Using it, we show that losses of stationary stable memory- M algorithms always retain the exponent \xi of plain GD, but can have different constants C_L depending on their effective learning rate that generalizes that of HB. We prove that in memory-1 algorithms we can make C_L arbitrarily small while maintaining stability. As a consequence, we propose a memory-1 algorithm with a time-dependent schedule that we show heuristically and experimentally to improve the exponent \xi of plain SGD.

[LG-164] Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

链接: https://arxiv.org/abs/2410.04223
作者: Gang Liu,Michael Sun,Wojciech Matusik,Meng Jiang,Jie Chen
关键词-EN: large language models, graphs remains challenging, language models, integrated images, remains challenging
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注: 27 pages, 11 figures, 4 tables

点击查看摘要

Abstract:While large language models (LLMs) have integrated images, adapting them to graphs remains challenging, limiting their applications in materials and drug design. This difficulty stems from the need for coherent autoregressive generation across texts and graphs. To address this, we introduce Llamole, the first multimodal LLM capable of interleaved text and graph generation, enabling molecular inverse design with retrosynthetic planning. Llamole integrates a base LLM with the Graph Diffusion Transformer and Graph Neural Networks for multi-conditional molecular generation and reaction inference within texts, while the LLM, with enhanced molecular understanding, flexibly controls activation among the different graph modules. Additionally, Llamole integrates A* search with LLM-based cost functions for efficient retrosynthetic planning. We create benchmarking datasets and conduct extensive experiments to evaluate Llamole against in-context learning and supervised fine-tuning. Llamole significantly outperforms 14 adapted LLMs across 12 metrics for controllable molecular design and retrosynthetic planning.

[LG-165] Equivariant Polynomial Functional Networks

链接: https://arxiv.org/abs/2410.04213
作者: Thieu N. Vo,Viet-Hoang Tran,Tho Tran Huu,An Nguyen The,Thanh Tran,Minh-Khoi Nguyen-Nhat,Duy-Tung Pham,Tan Minh Nguyen
关键词-EN: Neural Functional Networks, including extracting information, gained increasing interest, input neural networks, Neural Functional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Functional Networks (NFNs) have gained increasing interest due to their wide range of applications, including extracting information from implicit representations of data, editing network weights, and evaluating policies. A key design principle of NFNs is their adherence to the permutation and scaling symmetries inherent in the connectionist structure of the input neural networks. Recent NFNs have been proposed with permutation and scaling equivariance based on either graph-based message-passing mechanisms or parameter-sharing mechanisms. However, graph-based equivariant NFNs suffer from high memory consumption and long running times. On the other hand, parameter-sharing-based NFNs built upon equivariant linear layers exhibit lower memory consumption and faster running time, yet their expressivity is limited due to the large size of the symmetric group of the input neural networks. The challenge of designing a permutation and scaling equivariant NFN that maintains low memory consumption and running time while preserving expressivity remains unresolved. In this paper, we propose a novel solution with the development of MAGEP-NFN (Monomial mAtrix Group Equivariant Polynomial NFN). Our approach follows the parameter-sharing mechanism but differs from previous works by constructing a nonlinear equivariant layer represented as a polynomial in the input weights. This polynomial formulation enables us to incorporate additional relationships between weights from different input hidden layers, enhancing the model’s expressivity while keeping memory consumption and running time low, thereby addressing the aforementioned challenge. We provide empirical evidence demonstrating that MAGEP-NFN achieves competitive performance and efficiency compared to existing baselines.

[LG-166] Equivariant Neural Functional Networks for Transformers

链接: https://arxiv.org/abs/2410.04209
作者: Viet-Hoang Tran,Thieu N. Vo,An Nguyen The,Tho Tran Huu,Minh-Khoi Nguyen-Nhat,Thanh Tran,Duy-Tung Pham,Tan Minh Nguyen
关键词-EN: systematically explores neural, explores neural functional, neural functional networks, paper systematically explores, NFN
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper systematically explores neural functional networks (NFN) for transformer architectures. NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data and have proven valuable for tasks such as learnable optimizers, implicit data representations, and weight editing. While NFN have been extensively developed for MLP and CNN, no prior work has addressed their design for transformers, despite the importance of transformers in modern deep learning. This paper aims to address this gap by providing a systematic study of NFN for transformers. We first determine the maximal symmetric group of the weights in a multi-head attention module as well as a necessary and sufficient condition under which two sets of hyperparameters of the multi-head attention module define the same function. We then define the weight space of transformer architectures and its associated group action, which leads to the design principles for NFN in transformers. Based on these, we introduce Transformer-NFN, an NFN that is equivariant under this group action. Additionally, we release a dataset of more than 125,000 Transformers model checkpoints trained on two datasets with two different tasks, providing a benchmark for evaluating Transformer-NFN and encouraging further research on transformer training and performance.

[LG-167] Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models

链接: https://arxiv.org/abs/2410.04207
作者: Theo(Moe)Putterman,Derek Lim,Yoav Gelberg,Stefanie Jegelka,Haggai Maron
关键词-EN: enabling efficient adaptation, limited computational resources, large foundation models, Low-rank adaptations, efficient adaptation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages

点击查看摘要

Abstract:Low-rank adaptations (LoRAs) have revolutionized the finetuning of large foundation models, enabling efficient adaptation even with limited computational resources. The resulting proliferation of LoRAs presents exciting opportunities for applying machine learning techniques that take these low-rank weights themselves as inputs. In this paper, we investigate the potential of Learning on LoRAs (LoL), a paradigm where LoRA weights serve as input to machine learning models. For instance, an LoL model that takes in LoRA weights as inputs could predict the performance of the finetuned model on downstream tasks, detect potentially harmful finetunes, or even generate novel model edits without traditional training methods. We first identify the inherent parameter symmetries of low rank decompositions of weights, which differ significantly from the parameter symmetries of standard neural networks. To efficiently process LoRA weights, we develop several symmetry-aware invariant or equivariant LoL models, using tools such as canonicalization, invariant featurization, and equivariant layers. We finetune thousands of text-to-image diffusion models and language models to collect datasets of LoRAs. In numerical experiments on these datasets, we show that our LoL architectures are capable of processing low rank weight decompositions to predict CLIP score, finetuning data attributes, finetuning data membership, and accuracy on downstream tasks.

[LG-168] Deep Transfer Learning Based Peer Review Aggregation and Meta-review Generation for Scientific Articles

链接: https://arxiv.org/abs/2410.04202
作者: Md. Tarek Hasan,Mohammad Nazmush Shamael,H. M. Mutasim Billah,Arifa Akter,Md Al Emran Hossain,Sumayra Islam,Salekul Islam,Swakkhar Shatabda
关键词-EN: Peer, acceptance decision prediction, Peer review, peer experts, meta-review
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Peer review is the quality assessment of a manuscript by one or more peer experts. Papers are submitted by the authors to scientific venues, and these papers must be reviewed by peers or other authors. The meta-reviewers then gather the peer reviews, assess them, and create a meta-review and decision for each manuscript. As the number of papers submitted to these venues has grown in recent years, it becomes increasingly challenging for meta-reviewers to collect these peer evaluations on time while still maintaining the quality that is the primary goal of meta-review creation. In this paper, we address two peer review aggregation challenges a meta-reviewer faces: paper acceptance decision-making and meta-review generation. Firstly, we propose to automate the process of acceptance decision prediction by applying traditional machine learning algorithms. We use pre-trained word embedding techniques BERT to process the reviews written in natural language text. For the meta-review generation, we propose a transfer learning model based on the T5 model. Experimental results show that BERT is more effective than the other word embedding techniques, and the recommendation score is an important feature for the acceptance decision prediction. In addition, we figure out that fine-tuned T5 outperforms other inference models. Our proposed system takes peer reviews and other relevant features as input to produce a meta-review and make a judgment on whether or not the paper should be accepted. In addition, experimental results show that the acceptance decision prediction system of our task outperforms the existing models, and the meta-review generation task shows significantly improved scores compared to the existing models. For the statistical test, we utilize the Wilcoxon signed-rank test to assess whether there is a statistically significant improvement between paired observations.

[LG-169] Improving Generalization with Flat Hilbert Bayesian Inference

链接: https://arxiv.org/abs/2410.04196
作者: Tuan Truong,Quyen Tran,Quan Pham-Ngoc,Nhat Ho,Dinh Phung,Trung Le
关键词-EN: Hilbert Bayesian Inference, Flat Hilbert Bayesian, Bayesian Inference, introduce Flat Hilbert, Hilbert Bayesian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within the reproducing kernel Hilbert spaces. This methodology is supported by a theoretical analysis that extends previous findings on generalization ability from finite-dimensional Euclidean spaces to infinite-dimensional functional spaces. To evaluate the effectiveness of FHBI, we conduct comprehensive comparisons against seven baseline methods on the VTAB-1K benchmark, which encompasses 19 diverse datasets across various domains with diverse semantics. Empirical results demonstrate that FHBI consistently outperforms the baselines by notable margins, highlighting its practical efficacy.

[LG-170] Parametric Taylor series based latent dynamics identification neural networks

链接: https://arxiv.org/abs/2410.04193
作者: Xinlei Lin,Dunhui Xiao
关键词-EN: Numerical solving parameterised, Numerical solving, partial differential equations, solving parameterised partial, parameterised partial differential
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Numerical solving parameterised partial differential equations (P-PDEs) is highly practical yet computationally expensive, driving the development of reduced-order models (ROMs). Recently, methods that combine latent space identification techniques with deep learning algorithms (e.g., autoencoders) have shown great potential in describing the dynamical system in the lower dimensional latent space, for example, LaSDI, gLaSDI and GPLaSDI. In this paper, a new parametric latent identification of nonlinear dynamics neural networks, P-TLDINets, is introduced, which relies on a novel neural network structure based on Taylor series expansion and ResNets to learn the ODEs that govern the reduced space dynamics. During the training process, Taylor series-based Latent Dynamic Neural Networks (TLDNets) and identified equations are trained simultaneously to generate a smoother latent space. In order to facilitate the parameterised study, a k -nearest neighbours (KNN) method based on an inverse distance weighting (IDW) interpolation scheme is introduced to predict the identified ODE coefficients using local information. Compared to other latent dynamics identification methods based on autoencoders, P-TLDINets remain the interpretability of the model. Additionally, it circumvents the building of explicit autoencoders, avoids dependency on specific grids, and features a more lightweight structure, which is easy to train with high generalisation capability and accuracy. Also, it is capable of using different scales of meshes. P-TLDINets improve training speeds nearly hundred times compared to GPLaSDI and gLaSDI, maintaining an L_2 error below 2% compared to high-fidelity models. Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Dynamical Systems (math.DS) Cite as: arXiv:2410.04193 [cs.LG] (or arXiv:2410.04193v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04193 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-171] Unsupervised Assessment of Landscape Shifts Based on Persistent Entropy and Topological Preservation KDD’2024

链接: https://arxiv.org/abs/2410.04183
作者: Sebastian Basterrech
关键词-EN: drift typically refers, Concept drift typically, Concept drift, typically refers, data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: KDD’2024. Workshop on Drift Detection and Landscape Shifts

点击查看摘要

Abstract:Concept drift typically refers to the analysis of changes in data distribution. A drift in the input data can have negative consequences on a learning predictor and the system’s stability. The majority of concept drift methods emphasize the analysis of statistical changes in non-stationary data over time. In this context, we consider another perspective, where the concept drift also integrates substantial changes in the topological characteristics of the data stream. In this article, we introduce a novel framework for monitoring changes in multi-dimensional data streams. We explore a generalization of the standard concept drift focusing on the changes in the topological characteristics of the data. Our developed approach is based on persistent entropy and topology-preserving projections in a continual learning scenario. The framework operates in both unsupervised and supervised environments. To demonstrate the utility of the proposed framework, we analyze the model across three scenarios using data streams generated with MNIST samples. The obtained results reveal the potential of applying topological data analysis for shift detection and encourage further research in this area.

[LG-172] Beyond Language: Applying MLX Transformers to Engineering Physics

链接: https://arxiv.org/abs/2410.04167
作者: Stavros Kassinos,Alessio Alexiadis
关键词-EN: Transformer Neural Networks, Large Language Models, Neural Networks, Large Language, Transformer Neural
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 63 pages, 31 figure, research paper, code shared under an MIT license on GitHub

点击查看摘要

Abstract:Transformer Neural Networks are driving an explosion of activity and discovery in the field of Large Language Models (LLMs). In contrast, there have been only a few attempts to apply Transformers in engineering physics. Aiming to offer an easy entry point to physics-centric Transformers, we introduce a physics-informed Transformer model for solving the heat conduction problem in a 2D plate with Dirichlet boundary conditions. The model is implemented in the machine learning framework MLX and leverages the unified memory of Apple M-series processors. The use of MLX means that the models can be trained and perform predictions efficiently on personal machines with only modest memory requirements. To train, validate and test the Transformer model we solve the 2D heat conduction problem using central finite differences. Each finite difference solution in these sets is initialized with four random Dirichlet boundary conditions, a uniform but random internal temperature distribution and a randomly selected thermal diffusivity. Validation is performed in-line during training to monitor against over-fitting. The excellent performance of the trained model is demonstrated by predicting the evolution of the temperature field to steady state for the unseen test set of conditions.

[LG-173] Preference Optimization as Probabilistic Inference

链接: https://arxiv.org/abs/2410.04166
作者: Abbas Abdolmaleki,Bilal Piot,Bobak Shahriari,Jost Tobias Springenberg,Tim Hertweck,Rishabh Joshi,Junhyuk Oh,Michael Bloesch,Thomas Lampe,Nicolas Heess,Jonas Buchli,Martin Riedmiller
关键词-EN: Existing preference optimization, Existing preference, assumption that paired, human feedback, Existing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Existing preference optimization methods are mainly designed for directly learning from human feedback with the assumption that paired examples (preferred vs. dis-preferred) are available. In contrast, we propose a method that can leverage unpaired preferred or dis-preferred examples, and works even when only one type of feedback (positive or negative) is available. This flexibility allows us to apply it in scenarios with varying forms of feedback and models, including training generative language models based on human feedback as well as training policies for sequential decision-making problems, where learned (value) functions are available. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which proposes to use expectation-maximization (EM) to directly optimize the probability of preferred outcomes (as opposed to classic expected reward maximization). To obtain a practical algorithm, we identify and address a key limitation in current EM-based methods: when applied to preference optimization, they solely maximize the likelihood of preferred examples, while neglecting dis-preferred samples. We show how one can extend EM algorithms to explicitly incorporate dis-preferred outcomes, leading to a novel, theoretically grounded, preference optimization algorithm that offers an intuitive and versatile way to learn from both positive and negative feedback.

[LG-174] Applying Quantum Autoencoders for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2410.04154
作者: Robin Frehner,Kurt Stockinger
关键词-EN: Anomaly detection, pattern recognition, medical diagnosis, quantum, recognition or medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
*备注: 22 pages, 16 figures

点击查看摘要

Abstract:Anomaly detection is an important problem with applications in various domains such as fraud detection, pattern recognition or medical diagnosis. Several algorithms have been introduced using classical computing approaches. However, using quantum computing for solving anomaly detection problems in time series data is a widely unexplored research field. This paper explores the application of quantum autoencoders to time series anomaly detection. We investigate two primary techniques for classifying anomalies: (1) Analyzing the reconstruction error generated by the quantum autoencoder and (2) latent representation analysis. Our simulated experimental results, conducted across various ansaetze, demonstrate that quantum autoencoders consistently outperform classical deep learning-based autoencoders across multiple datasets. Specifically, quantum autoencoders achieve superior anomaly detection performance while utilizing 60-230 times fewer parameters and requiring five times fewer training iterations. In addition, we implement our quantum encoder on real quantum hardware. Our experimental results demonstrate that quantum autoencoders achieve anomaly detection performance on par with their simulated counterparts. Comments: 22 pages, 16 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph) Cite as: arXiv:2410.04154 [cs.LG] (or arXiv:2410.04154v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.04154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-175] ConDa: Fast Federated Unlearning with Contribution Dampening

链接: https://arxiv.org/abs/2410.04144
作者: Vikram S Chundawat,Pushkar Niroula,Prasanna Dhungana,Stefan Schoepf,Murari Mandal,Alexandra Brintrup
关键词-EN: enabled collaborative model, collaborative model training, decentralized data sources, federated unlearning, global model
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has enabled collaborative model training across decentralized data sources or clients. While adding new participants to a shared model does not pose great technical hurdles, the removal of a participant and their related information contained in the shared model remains a challenge. To address this problem, federated unlearning has emerged as a critical research direction, seeking to remove information from globally trained models without harming the model performance on the remaining data. Most modern federated unlearning methods use costly approaches such as the use of remaining clients data to retrain the global model or methods that would require heavy computation on client or server side. We introduce Contribution Dampening (ConDa), a framework that performs efficient unlearning by tracking down the parameters which affect the global model for each client and performs synaptic dampening on the parameters of the global model that have privacy infringing contributions from the forgetting client. Our technique does not require clients data or any kind of retraining and it does not put any computational overhead on either the client or server side. We perform experiments on multiple datasets and demonstrate that ConDa is effective to forget a client’s data. In experiments conducted on the MNIST, CIFAR10, and CIFAR100 datasets, ConDa proves to be the fastest federated unlearning method, outperforming the nearest state of the art approach by at least 100x. Our emphasis is on the non-IID Federated Learning setting, which presents the greatest challenge for unlearning. Additionally, we validate ConDa’s robustness through backdoor and membership inference attacks. We envision this work as a crucial component for FL in adhering to legal and ethical requirements.

[LG-176] From Hospital to Portables: A Universal ECG Foundation Model Built on 10 Million Diverse Recordings

链接: https://arxiv.org/abs/2410.04133
作者: Jun Li,Aaron Aguirre,Junior Moura,Che Liu,Lanhai Zhong,Chenxi Sun,Gari Clifford,Brandon Westover,Shenda Hong
关键词-EN: Artificial Intelligence, shown great promise, ECG, promise in electrocardiogram, shown great
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: working in progress

点击查看摘要

Abstract:Artificial Intelligence (AI) has shown great promise in electrocardiogram (ECG) analysis and cardiovascular disease detection. However, developing a general AI-ECG model has been challenging due to inter-individual variability and the diversity of ECG diagnoses, limiting existing models to specific diagnostic tasks and datasets. Moreover, current AI-ECG models struggle to achieve comparable performance between single-lead and 12-lead ECGs, limiting the application of AI-ECG to portable and wearable ECG devices. To address these limitations, we introduce an ECG Foundation Model (ECGFounder), a general-purpose model that leverages real-world ECG annotations from cardiology experts to broaden the diagnostic capabilities of ECG analysis. ECGFounder is trained on over 10 million ECGs with 150 label categories from the Harvard-Emory ECG Database, enabling comprehensive cardiovascular disease diagnosis through ECG analysis. The model is designed to be both effective out-of-the-box and fine-tunable for downstream tasks, maximizing usability. More importantly, we extend its application to single-lead ECGs, enabling complex condition diagnoses and supporting various downstream tasks in mobile and remote monitoring scenarios. Experimental results demonstrate that ECGFounder achieves expert-level performance on internal validation sets for both 12-lead and single-lead ECGs, while also exhibiting strong classification performance and generalization across various diagnoses on external validation sets. When fine-tuned, ECGFounder outperforms baseline models in demographics detection, clinical event detection, and cross-modality cardiac rhythm diagnosis. The trained model and data will be publicly released upon publication through the this http URL. Our code is available at this https URL.

[LG-177] Rethinking Fair Representation Learning for Performance-Sensitive Tasks

链接: https://arxiv.org/abs/2410.04120
作者: Charles Jones,Fabio de Sousa Ribeiro,Mélanie Roschewitz,Daniel C. Castro,Ben Glocker
关键词-EN: fair representation learning, fair representation, representation learning methods, representation learning, investigate the prominent
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We investigate the prominent class of fair representation learning methods for bias mitigation. Using causal reasoning to define and formalise different sources of dataset bias, we reveal important implicit assumptions inherent to these methods. We prove fundamental limitations on fair representation learning when evaluation data is drawn from the same distribution as training data and run experiments across a range of medical modalities to examine the performance of fair representation learning under distribution shifts. Our results explain apparent contradictions in the existing literature and reveal how rarely considered causal and statistical aspects of the underlying data affect the validity of fair representation learning. We raise doubts about current evaluation practices and the applicability of fair representation learning methods in performance-sensitive settings. We argue that fine-grained analysis of dataset biases should play a key role in the field moving forward.

[LG-178] Riemann Sum Optimization for Accurate Integrated Gradients Computation

链接: https://arxiv.org/abs/2410.04118
作者: Swadesh Swain,Shree Singhi
关键词-EN: Integrated Gradients, deep neural network, inaccurate Riemann Sum, input features, Riemann Sum approximations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Integrated Gradients (IG) is a widely used algorithm for attributing the outputs of a deep neural network to its input features. Due to the absence of closed-form integrals for deep learning models, inaccurate Riemann Sum approximations are used to calculate IG. This often introduces undesirable errors in the form of high levels of noise, leading to false insights in the model’s decision-making process. We introduce a framework, RiemannOpt, that minimizes these errors by optimizing the sample point selection for the Riemann Sum. Our algorithm is highly versatile and applicable to IG as well as its derivatives like Blur IG and Guided IG. RiemannOpt achieves up to 20% improvement in Insertion Scores. Additionally, it enables its users to curtail computational costs by up to four folds, thereby making it highly functional for constrained environments.

[LG-179] On the Sample Complexity of a Policy Gradient Algorithm with Occupancy Approximation for General Utility Reinforcement Learning

链接: https://arxiv.org/abs/2410.04108
作者: Anas Barakat,Souradip Chakraborty,Peihong Yu,Pratap Tokekar,Amrit Singh Bedi
关键词-EN: including imitation learning, recently gained attention, Reinforcement learning, pure exploration, imitation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:Reinforcement learning with general utilities has recently gained attention thanks to its ability to unify several problems, including imitation learning, pure exploration, and safe RL. However, prior work for solving this general problem in a unified way has mainly focused on the tabular setting. This is restrictive when considering larger state-action spaces because of the need to estimate occupancy measures during policy optimization. In this work, we address this issue and propose to approximate occupancy measures within a function approximation class using maximum likelihood estimation (MLE). We propose a simple policy gradient algorithm (PG-OMA) where an actor updates the policy parameters to maximize the general utility objective whereas a critic approximates the occupancy measure using MLE. We provide a sample complexity analysis of PG-OMA showing that our occupancy measure estimation error only scales with the dimension of our function approximation class rather than the size of the state action space. Under suitable assumptions, we establish first order stationarity and global optimality performance bounds for the proposed PG-OMA algorithm for nonconcave and concave general utilities respectively. We complement our methodological and theoretical findings with promising empirical results showing the scalability potential of our approach compared to existing tabular count-based approaches.

[LG-180] Sinc Kolmogorov-Arnold Network and Its Applications on Physics-informed Neural Networks

链接: https://arxiv.org/abs/2410.04096
作者: Tianchi Yu,Jingwei Qiu,Jiang Yang,Ivan Oseledets
关键词-EN: recently gained attention, Sinc interpolation proposes, Sinc interpolation, learnable activation functions, multilayer perceptron
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:In this paper, we propose to use Sinc interpolation in the context of Kolmogorov-Arnold Networks, neural networks with learnable activation functions, which recently gained attention as alternatives to multilayer perceptron. Many different function representations have already been tried, but we show that Sinc interpolation proposes a viable alternative, since it is known in numerical analysis to represent well both smooth functions and functions with singularities. This is important not only for function approximation but also for the solutions of partial differential equations with physics-informed neural networks. Through a series of experiments, we show that SincKANs provide better results in almost all of the examples we have considered.

[LG-181] Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

链接: https://arxiv.org/abs/2410.04091
作者: Allahdadi Fatemeh,Mahdian Toroghi Rahil,Zareian Hassan
关键词-EN: transcribed data scarcity, typically constrained, constrained by transcribed, transcribed data, data scarcity
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.

[LG-182] aming the Tail: Leveraging Asymmetric Loss and Pade Approximation to Overcome Medical Image Long-Tailed Class Imbalance BMVC24

链接: https://arxiv.org/abs/2410.04084
作者: Pankhi Kashyap,Pavni Tandon,Sunny Gupta,Abhishek Tiwari,Ritwik Kulkarni,Kshitij Sharad Jadhav
关键词-EN: dependable classification methods, data imbalance due, warranting the requirement, problems in healthcare, healthcare emerge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 1 figures. Accepted in The 35th British Machine Vision Conference (BMVC24)

点击查看摘要

Abstract:Long-tailed problems in healthcare emerge from data imbalance due to variability in the prevalence and representation of different medical conditions, warranting the requirement of precise and dependable classification methods. Traditional loss functions such as cross-entropy and binary cross-entropy are often inadequate due to their inability to address the imbalances between the classes with high representation and the classes with low representation found in medical image datasets. We introduce a novel polynomial loss function based on Pade approximation, designed specifically to overcome the challenges associated with long-tailed classification. This approach incorporates asymmetric sampling techniques to better classify under-represented classes. We conducted extensive evaluations on three publicly available medical datasets and a proprietary medical dataset. Our implementation of the proposed loss function is open-sourced in the public repository:this https URL.

[LG-183] High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

链接: https://arxiv.org/abs/2410.04080
作者: Ruiyuan Huang,Zengfeng Huang
关键词-EN: Motivated by applications, current round context, sleeping bandits, contextual bandits, cross learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round’s context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider and Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability. There are steps in the original analysis by Schneider and Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

[LG-184] On Eliciting Syntax from Language Models via Hashing EMNLP-2024

链接: https://arxiv.org/abs/2410.04074
作者: Yiran Wang,Masao Utiyama
关键词-EN: infer syntactic structure, aims to infer, infer syntactic, syntactic structure, raw text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP-2024

点击查看摘要

Abstract:Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text. Recently, binary representation has exhibited remarkable information-preserving capabilities at both lexicon and syntax levels. In this paper, we explore the possibility of leveraging this capability to deduce parsing trees from raw text, relying solely on the implicitly induced grammars within models. To achieve this, we upgrade the bit-level CKY from zero-order to first-order to encode the lexicon and syntax in a unified binary representation space, switch training from supervised to unsupervised under the contrastive hashing framework, and introduce a novel loss function to impose stronger yet balanced alignment signals. Our model shows competitive performance on various datasets, therefore, we claim that our method is effective and efficient enough to acquire high-quality parsing trees from pre-trained language models at a low cost.

[LG-185] xt2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback EMNLP2024

链接: https://arxiv.org/abs/2410.04064
作者: Fatemeh Pesaran Zadeh,Juyeon Kim,Jin-Hwa Kim,Gunhee Kim
关键词-EN: Large language models, demonstrated strong capabilities, Large language, notably through instruction-tuning, demonstrated strong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Main. Code and dataset are released at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities across various language tasks, notably through instruction-tuning methods. However, LLMs face challenges in visualizing complex, real-world data through charts and plots. Firstly, existing datasets rarely cover a full range of chart types, such as 3D, volumetric, and gridded charts. Secondly, supervised fine-tuning methods do not fully leverage the intricate relationships within rich datasets, including text, code, and figures. To address these challenges, we propose a hierarchical pipeline and a new dataset for chart generation. Our dataset, Text2Chart31, includes 31 unique plot types referring to the Matplotlib library, with 11.1K tuples of descriptions, code, data tables, and plots. Moreover, we introduce a reinforcement learning-based instruction tuning technique for chart generation tasks without requiring human feedback. Our experiments show that this approach significantly enhances the model performance, enabling smaller models to outperform larger open-source models and be comparable to state-of-the-art proprietary models in data visualization tasks. We make the code and dataset available at this https URL.

[LG-186] Enhancing Graph Self-Supervised Learning with Graph Interplay

链接: https://arxiv.org/abs/2410.04061
作者: Xinjian Zhao,Wei Pang,Xiangru Jian,Yaoyao Xu,Chaolong Ying,Tianshu Yu
关键词-EN: extracting informative representations, introduce Graph Interplay, Graph self-supervised learning, labeled inputs, compelling framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 27 pages, 12 figures

点击查看摘要

Abstract:Graph self-supervised learning (GSSL) has emerged as a compelling framework for extracting informative representations from graph-structured data without extensive reliance on labeled inputs. In this study, we introduce Graph Interplay (GIP), an innovative and versatile approach that significantly enhances the performance equipped with various existing GSSL methods. To this end, GIP advocates direct graph-level communications by introducing random inter-graph edges within standard batches. Against GIP’s simplicity, we further theoretically show that \textscGIP essentially performs a principled manifold separation via combining inter-graph message passing and GSSL, bringing about more structured embedding manifolds and thus benefits a series of downstream tasks. Our empirical study demonstrates that GIP surpasses the performance of prevailing GSSL methods across multiple benchmarks by significant margins, highlighting its potential as a breakthrough approach. Besides, GIP can be readily integrated into a series of GSSL methods and consistently offers additional performance gain. This advancement not only amplifies the capability of GSSL but also potentially sets the stage for a novel graph learning paradigm in a broader sense.

[LG-187] Beyond Forecasting: Compositional Time Series Reasoning for End-to-End Task Execution

链接: https://arxiv.org/abs/2410.04047
作者: Wen Ye,Yizhou Zhang,Wei Yang,Lumingyuan Tang,Defu Cao,Jie Cai,Yan Liu
关键词-EN: time series, time series data, time series forecasting, Time Series Reasoning, time series models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent decades, there has been substantial advances in time series models and benchmarks across various individual tasks, such as time series forecasting, classification, and anomaly detection. Meanwhile, compositional reasoning in time series prevalent in real-world applications (e.g., decision-making and compositional question answering) is in great demand. Unlike simple tasks that primarily focus on predictive accuracy, compositional reasoning emphasizes the synthesis of diverse information from both time series data and various domain knowledge, making it distinct and extremely more challenging. In this paper, we introduce Compositional Time Series Reasoning, a new task of handling intricate multistep reasoning tasks from time series data. Specifically, this new task focuses on various question instances requiring structural and compositional reasoning abilities on time series data, such as decision-making and compositional question answering. As an initial attempt to tackle this novel task, we developed TS-Reasoner, a program-aided approach that utilizes large language model (LLM) to decompose a complex task into steps of programs that leverage existing time series models and numerical subroutines. Unlike existing reasoning work which only calls off-the-shelf modules, TS-Reasoner allows for the creation of custom modules and provides greater flexibility to incorporate domain knowledge as well as user-specified constraints. We demonstrate the effectiveness of our method through a comprehensive set of experiments. These promising results indicate potential opportunities in the new task of time series reasoning and highlight the need for further research.

[LG-188] Efficient Large-Scale Urban Parking Prediction: Graph Coarsening Based on Real-Time Parking Service Capability

链接: https://arxiv.org/abs/2410.04022
作者: Yixuan Wang,Zhenwu Chen,Kangshuai Zhang,Yunduan Cui,Lei Peng
关键词-EN: large-scale urban parking, parking, urban parking, predicting large-scale urban, number of vehicles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the sharp increase in the number of vehicles, the issue of parking difficulties has emerged as an urgent challenge that many cities need to address promptly. In the task of predicting large-scale urban parking data, existing research often lacks effective deep learning models and strategies. To tackle this challenge, this paper proposes an innovative framework for predicting large-scale urban parking graphs leveraging real-time service capabilities, aimed at improving the accuracy and efficiency of parking predictions. Specifically, we introduce a graph attention mechanism that assesses the real-time service capabilities of parking lots to construct a dynamic parking graph that accurately reflects real preferences in parking behavior. To effectively handle large-scale parking data, this study combines graph coarsening techniques with temporal convolutional autoencoders to achieve unified dimension reduction of the complex urban parking graph structure and features. Subsequently, we use a spatio-temporal graph convolutional model to make predictions based on the coarsened graph, and a pre-trained autoencoder-decoder module restores the predicted results to their original data dimensions, completing the task. Our methodology has been rigorously tested on a real dataset from parking lots in Shenzhen. The experimental results indicate that compared to traditional parking prediction models, our framework achieves improvements of 46.8% and 30.5% in accuracy and efficiency, respectively. Remarkably, with the expansion of the graph’s scale, our framework’s advantages become even more apparent, showcasing its substantial potential for solving complex urban parking dilemmas in practical scenarios.

[LG-189] Improving Temporal Link Prediction via Temporal Walk Matrix Projection NEURIPS2024

链接: https://arxiv.org/abs/2410.04013
作者: Xiaodong Lu,Leilei Sun,Tongyu Zhu,Weifeng Lv
关键词-EN: predicting future interactions, Temporal link prediction, link prediction, relative encodings, temporal walk matrices
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024 Paper

点击查看摘要

Abstract:Temporal link prediction, aiming at predicting future interactions among entities based on historical interactions, is crucial for a series of real-world applications. Although previous methods have demonstrated the importance of relative encodings for effective temporal link prediction, computational efficiency remains a major concern in constructing these encodings. Moreover, existing relative encodings are usually constructed based on structural connectivity, where temporal information is seldom considered. To address the aforementioned issues, we first analyze existing relative encodings and unify them as a function of temporal walk matrices. This unification establishes a connection between relative encodings and temporal walk matrices, providing a more principled way for analyzing and designing relative encodings. Based on this analysis, we propose a new temporal graph neural network called TPNet, which introduces a temporal walk matrix that incorporates the time decay effect to simultaneously consider both temporal and structural information. Moreover, TPNet designs a random feature propagation mechanism with theoretical guarantees to implicitly maintain the temporal walk matrices, which improves the computation and storage efficiency. Experimental results on 13 benchmark datasets verify the effectiveness and efficiency of TPNet, where TPNet outperforms other baselines on most datasets and achieves a maximum speedup of 33.3 \times compared to the SOTA baseline. Our code can be found at \urlthis https URL.

[LG-190] Hyperbolic Fine-tuning for Large Language Models ICML2024

链接: https://arxiv.org/abs/2410.04010
作者: Menglin Yang,Aosong Feng,Bo Xiong,Jihong Liu,Irwin King,Rex Ying
关键词-EN: Large language models, Large language, demonstrated remarkable performance, language models, demonstrated remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注: The preliminary work was accepted for the ICML 2024 LLM Cognition Workshop, and this version includes new investigations, analyses, experiments, and results

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance on various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for embedding tokens in LLMs. In this study, we first investigate the non-Euclidean characteristics of LLMs. Our findings reveal that token frequency follows a power-law distribution, with high-frequency tokens clustering near the origin and low-frequency tokens positioned farther away. Additionally, token embeddings exhibit a high degree of hyperbolicity, indicating a latent tree-like structure in the embedding space. Building on the observation, we propose to efficiently fine-tune LLMs in hyperbolic space to better exploit the underlying complex structures. However, we found that this fine-tuning in hyperbolic space cannot be achieved with naive application of exponential and logarithmic maps, when the embedding and weight matrices both reside in Euclidean space. To address this technique issue, we introduce a new method called hyperbolic low-rank efficient fine-tuning, HypLoRA, that performs low-rank adaptation directly on the hyperbolic manifold, avoiding the cancellation effect caused by the exponential and logarithmic maps, thus preserving the hyperbolic modeling capabilities. Through extensive experiments, we demonstrate that HypLoRA significantly enhances the performance of LLMs on reasoning tasks, particularly for complex reasoning problems. In particular, HypLoRA improves the performance in the complex AQuA dataset by up to 13.0%, showcasing its effectiveness in handling complex reasoning challenges

[LG-191] FastLRNR and Sparse Physics Informed Backpropagation

链接: https://arxiv.org/abs/2410.04001
作者: Woojin Cho,Kookjin Lee,Noseong Park,Donsub Rim,Gerrit Welper
关键词-EN: Rank Neural Representation, introduce Sparse Physics, called Low Rank, architecture called Low, Sparse Physics Informed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:We introduce Sparse Physics Informed Backpropagation (SPInProp), a new class of methods for accelerating backpropagation for a specialized neural network architecture called Low Rank Neural Representation (LRNR). The approach exploits the low rank structure within LRNR and constructs a reduced neural network approximation that is much smaller in size. We call the smaller network FastLRNR. We show that backpropagation of FastLRNR can be substituted for that of LRNR, enabling a significant reduction in complexity. We apply SPInProp to a physics informed neural networks framework and demonstrate how the solution of parametrized partial differential equations is accelerated.

[LG-192] Symmetry From Scratch: Group Equivariance as a Supervised Learning Task

链接: https://arxiv.org/abs/2410.03989
作者: Haozhe Huang,Leo Kaixuan Cheng,Kaiwen Chen,Alán Aspuru-Guzik
关键词-EN: engineering extra weights, equivariant architectural constraints, relax equivariant architectural, architectural constraints, engineering extra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine learning datasets with symmetries, the paradigm for backward compatibility with symmetry-breaking has been to relax equivariant architectural constraints, engineering extra weights to differentiate symmetries of interest. However, this process becomes increasingly over-engineered as models are geared towards specific symmetries/asymmetries hardwired of a particular set of equivariant basis functions. In this work, we introduce symmetry-cloning, a method for inducing equivariance in machine learning models. We show that general machine learning architectures (i.e., MLPs) can learn symmetries directly as a supervised learning task from group equivariant architectures and retain/break the learned symmetry for downstream tasks. This simple formulation enables machine learning models with group-agnostic architectures to capture the inductive bias of group-equivariant architectures.

[LG-193] Survey on Code Generation for Low resource and Domain Specific Programming Languages

链接: https://arxiv.org/abs/2410.03981
作者: Sathvik Joel,Jie JW Wu,Fatemeh H. Fard
关键词-EN: Large Language Models, popular programming languages, shown impressive capabilities, Large Language, Language Models
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive capabilities in code generation for popular programming languages. However, their performance on Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a significant challenge, affecting millions of developers-3.5 million users in Rust alone-who cannot fully utilize LLM capabilities. LRPLs and DSLs encounter unique obstacles, including data scarcity and, for DSLs, specialized syntax that is poorly represented in general-purpose datasets. Addressing these challenges is crucial, as LRPLs and DSLs enhance development efficiency in specialized domains, such as finance and science. While several surveys discuss LLMs in software engineering, none focus specifically on the challenges and opportunities associated with LRPLs and DSLs. Our survey fills this gap by systematically reviewing the current state, methodologies, and challenges in leveraging LLMs for code generation in these languages. We filtered 111 papers from over 27,000 published studies between 2020 and 2024 to evaluate the capabilities and limitations of LLMs in LRPLs and DSLs. We report the LLMs used, benchmarks, and metrics for evaluation, strategies for enhancing performance, and methods for dataset collection and curation. We identified four main evaluation techniques and several metrics for assessing code generation in LRPLs and DSLs. Our analysis categorizes improvement methods into six groups and summarizes novel architectures proposed by researchers. Despite various techniques and metrics, a standard approach and benchmark dataset for evaluating code generation in LRPLs and DSLs are lacking. This survey serves as a resource for researchers and practitioners at the intersection of LLMs, software engineering, and specialized programming languages, laying the groundwork for future advancements in code generation for LRPLs and DSLs. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2410.03981 [cs.SE] (or arXiv:2410.03981v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.03981 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-194] Optimizing Sparse Generalized Singular Vectors for Feature Selection in Proximal Support Vector Machines with Application to Breast and Ovarian Cancer Detection

链接: https://arxiv.org/abs/2410.03978
作者: Ugochukwu O. Ugwu,Michael Kirby
关键词-EN: Generalized Singular, paper presents approaches, ell, compute sparse solutions, GSVP
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper presents approaches to compute sparse solutions of Generalized Singular Value Problem (GSVP). The GSVP is regularized by \ell_1 -norm and \ell_q -penalty for 0q1 , resulting in the \ell_1 -GSVP and \ell_q -GSVP formulations. The solutions of these problems are determined by applying the proximal gradient descent algorithm with a fixed step size. The inherent sparsity levels within the computed solutions are exploited for feature selection, and subsequently, binary classification with non-parallel Support Vector Machines (SVM). For our feature selection task, SVM is integrated into the \ell_1 -GSVP and \ell_q -GSVP frameworks to derive the \ell_1 -GSVPSVM and \ell_q -GSVPSVM variants. Machine learning applications to cancer detection are considered. We remarkably report near-to-perfect balanced accuracy across breast and ovarian cancer datasets using a few selected features.

[LG-195] Learning to Balance: Diverse Normalization for Cloth-Changing Person Re-Identification

链接: https://arxiv.org/abs/2410.03977
作者: Hongjun Wang,Jiyuan Chen,Zhengwei Yin,Xuan Song,Yinqiang Zheng
关键词-EN: Cloth-Changing Person Re-Identification, involves recognizing individuals, Cloth-Changing Person, Person Re-Identification, involves recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloth-Changing Person Re-Identification (CC-ReID) involves recognizing individuals in images regardless of clothing status. In this paper, we empirically and experimentally demonstrate that completely eliminating or fully retaining clothing features is detrimental to the task. Existing work, either relying on clothing labels, silhouettes, or other auxiliary data, fundamentally aim to balance the learning of clothing and identity features. However, we practically find that achieving this balance is challenging and nuanced. In this study, we introduce a novel module called Diverse Norm, which expands personal features into orthogonal spaces and employs channel attention to separate clothing and identity features. A sample re-weighting optimization strategy is also introduced to guarantee the opposite optimization direction. Diverse Norm presents a simple yet effective approach that does not require additional data. Furthermore, Diverse Norm can be seamlessly integrated ResNet50 and significantly outperforms the state-of-the-art methods.

[LG-196] Efficient Training of Neural Stochastic Differential Equations by Matching Finite Dimensional Distributions

链接: https://arxiv.org/abs/2410.03973
作者: Jianxin Zhang,Josh Viktorov,Doosan Jung,Emily Pitler
关键词-EN: Stochastic Differential Equations, Neural Stochastic Differential, Differential Equations, Stochastic Differential, continuous stochastic processes
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural Stochastic Differential Equations (Neural SDEs) have emerged as powerful mesh-free generative models for continuous stochastic processes, with critical applications in fields such as finance, physics, and biology. Previous state-of-the-art methods have relied on adversarial training, such as GANs, or on minimizing distance measures between processes using signature kernels. However, GANs suffer from issues like instability, mode collapse, and the need for specialized training techniques, while signature kernel-based methods require solving linear PDEs and backpropagating gradients through the solver, whose computational complexity scales quadratically with the discretization steps. In this paper, we identify a novel class of strictly proper scoring rules for comparing continuous Markov processes. This theoretical finding naturally leads to a novel approach called Finite Dimensional Matching (FDM) for training Neural SDEs. Our method leverages the Markov property of SDEs to provide a computationally efficient training objective. This scoring rule allows us to bypass the computational overhead associated with signature kernels and reduces the training complexity from O(D^2) to O(D) per epoch, where D represents the number of discretization steps of the process. We demonstrate that FDM achieves superior performance, consistently outperforming existing methods in terms of both computational efficiency and generative quality.

[LG-197] Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

链接: https://arxiv.org/abs/2410.03972
作者: Ann Huang,Satpreet H. Singh,Kanaka Rajan
关键词-EN: dynamical processes widely, dynamical processes, processes widely, Task-trained recurrent neural, recurrent neural networks
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Task-trained recurrent neural networks (RNNs) are versatile models of dynamical processes widely used in machine learning and neuroscience. While RNNs are easily trained to perform a wide range of tasks, the nature and extent of the degeneracy in the resultant solutions (i.e., the variability across trained RNNs) remain poorly understood. Here, we provide a unified framework for analyzing degeneracy across three levels: behavior, neural dynamics, and weight space. We analyzed RNNs trained on diverse tasks across machine learning and neuroscience domains, including N-bit flip-flop, sine wave generation, delayed discrimination, and path integration. Our key finding is that the variability across RNN solutions, quantified on the basis of neural dynamics and trained weights, depends primarily on network capacity and task characteristics such as complexity. We introduce information-theoretic measures to quantify task complexity and demonstrate that increasing task complexity consistently reduces degeneracy in neural dynamics and generalization behavior while increasing degeneracy in weight space. These relationships hold across diverse tasks and can be used to control the degeneracy of the solution space of task-trained RNNs. Furthermore, we provide several strategies to control solution degeneracy, enabling task-trained RNNs to learn more consistent or diverse solutions as needed. We envision that these insights will lead to more reliable machine learning models and could inspire strategies to better understand and control degeneracy observed in neuroscience experiments.

[LG-198] Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

链接: https://arxiv.org/abs/2410.03968
作者: Sijin Chen,Omar Hagrass,Jason M. Klusowski
关键词-EN: modern language models, puzzling gap divides, gap divides theory, Decoding strategies play, Decoding Game
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注: 17 pages

点击查看摘要

Abstract:Decoding strategies play a pivotal role in text generation for modern language models, yet a puzzling gap divides theory and practice. Surprisingly, strategies that should intuitively be optimal, such as Maximum a Posteriori (MAP), often perform poorly in practice. Meanwhile, popular heuristic approaches like Top- k and Nucleus sampling, which employ truncation and normalization of the conditional next-token probabilities, have achieved great empirical success but lack theoretical justifications. In this paper, we propose Decoding Game, a comprehensive theoretical framework which reimagines text generation as a two-player zero-sum game between Strategist, who seeks to produce text credible in the true distribution, and Nature, who distorts the true distribution adversarially. After discussing the decomposibility of multi-step generation, we derive the optimal strategy in closed form for one-step Decoding Game. It is shown that the adversarial Nature imposes an implicit regularization on likelihood maximization, and truncation-normalization methods are first-order approximations to the optimal strategy under this regularization. Additionally, by generalizing the objective and parameters of Decoding Game, near-optimal strategies encompass diverse methods such as greedy search, temperature scaling, and hybrids thereof. Numerical experiments are conducted to complement our theoretical analysis.

[LG-199] Variational Language Concepts for Interpreting Foundation Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.03964
作者: Hengyi Wang,Shiwei Tan,Zhiqing Hong,Desheng Zhang,Hao Wang
关键词-EN: Foundation Language Models, achieved remarkable success, natural language processing, Foundation Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Accepted at EMNLP 2024 findings

点击查看摘要

Abstract:Foundation Language Models (FLMs) such as BERT and its variants have achieved remarkable success in natural language processing. To date, the interpretability of FLMs has primarily relied on the attention weights in their self-attention layers. However, these attention weights only provide word-level interpretations, failing to capture higher-level structures, and are therefore lacking in readability and intuitiveness. To address this challenge, we first provide a formal definition of conceptual interpretation and then propose a variational Bayesian framework, dubbed VAriational Language Concept (VALC), to go beyond word-level interpretations and provide concept-level interpretations. Our theoretical analysis shows that our VALC finds the optimal language concepts to interpret FLM predictions. Empirical results on several real-world datasets show that our method can successfully provide conceptual interpretation for FLMs.

[LG-200] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

链接: https://arxiv.org/abs/2410.03960
作者: Aurick Qiao,Zhewei Yao,Samyam Rajbhandari,Yuxiong He
关键词-EN: typically observes orders, longer prompt lengths, magnitude longer prompt, generation lengths, enterprise use cases
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers’ KV cache using a much earlier layer’s output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs.

[LG-201] Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models

链接: https://arxiv.org/abs/2410.03955
作者: Gang Li,Wendi Yu,Yao Yao,Wei Tong,Yingbin Liang,Qihang Lin,Tianbao Yang
关键词-EN: undergoes multiple cycles, model developmental safety, model development, model, model development process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 41 pages, 8 figures

点击查看摘要

Abstract:In the real world, a learning-enabled system usually undergoes multiple cycles of model development to enhance the system’s ability to handle difficult or emerging tasks. This continual model development process raises a significant issue that the model development for acquiring new or improving existing capabilities may inadvertently lose capabilities of the old model, also known as catastrophic forgetting. Existing continual learning studies focus on mitigating catastrophic forgetting by trading off performance on previous tasks and new tasks to ensure good average performance. However, they are inadequate for many applications especially in safety-critical domains, as failure to strictly preserve the performance of the old model not only introduces safety risks and uncertainties but also imposes substantial expenses in the re-improving and re-validation of existing properties. To address this issue, we introduce model developmental safety as a guarantee of a learning system such that in the model development process the new model should strictly preserve the existing protected capabilities of the old model while improving its performance on target tasks. To ensure the model developmental safety, we present a safety-centric framework by formulating the model developmental safety as data-dependent constraints. Under this framework, we study how to develop a pretrained vision-language model (aka the CLIP model) for acquiring new capabilities or improving existing capabilities of image classification. We propose an efficient constrained optimization algorithm with theoretical guarantee and use its insights to finetune a CLIP model with task-dependent heads for promoting the model developmental safety. Our experiments on improving vision perception capabilities on autonomous driving and scene recognition datasets demonstrate the efficacy of the proposed approach.

[LG-202] SDA-GRIN for Adaptive Spatial-Temporal Multivariate Time Series Imputation

链接: https://arxiv.org/abs/2410.03954
作者: Amir Eskandari,Aman Anand,Drishti Sharma,Farhana Zulkernine
关键词-EN: missing data, multivariate time series, Spatial Dynamic Aware, Spatial, Recurrent Imputation Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In various applications, the multivariate time series often suffers from missing data. This issue can significantly disrupt systems that rely on the data. Spatial and temporal dependencies can be leveraged to impute the missing samples. Existing imputation methods often ignore dynamic changes in spatial dependencies. We propose a Spatial Dynamic Aware Graph Recurrent Imputation Network (SDA-GRIN) which is capable of capturing dynamic changes in spatial this http URL-GRIN leverages a multi-head attention mechanism to adapt graph structures with time. SDA-GRIN models multivariate time series as a sequence of temporal graphs and uses a recurrent message-passing architecture for imputation. We evaluate SDA-GRIN on four real-world datasets: SDA-GRIN improves MSE by 9.51% for the AQI and 9.40% for AQI-36. On the PEMS-BAY dataset, it achieves a 1.94% improvement in MSE. Detailed ablation study demonstrates the effect of window sizes and missing data on the performance of the method. Project page:this https URL

[LG-203] LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

链接: https://arxiv.org/abs/2410.03953
作者: Selim Furkan Tekin,Fatih Ilhan,Tiansheng Huang,Sihao Hu,Ling Liu
关键词-EN: Combining large language, large language models, Combining large, shown substantial performance, component LLMs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combining large language models during training or at inference time has shown substantial performance gain over component LLMs. This paper presents LLM-TOPLA, a diversity-optimized LLM ensemble method with three unique properties: (i) We introduce the focal diversity metric to capture the diversity-performance correlation among component LLMs of an ensemble. (ii) We develop a diversity-optimized ensemble pruning algorithm to select the top-k sub-ensembles from a pool of N base LLMs. Our pruning method recommends top-performing LLM subensembles of size S , often much smaller than N . (iii) We generate new output for each prompt query by utilizing a learn-to-ensemble approach, which learns to detect and resolve the output inconsistency among all component LLMs of an ensemble. Extensive evaluation on four different benchmarks shows good performance gain over the best LLM ensemble methods: (i) In constrained solution set problems, LLM-TOPLA outperforms the best-performing ensemble (Mixtral) by 2.2% in accuracy on MMLU and the best-performing LLM ensemble (MoreAgent) on GSM8k by 2.1%. (ii) In generative tasks, LLM-TOPLA outperforms the top-2 performers (Llama70b/Mixtral) on SearchQA by 3.9\mathrmx in F1, and on XSum by more than 38 in ROUGE-1. Our code and dataset, which contains outputs of 8 modern LLMs on 4 benchmarks is available at this https URL

[LG-204] A Brain-Inspired Regularizer for Adversarial Robustness

链接: https://arxiv.org/abs/2410.03952
作者: Elie Attias,Cengiz Pehlevan,Dina Obeid
关键词-EN: Convolutional Neural Networks, slight input perturbations, Convolutional Neural, task failures, visual tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注: 10 pages plus appendix, 10 figures (main text), 15 figures (appendix), 3 tables (appendix)

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) excel in many visual tasks, but they tend to be sensitive to slight input perturbations that are imperceptible to the human eye, often resulting in task failures. Recent studies indicate that training CNNs with regularizers that promote brain-like representations, using neural recordings, can improve model robustness. However, the requirement to use neural data severely restricts the utility of these methods. Is it possible to develop regularizers that mimic the computational function of neural regularizers without the need for neural recordings, thereby expanding the usability and effectiveness of these techniques? In this work, we inspect a neural regularizer introduced in Li et al. (2019) to extract its underlying strength. The regularizer uses neural representational similarities, which we find also correlate with pixel similarities. Motivated by this finding, we introduce a new regularizer that retains the essence of the original but is computed using image pixel similarities, eliminating the need for neural recordings. We show that our regularization method 1) significantly increases model robustness to a range of black box attacks on various datasets and 2) is computationally inexpensive and relies only on original datasets. Our work explores how biologically motivated loss functions can be used to drive the performance of artificial neural networks.

[LG-205] UFLUX v2.0: A Process-Informed Machine Learning Framework for Efficient and Explainable Modelling of Terrestrial Carbon Uptake

链接: https://arxiv.org/abs/2410.03951
作者: Wenquan Dong,Songyan Zhu,Jian Xu,Casey M. Ryan,Man Chen,Jingya Zeng,Hao Yu,Congfeng Cao,Jiancheng Shi
关键词-EN: Gross Primary Productivity, Gross Primary, Primary Productivity, carbon plants fixed, fixed by photosynthesis
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Gross Primary Productivity (GPP), the amount of carbon plants fixed by photosynthesis, is pivotal for understanding the global carbon cycle and ecosystem functioning. Process-based models built on the knowledge of ecological processes are susceptible to biases stemming from their assumptions and approximations. These limitations potentially result in considerable uncertainties in global GPP estimation, which may pose significant challenges to our Net Zero goals. This study presents UFLUX v2.0, a process-informed model that integrates state-of-art ecological knowledge and advanced machine learning techniques to reduce uncertainties in GPP estimation by learning the biases between process-based models and eddy covariance (EC) measurements. In our findings, UFLUX v2.0 demonstrated a substantial improvement in model accuracy, achieving an R^2 of 0.79 with a reduced RMSE of 1.60 g C m^-2 d^-1, compared to the process-based model’s R^2 of 0.51 and RMSE of 3.09 g C m^-2 d^-1. Our global GPP distribution analysis indicates that while UFLUX v2.0 and the process-based model achieved similar global total GPP (137.47 Pg C and 132.23 Pg C, respectively), they exhibited large differences in spatial distribution, particularly in latitudinal gradients. These differences are very likely due to systematic biases in the process-based model and differing sensitivities to climate and environmental conditions. This study offers improved adaptability for GPP modelling across diverse ecosystems, and further enhances our understanding of global carbon cycles and its responses to environmental changes.

[LG-206] Interpolation-Free Deep Learning for Meteorological Downscaling on Unaligned Grids Across Multiple Domains with Application to Wind Power

链接: https://arxiv.org/abs/2410.03945
作者: Jean-Sébastien Giroux,Simon-Philippe Breton,Julie Carreau
关键词-EN: cleaner energy sources, climate change intensifies, change intensifies, increasingly urgent, shift to cleaner
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As climate change intensifies, the shift to cleaner energy sources becomes increasingly urgent. With wind energy production set to accelerate, reliable wind probabilistic forecasts are essential to ensure its efficient use. However, since numerical weather prediction models are computationally expensive, probabilistic forecasts are produced at resolutions too coarse to capture all mesoscale wind behaviors. Statistical downscaling, typically applied to enchance the resolution of climate model simulations, presents a viable solution with lower computational costs by learning a mapping from low-resolution (LR) variables to high-resolution (HR) meteorological variables. Leveraging deep learning, we evaluate a downscaling model based on a state-of-the-art U-Net architecture, applied to an ensemble member from a coarse-scale probabilistic forecast of wind velocity. The architecture is modified to incorporate (1) a learned grid alignment strategy to resolve LR-HR grid mismatches and (2) a processing module for multi-level atmospheric predictors. To extend the downscaling model’s applicability from fixed spatial domains to the entire Canadian region, we assess a transfer learning approach. Our results show that the learned grid alignment strategy performs as well as conventional pre-processing interpolation steps and that LR wind speed at multiple levels is sufficient as a predictor, enabling a more compact architecture. Additionally, they suggest that extending to new spatial domains using transfer learning is promising, and that downscaled wind velocities demonstrate potential in improving the detection of wind power ramps, a critical phenomenon for wind energy.

[LG-207] Oscillatory State-Space Models

链接: https://arxiv.org/abs/2410.03943
作者: T. Konstantin Rusch,Daniela Rus
关键词-EN: propose Linear Oscillatory, Linear Oscillatory State-Space, Linear Oscillatory, Oscillatory State-Space models, propose Linear
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We propose Linear Oscillatory State-Space models (LinOSS) for efficiently learning on long sequences. Inspired by cortical dynamics of biological neural networks, we base our proposed LinOSS model on a system of forced harmonic oscillators. A stable discretization, integrated over time using fast associative parallel scans, yields the proposed state-space model. We prove that LinOSS produces stable dynamics only requiring nonnegative diagonal state matrix. This is in stark contrast to many previous state-space models relying heavily on restrictive parameterizations. Moreover, we rigorously show that LinOSS is universal, i.e., it can approximate any continuous and causal operator mapping between time-varying functions, to desired accuracy. In addition, we show that an implicit-explicit discretization of LinOSS perfectly conserves the symmetry of time reversibility of the underlying dynamics. Together, these properties enable efficient modeling of long-range interactions, while ensuring stable and accurate long-horizon forecasting. Finally, our empirical results, spanning a wide range of time-series tasks from mid-range to very long-range classification and regression, as well as long-horizon forecasting, demonstrate that our proposed LinOSS model consistently outperforms state-of-the-art sequence models. Notably, LinOSS outperforms Mamba by nearly 2x and LRU by 2.5x on a sequence modeling task with sequences of length 50k.

[LG-208] Clustering Alzheimers Disease Subtypes via Similarity Learning and Graph Diffusion

链接: https://arxiv.org/abs/2410.03937
作者: Tianyi Wei,Shu Yang,Davoud Ataee Tarzanagh,Jingxuan Bao,Jia Xu,Patryk Orzechowski,Joost B. Wagenaar,Qi Long,Li Shen
关键词-EN: complex neurodegenerative disorder, Alzheimer disease, people worldwide, complex neurodegenerative, neurodegenerative disorder
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注: ICIBM’23’: International Conference on Intelligent Biology and Medicine, Tampa, FL, USA, July 16-19, 2023

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Due to the heterogeneous nature of AD, its diagnosis and treatment pose critical challenges. Consequently, there is a growing research interest in identifying homogeneous AD subtypes that can assist in addressing these challenges in recent years. In this study, we aim to identify subtypes of AD that represent distinctive clinical features and underlying pathology by utilizing unsupervised clustering with graph diffusion and similarity learning. We adopted SIMLR, a multi-kernel similarity learning framework, and graph diffusion to perform clustering on a group of 829 patients with AD and mild cognitive impairment (MCI, a prodromal stage of AD) based on their cortical thickness measurements extracted from magnetic resonance imaging (MRI) scans. Although the clustering approach we utilized has not been explored for the task of AD subtyping before, it demonstrated significantly better performance than several commonly used clustering methods. Specifically, we showed the power of graph diffusion in reducing the effects of noise in the subtype detection. Our results revealed five subtypes that differed remarkably in their biomarkers, cognitive status, and some other clinical features. To evaluate the resultant subtypes further, a genetic association study was carried out and successfully identified potential genetic underpinnings of different AD subtypes. Our source code is available at: this https URL.

[LG-209] Learning Truncated Causal History Model for Video Restoration NEURIPS2024

链接: https://arxiv.org/abs/2410.03936
作者: Amirhosein Ghasemabadi,Muhammad Kamran Janjua,Mohammad Salameh,Di Niu
关键词-EN: video frames governed, key challenge, transition dynamics, video, video restoration
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024. 24 pages

点击查看摘要

Abstract:One key challenge to video restoration is to model the transition dynamics of video frames governed by motion. In this work, we propose TURTLE to learn the truncated causal history model for efficient and high-performing video restoration. Unlike traditional methods that process a range of contextual frames in parallel, TURTLE enhances efficiency by storing and summarizing a truncated history of the input frame latent representation into an evolving historical state. This is achieved through a sophisticated similarity-based retrieval mechanism that implicitly accounts for inter-frame motion and alignment. The causal design in TURTLE enables recurrence in inference through state-memorized historical features while allowing parallel training by sampling truncated video clips. We report new state-of-the-art results on a multitude of video restoration benchmark tasks, including video desnowing, nighttime video deraining, video raindrops and rain streak removal, video super-resolution, real-world and synthetic video deblurring, and blind video denoising while reducing the computational cost compared to existing best contextual methods on all these tasks.

[LG-210] GAS-Norm: Score-Driven Adaptive Normalization for Non-Stationary Time Series Forecasting in Deep Learning CIKM’24

链接: https://arxiv.org/abs/2410.03935
作者: Edoardo Urettini,Daniele Atzeni,Reshawn J. Ramjattan,Antonio Carta
关键词-EN: DNN forecasting models, DNN forecasting, beat simpler statistical, deep neural networks, DNN
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at CIKM '24

点击查看摘要

Abstract:Despite their popularity, deep neural networks (DNNs) applied to time series forecasting often fail to beat simpler statistical models. One of the main causes of this suboptimal performance is the data non-stationarity present in many processes. In particular, changes in the mean and variance of the input data can disrupt the predictive capability of a DNN. In this paper, we first show how DNN forecasting models fail in simple non-stationary settings. We then introduce GAS-Norm, a novel methodology for adaptive time series normalization and forecasting based on the combination of a Generalized Autoregressive Score (GAS) model and a Deep Neural Network. The GAS approach encompasses a score-driven family of models that estimate the mean and variance at each new observation, providing updated statistics to normalize the input data of the deep model. The output of the DNN is eventually denormalized using the statistics forecasted by the GAS model, resulting in a hybrid approach that leverages the strengths of both statistical modeling and deep learning. The adaptive normalization improves the performance of the model in non-stationary settings. The proposed approach is model-agnostic and can be applied to any DNN forecasting model. To empirically validate our proposal, we first compare GAS-Norm with other state-of-the-art normalization methods. We then combine it with state-of-the-art DNN forecasting models and test them on real-world datasets from the Monash open-access forecasting repository. Results show that deep forecasting models improve their performance in 21 out of 25 settings when combined with GAS-Norm compared to other normalization methods.

[LG-211] Online Posterior Sampling with a Diffusion Prior

链接: https://arxiv.org/abs/2410.03919
作者: Branislav Kveton,Boris Oreshkin,Youngsuk Park,Aniket Deshmukh,Rui Song
关键词-EN: Gaussian prior, Gaussian, Laplace approximation, Posterior sampling, contextual bandits
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Proceedings of the 38th Conference on Neural Information Processing Systems

点击查看摘要

Abstract:Posterior sampling in contextual bandits with a Gaussian prior can be implemented exactly or approximately using the Laplace approximation. The Gaussian prior is computationally efficient but it cannot describe complex distributions. In this work, we propose approximate posterior sampling algorithms for contextual bandits with a diffusion model prior. The key idea is to sample from a chain of approximate conditional posteriors, one for each stage of the reverse process, which are estimated in a closed form using the Laplace approximation. Our approximations are motivated by posterior sampling with a Gaussian prior, and inherit its simplicity and efficiency. They are asymptotically consistent and perform well empirically on a variety of contextual bandit problems.

[LG-212] Distribution Guided Active Feature Acquisition

链接: https://arxiv.org/abs/2410.03915
作者: Yang Li,Junier Oliva
关键词-EN: Human agents routinely, agents routinely reason, Human agents, routinely reason, weigh the cost
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human agents routinely reason on instances with incomplete and muddied data (and weigh the cost of obtaining further features). In contrast, much of ML is devoted to the unrealistic, sterile environment where all features are observed and further information on an instance is obviated. Here we extend past static ML and develop an active feature acquisition (AFA) framework that interacts with the environment to obtain new information on-the-fly and can: 1) make inferences on an instance in the face of incomplete features, 2) determine a plan for feature acquisitions to obtain additional information on the instance at hand. We build our AFA framework on a backbone of understanding the information and conditional dependencies that are present in the data. First, we show how to build generative models that can capture dependencies over arbitrary subsets of features and employ these models for acquisitions in a greedy scheme. After, we show that it is possible to guide the training of RL agents for AFA via side-information and auxiliary rewards stemming from our generative models. We also examine two important factors for deploying AFA models in real-world scenarios, namely interpretability and robustness. Extensive experiments demonstrate the state-of-the-art performance of our AFA framework.

[LG-213] Improving Node Representation by Boosting Target-Aware Contrastive Loss

链接: https://arxiv.org/abs/2410.03901
作者: Ying-Chun Lin,Jennifer Neville
关键词-EN: capturing intricate connections, edges capturing intricate, model complex relationships, relationships between entities, intricate connections
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graphs model complex relationships between entities, with nodes and edges capturing intricate connections. Node representation learning involves transforming nodes into low-dimensional embeddings. These embeddings are typically used as features for downstream tasks. Therefore, their quality has a significant impact on task performance. Existing approaches for node representation learning span (semi-)supervised, unsupervised, and self-supervised paradigms. In graph domains, (semi-)supervised learning often only optimizes models based on class labels, neglecting other abundant graph signals, which limits generalization. While self-supervised or unsupervised learning produces representations that better capture underlying graph signals, the usefulness of these captured signals for downstream target tasks can vary. To bridge this gap, we introduce Target-Aware Contrastive Learning (Target-aware CL) which aims to enhance target task performance by maximizing the mutual information between the target task and node representations with a self-supervised learning process. This is achieved through a sampling function, XGBoost Sampler (XGSampler), to sample proper positive examples for the proposed Target-Aware Contrastive Loss (XTCL). By minimizing XTCL, Target-aware CL increases the mutual information between the target task and node representations, such that model generalization is improved. Additionally, XGSampler enhances the interpretability of each signal by showing the weights for sampling the proper positive examples. We show experimentally that XTCL significantly improves the performance on two target tasks: node classification and link prediction tasks, compared to state-of-the-art models.

[LG-214] Human-aligned Chess with a Bit of Search

链接: https://arxiv.org/abs/2410.03893
作者: Yiming Zhang,Athul Paul Jacob,Vivian Lai,Daniel Fried,Daphne Ippolito
关键词-EN: recent years, surpassed the strongest, match human intelligence, quest to match, Chess
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Chess has long been a testbed for AI’s quest to match human intelligence, and in recent years, chess AI systems have surpassed the strongest humans at the game. However, these systems are not human-aligned; they are unable to match the skill levels of all human partners or model human-like behaviors beyond piece movement. In this paper, we introduce Allie, a chess-playing AI designed to bridge the gap between artificial and human intelligence in this classic game. Allie is trained on log sequences of real chess games to model the behaviors of human chess players across the skill spectrum, including non-move behaviors such as pondering times and resignations In offline evaluations, we find that Allie exhibits humanlike behavior: it outperforms the existing state-of-the-art in human chess move prediction and “ponders” at critical positions. The model learns to reliably assign reward at each game state, which can be used at inference as a reward function in a novel time-adaptive Monte-Carlo tree search (MCTS) procedure, where the amount of search depends on how long humans would think in the same positions. Adaptive search enables remarkable skill calibration; in a large-scale online evaluation against players with ratings from 1000 to 2600 Elo, our adaptive search method leads to a skill gap of only 49 Elo on average, substantially outperforming search-free and standard MCTS baselines. Against grandmaster-level (2500 Elo) opponents, Allie with adaptive search exhibits the strength of a fellow grandmaster, all while learning exclusively from humans.

[LG-215] owards Cost Sensitive Decision Making

链接: https://arxiv.org/abs/2410.03892
作者: Yang Li,Junier Oliva
关键词-EN: additional relevant information, real-world situations, additional relevant, relevant information, information when making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many real-world situations allow for the acquisition of additional relevant information when making decisions with limited or uncertain data. However, traditional RL approaches either require all features to be acquired beforehand (e.g. in a MDP) or regard part of them as missing data that cannot be acquired (e.g. in a POMDP). In this work, we consider RL models that may actively acquire features from the environment to improve the decision quality and certainty, while automatically balancing the cost of feature acquisition process and the reward of task decision process. We propose the Active-Acquisition POMDP and identify two types of the acquisition process for different application domains. In order to assist the agent in the actively-acquired partially-observed environment and alleviate the exploration-exploitation dilemma, we develop a model-based approach, where a deep generative model is utilized to capture the dependencies of the features and impute the unobserved features. The imputations essentially represent the beliefs of the agent. Equipped with the dynamics model, we develop hierarchical RL algorithms to resolve both types of the AA-POMDPs. Empirical results demonstrate that our approach achieves considerably better performance than existing POMDP-RL solutions.

[LG-216] Solving Dual Sourcing Problems with Supply Mode Dependent Failure Rates

链接: https://arxiv.org/abs/2410.03887
作者: Fabian Akkerman,Nils Knofius,Matthieu van der Heijden,Martijn Mes
关键词-EN: supply mode dependent, mode dependent failure, paper investigates dual, managing spare parts, investigates dual sourcing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates dual sourcing problems with supply mode dependent failure rates, particularly relevant in managing spare parts for downtime-critical assets. To enhance resilience, businesses increasingly adopt dual sourcing strategies using both conventional and additive manufacturing techniques. This paper explores how these strategies can optimise sourcing by addressing variations in part properties and failure rates. A significant challenge is the distinct failure characteristics of parts produced by these methods, which influence future demand. To tackle this, we propose a new iterative heuristic and several reinforcement learning techniques combined with an endogenous parameterised learning (EPL) approach. This EPL approach - compatible with any learning method - allows a single policy to handle various input parameters for multiple items. In a stylised setting, our best policy achieves an average optimality gap of 0.4%. In a case study within the energy sector, our policies outperform the baseline in 91.1% of instances, yielding average cost savings up to 22.6%.

[LG-217] DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction

链接: https://arxiv.org/abs/2410.03883
作者: Xinwei Zhang,Zhiqi Bu,Borja Balle,Mingyi Hong,Meisam Razaviyayn,Vahab Mirrokni
关键词-EN: safeguarding individual data, offers a robust, Differential privacy, individual data privacy, Differential
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) offers a robust framework for safeguarding individual data privacy. To utilize DP in training modern machine learning models, differentially private optimizers have been widely used in recent years. A popular approach to privatize an optimizer is to clip the individual gradients and add sufficiently large noise to the clipped gradient. This approach led to the development of DP optimizers that have comparable performance with their non-private counterparts in fine-tuning tasks or in tasks with a small number of training parameters. However, a significant performance drop is observed when these optimizers are applied to large-scale training. This degradation stems from the substantial noise injection required to maintain DP, which disrupts the optimizer’s dynamics. This paper introduces DiSK, a novel framework designed to significantly enhance the performance of DP optimizers. DiSK employs Kalman filtering, a technique drawn from control and signal processing, to effectively denoise privatized gradients and generate progressively refined gradient estimations. To ensure practicality for large-scale training, we simplify the Kalman filtering process, minimizing its memory and computational demands. We establish theoretical privacy-utility trade-off guarantees for DiSK, and demonstrate provable improvements over standard DP optimizers like DPSGD in terms of iteration complexity upper-bound. Extensive experiments across diverse tasks, including vision tasks such as CIFAR-100 and ImageNet-1k and language fine-tuning tasks such as GLUE, E2E, and DART, validate the effectiveness of DiSK. The results showcase its ability to significantly improve the performance of DP optimizers, surpassing state-of-the-art results under the same privacy constraints on several benchmarks.

[LG-218] A Federated Distributionally Robust Support Vector Machine with Mixture of Wasserstein Balls Ambiguity Set for Distributed Fault Diagnosis

链接: https://arxiv.org/abs/2410.03877
作者: Michael Ibrahim,Heraldo Rozas,Nagi Gebraeel,Weijun Xie
关键词-EN: fault diagnosis tasks, original parts manufacturers, long-term service contracts, geographically dispersed data, provide long-term service
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 3 figures

点击查看摘要

Abstract:The training of classification models for fault diagnosis tasks using geographically dispersed data is a crucial task for original parts manufacturers (OEMs) seeking to provide long-term service contracts (LTSCs) to their customers. Due to privacy and bandwidth constraints, such models must be trained in a federated fashion. Moreover, due to harsh industrial settings the data often suffers from feature and label uncertainty. Therefore, we study the problem of training a distributionally robust (DR) support vector machine (SVM) in a federated fashion over a network comprised of a central server and G clients without sharing data. We consider the setting where the local data of each client g is sampled from a unique true distribution \mathbbP_g , and the clients can only communicate with the central server. We propose a novel Mixture of Wasserstein Balls (MoWB) ambiguity set that relies on local Wasserstein balls centered at the empirical distribution of the data at each client. We study theoretical aspects of the proposed ambiguity set, deriving its out-of-sample performance guarantees and demonstrating that it naturally allows for the separability of the DR problem. Subsequently, we propose two distributed optimization algorithms for training the global FDR-SVM: i) a subgradient method-based algorithm, and ii) an alternating direction method of multipliers (ADMM)-based algorithm. We derive the optimization problems to be solved by each client and provide closed-form expressions for the computations performed by the central server during each iteration for both algorithms. Finally, we thoroughly examine the performance of the proposed algorithms in a series of numerical experiments utilizing both simulation data and popular real-world datasets.

[LG-219] Empowering Domain-Specific Language Models with Graph-Oriented Databases: A Paradigm Shift in Performance and Model Maintenance

链接: https://arxiv.org/abs/2410.03867
作者: Ricardo Di Pasquale,Soledad Represa
关键词-EN: domain-specific language models, domain-specific language, application domains, specific application domains, industry-specific requirements
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an era dominated by data, the management and utilization of domain-specific language have emerged as critical challenges in various application domains, particularly those with industry-specific requirements. Our work is driven by the need to effectively manage and process large volumes of short text documents inherent in specific application domains. By leveraging domain-specific knowledge and expertise, our approach aims to shape factual data within these domains, thereby facilitating enhanced utilization and understanding by end-users. Central to our methodology is the integration of domain-specific language models with graph-oriented databases, facilitating seamless processing, analysis, and utilization of textual data within targeted domains. Our work underscores the transformative potential of the partnership of domain-specific language models and graph-oriented databases. This cooperation aims to assist researchers and engineers in metric usage, mitigation of latency issues, boosting explainability, enhancing debug and improving overall model performance. Moving forward, we envision our work as a guide AI engineers, providing valuable insights for the implementation of domain-specific language models in conjunction with graph-oriented databases, and additionally provide valuable experience in full-life cycle maintenance of this kind of products.

[LG-220] DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

链接: https://arxiv.org/abs/2410.03864
作者: Murong Yue,Wenlin Yao,Haitao Mi,Dian Yu,Ziyu Yao,Dong Yu
关键词-EN: large language models, gained significant attention, task-solving LLM, LLM, specific task-solving LLM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called “reasoning actions”), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

[LG-221] Improving Mappers Robustness by Varying Resolution According to Lens-Space Density

链接: https://arxiv.org/abs/2410.03862
作者: Kaleb D. Ruscitti,Leland McInnes
关键词-EN: single resolution scale, semantic space, propose an improvement, algorithm that removes, removes the assumption
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 29 pages, 8 figures

点击查看摘要

Abstract:We propose an improvement to the Mapper algorithm that removes the assumption of a single resolution scale across semantic space, and improves the robustness of the results under change of parameters. This eases parameter selection, especially for datasets with highly variable local density in the Morse function f used for Mapper. This is achieved by incorporating this density into the choice of cover for Mapper. Furthermore, we prove that for covers with some natural hypotheses, the graph output by Mapper still converges in bottleneck distance to the Reeb graph of the Rips complex of the data, but captures more topological features than when using the usual Mapper cover. Finally, we discuss implementation details, and include the results of computational experiments. We also provide an accompanying reference implementation.

[LG-222] Detecting Machine-Generated Long-Form Content with Latent-Space Variables

链接: https://arxiv.org/abs/2410.03856
作者: Yufei Tian,Zeyu Pan,Nanyun Peng
关键词-EN: large language models, distinguishing machine-generated outputs, generate fluent long-form, fluent long-form texts, trustworthiness of expressions
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing capability of large language models (LLMs) to generate fluent long-form texts is presenting new challenges in distinguishing machine-generated outputs from human-written ones, which is crucial for ensuring authenticity and trustworthiness of expressions. Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts, including different prompting and decoding strategies, and adversarial attacks. We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts by training a latent-space model on sequences of events or topics derived from human-written texts. In three different domains, machine-generated texts, which are originally inseparable from human texts on the token level, can be better distinguished with our latent-space model, leading to a 31% improvement over strong baselines such as DetectGPT. Our analysis further reveals that, unlike humans, modern LLMs like GPT-4 generate event triggers and their transitions differently, an inherent disparity that helps our method to robustly detect machine-generated texts.

[LG-223] A Survey on Group Fairness in Federated Learning: Challenges Taxonomy of Solutions and Directions for Future Research

链接: https://arxiv.org/abs/2410.03855
作者: Teresa Salazar,Helder Araújo,Alberto Cano,Pedro Henriques Abreu
关键词-EN: Group fairness, achieving equitable outcomes, race or gender, equitable outcomes, Federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Group fairness in machine learning is a critical area of research focused on achieving equitable outcomes across different groups defined by sensitive attributes such as race or gender. Federated learning, a decentralized approach to training machine learning models across multiple devices or organizations without sharing raw data, amplifies the need for fairness due to the heterogeneous data distributions across clients, which can exacerbate biases. The intersection of federated learning and group fairness has attracted significant interest, with 47 research works specifically dedicated to addressing this issue. However, no dedicated survey has focused comprehensively on group fairness in federated learning. In this work, we present an in-depth survey on this topic, addressing the critical challenges and reviewing related works in the field. We create a novel taxonomy of these approaches based on key criteria such as data partitioning, location, and applied strategies. Additionally, we explore broader concerns related to this problem and investigate how different approaches handle the complexities of various sensitive groups and their intersections. Finally, we review the datasets and applications commonly used in current research. We conclude by highlighting key areas for future research, emphasizing the need for more methods to address the complexities of achieving group fairness in federated systems.

[LG-224] Sequential Probability Assignment with Contexts: Minimax Regret Contextual Shtarkov Sums and Contextual Normalized Maximum Likelihood NEURIPS2024

链接: https://arxiv.org/abs/2410.03849
作者: Ziyi Liu,Idan Attias,Daniel M. Roy
关键词-EN: possibly nonparametric hypothesis, nonparametric hypothesis class, contextual Shtarkov sum, sequential probability assignment, Shtarkov sum
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear in NeurIPS 2024

点击查看摘要

Abstract:We study the fundamental problem of sequential probability assignment, also known as online learning with logarithmic loss, with respect to an arbitrary, possibly nonparametric hypothesis class. Our goal is to obtain a complexity measure for the hypothesis class that characterizes the minimax regret and to determine a general, minimax optimal algorithm. Notably, the sequential \ell_\infty entropy, extensively studied in the literature (Rakhlin and Sridharan, 2015, Bilodeau et al., 2020, Wu et al., 2023), was shown to not characterize minimax risk in general. Inspired by the seminal work of Shtarkov (1987) and Rakhlin, Sridharan, and Tewari (2010), we introduce a novel complexity measure, the \emphcontextual Shtarkov sum, corresponding to the Shtarkov sum after projection onto a multiary context tree, and show that the worst case log contextual Shtarkov sum equals the minimax regret. Using the contextual Shtarkov sum, we derive the minimax optimal strategy, dubbed \emphcontextual Normalized Maximum Likelihood (cNML). Our results hold for sequential experts, beyond binary labels, which are settings rarely considered in prior work. To illustrate the utility of this characterization, we provide a short proof of a new regret upper bound in terms of sequential \ell_\infty entropy, unifying and sharpening state-of-the-art bounds by Bilodeau et al. (2020) and Wu et al. (2023).

[LG-225] Model-Based Reward Shaping for Adversarial Inverse Reinforcement Learning in Stochastic Environments

链接: https://arxiv.org/abs/2410.03847
作者: Simon Sinong Zhan,Qingyuan Wu,Philip Wang,Yixuan Wang,Ruochen Jiao,Chao Huang,Qi Zhu
关键词-EN: Inverse Reinforcement Learning, Adversarial Inverse Reinforcement, Reinforcement Learning, Adversarial Inverse, Inverse Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we aim to tackle the limitation of the Adversarial Inverse Reinforcement Learning (AIRL) method in stochastic environments where theoretical results cannot hold and performance is degraded. To address this issue, we propose a novel method which infuses the dynamics information into the reward shaping with the theoretical guarantee for the induced optimal policy in the stochastic environments. Incorporating our novel model-enhanced rewards, we present a novel Model-Enhanced AIRL framework, which integrates transition model estimation directly into reward shaping. Furthermore, we provide a comprehensive theoretical analysis of the reward error bound and performance difference bound for our method. The experimental results in MuJoCo benchmarks show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines.

[LG-226] Explaining the (Not So) Obvious: Simple and Fast Explanation of STAN a Next Point of Interest Recommendation System

链接: https://arxiv.org/abs/2410.03841
作者: Fajrian Yunus,Talel Abdessalem
关键词-EN: explain machine learning, machine learning systems, machine learning, machine learning methods, lot of effort
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A lot of effort in recent years have been expended to explain machine learning systems. However, some machine learning methods are inherently explainable, and thus are not completely black box. This enables the developers to make sense of the output without a developing a complex and expensive explainability technique. Besides that, explainability should be tailored to suit the context of the problem. In a recommendation system which relies on collaborative filtering, the recommendation is based on the behaviors of similar users, therefore the explanation should tell which other users are similar to the current user. Similarly, if the recommendation system is based on sequence prediction, the explanation should also tell which input timesteps are the most influential. We demonstrate this philosophy/paradigm in STAN (Spatio-Temporal Attention Network for Next Location Recommendation), a next Point of Interest recommendation system based on collaborative filtering and sequence prediction. We also show that the explanation helps to “debug” the output.

[LG-227] Learning Code Preference via Synthetic Evolution

链接: https://arxiv.org/abs/2410.03837
作者: Jiawei Liu,Thanh Nguyen,Mingyue Shang,Hantian Ding,Xiaopeng Li,Yu Yu,Varun Kumar,Zijian Wang
关键词-EN: Large Language Models, Large Language, remarkable coding capabilities, recently demonstrated remarkable, demonstrated remarkable coding
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our evaluation shows that CodeFavor holistically improves the accuracy of model-based code preferences by up to 28.8%. Meanwhile, CodeFavor models can match the performance of models with 6-9x more parameters while being 34x more cost-effective. We also rigorously validate the design choices in CodeFavor via a comprehensive set of controlled experiments. Furthermore, we discover the prohibitive costs and limitations of human-based code preference: despite spending 23.4 person-minutes on each task, 15.1-40.3% of tasks remain unsolved. Compared to model-based preference, human preference tends to be more accurate under the objective of code correctness, while being sub-optimal for non-functional objectives.

[LG-228] Why Fine-Tuning Struggles with Forgetting in Machine Unlearning? Theoretical Insights and a Remedial Approach

链接: https://arxiv.org/abs/2410.03833
作者: Meng Ding,Jinhui Xu,Kaiyi Ji
关键词-EN: forgetting data, Machine Unlearning, area of research, specific subsets, significant area
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages,5 figures

点击查看摘要

Abstract:Machine Unlearning has emerged as a significant area of research, focusing on ‘removing’ specific subsets of data from a trained model. Fine-tuning (FT) methods have become one of the fundamental approaches for approximating unlearning, as they effectively retain model performance. However, it is consistently observed that naive FT methods struggle to forget the targeted data. In this paper, we present the first theoretical analysis of FT methods for machine unlearning within a linear regression framework, providing a deeper exploration of this phenomenon. We investigate two scenarios with distinct features and overlapping features. Our findings reveal that FT models can achieve zero remaining loss yet fail to forget the forgetting data, unlike golden models (trained from scratch without the forgetting data). This analysis reveals that naive FT methods struggle with forgetting because the pretrained model retains information about the forgetting data, and the fine-tuning process has no impact on this retained information. To address this issue, we first propose a theoretical approach to mitigate the retention of forgetting data in the pretrained model. Our analysis shows that removing the forgetting data’s influence allows FT models to match the performance of the golden model. Building on this insight, we introduce a discriminative regularization term to practically reduce the unlearning loss gap between the fine-tuned model and the golden model. Our experiments on both synthetic and real-world datasets validate these theoretical insights and demonstrate the effectiveness of the proposed regularization method.

[LG-229] Large Language Models can be Strong Self-Detoxifiers

链接: https://arxiv.org/abs/2410.03818
作者: Ching-Yun Ko,Pin-Yu Chen,Payel Das,Youssef Mroueh,Soham Dan,Georgios Kollias,Subhajit Chaudhury,Tejaswini Pedapati,Luca Daniel
关键词-EN: aligning large language, large language models, likelihood of generating, generating harmful, essential task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 20 pages

点击查看摘要

Abstract:Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textitSelf-disciplined Autoregressive Sampling (SASA), a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM’s internal representations.

[LG-230] SOI: Scaling Down Computational Complexity by Estimating Partial States of the Model NEURIPS2024

链接: https://arxiv.org/abs/2410.03813
作者: Grzegorz Stefański,Paweł Daniluk,Artur Szumaczuk,Jakub Tkaczuk
关键词-EN:
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: NeurIPS 2024

点击查看摘要

[LG-231] Can Mamba Always Enjoy the “Free Lunch”?

链接: https://arxiv.org/abs/2410.03810
作者: Ruifeng Ren,Zhicong Li,Yong Liu
关键词-EN: Large Language Models, current Large Language, Language Models, Large Language, current Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Transformers have been the cornerstone of current Large Language Models (LLMs); however, its linear growth in overhead during inference with respect to sequence length poses challenges for modeling long sequences. In this context, Mamba has gradually attracted attention due to its constant-level size during inference and existing empirical results have shown that it can perform comparably to Transformers in sequence modeling while offering significant savings. However, one may ask that, can Mamba always enjoy the ``free lunch"? In this paper, we focus on analyzing the expressive ability of Mamba from a theoretical standpoint. First, inspired by the connection between Mamba and linear attention, we investigate potential shortcomings of the Mamba when performing the COPY operation. Our results indicate that Mamba with constant size may encounter bottlenecks when handling COPY, while it can achieve perfect performance when the size scales linearly with sequence length. Based on this observation, we analyze Mamba’s ability to tackle DP problems when equipped with Chain of Thought (CoT). Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is comparable to standard and efficient Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our results contribute to a deeper understanding of Mamba.

[LG-232] Metadata Matters for Time Series: Informative Forecasting with Transformers

链接: https://arxiv.org/abs/2410.03806
作者: Jiaxiang Dong,Haixu Wu,Yuxuan Wang,Li Zhang,Jianmin Wang,Mingsheng Long
关键词-EN: Time series, extensive real-world applications, Time series forecasting, series, Time
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Time series forecasting is prevalent in extensive real-world applications, such as financial analysis and energy planning. Previous studies primarily focus on time series modality, endeavoring to capture the intricate variations and dependencies inherent in time series. Beyond numerical time series data, we notice that metadata (e.g.~dataset and variate descriptions) also carries valuable information essential for forecasting, which can be used to identify the application scenario and provide more interpretable knowledge than digit sequences. Inspired by this observation, we propose a Metadata-informed Time Series Transformer (MetaTST), which incorporates multiple levels of context-specific metadata into Transformer forecasting models to enable informative time series forecasting. To tackle the unstructured nature of metadata, MetaTST formalizes them into natural languages by pre-designed templates and leverages large language models (LLMs) to encode these texts into metadata tokens as a supplement to classic series tokens, resulting in an informative embedding. Further, a Transformer encoder is employed to communicate series and metadata tokens, which can extend series representations by metadata information for more accurate forecasting. This design also allows the model to adaptively learn context-specific patterns across various scenarios, which is particularly effective in handling large-scale, diverse-scenario forecasting tasks. Experimentally, MetaTST achieves state-of-the-art compared to advanced time series models and LLM-based methods on widely acknowledged short- and long-term forecasting benchmarks, covering both single-dataset individual and multi-dataset joint training settings.

[LG-233] Local Attention Mechanism: Boosting the Transformer Architecture for Long-Sequence Time Series Forecasting

链接: https://arxiv.org/abs/2410.03805
作者: Ignacio Aguilera-Martos,Andrés Herrera-Poyatos,Julián Luengo,Francisco Herrera
关键词-EN: natural language processing, time series, time series analysis, leading choice, choice in natural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have become the leading choice in natural language processing over other deep learning architectures. This trend has also permeated the field of time series analysis, especially for long-horizon forecasting, showcasing promising results both in performance and running time. In this paper, we introduce Local Attention Mechanism (LAM), an efficient attention mechanism tailored for time series analysis. This mechanism exploits the continuity properties of time series to reduce the number of attention scores computed. We present an algorithm for implementing LAM in tensor algebra that runs in time and memory O(nlogn), significantly improving upon the O(n^2) time and memory complexity of traditional attention mechanisms. We also note the lack of proper datasets to evaluate long-horizon forecast models. Thus, we propose a novel set of datasets to improve the evaluation of models addressing long-horizon forecasting challenges. Our experimental analysis demonstrates that the vanilla transformer architecture magnified with LAM surpasses state-of-the-art models, including the vanilla attention mechanism. These results confirm the effectiveness of our approach and highlight a range of future challenges in long-sequence time series forecasting. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.03805 [cs.LG] (or arXiv:2410.03805v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.03805 Focus to learn more arXiv-issued DOI via DataCite

[LG-234] Mixture of Attentions For Speculative Decoding

链接: https://arxiv.org/abs/2410.03804
作者: Matthieu Zimmer,Milan Gritta,Gerasimos Lampouras,Haitham Bou Ammar,Jun Wang
关键词-EN: Large Language Models, Large Language, parameters of Large, Language Models, computational requirements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.

[LG-235] xt-guided Diffusion Model for 3D Molecule Generation

链接: https://arxiv.org/abs/2410.03803
作者: Yanchen Luo,Junfeng Fang,Sihang Li,Zhiyuan Liu,Jiancan Wu,An Zhang,Wenjie Du,Xiang Wang
关键词-EN: Text-guided Small Molecule, Small Molecule Generation, crucial in biology, drug discovery, targeted properties
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation Approach via 3D Diffusion Model which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG’s proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations.

[LG-236] P1-KAN an effective Kolmogorov Arnold Network for function approximation

链接: https://arxiv.org/abs/2410.03801
作者: Xavier Warin
关键词-EN: approximate potentially irregular, potentially irregular functions, high dimension, approximate potentially, potentially irregular
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A new Kolmogorov-Arnold network (KAN) is proposed to approximate potentially irregular functions in high dimension. We show that it outperforms multilayer perceptrons in terms of accuracy and converges faster. We also compare it with ReLU-KAN, a recently proposed network: it is more time consuming than ReLU-KAN, but more accurate.

[LG-237] Dynamic Evidence Decoupling for Trusted Multi-view Learning

链接: https://arxiv.org/abs/2410.03796
作者: Ying Liu,Lihong Liu,Cai Xu,Xiangyu Song,Ziyu Guan,Wei Zhao
关键词-EN: Multi-view learning methods, trusted multi-view learning, Multi-view learning, improving decision accuracy, improving decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-view learning methods often focus on improving decision accuracy, while neglecting the decision uncertainty, limiting their suitability for safety-critical applications. To mitigate this, researchers propose trusted multi-view learning methods that estimate classification probabilities and uncertainty by learning the class distributions for each instance. However, these methods assume that the data from each view can effectively differentiate all categories, ignoring the semantic vagueness phenomenon in real-world multi-view data. Our findings demonstrate that this phenomenon significantly suppresses the learning of view-specific evidence in existing methods. We propose a Consistent and Complementary-aware trusted Multi-view Learning (CCML) method to solve this problem. We first construct view opinions using evidential deep neural networks, which consist of belief mass vectors and uncertainty estimates. Next, we dynamically decouple the consistent and complementary evidence. The consistent evidence is derived from the shared portions across all views, while the complementary evidence is obtained by averaging the differing portions across all views. We ensure that the opinion constructed from the consistent evidence strictly aligns with the ground-truth category. For the opinion constructed from the complementary evidence, we allow it for potential vagueness in the evidence. We compare CCML with state-of-the-art baselines on one synthetic and six real-world datasets. The results validate the effectiveness of the dynamic evidence decoupling strategy and show that CCML significantly outperforms baselines on accuracy and reliability. The code is released at this https URL.

[LG-238] Deep Learning and Machine Learning: Advancing Big Data Analytics and Management with Design Patterns

链接: https://arxiv.org/abs/2410.03795
作者: Keyu Chen,Ziqian Bi,Tianyang Wang,Yizhu Wen,Pohsun Feng,Qian Niu,Junyu Liu,Benji Peng,Sen Zhang,Ming Li,Xuanhe Pan,Jiawei Xu,Jinlang Wang,Caitlyn Heqi Yin,Ming Liu
关键词-EN: Advancing Big Data, Big Data Analytics, deep learning applications, Deep Learning, Data Analytics Management
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 138pages

点击查看摘要

Abstract:This book, Design Patterns in Machine Learning and Deep Learning: Advancing Big Data Analytics Management, presents a comprehensive study of essential design patterns tailored for large-scale machine learning and deep learning applications. The book explores the application of classical software engineering patterns, Creational, Structural, Behavioral, and Concurrency Patterns, to optimize the development, maintenance, and scalability of big data analytics systems. Through practical examples and detailed Python implementations, it bridges the gap between traditional object-oriented design patterns and the unique demands of modern data analytics environments. Key design patterns such as Singleton, Factory, Observer, and Strategy are analyzed for their impact on model management, deployment strategies, and team collaboration, providing invaluable insights into the engineering of efficient, reusable, and flexible systems. This volume is an essential resource for developers, researchers, and engineers aiming to enhance their technical expertise in both machine learning and software design.

[LG-239] Repurposing Foundation Model for Generalizable Medical Time Series Classification

链接: https://arxiv.org/abs/2410.03794
作者: Nan Huang,Haishuai Wang,Zihuai He,Marinka Zitnik,Xiang Zhang
关键词-EN: Alzheimer Disease diagnosis, Alzheimer Disease, Disease diagnosis, time series, healthcare applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical time series (MedTS) classification is critical for a wide range of healthcare applications such as Alzheimer’s Disease diagnosis. However, its real-world deployment is severely challenged by poor generalizability due to inter- and intra-dataset heterogeneity in MedTS, including variations in channel configurations, time series lengths, and diagnostic tasks. Here, we propose FORMED, a foundation classification model that leverages a pre-trained backbone and tackles these challenges through re-purposing. FORMED integrates the general representation learning enabled by the backbone foundation model and the medical domain knowledge gained on a curated cohort of MedTS datasets. FORMED can adapt seamlessly to unseen MedTS datasets, regardless of the number of channels, sample lengths, or medical tasks. Experimental results show that, without any task-specific adaptation, the repurposed FORMED achieves performance that is competitive with, and often superior to, 11 baseline models trained specifically for each dataset. Furthermore, FORMED can effectively adapt to entirely new, unseen datasets, with lightweight parameter updates, consistently outperforming baselines. Our results highlight FORMED as a versatile and scalable model for a wide range of MedTS classification tasks, positioning it as a strong foundation model for future research in MedTS analysis.

[LG-240] Accelerating Deep Learning with Fixed Time Budget

链接: https://arxiv.org/abs/2410.03790
作者: Muhammad Asif Khan,Ridha Hamila,Hamid Menouar
关键词-EN: large model sizes, key elements, success of modern, huge amounts, modern deep learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The success of modern deep learning is attributed to two key elements: huge amounts of training data and large model sizes. Where a vast amount of data allows the model to learn more features, the large model architecture boosts the learning capability of the model. However, both these factors result in prolonged training time. In some practical applications such as edge-based learning and federated learning, limited-time budgets necessitate more efficient training methods. This paper proposes an effective technique for training arbitrary deep learning models within fixed time constraints utilizing sample importance and dynamic ranking. The proposed method is extensively evaluated in both classification and regression tasks in computer vision. The results consistently show clear gains achieved by the proposed method in improving the learning performance of various state-of-the-art deep learning models in both regression and classification tasks.

[LG-241] Reconstructing Human Mobility Pattern: A Semi-Supervised Approach for Cross-Dataset Transfer Learning

链接: https://arxiv.org/abs/2410.03788
作者: Xishun Liao,Yifan Liu,Chenchen Kuai,Haoxuan Ma,Yueshuai He,Shangqing Cao,Chris Stanford,Jiaqi Ma
关键词-EN: Understanding human mobility, Understanding human, urban planning, public health, crucial for urban
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 23 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Understanding human mobility patterns is crucial for urban planning, transportation management, and public health. This study tackles two primary challenges in the field: the reliance on trajectory data, which often fails to capture the semantic interdependencies of activities, and the inherent incompleteness of real-world trajectory data. We have developed a model that reconstructs and learns human mobility patterns by focusing on semantic activity chains. We introduce a semi-supervised iterative transfer learning algorithm to adapt models to diverse geographical contexts and address data scarcity. Our model is validated using comprehensive datasets from the United States, where it effectively reconstructs activity chains and generates high-quality synthetic mobility data, achieving a low Jensen-Shannon Divergence (JSD) value of 0.001, indicating a close similarity between synthetic and real data. Additionally, sparse GPS data from Egypt is used to evaluate the transfer learning algorithm, demonstrating successful adaptation of US mobility patterns to Egyptian contexts, achieving a 64% of increase in similarity, i.e., a JSD reduction from 0.09 to 0.03. This mobility reconstruction model and the associated transfer learning algorithm show significant potential for global human mobility modeling studies, enabling policymakers and researchers to design more effective and culturally tailored transportation solutions.

[LG-242] Improving Neural Optimal Transport via Displacement Interpolation

链接: https://arxiv.org/abs/2410.03783
作者: Jaemoo Choi,Yongxin Chen,Jaewoong Choi
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages

点击查看摘要

[LG-243] DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation

链接: https://arxiv.org/abs/2410.03782
作者: Changdae Oh,Yixuan Li,Kyungwoo Song,Sangdoo Yun,Dongyoon Han
关键词-EN: Adapting a pre-trained, pre-trained foundation model, pre-trained foundation, ensure robustness, Adapting
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adapting a pre-trained foundation model on downstream tasks should ensure robustness against distribution shifts without the need to retrain the whole model. Although existing weight interpolation methods are simple yet effective, we argue their static nature limits downstream performance while achieving efficiency. In this work, we propose DaWin, a training-free dynamic weight interpolation method that leverages the entropy of individual models over each unlabeled test sample to assess model expertise, and compute per-sample interpolation coefficients dynamically. Unlike previous works that typically rely on additional training to learn such coefficients, our approach requires no training. Then, we propose a mixture modeling approach that greatly reduces inference overhead raised by dynamic interpolation. We validate DaWin on the large-scale visual recognition benchmarks, spanning 14 tasks across robust fine-tuning – ImageNet and derived five distribution shift benchmarks – and multi-task learning with eight classification tasks. Results demonstrate that DaWin achieves significant performance gain in considered settings, with minimal computational overhead. We further discuss DaWin’s analytic behavior to explain its empirical success.

[LG-244] Reward-RAG: Enhancing RAG with Reward Driven Supervision

链接: https://arxiv.org/abs/2410.03780
作者: Thang Nguyen,Peter Chin,Yu-Wing Tai
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-245] Discovering Message Passing Hierarchies for Mesh-Based Physics Simulation

链接: https://arxiv.org/abs/2410.03779
作者: Huayu Deng,Xiangming Zhu,Yunbo Wang,Xiaokang Yang
关键词-EN: large-scale mesh-based physics, message passing, powerful tool, tool for large-scale, large-scale mesh-based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Graph neural networks have emerged as a powerful tool for large-scale mesh-based physics simulation. Existing approaches primarily employ hierarchical, multi-scale message passing to capture long-range dependencies within the graph. However, these graph hierarchies are typically fixed and manually designed, which do not adapt to the evolving dynamics present in complex physical systems. In this paper, we introduce a novel neural network named DHMP, which learns Dynamic Hierarchies for Message Passing networks through a differentiable node selection method. The key component is the anisotropic message passing mechanism, which operates at both intra-level and inter-level interactions. Unlike existing methods, it first supports directionally non-uniform aggregation of dynamic features between adjacent nodes within each graph hierarchy. Second, it determines node selection probabilities for the next hierarchy according to different physical contexts, thereby creating more flexible message shortcuts for learning remote node relations. Our experiments demonstrate the effectiveness of DHMP, achieving 22.7% improvement on average compared to recent fixed-hierarchy message passing networks across five classic physics simulation datasets.

[LG-246] SGW-based Multi-Task Learning in Vision Tasks

链接: https://arxiv.org/abs/2410.03778
作者: Ruiyuan Zhang,Yuyao Chen,Yuchi Huo,Jiaxiang Liu,Dianbing Xi,Jie Liu,Chao Wu
关键词-EN: multi-target optimization task, multi-target optimization, MTL, optimization task, previous cross-attention MTL
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task-learning(MTL) is a multi-target optimization task. Neural networks try to realize each target using a shared interpretative space within MTL. However, as the scale of datasets expands and the complexity of tasks increases, knowledge sharing becomes increasingly challenging. In this paper, we first re-examine previous cross-attention MTL methods from the perspective of noise. We theoretically analyze this issue and identify it as a flaw in the cross-attention mechanism. To address this issue, we propose an information bottleneck knowledge extraction module (KEM). This module aims to reduce inter-task interference by constraining the flow of information, thereby reducing computational complexity. Furthermore, we have employed neural collapse to stabilize the knowledge-selection process. That is, before input to KEM, we projected the features into ETF space. This mapping makes our method more robust. We implemented and conducted comparative experiments with this method on multiple datasets. The results demonstrate that our approach significantly outperforms existing methods in multi-task learning.

[LG-247] Parameter Estimation of Long Memory Stochastic Processes with Deep Neural Networks

链接: https://arxiv.org/abs/2410.03776
作者: Bálint Csanády,Lóránt Nagy,Dániel Boros,Iván Ivkovic,Dávid Kovács,Dalma Tóth-Lakits,László Márkus,András Lukács
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 16 figures, this https URL

点击查看摘要

[LG-248] Hidden in Plain Text: Emergence Mitigation of Steganographic Collusion in LLMs

链接: https://arxiv.org/abs/2410.03768
作者: Yohan Mathew,Ollie Matthews,Robert McCarthy,Joan Velja,Christian Schroeder de Witt,Dylan Cope,Nandi Schoots
关键词-EN:
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-249] Reasoning Elicitation in Language Models via Counterfactual Feedback

链接: https://arxiv.org/abs/2410.03767
作者: Alihan Hüyük,Xinnuo Xu,Jacqueline Maasch,Aditya V. Nori,Javier González
关键词-EN: capabilities remain underdeveloped, remain underdeveloped, increasing effectiveness, language models, reasoning capabilities remain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities.

[LG-250] FutureFill: Fast Generation from Convolutional Sequence Models

链接: https://arxiv.org/abs/2410.03766
作者: Naman Agarwal,Xinyi Chen,Evan Dogariu,Vlad Feinberg,Daniel Suo,Peter Bartlett,Elad Hazan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[LG-251] Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

链接: https://arxiv.org/abs/2410.03765
作者: Jingcun Wang,Yu-Guang Chen,Ing-Chao Lin,Bing Li,Grace Li Zhang
关键词-EN: Large Language Models, Language Models, achieved remarkable breakthroughs, Large Language, remarkable breakthroughs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable breakthroughs. However, the huge number of parameters in LLMs require significant amount of memory storage in inference, which prevents their practical deployment in many applications. To reduce memory storage of LLMs, singular value decomposition (SVD) provides a promising solution to approximate weight matrices for compressing LLMs. In this paper, we take a step further to explore parameter sharing across different layers with SVD to achieve more effective compression for LLMs. Specifically, weight matrices in different layers are decomposed and represented as a linear combination of a set of shared basis vectors and unique coefficients. The types of weight matrices and the layer selection for basis sharing are examined when compressing LLMs to maintain the performance. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios. Code is available at: this https URL

[LG-252] Words that Represent Peace

链接: https://arxiv.org/abs/2410.03764
作者: T. Prasad(1),L. S. Liebovitch(1),M. Wild(1),H. West(1),P. T. Coleman(1) ((1) Columbia University)
关键词-EN: data from LexisNexis, LexisNexis to determine, classifies countries, lower peace, characterized by themes
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:We used data from LexisNexis to determine the words in news media that best classifies countries as higher or lower peace. We found that higher peace news is characterized by themes of finance, daily actitivities, and health and that lower peace news is characterized by themes of politics, government, and legal issues. This work provides a starting point to measure levels of peace and identify the social processes that underly those words.

[LG-253] HiReview: Hierarchical Taxonomy-Driven Automatic Literature Review Generation

链接: https://arxiv.org/abs/2410.03761
作者: Yuntong Hu,Zhuofeng Li,Zheng Zhang,Chen Ling,Raasikh Kanjiani,Boxin Zhao,Liang Zhao
关键词-EN: literature review generation, automatic literature review, literature review, taxonomy-driven automatic literature, literature
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present HiReview, a novel framework for hierarchical taxonomy-driven automatic literature review generation. With the exponential growth of academic documents, manual literature reviews have become increasingly labor-intensive and time-consuming, while traditional summarization models struggle to generate comprehensive document reviews effectively. Large language models (LLMs), with their powerful text processing capabilities, offer a potential solution; however, research on incorporating LLMs for automatic document generation remains limited. To address key challenges in large-scale automatic literature review generation (LRG), we propose a two-stage taxonomy-then-generation approach that combines graph-based hierarchical clustering with retrieval-augmented LLMs. First, we retrieve the most relevant sub-community within the citation network, then generate a hierarchical taxonomy tree by clustering papers based on both textual content and citation relationships. In the second stage, an LLM generates coherent and contextually accurate summaries for clusters or topics at each hierarchical level, ensuring comprehensive coverage and logical organization of the literature. Extensive experiments demonstrate that HiReview significantly outperforms state-of-the-art methods, achieving superior hierarchical organization, content relevance, and factual accuracy in automatic literature review generation tasks.

[LG-254] Real-World Data and Calibrated Simulation Suite for Offline Training of Reinforcement Learning Agents to Optimize Energy and Emission in Buildings for Environmental Sustainability

链接: https://arxiv.org/abs/2410.03756
作者: Judah Goldfeder,John Sipple
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

[LG-255] Denoising with a Joint-Embedding Predictive Architecture

链接: https://arxiv.org/abs/2410.03755
作者: Dengsheng Chen,Jie Hu,Xiaoming Wei,Enhua Wu
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 38 pages

点击查看摘要

[LG-256] SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models EMNLP-24

链接: https://arxiv.org/abs/2410.03750
作者: Juan Pablo Muñoz,Jinjie Yuan,Nilesh Jain
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: To be published in EMNLP-24 Findings

点击查看摘要

[LG-257] Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization

链接: https://arxiv.org/abs/2410.03749
作者: K. Lian(1),L. S. Liebovitch(1),M. Wild(1),H. West(1),P. T. Coleman(1),F. Chen(2),E. Kimani(2),K. Sieck(2) ((1) Columbia University, (2) Toyota Research Institute)
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures

点击查看摘要

[LG-258] Khattat: Enhancing Readability and Concept Representation of Semantic Typography

链接: https://arxiv.org/abs/2410.03748
作者: Ahmed Hussein,Alaa Elsetohy,Sama Hadhoud,Tameem Bakr,Yasser Rohaim,Badr AlKhamissi
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-259] Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging EMNLP2024

链接: https://arxiv.org/abs/2410.03743
作者: Yiming Ju,Ziyi Ni,Xingrun Xing,Zhixiong Zeng,hanyu Zhao,Siqi Fan,Zheng Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024

点击查看摘要

[LG-260] Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

链接: https://arxiv.org/abs/2410.03742
作者: Ziyi Ye,Xiangsheng Li,Qiuchi Li,Qingyao Ai,Yujia Zhou,Wei Shen,Dong Yan,Yiqun Liu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-261] Meta Reinforcement Learning Approach for Adaptive Resource Optimization in O-RAN

链接: https://arxiv.org/abs/2410.03737
作者: Fatemeh Lotfi,Fatemeh Afghah
关键词-EN: RAN Intelligent Controller, Open Radio Access, smart RAN Intelligent, Radio Access Network, Intelligent Controller
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As wireless networks grow to support more complex applications, the Open Radio Access Network (O-RAN) architecture, with its smart RAN Intelligent Controller (RIC) modules, becomes a crucial solution for real-time network data collection, analysis, and dynamic management of network resources including radio resource blocks and downlink power allocation. Utilizing artificial intelligence (AI) and machine learning (ML), O-RAN addresses the variable demands of modern networks with unprecedented efficiency and adaptability. Despite progress in using ML-based strategies for network optimization, challenges remain, particularly in the dynamic allocation of resources in unpredictable environments. This paper proposes a novel Meta Deep Reinforcement Learning (Meta-DRL) strategy, inspired by Model-Agnostic Meta-Learning (MAML), to advance resource block and downlink power allocation in O-RAN. Our approach leverages O-RAN’s disaggregated architecture with virtual distributed units (DUs) and meta-DRL strategies, enabling adaptive and localized decision-making that significantly enhances network efficiency. By integrating meta-learning, our system quickly adapts to new network conditions, optimizing resource allocation in real-time. This results in a 19.8% improvement in network management performance over traditional methods, advancing the capabilities of next-generation wireless networks.

[LG-262] CliMB: An AI-enabled Partner for Clinical Predictive Modeling

链接: https://arxiv.org/abs/2410.03736
作者: Evgeny Saveliev,Tim Schubert,Thomas Pouplin,Vasilis Kosmoliaptsis,Mihaela van der Schaar
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: * Evgeny Saveliev and Tim Schubert contributed equally to this work

点击查看摘要

[LG-263] ask-Adaptive Pretrained Language Models via Clustered-Importance Sampling

链接: https://arxiv.org/abs/2410.03735
作者: David Grangier,Simin Fan,Skyler Seto,Pierre Ablin
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-264] Multi-Scale Convolutional LSTM with Transfer Learning for Anomaly Detection in Cellular Networks

链接: https://arxiv.org/abs/2410.03732
作者: Nooruddin Noonari,Daniel Corujo,Rui L. Aguiar,Francisco J. Ferrao
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-265] Progress Report: Towards European LLMs

链接: https://arxiv.org/abs/2410.03730
作者: Mehdi Ali,Michael Fromm,Klaudia Thellmann,Jan Ebert,Alexander Arno Weber,Richard Rutmann,Charvi Jain,Max Lübbering,Daniel Steinigen,Johannes Leveling,Katrin Klug,Jasper Schulze Buschhoff,Lena Jurkschat,Hammam Abdelwahab,Benny Jörg Stein,Karl-Heinz Sylla,Pavel Denisov,Nicolo Brandizzi,Qasid Saleem,Bhowmick Anirban,Chelsea John,Pedro Ortiz Suarez,Malte Ostendorff,Alex Jude,Lalith Manjunath,Samuel Weinbach,Carolin Penke,Shima Asaadi,Fabio Barth,Rafet Sifa,Fabian Küch,René Jäkel,Georg Rehm,Stefan Kesselheim,Joachim Köhler,Nicolas Flores-Herr
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-266] Certifying Guidance Control Networks: Uncertainty Propagation to an Event Manifold

链接: https://arxiv.org/abs/2410.03729
作者: Sebastien Origer,Dario Izzo,Giacomo Acciarini,Francesco Biscani,Rita Mastroianni,Max Bannach,Harry Holt
关键词-EN:
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-267] Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic Analysis

链接: https://arxiv.org/abs/2410.03728
作者: Barak Gahtan,Robert J. Sahala,Alex M. Bronstein,Reuven Cohen
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: The dataset and the supplementary material can be provided upon request

点击查看摘要

[LG-268] FaithEval: Can Your Language Model Stay Faithful to Context Even If “The Moon is Made of Marshmallows”

链接: https://arxiv.org/abs/2410.03727
作者: Yifei Ming,Senthil Purushwalkam,Shrey Pandit,Zixuan Ke,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-269] Revisiting the Superficial Alignment Hypothesis

链接: https://arxiv.org/abs/2410.03717
作者: Mohit Raghavendra,Vaskar Nath,Sean Hendryx
关键词-EN: Superficial Alignment Hypothesis, Alignment Hypothesis posits, style and format, Superficial Alignment, Alignment Hypothesis
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Superficial Alignment Hypothesis posits that almost all of a language model’s abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model’s ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification.

[LG-270] opological Foundations of Reinforcement Learning

链接: https://arxiv.org/abs/2410.03706
作者: David Krame Kadurha
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA)
*备注: Supervisor : Yae Ulrich Gaba , Mentor : Domini Jocema Leko

点击查看摘要

[LG-271] Gradient Boosting Decision Trees on Medical Diagnosis over Tabular Data

链接: https://arxiv.org/abs/2410.03705
作者: A. Yarkın Yıldız,Asli Kalayci
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-272] Combining Open-box Simulation and Importance Sampling for Tuning Large-Scale Recommenders RECSYS’24

链接: https://arxiv.org/abs/2410.03697
作者: Kaushal Paneri,Michael Munje,Kailash Singh Maurya,Adith Swaminathan,Yifan Shi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

[LG-273] Improving Emotion Recognition Accuracy with Personalized Clustering

链接: https://arxiv.org/abs/2410.03696
作者: Laura Gutierrez-Martin(1),Celia Lopez Ongil(1 and 2),Jose M. Lanza-Gutierrez(3),Jose A. Miranda Calero(4) ((1) Department of Electronics, Universidad Carlos III de Madrid, Spain, (2) Gender Studies Institute, Universidad Carlos III de Madrid, Spain, (3) Department of Computer Science, Universidad de Alcala, Spain, (4) Embedded Systems Laboratory, Ecole Polytechnique Federale de Lausanne, Switzerland)
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 11 pages, 2 figures

点击查看摘要

[LG-274] Improving the Accessibility of Dating Websites for Individuals with Visual Impairments

链接: https://arxiv.org/abs/2410.03695
作者: Gyanendra Shrestha,Soumya Tejaswi Vadlamani
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-275] Linear Independence of Generalized Neurons and Related Functions

链接: https://arxiv.org/abs/2410.03693
作者: Leyang Zhang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 51 pages

点击查看摘要

[LG-276] Floating-floating point: a highly accurate number representation with flexible Counting ranges

链接: https://arxiv.org/abs/2410.03692
作者: Itamar Cohen,Gil Einziger
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-277] A quest through interconnected datasets: lessons from highly-cited ICASSP papers

链接: https://arxiv.org/abs/2410.03676
作者: Cynthia C. S. Liem,Doğa Taşcılar,Andrew M. Demetriou
关键词-EN:
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: in Proceedings of the 21st International Conference on Content-based Multimedia Indexing, September 18-20 2024, Reykjavik, Iceland

点击查看摘要

[LG-278] rends Advancements and Challenges in Intelligent Optimization in Satellite Communication

链接: https://arxiv.org/abs/2410.03674
作者: Philippe Krajsic,Viola Suess,Zehong Cao,Ryszard Kowalczyk,Bogdan Franczyk
关键词-EN:
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures, 3 tables

点击查看摘要

[LG-279] AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation

链接: https://arxiv.org/abs/2409.20398
作者: Boyu Han,Qianqian Xu,Zhiyong Yang,Shilong Bao,Peisong Wen,Yangbangyan Jiang,Qingming Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-280] Regression Conformal Prediction under Bias

链接: https://arxiv.org/abs/2410.05263
作者: Matt Y. Cheung,Tucker J. Netherton,Laurence E. Court,Ashok Veeraraghavan,Guha Balakrishnan
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 17 pages, 6 figures, code available at: this https URL

点击查看摘要

[LG-281] Are causal effect estimations enough for optimal recommendations under multitreatment scenarios?

链接: https://arxiv.org/abs/2410.05177
作者: Sherly Alfonso-Sánchez,Kristina P. Sendova,Cristián Bravo
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 4 figures

点击查看摘要

[LG-282] Agnostic Smoothed Online Learning

链接: https://arxiv.org/abs/2410.05124
作者: Moïse Blanchard
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-283] Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation

链接: https://arxiv.org/abs/2410.05106
作者: Marina Sheshukova,Denis Belomestny,Alain Durmus,Eric Moulines,Alexey Naumov,Sergey Samsonov
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-284] CR-CTC: Consistency regularization on CTC for improved speech recognition

链接: https://arxiv.org/abs/2410.05101
作者: Zengwei Yao,Wei Kang,Xiaoyu Yang,Fangjun Kuang,Liyong Guo,Han Zhu,Zengrui Jin,Zhaoqing Li,Long Lin,Daniel Povey
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

[LG-285] Assumption-Lean Post-Integrated Inference with Negative Control Outcomes

链接: https://arxiv.org/abs/2410.04996
作者: Jin-Hong Du,Kathryn Roeder,Larry Wasserman
关键词-EN:
类目: Methodology (stat.ME); Machine Learning (cs.LG); Genomics (q-bio.GN); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 29 pages for main text, and 18 pages for appendix, 9 figures for main text, 4 figures for appendix

点击查看摘要

[LG-286] Decomposition Polyhedra of Piecewise Linear Functions

链接: https://arxiv.org/abs/2410.04907
作者: Marie-Charlotte Brandenburg,Moritz Grillo,Christoph Hertrich
关键词-EN:
类目: Combinatorics (math.CO); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-287] Molecular topological deep learning for polymer property prediction

链接: https://arxiv.org/abs/2410.04765
作者: Cong Shen,Yipeng Zhang,Fei Han,Kelin Xia
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-288] Stochastic Runge-Kutta Methods: Provable Acceleration of Diffusion Models

链接: https://arxiv.org/abs/2410.04760
作者: Yuchen Wu,Yuxin Chen,Yuting Wei
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 45 pages, 3 figures

点击查看摘要

[LG-289] SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

链接: https://arxiv.org/abs/2410.04690
作者: Minchan Kim,Myeonghun Jeong,Joun Yeop Lee,Nam Soo Kim
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

[LG-290] Combining Structural and Unstructured Data: A Topic-based Finite Mixture Model for Insurance Claim Prediction

链接: https://arxiv.org/abs/2410.04684
作者: Yanxi Hou,Xiaolan Xia,Guangyuan Gao
关键词-EN:
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-291] Generative Flows on Synthetic Pathway for Drug Design

链接: https://arxiv.org/abs/2410.04542
作者: Seonghwan Seo,Minsu Kim,Tony Shen,Martin Ester,Jinkyoo Park,Sungsoo Ahn,Woo Youn Kim
关键词-EN:
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 25 pages, 10 figures

点击查看摘要

[LG-292] YanTian: An Application Platform for AI Global Weather Forecasting Models

链接: https://arxiv.org/abs/2410.04539
作者: Wencong Cheng,Jiangjiang Xia,Chang Qu,Zhigang Wang,Xinyi Zeng,Fang Huang,Tianye Li
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-293] Grokking at the Edge of Linear Separability

链接: https://arxiv.org/abs/2410.04489
作者: Alon Beck,Noam Levi,Yohai Bar-Sinai
关键词-EN:
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 24 pages, 13 figures

点击查看摘要

[LG-294] SITCOM: Step-wise Triple-Consistent Diffusion Sampling for Inverse Problems

链接: https://arxiv.org/abs/2410.04479
作者: Ismail Alkhouri,Shijun Liang,Cheng-Han Huang,Jimmy Dai,Qing Qu,Saiprasad Ravishankar,Rongrong Wang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-295] U-net based prediction of cerebrospinal fluid distribution and ventricular reflux grading

链接: https://arxiv.org/abs/2410.04460
作者: Melanie Rieff,Fabian Holzberger,Oksana Lapina,Geir Ringstad,Lars Magnus Valnes,Bogna Warsza,Kent-Andre Mardal,Per Kristian Eide,Barbara Wohlmuth
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

[LG-296] Spectral Densities Structured Noise and Ensemble Averaging within Open Quantum Dynamics

链接: https://arxiv.org/abs/2410.04294
作者: Yannick Marcel Holtkamp,Emiliano Godinez-Ramirez,Ulrich Kleinekathöfer
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 48 pages, 13 figures. This article may be downloaded for personal use only. Any other use requires prior permission of the author and AIP Publishing. This article appeared in J. Chem. Phys. 161, 134101 (2024) and may be found at this https URL

点击查看摘要

[LG-297] MindFlayer: Efficient Asynchronous Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times

链接: https://arxiv.org/abs/2410.04285
作者: Artavazd Maranjyan,Omar Shaikh Omar,Peter Richtárik
关键词-EN:
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-298] Visualising Feature Learning in Deep Neural Networks by Diagonalizing the Forward Feature Map

链接: https://arxiv.org/abs/2410.04264
作者: Yoonsoo Nam,Chris Mingard,Seok Hyeong Lee,Soufiane Hayou,Ard Louis
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-299] Quantum Kolmogorov-Arnold networks by combining quantum signal processing circuits

链接: https://arxiv.org/abs/2410.04218
作者: Ammar Daskin
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: short version: 5 pages

点击查看摘要

[LG-300] WAVE-UNET: Wavelength based Image Reconstruction method using attention UNET for OCT images

链接: https://arxiv.org/abs/2410.04123
作者: Maryam Viqar,Erdem Sahin,Violeta Madjarova,Elena Stoykova,Keehoon Hong
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Optics (physics.optics)
*备注:

点击查看摘要

[LG-301] pFedGame – Decentralized Federated Learning using Game Theory in Dynamic Topology

链接: https://arxiv.org/abs/2410.04058
作者: Monik Raj Behera,Suchetana Chakraborty
关键词-EN:
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-302] Is Score Matching Suitable for Estimating Point Processes?

链接: https://arxiv.org/abs/2410.04037
作者: Haoqun Cao,Zizhuo Meng,Tianjun Ke,Feng Zhou
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-303] Implicit Bias of Mirror Descent for Shallow Neural Networks in Univariate Regression

链接: https://arxiv.org/abs/2410.03988
作者: Shuang Liang,Guido Montúfar
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-304] Robust Barycenter Estimation using Semi-Unbalanced Neural Optimal Transport

链接: https://arxiv.org/abs/2410.03974
作者: Milena Gazdieva,Jaemoo Choi,Alexander Kolesov,Jaewoong Choi,Petr Mokrov,Alexander Korotin
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

[LG-305] End-to-End Reaction Field Energy Modeling via Deep Learning based Voxel-to-voxel Transform

链接: https://arxiv.org/abs/2410.03927
作者: Yongxian Wu,Qiang Zhu,Ray Luo
关键词-EN:
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

[LG-306] Online Control-Informed Learning

链接: https://arxiv.org/abs/2410.03924
作者: Zihao Liang,Tianyu Zhou,Zehui Lu,Shaoshuai Mou
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

[LG-307] Leveraging Fundamental Analysis for Stock Trend Prediction for Profit

链接: https://arxiv.org/abs/2410.03913
作者: John Phan,Hung-Fu Chang
关键词-EN:
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

[LG-308] Harnessing Generative AI for Economic Insights

链接: https://arxiv.org/abs/2410.03897
作者: Manish Jha,Jialin Qian,Michael Weber,Baozhong Yang
关键词-EN:
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 26 Pages, 3 Figures, 11 Tables

点击查看摘要

[LG-309] rustEMG-Net: Using Representation-Masking Transformer with U-Net for Surface Electromyography Enhancement ALT

链接: https://arxiv.org/abs/2410.03843
作者: Kuan-Chen Wang,Kai-Chun Liu,Ping-Cheng Yeh,Sheng-Yu Peng,Yu Tsao
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures, to be published in IEEE Journal of Biomedical and Health Informatics

点击查看摘要

[LG-310] Mesh-Informed Reduced Order Models for Aneurysm Rupture Risk Prediction

链接: https://arxiv.org/abs/2410.03802
作者: Giuseppe Alessio D’Inverno,Saeid Moradizadeh,Sajad Salavatidezfouli,Pasquale Claudio Africa,Gianluigi Rozza
关键词-EN:
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-311] On the SAGA algorithm with decreasing step

链接: https://arxiv.org/abs/2410.03760
作者: Luis Fredes(IMB),Bernard Bercu(IMB),Eméric Gbaguidi(IMB)
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-312] NeuralQP: A General Hypergraph-based Optimization Framework for Large-scale QCQPs

链接: https://arxiv.org/abs/2410.03720
作者: Zhixiao Xiong,Fangyu Zong,Huigen Ye,Hua Xu
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-313] Mamba Meets Financial Markets: A Graph-Mamba Approach for Stock Price Prediction

链接: https://arxiv.org/abs/2410.03707
作者: Ali Mehrabian,Ehsan Hoseinzade,Mahdi Mazloum,Xiaohong Chen
关键词-EN:
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

信息检索

[IR-0] Causal Micro-Narratives EMNLP2024

链接: https://arxiv.org/abs/2410.05252
作者: Mourad Heddaya,Qingcheng Zeng,Chenhao Tan,Rob Voigt,Alexander Zentefis
关键词-EN: classify causal micro-narratives, micro-narratives from text, classify causal, Abstract, causal micro-narratives
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Workshop on Narrative Understanding

点击查看摘要

Abstract:We present a novel approach to classify causal micro-narratives from text. These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject. The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives. Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task. The best-performing model–a fine-tuned Llama 3.1 8B–achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification. Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research.

[IR-1] Efficient Inference for Large Language Model-based Generative Recommendation

链接: https://arxiv.org/abs/2410.05165
作者: Xinyu Lin,Chaoqun Yang,Wenjie Wang,Yongqi Li,Cunxiao Du,Fuli Feng,See-Kiong Ng,Tat-Seng Chua
关键词-EN: Large Language Model, Large Language, achieved notable success, excessive inference latency, inference latency caused
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promising solution. However, applying SD to generative recommendation presents unique challenges due to the requirement of generating top-K items (i.e., K distinct token sequences) as a recommendation list by beam search. This leads to more stringent verification in SD, where all the top-K sequences from the target LLM must be successfully drafted by the draft model at each decoding step. To alleviate this, we consider 1) boosting top-K sequence alignment between the draft model and the target LLM, and 2) relaxing the verification strategy to reduce trivial LLM calls. To this end, we propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification. Moreover, we introduce a relaxed sampling verification strategy that allows high-probability non-top-K drafted sequences to be accepted, significantly reducing LLM calls. Correspondingly, we propose AtSpeed-R for top-K alignment under this relaxed sampling verification. Empirical results on two real-world datasets demonstrate that AtSpeed significantly accelerates LLM-based generative recommendation, e.g., near 2x speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. The codes and datasets will be released in the near future.

[IR-2] On the Biased Assessment of Expert Finding Systems RECSYS RECSYS2024

链接: https://arxiv.org/abs/2410.05018
作者: Jens-Joris Decorte,Jeroen Van Hautte,Chris Develder,Thomas Demeester
关键词-EN: internal knowledge spread, large organisations, teams and departments, topic is crucial, crucial in leveraging
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: Accepted to the 4th Workshop on Recommender Systems for Human Resources (RecSys in HR 2024) as part of RecSys 2024

点击查看摘要

Abstract:In large organisations, identifying experts on a given topic is crucial in leveraging the internal knowledge spread across teams and departments. So-called enterprise expert retrieval systems automatically discover and structure employees’ expertise based on the vast amount of heterogeneous data available about them and the work they perform. Evaluating these systems requires comprehensive ground truth expert annotations, which are hard to obtain. Therefore, the annotation process typically relies on automated recommendations of knowledge areas to validate. This case study provides an analysis of how these recommendations can impact the evaluation of expert finding systems. We demonstrate on a popular benchmark that system-validated annotations lead to overestimated performance of traditional term-based retrieval models and even invalidate comparisons with more recent neural methods. We also augment knowledge areas with synonyms to uncover a strong bias towards literal mentions of their constituent words. Finally, we propose constraints to the annotation process to prevent these biased evaluations, and show that this still allows annotation suggestions of high utility. These findings should inform benchmark creation or selection for expert finding, to guarantee meaningful comparison of methods.

[IR-3] Leverage Knowledge Graph and Large Language Model for Law Article Recommendation: A Case Study of Chinese Criminal Law

链接: https://arxiv.org/abs/2410.04949
作者: Yongming Chen,Miner Chen,Ye Zhu,Juan Pei,Siyu Chen,Yu Zhou,Yi Wang,Yifan Zhou,Hao Li,Songan Zhang
关键词-EN: Article Knowledge Graph, Large Language Model, Knowledge Graph, social stability, law article
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Court efficiency is vital for social stability. However, in most countries around the world, the grassroots courts face case backlogs, with decisions relying heavily on judicial personnel’s cognitive labor, lacking intelligent tools to improve efficiency. To address this issue, we propose an efficient law article recommendation approach utilizing a Knowledge Graph (KG) and a Large Language Model (LLM). Firstly, we propose a Case-Enhanced Law Article Knowledge Graph (CLAKG) as a database to store current law statutes, historical case information, and correspondence between law articles and historical cases. Additionally, we introduce an automated CLAKG construction method based on LLM. On this basis, we propose a closed-loop law article recommendation method. Finally, through a series of experiments using judgment documents from the website “China Judgements Online”, we have improved the accuracy of law article recommendation in cases from 0.549 to 0.694, demonstrating that our proposed method significantly outperforms baseline approaches.

[IR-4] FELLAS: Enhancing Federated Sequential Recommendation with LLM as External Services

链接: https://arxiv.org/abs/2410.04927
作者: Wei Yuan,Chaoqun Yang,Guanhua Ye,Tong Chen,Quoc Viet Hung Nguyen
关键词-EN: gained growing attention, growing attention due, Federated sequential recommendation, Federated sequential, gained growing
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated sequential recommendation (FedSeqRec) has gained growing attention due to its ability to protect user privacy. Unfortunately, the performance of FedSeqRec is still unsatisfactory because the models used in FedSeqRec have to be lightweight to accommodate communication bandwidth and clients’ on-device computational resource constraints. Recently, large language models (LLMs) have exhibited strong transferable and generalized language understanding abilities and therefore, in the NLP area, many downstream tasks now utilize LLMs as a service to achieve superior performance without constructing complex models. Inspired by this successful practice, we propose a generic FedSeqRec framework, FELLAS, which aims to enhance FedSeqRec by utilizing LLMs as an external service. Specifically, FELLAS employs an LLM server to provide both item-level and sequence-level representation assistance. The item-level representation service is queried by the central server to enrich the original ID-based item embedding with textual information, while the sequence-level representation service is accessed by each client. However, invoking the sequence-level representation service requires clients to send sequences to the external LLM server. To safeguard privacy, we implement dx-privacy satisfied sequence perturbation, which protects clients’ sensitive data with guarantees. Additionally, a contrastive learning-based method is designed to transfer knowledge from the noisy sequence representation to clients’ sequential recommendation models. Furthermore, to empirically validate the privacy protection capability of FELLAS, we propose two interacted item inference attacks. Extensive experiments conducted on three datasets with two widely used sequential recommendation models demonstrate the effectiveness and privacy-preserving capability of FELLAS.

[IR-5] Correcting for Popularity Bias in Recommender Systems via Item Loss Equalization

链接: https://arxiv.org/abs/2410.04830
作者: Juno Prent,Masoud Mansoury
关键词-EN: Recommender Systems, high interaction rates, popular items overlooked, popular items dominate, popular items
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender Systems (RS) often suffer from popularity bias, where a small set of popular items dominate the recommendation results due to their high interaction rates, leaving many less popular items overlooked. This phenomenon disproportionately benefits users with mainstream tastes while neglecting those with niche interests, leading to unfairness among users and exacerbating disparities in recommendation quality across different user groups. In this paper, we propose an in-processing approach to address this issue by intervening in the training process of recommendation models. Drawing inspiration from fair empirical risk minimization in machine learning, we augment the objective function of the recommendation model with an additional term aimed at minimizing the disparity in loss values across different item groups during the training process. Our approach is evaluated through extensive experiments on two real-world datasets and compared against state-of-the-art baselines. The results demonstrate the superior efficacy of our method in mitigating the unfairness of popularity bias while incurring only negligible loss in recommendation accuracy.

[IR-6] Item Cluster-aware Prompt Learning for Session-based Recommendation

链接: https://arxiv.org/abs/2410.04756
作者: Wooseong Yang,Chen Wang,Zihe Song,Weizhi Zhang,Philip S. Yu
关键词-EN: dynamic user preferences, capture dynamic user, analyzing item sequences, dynamic user, user preferences
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Session-based recommendation (SBR) aims to capture dynamic user preferences by analyzing item sequences within individual sessions. However, most existing approaches focus mainly on intra-session item relationships, neglecting the connections between items across different sessions (inter-session relationships), which limits their ability to fully capture complex item interactions. While some methods incorporate inter-session information, they often suffer from high computational costs, leading to longer training times and reduced efficiency. To address these challenges, we propose the CLIP-SBR (Cluster-aware Item Prompt learning for Session-Based Recommendation) framework. CLIP-SBR is composed of two modules: 1) an item relationship mining module that builds a global graph to effectively model both intra- and inter-session relationships, and 2) an item cluster-aware prompt learning module that uses soft prompts to integrate these relationships into SBR models efficiently. We evaluate CLIP-SBR across eight SBR models and three benchmark datasets, consistently demonstrating improved recommendation performance and establishing CLIP-SBR as a robust solution for session-based recommendation tasks.

[IR-7] ableRAG: Million-Token Table Understanding with Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.04739
作者: Si-An Chen,Lesly Miculicich,Julian Martin Eisenschlos,Zifeng Wang,Zilong Wang,Yanfei Chen,Yasuhisa Fujii,Hsuan-Tien Lin,Chen-Yu Lee,Tomas Pfister
关键词-EN: Recent advancements, language models, primarily through program-aided, advancements in language, notably enhanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG’s effectiveness at scale. Our results demonstrate that TableRAG’s retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.

[IR-8] Decoding MIE: A Novel Dataset Approach Using Topic Extraction and Affiliation Parsing

链接: https://arxiv.org/abs/2410.04602
作者: Ehsan Bitaraf,Maryam Jafarpour
关键词-EN: informatics literature presents, medical informatics literature, Medical Informatics Europe, medical informatics, literature presents significant
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapid expansion of medical informatics literature presents significant challenges in synthesizing and analyzing research trends. This study introduces a novel dataset derived from the Medical Informatics Europe (MIE) Conference proceedings, addressing the need for sophisticated analytical tools in the field. Utilizing the Triple-A software, we extracted and processed metadata and abstract from 4,606 articles published in the “Studies in Health Technology and Informatics” journal series, focusing on MIE conferences from 1996 onwards. Our methodology incorporated advanced techniques such as affiliation parsing using the TextRank algorithm. The resulting dataset, available in JSON format, offers a comprehensive view of bibliometric details, extracted topics, and standardized affiliation information. Analysis of this data revealed interesting patterns in Digital Object Identifier usage, citation trends, and authorship attribution across the years. Notably, we observed inconsistencies in author data and a brief period of linguistic diversity in publications. This dataset represents a significant contribution to the medical informatics community, enabling longitudinal studies of research trends, collaboration network analyses, and in-depth bibliometric investigations. By providing this enriched, structured resource spanning nearly three decades of conference proceedings, we aim to facilitate novel insights and advancements in the rapidly evolving field of medical informatics.

[IR-9] Ranking Policy Learning via Marketplace Expected Value Estimation From Observational Data

链接: https://arxiv.org/abs/2410.04568
作者: Ehsan Ebrahimzadeh,Nikhil Monga,Hang Gao,Alex Cozzi,Abraham Bagherjeiran
关键词-EN: decision making framework, reward optimization problem, expected reward optimization, expected reward, ranking policy
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 9 pages

点击查看摘要

Abstract:We develop a decision making framework to cast the problem of learning a ranking policy for search or recommendation engines in a two-sided e-commerce marketplace as an expected reward optimization problem using observational data. As a value allocation mechanism, the ranking policy allocates retrieved items to the designated slots so as to maximize the user utility from the slotted items, at any given stage of the shopping journey. The objective of this allocation can in turn be defined with respect to the underlying probabilistic user browsing model as the expected number of interaction events on presented items matching the user intent, given the ranking context. Through recognizing the effect of ranking as an intervention action to inform users’ interactions with slotted items and the corresponding economic value of the interaction events for the marketplace, we formulate the expected reward of the marketplace as the collective value from all presented ranking actions. The key element in this formulation is a notion of context value distribution, which signifies not only the attribution of value to ranking interventions within a session but also the distribution of marketplace reward across user sessions. We build empirical estimates for the expected reward of the marketplace from observational data that account for the heterogeneity of economic value across session contexts as well as the distribution shifts in learning from observational user activity data. The ranking policy can then be trained by optimizing the empirical expected reward estimates via standard Bayesian inference techniques. We report empirical results for a product search ranking task in a major e-commerce platform demonstrating the fundamental trade-offs governed by ranking polices trained on empirical reward estimates with respect to extreme choices of the context value distribution.

[IR-10] Modeling Social Media Recommendation Impacts Using Academic Networks: A Graph Neural Network Approach

链接: https://arxiv.org/abs/2410.04552
作者: Sabrina Guidotti,Gregor Donabauer,Simone Somazzi,Udo Kruschwitz,Davide Taibi,Dimitri Ognibene
关键词-EN: highlighted potential negative, potential negative impacts, shape user behavior, society and individuals, largely driven
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of social media has highlighted potential negative impacts on society and individuals, largely driven by recommendation algorithms that shape user behavior and social dynamics. Understanding these algorithms is essential but challenging due to the complex, distributed nature of social media networks as well as limited access to real-world data. This study proposes to use academic social networks as a proxy for investigating recommendation systems in social media. By employing Graph Neural Networks (GNNs), we develop a model that separates the prediction of academic infosphere from behavior prediction, allowing us to simulate recommender-generated infospheres and assess the model’s performance in predicting future co-authorships. Our approach aims to improve our understanding of recommendation systems’ roles and social networks modeling. To support the reproducibility of our work we publicly make available our implementations: this https URL

[IR-11] Social Choice for Heterogeneous Fairness in Recommendation

链接: https://arxiv.org/abs/2410.04551
作者: Amanda Aird,Elena Štefancová,Cassidy All,Amy Voida,Martin Homola,Nicholas Mattei,Robin Burke
关键词-EN: recommender systems requires, systems requires close, requires close attention, Algorithmic fairness, competing interests
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic fairness in recommender systems requires close attention to the needs of a diverse set of stakeholders that may have competing interests. Previous work in this area has often been limited by fixed, single-objective definitions of fairness, built into algorithms or optimization criteria that are applied to a single fairness dimension or, at most, applied identically across dimensions. These narrow conceptualizations limit the ability to adapt fairness-aware solutions to the wide range of stakeholder needs and fairness definitions that arise in practice. Our work approaches recommendation fairness from the standpoint of computational social choice, using a multi-agent framework. In this paper, we explore the properties of different social choice mechanisms and demonstrate the successful integration of multiple, heterogeneous fairness definitions across multiple data sets.

[IR-12] Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia EMNLP2024

链接: https://arxiv.org/abs/2410.04254
作者: Tomás Feith,Akhil Arora,Martin Gerlach,Debjit Paul,Robert West
关键词-EN: turning isolated pieces, fundamental part, entity insertion, turning isolated, isolated pieces
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: EMNLP 2024; 24 pages; 62 figures

点击查看摘要

Abstract:Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.

[IR-13] Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models

链接: https://arxiv.org/abs/2410.04231
作者: Teruaki Hayashi,Hiroki Sakaji,Jiayi Dai,Randy Goebel
关键词-EN: Developing the capacity, assist data users, capacity to effectively, effectively search, search for requisite
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Developing the capacity to effectively search for requisite datasets is an urgent requirement to assist data users in identifying relevant datasets considering the very limited available metadata. For this challenge, the utilization of third-party data is emerging as a valuable source for improvement. Our research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery. The system integrates large language models (LLMs) with external vector databases to identify semantic relationships among diverse types of datasets. The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources and for improving data exploration. Our study includes experimental results on four critical tasks: 1) recommending similar datasets, 2) suggesting combinable datasets, 3) estimating tags, and 4) predicting variables. Our results demonstrate that RAG can enhance the selection of relevant datasets, particularly from different categories, when compared to conventional metadata approaches. However, performance varied across tasks and models, which confirms the significance of selecting appropriate techniques based on specific use cases. The findings suggest that this approach holds promise for addressing challenges in data exploration and discovery, although further refinement is necessary for estimation tasks.

[IR-14] LLMTemporalComparator: A Tool for Analysing Differences in Temporal Adaptations of Large Language Models

链接: https://arxiv.org/abs/2410.04195
作者: Reinhard Friedrich Fritsch,Adam Jatowt
关键词-EN: analyzing temporal discrepancies, large language models, trained on data, time periods, study addresses
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study addresses the challenges of analyzing temporal discrepancies in large language models (LLMs) trained on data from different time periods. To facilitate the automatic exploration of these differences, we propose a novel system that compares in a systematic way the outputs of two LLM versions based on user-defined queries. The system first generates a hierarchical topic structure rooted in a user-specified keyword, allowing for an organized comparison of topical categories. Subsequently, it evaluates the generated text by both LLMs to identify differences in vocabulary, information presentation, and underlying themes. This fully automated approach not only streamlines the identification of shifts in public opinion and cultural norms but also enhances our understanding of the adaptability and robustness of machine learning applications in response to temporal changes. By fostering research in continual model adaptation and comparative summarization, this work contributes to the development of more transparent machine learning models capable of capturing the nuances of evolving societal contexts.

[IR-15] C3PA: An Open Dataset of Expert-Annotated and Regulation-Aware Privacy Policies to Enable Scalable Regulatory Compliance Audits EMNLP2024

链接: https://arxiv.org/abs/2410.03925
作者: Maaz Bin Musa,Steven M. Winston,Garrison Allen,Jacob Schiller,Kevin Moore,Sean Quick,Johnathan Melvin,Padmini Srinivasan,Mihailis E. Diamantis,Rishab Nithyanand
关键词-EN: scalable regulatory compliance, extract organizations data, organizations data habits, techniques to analyze, analyze and extract
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 9 pages, EMNLP 2024

点击查看摘要

Abstract:The development of tools and techniques to analyze and extract organizations data habits from privacy policies are critical for scalable regulatory compliance audits. Unfortunately, these tools are becoming increasingly limited in their ability to identify compliance issues and fixes. After all, most were developed using regulation-agnostic datasets of annotated privacy policies obtained from a time before the introduction of landmark privacy regulations such as EUs GDPR and Californias CCPA. In this paper, we describe the first open regulation-aware dataset of expert-annotated privacy policies, C3PA (CCPA Privacy Policy Provision Annotations), aimed to address this challenge. C3PA contains over 48K expert-labeled privacy policy text segments associated with responses to CCPA-specific disclosure mandates from 411 unique organizations. We demonstrate that the C3PA dataset is uniquely suited for aiding automated audits of compliance with CCPA-related disclosure mandates.

[IR-16] Explaining the (Not So) Obvious: Simple and Fast Explanation of STAN a Next Point of Interest Recommendation System

链接: https://arxiv.org/abs/2410.03841
作者: Fajrian Yunus,Talel Abdessalem
关键词-EN: explain machine learning, machine learning systems, machine learning, machine learning methods, lot of effort
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A lot of effort in recent years have been expended to explain machine learning systems. However, some machine learning methods are inherently explainable, and thus are not completely black box. This enables the developers to make sense of the output without a developing a complex and expensive explainability technique. Besides that, explainability should be tailored to suit the context of the problem. In a recommendation system which relies on collaborative filtering, the recommendation is based on the behaviors of similar users, therefore the explanation should tell which other users are similar to the current user. Similarly, if the recommendation system is based on sequence prediction, the explanation should also tell which input timesteps are the most influential. We demonstrate this philosophy/paradigm in STAN (Spatio-Temporal Attention Network for Next Location Recommendation), a next Point of Interest recommendation system based on collaborative filtering and sequence prediction. We also show that the explanation helps to “debug” the output.

[IR-17] Enhancing Retrieval in QA Systems with Derived Feature Association

链接: https://arxiv.org/abs/2410.03754
作者: Keyush Shah,Abhishek Goyal,Isaac Wasserman
关键词-EN:
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-18] Combining Open-box Simulation and Importance Sampling for Tuning Large-Scale Recommenders RECSYS’24

链接: https://arxiv.org/abs/2410.03697
作者: Kaushal Paneri,Michael Munje,Kailash Singh Maurya,Adith Swaminathan,Yifan Shi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

[IR-19] Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering

链接: https://arxiv.org/abs/2210.02627
作者: Shamane Siriwardhana,Rivindu Weerasekera,Elliott Wen,Tharindu Kaluarachchi,Rajib Rana,Suranga Nanayakkara
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: This paper is awaiting publication at Transactions of the Association for Computational Linguistics. This is a pre-MIT Press publication version. For associated huggingface transformers code, see this https URL

点击查看摘要

[IR-20] Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering

链接: https://arxiv.org/abs/2106.11517
作者: Shamane Siriwardhana,Rivindu Weerasekera,Elliott Wen,Suranga Nanayakkara
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: for associated code, see this https URL

点击查看摘要

附件下载

点击下载今日全部论文列表