本篇博文主要展示 2024-08-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-19)

今日共更新299篇论文,其中:

  • 自然语言处理45篇(Computation and Language (cs.CL))
  • 人工智能71篇(Artificial Intelligence (cs.AI))
  • 计算机视觉75篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习92篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
[NLP-0] xGen-MM(BLIP-3):一系列开放式大型多峰模型

链接: https://arxiv.org/abs/2408.08872
作者: Le Xue,Manli Shu,Anas Awadalla,Jun Wang,An Yan,Senthil Purushwalkam,Honglu Zhou,Viraj Prabhu,Yutong Dai,Michael S Ryoo,Shrikant Kendre,Jieyu Zhang,Can Qin,Shu Zhang,Chia-Chih Chen,Ning Yu,Juntao Tan,Tulika Manoj Awalgaonkar,Shelby Heinecke,Huan Wang,Yejin Choi,Ludwig Schmidt,Zeyuan Chen,Silvio Savarese,Juan Carlos Niebles,Caiming Xiong,Ran Xu
关键词-EN: developing Large Multimodal, Large Multimodal Models, Large Multimodal, developing Large, Multimodal Models
关键词-ZH: 开发大型多模式,大型多模式模型,大型多模式,开发大型多模式模型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
摘要:本报告介绍了xGen-MM(也称为BLIP-3),这是一个用于开发大型多模式模型(LSYS)的框架。该框架包括精心策划的数据集、训练食谱、模型架构和由此产生的LSYS套件。xGen-MM(xGen-MultiModal的缩写)扩展了Salesforce xGen在基础AI模型上的计划。我们的模型经过一系列任务的严格评估,包括单个和多图像基准。我们预先训练的基本模型展现出强大的上下文学习能力,而经过描述调整的模型在具有相似模型大小的开源LSYS中展示了竞争力的性能。此外,我们还引入了DPO的安全调整模型,旨在减轻幻觉等有害行为并提高安全性。我们开源模型、策划大规模数据集和微调代码库,以促进LMM研究的进一步进步。相关资源将在上面的项目页面上提供。

[NLP-1] PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars
[NLP-1] PEDAL:使用多样化示例通过大型语言模型增强贪婪解码

链接: https://arxiv.org/abs/2408.08869
作者: Sumanth Prabhu
关键词-EN: Large Language Models, Language Models, Large Language, demonstrated remarkable gains, diverse reasoning paths
关键词-ZH: 大型语言模型,语言模型,大型语言,表现出显着的收益,多样化的推理路径
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-ensembling techniques with diverse reasoning paths such as Self-Consistency have demonstrated remarkable gains in accuracy for Large Language Models (LLMs). However, such techniques depend on the availability of an accurate answer extraction process to aggregate across multiple outputs. Moreover, they acquire higher inference cost, in comparison to Greedy Decoding, due to generation of relatively higher number of output tokens. Research has shown that the free form text outputs from Self-Consistency can be aggregated reliably using LLMs to produce the final output. Additionally, recent advancements in LLM inference have demonstrated that usage of diverse exemplars in prompts have the ability to induce diversity in the LLM outputs. Such proven techniques can be easily extended to self-ensembling based approaches to achieve enhanced results in text generation. In this paper, we introduce PEDAL (Prompts based on Exemplar Diversity Aggregated using LLMs), a hybrid self-ensembling approach, that combines the strengths of diverse exemplar based prompts and LLM based aggregation to achieve improvement in overall performance. On the publicly available SVAMP and ARC datasets, our experiments reveal that PEDAL can achieve better accuracy than Greedy Decoding based strategies with lower inference cost compared to Self Consistency based approaches.
摘要:具有不同推理路径的自我集成技术,如自我一致性,在大型语言模型(LLM)的准确性方面显示出显著的提高。然而,这些技术依赖于准确的答案提取过程的可用性,以聚合多个输出。此外,与贪婪解码相比,由于生成了相对更多的输出令牌,它们获得了更高的推理成本。研究表明,使用LLMS可以可靠地聚合来自自我一致性的自由形式文本输出,以产生最终输出。此外,LLM推理的最新进展表明,在提示中使用不同的样本能够在LLM输出中诱导多样性。这种经过验证的技术可以很容易地扩展到基于自我集成的方法,以在文本生成中实现增强的结果。本文提出了一种混合的自集成方法PEDAL(Prompt Based On Example With LLMS Aggregated Using LLMS),它结合了基于不同样本的提示和基于LLM的聚集的优点,从而实现了整体性能的提高。在公开可用的SVAMP和ARC数据集上,我们的实验表明PEDAL比基于贪婪解码的策略具有更高的准确率,并且比基于自洽的方法具有更低的推理代价。

[NLP-2] PsychoLex: Unveiling the Psychological Mind of Large Language Models
[NLP-2] PsychoLex:揭开大型语言模型的心理头脑

链接: https://arxiv.org/abs/2408.08848
作者: Mohammad Amin Abbasi,Farnaz Sadat Mirnezami,Hassan Naderi
关键词-EN: Large Language Models, specialized Large Language, Large Language, Language Models, paper explores
关键词-ZH: 大型语言模型,专业大型语言,大型语言,语言模型,论文探索
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the intersection of psychology and artificial intelligence through the development and evaluation of specialized Large Language Models (LLMs). We introduce PsychoLex, a suite of resources designed to enhance LLMs’ proficiency in psychological tasks in both Persian and English. Key contributions include the PsychoLexQA dataset for instructional content and the PsychoLexEval dataset for rigorous evaluation of LLMs in complex psychological scenarios. Additionally, we present the PsychoLexLLaMA model, optimized specifically for psychological applications, demonstrating superior performance compared to general-purpose models. The findings underscore the potential of tailored LLMs for advancing psychological research and applications, while also highlighting areas for further refinement. This research offers a foundational step towards integrating LLMs into specialized psychological domains, with implications for future advancements in AI-driven psychological practice.
摘要:本文通过专门的大型语言模型的开发和评估,探索心理学和人工智能的交叉。我们介绍了一套用于提高LLMS在波斯语和英语心理学任务中的熟练程度的资源。主要贡献包括用于教学内容的心理词汇QA数据集和用于在复杂心理场景中对LLM进行严格评估的心理词汇评估数据集。此外,我们还介绍了专门针对心理应用进行优化的心理LexLLaMA模型,与通用模型相比,该模型表现出了更好的性能。这些发现强调了量身定做的LLM在推进心理学研究和应用方面的潜力,同时也强调了需要进一步完善的领域。这项研究为将LLMS整合到专门的心理学领域提供了一个基础性的步骤,对未来人工智能驱动的心理学实践的进步具有启示意义。

[NLP-3] FLEXTAF: Enhancing Table Reasoning with Flexible Tabular Formats
[NLP-3] FLEXTAF:用灵活的表格格式增强表格推理

链接: https://arxiv.org/abs/2408.08841
作者: Xuanliang Zhang,Dingzirui Wang,Longxu Dou,Baoxin Wang,Dayong Wu,Qingfu Zhu,Wanxiang Che
关键词-EN: reasoning task aims, Large Language Models, table reasoning task, table reasoning, task aims
关键词-ZH: 推理任务目标、大型语言模型、表推理任务、表推理、任务目标
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The table reasoning task aims to answer the question according to the given table. Currently, using Large Language Models (LLMs) is the predominant method for table reasoning. Most existing methods employ a fixed tabular format to represent the table, which could limit the performance. Given that each instance requires different capabilities and models possess varying abilities, we assert that different instances and models suit different tabular formats. We prove the aforementioned claim through quantitative analysis of experimental results, where different instances and models achieve different performances using various tabular formats. Building on this discussion, we propose FLEXTAF-Single and FLEXTAF-Vote to enhance table reasoning performance by employing flexible tabular formats. Specifically, (i) FLEXTAF-Single trains a classifier to predict the most suitable tabular format based on the instance and the LLM. (ii) FLEXTAF-Vote integrates the results across different formats. Our experiments on WikiTableQuestions and TabFact reveal significant improvements, with average gains of 2.3% and 4.8% compared to the best performance achieved using a fixed tabular format with greedy decoding and self-consistency decoding, thereby validating the effectiveness of our methods.
摘要:表格推理任务的目的是根据给定的表格回答问题。目前,使用大型语言模型(LLMS)是表推理的主要方法。大多数现有方法使用固定的表格格式来表示表格,这可能会限制性能。假设每个实例需要不同的功能,并且模型具有不同的功能,我们断言不同的实例和模型适合不同的表格格式。我们通过对实验结果的定量分析证明了上述结论,其中不同的实例和模型使用不同的表格格式实现了不同的性能。在此讨论的基础上,我们提出了FLEXTAF-Single和FLEXTAF-Vote,通过使用灵活的表格格式来增强表格推理的性能。具体地说,(I)FLEXTAF-Single训练分类器以基于实例和LLM预测最合适的表格格式。(2)FLEXTAF–VOTE综合不同格式的结果。我们在WikiTableQuestions和TabFact上的实验表明,与采用贪婪解码和自洽解码的固定表格格式获得的最佳性能相比,平均提高了2.3%和4.8%,从而验证了我们方法的有效性。

[NLP-4] CIKMar: A Dual-Encoder Approach to Prompt-Based Reranking in Educational Dialogue Systems
[NLP-4] CIKMar:教育对话系统中基于预算的重新排序的双编码器方法

链接: https://arxiv.org/abs/2408.08805
作者: Joanito Agili Lopo,Marina Indah Prasasti,Alma Permatasari
关键词-EN: dialogue systems powered, Gemma Language model, language model size, Language model, smaller language model
关键词-ZH: 对话系统供电,Gemma语言模型,语言模型大小,语言模型,更小的语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper is the result of the final project of the Natural Language Processing course, Master of Artificial Intelligence, Universitas Gadjah Mada

点击查看摘要

Abstract:In this study, we introduce CIKMar, an efficient approach to educational dialogue systems powered by the Gemma Language model. By leveraging a Dual-Encoder ranking system that incorporates both BERT and SBERT model, we have designed CIKMar to deliver highly relevant and accurate responses, even with the constraints of a smaller language model size. Our evaluation reveals that CIKMar achieves a robust recall and F1-score of 0.70 using BERTScore metrics. However, we have identified a significant challenge: the Dual-Encoder tends to prioritize theoretical responses over practical ones. These findings underscore the potential of compact and efficient models like Gemma in democratizing access to advanced educational AI systems, ensuring effective and contextually appropriate responses.
摘要:在本研究中,我们引入了CIKMar,这是一种由Gemma语言模型支持的教育对话系统的有效方法。通过利用结合BERT和SBERT模型的Dual-Encoder排名系统,我们设计了CIKMar,即使在较小语言模型大小的限制下也能提供高度相关和准确的响应。我们的评估显示,使用BERTScore指标,CIKMar实现了稳健的召回率和0.70的F1评分。然而,我们发现了一个重大挑战:双重编码器倾向于优先考虑理论响应而不是实际响应。这些发现强调了像Gemma这样紧凑高效的模型在民主化先进教育人工智能系统的获取方面的潜力,确保有效且适合上下文的响应。

[NLP-5] Leveraging FourierKAN Classification Head for Pre-Trained Transformer-based Text Classification
[NLP-5] 利用FourierKAN分类头进行预训练的基于转换器的文本分类

链接: https://arxiv.org/abs/2408.08803
作者: Abdullah Al Imran,Md Farhan Ishmam
关键词-EN: Multi-layer Perceptron, text classification tasks, Perceptron, transformer-based pre-trained models, Multi-layer
关键词-ZH: 多层感知器,文本分类任务,感知器,基于转换器的预训练模型,多层
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For many years, transformer-based pre-trained models with Multi-layer Perceptron (MLP) heads have been the standard for text classification tasks. However, the fixed non-linear functions employed by MLPs often fall short of capturing the intricacies of the contextualized embeddings produced by pre-trained encoders. Furthermore, MLPs usually require a significant number of training parameters, which can be computationally expensive. In this work, we introduce FourierKAN (FR-KAN), a variant of the promising MLP alternative called Kolmogorov-Arnold Networks (KANs), as classification heads for transformer-based encoders. Our studies reveal an average increase of 10% in accuracy and 11% in F1-score when incorporating FR-KAN heads instead of traditional MLP heads for several transformer-based pre-trained models across multiple text classification tasks. Beyond improving model accuracy, FR-KAN heads train faster and require fewer parameters. Our research opens new grounds for broader applications of KAN across several Natural Language Processing (NLP) tasks.
摘要:多年来,基于变压器的多层感知器(MLP)头部预训练模型一直是文本分类任务的标准。然而,MLP使用的固定非线性函数往往不能捕捉到由预先训练的编码器产生的上下文嵌入的复杂性。此外,MLP通常需要大量的训练参数,这可能在计算上代价高昂。在这项工作中,我们引入了FURIERKAN(FR-KAN),它是一种很有前途的MLP替代方案Kolmogorov-Arnold Networks(KANS)的变体,作为基于变压器的编码器的分类头。我们的研究表明,对于多个文本分类任务中的几个基于变压器的预训练模型,当加入FR-Kan头部而不是传统的MLP头部时,准确率平均提高了10%,F1得分平均提高了11%。除了提高模型精度外,FR-KAN头训练速度更快,需要的参数更少。我们的研究为Kan在几个自然语言处理(NLP)任务中的更广泛应用开辟了新的基础。

[NLP-6] EmoDynamiX: Emotional Support Dialogue Strategy Prediction by Modelling MiXed Emotions and Discourse Dynamics
[NLP-6] DynamiX:通过模拟混合情绪和话语动力学来预测情绪支持对话策略

链接: https://arxiv.org/abs/2408.08782
作者: Chenwei Wan,Matthieu Labeau,Chloé Clavel
关键词-EN: Designing emotionally intelligent, emotionally intelligent conversational, people experiencing distress, Designing emotionally, intelligent conversational systems
关键词-ZH: 设计情商高、情商高的对话、经历痛苦的人们、设计情商高、智能的对话系统
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Designing emotionally intelligent conversational systems to provide comfort and advice to people experiencing distress is a compelling area of research. Previous efforts have focused on developing modular dialogue systems that treat socio-emotional strategy prediction as an auxiliary task and generate strategy-conditioned responses with customized decoders. Recently, with advancements in large language models (LLMs), end-to-end dialogue agents without explicit socio-emotional strategy prediction steps have become prevalent. However, despite their excellence in language generation, recent studies show that LLMs’ inherent preference bias towards certain socio-emotional strategies hinders the delivery of high-quality emotional support. To address this challenge, we propose decoupling strategy prediction from language generation, and introduce a novel dialogue strategy predictor, EmoDynamiX, which models the discourse dynamics between user emotions and system strategies using a heterogeneous graph. Additionally, we make use of the Emotion Recognition in Conversations (ERC) task and design a flexible mixed-emotion module to capture fine-grained emotional states of the user. Experimental results on two ESC datasets show EmoDynamiX outperforms previous state-of-the-art methods with a significant margin.
摘要:设计情感智能的对话系统,为经历痛苦的人提供安慰和建议,是一个引人注目的研究领域。以前的努力集中在开发模块化对话系统,该系统将社会情绪策略预测作为一项辅助任务,并通过定制的解码器生成受策略制约的反应。近年来,随着大型语言模型的发展,没有明确的社会情感策略预测步骤的端到端对话主体变得普遍起来。然而,尽管LLMS在语言生成方面表现出色,但最近的研究表明,LLMS对某些社会情感策略的固有偏好偏见阻碍了高质量情感支持的交付。为了应对这一挑战,我们提出了将策略预测从语言生成中分离出来,并引入了一个新的对话策略预测器EmoDynamiX,它使用异构图来模拟用户情感和系统策略之间的话语动态。此外,我们利用会话中的情感识别(ERC)任务,设计了一个灵活的混合情感模块来捕捉用户的细粒度情感状态。在两个ESC数据集上的实验结果表明,EmoDynamiX的性能明显优于以前的最先进方法。

[NLP-7] Evaluating the Evaluator: Measuring LLMs Adherence to Task Evaluation Instructions
[NLP-7] 评估评估者:衡量LLM对任务评估指示的遵守程度

链接: https://arxiv.org/abs/2408.08781
作者: Bhuvanashree Murugadoss,Christian Poelitz,Ian Drosos,Vu Le,Nick McKenna,Carina Suzana Negreanu,Chris Parnin,Advait Sarkar
关键词-EN: recently popularized method, replaces human judgements, Reinforcement Learning, recently popularized, human judgements
关键词-ZH: 最近流行的方法,取代了人类判断,强化学习,最近流行的,人类判断
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.
摘要:LLMS-as-a-裁判是最近推广的一种方法,它在任务评价中取代了人的判断(郑等人)。2024),并使用LLMS进行自动评估。由于RLHF(从人类反馈的强化学习)的广泛使用,当被提示进行质量判断时,如GPT4和Llama3,最新的LLM有望与人类的偏好有很强的一致性,例如文本的一致性。虽然这似乎是有益的,但尚不清楚LLM作为法官的评估是否仅构成基于提示中的说明的评估,还是反映了其对类似于其微调数据的高质量数据的偏好。为了调查提示LLMS作为法官对人工智能判断与人类判断的一致性有多大的影响,我们分析了几个LLMS作为法官的提示,这些提示具有关于评估目标质量的不断增加的指令级别。此外,我们将其与使用模型困惑作为质量度量的无提示方法进行比较。我们汇总了LLMS在最先进的评估中常用的质量标准分类,并将其作为评判模型的严格基准提供。总体而言,我们表明,LLMS作为法官只从提示中高度详细的说明中获益很少,困惑有时比提示更符合人类的判断,特别是在文本质量方面。

[NLP-8] Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions
[NLP-8] 大型语言模型可能不在乎你在说什么:提示格式胜过描述

链接: https://arxiv.org/abs/2408.08780
作者: Chenming Tang,Zhixiang Wang,Yunfang Wu
关键词-EN: large language models, achieved impressive performance, large language, language models, achieved impressive
关键词-ZH: 大型语言模型,取得了令人印象深刻的性能,大型语言,语言模型,取得了令人印象深刻的性能
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures, 3 tables

点击查看摘要

Abstract:With the help of in-context learning (ICL), large language models (LLMs) have achieved impressive performance across various tasks. However, the function of descriptive instructions during ICL remains under-explored. In this work, we propose an ensemble prompt framework to describe the selection criteria of multiple in-context examples, and preliminary experiments on machine translation (MT) across six translation directions confirm that this framework boosts ICL perfromance. But to our surprise, LLMs might not necessarily care what the descriptions actually say, and the performance gain is primarily caused by the ensemble format, since the framework could lead to improvement even with random descriptive nouns. We further apply this new ensemble prompt on a range of commonsense, math, logical reasoning and hallucination tasks with three LLMs and achieve promising results, suggesting again that designing a proper prompt format would be much more effective and efficient than paying effort into specific descriptions. Our code will be publicly available once this paper is published.
摘要:在情境学习(ICL)的帮助下,大型语言模型(LLM)在各种任务中取得了令人印象深刻的表现。然而,描述性指令在ICL中的作用仍未得到充分的研究。在这项工作中,我们提出了一个集成提示框架来描述多个上下文中实例的选择标准,并在六个翻译方向的机器翻译(MT)上进行了初步实验,证实该框架提高了ICL的性能。但令我们惊讶的是,LLM可能不一定关心描述实际说了什么,性能收益主要是由集成格式造成的,因为即使使用随机的描述性名词,该框架也可能导致改进。我们进一步将这种新的集成提示应用于三个LLM的常识、数学、逻辑推理和幻觉任务,并取得了令人振奋的结果,再次表明,设计适当的提示格式将比努力进行具体描述要有效和高效得多。一旦这篇论文发表,我们的代码将公开可用。

[NLP-9] DAC: Decomposed Automation Correction for Text-to-SQL
[NLP-9] ADC:文本到SQL的分解自动纠正

链接: https://arxiv.org/abs/2408.08779
作者: Dingzirui Wang,Longxu Dou,Xuanliang Zhang,Qingfu Zhu,Wanxiang Che
关键词-EN: people obtain information, Large Language Models, generating SQL queries, automatically generating SQL, important task
关键词-ZH: 人们获取信息,大型语言模型,生成SQL查询,自动生成SQL,重要任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL is an important task that helps people obtain information from databases by automatically generating SQL queries. Considering the brilliant performance, approaches based on Large Language Models (LLMs) become the mainstream for text-to-SQL. Among these approaches, automated correction is an effective approach that further enhances performance by correcting the mistakes in the generated results. The existing correction methods require LLMs to directly correct with generated SQL, while previous research shows that LLMs do not know how to detect mistakes, leading to poor performance. Therefore, in this paper, we propose to employ the decomposed correction to enhance text-to-SQL performance. We first demonstrate that decomposed correction outperforms direct correction since detecting and fixing mistakes with the results of the decomposed sub-tasks is easier than with SQL. Based on this analysis, we introduce Decomposed Automation Correction (DAC), which corrects SQL by decomposing text-to-SQL into entity linking and skeleton parsing. DAC first generates the entity and skeleton corresponding to the question and then compares the differences between the initial SQL and the generated entities and skeleton as feedback for correction. Experimental results show that our method improves performance by 3.7% on average of Spider, Bird, and KaggleDBQA compared with the baseline method, demonstrating the effectiveness of DAC.
摘要:文本到SQL是一项重要的任务,它通过自动生成SQL查询来帮助人们从数据库中获取信息。考虑到其出色的性能,基于大型语言模型的方法成为了文本到SQL的主流方法。在这些方法中,自动校正是一种有效的方法,它通过校正生成结果中的错误来进一步提高性能。现有的纠错方法需要LLMS直接使用生成的SQL进行纠错,而以往的研究表明,LLMS不知道如何检测错误,导致性能较差。因此,在本文中,我们提出使用分解校正来提高Text-to-SQL的性能。我们首先证明了分解校正优于直接校正,因为使用分解的子任务的结果检测和修复错误比使用SQL更容易。在此基础上,我们引入了分解自动化校正(DAC),它通过将文本到SQL的分解为实体链接和骨架解析来校正SQL。DAC首先生成与问题对应的实体和框架,然后比较初始SQL与生成的实体和框架之间的差异,作为反馈进行更正。实验结果表明,与基准方法相比,该方法对Spider、Bird和KaggleDBQA的平均性能提高了3.7%,证明了DAC的有效性。

[NLP-10] Lower Layer Matters: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused
[NLP-10] 较低层重要:通过多层融合对比解码减轻幻觉,重新聚焦真实性

链接: https://arxiv.org/abs/2408.08769
作者: Dingwei Chen,Feiteng Fang,Shiwen Ni,Feng Liang,Ruifeng Xu,Min Yang,Chengming Li
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, demonstrated exceptional performance
关键词-ZH: 大型语言模型、语言处理任务、自然语言处理、大型语言,表现出出色的性能
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional performance across various natural language processing tasks, yet they occasionally tend to yield content that factually inaccurate or discordant with the expected output, a phenomenon empirically referred to as “hallucination”. To tackle this issue, recent works have investigated contrastive decoding between the original model and an amateur model with induced hallucination, which has shown promising results. Nonetheless, this method may undermine the output distribution of the original LLM caused by its coarse contrast and simplistic subtraction operation, potentially leading to errors in certain cases. In this paper, we introduce a novel contrastive decoding framework termed LOL (LOwer Layer Matters). Our approach involves concatenating the contrastive decoding of both the final and lower layers between the original model and the amateur model, thereby achieving multi-layer fusion to aid in the mitigation of hallucination. Additionally, we incorporate a truthfulness refocused module that leverages contextual guidance to enhance factual encoding, further capturing truthfulness during contrastive decoding. Extensive experiments conducted on two publicly available datasets illustrate that our proposed LOL framework can substantially alleviate hallucination while surpassing existing baselines in most cases. Compared with the best baseline, we improve by average 4.5 points on all metrics of TruthfulQA. The source code is coming soon.
摘要:大型语言模型(LLM)在各种自然语言处理任务中表现出了出色的性能,但它们偶尔会产生与预期输出不准确或不一致的内容,经验上将这种现象称为“幻觉”。为了解决这个问题,最近的工作已经研究了原始模型和具有诱导幻觉的业余模型之间的对比解码,并显示了令人振奋的结果。然而,这种方法可能会破坏原始LLM的输出分布,这是由于其对比度较低和减法运算过于简单,在某些情况下可能会导致误差。本文介绍了一种新的对比解码框架LOL(Low Layer Matters)。我们的方法包括连接原始模型和业余模型之间的最终层和较低层的对比解码,从而实现多层融合以帮助减轻幻觉。此外,我们还加入了真实性重新聚焦模块,该模块利用上下文指导来增强事实编码,在对比解码期间进一步捕获真实性。在两个公开可用的数据集上进行的广泛实验表明,我们提出的LOL框架在大多数情况下可以显著缓解幻觉,同时超过现有的基线。与最佳基线相比,我们在TruthfulQA的所有指标上平均提高了4.5分。源代码很快就会出来。

[NLP-11] ChatZero:Zero-shot Cross-Lingual Dialogue Generation via Pseudo-Target Language ECAI2024
[NLP-11] ChatZero:通过伪目标语言生成零镜头跨语言对话

链接: https://arxiv.org/abs/2408.08724
作者: Yongkang Liu,Feng Shi,Daling Wang,Yifei Zhang,Hinrich Schütze
关键词-EN: show amazing capabilities, LLMs fall short, exciting applications discovered, show amazing, amazing capabilities
关键词-ZH: 显示出惊人的能力,LLM达不到要求,发现了令人兴奋的应用程序,显示出惊人的、惊人的能力
类目: Computation and Language (cs.CL)
备注: ECAI2024

点击查看摘要

Abstract:Although large language models(LLMs) show amazing capabilities, among various exciting applications discovered for LLMs fall short in other low-resource languages. Besides, most existing methods depend on large-scale dialogue corpora and thus building systems for dialogue generation in a zero-shot scenario remains a considerable challenge. To address this challenge, we propose a novel end-to-end zero-shot dialogue generation model ChatZero based on cross-lingual code-switching method. First, we construct code-switching language and pseudo-target language with placeholders. Then for cross-lingual semantic transfer, we employ unsupervised contrastive learning to minimize the semantics gap of the source language, code-switching language, and pseudo-target language that are mutually positive examples in the high dimensional semantic space. Experiments on the multilingual DailyDialog and DSTC7-AVSD datasets demonstrate that ChatZero can achieve more than 90% of the original performance under the zero-shot case compared to supervised learning, and achieve state-of-the-art performance compared with other baselines.
摘要:尽管大型语言模型(LLM)显示出惊人的能力,但在各种令人兴奋的应用中,LLM的应用在其他低资源语言中是不足的。此外,大多数现有方法依赖于大规模对话语料库,因此在零射击情景下建立对话生成系统仍然是一个相当大的挑战。为了应对这一挑战,我们提出了一种基于跨语言代码转换方法的端到端零镜头对话生成模型ChatZero。首先,我们构建了带占位符的语码转换语言和伪目标语言。然后,对于跨语言的语义迁移,我们采用无监督的对比学习来最小化高维语义空间中互为正例的源语、语码转换语言和伪目的语之间的语义差距。在多语言DailyDialog和DSTC7-AVSD数据集上的实验表明,与有监督学习相比,ChatZero在零镜头情况下可以获得90%以上的原始性能,与其他基线相比,ChatZero的性能也达到了最高水平。

[NLP-12] urning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
[NLP-12] 将垃圾变成宝藏:通过代币回收加速大型语言模型的推理

链接: https://arxiv.org/abs/2408.08696
作者: Xianzhen Luo,Yixuan Wang,Qingfu Zhu,Zhiming Zhang,Xuanyu Zhang,Qing Yang,Dongliang Xu,Wanxiang Che
关键词-EN: limiting broader application, made inference latency, fundamental bottleneck, limiting broader, rapid growth
关键词-ZH: 限制更广泛的应用,造成推理延迟,根本瓶颈,限制更广泛的快速增长
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:The rapid growth in the parameters of large language models (LLMs) has made inference latency a fundamental bottleneck, limiting broader application of LLMs. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm, leveraging the parallel capabilities of modern hardware. Some speculative decoding methods rely on additional structures to guess draft tokens, such as small models or parameter-efficient architectures, which need extra training before use. Alternatively, retrieval-based train-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. This approach stores candidate tokens in an adjacency matrix and employs a breadth-first search (BFS)-like algorithm on the matrix to construct a draft tree. The tree is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30% and even a training method by 25%. It can be directly applied to any existing LLMs and tasks without the need for adaptation.
摘要:大型语言模型(LLMS)参数的快速增长使推理延迟成为一个基本瓶颈,限制了LLMS的广泛应用。推测解码代表了一种无损的方法,利用现代硬件的并行能力,通过猜测和验证范例来加速推理。一些推测译码方法依赖于额外的结构来猜测草稿令牌,例如小模型或参数高效架构,这些方法在使用之前需要额外的训练。或者,基于检索的免训练技术从预先存在的语料库或通过n-gram生成来构建库。然而,它们面临着诸如大存储需求、耗时的检索和有限的适应性等挑战。考虑到在解码过程中产生的候选令牌很可能在未来的序列中重复出现,我们提出了令牌回收。该方法将候选令牌存储在邻接矩阵中,并在该矩阵上使用类似广度优先搜索(BFS)的算法来构造草稿树。然后,通过对树的关注来验证树。然后使用来自解码过程的新候选令牌来更新矩阵。令牌回收需要\extless 2MB的额外存储,并且在所有大小的LLM中实现了大约2倍的加速。它的性能比现有的无训练方法高30%,甚至比一种训练方法高25%。它可以直接应用于任何现有的LLM和任务,而不需要进行调整。

[NLP-13] Quantifying the Effectiveness of Student Organization Activities using Natural Language Processing
[NLP-13] 使用自然语言处理量化学生组织活动的有效性

链接: https://arxiv.org/abs/2408.08694
作者: Lyberius Ennio F. Taruc,Arvin R. De La Cruz
关键词-EN: extracurricular activities play, students’ educational experiences, Natural Language Processing, Student extracurricular activities, Large Language Model
关键词-ZH: 课外活动玩法、学生教育体验、自然语言处理、学生课外活动、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 11 pages, 4 figures, presented in International Conference on Generative Al and its Applications (ICGAIA-24) last 22nd - 23rd, July, 2024 at Jakarta, Indonesia

点击查看摘要

Abstract:Student extracurricular activities play an important role in enriching the students’ educational experiences. With the increasing popularity of Machine Learning and Natural Language Processing, it becomes a logical step that incorporating ML-NLP in improving extracurricular activities is a potential focus of study in Artificial Intelligence (AI). This research study aims to develop a machine learning workflow that will quantify the effectiveness of student-organized activities based on student emotional responses using sentiment analysis. The study uses the Bidirectional Encoder Representations from Transformers (BERT) Large Language Model (LLM) called via the pysentimiento toolkit, as a Transformer pipeline in Hugging Face. A sample data set from Organization C, a Recognized Student Organization (RSO) of a higher educational institute in the Philippines, College X, was used to develop the workflow. The workflow consisted of data preprocessing, key feature selection, LLM feature processing, and score aggregation, resulting in an Event Score for each data set. The results show that the BERT LLM can also be used effectively in analyzing sentiment beyond product reviews and post comments. For the student affairs offices of educational institutions, this study can provide a practical example of how NLP can be applied to real-world scenarios, showcasing the potential impact of data-driven decision making.
摘要:学生课外活动对丰富学生的教育经验具有重要作用。随着机器学习和自然语言处理的日益普及,将ML-NLP融入到改善课外活动中成为人工智能(AI)潜在的研究重点,这是一个合乎逻辑的步骤。这项研究旨在开发一种机器学习工作流程,该工作流程将根据学生的情绪反应,使用情绪分析来量化学生组织的活动的有效性。本研究使用来自Transformers(BERT)大型语言模型(LLM)的双向编码器表示,通过pysentimiento工具包调用,作为拥抱面孔中的变形金刚管道。来自C组织的样本数据被用于开发该工作流程。C组织是菲律宾一所高等教育机构X学院的公认学生组织(RSO)。该工作流程包括数据预处理、关键特征选择、LLM特征处理和分数聚合,从而为每个数据集生成事件分数。结果表明,BERT LLM也可以有效地用于分析产品评论和帖子评论之外的情感。对于教育机构的学生事务办公室,本研究可以提供一个实际例子,说明如何将自然语言处理应用于现实世界的场景,展示数据驱动的决策的潜在影响。

[NLP-14] Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm
[NLP-14] Med-PDC:采用积极主动的先观察后观察范式的医疗个性化多模式咨询

链接: https://arxiv.org/abs/2408.08693
作者: Hongcheng Liu,Yusheng Liao,Siqv Ou,Yuhao Wang,Heyang Liu,Yanfeng Wang,Yu Wang
关键词-EN: Large Language Models, Multi-modal Large Language, Language Models, Large Language, Multi-modal Large
关键词-ZH: 大型语言模型,多模式大型语言,语言模型,大型语言,多模式大型
类目: Computation and Language (cs.CL)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:The application of the Multi-modal Large Language Models (MLLMs) in medical clinical scenarios remains underexplored. Previous benchmarks only focus on the capacity of the MLLMs in medical visual question-answering (VQA) or report generation and fail to assess the performance of the MLLMs on complex clinical multi-modal tasks. In this paper, we propose a novel Medical Personalized Multi-modal Consultation (Med-PMC) paradigm to evaluate the clinical capacity of the MLLMs. Med-PMC builds a simulated clinical environment where the MLLMs are required to interact with a patient simulator to complete the multi-modal information-gathering and decision-making task. Specifically, the patient simulator is decorated with personalized actors to simulate diverse patients in real scenarios. We conduct extensive experiments to access 12 types of MLLMs, providing a comprehensive view of the MLLMs’ clinical performance. We found that current MLLMs fail to gather multimodal information and show potential bias in the decision-making task when consulted with the personalized patient simulators. Further analysis demonstrates the effectiveness of Med-PMC, showing the potential to guide the development of robust and reliable clinical MLLMs. Code and data are available at this https URL.
摘要:多模式大语言模型(MLLMS)在医学临床场景中的应用尚处于探索阶段。以往的基准只关注MLLMS在医学视觉问答(VQA)或报告生成方面的能力,而没有评估MLLMS在复杂临床多模式任务中的性能。在本文中,我们提出了一种新的医疗个性化多模式会诊(MED-PMC)范式来评估MLLMS的临床能力。MED-PMC构建了一个模拟的临床环境,需要MLLMS与患者模拟器交互,以完成多模式信息收集和决策任务。具体地说,患者模拟器用个性化的演员来装饰,以模拟真实场景中的不同患者。我们进行了广泛的实验来访问12种类型的MLLMS,提供了MLLMS的临床性能的全面视图。我们发现,当参考个性化的患者模拟器时,现有的MLLMS不能收集多通道信息,并且在决策任务中显示出潜在的偏差。进一步的分析证明了MED-PMC的有效性,显示了指导开发稳健和可靠的临床MLLMS的潜力。代码和数据可在此HTTPS URL上找到。

[NLP-15] he Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation
[NLP-15] LLM奖学金:用于合成偏好优化数据集生成的多智能体工作流

链接: https://arxiv.org/abs/2408.08688
作者: Samee Arif,Sualeha Farid,Abdul Hameed Azeemi,Awais Athar,Agha Ali Raza
关键词-EN: synthetic Preference Optimization, Preference Optimization, synthetic Preference, LLM Feedback Loop, response evaluation module
关键词-ZH: 合成偏好优化、偏好优化、合成偏好、LLM反馈环、响应评估模块
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents and evaluates multi-agent workflows for synthetic Preference Optimization (PO) dataset generation. PO dataset generation requires two modules: (1) response evaluation, and (2) response generation. In the response evaluation module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. In each step, we use inter-rater agreement using Cohen’s Kappa between human annotators and LLMs. For the response generation module, we compare different configurations for the LLM Feedback Loop using the identified LLM evaluator configuration. We use the win rate (the fraction of times a generation framework is selected as the best by an LLM evaluator) to determine the best multi-agent configuration for generation. After identifying the best configurations for both modules, we use models from the GPT, Gemma, and Llama families to generate our PO datasets using the above pipeline. We generate two types of PO datasets, one to improve the generation capabilities of individual LLM and the other to improve the multi-agent workflow. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across datasets when the candidate responses do not include responses from the GPT family. Additionally, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-agent Llama and Gemma, respectively.
摘要:提出并评价了用于综合偏好优化(PO)数据集生成的多智能体工作流。PO数据集的生成需要两个模块:(1)响应评估,(2)响应生成。在响应评估模块中,对来自大型语言模型(LLM)的响应进行评估和排序–这项任务通常由人工注释员执行,我们使用LLMS实现自动化。我们在两个步骤的过程中评估响应评估模块。在第一步中,我们使用三种不同的激励策略将LLM作为评估者进行评估。在步骤2中,我们应用获胜的提示策略来比较LLM作为法官、LLMS作为陪审团和LLM辩论的性能。在每个步骤中,我们使用评分者之间的协议,使用Cohen的Kappa在人工注释者和LLMS之间。对于响应生成模块,我们使用确定的LLM评估器配置比较LLM反馈环路的不同配置。我们使用Win Rate(生成框架被LLM评估者选为最佳的次数)来确定生成的最佳多代理配置。在确定了两个模块的最佳配置后,我们使用GPT、Gema和Llama系列的模型来使用上述管道生成我们的PO数据集。我们生成了两种类型的PO数据集,一种是为了提高单个LLM的生成能力,另一种是为了改进多智能体的工作流。我们的评估表明,当候选答案不包括来自GPT家族的答案时,GPT-40-as-a-Court在数据集上更一致。此外,我们还发现,以Llama为生成器,Gema为评价者的LLM反馈环,与单代理Llama和Gema相比,分别获得了71.8%和73.8%的胜率。

[NLP-16] LLM-PCGC: Large Language Model-based Point Cloud Geometry Compression
[NLP-16] LLM-PCGC:基于大语言模型的点云几何压缩

链接: https://arxiv.org/abs/2408.08682
作者: Yuqi Ye,Wei Gao
关键词-EN: point cloud, point cloud geometry, point cloud compression, robust context model, context model consistent
关键词-ZH: 点云、点云几何、点云压缩、稳健的上下文模型、上下文模型一致
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The key to effective point cloud compression is to obtain a robust context model consistent with complex 3D data structures. Recently, the advancement of large language models (LLMs) has highlighted their capabilities not only as powerful generators for in-context learning and generation but also as effective compressors. These dual attributes of LLMs make them particularly well-suited to meet the demands of data compression. Therefore, this paper explores the potential of using LLM for compression tasks, focusing on lossless point cloud geometry compression (PCGC) experiments. However, applying LLM directly to PCGC tasks presents some significant challenges, i.e., LLM does not understand the structure of the point cloud well, and it is a difficult task to fill the gap between text and point cloud through text description, especially for large complicated and small shapeless point clouds. To address these problems, we introduce a novel architecture, namely the Large Language Model-based Point Cloud Geometry Compression (LLM-PCGC) method, using LLM to compress point cloud geometry information without any text description or aligning operation. By utilizing different adaptation techniques for cross-modality representation alignment and semantic consistency, including clustering, K-tree, token mapping invariance, and Low Rank Adaptation (LoRA), the proposed method can translate LLM to a compressor/generator for point cloud. To the best of our knowledge, this is the first structure to employ LLM as a compressor for point cloud data. Experiments demonstrate that the LLM-PCGC outperforms the other existing methods significantly, by achieving -40.213% bit rate reduction compared to the reference software of MPEG Geometry-based Point Cloud Compression (G-PCC) standard, and by achieving -2.267% bit rate reduction compared to the state-of-the-art learning-based method.
摘要:有效的点云压缩的关键是获得与复杂3D数据结构一致的健壮上下文模型。最近,大型语言模型(LLM)的发展突显了它们不仅是上下文中学习和生成的强大生成器,而且还是有效的压缩器。LLM的这些双重属性使其特别适合于满足数据压缩的要求。因此,本文以无损点云几何压缩(PCGC)实验为重点,探索LLM在压缩任务中的应用潜力。然而,将LLM直接应用于PCGC任务面临着一些重大挑战,即LLM不能很好地理解点云的结构,通过文本描述来填补文本和点云之间的空白是一项困难的任务,特别是对于大型复杂和小形状的点云。为了解决这些问题,我们提出了一种新的体系结构,即基于大语言模型的点云几何压缩(LLM-PCGC)方法,使用LLM来压缩点云几何信息,而不需要任何文本描述或对齐操作。通过使用不同的跨通道表示对齐和语义一致性自适应技术,包括聚类、K-树、标记映射不变性和低等级自适应(LORA),该方法可以将LLM转换为点云的压缩器/生成器。据我们所知,这是第一个使用LLM作为点云数据压缩器的结构。实验表明,与基于几何的点云压缩标准参考软件相比,LLMPCGC的码率降低了-40.213%,与最先进的基于学习的方法相比,码率降低了-2.267%。

[NLP-17] MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector
[NLP-17] MIA-Tuner:调整大型语言模型作为预训练文本检测器

链接: https://arxiv.org/abs/2408.08661
作者: Wenjie Fu,Huandong Wang,Chen Gao,Guanghua Liu,Yong Li,Tao Jiang
关键词-EN: large language models, language models, highlight the urgent, increasing parameters, parameters and expansive
关键词-ZH: 大型语言模型、语言模型、凸显紧迫性、不断增加的参数、参数和扩展性
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: code and dataset: this https URL

点击查看摘要

Abstract:The increasing parameters and expansive dataset of large language models (LLMs) highlight the urgent demand for a technical solution to audit the underlying privacy risks and copyright issues associated with LLMs. Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA). This problem involves determining whether a given piece of text has been used during the pre-training phase of the target LLM. Although existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance in pre-trained LLMs, how to achieve high-confidence detection and how to perform MIA on aligned LLMs remain challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector internally, rather than design an external MIA score function. Furthermore, we design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner. To comprehensively evaluate the most recent state-of-the-art LLMs, we collect a more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely adopted benchmark WIKIMIA. We conduct extensive experiments across various aligned and unaligned LLMs over the two benchmark datasets. The results demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a significantly high level of 0.9.
摘要:随着大型语言模型参数的不断增加和数据集的不断扩大,迫切需要一种技术解决方案来审计与大型语言模型相关的潜在隐私风险和版权问题。现有的研究已经通过探索训练前数据检测问题部分地解决了这一需求,该问题是成员推理攻击(MIA)的一个实例。这个问题涉及确定在目标LLM的预训练阶段是否使用了给定的文本片段。虽然现有的方法已经设计了各种复杂的MIA评分函数来在预先训练的LLM中获得相当高的检测性能,但如何实现高置信度检测以及如何在对准的LLM上执行MIA仍然是具有挑战性的。在本文中,我们提出了一种新的基于指令的MIA方法MIA-Tuner,它指示LLMS本身在内部充当更精确的预训练数据检测器,而不是在外部设计MIA得分函数。此外,我们设计了两个基于指令的安全机制,分别缓解了现有方法和MIA-Tuner带来的隐私风险。为了全面评估最新的LLM,我们收集了一个更新的MIA基准数据集,名为WIKIMIA-24,以取代广泛采用的基准WIKIMIA。我们在两个基准数据集上对各种对齐和未对齐的LLM进行了广泛的实验。结果表明,MIA-Tuner将MIA的AUC从0.7提高到0.9的显著高水平。

[NLP-18] LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
[NLP-18] LLM偏向于产出指标!系统评估和缓解LLM的输出格式偏差

链接: https://arxiv.org/abs/2408.08656
作者: Do Xuan Long,Hai Nguyen Ngoc,Tiviatis Sim,Hieu Dao,Shafiq Joty,Kenji Kawaguchi,Nancy F. Chen,Min-Yen Kan
关键词-EN: large language models, systematic evaluation examining, format bias, language models, format
关键词-ZH: 大型语言模型、系统评价检查、格式偏差、语言模型、格式
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the first systematic evaluation examining format bias in performance of large language models (LLMs). Our approach distinguishes between two categories of an evaluation metric under format constraints to reliably and accurately assess performance: one measures performance when format constraints are adhered to, while the other evaluates performance regardless of constraint adherence. We then define a metric for measuring the format bias of LLMs and establish effective strategies to reduce it. Subsequently, we present our empirical format bias evaluation spanning four commonly used categories – multiple-choice question-answer, wrapping, list, and mapping – covering 15 widely-used formats. Our evaluation on eight generation tasks uncovers significant format bias across state-of-the-art LLMs. We further discover that improving the format-instruction following capabilities of LLMs across formats potentially reduces format bias. Based on our evaluation findings, we study prompting and fine-tuning with synthesized format data techniques to mitigate format bias. Our methods successfully reduce the variance in ChatGPT’s performance among wrapping formats from 235.33 to 0.71 (% ^2 ).
摘要:我们首次对大型语言模型(LLMS)的性能进行了系统评估,考察了格式偏差。我们的方法区分了格式约束下的两种评估指标,以可靠和准确地评估性能:一种是在遵守格式约束时测量性能,另一种是在不考虑约束的情况下评估性能。然后,我们定义了一个度量LLMS格式偏差的指标,并制定了有效的策略来减少这种偏差。随后,我们提出了我们的经验性格式偏差评估,涵盖了四个常用的类别–多项选择问答、包装、列表和映射–涵盖了15种广泛使用的格式。我们对八代任务的评估发现,在最先进的LLM上存在显著的格式偏差。我们进一步发现,提高LLM跨格式的格式指令跟踪能力可能会减少格式偏差。基于我们的评估结果,我们研究了使用合成格式数据技术来缓解格式偏差的提示和微调。我们的方法成功地将ChatGPT在不同包装格式之间的性能差异从235.33降低到0.71%^2。

[NLP-19] Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning
[NLP-19] 超越偏见的推理:反事实推理和思维链推理的研究

链接: https://arxiv.org/abs/2408.08651
作者: Kyle Moore,Jesse Roberts,Thao Pham,Douglas Fisher
关键词-EN: Counterfactual Prompting, Multi-Task Language Understanding, Massive Multi-Task Language, training data, leading to predictions
关键词-ZH: 反事实预算、多任务语言理解、大规模多任务语言、训练数据、导致预测
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are known to absorb biases from their training data, leading to predictions driven by statistical regularities rather than semantic relevance. We investigate the impact of these biases on answer choice preferences in the Massive Multi-Task Language Understanding (MMLU) task. Our findings reveal that differences in learned regularities across answer options are predictive of model preferences and mirror human test-taking strategies. To address this issue, we introduce two novel methods: Counterfactual Prompting with Chain of Thought (CoT) and Counterfactual Prompting with Agnostically Primed CoT (APriCoT). We demonstrate that while Counterfactual Prompting with CoT alone is insufficient to mitigate bias, our novel Primed Counterfactual Prompting with CoT approach effectively reduces the influence of base-rate probabilities while improving overall accuracy. Our results suggest that mitigating bias requires a “System-2” like process and that CoT reasoning is susceptible to confirmation bias under some prompting methodologies. Our contributions offer practical solutions for developing more robust and fair language models.
摘要:众所周知,语言模型从其训练数据中吸收偏差,导致由统计规则而不是语义相关性驱动的预测。我们考察了这些偏差对大规模多任务语言理解(MMLU)任务中答案选择偏好的影响。我们的发现表明,不同答案选项的习得规则的差异预测了模型偏好,并反映了人类的应试策略。为了解决这个问题,我们引入了两种新的方法:基于思维链的反事实提示(COT)和基于不可知启动的COT的反事实提示(APICOT)。我们证明,虽然单独使用COT的反事实提示不足以减少偏差,但我们的新的基于COT的启动反事实提示方法在提高整体准确率的同时有效地减少了基率概率的影响。我们的结果表明,缓解偏差需要一个类似于系统-2的过程,在某些提示方法下,COT推理容易受到确认性偏差的影响。我们的贡献为开发更健壮和公平的语言模型提供了实用的解决方案。

[NLP-20] An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation
[NLP-20] 照片共享多模式对话生成的端到端模型

链接: https://arxiv.org/abs/2408.08650
作者: Peiming Guo,Sinuo Liu,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Min Zhang
关键词-EN: Multi-modal dialogue generation, Photo-Sharing Multi-modal dialogue, generate text responses, model, Multi-modal dialogue
关键词-ZH: 多模式对话生成,照片共享多模式对话,生成文本响应,模型,多模式对话
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Photo-Sharing Multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment. Using image text caption as the bridge, a pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task. However, representing the images with text captions may loss important visual details and information and cause error propagation in the complex dialogue system. Besides, the pipeline model isolates the three models separately because discrete image text captions hinder end-to-end gradient propagation. We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model. The large language model employs the Q-Former to perceive visual images in the input end. For image generation in the output end, we propose a dynamic vocabulary transformation matrix and use straight-through and gumbel-softmax techniques to align the large language model and stable diffusion model and achieve end-to-end gradient propagation. We perform experiments on PhotoChat and DialogCC datasets to evaluate our end-to-end model. Compared with pipeline models, the end-to-end model gains state-of-the-art performances on various metrics of text and image generation. More analysis experiments also verify the effectiveness of the end-to-end model for photo-sharing multi-modal dialogue generation.
摘要:照片共享多模式对话的生成不仅需要对话代理生成文本响应,还需要在适当的时刻分享照片。以图像文本字幕为桥梁,管道模型集成了图像字幕模型、文本生成模型和图像生成模型来处理这一复杂的多模式任务。然而,在复杂的对话系统中,用文本字幕表示图像可能会丢失重要的视觉细节和信息,并导致错误传播。此外,由于离散的图像文本字幕阻碍了端到端的梯度传播,所以管道模型将这三种模型分开。我们提出了第一个端到端的照片共享多模式对话生成模型,该模型将图像感知器和图像生成器与大型语言模型集成在一起。大语言模型使用Q形成器在输入端感知视觉图像。对于输出端的图像生成,我们提出了一个动态词汇转换矩阵,并使用直通和Gumbel-Softmax技术对准大语言模型和稳定扩散模型,实现端到端的梯度传播。我们在PhotoChat和DialogCC数据集上进行了实验,以评估我们的端到端模型。与流水线模型相比,端到端模型在文本和图像生成的各种度量上都获得了最先进的性能。更多的分析实验也验证了端到端模型在照片共享多模式对话生成中的有效性。

[NLP-21] Understanding Enthymemes in Argument Maps: Bridging Argument Mining and Logic-based Argumentation
[NLP-21] 理解论点地图中的推理模元:弥合论点挖掘和基于逻辑的论证

链接: https://arxiv.org/abs/2408.08648
作者: Jonathan Ben-Naim,Victor David,Anthony Hunter
关键词-EN: argument map, processing technology aimed, Argument, arguments, language processing technology
关键词-ZH: 论点地图、针对的处理技术、论点、论点、语言处理技术
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Research note

点击查看摘要

Abstract:Argument mining is natural language processing technology aimed at identifying arguments in text. Furthermore, the approach is being developed to identify the premises and claims of those arguments, and to identify the relationships between arguments including support and attack relationships. In this paper, we assume that an argument map contains the premises and claims of arguments, and support and attack relationships between them, that have been identified by argument mining. So from a piece of text, we assume an argument map is obtained automatically by natural language processing. However, to understand and to automatically analyse that argument map, it would be desirable to instantiate that argument map with logical arguments. Once we have the logical representation of the arguments in an argument map, we can use automated reasoning to analyze the argumentation (e.g. check consistency of premises, check validity of claims, and check the labelling on each arc corresponds with thw logical arguments). We address this need by using classical logic for representing the explicit information in the text, and using default logic for representing the implicit information in the text. In order to investigate our proposal, we consider some specific options for instantiation.
摘要:论元挖掘是一种自然语言处理技术,旨在识别文本中的论元。此外,正在制定一种方法,以确定这些论点的前提和主张,并查明论点之间的关系,包括支持和攻击关系。在本文中,我们假设参数映射包含参数的前提和声明,以及它们之间的支持和攻击关系,这些关系已经通过参数挖掘识别出来。因此,从一段文本中,我们假设参数映射是通过自然语言处理自动获得的。然而,为了理解和自动分析该参数映射,最好用逻辑参数实例化该参数映射。一旦我们有了参数映射中参数的逻辑表示,我们就可以使用自动推理来分析论证(例如,检查前提的一致性,检查声明的有效性,并检查每条弧线上的标签是否与逻辑参数对应)。我们通过使用经典逻辑来表示文本中的显性信息,并使用缺省逻辑来表示文本中的隐含信息来满足这一需求。为了研究我们的建议,我们考虑了一些具体的实例化选项。

[NLP-22] Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
[NLP-22] Math-SEARCH A:渐进式向上多模式对齐以增强数学推理

链接: https://arxiv.org/abs/2408.08640
作者: Wenwen Zhuang,Xin Huang,Xiantao Zhang,Jin Zeng
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, natural scene images, Multimodal Large
关键词-ZH: 大型语言模型、多模式大型语言、大型语言、自然场景图像、多模式大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the language model’s mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities.
摘要:多通道大语言模型在解决基于文本的数学问题方面表现出色,但由于它们主要针对自然场景图像进行训练,因此难以处理数学图表。对于人类来说,视觉辅助通常有助于解决问题,但随着信息从文本形式转移到视觉形式,MLLMS的表现更差。这一下降主要是由于它们在图像和文本对齐方面的缺陷。为了应对上述挑战,我们提出了Math-PUMA,一种专注于渐进式向上多模式对齐的方法。该方法旨在通过三个阶段的训练过程来提高MLLMS的数学推理能力,其中第二阶段是关键对齐阶段。我们首先通过大量的文本数学问题来增强语言模型的数学推理能力。然后,我们构建一个具有不同程度的文本和视觉信息的多模式数据集,通过以至少两种形式呈现每个问题来创建数据对。通过利用下一代币预测分布的Kullback-Leibler(KL)发散来对齐视觉和文本模态,确保了一致的问题解决能力。最后,我们利用高质量的多模式数据对MLLMS进行多模式指令调优。在多个数学推理基准上的实验结果表明,用Math-PUMA训练的MLLMS优于大多数开源的MLLMS。我们的方法有效地缩小了在不同模式下出现的问题的性能差距。

[NLP-23] A Survey on Benchmarks of Multimodal Large Language Models
[NLP-23] 多模式大型语言模型基准调查

链接: https://arxiv.org/abs/2408.08632
作者: Jian Li,Weiheng Lu
关键词-EN: Multimodal Large Language, Large Language Models, visual question answering, Multimodal Large, Language Models
关键词-ZH: 多模式大型语言、大型语言模型、视觉问答、多模式大型、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of \textbf180 benchmarks and evaluation for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to better support the development of MLLMs. For more details, please visit our GitHub repository: this https URL.
摘要:多模式大型语言模型(MLLM)因其在视觉问答、视觉感知、理解和推理等各种应用中的出色表现而在学术界和工业界越来越受欢迎。在过去的几年里,人们做出了巨大努力从多个角度审查MLLM。本文全面回顾了MLLM的\textbf 180基准和评估,重点关注(1)感知和理解,(2)认知和推理,(3)特定领域,(4)关键能力,和(5)其他模式。最后,我们讨论了MLLM当前评估方法的局限性,并探索有前途的未来方向。我们的关键论点是,评估应被视为更好地支持MLLM发展的重要学科。欲了解更多详细信息,请访问我们的GitHub存储库:此https URL。

[NLP-24] Persona is a Double-edged Sword: Enhancing the Zero-shot Reasoning by Ensembling the Role-playing and Neutral Prompts
[NLP-24] 女神异闻录是一把双刃剑:通过整合角色扮演和中性预言来增强零镜头推理

链接: https://arxiv.org/abs/2408.08631
作者: Junseok Kim,Nakyeong Yang,Kyomin Jung
关键词-EN: Recent studies demonstrate, Recent studies, LLM, LLM evaluator, Recent
关键词-ZH: 最近的研究表明,最近的研究,LLM,LLM评估者,最近
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Recent studies demonstrate that prompting an appropriate role-playing persona to an LLM improves its reasoning capability. However, assigning a proper persona is difficult since an LLM’s performance is extremely sensitive to assigned prompts; therefore, personas sometimes hinder LLMs and degrade their reasoning capabilities. In this paper, we propose a novel framework, Jekyll \ Hyde, which ensembles the results of role-playing and neutral prompts to eradicate performance degradation via unilateral use of role-playing prompted LLM and enhance the robustness of an LLM’s reasoning ability. Specifically, Jekyll \ Hyde collects two potential solutions from both role-playing and neutral prompts and selects a better solution after cross-checking via an LLM evaluator. However, LLM-based evaluators tend to be affected by the order of those potential solutions within the prompt when selecting the proper solution; thus, we also propose a robust LLM evaluator to mitigate the position bias. The experimental analysis demonstrates that role-playing prompts distract LLMs and degrade their reasoning abilities in 4 out of 12 datasets, even when using GPT-4. In addition, we reveal that Jekyll \ Hyde improves reasoning capabilities by selecting better choices among the potential solutions on twelve widely-used reasoning datasets. We further show that our proposed LLM evaluator outperforms other baselines, proving the LLMs’ position bias is successfully mitigated.
摘要:最近的研究表明,提示适当的角色扮演角色可以提高LLM的推理能力。然而,分配合适的人物角色是困难的,因为LLM的表现对所分配的提示极其敏感;因此,人物角色有时会阻碍LLM并降低他们的推理能力。在本文中,我们提出了一种新的框架Jekyll-Hyde,它集成了角色扮演和中性提示的结果,通过单方面使用角色扮演提示来消除LLM的性能退化,并增强LLM的推理能力的健壮性。具体地说,Jekyll\Hyde从角色扮演和中性提示中收集两个潜在的解决方案,并通过LLM评估器进行交叉检查后选择更好的解决方案。然而,基于LLM的评估器在选择合适的解决方案时往往会受到提示中潜在解决方案的顺序的影响,因此,我们还提出了一个健壮的LLM评估器来缓解位置偏差。实验分析表明,即使在使用GPT-4的情况下,角色扮演提示也会在12个数据集中的4个数据集中分散LLM的注意力,并降低他们的推理能力。此外,我们还揭示了Jekyll-Hyde通过在12个广泛使用的推理数据集的潜在解中选择更好的选项来提高推理能力。我们进一步表明,我们提出的LLM评估器的性能优于其他基线,证明LLMS的位置偏差得到了成功的缓解。

[NLP-25] RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions
[NLP-25] RealMedQA:包含现实临床问题的试点生物医学问答数据集

链接: https://arxiv.org/abs/2408.08624
作者: Gregory Kell,Angus Roberts,Serge Umansky,Yuti Khare,Najma Ahmed,Nikhil Patel,Chloe Simela,Jack Coumbe,Julian Rozario,Ryan-Rhys Griffiths,Iain J. Marshall
关键词-EN: Clinical question answering, question answering systems, clinicians with relevant, relevant and timely, answering systems
关键词-ZH: 临床问答、问答系统、具有相关、相关且及时的解答系统的临床医生
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AMIA Annual Symposium 2024

点击查看摘要

Abstract:Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating “ideal” QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. We release our code and our dataset publicly to encourage further research.
摘要:临床问答系统有潜力为临床医生提供相关且及时的问题答案。尽管如此,尽管取得了进步,但这些系统在临床环境中的采用进展缓慢。一个问题是缺乏反映卫生专业人员现实需求的问答数据集。在这项工作中,我们展示了RealMedQA,这是一个由人类和LLM生成的现实临床问题的数据集。我们描述了生成和验证QA对的过程,并评估BioASQ和RealMedQA上的几个QA模型,以评估将答案与问题匹配的相对难度。我们表明,LLM对于生成“理想”QA对来说更具成本效益。此外,根据结果,我们的问题和答案之间的词汇相似性比BioASQ更低,这对前两个QA模型提出了额外的挑战。我们公开发布我们的代码和数据集,以鼓励进一步研究。

[NLP-26] A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models
[NLP-26] 自回归语言模型中系统推理的机械解释

链接: https://arxiv.org/abs/2408.08590
作者: Geonhee Kim,Marco Valentino,André Freitas
关键词-EN: auto-regressive Language Models, exploit superficial patterns, Recent studies, auto-regressive Language, systematic reasoning principles
关键词-ZH: 自回归语言模型,利用表面模式,最近的研究,自回归语言,系统推理原则
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies on logical reasoning in auto-regressive Language Models (LMs) have sparked a debate on whether such models can learn systematic reasoning principles during pre-training or merely exploit superficial patterns in the training data. This paper presents a mechanistic interpretation of syllogistic reasoning in LMs to further enhance our understanding of internal dynamics. Specifically, we present a methodology for circuit discovery aimed at disentangling content-independent reasoning mechanisms from world knowledge acquired during pre-training. Through two distinct intervention methods, we uncover a sufficient and necessary circuit involving middle-term suppression that elucidates how LMs transfer information to derive valid conclusions from premises. Furthermore, we investigate how belief biases manifest in syllogistic reasoning, finding evidence of partial contamination from additional attention heads responsible for encoding commonsense and contextualized knowledge. Finally, we explore the generalization of the discovered mechanisms across various syllogistic schemes and model sizes, finding that the identified circuit is sufficient and necessary for all the schemes on which the model achieves high downstream accuracy ( \geq 60%). Overall, our findings suggest that LMs indeed learn transferable content-independent reasoning mechanisms, but that, at the same time, such mechanisms do not involve generalisable and abstract logical primitives, being susceptible to contamination by the same world knowledge acquired during pre-training.
摘要:最近关于自回归语言模型(LMS)中逻辑推理的研究引发了一场争论,即这类模型是能够在预训练期间学习系统的推理原理,还是仅仅利用训练数据中的表面模式。本文对LMS中的三段论推理提出了一种机械论解释,以进一步加深我们对内部动力学的理解。具体地说,我们提出了一种电路发现的方法,旨在从预训练期间获得的世界知识中分离出与内容无关的推理机制。通过两种不同的干预方法,我们揭示了一个涉及中期抑制的充分和必要的回路,该回路阐明了LMS如何传递信息以从前提中得出有效的结论。此外,我们调查了信念偏差如何在三段论推理中表现出来,从负责编码常识和上下文知识的额外注意头那里找到了部分污染的证据。最后,我们探索了发现的机制在不同的三段论方案和模型大小上的推广,发现所识别的电路对于模型在其上获得高下游精度(GEQ 60)的所有方案是充分和必要的。总体而言,我们的发现表明,LMS确实学习了可转移的独立于内容的推理机制,但同时,这种机制不涉及可概括和抽象的逻辑原语,容易受到在预培训中获得的相同世界知识的污染。

[NLP-27] Overview of the BioLaySumm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles
[NLP-27] BioLaySumm 2024年生物医学研究文章综述共享任务概述

链接: https://arxiv.org/abs/2408.08566
作者: Tomas Goldsack,Carolina Scarton,Matthew Shardlow,Chenghua Lin
关键词-EN: Biomedical Research Articles, Workshop at ACL, Lay Summarisation, Summarisation of Biomedical, BioLaySumm shared task
关键词-ZH: 生物医学研究文章,ACL研讨会,Lay总结,生物医学总结,BioLaySumm共享任务
类目: Computation and Language (cs.CL)
备注: Published in: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

点击查看摘要

Abstract:This paper presents the setup and results of the second edition of the BioLaySumm shared task on the Lay Summarisation of Biomedical Research Articles, hosted at the BioNLP Workshop at ACL 2024. In this task edition, we aim to build on the first edition’s success by further increasing research interest in this important task and encouraging participants to explore novel approaches that will help advance the state-of-the-art. Encouragingly, we found research interest in the task to be high, with this edition of the task attracting a total of 53 participating teams, a significant increase in engagement from the previous edition. Overall, our results show that a broad range of innovative approaches were adopted by task participants, with a predictable shift towards the use of Large Language Models (LLMs).
摘要:本文介绍了BioLaySumm共享任务第二版的设置和结果,该任务在ACL 2024年BioNLP研讨会上主办。在这个任务版本中,我们的目标是通过进一步提高对这项重要任务的研究兴趣,并鼓励参与者探索有助于推进最新技术水平的新颖方法,以巩固第一版的成功。令人鼓舞的是,我们发现人们对该任务的研究兴趣很高,这个版本的任务总共吸引了53个参与团队,参与度比上一版本显着增加。总体而言,我们的结果表明,任务参与者采用了广泛的创新方法,并且可以预见地转向使用大型语言模型(LLM)。

[NLP-28] Collaborative Cross-modal Fusion with Large Language Model for Recommendation CIKM2024
[NLP-28] 与大型语言模型的协作跨模式融合进行推荐

链接: https://arxiv.org/abs/2408.08564
作者: Zhongzhou Liu,Hao Zhang,Kuicai Dong,Yuan Fang
关键词-EN: conventional collaborative filtering, large language models, leveraging semantic knowledge, success of conventional, exhibit limitations
关键词-ZH: 传统的协作过滤、大型语言模型、利用语义知识、传统的成功、表现出局限性
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, accepted by CIKM 2024

点击查看摘要

Abstract:Despite the success of conventional collaborative filtering (CF) approaches for recommendation systems, they exhibit limitations in leveraging semantic knowledge within the textual attributes of users and items. Recent focus on the application of large language models for recommendation (LLM4Rec) has highlighted their capability for effective semantic knowledge capture. However, these methods often overlook the collaborative signals in user behaviors. Some simply instruct-tune a language model, while others directly inject the embeddings of a CF-based model, lacking a synergistic fusion of different modalities. To address these issues, we propose a framework of Collaborative Cross-modal Fusion with Large Language Models, termed CCF-LLM, for recommendation. In this framework, we translate the user-item interactions into a hybrid prompt to encode both semantic knowledge and collaborative signals, and then employ an attentive cross-modal fusion strategy to effectively fuse latent embeddings of both modalities. Extensive experiments demonstrate that CCF-LLM outperforms existing methods by effectively utilizing semantic and collaborative signals in the LLM4Rec context.
摘要:尽管传统的协同过滤方法在推荐系统中取得了成功,但它们在利用用户和项目的文本属性中的语义知识方面存在局限性。最近对大型推荐语言模型(LLM4Rec)应用的关注突出了它们有效获取语义知识的能力。然而,这些方法往往忽略了用户行为中的协作信号。有些只是指示调整语言模型,而另一些则直接注入基于CF的模型的嵌入,缺乏不同模式的协同融合。为了解决这些问题,我们提出了一个基于大型语言模型的协作式跨模式融合框架CCF-LLM。在该框架中,我们将用户与条目的交互转换为混合提示来编码语义知识和协作信号,然后采用关注的跨通道融合策略来有效地融合这两个通道的潜在嵌入。大量实验表明,CCF-LLM通过在LLM4Rec上下文中有效地利用语义和协作信号,性能优于现有方法。

[NLP-29] Integrating Multi-view Analysis: Multi-view Mixture-of-Expert for Textual Personality Detection NLPCC2024
[NLP-29] 集成多视图分析:用于文本个性检测的多视图专家混合

链接: https://arxiv.org/abs/2408.08551
作者: Haohao Zhu,Xiaokun Zhang,Junyu Lu,Liang Yang,Hongfei Lin
关键词-EN: Textual personality detection, identify personality traits, personality detection, Textual personality, personality detection aims
关键词-ZH: 文本个性检测,识别个性特征,个性检测,文本个性,个性检测目标
类目: Computation and Language (cs.CL)
备注: Accepted by NLPCC 2024

点击查看摘要

Abstract:Textual personality detection aims to identify personality traits by analyzing user-generated content. To achieve this effectively, it is essential to thoroughly examine user-generated content from various perspectives. However, previous studies have struggled with automatically extracting and effectively integrating information from multiple perspectives, thereby limiting their performance on personality detection. To address these challenges, we propose the Multi-view Mixture-of-Experts Model for Textual Personality Detection (MvP). MvP introduces a Multi-view Mixture-of-Experts (MoE) network to automatically analyze user posts from various perspectives. Additionally, it employs User Consistency Regularization to mitigate conflicts among different perspectives and learn a multi-view generic user representation. The model’s training is optimized via a multi-task joint learning strategy that balances supervised personality detection with self-supervised user consistency constraints. Experimental results on two widely-used personality detection datasets demonstrate the effectiveness of the MvP model and the benefits of automatically analyzing user posts from diverse perspectives for textual personality detection.
摘要:文本人格检测的目的是通过分析用户生成的内容来识别人格特征。为了有效地实现这一点,必须从不同的角度彻底检查用户生成的内容。然而,以前的研究一直在努力从多个角度自动提取和有效整合信息,从而限制了它们在人格检测方面的表现。为了应对这些挑战,我们提出了文本人格检测的多视图混合专家模型(MVP)。MVP引入了多视图混合专家(MOE)网络,从不同的角度自动分析用户帖子。此外,它使用用户一致性正则化来缓解不同视角之间的冲突,并学习多视图通用用户表示。该模型的训练通过多任务联合学习策略进行优化,该策略平衡了监督的个性检测和自我监督的用户一致性约束。在两个广泛使用的人格检测数据集上的实验结果证明了MVP模型的有效性,以及从不同角度自动分析用户帖子用于文本人格检测的好处。

[NLP-30] SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models
[NLP-30] SelectLLM:大型语言模型的查询感知高效选择算法

链接: https://arxiv.org/abs/2408.08545
作者: Kaushal Kumar Maurya,KV Aditya Srivatsa,Ekaterina Kochmar
关键词-EN: gained increased popularity, increased popularity due, Large language models, gained increased, increased popularity
关键词-ZH: 受欢迎程度增加,受欢迎程度增加,由于大语言模型,受欢迎程度增加,受欢迎程度增加
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have gained increased popularity due to their remarkable success across various tasks, which has led to the active development of a large set of diverse LLMs. However, individual LLMs have limitations when applied to complex tasks because of such factors as training biases, model sizes, and the datasets used. A promising approach is to efficiently harness the diverse capabilities of LLMs to overcome these individual limitations. Towards this goal, we introduce a novel LLM selection algorithm called SelectLLM. This algorithm directs input queries to the most suitable subset of LLMs from a large pool, ensuring they collectively provide the correct response efficiently. SelectLLM uses a multi-label classifier, utilizing the classifier’s predictions and confidence scores to design optimal policies for selecting an optimal, query-aware, and lightweight subset of LLMs. Our findings show that the proposed model outperforms individual LLMs and achieves competitive performance compared to similarly sized, computationally expensive top-performing LLM subsets. Specifically, with a similarly sized top-performing LLM subset, we achieve a significant reduction in latency on two standard reasoning benchmarks: 13% lower latency for GSM8K and 70% lower latency for MMLU. Additionally, we conduct comprehensive analyses and ablation studies, which validate the robustness of the proposed model.
摘要:大型语言模型因其在各种任务中的显著成功而越来越受欢迎,这导致了一大批不同的大型语言模型的活跃发展。然而,由于训练偏差、模型大小和使用的数据集等因素,单个LLMS在应用于复杂任务时具有局限性。一种有希望的方法是有效地利用LLMS的各种能力来克服这些个人限制。为了实现这一目标,我们引入了一种新的LLM选择算法SelectLLM。该算法将输入查询定向到大型池中最合适的LLM子集,确保它们共同高效地提供正确的响应。SelectLLM使用多标签分类器,利用分类器的预测和置信度分数来设计最优策略,以选择最优的、查询感知的和轻量级的LLM子集。我们的研究结果表明,该模型的性能优于单个LLM,并且与大小相似、计算代价较高的LLM子集相比具有相当的性能。具体地说,在相同大小的顶级LLM子集上,我们在两个标准推理基准上实现了显著的延迟减少:GSM8K的延迟减少了13%,MMLU的延迟减少了70%。此外,我们进行了全面的分析和烧蚀研究,验证了所提出的模型的稳健性。

[NLP-31] Where is the signal in tokenization space?
[NLP-31] 代币化空间的信号在哪里?

链接: https://arxiv.org/abs/2408.08541
作者: Renato Lui Geh,Honghua Zhang,Kareem Ahmed,Benjie Wang,Guy Van den Broeck
关键词-EN: Large Language Models, Large Language, so-called canonical token, canonical token sequences, Language Models
关键词-ZH: 大型语言模型,大型语言,所谓的规范标记,规范标记序列,语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values. One common assumption is that the probability of a piece of text is the probability of its canonical token sequence. However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes Tokens as [Tok,ens], but [Tok,en,s] also represents the same text. In this paper, we study non-canonical tokenizations. We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations. We then show how the marginal is, in most cases, indistinguishable from the canonical probability. Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space. Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.
摘要:大型语言模型(LLM)通常带有标记器,这些标记器将文本确定地编码成所谓的规范标记序列,LLM为其分配概率值。一种常见的假设是,一段文本的概率是其规范令牌序列的概率。然而,字符串的标记化并不是唯一的:例如,Llama2标记器将标记编码为[TOK,ENS],但[TOK,EN,S]也表示相同的文本。在本文中,我们研究非正则标记化。我们证明,在给定一个字符串的情况下,很难找到自回归LLM的最可能的标记化,也很难计算所有可能的标记化的边际概率。然后,我们展示了在大多数情况下,边际概率与正则概率是无法区分的。令人惊讶的是,我们随后经验地证明了隐藏在标记化空间中的大量信号的存在。值得注意的是,通过简单地聚合非规范标记化的概率,我们在各种体系结构(包括转换器和状态空间模型)的一系列LLM评估基准上实现了改进。

[NLP-32] CommunityKG-RAG: Leveraging Community Structures in Knowledge Graphs for Advanced Retrieval-Augmented Generation in Fact-Checking
[NLP-32] CommunityKG-RAG:利用知识图中的社区结构进行事实核查中的高级检索增强生成

链接: https://arxiv.org/abs/2408.08535
作者: Rong-Ching Chang,Jiawei Zhang
关键词-EN: Large Language Models, Language Models, Large Language, provide contextually rich, Graph-Retrieval Augmented Generation
关键词-ZH: 大型语言模型,语言模型,大型语言,提供上下文丰富的图形检索增强生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite advancements in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, their effectiveness is often hindered by a lack of integration with entity relationships and community structures, limiting their ability to provide contextually rich and accurate information retrieval for fact-checking. We introduce CommunityKG-RAG (Community Knowledge Graph-Retrieval Augmented Generation), a novel zero-shot framework that integrates community structures within Knowledge Graphs (KGs) with RAG systems to enhance the fact-checking process. Capable of adapting to new domains and queries without additional training, CommunityKG-RAG utilizes the multi-hop nature of community structures within KGs to significantly improve the accuracy and relevance of information retrieval. Our experimental results demonstrate that CommunityKG-RAG outperforms traditional methods, representing a significant advancement in fact-checking by offering a robust, scalable, and efficient solution.
摘要:尽管大型语言模型(LLMS)和检索-增强生成(RAG)系统取得了进展,但它们的有效性往往受到缺乏与实体关系和社区结构集成的阻碍,从而限制了它们为事实核查提供上下文丰富和准确的信息检索的能力。本文介绍了社区知识图检索增强生成框架,该框架将社区知识图中的社区结构与社区知识图系统相结合,以增强事实核查过程。社区KG-RAG能够适应新的领域和查询,而不需要额外的训练,它利用KGS中社区结构的多跳性质来显著提高信息检索的准确性和相关性。我们的实验结果表明,社区KG-RAG的性能优于传统方法,通过提供健壮、可扩展和高效的解决方案,在事实核查方面取得了显著的进步。

[NLP-33] MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
[NLP-33] MuVAR:用于多模式问题解答的简单有效的多模式检索和答案细化框架

链接: https://arxiv.org/abs/2408.08521
作者: Zhengyuan Zhu,Daniel Lee,Hong Zhang,Sai Sree Harsha,Loic Feujio,Akash Maharaj,Yunyao Li
关键词-EN: demonstrated impressive performance, Recent advancements, retrieval-augmented generation, advancements in retrieval-augmented, demonstrated impressive
关键词-ZH: 表现出令人印象深刻的性能,最近的进步、检索增强一代、检索增强方面的进步,表现出令人印象深刻的
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in retrieval-augmented generation (RAG) have demonstrated impressive performance in the question-answering (QA) task. However, most previous works predominantly focus on text-based answers. While some studies address multimodal data, they still fall short in generating comprehensive multimodal answers, particularly for explaining concepts or providing step-by-step tutorials on how to accomplish specific goals. This capability is especially valuable for applications such as enterprise chatbots and settings such as customer service and educational systems, where the answers are sourced from multimodal data. In this paper, we introduce a simple and effective framework named MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR enhances text-based answers by retrieving relevant multimodal data and refining the responses to create coherent multimodal answers. This framework can be easily extended to support multimodal answers in enterprise chatbots with minimal modifications. Human evaluation results indicate that multimodal answers generated by MuRAR are more useful and readable compared to plain text answers.
摘要:检索增强生成(RAG)的最新进展在问答(QA)任务中表现出令人印象深刻的性能。然而,大多数以前的工作主要集中在基于文本的答案上。虽然一些研究涉及多模式数据,但它们在生成全面的多模式答案方面仍然不足,特别是在解释概念或提供如何实现特定目标的循序渐进的教程方面。此功能对于企业聊天机器人等应用程序以及客户服务和教育系统等环境特别有价值,在这些应用程序中,答案来自多模式数据。在本文中,我们介绍了一个简单而有效的框架–MURAR(多模式检索和答案求精)。Murar通过检索相关的多模式数据并改进响应来创建连贯的多模式答案,从而增强了基于文本的答案。这个框架可以很容易地扩展到支持企业聊天机器人中的多模式答案,只需最少的修改。人类评价结果表明,与纯文本答案相比,Murar生成的多模式答案更有用、更具可读性。

[NLP-34] Ex3: Automatic Novel Writing by Extracting Excelsior and Expanding
[NLP-34] Ex3:通过提取精益求精和扩展自动小说写作

链接: https://arxiv.org/abs/2408.08506
作者: Huang Lei,Jiaming Guo,Guanhua He,Xishan Zhang,Rui Zhang,Shaohui Peng,Shaoli Liu,Tianshi Chen
关键词-EN: Generating long-term texts, Generating long-term, long-term texts, artificial intelligence, Generating
关键词-ZH: 生成长期文本,生成长期、长期文本,人工智能,生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and event depiction, ultimately compromising the overall narrative quality. In this paper, we propose a method named Extracting Excelsior and Expanding. Ex3 initially extracts structure information from raw novel data. By combining this structure information with the novel data, an instruction-following dataset is meticulously crafted. This dataset is then utilized to fine-tune the LLM, aiming for excelsior generation performance. In the final stage, a tree-like expansion method is deployed to facilitate the generation of arbitrarily long novels. Evaluation against previous methods showcases Ex3’s ability to produce higher-quality long-form novels.
摘要:使用人工智能生成小说等长期文本一直是一个挑战。一种常见的方法是使用大型语言模型(LLM)来构建一个分层框架,该框架首先计划,然后编写。尽管生成的小说篇幅足够长,但它们在情节上表现出较差的逻辑连贯性和感染力,在人物和事件描述方面存在不足,最终损害了整体叙事质量。在本文中,我们提出了一种提取Excelsior和扩展的方法。EX3最初从原始小说数据中提取结构信息。通过将这种结构信息与新颖的数据相结合,精心设计了指令遵循的数据集。然后利用该数据集对LLM进行微调,以实现精益求精的生成性能。在最后阶段,采用了树状扩展方法,以便于生成任意长度的小说。对以前方法的评估表明,EX3的S有能力创作出更高质量的长篇小说。

[NLP-35] JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
[NLP-35] JPEG-LM:LLM作为具有规范编解码器表示的图像生成器

链接: https://arxiv.org/abs/2408.08459
作者: Xiaochuang Han,Marjan Ghazvininejad,Pang Wei Koh,Yulia Tsvetkov
关键词-EN: potentially easy integration, Recent work, generality and potentially, potentially easy, easy integration
关键词-ZH: 潜在容易的集成、最近的工作、一般性以及潜在的、容易的集成
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization – representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.
摘要:最近在图像和视频生成方面的工作一直采用自回归LLM结构,因为它具有通用性,并且很容易集成到多模式系统中。将语言生成中的自回归训练应用于视觉生成的关键是离散化–将图像和视频等连续数据表示为离散符号。对图像和视频进行离散化的常见方法包括对原始像素值进行建模,这太过冗长,或者对矢量量化进行建模,这需要复杂的预先训练。在这项工作中,我们建议将图像和视频直接建模为通过规范的编解码器(如JPEG、AVC/H.264)保存在计算机上的压缩文件。使用默认的Llama体系结构,无需任何视觉特定修改,我们从头开始预先训练JPEG-LM以生成图像(以及AVC-LM以生成视频作为概念证明),方法是直接输出JPEG和AVC格式的压缩文件字节。对图像生成的评估表明,这种简单直接的方法比基于像素的建模和复杂的矢量量化基线更有效(我们的方法在这些基线上产生了31%的FID减少)。分析表明,与矢量量化模型相比,JPEGLM在生成长尾视觉元素方面具有独特的优势。总体而言,我们表明使用规范的编解码器表示可以帮助降低语言生成和视觉生成之间的障碍,从而促进未来对多通道语言/图像/视频LLMS的研究。

[NLP-36] W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering
[NLP-36] W-RAG:RAG中用于开放领域问题解答的弱监督密集检索

链接: https://arxiv.org/abs/2408.08444
作者: Jinming Nian,Zhiyuan Peng,Qifan Wang,Yi Fang
关键词-EN: Large Language Models, Large Language, open-domain question answering, factual answers relying, answers relying solely
关键词-ZH: 大型语言模型、大型语言、开放领域问答、依赖事实答案、仅依赖答案
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In knowledge-intensive tasks such as open-domain question answering (OpenQA), Large Language Models (LLMs) often struggle to generate factual answers relying solely on their internal (parametric) knowledge. To address this limitation, Retrieval-Augmented Generation (RAG) systems enhance LLMs by retrieving relevant information from external sources, thereby positioning the retriever as a pivotal component. Although dense retrieval demonstrates state-of-the-art performance, its training poses challenges due to the scarcity of ground-truth evidence, largely attributed to the high costs of human annotation. In this paper, we propose W-RAG by utilizing the ranking capabilities of LLMs to create weakly labeled data for training dense retrievers. Specifically, we rerank the top- K passages retrieved via BM25 by assessing the probability that LLMs will generate the correct answer based on the question and each passage. The highest-ranking passages are then used as positive training examples for dense retrieval. Our comprehensive experiments across four publicly available OpenQA datasets demonstrate that our approach enhances both retrieval and OpenQA performance compared to baseline models.
摘要:在开放领域问答(OpenQA)等知识密集型任务中,大型语言模型(LLM)往往很难仅依靠其内部(参数)知识来生成事实答案。为了解决这一局限性,检索-增强生成(RAG)系统通过从外部来源检索相关信息来增强LLMS,从而将检索者定位为关键组件。尽管密集检索展示了最先进的性能,但由于缺乏地面事实证据,其培训构成了挑战,这在很大程度上归因于人工注释的高成本。在本文中,我们提出了W-RAG,利用LLMS的排序能力来创建弱标签数据,用于训练密集的检索者。具体地说,我们通过评估LLMS根据问题和每一段文章生成正确答案的概率,对通过BM25检索到的前K段进行了重新排序。然后,将排序最高的段落用作密集检索的正向训练样本。我们在四个公开可用的OpenQA数据集上的综合实验表明,与基准模型相比,我们的方法提高了检索和OpenQA的性能。

[NLP-37] Rater Cohesion and Quality from a Vicarious Perspective
[NLP-37] 从替代角度看评分者的凝聚力和质量

链接: https://arxiv.org/abs/2408.08411
作者: Deepak Pandita,Tharindu Cyril Weerasooriya,Sujan Dutta,Sarah K. Luger,Tharindu Ranasinghe,Ashiqur R. KhudaBukhsh,Marcos Zampieri,Christopher M. Homan
关键词-EN: Human feedback, content moderation, sentiment analysis, feedback is essential, essential for building
关键词-ZH: 人类反馈、内容审核、情感分析、反馈至关重要,对于建设至关重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human feedback is essential for building human-centered AI systems across domains where disagreement is prevalent, such as AI safety, content moderation, or sentiment analysis. Many disagreements, particularly in politically charged settings, arise because raters have opposing values or beliefs. Vicarious annotation is a method for breaking down disagreement by asking raters how they think others would annotate the data. In this paper, we explore the use of vicarious annotation with analytical methods for moderating rater disagreement. We employ rater cohesion metrics to study the potential influence of political affiliations and demographic backgrounds on raters’ perceptions of offense. Additionally, we utilize CrowdTruth’s rater quality metrics, which consider the demographics of the raters, to score the raters and their annotations. We study how the rater quality metrics influence the in-group and cross-group rater cohesion across the personal and vicarious levels.
摘要:人类反馈对于在分歧普遍存在的领域(例如人工智能安全、内容审核或情感分析)构建以人为本的人工智能系统至关重要。许多分歧,尤其是在政治紧张的环境中,之所以出现,是因为评级者有相反的价值观或信仰。替代注释是一种通过询问评级者认为其他人会如何注释数据来消除分歧的方法。在本文中,我们探讨了使用替代注释和分析方法来缓和评级者的分歧。我们采用评分者凝聚力指标来研究政治派别和人口背景对评分者对冒犯行为的看法的潜在影响。此外,我们利用CrowdTruth的评分者质量指标(考虑了评分者的人口统计数据)对评分者及其注释进行评分。我们研究评级者质量指标如何影响个人和替代层面的组内和跨组评级者凝聚力。

[NLP-38] Zero-Shot Learning and Key Points Are All You Need for Automated Fact-Checking
[NLP-38] 零镜头学习和关键点就是自动事实核查所需的一切

链接: https://arxiv.org/abs/2408.08400
作者: Mohammad Ghiasvand Mohammadkhani,Ali Ghiasvand Mohammadkhani,Hamid Beigy
关键词-EN: determining the accurate, accurate status, proposed claim, vast amount, Natural Language Processing
关键词-ZH: 确定准确、准确的状态、提出的索赔、大量、自然语言处理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated fact-checking is an important task because determining the accurate status of a proposed claim within the vast amount of information available online is a critical challenge. This challenge requires robust evaluation to prevent the spread of false information. Modern large language models (LLMs) have demonstrated high capability in performing a diverse range of Natural Language Processing (NLP) tasks. By utilizing proper prompting strategies, their versatility due to their understanding of large context sizes and zero-shot learning ability enables them to simulate human problem-solving intuition and move towards being an alternative to humans for solving problems. In this work, we introduce a straightforward framework based on Zero-Shot Learning and Key Points (ZSL-KeP) for automated fact-checking, which despite its simplicity, performed well on the AVeriTeC shared task dataset by robustly improving the baseline and achieving 10th place.
摘要:自动化事实核查是一项重要任务,因为在大量在线信息中确定拟议索赔的准确状态是一项严峻的挑战。这一挑战需要强有力的评估以防止虚假信息的传播。现代大型语言模型(LLM)已表现出执行各种自然语言处理(NLP)任务的高能力。通过利用适当的提示策略,它们因对大上下文大小的理解和零射击学习能力而具有的多功能性使它们能够模拟人类解决问题的直觉,并成为人类解决问题的替代者。在这项工作中,我们引入了一个基于Zero-Shot学习和关键点(CLARL-KeP)的简单框架,用于自动事实检查,尽管该框架很简单,但通过稳健地改善基线并获得第十名,在AVeriTeC共享任务数据集上表现良好。

[NLP-39] Level Up Your Tutorials: VLMs for Game Tutorials Quality Assessment ECCV2024
[NLP-39] 升级您的学费:用于游戏学费质量评估的VLM

链接: https://arxiv.org/abs/2408.08396
作者: Daniele Rege Cambrin,Gabriele Scaffidi Militone,Luca Colomba,Giovanni Malnati,Daniele Apiletti,Paolo Garza
关键词-EN: complex core mechanics, Designing effective game, smooth learning curve, Designing effective, core mechanics
关键词-ZH: 复杂的核心机制,设计有效的游戏,平滑的学习曲线,设计有效的核心机制
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at ECCV 2024 CV2 Workshop

点击查看摘要

Abstract:Designing effective game tutorials is crucial for a smooth learning curve for new players, especially in games with many rules and complex core mechanics. Evaluating the effectiveness of these tutorials usually requires multiple iterations with testers who have no prior knowledge of the game. Recent Vision-Language Models (VLMs) have demonstrated significant capabilities in understanding and interpreting visual content. VLMs can analyze images, provide detailed insights, and answer questions about their content. They can recognize objects, actions, and contexts in visual data, making them valuable tools for various applications, including automated game testing. In this work, we propose an automated game-testing solution to evaluate the quality of game tutorials. Our approach leverages VLMs to analyze frames from video game tutorials, answer relevant questions to simulate human perception, and provide feedback. This feedback is compared with expected results to identify confusing or problematic scenes and highlight potential errors for developers. In addition, we publish complete tutorial videos and annotated frames from different game versions used in our tests. This solution reduces the need for extensive manual testing, especially by speeding up and simplifying the initial development stages of the tutorial to improve the final game experience.
摘要:设计有效的游戏教程对于新玩家的顺利学习至关重要,尤其是在规则众多、核心机制复杂的游戏中。评估这些教程的有效性通常需要与事先不了解游戏的测试人员进行多次迭代。最近的视觉语言模型(VLM)在理解和解释视觉内容方面表现出了显著的能力。VLM可以分析图像,提供详细的见解,并回答有关其内容的问题。它们可以识别可视数据中的对象、动作和上下文,使它们成为各种应用程序的宝贵工具,包括自动游戏测试。在这项工作中,我们提出了一个自动化的游戏测试解决方案来评估游戏教程的质量。我们的方法利用VLM来分析视频游戏教程中的帧,回答相关问题来模拟人的感知,并提供反馈。将此反馈与预期结果进行比较,以识别令人困惑或有问题的场景,并为开发人员突出潜在的错误。此外,我们还发布了测试中使用的不同游戏版本的完整教程视频和带注释的框架。该解决方案减少了大量手动测试的需要,特别是通过加快和简化教程的初始开发阶段,以改善最终的游戏体验。

[NLP-40] owards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions
[NLP-40] owards现实合成用户生成内容:生成在线讨论的支架方法

链接: https://arxiv.org/abs/2408.08379
作者: Krisztian Balog,John Palowitch,Barbara Ikica,Filip Radlinski,Hamidreza Alvari,Mehdi Manshadi
关键词-EN: modern machine learning, synthetic data represents, highly private, machine learning, offering a solution
关键词-ZH: 现代机器学习、合成数据代表高度私有的机器学习,提供解决方案
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of synthetic data represents a pivotal shift in modern machine learning, offering a solution to satisfy the need for large volumes of data in domains where real data is scarce, highly private, or difficult to obtain. We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content, noting that such content is increasingly prevalent and a source of frequently sought information. Large language models (LLMs) offer a starting point for generating synthetic social media discussion threads, due to their ability to produce diverse responses that typify online interactions. However, as we demonstrate, straightforward application of LLMs yields limited success in capturing the complex structure of online discussions, and standard prompting mechanisms lack sufficient control. We therefore propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads, referred to as scaffolds. Our framework is generic yet adaptable to the unique characteristics of specific social media platforms. We demonstrate its feasibility using data from two distinct online discussion platforms. To address the fundamental challenge of ensuring the representativeness and realism of synthetic data, we propose a portfolio of evaluation measures to compare various instantiations of our framework.
摘要:合成数据的出现代表了现代机器学习的一个关键转变,它提供了一种解决方案,可以满足真实数据稀缺、高度保密或难以获得的领域中对大量数据的需求。我们调查了为用户生成的内容创建现实的、大规模的合成数据集的可行性,注意到这种内容越来越普遍,也是经常搜索的信息的来源。大型语言模型(LLM)提供了生成合成社交媒体讨论线索的起点,因为它们能够产生具有代表性的在线互动的不同响应。然而,正如我们所证明的那样,直接应用LLMS在捕捉在线讨论的复杂结构方面的成功有限,而且标准的提示机制缺乏足够的控制。因此,我们提出了一个多步骤生成过程,其前提是创建讨论线索的紧凑表示,称为脚手架。我们的框架是通用的,但可以适应特定社交媒体平台的独特特征。我们使用来自两个不同的在线讨论平台的数据来论证其可行性。为了解决确保合成数据的代表性和现实性这一根本挑战,我们提出了一系列评估措施,以比较我们框架的各种实例。

[NLP-41] Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples
[NLP-41] 评估文本分类对词性对抗示例的鲁棒性

链接: https://arxiv.org/abs/2408.08374
作者: Anahita Samadi,Allison Sullivan
关键词-EN: machine learning systems, safety critical applications, machine learning, text-based adversarial, learning systems
关键词-ZH: 机器学习系统、安全关键应用、机器学习、基于文本的对抗、学习系统
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As machine learning systems become more widely used, especially for safety critical applications, there is a growing need to ensure that these systems behave as intended, even in the face of adversarial examples. Adversarial examples are inputs that are designed to trick the decision making process, and are intended to be imperceptible to humans. However, for text-based classification systems, changes to the input, a string of text, are always perceptible. Therefore, text-based adversarial examples instead focus on trying to preserve semantics. Unfortunately, recent work has shown this goal is often not met. To improve the quality of text-based adversarial examples, we need to know what elements of the input text are worth focusing on. To address this, in this paper, we explore what parts of speech have the highest impact of text-based classifiers. Our experiments highlight a distinct bias in CNN algorithms against certain parts of speech tokens within review datasets. This finding underscores a critical vulnerability in the linguistic processing capabilities of CNNs.
摘要:随着机器学习系统得到越来越广泛的应用,尤其是在安全关键的应用中,越来越需要确保这些系统按预期运行,即使面对敌意的例子也是如此。对抗性的例子是被设计来欺骗决策过程的输入,目的是让人类察觉不到。然而,对于基于文本的分类系统,输入文本字符串的变化总是可以察觉到的。因此,基于文本的对抗性例子反而侧重于试图保留语义。不幸的是,最近的研究表明,这一目标往往无法实现。为了提高基于文本的对抗性例子的质量,我们需要知道输入文本的哪些元素值得关注。为了解决这一问题,本文探讨了哪些词性对基于文本的分类器的影响最大。我们的实验突出了CNN算法对评论数据集中的某些词性标记的明显偏见。这一发现突显了CNN语言处理能力的一个严重弱点。

[NLP-42] Plan with Code: Comparing approaches for robust NL to DSL generation
[NLP-42] 带代码的计划:比较稳健的NI与DSA生成的方法

链接: https://arxiv.org/abs/2408.08335
作者: Nastaran Bassamzadeh,Chhaya Methani
关键词-EN: code, Natural Language, Domain Specific Languages, function, generated via Natural
关键词-ZH: 代码、自然语言、领域特定语言、函数、通过自然生成
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 1 figure, 5 tables. arXiv admin note: substantial text overlap with arXiv:2407.02742

点击查看摘要

Abstract:Planning in code is considered a more reliable approach for many orchestration tasks. This is because code is more tractable than steps generated via Natural Language and make it easy to support more complex sequences by abstracting deterministic logic into functions. It also allows spotting issues with incorrect function names with the help of parsing checks that can be run on code. Progress in Code Generation methodologies, however, remains limited to general-purpose languages like C, C++, and Python. LLMs continue to face challenges with custom function names in Domain Specific Languages or DSLs, leading to higher hallucination rates and syntax errors. This is more common for custom function names, that are typically part of the plan. Moreover, keeping LLMs up-to-date with newer function names is an issue. This poses a challenge for scenarios like task planning over a large number of APIs, since the plan is represented as a DSL having custom API names. In this paper, we focus on workflow automation in RPA (Robotic Process Automation) domain as a special case of task planning. We present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies with a fine-tuned model. Our results showed that the fine-tuned model scored the best on code similarity metric. However, with our optimizations, RAG approach is able to match the quality for in-domain API names in the test set. Additionally, it offers significant advantage for out-of-domain or unseen API names, outperforming Fine-Tuned model on similarity metric by 7 pts.
摘要:对于许多编排任务,在代码中进行规划被认为是一种更可靠的方法。这是因为代码比通过自然语言生成的步骤更容易处理,并通过将确定性逻辑抽象为函数使其更容易支持更复杂的序列。它还允许通过可以在代码上运行的解析检查来发现函数名称不正确的问题。然而,代码生成方法方面的进展仍然局限于C、C++和Python等通用语言。LLM继续面临使用域特定语言或DSL的自定义函数名称的挑战,导致更高的幻觉率和语法错误。这对于通常是计划一部分的自定义函数名称来说更为常见。此外,使用较新的函数名称使LLM保持最新也是一个问题。这给大量API上的任务规划之类的场景带来了挑战,因为计划被表示为具有定制API名称的DSL。本文以机器人过程自动化领域的工作流自动化为研究对象,将其作为任务规划的特例。我们提出了将检索增强生成(RAG)与LLMS一起用于DSL生成的优化方案,并进行了消融研究,将这些策略与微调模型进行了比较。结果表明,微调后的模型在代码相似性度量上得分最高。然而,通过我们的优化,RAG方法能够匹配测试集中的域内API名称的质量。此外,它为域外或看不见的API名称提供了显著的优势,在相似性度量方面比Fine-Tuned模型高出7个百分点。

[NLP-43] CodeMirage: Hallucinations in Code Generated by Large Language Models IJCAI2024
[NLP-43] CodeMirror:大型语言模型生成的代码中的幻觉

链接: https://arxiv.org/abs/2408.08333
作者: Vibhor Agarwal,Yulong Pei,Salwa Alamir,Xiaomo Liu
关键词-EN: Large Language Models, Large Language, shown promising potentials, Language Models, code
关键词-ZH: 大型语言模型,大型语言,显示出有前途的潜力,语言模型,代码
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at AutoMates @ IJCAI 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI’s GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.
摘要:大型语言模型在程序生成和无代码自动化方面显示出巨大的潜力。然而,LLMS容易产生幻觉,即它们产生听起来合理但不正确的文本。虽然最近关于文本生成的LLM幻觉的研究激增,但类似的幻觉现象也可能发生在代码生成中。有时,生成的代码可能存在语法或逻辑错误以及更高级的问题,如安全漏洞、内存泄漏等。考虑到广泛采用LLM来提高代码生成和开发的效率,调查代码生成中的幻觉变得势在必行。据我们所知,这是研究LLMS生成的代码中的幻觉的第一次尝试。我们首先介绍代码幻觉定义和代码幻觉类型的全面分类。我们提出了第一个针对代码幻觉的基准CodeMiRAGE数据集。该基准包含1,137个GPT-3.5生成的幻觉代码片段,用于从两个基本数据集-HumanEval和MBPP-中解决Python编程问题。然后,我们提出了代码幻觉检测的方法,并使用开源的LLMS(如CodeLLaMA)以及OpenAI的GPT-3.5和GPT-4模型使用一次提示进行了实验。我们发现GPT-4在HumanEval数据集上的性能最好,并且在MBPP数据集上的结果与经过微调的CodeBERT基线相当。最后,我们讨论了各种缓解代码幻觉的策略,并总结了我们的工作。

[NLP-44] ConcateNet: Dialogue Separation Using Local And Global Feature Concatenation
[NLP-44] ContateNet:使用本地和全球功能连锁的对话分离

链接: https://arxiv.org/abs/2408.08729
作者: Mhd Modar Halimeh,Matteo Torcoli,Emanuël Habets
关键词-EN: separation involves isolating, Dialogue separation involves, involves isolating, Dialogue separation, separation involves
关键词-ZH: 分离涉及隔离,对话分离涉及,涉及隔离,对话分离,分离涉及
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Dialogue separation involves isolating a dialogue signal from a mixture, such as a movie or a TV program. This can be a necessary step to enable dialogue enhancement for broadcast-related applications. In this paper, ConcateNet for dialogue separation is proposed, which is based on a novel approach for processing local and global features aimed at better generalization for out-of-domain signals. ConcateNet is trained using a noise reduction-focused, publicly available dataset and evaluated using three datasets: two noise reduction-focused datasets (in-domain), which show competitive performance for ConcateNet, and a broadcast-focused dataset (out-of-domain), which verifies the better generalization performance for the proposed architecture compared to considered state-of-the-art noise-reduction methods.
摘要:对话分离涉及将对话信号与混合物(例如电影或电视节目)隔离。这可能是为广播相关应用程序增强对话的必要步骤。本文提出了用于对话分离的ConteNet,该方法基于一种处理局部和全局特征的新型方法,旨在更好地概括域外信号。ConcateNet使用一个以降噪为重点的公开数据集进行训练,并使用三个数据集进行评估:两个以降噪为重点的数据集(域内),显示了ContateNet的竞争性能,以及一个以广播为重点的数据集(域外),与考虑的最先进的降噪方法相比,该架构验证了更好的概括性能。

人工智能

[AI-0] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

链接: https://arxiv.org/abs/2408.08872
作者: Le Xue,Manli Shu,Anas Awadalla,Jun Wang,An Yan,Senthil Purushwalkam,Honglu Zhou,Viraj Prabhu,Yutong Dai,Michael S Ryoo,Shrikant Kendre,Jieyu Zhang,Can Qin,Shu Zhang,Chia-Chih Chen,Ning Yu,Juntao Tan,Tulika Manoj Awalgaonkar,Shelby Heinecke,Huan Wang,Yejin Choi,Ludwig Schmidt,Zeyuan Chen,Silvio Savarese,Juan Carlos Niebles,Caiming Xiong,Ran Xu
关键词-EN: developing Large Multimodal, Large Multimodal Models, Large Multimodal, developing Large, Multimodal Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

[AI-1] GeoTransformer: Enhancing Urban Forecasting with Geospatial Attention Mechanisms

链接: https://arxiv.org/abs/2408.08852
作者: Yuhao Jia,Zile Wu,Shengao Yi,Yifei Sun
关键词-EN: integrating sociodemographic data, Recent advancements, notable efforts dedicated, high-dimensional spaces, satellite imagery
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements have focused on encoding urban spatial information into high-dimensional spaces, with notable efforts dedicated to integrating sociodemographic data and satellite imagery. These efforts have established foundational models in this field. However, the effective utilization of these spatial representations for urban forecasting applications remains under-explored. To address this gap, we introduce GeoTransformer, a novel structure that synergizes the Transformer architecture with geospatial statistics prior. GeoTransformer employs an innovative geospatial attention mechanism to incorporate extensive urban information and spatial dependencies into a unified predictive model. Specifically, we compute geospatial weighted attention scores between the target region and surrounding regions and leverage the integrated urban information for predictions. Extensive experiments on GDP and ride-share demand prediction tasks demonstrate that GeoTransformer significantly outperforms existing baseline models, showcasing its potential to enhance urban forecasting tasks.

[AI-2] Optimal Symmetries in Binary Classification

链接: https://arxiv.org/abs/2408.08823
作者: Vishal S. Ngairangbam,Michael Spannowsky
关键词-EN: binary classification tasks, Neyman-Pearson optimality, explore the role, framework that leverages, leverages the principles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 13 pages, 1 figure, 2 tables

点击查看摘要

Abstract:We explore the role of group symmetries in binary classification tasks, presenting a novel framework that leverages the principles of Neyman-Pearson optimality. Contrary to the common intuition that larger symmetry groups lead to improved classification performance, our findings show that selecting the appropriate group symmetries is crucial for optimising generalisation and sample efficiency. We develop a theoretical foundation for designing group equivariant neural networks that align the choice of symmetries with the underlying probability distributions of the data. Our approach provides a unified methodology for improving classification accuracy across a broad range of applications by carefully tailoring the symmetry group to the specific characteristics of the problem. Theoretical analysis and experimental results demonstrate that optimal classification performance is not always associated with the largest equivariant groups possible in the domain, even when the likelihood ratio is invariant under one of its proper subgroups, but rather with those subgroups themselves. This work offers insights and practical guidelines for constructing more effective group equivariant architectures in diverse machine-learning contexts.

[AI-3] EasyRec: Simple yet Effective Language Models for Recommendation

链接: https://arxiv.org/abs/2408.08821
作者: Xubin Ren,Chao Huang
关键词-EN: Deep neural networks, user-item interaction data, Deep neural, neural networks, powerful technique
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep neural networks have become a powerful technique for learning representations from user-item interaction data in collaborative filtering (CF) for recommender systems. However, many existing methods heavily rely on unique user and item IDs, which limits their ability to perform well in practical zero-shot learning scenarios where sufficient training data may be unavailable. Inspired by the success of language models (LMs) and their strong generalization capabilities, a crucial question arises: How can we harness the potential of language models to empower recommender systems and elevate its generalization capabilities to new heights? In this study, we propose EasyRec - an effective and easy-to-use approach that seamlessly integrates text-based semantic understanding with collaborative signals. EasyRec employs a text-behavior alignment framework, which combines contrastive learning with collaborative language model tuning, to ensure a strong alignment between the text-enhanced semantic space and the collaborative behavior information. Extensive empirical evaluations across diverse real-world datasets demonstrate the superior performance of EasyRec compared to state-of-the-art alternative models, particularly in the challenging text-based zero-shot recommendation scenarios. Furthermore, the study highlights the potential of seamlessly integrating EasyRec as a plug-and-play component into text-enhanced collaborative filtering frameworks, thereby empowering existing recommender systems to elevate their recommendation performance and adapt to the evolving user preferences in dynamic environments. For better result reproducibility of our EasyRec framework, the model implementation details, source code, and datasets are available at the link: this https URL.

[AI-4] Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

链接: https://arxiv.org/abs/2408.08808
作者: Ravi Raju,Swayambhoo Jain,Bo Li,Jonathan Li,Urmish Thakkar
关键词-EN: Large Language Models, Large Language, real-world applications, revolutionized the landscape, landscape of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark’s usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC \citedubois2024lengthcontrolledalpacaevalsimpleway and Arena-Hard v0.1 \citeli2024crowdsourced are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84%) across ten top-ranked models, and agreement (84%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9% better than Arena Hard and 20% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.

[AI-5] CIKMar: A Dual-Encoder Approach to Prompt-Based Reranking in Educational Dialogue Systems

链接: https://arxiv.org/abs/2408.08805
作者: Joanito Agili Lopo,Marina Indah Prasasti,Alma Permatasari
关键词-EN: dialogue systems powered, Gemma Language model, language model size, Language model, smaller language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This paper is the result of the final project of the Natural Language Processing course, Master of Artificial Intelligence, Universitas Gadjah Mada

点击查看摘要

Abstract:In this study, we introduce CIKMar, an efficient approach to educational dialogue systems powered by the Gemma Language model. By leveraging a Dual-Encoder ranking system that incorporates both BERT and SBERT model, we have designed CIKMar to deliver highly relevant and accurate responses, even with the constraints of a smaller language model size. Our evaluation reveals that CIKMar achieves a robust recall and F1-score of 0.70 using BERTScore metrics. However, we have identified a significant challenge: the Dual-Encoder tends to prioritize theoretical responses over practical ones. These findings underscore the potential of compact and efficient models like Gemma in democratizing access to advanced educational AI systems, ensuring effective and contextually appropriate responses.

[AI-6] A Transparency Paradox? Investigating the Impact of Explanation Specificity and Autonomous Vehicle Perceptual Inaccuracies on Passengers

链接: https://arxiv.org/abs/2408.08785
作者: Daniel Omeiza,Raunak Bhattacharyya,Marina Jirotka,Nick Hawes,Lars Kunze
关键词-EN: provision of intelligible, Transparency, explanations, perception system, perception
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to Transportation Research Part F: Traffic Psychology and Behaviour. arXiv admin note: text overlap with arXiv:2307.00633

点击查看摘要

Abstract:Transparency in automated systems could be afforded through the provision of intelligible explanations. While transparency is desirable, might it lead to catastrophic outcomes (such as anxiety), that could outweigh its benefits? It’s quite unclear how the specificity of explanations (level of transparency) influences recipients, especially in autonomous driving (AD). In this work, we examined the effects of transparency mediated through varying levels of explanation specificity in AD. We first extended a data-driven explainer model by adding a rule-based option for explanation generation in AD, and then conducted a within-subject lab study with 39 participants in an immersive driving simulator to study the effect of the resulting explanations. Specifically, our investigation focused on: (1) how different types of explanations (specific vs. abstract) affect passengers’ perceived safety, anxiety, and willingness to take control of the vehicle when the vehicle perception system makes erroneous predictions; and (2) the relationship between passengers’ behavioural cues and their feelings during the autonomous drives. Our findings showed that passengers felt safer with specific explanations when the vehicle’s perception system had minimal errors, while abstract explanations that hid perception errors led to lower feelings of safety. Anxiety levels increased when specific explanations revealed perception system errors (high transparency). We found no significant link between passengers’ visual patterns and their anxiety levels. Our study suggests that passengers prefer clear and specific explanations (high transparency) when they originate from autonomous vehicles (AVs) with optimal perceptual accuracy.

[AI-7] Evaluating the Evaluator: Measuring LLMs Adherence to Task Evaluation Instructions

链接: https://arxiv.org/abs/2408.08781
作者: Bhuvanashree Murugadoss,Christian Poelitz,Ian Drosos,Vu Le,Nick McKenna,Carina Suzana Negreanu,Chris Parnin,Advait Sarkar
关键词-EN: recently popularized method, replaces human judgements, Reinforcement Learning, recently popularized, human judgements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.

[AI-8] Pessimistic Iterative Planning for Robust POMDPs

链接: https://arxiv.org/abs/2408.08770
作者: Maris F. L. Galesloot,Marnix Suilen,Thiago D. Simão,Steven Carr,Matthijs T. J. Spaan,Ufuk Topcu,Nils Jansen
关键词-EN: Markov decision processes, partially observable Markov, observable Markov decision, extend classical POMDPs, observable Markov
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust partially observable Markov decision processes (robust POMDPs) extend classical POMDPs to handle additional uncertainty on the transition and observation probabilities via so-called uncertainty sets. Policies for robust POMDPs must not only be memory-based to account for partial observability but also robust against model uncertainty to account for the worst-case instances from the uncertainty sets. We propose the pessimistic iterative planning (PIP) framework, which finds robust memory-based policies for robust POMDPs. PIP alternates between two main steps: (1) selecting an adversarial (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this adversarial POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next adversarial POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network trained using supervision policies optimized for the adversarial POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against a baseline method in an ablation study and competitive performance compared to a state-of-the-art robust POMDP solver.

[AI-9] Symbolic Parameter Learning in Probabilistic Answer Set Programming

链接: https://arxiv.org/abs/2408.08732
作者: Damiano Azzolini,Elisabetta Gentili,Fabrizio Riguzzi
关键词-EN: Relational Artificial Intelligence, Statistical Relational Artificial, Artificial Intelligence, probabilistic logic program, Statistical Relational
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: The paper has been accepted at the ICLP2024 conference and is under consideration in Theory and Practice of Logic Programming (TPLP)

点击查看摘要

Abstract:Parameter learning is a crucial task in the field of Statistical Relational Artificial Intelligence: given a probabilistic logic program and a set of observations in the form of interpretations, the goal is to learn the probabilities of the facts in the program such that the probabilities of the interpretations are maximized. In this paper, we propose two algorithms to solve such a task within the formalism of Probabilistic Answer Set Programming, both based on the extraction of symbolic equations representing the probabilities of the interpretations. The first solves the task using an off-the-shelf constrained optimization solver while the second is based on an implementation of the Expectation Maximization algorithm. Empirical results show that our proposals often outperform existing approaches based on projected answer set enumeration in terms of quality of the solution and in terms of execution time. The paper has been accepted at the ICLP2024 conference and is under consideration in Theory and Practice of Logic Programming (TPLP).

[AI-10] Correspondence-Guided SfM-Free 3D Gaussian Splatting for NVS

链接: https://arxiv.org/abs/2408.08723
作者: Wei Sun,Xiaosong Zhang,Fang Wan,Yanzhao Zhou,Yuan Li,Qixiang Ye,Jianbin Jiao
关键词-EN: View Synthesis, variable operating conditions, promoting rapid response, rapid response capabilities, pre-processed camera poses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2312.07504 by other authors

点击查看摘要

Abstract:Novel View Synthesis (NVS) without Structure-from-Motion (SfM) pre-processed camera poses–referred to as SfM-free methods–is crucial for promoting rapid response capabilities and enhancing robustness against variable operating conditions. Recent SfM-free methods have integrated pose optimization, designing end-to-end frameworks for joint camera pose estimation and NVS. However, most existing works rely on per-pixel image loss functions, such as L2 loss. In SfM-free methods, inaccurate initial poses lead to misalignment issue, which, under the constraints of per-pixel image loss functions, results in excessive gradients, causing unstable optimization and poor convergence for NVS. In this study, we propose a correspondence-guided SfM-free 3D Gaussian splatting for NVS. We use correspondences between the target and the rendered result to achieve better pixel alignment, facilitating the optimization of relative poses between frames. We then apply the learned poses to optimize the entire scene. Each 2D screen-space pixel is associated with its corresponding 3D Gaussians through approximated surface rendering to facilitate gradient back propagation. Experimental results underline the superior performance and time efficiency of the proposed approach compared to the state-of-the-art baselines.

[AI-11] Beyond KAN: Introducing KarSein for Adaptive High-Order Feature Interaction Modeling in CTR Prediction

链接: https://arxiv.org/abs/2408.08713
作者: Yunxiao Shi,Wujiang Wu,Mingyu Jin,Haimin Zhang,Qiang Wu,Yongfeng Zhang,Min Xu
关键词-EN: click-through rate, crucial for click-through, high-order explicit interactions, modeling high-order, interactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: KarSein for CTR

点击查看摘要

Abstract:Modeling feature interactions is crucial for click-through rate (CTR) prediction, particularly when it comes to high-order explicit interactions. Traditional methods struggle with this task because they often predefine a maximum interaction order, which relies heavily on prior knowledge and can limit the model’s effectiveness. Additionally, modeling high-order interactions typically leads to increased computational costs. Therefore, the challenge lies in adaptively modeling high-order feature interactions while maintaining efficiency. To address this issue, we introduce Kolmogorov-Arnold Represented Sparse Efficient Interaction Network (KarSein), designed to optimize both predictive accuracy and computational efficiency. We firstly identify limitations of directly applying Kolmogorov-Arnold Networks (KAN) to CTR and then introduce KarSein to overcome these issues. It features a novel architecture that reduces the computational costs of KAN and supports embedding vectors as feature inputs. Additionally, KarSein employs guided symbolic regression to address the challenge of KAN in spontaneously learning multiplicative relationships. Extensive experiments demonstrate KarSein’s superior performance, achieving significant predictive accuracy with minimal computational overhead. Furthermore, KarSein maintains strong global explainability while enabling the removal of redundant features, resulting in a sparse network structure. These advantages also position KarSein as a promising method for efficient inference.

[AI-12] Beam Prediction based on Large Language Models

链接: https://arxiv.org/abs/2408.08707
作者: Yucheng Sheng,Kai Huang,Le Liang,Peng Liu,Shi Jin,Geoffrey Ye Li
关键词-EN: significant path loss, requiring extensive antenna, extensive antenna arrays, frequent beam training, next-generation wireless networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) communication is promising for next-generation wireless networks but suffers from significant path loss, requiring extensive antenna arrays and frequent beam training. Traditional deep learning models, such as long short-term memory (LSTM), enhance beam tracking accuracy however are limited by poor robustness and generalization. In this letter, we use large language models (LLMs) to improve the robustness of beam prediction. By converting time series data into text-based representations and employing the Prompt-as-Prefix (PaP) technique for contextual enrichment, our approach unleashes the strength of LLMs for time series forecasting. Simulation results demonstrate that our LLM-based method offers superior robustness and generalization compared to LSTM-based models, showcasing the potential of LLMs in wireless communications.

[AI-13] Beyond the Hype: A dispassionate look at vision-language models in medical scenario

链接: https://arxiv.org/abs/2408.08704
作者: Yang Nan,Huichi Zhou,Xiaodan Xing,Guang Yang
关键词-EN: Large Vision-Language Models, garnering significant attention, Recent advancements, Vision-Language Models, demonstrated remarkable capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate in evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristic of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models’ ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models’ spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models’ capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models’ capabilities against unharmonised and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code and dataset will be available after the acceptance of this paper.

[AI-14] NFDI4DSO: Towards a BFO Compliant Ontology for Data Science

链接: https://arxiv.org/abs/2408.08698
作者: Genet Asefa Gesese,Jörg Waitelonis,Zongxiong Chen,Sonja Schimmler,Harald Sack
关键词-EN: connecting digital artifacts, Artificial Intelligence, Data Science, adhere to FAIR, research data
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The NFDI4DataScience (NFDI4DS) project aims to enhance the accessibility and interoperability of research data within Data Science (DS) and Artificial Intelligence (AI) by connecting digital artifacts and ensuring they adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) principles. To this end, this poster introduces the NFDI4DS Ontology, which describes resources in DS and AI and models the structure of the NFDI4DS consortium. Built upon the NFDICore ontology and mapped to the Basic Formal Ontology (BFO), this ontology serves as the foundation for the NFDI4DS knowledge graph currently under development.

[AI-15] Quantifying the Effectiveness of Student Organization Activities using Natural Language Processing

链接: https://arxiv.org/abs/2408.08694
作者: Lyberius Ennio F. Taruc,Arvin R. De La Cruz
关键词-EN: extracurricular activities play, students’ educational experiences, Natural Language Processing, Student extracurricular activities, Large Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 11 pages, 4 figures, presented in International Conference on Generative Al and its Applications (ICGAIA-24) last 22nd - 23rd, July, 2024 at Jakarta, Indonesia

点击查看摘要

Abstract:Student extracurricular activities play an important role in enriching the students’ educational experiences. With the increasing popularity of Machine Learning and Natural Language Processing, it becomes a logical step that incorporating ML-NLP in improving extracurricular activities is a potential focus of study in Artificial Intelligence (AI). This research study aims to develop a machine learning workflow that will quantify the effectiveness of student-organized activities based on student emotional responses using sentiment analysis. The study uses the Bidirectional Encoder Representations from Transformers (BERT) Large Language Model (LLM) called via the pysentimiento toolkit, as a Transformer pipeline in Hugging Face. A sample data set from Organization C, a Recognized Student Organization (RSO) of a higher educational institute in the Philippines, College X, was used to develop the workflow. The workflow consisted of data preprocessing, key feature selection, LLM feature processing, and score aggregation, resulting in an Event Score for each data set. The results show that the BERT LLM can also be used effectively in analyzing sentiment beyond product reviews and post comments. For the student affairs offices of educational institutions, this study can provide a practical example of how NLP can be applied to real-world scenarios, showcasing the potential impact of data-driven decision making.

[AI-16] he Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation

链接: https://arxiv.org/abs/2408.08688
作者: Samee Arif,Sualeha Farid,Abdul Hameed Azeemi,Awais Athar,Agha Ali Raza
关键词-EN: synthetic Preference Optimization, Preference Optimization, synthetic Preference, LLM Feedback Loop, response evaluation module
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents and evaluates multi-agent workflows for synthetic Preference Optimization (PO) dataset generation. PO dataset generation requires two modules: (1) response evaluation, and (2) response generation. In the response evaluation module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. In each step, we use inter-rater agreement using Cohen’s Kappa between human annotators and LLMs. For the response generation module, we compare different configurations for the LLM Feedback Loop using the identified LLM evaluator configuration. We use the win rate (the fraction of times a generation framework is selected as the best by an LLM evaluator) to determine the best multi-agent configuration for generation. After identifying the best configurations for both modules, we use models from the GPT, Gemma, and Llama families to generate our PO datasets using the above pipeline. We generate two types of PO datasets, one to improve the generation capabilities of individual LLM and the other to improve the multi-agent workflow. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across datasets when the candidate responses do not include responses from the GPT family. Additionally, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-agent Llama and Gemma, respectively.

[AI-17] SC-Rec: Enhancing Generative Retrieval with Self-Consistent Reranking for~Sequential Recommendation

链接: https://arxiv.org/abs/2408.08686
作者: Tongyoung Kim,Soojin Yoon,Seongku Kang,Jinyoung Yeo,Dongha Lee
关键词-EN: advanced language understanding, advanced language, language understanding, generation capabilities, recommendation systems due
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language Models (LMs) are increasingly employed in recommendation systems due to their advanced language understanding and generation capabilities. Recent recommender systems based on generative retrieval have leveraged the inferential abilities of LMs to directly generate the index tokens of the next item, based on item sequences within the user’s interaction history. Previous studies have mostly focused on item indices based solely on textual semantic or collaborative information. However, although the standalone effectiveness of these aspects has been demonstrated, the integration of this information has remained unexplored. Our in-depth analysis finds that there is a significant difference in the knowledge captured by the model from heterogeneous item indices and diverse input prompts, which can have a high potential for complementarity. In this paper, we propose SC-Rec, a unified recommender system that learns diverse preference knowledge from two distinct item indices and multiple prompt templates. Furthermore, SC-Rec adopts a novel reranking strategy that aggregates a set of ranking results, inferred based on different indices and prompts, to achieve the self-consistency of the model. Our empirical evaluation on three real-world datasets demonstrates that SC-Rec considerably outperforms the state-of-the-art methods for sequential recommendation, effectively incorporating complementary knowledge from varied outputs of the model.

[AI-18] Can Large Language Models Improve the Adversarial Robustness of Graph Neural Networks?

链接: https://arxiv.org/abs/2408.08685
作者: Zhongjian Zhang,Xiao Wang,Huichi Zhou,Yue Yu,Mengmei Zhang,Cheng Yang,Chuan Shi
关键词-EN: received considerable attention, Graph neural networks, GNNs, neural networks, considerable attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are vulnerable to adversarial perturbations, especially for topology attacks, and many methods that improve the robustness of GNNs have received considerable attention. Recently, we have witnessed the significant success of large language models (LLMs), leading many to explore the great potential of LLMs on GNNs. However, they mainly focus on improving the performance of GNNs by utilizing LLMs to enhance the node features. Therefore, we ask: Will the robustness of GNNs also be enhanced with the powerful understanding and inference capabilities of LLMs? By presenting the empirical results, we find that despite that LLMs can improve the robustness of GNNs, there is still an average decrease of 23.1% in accuracy, implying that the GNNs remain extremely vulnerable against topology attack. Therefore, another question is how to extend the capabilities of LLMs on graph adversarial robustness. In this paper, we propose an LLM-based robust graph structure inference framework, LLM4RGNN, which distills the inference capabilities of GPT-4 into a local LLM for identifying malicious edges and an LM-based edge predictor for finding missing important edges, so as to recover a robust graph structure. Extensive experiments demonstrate that LLM4RGNN consistently improves the robustness across various GNNs. Even in some cases where the perturbation ratio increases to 40%, the accuracy of GNNs is still better than that on the clean graph.

[AI-19] LLM-PCGC: Large Language Model-based Point Cloud Geometry Compression

链接: https://arxiv.org/abs/2408.08682
作者: Yuqi Ye,Wei Gao
关键词-EN: point cloud, point cloud geometry, point cloud compression, robust context model, context model consistent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The key to effective point cloud compression is to obtain a robust context model consistent with complex 3D data structures. Recently, the advancement of large language models (LLMs) has highlighted their capabilities not only as powerful generators for in-context learning and generation but also as effective compressors. These dual attributes of LLMs make them particularly well-suited to meet the demands of data compression. Therefore, this paper explores the potential of using LLM for compression tasks, focusing on lossless point cloud geometry compression (PCGC) experiments. However, applying LLM directly to PCGC tasks presents some significant challenges, i.e., LLM does not understand the structure of the point cloud well, and it is a difficult task to fill the gap between text and point cloud through text description, especially for large complicated and small shapeless point clouds. To address these problems, we introduce a novel architecture, namely the Large Language Model-based Point Cloud Geometry Compression (LLM-PCGC) method, using LLM to compress point cloud geometry information without any text description or aligning operation. By utilizing different adaptation techniques for cross-modality representation alignment and semantic consistency, including clustering, K-tree, token mapping invariance, and Low Rank Adaptation (LoRA), the proposed method can translate LLM to a compressor/generator for point cloud. To the best of our knowledge, this is the first structure to employ LLM as a compressor for point cloud data. Experiments demonstrate that the LLM-PCGC outperforms the other existing methods significantly, by achieving -40.213% bit rate reduction compared to the reference software of MPEG Geometry-based Point Cloud Compression (G-PCC) standard, and by achieving -2.267% bit rate reduction compared to the state-of-the-art learning-based method.

[AI-20] Neural Reward Machines

链接: https://arxiv.org/abs/2408.08677
作者: Elena Umili,Francesco Argenziano,Roberto Capobianco
关键词-EN: Non-markovian Reinforcement Learning, Linear Temporal Logic, Non-markovian Reinforcement, Reinforcement Learning, hard to solve
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-markovian Reinforcement Learning (RL) tasks are very hard to solve, because agents must consider the entire history of state-action pairs to act rationally in the environment. Most works use symbolic formalisms (as Linear Temporal Logic or automata) to specify the temporally-extended task. These approaches only work in finite and discrete state environments or continuous problems for which a mapping between the raw state and a symbolic interpretation is known as a symbol grounding (SG) function. Here, we define Neural Reward Machines (NRM), an automata-based neurosymbolic framework that can be used for both reasoning and learning in non-symbolic non-markovian RL domains, which is based on the probabilistic relaxation of Moore Machines. We combine RL with semisupervised symbol grounding (SSSG) and we show that NRMs can exploit high-level symbolic knowledge in non-symbolic environments without any knowledge of the SG function, outperforming Deep RL methods which cannot incorporate prior knowledge. Moreover, we advance the research in SSSG, proposing an algorithm for analysing the groundability of temporal specifications, which is more efficient than baseline techniques of a factor 10^3 .

[AI-21] Fine-tuning LLMs for Autonomous Spacecraft Control: A Case Study Using Kerbal Space Program

链接: https://arxiv.org/abs/2408.08676
作者: Alejandro Carrasco,Victor Rodriguez-Fernandez,Richard Linares
关键词-EN: Large Language Models, fine-tuned Large Language, Large Language, Program Differential Games, Language Models
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: ESA SPAICE Conference 2024. arXiv admin note: text overlap with arXiv:2404.00413

点击查看摘要

Abstract:Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompt. This study explores the use of fine-tuned Large Language Models (LLMs) for autonomous spacecraft control, using the Kerbal Space Program Differential Games suite (KSPDG) as a testing environment. Traditional Reinforcement Learning (RL) approaches face limitations in this domain due to insufficient simulation capabilities and data. By leveraging LLMs, specifically fine-tuning models like GPT-3.5 and LLaMA, we demonstrate how these models can effectively control spacecraft using language-based inputs and outputs. Our approach integrates real-time mission telemetry into textual prompts processed by the LLM, which then generate control actions via an agent. The results open a discussion about the potential of LLMs for space operations beyond their nominal use for text-related tasks. Future work aims to expand this methodology to other space control tasks and evaluate the performance of different LLM families. The code is available at this URL: \textttthis https URL.

[AI-22] MAT-SED: AMasked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection INTERSPEECH2024

链接: https://arxiv.org/abs/2408.08673
作者: Pengfei Cai,Yan Song,Kang Li,Haoyu Song,Ian McLoughlin
关键词-EN: Sound event detection, recent DCASE challenges, DCASE challenges, Sound event, recent DCASE
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Received by interspeech 2024

点击查看摘要

Abstract:Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.

[AI-23] Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning

链接: https://arxiv.org/abs/2408.08670
作者: Alessio Devoto,Federico Alvetreti,Jary Pomponi,Paolo Di Lorenzo,Pasquale Minervini,Simone Scardapane
关键词-EN: Vision Transformers, foundation models based, foundation models, Recently, fine-tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this end, in this paper we introduce an efficient fine-tuning method for ViTs called \textbfALaST ( \textitAdaptive Layer Selection Fine-Tuning for Vision Transformers ) to speed up the fine-tuning process while reducing computational cost, memory load, and training time. Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. Therefore, at each fine-tuning step, we adaptively estimate the importance of all layers and we assign what we call ``compute budgets’’ accordingly. Layers that were allocated lower budgets are either trained with a reduced number of input tokens or kept frozen. Freezing a layer reduces the computational cost and memory usage by preventing updates to its weights, while discarding tokens removes redundant data, speeding up processing and reducing memory requirements. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources across layers, resulting in substantial reductions in training time (up to 1.5x), FLOPs (up to 2x), and memory load (up to 2x) compared to traditional full fine-tuning approaches. Additionally, it can be successfully combined with other parameter-efficient fine-tuning methods, such as LoRA.

[AI-24] Robust Stochastic Shortest-Path Planning via Risk-Sensitive Incremental Sampling

链接: https://arxiv.org/abs/2408.08668
作者: Clinton Enwerem,Erfaun Noorani,John S. Baras,Brian M. Sadler
关键词-EN: supply chain management, mitigating hazardous outcomes, last-mile autonomous delivery, ensuring successful task, successful task completion
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Accepted for presentation at the 2024 IEEE Conference on Decision and Control (CDC)

点击查看摘要

Abstract:With the pervasiveness of Stochastic Shortest-Path (SSP) problems in high-risk industries, such as last-mile autonomous delivery and supply chain management, robust planning algorithms are crucial for ensuring successful task completion while mitigating hazardous outcomes. Mainstream chance-constrained incremental sampling techniques for solving SSP problems tend to be overly conservative and typically do not consider the likelihood of undesirable tail events. We propose an alternative risk-aware approach inspired by the asymptotically-optimal Rapidly-Exploring Random Trees (RRT*) planning algorithm, which selects nodes along path segments with minimal Conditional Value-at-Risk (CVaR). Our motivation rests on the step-wise coherence of the CVaR risk measure and the optimal substructure of the SSP problem. Thus, optimizing with respect to the CVaR at each sampling iteration necessarily leads to an optimal path in the limit of the sample size. We validate our approach via numerical path planning experiments in a two-dimensional grid world with obstacles and stochastic path-segment lengths. Our simulation results show that incorporating risk into the tree growth process yields paths with lengths that are significantly less sensitive to variations in the noise parameter, or equivalently, paths that are more robust to environmental uncertainty. Algorithmic analyses reveal similar query time and memory space complexity to the baseline RRT* procedure, with only a marginal increase in processing time. This increase is offset by significantly lower noise sensitivity and reduced planner failure rates.

[AI-25] A Multivocal Literature Review on Privacy and Fairness in Federated Learning

链接: https://arxiv.org/abs/2408.08666
作者: Beatrice Balbierer,Lukas Heinlein,Domenique Zipperling,Niklas Kühl
关键词-EN: Federated Learning presents, Federated Learning, data sharing, eliminating the necessity, necessity for data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at the Internationale Tagung Wirtschaftsinformatik 2024

点击查看摘要

Abstract:Federated Learning presents a way to revolutionize AI applications by eliminating the necessity for data sharing. Yet, research has shown that information can still be extracted during training, making additional privacy-preserving measures such as differential privacy imperative. To implement real-world federated learning applications, fairness, ranging from a fair distribution of performance to non-discriminative behaviour, must be considered. Particularly in high-risk applications (e.g. healthcare), avoiding the repetition of past discriminatory errors is paramount. As recent research has demonstrated an inherent tension between privacy and fairness, we conduct a multivocal literature review to examine the current methods to integrate privacy and fairness in federated learning. Our analyses illustrate that the relationship between privacy and fairness has been neglected, posing a critical risk for real-world applications. We highlight the need to explore the relationship between privacy, fairness, and performance, advocating for the creation of integrated federated learning frameworks.

[AI-26] Mitigating Backdoor Attacks in Federated Learning via Flipping Weight Updates of Low-Activation Input Neurons

链接: https://arxiv.org/abs/2408.08655
作者: Binbin Ding,Penghui Yang,Zeqing Ge,Shengjun Huang
关键词-EN: collaboratively train machine, enables multiple clients, train machine learning, learning enables multiple, Federated learning enables
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning enables multiple clients to collaboratively train machine learning models under the overall planning of the server while adhering to privacy requirements. However, the server cannot directly oversee the local training process, creating an opportunity for malicious clients to introduce backdoors. Existing research shows that backdoor attacks activate specific neurons in the compromised model, which remain dormant when processing clean data. Leveraging this insight, we propose a method called Flipping Weight Updates of Low-Activation Input Neurons (FLAIN) to defend against backdoor attacks in federated learning. Specifically, after completing global training, we employ an auxiliary dataset to identify low-activation input neurons and flip the associated weight updates. We incrementally raise the threshold for low-activation inputs and flip the weight updates iteratively, until the performance degradation on the auxiliary data becomes unacceptable. Extensive experiments validate that our method can effectively reduce the success rate of backdoor attacks to a low level in various attack scenarios including those with non-IID data distribution or high MCRs, causing only minimal performance degradation on clean data.

[AI-27] xtCAVs: Debugging vision models using text MICCAI2024

链接: https://arxiv.org/abs/2408.08652
作者: Angus Nicolson,Yarin Gal,J. Alison Noble
关键词-EN: Concept-based interpretability methods, high-level human interpretable, Concept-based interpretability, human interpretable concepts, popular form
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 11 pages, 2 figures. Accepted at iMIMIC Workshop at MICCAI 2024

点击查看摘要

Abstract:Concept-based interpretability methods are a popular form of explanation for deep learning models which provide explanations in the form of high-level human interpretable concepts. These methods typically find concept activation vectors (CAVs) using a probe dataset of concept examples. This requires labelled data for these concepts – an expensive task in the medical domain. We introduce TextCAVs: a novel method which creates CAVs using vision-language models such as CLIP, allowing for explanations to be created solely using text descriptions of the concept, as opposed to image exemplars. This reduced cost in testing concepts allows for many concepts to be tested and for users to interact with the model, testing new ideas as they are thought of, rather than a delay caused by image collection and annotation. In early experimental results, we demonstrate that TextCAVs produces reasonable explanations for a chest x-ray dataset (MIMIC-CXR) and natural images (ImageNet), and that these explanations can be used to debug deep learning-based models.

[AI-28] Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning

链接: https://arxiv.org/abs/2408.08651
作者: Kyle Moore,Jesse Roberts,Thao Pham,Douglas Fisher
关键词-EN: Counterfactual Prompting, Multi-Task Language Understanding, Massive Multi-Task Language, training data, leading to predictions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language models are known to absorb biases from their training data, leading to predictions driven by statistical regularities rather than semantic relevance. We investigate the impact of these biases on answer choice preferences in the Massive Multi-Task Language Understanding (MMLU) task. Our findings reveal that differences in learned regularities across answer options are predictive of model preferences and mirror human test-taking strategies. To address this issue, we introduce two novel methods: Counterfactual Prompting with Chain of Thought (CoT) and Counterfactual Prompting with Agnostically Primed CoT (APriCoT). We demonstrate that while Counterfactual Prompting with CoT alone is insufficient to mitigate bias, our novel Primed Counterfactual Prompting with CoT approach effectively reduces the influence of base-rate probabilities while improving overall accuracy. Our results suggest that mitigating bias requires a “System-2” like process and that CoT reasoning is susceptible to confirmation bias under some prompting methodologies. Our contributions offer practical solutions for developing more robust and fair language models.

[AI-29] Understanding Enthymemes in Argument Maps: Bridging Argument Mining and Logic-based Argumentation

链接: https://arxiv.org/abs/2408.08648
作者: Jonathan Ben-Naim,Victor David,Anthony Hunter
关键词-EN: argument map, processing technology aimed, Argument, arguments, language processing technology
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Research note

点击查看摘要

Abstract:Argument mining is natural language processing technology aimed at identifying arguments in text. Furthermore, the approach is being developed to identify the premises and claims of those arguments, and to identify the relationships between arguments including support and attack relationships. In this paper, we assume that an argument map contains the premises and claims of arguments, and support and attack relationships between them, that have been identified by argument mining. So from a piece of text, we assume an argument map is obtained automatically by natural language processing. However, to understand and to automatically analyse that argument map, it would be desirable to instantiate that argument map with logical arguments. Once we have the logical representation of the arguments in an argument map, we can use automated reasoning to analyze the argumentation (e.g. check consistency of premises, check validity of claims, and check the labelling on each arc corresponds with thw logical arguments). We address this need by using classical logic for representing the explicit information in the text, and using default logic for representing the implicit information in the text. In order to investigate our proposal, we consider some specific options for instantiation.

[AI-30] Magazine Supply Optimization: a Case-study

链接: https://arxiv.org/abs/2408.08637
作者: Duong Nguyen,Ana Ulianovici,Sami Achour,Soline Aubry,Nicolas Chesneau
关键词-EN: fixed inventory assumption, irregular sales patterns, magazine retail industry, Supply optimization, magazine supply optimization
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Supply optimization is a complex and challenging task in the magazine retail industry because of the fixed inventory assumption, irregular sales patterns, and varying product and point-of-sale characteristics. We introduce AthenIA, an industrialized magazine supply optimization solution that plans the supply for over 20,000 points of sale in France. We modularize the supply planning process into a four-step pipeline: demand sensing, optimization, business rules, and operating. The core of the solution is a novel group conformalized quantile regression method that integrates domain expert insights, coupled with a supply optimization technique that balances the costs of out-of-stock against the costs of over-supply. AthenIA has proven to be a valuable tool for magazine publishers, particularly in the context of evolving economic and ecological challenges.

[AI-31] A Survey on Benchmarks of Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.08632
作者: Jian Li,Weiheng Lu
关键词-EN: Multimodal Large Language, Large Language Models, visual question answering, Multimodal Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of \textbf180 benchmarks and evaluation for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to better support the development of MLLMs. For more details, please visit our GitHub repository: this https URL.

[AI-32] RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions

链接: https://arxiv.org/abs/2408.08624
作者: Gregory Kell,Angus Roberts,Serge Umansky,Yuti Khare,Najma Ahmed,Nikhil Patel,Chloe Simela,Jack Coumbe,Julian Rozario,Ryan-Rhys Griffiths,Iain J. Marshall
关键词-EN: Clinical question answering, question answering systems, clinicians with relevant, relevant and timely, answering systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at AMIA Annual Symposium 2024

点击查看摘要

Abstract:Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating “ideal” QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. We release our code and our dataset publicly to encourage further research.

[AI-33] SketchRef: A Benchmark Dataset and Evaluation Metrics for Automated Sketch Synthesis

链接: https://arxiv.org/abs/2408.08623
作者: Xingyue Lin,Xingjian Hu,Shuai Peng,Jianhua Zhu,Liangcai Gao
关键词-EN: powerful artistic technique, capture essential visual, essential visual information, increasingly gaining attention, image synthesis field
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sketch, a powerful artistic technique to capture essential visual information about real-world objects, is increasingly gaining attention in the image synthesis field. However, evaluating the quality of synthesized sketches presents unique unsolved challenges. Current evaluation methods for sketch synthesis are inadequate due to the lack of a unified benchmark dataset, over-reliance on classification accuracy for recognizability, and unfair evaluation of sketches with different levels of simplification. To address these issues, we introduce SketchRef, a benchmark dataset comprising 4 categories of reference photos–animals, human faces, human bodies, and common objects–alongside novel evaluation metrics. Considering that classification accuracy is insufficient to measure the structural consistency between a sketch and its reference photo, we propose the mean Object Keypoint Similarity (mOKS) metric, utilizing pose estimation to assess structure-level recognizability. To ensure fair evaluation sketches with different simplification levels, we propose a recognizability calculation method constrained by simplicity. We also collect 8K responses from art enthusiasts, validating the effectiveness of our proposed evaluation methods. We hope this work can provide a comprehensive evaluation of sketch synthesis algorithms, thereby aligning their performance more closely with human understanding.

[AI-34] DeepDFA: Automata Learning through Neural Probabilistic Relaxations

链接: https://arxiv.org/abs/2408.08622
作者: Elena Umili,Roberto Capobianco
关键词-EN: Deterministic Finite Automata, identifying Deterministic Finite, Finite Automata, Deterministic Finite, Recurrent Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we introduce DeepDFA, a novel approach to identifying Deterministic Finite Automata (DFAs) from traces, harnessing a differentiable yet discrete model. Inspired by both the probabilistic relaxation of DFAs and Recurrent Neural Networks (RNNs), our model offers interpretability post-training, alongside reduced complexity and enhanced training efficiency compared to traditional RNNs. Moreover, by leveraging gradient-based optimization, our method surpasses combinatorial approaches in both scalability and noise resilience. Validation experiments conducted on target regular languages of varying size and complexity demonstrate that our approach is accurate, fast, and robust to noise in both the input symbols and the output labels of training data, integrating the strengths of both logical grammar induction and deep learning.

[AI-35] PatUntrack: Automated Generating Patch Examples for Issue Reports without Tracked Insecure Code

链接: https://arxiv.org/abs/2408.08619
作者: Ziyou Jiang,Lin Shi,Guowei Yang,Qing Wang
关键词-EN: insecure code, relevant insecure code, insecure, code, software community
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Accepted by ASE’24

点击查看摘要

Abstract:Security patches are essential for enhancing the stability and robustness of projects in the software community. While vulnerabilities are officially expected to be patched before being disclosed, patching vulnerabilities is complicated and remains a struggle for many organizations. To patch vulnerabilities, security practitioners typically track vulnerable issue reports (IRs), and analyze their relevant insecure code to generate potential patches. However, the relevant insecure code may not be explicitly specified and practitioners cannot track the insecure code in the repositories, thus limiting their ability to generate patches. In such cases, providing examples of insecure code and the corresponding patches would benefit the security developers to better locate and fix the insecure code. In this paper, we propose PatUntrack to automatically generating patch examples from IRs without tracked insecure code. It auto-prompts Large Language Models (LLMs) to make them applicable to analyze the vulnerabilities. It first generates the completed description of the Vulnerability-Triggering Path (VTP) from vulnerable IRs. Then, it corrects hallucinations in the VTP description with external golden knowledge. Finally, it generates Top-K pairs of Insecure Code and Patch Example based on the corrected VTP description. To evaluate the performance, we conducted experiments on 5,465 vulnerable IRs. The experimental results show that PatUntrack can obtain the highest performance and improve the traditional LLM baselines by +14.6% (Fix@10) on average in patch example generation. Furthermore, PatUntrack was applied to generate patch examples for 76 newly disclosed vulnerable IRs. 27 out of 37 replies from the authors of these IRs confirmed the usefulness of the patch examples generated by PatUntrack, indicating that they can benefit from these examples for patching the vulnerabilities.

[AI-36] Generative Dataset Distillation Based on Diffusion Model ECCV2024

链接: https://arxiv.org/abs/2408.08610
作者: Duo Su,Junjie Hou,Guang Li,Ren Togo,Rui Song,Takahiro Ogawa,Miki Haseyama
关键词-EN: Dataset Distillation Challenge, generative dataset distillation, dataset distillation method, Dataset Distillation, generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The Third Place Winner in Generative Track of the ECCV 2024 DD Challenge

点击查看摘要

Abstract:This paper presents our method for the generative track of The First Dataset Distillation Challenge at ECCV 2024. Since the diffusion model has become the mainstay of generative models because of its high-quality generative effects, we focus on distillation methods based on the diffusion model. Considering that the track can only generate a fixed number of images in 10 minutes using a generative model for CIFAR-100 and Tiny-ImageNet datasets, we need to use a generative model that can generate images at high speed. In this study, we proposed a novel generative dataset distillation method based on Stable Diffusion. Specifically, we use the SDXL-Turbo model which can generate images at high speed and quality. Compared to other diffusion models that can only generate images per class (IPC) = 1, our method can achieve an IPC = 10 for Tiny-ImageNet and an IPC = 20 for CIFAR-100, respectively. Additionally, to generate high-quality distilled datasets for CIFAR-100 and Tiny-ImageNet, we use the class information as text prompts and post data augmentation for the SDXL-Turbo model. Experimental results show the effectiveness of the proposed method, and we achieved third place in the generative track of the ECCV 2024 DD Challenge. Codes are available at this https URL.

[AI-37] MM-UNet: A Mixed MLP Architecture for Improved Ophthalmic Image Segmentation

链接: https://arxiv.org/abs/2408.08600
作者: Zunjie Xiao,Xiaoqing Zhang,Risa Higashita,Jiang Liu
关键词-EN: ocular disease diagnosis, disease diagnosis, critical foundation, foundation for ocular, ocular disease
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: OMIA2024

点击查看摘要

Abstract:Ophthalmic image segmentation serves as a critical foundation for ocular disease diagnosis. Although fully convolutional neural networks (CNNs) are commonly employed for segmentation, they are constrained by inductive biases and face challenges in establishing long-range dependencies. Transformer-based models address these limitations but introduce substantial computational overhead. Recently, a simple yet efficient Multilayer Perceptron (MLP) architecture was proposed for image classification, achieving competitive performance relative to advanced transformers. However, its effectiveness for ophthalmic image segmentation remains unexplored. In this paper, we introduce MM-UNet, an efficient Mixed MLP model tailored for ophthalmic image segmentation. Within MM-UNet, we propose a multi-scale MLP (MMLP) module that facilitates the interaction of features at various depths through a grouping strategy, enabling simultaneous capture of global and local information. We conducted extensive experiments on both a private anterior segment optical coherence tomography (AS-OCT) image dataset and a public fundus image dataset. The results demonstrated the superiority of our MM-UNet model in comparison to state-of-the-art deep segmentation networks.

[AI-38] A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models

链接: https://arxiv.org/abs/2408.08590
作者: Geonhee Kim,Marco Valentino,André Freitas
关键词-EN: auto-regressive Language Models, exploit superficial patterns, Recent studies, auto-regressive Language, systematic reasoning principles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies on logical reasoning in auto-regressive Language Models (LMs) have sparked a debate on whether such models can learn systematic reasoning principles during pre-training or merely exploit superficial patterns in the training data. This paper presents a mechanistic interpretation of syllogistic reasoning in LMs to further enhance our understanding of internal dynamics. Specifically, we present a methodology for circuit discovery aimed at disentangling content-independent reasoning mechanisms from world knowledge acquired during pre-training. Through two distinct intervention methods, we uncover a sufficient and necessary circuit involving middle-term suppression that elucidates how LMs transfer information to derive valid conclusions from premises. Furthermore, we investigate how belief biases manifest in syllogistic reasoning, finding evidence of partial contamination from additional attention heads responsible for encoding commonsense and contextualized knowledge. Finally, we explore the generalization of the discovered mechanisms across various syllogistic schemes and model sizes, finding that the identified circuit is sufficient and necessary for all the schemes on which the model achieves high downstream accuracy ( \geq 60%). Overall, our findings suggest that LMs indeed learn transferable content-independent reasoning mechanisms, but that, at the same time, such mechanisms do not involve generalisable and abstract logical primitives, being susceptible to contamination by the same world knowledge acquired during pre-training.

[AI-39] S-RAF: A Simulation-Based Robustness Assessment Framework for Responsible Autonomous Driving

链接: https://arxiv.org/abs/2408.08584
作者: Daniel Omeiza,Pratik Somaiya,Jo-Ann Pattinson,Carolyn Ten-Holter,Jack Stilgoe,Marina Jirotka,Lars Kunze
关键词-EN: technology advances, artificial intelligence, Robustness Assessment Framework, AI-driven systems, Simulation-Based Robustness Assessment
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) technology advances, ensuring the robustness and safety of AI-driven systems has become paramount. However, varying perceptions of robustness among AI developers create misaligned evaluation metrics, complicating the assessment and certification of safety-critical and complex AI systems such as autonomous driving (AD) agents. To address this challenge, we introduce Simulation-Based Robustness Assessment Framework (S-RAF) for autonomous driving. S-RAF leverages the CARLA Driving simulator to rigorously assess AD agents across diverse conditions, including faulty sensors, environmental changes, and complex traffic situations. By quantifying robustness and its relationship with other safety-critical factors, such as carbon emissions, S-RAF aids developers and stakeholders in building safe and responsible driving agents, and streamlining safety certification processes. Furthermore, S-RAF offers significant advantages, such as reduced testing costs, and the ability to explore edge cases that may be unsafe to test in the real world. The code for this framework is available here: this https URL

[AI-40] AgentS imulator: An Agent -based Approach for Data-driven Business Process Simulation

链接: https://arxiv.org/abs/2408.08571
作者: Lukas Kirchdorfer,Robert Blümel,Timotheus Kampik,Han van der Aa,Heiner Stuckenschmidt
关键词-EN: Business process simulation, estimating process performance, Business process, versatile technique, technique for estimating
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Business process simulation (BPS) is a versatile technique for estimating process performance across various scenarios. Traditionally, BPS approaches employ a control-flow-first perspective by enriching a process model with simulation parameters. Although such approaches can mimic the behavior of centrally orchestrated processes, such as those supported by workflow systems, current control-flow-first approaches cannot faithfully capture the dynamics of real-world processes that involve distinct resource behavior and decentralized decision-making. Recognizing this issue, this paper introduces AgentSimulator, a resource-first BPS approach that discovers a multi-agent system from an event log, modeling distinct resource behaviors and interaction patterns to simulate the underlying process. Our experiments show that AgentSimulator achieves state-of-the-art simulation accuracy with significantly lower computation times than existing approaches while providing high interpretability and adaptability to different types of process-execution scenarios.

[AI-41] String Diagram of Optimal Transports

链接: https://arxiv.org/abs/2408.08550
作者: Kazuki Watanabe,Noboru Isobe
关键词-EN: string diagrams, diagrams of OTs, optimal transports, safety problem, propose a hierarchical
类目: Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: Preprint, under review, 14 pages, 2 fugures, 1 table

点击查看摘要

Abstract:We propose a hierarchical framework of optimal transports (OTs), namely string diagrams of OTs. Our target problem is a safety problem on string diagrams of OTs, which requires proving or disproving that the minimum transportation cost in a given string diagram of OTs is above a given threshold. We reduce the safety problem on a string diagram of OTs to that on a monolithic OT by composing cost matrices. Our novel reduction exploits an algebraic structure of cost matrices equipped with two compositions: a sequential composition and a parallel composition. We provide a novel algorithm for the safety problem on string diagrams of OTs by our reduction, and we demonstrate its efficiency and performance advantage through experiments.

[AI-42] Detecting Unsuccessful Students in Cybersecurity Exercises in Two Different Learning Environments

链接: https://arxiv.org/abs/2408.08531
作者: Valdemar Švábenský,Kristián Tkáčik,Aubrey Birdwell,Richard Weiss,Ryan S. Baker,Pavel Čeleda,Jan Vykopal,Jens Mache,Ankur Chattopadhyay
关键词-EN: research track evaluates, performing poorly, research track, track evaluates, evaluates the usage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: To appear for publication in the FIE 2024 conference proceedings

点击查看摘要

Abstract:This full paper in the research track evaluates the usage of data logged from cybersecurity exercises in order to predict students who are potentially at risk of performing poorly. Hands-on exercises are essential for learning since they enable students to practice their skills. In cybersecurity, hands-on exercises are often complex and require knowledge of many topics. Therefore, students may miss solutions due to gaps in their knowledge and become frustrated, which impedes their learning. Targeted aid by the instructor helps, but since the instructor’s time is limited, efficient ways to detect struggling students are needed. This paper develops automated tools to predict when a student is having difficulty. We formed a dataset with the actions of 313 students from two countries and two learning environments: KYPO CRP and EDURange. These data are used in machine learning algorithms to predict the success of students in exercises deployed in these environments. After extracting features from the data, we trained and cross-validated eight classifiers for predicting the exercise outcome and evaluated their predictive power. The contribution of this paper is comparing two approaches to feature engineering, modeling, and classification performance on data from two learning environments. Using the features from either learning environment, we were able to detect and distinguish between successful and struggling students. A decision tree classifier achieved the highest balanced accuracy and sensitivity with data from both learning environments. The results show that activity data from cybersecurity exercises are suitable for predicting student success. In a potential application, such models can aid instructors in detecting struggling students and providing targeted help. We publish data and code for building these models so that others can adopt or adapt them.

[AI-43] Focus on Focus: Focus-oriented Representation Learning and Multi-view Cross-modal Alignment for Glioma Grading

链接: https://arxiv.org/abs/2408.08527
作者: Li Pan,Yupei Zhang,Qiushi Yang,Tan Li,Xiaohan Xing,Maximus C. F. Yeung,Zhen Chen
关键词-EN: achieved a promising, multimodal deep learning, Recently, molecular biomarkers, integrates histopathology slides
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, multimodal deep learning, which integrates histopathology slides and molecular biomarkers, has achieved a promising performance in glioma grading. Despite great progress, due to the intra-modality complexity and inter-modality heterogeneity, existing studies suffer from inadequate histopathology representation learning and inefficient molecular-pathology knowledge alignment. These two issues hinder existing methods to precisely interpret diagnostic molecular-pathology features, thereby limiting their grading performance. Moreover, the real-world applicability of existing multimodal approaches is significantly restricted as molecular biomarkers are not always available during clinical deployment. To address these problems, we introduce a novel Focus on Focus (FoF) framework with paired pathology-genomic training and applicable pathology-only inference, enhancing molecular-pathology representation effectively. Specifically, we propose a Focus-oriented Representation Learning (FRL) module to encourage the model to identify regions positively or negatively related to glioma grading and guide it to focus on the diagnostic areas with a consistency constraint. To effectively link the molecular biomarkers to morphological features, we propose a Multi-view Cross-modal Alignment (MCA) module that projects histopathology representations into molecular subspaces, aligning morphological features with corresponding molecular biomarker status by supervised contrastive learning. Experiments on the TCGA GBM-LGG dataset demonstrate that our FoF framework significantly improves the glioma grading. Remarkably, our FoF achieves superior performance using only histopathology slides compared to existing multimodal methods. The source code is available at this https URL.

[AI-44] GS-ID: Illumination Decomposition on Gaussian Splatting via Diffusion Prior and Parametric Light Source Optimization

链接: https://arxiv.org/abs/2408.08524
作者: Kang Du,Zhihao Liang,Zeyu Wang
关键词-EN: intuitive light editing, Gaussian Splatting, view synthesis, synthesis and intuitive, illumination decomposition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages, 13 figures

点击查看摘要

Abstract:We present GS-ID, a novel framework for illumination decomposition on Gaussian Splatting, achieving photorealistic novel view synthesis and intuitive light editing. Illumination decomposition is an ill-posed problem facing three main challenges: 1) priors for geometry and material are often lacking; 2) complex illumination conditions involve multiple unknown light sources; and 3) calculating surface shading with numerous light sources is computationally expensive. To address these challenges, we first introduce intrinsic diffusion priors to estimate the attributes for physically based rendering. Then we divide the illumination into environmental and direct components for joint optimization. Last, we employ deferred rendering to reduce the computational load. Our framework uses a learnable environment map and Spherical Gaussians (SGs) to represent light sources parametrically, therefore enabling controllable and photorealistic relighting on Gaussian Splatting. Extensive experiments and applications demonstrate that GS-ID produces state-of-the-art illumination decomposition results while achieving better geometry reconstruction and rendering performance.

[AI-45] Ex3: Automatic Novel Writing by Extracting Excelsior and Expanding

链接: https://arxiv.org/abs/2408.08506
作者: Huang Lei,Jiaming Guo,Guanhua He,Xishan Zhang,Rui Zhang,Shaohui Peng,Shaoli Liu,Tianshi Chen
关键词-EN: Generating long-term texts, Generating long-term, long-term texts, artificial intelligence, Generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and event depiction, ultimately compromising the overall narrative quality. In this paper, we propose a method named Extracting Excelsior and Expanding. Ex3 initially extracts structure information from raw novel data. By combining this structure information with the novel data, an instruction-following dataset is meticulously crafted. This dataset is then utilized to fine-tune the LLM, aiming for excelsior generation performance. In the final stage, a tree-like expansion method is deployed to facilitate the generation of arbitrarily long novels. Evaluation against previous methods showcases Ex3’s ability to produce higher-quality long-form novels.

[AI-46] Adversarial Contrastive Learning Based Physics-Informed Temporal Networks for Cuffless Blood Pressure Estimation

链接: https://arxiv.org/abs/2408.08488
作者: Rui Wang,Mengshi Qi,Yingxia Shao,Anfu Zhou,Huadong Ma
关键词-EN: extensive applications, mining is immensely, immensely important, important in extensive, Time series data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Time series data mining is immensely important in extensive applications, such as traffic, medical, and e-commerce. In this paper, we focus on medical temporal variation modeling, \emphi.e., cuffless blood pressure (BP) monitoring which has great value in cardiovascular healthcare. Although providing a comfortable user experience, such methods are suffering from the demand for a significant amount of realistic data to train an individual model for each subject, especially considering the invasive or obtrusive BP ground-truth measurements. To tackle this challenge, we introduce a novel physics-informed temporal network~(PITN) with adversarial contrastive learning to enable precise BP estimation with very limited data. Specifically, we first enhance the physics-informed neural network~(PINN) with the temporal block for investigating BP dynamics’ multi-periodicity for personal cardiovascular cycle modeling and temporal variation. We then employ adversarial training to generate extra physiological time series data, improving PITN’s robustness in the face of sparse subject-specific training data. Furthermore, we utilize contrastive learning to capture the discriminative variations of cardiovascular physiologic phenomena. This approach aggregates physiological signals with similar blood pressure values in latent space while separating clusters of samples with dissimilar blood pressure values. Experiments on three widely-adopted datasets with different modailties (\emphi.e., bioimpedance, PPG, millimeter-wave) demonstrate the superiority and effectiveness of the proposed methods over previous state-of-the-art approaches. The code is available at~\urlthis https URL.

[AI-47] An Unsupervised Learning Framework Combined with Heuristics for the Maximum Minimal Cut Problem

链接: https://arxiv.org/abs/2408.08484
作者: Huaiyuan Liu,Xianzhang Liu,Donghua Yang,Hongzhi Wang,Yingchi Long,Mengtong Ji,Dongjing Miao,Zhiyu Liang
关键词-EN: Maximum Minimal Cut, Minimal Cut Problem, Maximum Minimal, Minimal Cut, NP-hard combinatorial optimization
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Maximum Minimal Cut Problem (MMCP), a NP-hard combinatorial optimization (CO) problem, has not received much attention due to the demanding and challenging bi-connectivity constraint. Moreover, as a CO problem, it is also a daunting task for machine learning, especially without labeled instances. To deal with these problems, this work proposes an unsupervised learning framework combined with heuristics for MMCP that can provide valid and high-quality solutions. As far as we know, this is the first work that explores machine learning and heuristics to solve MMCP. The unsupervised solver is inspired by a relaxation-plus-rounding approach, the relaxed solution is parameterized by graph neural networks, and the cost and penalty of MMCP are explicitly written out, which can train the model end-to-end. A crucial observation is that each solution corresponds to at least one spanning tree. Based on this finding, a heuristic solver that implements tree transformations by adding vertices is utilized to repair and improve the solution quality of the unsupervised solver. Alternatively, the graph is simplified while guaranteeing solution consistency, which reduces the running time. We conduct extensive experiments to evaluate our framework and give a specific application. The results demonstrate the superiority of our method against two techniques designed.

[AI-48] Fairness Issues and Mitigations in (Differentially Private) Socio-demographic Data Processes

链接: https://arxiv.org/abs/2408.08471
作者: Joonhyuk Ko,Juba Ziani,Saswat Das,Matt Williams,Ferdinando Fioretto
关键词-EN: Statistical agencies rely, collect socio-demographic data, socio-demographic data crucial, Statistical agencies, resource allocation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Statistical agencies rely on sampling techniques to collect socio-demographic data crucial for policy-making and resource allocation. This paper shows that surveys of important societal relevance introduce sampling errors that unevenly impact group-level estimates, thereby compromising fairness in downstream decisions. To address these issues, this paper introduces an optimization approach modeled on real-world survey design processes, ensuring sampling costs are optimized while maintaining error margins within prescribed tolerances. Additionally, privacy-preserving methods used to determine sampling rates can further impact these fairness issues. The paper explores the impact of differential privacy on the statistics informing the sampling process, revealing a surprising effect: not only the expected negative effect from the addition of noise for differential privacy is negligible, but also this privacy noise can in fact reduce unfairness as it positively biases smaller counts. These findings are validated over an extensive analysis using datasets commonly applied in census statistics.

[AI-49] Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models

链接: https://arxiv.org/abs/2408.08470
作者: Jerry Huang,Prasanna Parthasarathi,Mehdi Rezagholizadeh,Sarath Chandar
关键词-EN: large language models, widespread adoption, resource constraints, growing sizes, sizes only increasing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages (9 pages main content + references + appendix)

点击查看摘要

Abstract:Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One noted issue is the high latency associated with auto-regressive generation, rendering large LLMs use dependent on advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger target model’s generation, has helped alleviate this, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains provided the candidates are effective. Further results show this to hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.

[AI-50] A theory of understanding for artificial intelligence: composability catalysts and learning

链接: https://arxiv.org/abs/2408.08463
作者: Zijian Zhang,Sara Aronowitz,Alán Aspuru-Guzik
关键词-EN: crucial yet elusive, elusive concept, concept in artificial, Understanding, artificial intelligence
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Understanding is a crucial yet elusive concept in artificial intelligence (AI). This work proposes a framework for analyzing understanding based on the notion of composability. Given any subject (e.g., a person or an AI), we suggest characterizing its understanding of an object in terms of its ability to process (compose) relevant inputs into satisfactory outputs from the perspective of a verifier. This highly universal framework can readily apply to non-human subjects, such as AIs, non-human animals, and institutions. Further, we propose methods for analyzing the inputs that enhance output quality in compositions, which we call catalysts. We show how the structure of a subject can be revealed by analyzing its components that act as catalysts and argue that a subject’s learning ability can be regarded as its ability to compose inputs into its inner catalysts. Finally we examine the importance of learning ability for AIs to attain general intelligence. Our analysis indicates that models capable of generating outputs that can function as their own catalysts, such as language models, establish a foundation for potentially overcoming existing limitations in AI understanding.

[AI-51] SpectralEarth: Training Hyperspectral Foundation Models at Scale

链接: https://arxiv.org/abs/2408.08447
作者: Nassim Ait Ali Braham,Conrad M Albrecht,Julien Mairal,Jocelyn Chanussot,Yi Wang,Xiao Xiang Zhu
关键词-EN: remote sensing, multispectral imagery, triggered a paradigm, paradigm shift, shift in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models have triggered a paradigm shift in computer vision and are increasingly being adopted in remote sensing, particularly for multispectral imagery. Yet, their potential in hyperspectral imaging (HSI) remains untapped due to the absence of comprehensive and globally representative hyperspectral datasets. To close this gap, we introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models leveraging data from the Environmental Mapping and Analysis Program (EnMAP). SpectralEarth comprises 538,974 image patches covering 415,153 unique locations from more than 11,636 globally distributed EnMAP scenes spanning two years of archive. Additionally, 17.5% of these locations include multiple timestamps, enabling multi-temporal HSI analysis. Utilizing state-of-the-art self-supervised learning (SSL) algorithms, we pretrain a series of foundation models on SpectralEarth. We integrate a spectral adapter into classical vision backbones to accommodate the unique characteristics of HSI. In tandem, we construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation. Experimental results support the versatility of our models, showcasing their generalizability across different tasks and sensors. We also highlight computational efficiency during model fine-tuning. The dataset, models, and source code will be made publicly available.

[AI-52] W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering

链接: https://arxiv.org/abs/2408.08444
作者: Jinming Nian,Zhiyuan Peng,Qifan Wang,Yi Fang
关键词-EN: Large Language Models, Large Language, open-domain question answering, factual answers relying, answers relying solely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In knowledge-intensive tasks such as open-domain question answering (OpenQA), Large Language Models (LLMs) often struggle to generate factual answers relying solely on their internal (parametric) knowledge. To address this limitation, Retrieval-Augmented Generation (RAG) systems enhance LLMs by retrieving relevant information from external sources, thereby positioning the retriever as a pivotal component. Although dense retrieval demonstrates state-of-the-art performance, its training poses challenges due to the scarcity of ground-truth evidence, largely attributed to the high costs of human annotation. In this paper, we propose W-RAG by utilizing the ranking capabilities of LLMs to create weakly labeled data for training dense retrievers. Specifically, we rerank the top- K passages retrieved via BM25 by assessing the probability that LLMs will generate the correct answer based on the question and each passage. The highest-ranking passages are then used as positive training examples for dense retrieval. Our comprehensive experiments across four publicly available OpenQA datasets demonstrate that our approach enhances both retrieval and OpenQA performance compared to baseline models.

[AI-53] PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

链接: https://arxiv.org/abs/2408.08437
作者: Kshitij Bhardwaj
关键词-EN: replacing convolutional neural, convolutional neural networks, computer vision tasks, optimize vision transformers, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Vision Transformers (ViTs) are extremely effective at computer vision tasks and are replacing convolutional neural networks as the new state-of-the-art, they are complex and memory-intensive models. In order to effectively run these models on resource-constrained mobile/edge systems, there is a need to not only compress these models but also to optimize them and convert them into deployment-friendly formats. To this end, this paper presents a combined pruning and quantization tool, called PQV-Mobile, to optimize vision transformers for mobile applications. The tool is able to support different types of structured pruning based on magnitude importance, Taylor importance, and Hessian importance. It also supports quantization from FP32 to FP16 and int8, targeting different mobile hardware backends. We demonstrate the capabilities of our tool and show important latency-memory-accuracy trade-offs for different amounts of pruning and int8 quantization with Facebook Data Efficient Image Transformer (DeiT) models. Our results show that even pruning a DeiT model by 9.375% and quantizing it to int8 from FP32 followed by optimizing for mobile applications, we find a latency reduction by 7.18X with a small accuracy loss of 2.24%. The tool is open source.

[AI-54] Automated Design of Agent ic Systems

链接: https://arxiv.org/abs/2408.08435
作者: Shengran Hu,Cong Lu,Jeff Clune
关键词-EN: investing substantial effort, Researchers are investing, developing powerful general-purpose, Foundation Models, Meta Agent Search
类目: Artificial Intelligence (cs.AI)
*备注: Website: this https URL

点击查看摘要

Abstract:Researchers are investing substantial effort in developing powerful general-purpose agents, wherein Foundation Models are used as modules within agentic systems (e.g. Chain-of-Thought, Self-Reflection, Toolformer). However, the history of machine learning teaches us that hand-designed solutions are eventually replaced by learned solutions. We formulate a new research area, Automated Design of Agentic Systems (ADAS), which aims to automatically create powerful agentic system designs, including inventing novel building blocks and/or combining them in new ways. We further demonstrate that there is an unexplored yet promising approach within ADAS where agents can be defined in code and new agents can be automatically discovered by a meta agent programming ever better ones in code. Given that programming languages are Turing Complete, this approach theoretically enables the learning of any possible agentic system: including novel prompts, tool use, control flows, and combinations thereof. We present a simple yet effective algorithm named Meta Agent Search to demonstrate this idea, where a meta agent iteratively programs interesting new agents based on an ever-growing archive of previous discoveries. Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents. Importantly, we consistently observe the surprising result that agents invented by Meta Agent Search maintain superior performance even when transferred across domains and models, demonstrating their robustness and generality. Provided we develop it safely, our work illustrates the potential of an exciting new research direction toward automatically designing ever-more powerful agentic systems to benefit humanity.

[AI-55] Multi-Modal Dialogue State Tracking for Playing GuessWhich Game

链接: https://arxiv.org/abs/2408.08431
作者: Wei Pang,Ruixue Duan,Jinfu Yang,Ning Li
关键词-EN: Questioner Bot, Answer Bot, visually related reasoning, Bot, visually related
类目: Artificial Intelligence (cs.AI)
*备注: Published at CICAI 2023 (CAAI-A), codes at this https URL

点击查看摘要

Abstract:GuessWhich is an engaging visual dialogue game that involves interaction between a Questioner Bot (QBot) and an Answer Bot (ABot) in the context of image-guessing. In this game, QBot’s objective is to locate a concealed image solely through a series of visually related questions posed to ABot. However, effectively modeling visually related reasoning in QBot’s decision-making process poses a significant challenge. Current approaches either lack visual information or rely on a single real image sampled at each round as decoding context, both of which are inadequate for visual reasoning. To address this limitation, we propose a novel approach that focuses on visually related reasoning through the use of a mental model of the undisclosed image. Within this framework, QBot learns to represent mental imagery, enabling robust visual reasoning by tracking the dialogue state. The dialogue state comprises a collection of representations of mental imagery, as well as representations of the entities involved in the conversation. At each round, QBot engages in visually related reasoning using the dialogue state to construct an internal representation, generate relevant questions, and update both the dialogue state and internal representation upon receiving an answer. Our experimental results on the VisDial datasets (v0.5, 0.9, and 1.0) demonstrate the effectiveness of our proposed model, as it achieves new state-of-the-art performance across all metrics and datasets, surpassing previous state-of-the-art models. Codes and datasets from our experiments are freely available at \hrefthis https URL.

[AI-56] Assessing and Enhancing Large Language Models in Rare Disease Question-answering

链接: https://arxiv.org/abs/2408.08422
作者: Guanchu Wang,Junhao Ran,Ruixiang Tang,Chia-Yuan Chang,Chia-Yuan Chang,Yu-Neng Chuang,Zirui Liu,Vladimir Braverman,Zhandong Liu,Xia Hu
关键词-EN: Large Language Models, Large Language, general medical domains, diagnosing rare diseases, capabilities of Large
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the impressive capabilities of Large Language Models (LLMs) in general medical domains, questions remain about their performance in diagnosing rare diseases. To answer this question, we aim to assess the diagnostic performance of LLMs in rare diseases, and explore methods to enhance their effectiveness in this area. In this work, we introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of LLMs in diagnosing rare diseases. Specifically, we collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. Additionally, we annotated meta-data for each question, facilitating the extraction of subsets specific to any given disease and its property. Based on the ReDis-QA dataset, we benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. To facilitate retrieval augmentation generation for rare disease diagnosis, we collect the first rare diseases corpus (ReCOP), sourced from the National Organization for Rare Disorders (NORD) database. Specifically, we split the report of each rare disease into multiple chunks, each representing a different property of the disease, including their overview, symptoms, causes, effects, related disorders, diagnosis, and standard therapies. This structure ensures that the information within each chunk aligns consistently with a question. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%. Moreover, it significantly guides LLMs to generate trustworthy answers and explanations that can be traced back to existing literature. Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.08422 [cs.CE] (or arXiv:2408.08422v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2408.08422 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-57] Understanding Help-Seeking Behavior of Students Using LLMs vs. Web Search for Writing SQL Queries

链接: https://arxiv.org/abs/2408.08401
作者: Harsh Kumar,Mohi Reza,Jeb Mitchell,Ilya Musabirov,Lisa Zhang,Michael Liut
关键词-EN: large language models, students write SQL, write SQL queries, SQL queries, language models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Growth in the use of large language models (LLMs) in programming education is altering how students write SQL queries. Traditionally, students relied heavily on web search for coding assistance, but this has shifted with the adoption of LLMs like ChatGPT. However, the comparative process and outcomes of using web search versus LLMs for coding help remain underexplored. To address this, we conducted a randomized interview study in a database classroom to compare web search and LLMs, including a publicly available LLM (ChatGPT) and an instructor-tuned LLM, for writing SQL queries. Our findings indicate that using an instructor-tuned LLM required significantly more interactions than both ChatGPT and web search, but resulted in a similar number of edits to the final SQL query. No significant differences were found in the quality of the final SQL queries between conditions, although the LLM conditions directionally showed higher query quality. Furthermore, students using instructor-tuned LLM reported a lower mental demand. These results have implications for learning and productivity in programming education.

[AI-58] API-guided Dataset Synthesis to Finetune Large Code Models

链接: https://arxiv.org/abs/2408.08343
作者: Zongjie Li,Daoyuan Wu,Shuai Wang,Zhendong Su
关键词-EN: demonstrated remarkable performance, Large code models, vast code corpora, Large code, pre-trained on vast
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large code models (LCMs), pre-trained on vast code corpora, have demonstrated remarkable performance across a wide array of code-related tasks. Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific requirements and enhancing their performance in particular domains. However, synthesizing high-quality SFT datasets poses a significant challenge due to the uneven quality of datasets and the scarcity of domain-specific datasets. Inspired by APIs as high-level abstractions of code that encapsulate rich semantic information in a concise structure, we propose DataScope, an API-guided dataset synthesis framework designed to enhance the SFT process for LCMs in both general and domain-specific scenarios. DataScope comprises two main components: Dsel and Dgen. On one hand, Dsel employs API coverage as a core metric, enabling efficient dataset synthesis in general scenarios by selecting subsets of existing (uneven-quality) datasets with higher API coverage. On the other hand, Dgen recasts domain dataset synthesis as a process of using API-specified high-level functionality and deliberately-constituted code skeletons to synthesize concrete code. Extensive experiments demonstrate DataScope’s effectiveness, with models fine-tuned on its synthesized datasets outperforming those tuned on unoptimized datasets five times larger. Furthermore, a series of analyses on model internals, relevant hyperparameters, and case studies provide additional evidence for the efficacy of our proposed methods. These findings underscore the significance of dataset quality in SFT and advance the field of LCMs by providing an efficient, cost-effective framework for constructing high-quality datasets. This contribution enhances performance across both general and domain-specific scenarios, paving the way for more powerful and tailored LCMs. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.08343 [cs.SE] (or arXiv:2408.08343v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2408.08343 Focus to learn more arXiv-issued DOI via DataCite

[AI-59] Graph representations of 3D data for machine learning

链接: https://arxiv.org/abs/2408.08336
作者: Tomasz Prytuła
关键词-EN: machine learning algorithms, graphs and meshes, learning algorithms, give an overview, overview of combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:We give an overview of combinatorial methods to represent 3D data, such as graphs and meshes, from the viewpoint of their amenability to analysis using machine learning algorithms. We highlight pros and cons of various representations and we discuss some methods of generating/switching between the representations. We finally present two concrete applications in life science and industry. Despite its theoretical nature, our discussion is in general motivated by, and biased towards real-world challenges.

[AI-60] Plan with Code: Comparing approaches for robust NL to DSL generation

链接: https://arxiv.org/abs/2408.08335
作者: Nastaran Bassamzadeh,Chhaya Methani
关键词-EN: code, Natural Language, Domain Specific Languages, function, generated via Natural
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 9 pages, 1 figure, 5 tables. arXiv admin note: substantial text overlap with arXiv:2407.02742

点击查看摘要

Abstract:Planning in code is considered a more reliable approach for many orchestration tasks. This is because code is more tractable than steps generated via Natural Language and make it easy to support more complex sequences by abstracting deterministic logic into functions. It also allows spotting issues with incorrect function names with the help of parsing checks that can be run on code. Progress in Code Generation methodologies, however, remains limited to general-purpose languages like C, C++, and Python. LLMs continue to face challenges with custom function names in Domain Specific Languages or DSLs, leading to higher hallucination rates and syntax errors. This is more common for custom function names, that are typically part of the plan. Moreover, keeping LLMs up-to-date with newer function names is an issue. This poses a challenge for scenarios like task planning over a large number of APIs, since the plan is represented as a DSL having custom API names. In this paper, we focus on workflow automation in RPA (Robotic Process Automation) domain as a special case of task planning. We present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies with a fine-tuned model. Our results showed that the fine-tuned model scored the best on code similarity metric. However, with our optimizations, RAG approach is able to match the quality for in-domain API names in the test set. Additionally, it offers significant advantage for out-of-domain or unseen API names, outperforming Fine-Tuned model on similarity metric by 7 pts.

[AI-61] CodeMirage: Hallucinations in Code Generated by Large Language Models IJCAI2024

链接: https://arxiv.org/abs/2408.08333
作者: Vibhor Agarwal,Yulong Pei,Salwa Alamir,Xiaomo Liu
关键词-EN: Large Language Models, Large Language, shown promising potentials, Language Models, code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted at AutoMates @ IJCAI 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI’s GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.

[AI-62] Unleash The Power of Pre-Trained Language Models for Irregularly Sampled Time Series

链接: https://arxiv.org/abs/2408.08328
作者: Weijia Zhang,Chenlong Yin,Hao Liu,Hui Xiong
关键词-EN: Pre-trained Language Models, natural language processing, Pre-trained Language, Sampled Time Series, language processing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Pre-trained Language Models (PLMs), such as ChatGPT, have significantly advanced the field of natural language processing. This progress has inspired a series of innovative studies that explore the adaptation of PLMs to time series analysis, intending to create a unified foundation model that addresses various time series analytical tasks. However, these efforts predominantly focus on Regularly Sampled Time Series (RSTS), neglecting the unique challenges posed by Irregularly Sampled Time Series (ISTS), which are characterized by non-uniform sampling intervals and prevalent missing data. To bridge this gap, this work explores the potential of PLMs for ISTS analysis. We begin by investigating the effect of various methods for representing ISTS, aiming to maximize the efficacy of PLMs in this under-explored area. Furthermore, we present a unified PLM-based framework, ISTS-PLM, which integrates time-aware and variable-aware PLMs tailored for comprehensive intra and inter-time series modeling and includes a learnable input embedding layer and a task-specific output layer to tackle diverse ISTS analytical tasks. Extensive experiments on a comprehensive benchmark demonstrate that the ISTS-PLM, utilizing a simple yet effective series-based representation for ISTS, consistently achieves state-of-the-art performance across various analytical tasks, such as classification, interpolation, and extrapolation, as well as few-shot and zero-shot learning scenarios, spanning scientific domains like healthcare and biomechanics.

[AI-63] First Analysis of the EU Artifical Intelligence Act: Towards a Global Standard for Trustworthy AI?

链接: https://arxiv.org/abs/2408.08318
作者: Marion Ho-Dac(UA, CDEP)
关键词-EN: Artificial Intelligence Act, European Union, Artificial Intelligence, Intelligence Act, Act
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: in French language

点击查看摘要

Abstract:The EU Artificial Intelligence Act (AI Act) came into force in the European Union (EU) on 1 August 2024. It is a key piece of legislation both for the citizens at the heart of AI technologies and for the industry active in the internal market. The AI Act imposes progressive compliance on organisations - both private and public - involved in the global value chain of AI systems and models marketed and used in the EU. While the Act is unprecedented on an international scale in terms of its horizontal and binding regulatory scope, its global appeal in support of trustworthy AI is one of its major challenges.

[AI-64] Segment Anything for Videos: A Systematic Survey

链接: https://arxiv.org/abs/2408.08315
作者: Chunhui Zhang,Yawen Cui,Weilin Lin,Guanjie Huang,Yan Rong,Li Liu,Shiguang Shan
关键词-EN: witnessed tremendous success, exploring task-agnostic visual, SAM, computer vision, task-agnostic visual foundation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (\eg, text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.

[AI-65] A Disease-Specific Foundation Model Using Over 100K Fundus Images: Release and Validation for Abnormality and Multi-Disease Classification on Downstream Tasks

链接: https://arxiv.org/abs/2408.08790
作者: Boa Jang,Youngbin Ahn,Eun Kyung Choe,Chang Ki Yoon,Hyuk Jin Choi,Young-Gon Kim
关键词-EN: Artificial intelligence applied, Artificial intelligence, offers significant potential, artificial intelligence models, generalized artificial intelligence
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Artificial intelligence applied to retinal images offers significant potential for recognizing signs and symptoms of retinal conditions and expediting the diagnosis of eye diseases and systemic disorders. However, developing generalized artificial intelligence models for medical data often requires a large number of labeled images representing various disease signs, and most models are typically task-specific, focusing on major retinal diseases. In this study, we developed a Fundus-Specific Pretrained Model (Image+Fundus), a supervised artificial intelligence model trained to detect abnormalities in fundus images. A total of 57,803 images were used to develop this pretrained model, which achieved superior performance across various downstream tasks, indicating that our proposed model outperforms other general methods. Our Image+Fundus model offers a generalized approach to improve model performance while reducing the number of labeled datasets required. Additionally, it provides more disease-specific insights into fundus images, with visualizations generated by our model. These disease-specific foundation models are invaluable in enhancing the performance and efficiency of deep learning models in the field of fundus imaging.

[AI-66] ASVspoof 5: Crowdsourced Speech Data Deepfakes and Adversarial Attacks at Scale INTERSPEECH2024

链接: https://arxiv.org/abs/2408.08739
作者: Xin Wang,Hector Delgado,Hemlata Tak,Jee-weon Jung,Hye-jin Shim,Massimiliano Todisco,Ivan Kukanov,Xuechen Liu,Md Sahidullah,Tomi Kinnunen,Nicholas Evans,Kong Aik Lee,Junichi Yamagishi
关键词-EN: promote the study, study of speech, speech spoofing, spoofing and deepfake, detection solutions
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

点击查看摘要

Abstract:ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements.

[AI-67] Efficient Data-Sketches and Fine-Tuning for Early Detection of Distributional Drift in Medical Imaging

链接: https://arxiv.org/abs/2408.08456
作者: Yusen Wu,Hao Chen,Alex Pissinou Makki,Phuong Nguyen,Yelena Yesha
关键词-EN: underlying data distribution, treatment decisions, Distributional drift, Distributional drift detection, detect distributional drift
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributional drift detection is important in medical applications as it helps ensure the accuracy and reliability of models by identifying changes in the underlying data distribution that could affect diagnostic or treatment decisions. However, current methods have limitations in detecting drift; for example, the inclusion of abnormal datasets can lead to unfair comparisons. This paper presents an accurate and sensitive approach to detect distributional drift in CT-scan medical images by leveraging data-sketching and fine-tuning techniques. We developed a robust baseline library model for real-time anomaly detection, allowing for efficient comparison of incoming images and identification of anomalies. Additionally, we fine-tuned a vision transformer pre-trained model to extract relevant features using breast cancer images as an example, significantly enhancing model accuracy to 99.11%. Combining with data-sketches and fine-tuning, our feature extraction evaluation demonstrated that cosine similarity scores between similar datasets provide greater improvements, from around 50% increased to 100%. Finally, the sensitivity evaluation shows that our solutions are highly sensitive to even 1% salt-and-pepper and speckle noise, and it is not sensitive to lighting noise (e.g., lighting conditions have no impact on data drift). The proposed methods offer a scalable and reliable solution for maintaining the accuracy of diagnostic models in dynamic clinical environments.

[AI-68] Predictive uncertainty estimation in deep learning for lung carcinoma classification in digital pathology under real dataset shifts

链接: https://arxiv.org/abs/2408.08432
作者: Abdur R. Fayjie,Jutika Borah,Florencia Carbone,Jan Tack,Patrick Vandewalle
关键词-EN: shown tremendous progress, predictive uncertainty, shown tremendous, tremendous progress, wide range
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Deep learning has shown tremendous progress in a wide range of digital pathology and medical image classification tasks. Its integration into safe clinical decision-making support requires robust and reliable models. However, real-world data comes with diversities that often lie outside the intended source distribution. Moreover, when test samples are dramatically different, clinical decision-making is greatly affected. Quantifying predictive uncertainty in models is crucial for well-calibrated predictions and determining when (or not) to trust a model. Unfortunately, many works have overlooked the importance of predictive uncertainty estimation. This paper evaluates whether predictive uncertainty estimation adds robustness to deep learning-based diagnostic decision-making systems. We investigate the effect of various carcinoma distribution shift scenarios on predictive performance and calibration. We first systematically investigate three popular methods for improving predictive uncertainty: Monte Carlo dropout, deep ensemble, and few-shot learning on lung adenocarcinoma classification as a primary disease in whole slide images. Secondly, we compare the effectiveness of the methods in terms of performance and calibration under clinically relevant distribution shifts such as in-distribution shifts comprising primary disease sub-types and other characterization analysis data; out-of-distribution shifts comprising well-differentiated cases, different organ origin, and imaging modality shifts. While studies on uncertainty estimation exist, to our best knowledge, no rigorous large-scale benchmark compares predictive uncertainty estimation including these dataset shifts for lung carcinoma classification.

[AI-69] Decoding the human brain tissue response to radiofrequency excitation using a biophysical-model-free deep MRI on a chip framework

链接: https://arxiv.org/abs/2408.08376
作者: Dinor Nagar(1),Moritz Zaiss(2 and 3),Or Perlman(4 and 5) ((1) School of Electrical Engineering, Tel Aviv University, Tel Aviv, Israel, (2) Institute of Neuroradiology, Friedrich-Alexander Universitat Erlangen-Nurnberg (FAU), University Hospital Erlangen, Erlangen, Germany, (3) Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander Universitat Erlangen-Nurnberg, Erlangen, Germany, (4) Department of Biomedical Engineering, Tel Aviv University, Tel Aviv, Israel, (5) Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel)
关键词-EN: relies on radiofrequency, proton spin, Magnetic resonance imaging, multiple MRI contrasts, MRI
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: This project was funded by the European Union (ERC, BabyMagnet, project no. 101115639). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) relies on radiofrequency (RF) excitation of proton spin. Clinical diagnosis requires a comprehensive collation of biophysical data via multiple MRI contrasts, acquired using a series of RF sequences that lead to lengthy examinations. Here, we developed a vision transformer-based framework that captures the spatiotemporal magnetic signal evolution and decodes the brain tissue response to RF excitation, constituting an MRI on a chip. Following a per-subject rapid calibration scan (28.2 s), a wide variety of image contrasts including fully quantitative molecular, water relaxation, and magnetic field maps can be generated automatically. The method was validated across healthy subjects and a cancer patient in two different imaging sites, and proved to be 94% faster than alternative protocols. The deep MRI on a chip (DeepMonC) framework may reveal the molecular composition of the human brain tissue in a wide range of pathologies, while offering clinically attractive scan times.

[AI-70] Exploring Latent Space for Generating Peptide Analogs Using Protein Language Models

链接: https://arxiv.org/abs/2408.08341
作者: Po-Yu Liang,Xueting Huang,Tibo Duran,Andrew J. Wiemer,Jun Bai
关键词-EN: discovery and biotechnology, Generating peptides, crucial for drug, drug discovery, Generating
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating peptides with desired properties is crucial for drug discovery and biotechnology. Traditional sequence-based and structure-based methods often require extensive datasets, which limits their effectiveness. In this study, we proposed a novel method that utilized autoencoder shaped models to explore the protein embedding space, and generate novel peptide analogs by leveraging protein language models. The proposed method requires only a single sequence of interest, avoiding the need for large datasets. Our results show significant improvements over baseline models in similarity indicators of peptide structures, descriptors and bioactivities. The proposed method validated through Molecular Dynamics simulations on TIGIT inhibitors, demonstrates that our method produces peptide analogs with similar yet distinct properties, highlighting its potential to enhance peptide screening processes.

计算机视觉

[CV-0] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

链接: https://arxiv.org/abs/2408.08872
作者: Le Xue,Manli Shu,Anas Awadalla,Jun Wang,An Yan,Senthil Purushwalkam,Honglu Zhou,Viraj Prabhu,Yutong Dai,Michael S Ryoo,Shrikant Kendre,Jieyu Zhang,Can Qin,Shu Zhang,Chia-Chih Chen,Ning Yu,Juntao Tan,Tulika Manoj Awalgaonkar,Shelby Heinecke,Huan Wang,Yejin Choi,Ludwig Schmidt,Zeyuan Chen,Silvio Savarese,Juan Carlos Niebles,Caiming Xiong,Ran Xu
关键词-EN: developing Large Multimodal, Large Multimodal Models, Large Multimodal, developing Large, Multimodal Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

[CV-1] SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation

链接: https://arxiv.org/abs/2408.08870
作者: Xinyu Xiong,Zihuang Wu,Shuangyi Tan,Wenxue Li,Feilong Tang,Ying Chen,Siying Li,Jie Ma,Guanbin Li
关键词-EN: Image segmentation plays, plays an important, important role, vision understanding, vision foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Image segmentation plays an important role in vision understanding. Recently, the emerging vision foundation models continuously achieved superior performance on various tasks. Following such success, in this paper, we prove that the Segment Anything Model 2 (SAM2) can be a strong encoder for U-shaped segmentation models. We propose a simple but effective framework, termed SAM2-UNet, for versatile image segmentation. Specifically, SAM2-UNet adopts the Hiera backbone of SAM2 as the encoder, while the decoder uses the classic U-shaped design. Additionally, adapters are inserted into the encoder to allow parameter-efficient fine-tuning. Preliminary experiments on various downstream tasks, such as camouflaged object detection, salient object detection, marine animal segmentation, mirror detection, and polyp segmentation, demonstrate that our SAM2-UNet can simply beat existing specialized state-of-the-art methods without bells and whistles. Project page: \urlthis https URL.

[CV-2] DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

链接: https://arxiv.org/abs/2408.08855
作者: Eman Ali,Sathira Silva,Muhammad Haris Khan
关键词-EN: shown remarkable potential, Vision-language models, shown remarkable, remarkable potential, Vision-language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labelled data is unavailable. Recent research has proposed pseudo-labelling approaches to adapt CLIP in an unsupervised manner using unlabelled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP’s visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.

[CV-3] RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

链接: https://arxiv.org/abs/2408.08827
作者: Andong Lu,Wanyu Wang,Chenglong Li,Jin Tang,Bin Luo
关键词-EN: RGBT tracking, fusion Mamba, robust RGBT tracking, Difference-based Fusion Mamba, Existing RGBT tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.

[CV-4] PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future

链接: https://arxiv.org/abs/2408.08822
作者: Guangyi Wang,Yuren Cai,Lijiang Li,Wei Peng,Songzhi Su
关键词-EN: shown remarkable potential, Diffusion Probabilistic Models, ODE solvers, fast ODE solvers, Diffusion Probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have shown remarkable potential in image generation, but their sampling efficiency is hindered by the need for numerous denoising steps. Most existing solutions accelerate the sampling process by proposing fast ODE solvers. However, the inevitable discretization errors of the ODE solvers are significantly magnified when the number of function evaluations (NFE) is fewer. In this work, we propose PFDiff, a novel training-free and orthogonal timestep-skipping strategy, which enables existing fast ODE solvers to operate with fewer NFE. Based on two key observations: a significant similarity in the model’s outputs at time step size that is not excessively large during the denoising process of existing ODE solvers, and a high resemblance between the denoising process and SGD. PFDiff, by employing gradient replacement from past time steps and foresight updates inspired by Nesterov momentum, rapidly updates intermediate states, thereby reducing unnecessary NFE while correcting for discretization errors inherent in first-order ODE solvers. Experimental results demonstrate that PFDiff exhibits flexible applicability across various pre-trained DPMs, particularly excelling in conditional DPMs and surpassing previous state-of-the-art training-free methods. For instance, using DDIM as a baseline, we achieved 16.46 FID (4 NFE) compared to 138.81 FID with DDIM on ImageNet 64x64 with classifier guidance, and 13.06 FID (10 NFE) on Stable Diffusion with 7.5 guidance scale.

[CV-5] Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models

链接: https://arxiv.org/abs/2408.08813
作者: Lin Zhao,Xiao Chen,Eric Z. Chen,Yikang Liu,Terrence Chen,Shanhui Sun
关键词-EN: Medical image segmentation, presents significant challenges, Medical image, few-shot medical image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image segmentation is crucial for clinical decision-making, but the scarcity of annotated data presents significant challenges. Few-shot segmentation (FSS) methods show promise but often require retraining on the target domain and struggle to generalize across different modalities. Similarly, adapting foundation models like the Segment Anything Model (SAM) for medical imaging has limitations, including the need for finetuning and domain-specific adaptation. To address these issues, we propose a novel method that adapts DINOv2 and Segment Anything Model 2 (SAM 2) for retrieval-augmented few-shot medical image segmentation. Our approach uses DINOv2’s feature as query to retrieve similar samples from limited annotated data, which are then encoded as memories and stored in memory bank. With the memory attention mechanism of SAM 2, the model leverages these memories as conditions to generate accurate segmentation of the target image. We evaluated our framework on three medical image segmentation tasks, demonstrating superior performance and generalizability across various modalities without the need for any retraining or finetuning. Overall, this method offers a practical and effective solution for few-shot medical image segmentation and holds significant potential as a valuable annotation tool in clinical applications.

[CV-6] PriorMapNet: Enhancing Online Vectorized HD Map Construction with Priors

链接: https://arxiv.org/abs/2408.08802
作者: Rongxuan Wang,Xin Lu,Xiaoyang Liu,Xiaoyi Zou,Tongyi Cao,Ying Li
关键词-EN: Online vectorized High-Definition, autonomous driving, crucial for subsequent, Online vectorized, map construction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Online vectorized High-Definition (HD) map construction is crucial for subsequent prediction and planning tasks in autonomous driving. Following MapTR paradigm, recent works have made noteworthy achievements. However, reference points are randomly initialized in mainstream methods, leading to unstable matching between predictions and ground truth. To address this issue, we introduce PriorMapNet to enhance online vectorized HD map construction with priors. We propose the PPS-Decoder, which provides reference points with position and structure priors. Fitted from the map elements in the dataset, prior reference points lower the learning difficulty and achieve stable matching. Furthermore, we propose the PF-Encoder to enhance the image-to-BEV transformation with BEV feature priors. Besides, we propose the DMD cross-attention, which decouples cross-attention along multi-scale and multi-sample respectively to achieve efficiency. Our proposed PriorMapNet achieves state-of-the-art performance in the online vectorized HD map construction task on nuScenes and Argoverse2 datasets. The code will be released publicly soon.

[CV-7] Backward-Compatible Aligned Representations via an Orthogonal Transformation Layer ECCV2024

链接: https://arxiv.org/abs/2408.08793
作者: Simone Ricci,Niccolò Biondi,Federico Pernici,Alberto Del Bimbo
关键词-EN: Visual retrieval systems, retrieval systems face, systems face significant, face significant challenges, Visual retrieval
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BEW2024 Workshop at ECCV2024

点击查看摘要

Abstract:Visual retrieval systems face significant challenges when updating models with improved representations due to misalignment between the old and new representations. The costly and resource-intensive backfilling process involves recalculating feature vectors for images in the gallery set whenever a new model is introduced. To address this, prior research has explored backward-compatible training methods that enable direct comparisons between new and old representations without backfilling. Despite these advancements, achieving a balance between backward compatibility and the performance of independently trained models remains an open problem. In this paper, we address it by expanding the representation space with additional dimensions and learning an orthogonal transformation to achieve compatibility with old models and, at the same time, integrate new information. This transformation preserves the original feature space’s geometry, ensuring that our model aligns with previous versions while also learning new data. Our Orthogonal Compatible Aligned (OCA) approach eliminates the need for re-indexing during model updates and ensures that features can be compared directly across different model updates without additional mapping functions. Experimental results on CIFAR-100 and ImageNet-1k demonstrate that our method not only maintains compatibility with previous models but also achieves state-of-the-art accuracy, outperforming several existing methods.

[CV-8] VF-NeRF: Learning Neural Vector Fields for Indoor Scene Reconstruction

链接: https://arxiv.org/abs/2408.08766
作者: Albert Gassol Puigjaner,Edoardo Mello Rella,Erik Sandström,Ajad Chhatkuli,Luc Van Gool
关键词-EN: shown surprising accuracy, neural radiance fields, neural radiance, shown surprising, surprising accuracy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:Implicit surfaces via neural radiance fields (NeRF) have shown surprising accuracy in surface reconstruction. Despite their success in reconstructing richly textured surfaces, existing methods struggle with planar regions with weak textures, which account for the majority of indoor scenes. In this paper, we address indoor dense surface reconstruction by revisiting key aspects of NeRF in order to use the recently proposed Vector Field (VF) as the implicit representation. VF is defined by the unit vector directed to the nearest surface point. It therefore flips direction at the surface and equals to the explicit surface normals. Except for this flip, VF remains constant along planar surfaces and provides a strong inductive bias in representing planar surfaces. Concretely, we develop a novel density-VF relationship and a training scheme that allows us to learn VF via volume rendering By doing this, VF-NeRF can model large planar surfaces and sharp corners accurately. We show that, when depth cues are available, our method further improves and achieves state-of-the-art results in reconstructing indoor scenes and rendering novel views. We extensively evaluate VF-NeRF on indoor datasets and run ablations of its components.

[CV-9] PCP-MAE: Learning to Predict Centers for Point Masked Autoencoders

链接: https://arxiv.org/abs/2408.08753
作者: Xiangdong Zhang,Shaofeng Zhang,Junchi Yan
关键词-EN: point cloud self-supervised, cloud self-supervised learning, point cloud, Point Masked AutoEncoders, masked parts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Masked autoencoder has been widely explored in point cloud self-supervised learning, whereby the point cloud is generally divided into visible and masked parts. These methods typically include an encoder accepting visible patches (normalized) and corresponding patch centers (position) as input, with the decoder accepting the output of the encoder and the centers (position) of the masked parts to reconstruct each point in the masked patches. Then, the pre-trained encoders are used for downstream tasks. In this paper, we show a motivating empirical result that when directly feeding the centers of masked patches to the decoder without information from the encoder, it still reconstructs well. In other words, the centers of patches are important and the reconstruction objective does not necessarily rely on representations of the encoder, thus preventing the encoder from learning semantic representations. Based on this key observation, we propose a simple yet effective method, i.e., learning to Predict Centers for Point Masked AutoEncoders (PCP-MAE) which guides the model to learn to predict the significant centers and use the predicted centers to replace the directly provided centers. Specifically, we propose a Predicting Center Module (PCM) that shares parameters with the original encoder with extra cross-attention to predict centers. Our method is of high pre-training efficiency compared to other alternatives and achieves great improvement over Point-MAE, particularly outperforming it by 5.50%, 6.03%, and 5.17% on three variants of ScanObjectNN. The code will be made publicly available.

[CV-10] Comparative Analysis of Generative Models: Enhancing Image Synthesis with VAEs GANs and Stable Diffusion

链接: https://arxiv.org/abs/2408.08751
作者: Sanchayan Vivekananthan
关键词-EN: Generative Adversarial Networks, Variational Autoencoders, generative modelling frameworks, Adversarial Networks, major generative modelling
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper examines three major generative modelling frameworks: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Stable Diffusion models. VAEs are effective at learning latent representations but frequently yield blurry results. GANs can generate realistic images but face issues such as mode collapse. Stable Diffusion models, while producing high-quality images with strong semantic coherence, are demanding in terms of computational resources. Additionally, the paper explores how incorporating Grounding DINO and Grounded SAM with Stable Diffusion improves image accuracy by utilising sophisticated segmentation and inpainting techniques. The analysis guides on selecting suitable models for various applications and highlights areas for further research.

[CV-11] ask-Aware Dynamic Transformer for Efficient Arbitrary-Scale Image Super-Resolution ECAI2024

链接: https://arxiv.org/abs/2408.08736
作者: Tianyi Xu,Yiji Zhou,Xiaotao Hu,Kai Zhang,Anran Zhang,Xingye Qiu,Jun Xu
关键词-EN: aims to learn, ASSR, learn a single, single model, arbitrary magnifying scales
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECAI 2024

点击查看摘要

Abstract:Arbitrary-scale super-resolution (ASSR) aims to learn a single model for image super-resolution at arbitrary magnifying scales. Existing ASSR networks typically comprise an off-the-shelf scale-agnostic feature extractor and an arbitrary scale upsampler. These feature extractors often use fixed network architectures to address different ASSR inference tasks, each of which is characterized by an input image and an upsampling scale. However, this overlooks the difficulty variance of super-resolution on different inference scenarios, where simple images or small SR scales could be resolved with less computational effort than difficult images or large SR scales. To tackle this difficulty variability, in this paper, we propose a Task-Aware Dynamic Transformer (TADT) as an input-adaptive feature extractor for efficient image ASSR. Our TADT consists of a multi-scale feature extraction backbone built upon groups of Multi-Scale Transformer Blocks (MSTBs) and a Task-Aware Routing Controller (TARC). The TARC predicts the inference paths within feature extraction backbone, specifically selecting MSTBs based on the input images and SR scales. The prediction of inference path is guided by a new loss function to trade-off the SR accuracy and efficiency. Experiments demonstrate that, when working with three popular arbitrary-scale upsamplers, our TADT achieves state-of-the-art ASSR performance when compared with mainstream feature extractors, but with relatively fewer computational costs. The code will be publicly released.

[CV-12] Correspondence-Guided SfM-Free 3D Gaussian Splatting for NVS

链接: https://arxiv.org/abs/2408.08723
作者: Wei Sun,Xiaosong Zhang,Fang Wan,Yanzhao Zhou,Yuan Li,Qixiang Ye,Jianbin Jiao
关键词-EN: View Synthesis, variable operating conditions, promoting rapid response, rapid response capabilities, pre-processed camera poses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2312.07504 by other authors

点击查看摘要

Abstract:Novel View Synthesis (NVS) without Structure-from-Motion (SfM) pre-processed camera poses–referred to as SfM-free methods–is crucial for promoting rapid response capabilities and enhancing robustness against variable operating conditions. Recent SfM-free methods have integrated pose optimization, designing end-to-end frameworks for joint camera pose estimation and NVS. However, most existing works rely on per-pixel image loss functions, such as L2 loss. In SfM-free methods, inaccurate initial poses lead to misalignment issue, which, under the constraints of per-pixel image loss functions, results in excessive gradients, causing unstable optimization and poor convergence for NVS. In this study, we propose a correspondence-guided SfM-free 3D Gaussian splatting for NVS. We use correspondences between the target and the rendered result to achieve better pixel alignment, facilitating the optimization of relative poses between frames. We then apply the learned poses to optimize the entire scene. Each 2D screen-space pixel is associated with its corresponding 3D Gaussians through approximated surface rendering to facilitate gradient back propagation. Experimental results underline the superior performance and time efficiency of the proposed approach compared to the state-of-the-art baselines.

[CV-13] Decoupling Feature Representations of Ego and Other Modalities for Incomplete Multi-modal Brain Tumor Segmentation

链接: https://arxiv.org/abs/2408.08708
作者: Kaixiang Yang,Wenqi Shan,Xudong Li,Xuan Wang,Xikai Yang,Xi Wang,Pheng-Ann Heng,Qiang Li,Zhiwei Wang
关键词-EN: magnetic resonance imaging, significantly degrade performance, segmentation typically involves, modalities significantly degrade, resonance imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Multi-modal brain tumor segmentation typically involves four magnetic resonance imaging (MRI) modalities, while incomplete modalities significantly degrade performance. Existing solutions employ explicit or implicit modality adaptation, aligning features across modalities or learning a fused feature robust to modality incompleteness. They share a common goal of encouraging each modality to express both itself and the others. However, the two expression abilities are entangled as a whole in a seamless feature space, resulting in prohibitive learning burdens. In this paper, we propose DeMoSeg to enhance the modality adaptation by Decoupling the task of representing the ego and other Modalities for robust incomplete multi-modal Segmentation. The decoupling is super lightweight by simply using two convolutions to map each modality onto four feature sub-spaces. The first sub-space expresses itself (Self-feature), while the remaining sub-spaces substitute for other modalities (Mutual-features). The Self- and Mutual-features interactively guide each other through a carefully-designed Channel-wised Sparse Self-Attention (CSSA). After that, a Radiologist-mimic Cross-modality expression Relationships (RCR) is introduced to have available modalities provide Self-feature and also `lend’ their Mutual-features to compensate for the absent ones by exploiting the clinical prior knowledge. The benchmark results on BraTS2020, BraTS2018 and BraTS2015 verify the DeMoSeg’s superiority thanks to the alleviated modality adaptation difficulty. Concretely, for BraTS2020, DeMoSeg increases Dice by at least 0.92%, 2.95% and 4.95% on whole tumor, tumor core and enhanced tumor regions, respectively, compared to other state-of-the-arts. Codes are at this https URL

[CV-14] Beyond the Hype: A dispassionate look at vision-language models in medical scenario

链接: https://arxiv.org/abs/2408.08704
作者: Yang Nan,Huichi Zhou,Xiaodan Xing,Guang Yang
关键词-EN: Large Vision-Language Models, garnering significant attention, Recent advancements, Vision-Language Models, demonstrated remarkable capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate in evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristic of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models’ ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models’ spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models’ capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models’ capabilities against unharmonised and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code and dataset will be available after the acceptance of this paper.

[CV-15] sCA: On the Semantic Consistency Alignment via Conditional Transport for Compositional Zero-Shot Learning

链接: https://arxiv.org/abs/2408.08703
作者: Miaoge Li,Jingcai Guo,Richard Yi Da Xu,Dongsheng Wang,Xiaofeng Cao,Song Guo
关键词-EN: Compositional Zero-Shot Learning, aims to recognize, leveraging the shared, Compositional Zero-Shot, Trisets Consistency Alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize novel \textitstate-object compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method.

[CV-16] HyCoT: Hyperspectral Compression Transformer with an Efficient Training Strategy

链接: https://arxiv.org/abs/2408.08700
作者: Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir
关键词-EN: attracted significant interest, recently attracted significant, learning-based hyperspectral image, significant interest, development of learning-based
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The development of learning-based hyperspectral image (HSI) compression models has recently attracted significant interest. Existing models predominantly utilize convolutional filters, which capture only local dependencies. Furthermore, they often incur high training costs and exhibit substantial computational complexity. To address these limitations, in this paper we propose Hyperspectral Compression Transformer (HyCoT) that is a transformer-based autoencoder for pixelwise HSI compression. Additionally, we introduce an efficient training strategy to accelerate the training process. Experimental results on the HySpecNet-11k dataset demonstrate that HyCoT surpasses the state-of-the-art across various compression ratios by over 1 dB with significantly reduced computational requirements. Our code and pre-trained weights are publicly available at this https URL .

[CV-17] LLM-PCGC: Large Language Model-based Point Cloud Geometry Compression

链接: https://arxiv.org/abs/2408.08682
作者: Yuqi Ye,Wei Gao
关键词-EN: point cloud, point cloud geometry, point cloud compression, robust context model, context model consistent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The key to effective point cloud compression is to obtain a robust context model consistent with complex 3D data structures. Recently, the advancement of large language models (LLMs) has highlighted their capabilities not only as powerful generators for in-context learning and generation but also as effective compressors. These dual attributes of LLMs make them particularly well-suited to meet the demands of data compression. Therefore, this paper explores the potential of using LLM for compression tasks, focusing on lossless point cloud geometry compression (PCGC) experiments. However, applying LLM directly to PCGC tasks presents some significant challenges, i.e., LLM does not understand the structure of the point cloud well, and it is a difficult task to fill the gap between text and point cloud through text description, especially for large complicated and small shapeless point clouds. To address these problems, we introduce a novel architecture, namely the Large Language Model-based Point Cloud Geometry Compression (LLM-PCGC) method, using LLM to compress point cloud geometry information without any text description or aligning operation. By utilizing different adaptation techniques for cross-modality representation alignment and semantic consistency, including clustering, K-tree, token mapping invariance, and Low Rank Adaptation (LoRA), the proposed method can translate LLM to a compressor/generator for point cloud. To the best of our knowledge, this is the first structure to employ LLM as a compressor for point cloud data. Experiments demonstrate that the LLM-PCGC outperforms the other existing methods significantly, by achieving -40.213% bit rate reduction compared to the reference software of MPEG Geometry-based Point Cloud Compression (G-PCC) standard, and by achieving -2.267% bit rate reduction compared to the state-of-the-art learning-based method.

[CV-18] owards Physical World Backdoor Attacks against Skeleton Action Recognition ECCV2024

链接: https://arxiv.org/abs/2408.08671
作者: Qichen Zheng,Yi Yu,Siyuan Yang,Jun Liu,Kwok-Yan Lam,Alex Kot
关键词-EN: attracted significant interest, Skeleton Action Recognition, human skeletal structure, Action Recognition, skeletal structure
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Skeleton Action Recognition (SAR) has attracted significant interest for its efficient representation of the human skeletal structure. Despite its advancements, recent studies have raised security concerns in SAR models, particularly their vulnerability to adversarial attacks. However, such strategies are limited to digital scenarios and ineffective in physical attacks, limiting their real-world applicability. To investigate the vulnerabilities of SAR in the physical world, we introduce the Physical Skeleton Backdoor Attacks (PSBA), the first exploration of physical backdoor attacks against SAR. Considering the practicalities of physical execution, we introduce a novel trigger implantation method that integrates infrequent and imperceivable actions as triggers into the original skeleton data. By incorporating a minimal amount of this manipulated data into the training set, PSBA enables the system misclassify any skeleton sequences into the target class when the trigger action is present. We examine the resilience of PSBA in both poisoned and clean-label scenarios, demonstrating its efficacy across a range of datasets, poisoning ratios, and model architectures. Additionally, we introduce a trigger-enhancing strategy to strengthen attack performance in the clean label setting. The robustness of PSBA is tested against three distinct backdoor defenses, and the stealthiness of PSBA is evaluated using two quantitative metrics. Furthermore, by employing a Kinect V2 camera, we compile a dataset of human actions from the real world to mimic physical attack situations, with our findings confirming the effectiveness of our proposed attacks. Our project website can be found at this https URL.

[CV-19] Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning

链接: https://arxiv.org/abs/2408.08670
作者: Alessio Devoto,Federico Alvetreti,Jary Pomponi,Paolo Di Lorenzo,Pasquale Minervini,Simone Scardapane
关键词-EN: Vision Transformers, foundation models based, foundation models, Recently, fine-tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this end, in this paper we introduce an efficient fine-tuning method for ViTs called \textbfALaST ( \textitAdaptive Layer Selection Fine-Tuning for Vision Transformers ) to speed up the fine-tuning process while reducing computational cost, memory load, and training time. Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. Therefore, at each fine-tuning step, we adaptively estimate the importance of all layers and we assign what we call ``compute budgets’’ accordingly. Layers that were allocated lower budgets are either trained with a reduced number of input tokens or kept frozen. Freezing a layer reduces the computational cost and memory usage by preventing updates to its weights, while discarding tokens removes redundant data, speeding up processing and reducing memory requirements. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources across layers, resulting in substantial reductions in training time (up to 1.5x), FLOPs (up to 2x), and memory load (up to 2x) compared to traditional full fine-tuning approaches. Additionally, it can be successfully combined with other parameter-efficient fine-tuning methods, such as LoRA.

[CV-20] QMambaBSR: Burst Image Super-Resolution with Query State Space Model

链接: https://arxiv.org/abs/2408.08665
作者: Xin Di,Long Peng,Peizhe Xia,Wenbo Li,Renjing Pei,Yang Cao,Yang Wang,Zheng-Jun Zha
关键词-EN: reconstruct high-resolution images, burst low-resolution frames, Burst super-resolution aims, multiple burst low-resolution, aims to reconstruct
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Burst super-resolution aims to reconstruct high-resolution images with higher quality and richer details by fusing the sub-pixel information from multiple burst low-resolution frames. In BusrtSR, the key challenge lies in extracting the base frame’s content complementary sub-pixel details while simultaneously suppressing high-frequency noise disturbance. Existing methods attempt to extract sub-pixels by modeling inter-frame relationships frame by frame while overlooking the mutual correlations among multi-current frames and neglecting the intra-frame interactions, leading to inaccurate and noisy sub-pixels for base frame super-resolution. Further, existing methods mainly employ static upsampling with fixed parameters to improve spatial resolution for all scenes, failing to perceive the sub-pixel distribution difference across multiple frames and cannot balance the fusion weights of different frames, resulting in over-smoothed details and artifacts. To address these limitations, we introduce a novel Query Mamba Burst Super-Resolution (QMambaBSR) network, which incorporates a Query State Space Model (QSSM) and Adaptive Up-sampling module (AdaUp). Specifically, based on the observation that sub-pixels have consistent spatial distribution while random noise is inconsistently distributed, a novel QSSM is proposed to efficiently extract sub-pixels through inter-frame querying and intra-frame scanning while mitigating noise interference in a single step. Moreover, AdaUp is designed to dynamically adjust the upsampling kernel based on the spatial distribution of multi-frame sub-pixel information in the different burst scenes, thereby facilitating the reconstruction of the spatial arrangement of high-resolution details. Extensive experiments on four popular synthetic and real-world benchmarks demonstrate that our method achieves a new state-of-the-art performance.

[CV-21] Extracting polygonal footprints in off-nadir images with Segment Anything Model

链接: https://arxiv.org/abs/2408.08645
作者: Kai Li,Jingbo Chen,Yupeng Deng,Yu Meng,Diyou Liu,Junxian Ma,Chenhao Wang
关键词-EN: Building Footprint Extraction, off-nadir aerial images, Footprint Extraction, off-nadir aerial, aerial images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Building Footprint Extraction (BFE) in off-nadir aerial images often relies on roof segmentation and roof-to-footprint offset prediction, then drugging roof-to-footprint via the offset. However, the results from this multi-stage inference are not applicable in data production, because of the low quality of masks given by prediction. To solve this problem, we proposed OBMv2 in this paper, which supports both end-to-end and promptable polygonal footprint prediction. Different from OBM, OBMv2 using a newly proposed Self Offset Attention (SOFA) to bridge the performance gap on bungalow and skyscraper, which realized a real end-to-end footprint polygon prediction without postprocessing. %, such as Non-Maximum Suppression (NMS) and Distance NMS (DNMS). % To fully use information contained in roof masks, building masks and offsets, we proposed a Multi-level Information SyStem (MISS) for footprint prediction, with which OBMv2 can predict footprints even with insufficient predictions. Additionally, to squeeze information from the same model, we were inspired by Retrieval-Augmented Generation (RAG) in Nature Language Processing and proposed “RAG in BFE” problem. To verify the effectiveness of the proposed method, experiments were conducted on open datasets BONAI and OmniCity-view3. A generalization test was also conducted on Huizhou test set. The code will be available at \urlthis https URL.

[CV-22] Historical Printed Ornaments: Dataset and Tasks

链接: https://arxiv.org/abs/2408.08633
作者: Sayan Kumar Chaki,Zeynep Sonat Baltaci,Elliot Vincent,Remi Emonet,Fabienne Vial-Bonacci,Christelle Bahier-Porte,Mathieu Aubry,Thierry Fournel
关键词-EN: unsupervised computer vision, modern unsupervised computer, historical printed ornaments, computer vision, paper aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper aims to develop the study of historical printed ornaments with modern unsupervised computer vision. We highlight three complex tasks that are of critical interest to book historians: clustering, element discovery, and unsupervised change localization. For each of these tasks, we introduce an evaluation benchmark, and we adapt and evaluate state-of-the-art models. Our Rey’s Ornaments dataset is designed to be a representative example of a set of ornaments historians would be interested in. It focuses on an XVIIIth century bookseller, Marc-Michel Rey, providing a consistent set of ornaments with a wide diversity and representative challenges. Our results highlight the limitations of state-of-the-art models when faced with real data and show simple baselines such as k-means or congealing can outperform more sophisticated approaches on such data. Our dataset and code can be found at this https URL.

[CV-23] A Survey on Benchmarks of Multimodal Large Language Models

链接: https://arxiv.org/abs/2408.08632
作者: Jian Li,Weiheng Lu
关键词-EN: Multimodal Large Language, Large Language Models, visual question answering, Multimodal Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of \textbf180 benchmarks and evaluation for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to better support the development of MLLMs. For more details, please visit our GitHub repository: this https URL.

[CV-24] SketchRef: A Benchmark Dataset and Evaluation Metrics for Automated Sketch Synthesis

链接: https://arxiv.org/abs/2408.08623
作者: Xingyue Lin,Xingjian Hu,Shuai Peng,Jianhua Zhu,Liangcai Gao
关键词-EN: powerful artistic technique, capture essential visual, essential visual information, increasingly gaining attention, image synthesis field
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sketch, a powerful artistic technique to capture essential visual information about real-world objects, is increasingly gaining attention in the image synthesis field. However, evaluating the quality of synthesized sketches presents unique unsolved challenges. Current evaluation methods for sketch synthesis are inadequate due to the lack of a unified benchmark dataset, over-reliance on classification accuracy for recognizability, and unfair evaluation of sketches with different levels of simplification. To address these issues, we introduce SketchRef, a benchmark dataset comprising 4 categories of reference photos–animals, human faces, human bodies, and common objects–alongside novel evaluation metrics. Considering that classification accuracy is insufficient to measure the structural consistency between a sketch and its reference photo, we propose the mean Object Keypoint Similarity (mOKS) metric, utilizing pose estimation to assess structure-level recognizability. To ensure fair evaluation sketches with different simplification levels, we propose a recognizability calculation method constrained by simplicity. We also collect 8K responses from art enthusiasts, validating the effectiveness of our proposed evaluation methods. We hope this work can provide a comprehensive evaluation of sketch synthesis algorithms, thereby aligning their performance more closely with human understanding.

[CV-25] Generative Dataset Distillation Based on Diffusion Model ECCV2024

链接: https://arxiv.org/abs/2408.08610
作者: Duo Su,Junjie Hou,Guang Li,Ren Togo,Rui Song,Takahiro Ogawa,Miki Haseyama
关键词-EN: Dataset Distillation Challenge, generative dataset distillation, dataset distillation method, Dataset Distillation, generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The Third Place Winner in Generative Track of the ECCV 2024 DD Challenge

点击查看摘要

Abstract:This paper presents our method for the generative track of The First Dataset Distillation Challenge at ECCV 2024. Since the diffusion model has become the mainstay of generative models because of its high-quality generative effects, we focus on distillation methods based on the diffusion model. Considering that the track can only generate a fixed number of images in 10 minutes using a generative model for CIFAR-100 and Tiny-ImageNet datasets, we need to use a generative model that can generate images at high speed. In this study, we proposed a novel generative dataset distillation method based on Stable Diffusion. Specifically, we use the SDXL-Turbo model which can generate images at high speed and quality. Compared to other diffusion models that can only generate images per class (IPC) = 1, our method can achieve an IPC = 10 for Tiny-ImageNet and an IPC = 20 for CIFAR-100, respectively. Additionally, to generate high-quality distilled datasets for CIFAR-100 and Tiny-ImageNet, we use the class information as text prompts and post data augmentation for the SDXL-Turbo model. Experimental results show the effectiveness of the proposed method, and we achieved third place in the generative track of the ECCV 2024 DD Challenge. Codes are available at this https URL.

[CV-26] Bi-Directional Deep Contextual Video Compression

链接: https://arxiv.org/abs/2408.08604
作者: Xihua Sheng,Li Li,Dong Liu,Shiqi Wang
关键词-EN: made remarkable process, concentrated on P-frame, P-frame coding, deep B-frame coding, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep video compression has made remarkable process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this paper, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, termed DCVC-B, to improve the compression performance of deep B-frame coding. Our scheme mainly has three key innovations. First, we develop a bi-directional motion difference context propagation method for effective motion difference coding, which significantly reduces the bit cost of bi-directional motions. Second, we propose a bi-directional contextual compression model and a corresponding bi-directional temporal entropy model, to make better use of the multi-scale temporal contexts. Third, we propose a hierarchical quality structure-based training strategy, leading to an effective bit allocation across large groups of pictures (GOP). Experimental results show that our DCVC-B achieves an average reduction of 26.6% in BD-Rate compared to the reference software for H.265/HEVC under random access conditions. Remarkably, it surpasses the performance of the H.266/VVC reference software on certain test datasets under the same configuration.

[CV-27] Learning A Low-Level Vision Generalist via Visual Task Prompt

链接: https://arxiv.org/abs/2408.08601
作者: Xiangyu Chen,Yihao Liu,Yuandong Pu,Wenlong Zhang,Jiantao Zhou,Yu Qiao,Chao Dong
关键词-EN: holds significant research, Building a unified, tasks holds significant, holds significant, significant research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACMMM24

点击查看摘要

Abstract:Building a unified model for general low-level vision tasks holds significant research and practical value. Current methods encounter several critical issues. Multi-task restoration approaches can address multiple degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) is limited. Methods like PromptGIP can handle multiple input-target domains but rely on the Masked Autoencoder (MAE) paradigm. Consequently, they are tied to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing. In this paper, we propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network suitable for general tasks. Besides, a new prompt cross-attention is introduced to facilitate interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, significantly outperforming existing methods both quantitatively and qualitatively. Codes are available at this https URL.

[CV-28] MM-UNet: A Mixed MLP Architecture for Improved Ophthalmic Image Segmentation

链接: https://arxiv.org/abs/2408.08600
作者: Zunjie Xiao,Xiaoqing Zhang,Risa Higashita,Jiang Liu
关键词-EN: ocular disease diagnosis, disease diagnosis, critical foundation, foundation for ocular, ocular disease
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: OMIA2024

点击查看摘要

Abstract:Ophthalmic image segmentation serves as a critical foundation for ocular disease diagnosis. Although fully convolutional neural networks (CNNs) are commonly employed for segmentation, they are constrained by inductive biases and face challenges in establishing long-range dependencies. Transformer-based models address these limitations but introduce substantial computational overhead. Recently, a simple yet efficient Multilayer Perceptron (MLP) architecture was proposed for image classification, achieving competitive performance relative to advanced transformers. However, its effectiveness for ophthalmic image segmentation remains unexplored. In this paper, we introduce MM-UNet, an efficient Mixed MLP model tailored for ophthalmic image segmentation. Within MM-UNet, we propose a multi-scale MLP (MMLP) module that facilitates the interaction of features at various depths through a grouping strategy, enabling simultaneous capture of global and local information. We conducted extensive experiments on both a private anterior segment optical coherence tomography (AS-OCT) image dataset and a public fundus image dataset. The results demonstrated the superiority of our MM-UNet model in comparison to state-of-the-art deep segmentation networks.

[CV-29] Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation CVPR2024

链接: https://arxiv.org/abs/2408.08591
作者: Tri Ton,Ji Woo Hong,SooHwan Eom,Jun Yeop Shim,Junyeong Kim,Chang D. Yoo
关键词-EN: transcends traditional closed-vocabulary, traditional closed-vocabulary methods, segmentation transcends traditional, real-world scenarios, transcends traditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: OpenSUN 3D: 2nd Workshop on Open-Vocabulary 3D Scene Understanding (CVPR 2024)

点击查看摘要

Abstract:Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods by enabling the identification of both previously seen and unseen objects in real-world scenarios. It leverages a dual-modality approach, utilizing both 3D point clouds and 2D multi-view images to generate class-agnostic object mask proposals. Previous efforts predominantly focused on enhancing 3D mask proposal models; consequently, the information that could come from 2D association to 3D was not fully exploited. This bias towards 3D data, while effective for familiar indoor objects, limits the system’s adaptability to new and varied object types, where 2D models offer greater utility. Addressing this gap, we introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities. Our framework comprises three components: 3D pathway, 2D pathway, and Dual-Path Integration. 3D pathway generates spatially accurate class-agnostic mask proposals of common indoor objects from 3D point cloud data using a pre-trained 3D model, while 2D pathway utilizes pre-trained open-vocabulary instance segmentation model to identify a diverse array of object proposals from multi-view RGB-D images. In Dual-Path Integration, our Conditional Integration process, which operates in two stages, filters and merges the proposals from both pathways adaptively. This process harmonizes output proposals to enhance segmentation capabilities. Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data, as evidenced by comprehensive evaluations on the ScanNet200 and qualitative results on ARKitScenes datasets.

[CV-30] S-RAF: A Simulation-Based Robustness Assessment Framework for Responsible Autonomous Driving

链接: https://arxiv.org/abs/2408.08584
作者: Daniel Omeiza,Pratik Somaiya,Jo-Ann Pattinson,Carolyn Ten-Holter,Jack Stilgoe,Marina Jirotka,Lars Kunze
关键词-EN: technology advances, artificial intelligence, Robustness Assessment Framework, AI-driven systems, Simulation-Based Robustness Assessment
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) technology advances, ensuring the robustness and safety of AI-driven systems has become paramount. However, varying perceptions of robustness among AI developers create misaligned evaluation metrics, complicating the assessment and certification of safety-critical and complex AI systems such as autonomous driving (AD) agents. To address this challenge, we introduce Simulation-Based Robustness Assessment Framework (S-RAF) for autonomous driving. S-RAF leverages the CARLA Driving simulator to rigorously assess AD agents across diverse conditions, including faulty sensors, environmental changes, and complex traffic situations. By quantifying robustness and its relationship with other safety-critical factors, such as carbon emissions, S-RAF aids developers and stakeholders in building safe and responsible driving agents, and streamlining safety certification processes. Furthermore, S-RAF offers significant advantages, such as reduced testing costs, and the ability to explore edge cases that may be unsafe to test in the real world. The code for this framework is available here: this https URL

[CV-31] AMER: Tree-Aware Transformer for Handwritten Mathematical Expression Recognition

链接: https://arxiv.org/abs/2408.08578
作者: Jianhua Zhu,Wenqi Zhao,Yu Li,Xingjian Hu,Liangcai Gao
关键词-EN: Mathematical Expression Recognition, Handwritten Mathematical Expression, Expression Recognition, Mathematical Expression, office automation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Handwritten Mathematical Expression Recognition (HMER) has extensive applications in automated grading and office automation. However, existing sequence-based decoding methods, which directly predict \LaTeX sequences, struggle to understand and model the inherent tree structure of \LaTeX and often fail to ensure syntactic correctness in the decoded results. To address these challenges, we propose a novel model named TAMER (Tree-Aware Transformer) for handwritten mathematical expression recognition. TAMER introduces an innovative Tree-aware Module while maintaining the flexibility and efficient training of Transformer. TAMER combines the advantages of both sequence decoding and tree decoding models by jointly optimizing sequence prediction and tree structure prediction tasks, which enhances the model’s understanding and generalization of complex mathematical expression structures. During inference, TAMER employs a Tree Structure Prediction Scoring Mechanism to improve the structural validity of the generated \LaTeX sequences. Experimental results on CROHME datasets demonstrate that TAMER outperforms traditional sequence decoding and tree decoding models, especially in handling complex mathematical structures, achieving state-of-the-art (SOTA) performance.

[CV-32] uning a SAM-Based Model with Multi-Cognitive Visual Adapter to Remote Sensing Instance Segmentation

链接: https://arxiv.org/abs/2408.08576
作者: Linghao Zheng,Xinyang Pu,Feng Xu
关键词-EN: natural scene image, scene image segmentation, demonstrates exceptional generalization, remote sensing images, foundational model designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM), a foundational model designed for promptable segmentation tasks, demonstrates exceptional generalization capabilities, making it highly promising for natural scene image segmentation. However, SAM’s lack of pretraining on massive remote sensing images and its interactive structure limit its automatic mask prediction capabilities. In this paper, a Multi-Cognitive SAM-Based Instance Segmentation Model (MC-SAM SEG) is introduced to employ SAM on remote sensing domain. The SAM-Mona encoder utilizing the Multi-cognitive Visual Adapter (Mona) is conducted to facilitate SAM’s transfer learning in remote sensing applications. The proposed method named MC-SAM SEG extracts high-quality features by fine-tuning the SAM-Mona encoder along with a feature aggregator. Subsequently, a pixel decoder and transformer decoder are designed for prompt-free mask generation and instance classification. The comprehensive experiments are conducted on the HRSID and WHU datasets for instance segmentation tasks on Synthetic Aperture Radar (SAR) images and optical remote sensing images respectively. The evaluation results indicate the proposed method surpasses other deep learning algorithms and verify its effectiveness and generalization.

[CV-33] Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

链接: https://arxiv.org/abs/2408.08575
作者: Jinming Liu,Yuntao Wei,Junyan Lin,Shengyang Zhao,Heming Sun,Zhibo Chen,Wenjun Zeng,Xin Jin
关键词-EN: Large Multimodal Models, Multimodal Models, paradigm to achieve, Large Multimodal, cleverly leveraging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a new image compression paradigm to achieve intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eyes, the image coding for machines (ICM) framework we focus on requires the compressed bitstream to more comply with different downstream intelligent analysis tasks. To this end, we employ LMM to \textcolorredtell codec what to compress: 1) first utilize the powerful semantic understanding capability of LMMs w.r.t object grounding, identification, and importance ranking via prompts, to disentangle image content before compression, 2) and then based on these semantic priors we accordingly encode and transmit objects of the image in order with a structured bitstream. In this way, diverse vision benchmarks including image classification, object detection, instance segmentation, etc., can be well supported with such a semantically structured bitstream. We dub our method \textitSDComp’’ for ``\textitSemantically \textitDisentangled \textitCompression’', and compare it with state-of-the-art codecs on a wide variety of different vision tasks. SDComp codec leads to more flexible reconstruction results, promised decoded visual quality, and a more generic/satisfactory intelligent task-supporting ability.

[CV-34] EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation

链接: https://arxiv.org/abs/2408.08570
作者: Jun Zhou,Chunsheng Liu,Faliang Chang,Wenqian Wang,Penghui Hao,Yiming Huang,Zhiqiang Yang
关键词-EN: requires comprehensive consideration, Associating driver attention, cross-domain perception problem, driver status tracking, hard cross-domain perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13pages, 9 figures,

点击查看摘要

Abstract:Associating driver attention with driving scene across two fields of views (FOVs) is a hard cross-domain perception problem, which requires comprehensive consideration of cross-view mapping, dynamic driving scene analysis, and driver status tracking. Previous methods typically focus on a single view or map attention to the scene via estimated gaze, failing to exploit the implicit connection between them. Moreover, simple fusion modules are insufficient for modeling the complex relationships between the two views, making information integration challenging. To address these issues, we propose a novel method for end-to-end scene-associated driver attention estimation, called EraW-Net. This method enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net. Specifically, a Dynamic Adaptive Filter Module (DAF-Module) is proposed to address the challenges of frequently changing driving environments by extracting vital regions. It suppresses the indiscriminately recorded dynamics and highlights crucial ones by innovative joint frequency-spatial analysis, enhancing the model’s ability to parse complex dynamics. Additionally, to track driver states during non-fixed facial poses, we propose a Global Context Sharing Module (GCS-Module) to construct refined feature representations by capturing hierarchical features that adapt to various scales of head and eye movements. Finally, W-Net achieves systematic cross-view information integration through its “Encoding-Independent Partial Decoding-Fusion Decoding” structure, addressing semantic misalignment in heterogeneous data integration. Experiments demonstrate that the proposed method robustly and accurately estimates the mapping of driver attention in scene on large public datasets.

[CV-35] Unsupervised Non-Rigid Point Cloud Matching through Large Vision Models

链接: https://arxiv.org/abs/2408.08568
作者: Zhangquan Chen,Puhua Jiang,Ruqi Huang
关键词-EN: trained purely, correspondence annotation, extended naturally, point cloud matching, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:In this paper, we propose a novel learning-based framework for non-rigid point cloud matching, which can be trained purely on point clouds without any correspondence annotation but also be extended naturally to partial-to-full matching. Our key insight is to incorporate semantic features derived from large vision models (LVMs) to geometry-based shape feature learning. Our framework effectively leverages the structural information contained in the semantic features to address ambiguities arise from self-similarities among local geometries. Furthermore, our framework also enjoys the strong generalizability and robustness regarding partial observations of LVMs, leading to improvements in the regarding point cloud matching tasks. In order to achieve the above, we propose a pixel-to-point feature aggregation module, a local and global attention network as well as a geometrical similarity loss function. Experimental results show that our method achieves state-of-the-art results in matching non-rigid point clouds in both near-isometric and heterogeneous shape collection as well as more realistic partial and noisy data.

[CV-36] S3Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

链接: https://arxiv.org/abs/2408.08567
作者: Xue Wang,Tian Zhou,Jianqing Zhu,Jialin Liu,Kun Yuan,Tao Yao,Wotao Yin,Rong Jin,HanQin Cai
关键词-EN: Attention based models, Attention, based Attention structure, Attention based, vanilla Attention based
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S ^3 Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S ^3 Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S ^3 Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S ^3 Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.

[CV-37] A New Chinese Landscape Paintings Generation Model based on Stable Diffusion using DreamBooth HPCA

链接: https://arxiv.org/abs/2408.08561
作者: Yujia Gu,Xinyu Fang,Xueyuan Deng
关键词-EN: Stable Diffusion Model, Chinese Landscape Paintings, Stable Diffusion, generating Chinese Landscape, Landscape Paintings
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by AHPCAI

点击查看摘要

Abstract:This study mainly introduces a method combining the Stable Diffusion Model (SDM) and Parameter-Efficient Fine-Tuning method for generating Chinese Landscape Paintings. This training process is accelerated by combining LoRA with pre-trained SDM and DreamBooth with pre-trained SDM, respectively. On the Chinese Landscape Paintings Internet dataset used in this paper, this study finds that SDM combined with DreamBooth exhibits superior performance, outperforming other models, including the generic pre-trained SDM and LoRA-based fine-tuning SDM. The SDM combined with DreamBooth achieves a FID of 12.75 on the dataset and outperforms all other models in terms of expert evaluation, highlighting the model’s versatility in the field of Chinese Landscape Paintings given the unique identifier, high fidelity and high quality. This study illustrates the potential of specialised fine-tuning method to improve the performance of SDM on domain-specific tasks, particularly in the domain of Landscape Paintings.

[CV-38] A training regime to learn unified representations from complementary breast imaging modalities

链接: https://arxiv.org/abs/2408.08560
作者: Umang Sharma,Jungkyu Park,Laura Heacock,Sumit Chopra,Krzysztof Geras
关键词-EN: Full Field Digital, Field Digital Mammograms, Digital Breast Tomosynthesis, Digital Mammograms, Full Field
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Full Field Digital Mammograms (FFDMs) and Digital Breast Tomosynthesis (DBT) are the two most widely used imaging modalities for breast cancer screening. Although DBT has increased cancer detection compared to FFDM, its widespread adoption in clinical practice has been slowed by increased interpretation times and a perceived decrease in the conspicuity of specific lesion types. Specifically, the non-inferiority of DBT for microcalcifications remains under debate. Due to concerns about the decrease in visual acuity, combined DBT-FFDM acquisitions remain popular, leading to overall increased exam times and radiation dosage. Enabling DBT to provide diagnostic information present in both FFDM and DBT would reduce reliance on FFDM, resulting in a reduction in both quantities. We propose a machine learning methodology that learns high-level representations leveraging the complementary diagnostic signal from both DBT and FFDM. Experiments on a large-scale data set validate our claims and show that our representations enable more accurate breast lesion detection than any DBT- or FFDM-based model.

[CV-39] Detection and tracking of MAVs using a LiDAR with rosette scanning pattern

链接: https://arxiv.org/abs/2408.08555
作者: Sándor Gazdag,Tom Möller,Tamás Filep,Anita Keszler,András L. Majdik
关键词-EN: commercial Micro Aerial, usage of commercial, increased drastically, Micro Aerial Vehicles, Micro Aerial
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The usage of commercial Micro Aerial Vehicles (MAVs) has increased drastically during the last decade. While the added value of MAVs to society is apparent, their growing use is also coming with increasing risks like violating public airspace at airports or committing privacy violations. To mitigate these issues it is becoming critical to develop solutions that incorporate the detection and tracking of MAVs with autonomous systems. This work presents a method for the detection and tracking of MAVs using a novel, low-cost rosette scanning LiDAR on a pan-tilt turret. Once the static background is captured, a particle filter is utilized to detect a possible target and track its position with a physical, programmable pan-tilt system. The tracking makes it possible to keep the MAV in the center, maximizing the density of 3D points measured on the target by the LiDAR sensor. The developed algorithm was evaluated within the indoor MIcro aerial vehicle and MOtion capture (MIMO) arena and has state-of-the-art tracking accuracy, stability, and fast re-detection time in case of tracking loss. Based on the outdoor tests, it was possible to significantly increase the detection distance and number of returned points compared to other similar methods using LiDAR.

[CV-40] Scaling up Multimodal Pre-training for Sign Language Understanding

链接: https://arxiv.org/abs/2408.08544
作者: Wengang Zhou,Weichao Zhao,Hezhen Hu,Zecheng Li,Houqiang Li
关键词-EN: Sign language, Sign language serves, Sign, language, sign language understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Sign language recognition; Sign language translation; Sign language retrieval

点击查看摘要

Abstract:Sign language serves as the primary meaning of communication for the deaf-mute community. Different from spoken language, it commonly conveys information by the collaboration of manual features, i.e., hand gestures and body movements, and non-manual features, i.e., facial expressions and mouth cues. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied in recent years, including isolated/continuous sign language recognition (ISLR/CSLR), gloss-free sign language translation (GF-SLT) and sign language retrieval (SL-RT). Sign language recognition and translation aims to understand the semantic meaning conveyed by sign languages from gloss-level and sentence-level, respectively. In contrast, SL-RT focuses on retrieving sign videos or corresponding texts from a closed-set under the query-by-example search paradigm. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos. To advance the development of sign language understanding, exploring a generalized model that is applicable across various SLU tasks is a profound research direction.

[CV-41] Language-Driven Interactive Shadow Detection ACM-MM2024

链接: https://arxiv.org/abs/2408.08543
作者: Hongqiu Wang,Wei Wang,Haipeng Zhou,Huihui Xu,Shaozhi Wu,Lei Zhu
关键词-EN: Traditional shadow detectors, Traditional shadow, RVSD, natural language prompts, detectors often identify
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM 2024

点击查看摘要

Abstract:Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrary shadow areas of interest based on descriptions (flexibility) but also allows users to interact with visual content more directly and naturally by using natural language prompts (interactivity), paving the way for abundant applications ranging from advanced video editing to virtual reality experiences. To pioneer the RVSD research, we curated a well-annotated RVSD dataset, which encompasses 86 videos and a rich set of 15,011 paired textual descriptions with corresponding shadows. To the best of our knowledge, this dataset is the first one for addressing RVSD. Based on this dataset, we propose a Referring Shadow-Track Memory Network (RSM-Net) for addressing the RVSD task. In our RSM-Net, we devise a Twin-Track Synergistic Memory (TSM) to store intra-clip memory features and hierarchical inter-clip memory features, and then pass these memory features into a memory read module to refine features of the current video frame for referring shadow detection. We also develop a Mixed-Prior Shadow Attention (MSA) to utilize physical priors to obtain a coarse shadow map for learning more visual features by weighting it with the input video frame. Experimental results show that our RSM-Net achieves state-of-the-art performance for RVSD with a notable Overall IOU increase of 4.4%. Our code and dataset are available at this https URL.

[CV-42] Privacy-Preserving Vision Transformer Using Images Encrypted with Restricted Random Permutation Matrices

链接: https://arxiv.org/abs/2408.08529
作者: Kouki Horio,Kiyoshi Nishikawa,Hitoshi Kiya
关键词-EN: fine-tuning vision transformers, privacy-preserving fine-tuning vision, vision transformers, privacy-preserving fine-tuning, fine-tuning vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 9 figures

点击查看摘要

Abstract:We propose a novel method for privacy-preserving fine-tuning vision transformers (ViTs) with encrypted images. Conventional methods using encrypted images degrade model performance compared with that of using plain images due to the influence of image encryption. In contrast, the proposed encryption method using restricted random permutation matrices can provide a higher performance than the conventional ones.

[CV-43] Focus on Focus: Focus-oriented Representation Learning and Multi-view Cross-modal Alignment for Glioma Grading

链接: https://arxiv.org/abs/2408.08527
作者: Li Pan,Yupei Zhang,Qiushi Yang,Tan Li,Xiaohan Xing,Maximus C. F. Yeung,Zhen Chen
关键词-EN: achieved a promising, multimodal deep learning, Recently, molecular biomarkers, integrates histopathology slides
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, multimodal deep learning, which integrates histopathology slides and molecular biomarkers, has achieved a promising performance in glioma grading. Despite great progress, due to the intra-modality complexity and inter-modality heterogeneity, existing studies suffer from inadequate histopathology representation learning and inefficient molecular-pathology knowledge alignment. These two issues hinder existing methods to precisely interpret diagnostic molecular-pathology features, thereby limiting their grading performance. Moreover, the real-world applicability of existing multimodal approaches is significantly restricted as molecular biomarkers are not always available during clinical deployment. To address these problems, we introduce a novel Focus on Focus (FoF) framework with paired pathology-genomic training and applicable pathology-only inference, enhancing molecular-pathology representation effectively. Specifically, we propose a Focus-oriented Representation Learning (FRL) module to encourage the model to identify regions positively or negatively related to glioma grading and guide it to focus on the diagnostic areas with a consistency constraint. To effectively link the molecular biomarkers to morphological features, we propose a Multi-view Cross-modal Alignment (MCA) module that projects histopathology representations into molecular subspaces, aligning morphological features with corresponding molecular biomarker status by supervised contrastive learning. Experiments on the TCGA GBM-LGG dataset demonstrate that our FoF framework significantly improves the glioma grading. Remarkably, our FoF achieves superior performance using only histopathology slides compared to existing multimodal methods. The source code is available at this https URL.

[CV-44] GS-ID: Illumination Decomposition on Gaussian Splatting via Diffusion Prior and Parametric Light Source Optimization

链接: https://arxiv.org/abs/2408.08524
作者: Kang Du,Zhihao Liang,Zeyu Wang
关键词-EN: intuitive light editing, Gaussian Splatting, view synthesis, synthesis and intuitive, illumination decomposition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages, 13 figures

点击查看摘要

Abstract:We present GS-ID, a novel framework for illumination decomposition on Gaussian Splatting, achieving photorealistic novel view synthesis and intuitive light editing. Illumination decomposition is an ill-posed problem facing three main challenges: 1) priors for geometry and material are often lacking; 2) complex illumination conditions involve multiple unknown light sources; and 3) calculating surface shading with numerous light sources is computationally expensive. To address these challenges, we first introduce intrinsic diffusion priors to estimate the attributes for physically based rendering. Then we divide the illumination into environmental and direct components for joint optimization. Last, we employ deferred rendering to reduce the computational load. Our framework uses a learnable environment map and Spherical Gaussians (SGs) to represent light sources parametrically, therefore enabling controllable and photorealistic relighting on Gaussian Splatting. Extensive experiments and applications demonstrate that GS-ID produces state-of-the-art illumination decomposition results while achieving better geometry reconstruction and rendering performance.

[CV-45] Visual-Friendly Concept Protection via Selective Adversarial Perturbations

链接: https://arxiv.org/abs/2408.08518
作者: Xiaoyue Mi,Fan Tang,Juan Cao,Peng Li,Yang Liu
关键词-EN: Personalized concept generation, tuning diffusion models, raises potential legal, images raises potential, Personalized concept
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Personalized concept generation by tuning diffusion models with a few images raises potential legal and ethical concerns regarding privacy and intellectual property rights. Researchers attempt to prevent malicious personalization using adversarial perturbations. However, previous efforts have mainly focused on the effectiveness of protection while neglecting the visibility of perturbations. They utilize global adversarial perturbations, which introduce noticeable alterations to original images and significantly degrade visual quality. In this work, we propose the Visual-Friendly Concept Protection (VCPro) framework, which prioritizes the protection of key concepts chosen by the image owner through adversarial perturbations with lower perceptibility. To ensure these perturbations are as inconspicuous as possible, we introduce a relaxed optimization objective to identify the least perceptible yet effective adversarial perturbations, solved using the Lagrangian multiplier method. Qualitative and quantitative experiments validate that VCPro achieves a better trade-off between the visibility of perturbations and protection effectiveness, effectively prioritizing the protection of target concepts in images with less perceptible perturbations.

[CV-46] Efficient Image-to-Image Diffusion Classifier for Adversarial Robustness

链接: https://arxiv.org/abs/2408.08502
作者: Hefei Mei,Minjing Dong,Chang Xu
关键词-EN: superior defense capability, demonstrated great potential, achieve superior defense, superior defense, defense capability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated great potential in the field of adversarial robustness, where DM-based defense methods can achieve superior defense capability without adversarial training. However, they all require huge computational costs due to the usage of large-scale pre-trained DMs, making it difficult to conduct full evaluation under strong attacks and compare with traditional CNN-based methods. Simply reducing the network size and timesteps in DMs could significantly harm the image generation quality, which invalidates previous frameworks. To alleviate this issue, we redesign the diffusion framework from generating high-quality images to predicting distinguishable image labels. Specifically, we employ an image translation framework to learn many-to-one mapping from input samples to designed orthogonal image labels. Based on this framework, we introduce an efficient Image-to-Image diffusion classifier with a pruned U-Net structure and reduced diffusion timesteps. Besides the framework, we redesign the optimization objective of DMs to fit the target of image classification, where a new classification loss is incorporated in the DM-based image translation framework to distinguish the generated label from those of other classes. We conduct sufficient evaluations of the proposed classifier under various attacks on popular benchmarks. Extensive experiments show that our method achieves better adversarial robustness with fewer computational costs than DM-based and CNN-based methods. The code is available at this https URL.

[CV-47] CoSEC: A Coaxial Stereo Event Camera Dataset for Autonomous Driving

链接: https://arxiv.org/abs/2408.08500
作者: Shihan Peng,Hanyu Zhou,Hao Dong,Zhiwei Shi,Haoyue Liu,Yuxing Duan,Yi Chang,Luxin Yan
关键词-EN: Conventional frame camera, Conventional frame, driving scene perception, multimodal fusion, multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Conventional frame camera is the mainstream sensor of the autonomous driving scene perception, while it is limited in adverse conditions, such as low light. Event camera with high dynamic range has been applied in assisting frame camera for the multimodal fusion, which relies heavily on the pixel-level spatial alignment between various modalities. Typically, existing multimodal datasets mainly place event and frame cameras in parallel and directly align them spatially via warping operation. However, this parallel strategy is less effective for multimodal fusion, since the large disparity exacerbates spatial misalignment due to the large event-frame baseline. We argue that baseline minimization can reduce alignment error between event and frame cameras. In this work, we introduce hybrid coaxial event-frame devices to build the multimodal system, and propose a coaxial stereo event camera (CoSEC) dataset for autonomous driving. As for the multimodal system, we first utilize the microcontroller to achieve time synchronization, and then spatially calibrate different sensors, where we perform intra- and inter-calibration of stereo coaxial devices. As for the multimodal dataset, we filter LiDAR point clouds to generate depth and optical flow labels using reference depth, which is further improved by fusing aligned event and frame data in nighttime conditions. With the help of the coaxial device, the proposed dataset can promote the all-day pixel-level multimodal fusion. Moreover, we also conduct experiments to demonstrate that the proposed dataset can improve the performance and generalization of the multimodal fusion.

[CV-48] Achieving Complex Image Edits via Function Aggregation with Diffusion Models

链接: https://arxiv.org/abs/2408.08495
作者: Mohammadreza Samadi,Fred X. Han,Mohammad Salameh,Hao Wu,Fengyu Sun,Chunhua Zhou,Di Niu
关键词-EN: demonstrated strong performance, making them ideal, demonstrated strong, strong performance, performance in generative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have demonstrated strong performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet two key challenges persist. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations in other parts of the image. In this work, we introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. FunEditor is 5 to 24 times faster inference than existing methods on complex tasks like object movement. Our experiments demonstrate that FunEditor significantly outperforms recent baselines, including both inference-time optimization methods and fine-tuned models, across various metrics, such as image quality assessment (IQA) and object-background consistency.

[CV-49] EXTOC: Text-driven Object-Centric Style Transfer

链接: https://arxiv.org/abs/2408.08461
作者: Jihun Park,Jongmin Gim,Kyoungmin Lee,Seunghun Lee,Sunghoon Im
关键词-EN: present Text-driven Object-Centric, Text-driven Object-Centric Style, present Text-driven, Text-driven Object-Centric, guides style transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present Text-driven Object-Centric Style Transfer (TEXTOC), a novel method that guides style transfer at an object-centric level using textual inputs. The core of TEXTOC is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric transformations that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style transfer across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image’s background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style transfers.

[CV-50] JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

链接: https://arxiv.org/abs/2408.08459
作者: Xiaochuang Han,Marjan Ghazvininejad,Pang Wei Koh,Yulia Tsvetkov
关键词-EN: potentially easy integration, Recent work, generality and potentially, potentially easy, easy integration
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization – representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

[CV-51] Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

链接: https://arxiv.org/abs/2408.08454
作者: Zohaib Khan,Muhammad Khaquan,Omer Tafveez,Agha Ali Raza
关键词-EN: revolutionized deep learning, effectively captures contextual, captures contextual information, GQA, architecture has revolutionized
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities.

[CV-52] SpectralEarth: Training Hyperspectral Foundation Models at Scale

链接: https://arxiv.org/abs/2408.08447
作者: Nassim Ait Ali Braham,Conrad M Albrecht,Julien Mairal,Jocelyn Chanussot,Yi Wang,Xiao Xiang Zhu
关键词-EN: remote sensing, multispectral imagery, triggered a paradigm, paradigm shift, shift in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models have triggered a paradigm shift in computer vision and are increasingly being adopted in remote sensing, particularly for multispectral imagery. Yet, their potential in hyperspectral imaging (HSI) remains untapped due to the absence of comprehensive and globally representative hyperspectral datasets. To close this gap, we introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models leveraging data from the Environmental Mapping and Analysis Program (EnMAP). SpectralEarth comprises 538,974 image patches covering 415,153 unique locations from more than 11,636 globally distributed EnMAP scenes spanning two years of archive. Additionally, 17.5% of these locations include multiple timestamps, enabling multi-temporal HSI analysis. Utilizing state-of-the-art self-supervised learning (SSL) algorithms, we pretrain a series of foundation models on SpectralEarth. We integrate a spectral adapter into classical vision backbones to accommodate the unique characteristics of HSI. In tandem, we construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation. Experimental results support the versatility of our models, showcasing their generalizability across different tasks and sensors. We also highlight computational efficiency during model fine-tuning. The dataset, models, and source code will be made publicly available.

[CV-53] PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

链接: https://arxiv.org/abs/2408.08437
作者: Kshitij Bhardwaj
关键词-EN: replacing convolutional neural, convolutional neural networks, computer vision tasks, optimize vision transformers, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Vision Transformers (ViTs) are extremely effective at computer vision tasks and are replacing convolutional neural networks as the new state-of-the-art, they are complex and memory-intensive models. In order to effectively run these models on resource-constrained mobile/edge systems, there is a need to not only compress these models but also to optimize them and convert them into deployment-friendly formats. To this end, this paper presents a combined pruning and quantization tool, called PQV-Mobile, to optimize vision transformers for mobile applications. The tool is able to support different types of structured pruning based on magnitude importance, Taylor importance, and Hessian importance. It also supports quantization from FP32 to FP16 and int8, targeting different mobile hardware backends. We demonstrate the capabilities of our tool and show important latency-memory-accuracy trade-offs for different amounts of pruning and int8 quantization with Facebook Data Efficient Image Transformer (DeiT) models. Our results show that even pruning a DeiT model by 9.375% and quantizing it to int8 from FP32 followed by optimizing for mobile applications, we find a latency reduction by 7.18X with a small accuracy loss of 2.24%. The tool is open source.

[CV-54] Penny-Wise and Pound-Foolish in Deepfake Detection

链接: https://arxiv.org/abs/2408.08412
作者: Yabin Wang,Zhiwu Huang,Su Zhou,Adam Prugel-Bennett,Xiaopeng Hong
关键词-EN: deepfake detection, deepfake, prompting the urgent, technologies has sparked, sparked serious concerns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The diffusion of deepfake technologies has sparked serious concerns about its potential misuse across various domains, prompting the urgent need for robust detection methods. Despite advancement, many current approaches prioritize short-term gains at expense of long-term effectiveness. This paper critiques the overly specialized approach of fine-tuning pre-trained models solely with a penny-wise objective on a single deepfake dataset, while disregarding the pound-wise balance for generalization and knowledge retention. To address this “Penny-Wise and Pound-Foolish” issue, we propose a novel learning framework (PoundNet) for generalization of deepfake detection on a pre-trained vision-language model. PoundNet incorporates a learnable prompt design and a balanced objective to preserve broad knowledge from upstream tasks (object classification) while enhancing generalization for downstream tasks (deepfake detection). We train PoundNet on a standard single deepfake dataset, following common practice in the literature. We then evaluate its performance across 10 public large-scale deepfake datasets with 5 main evaluation metrics-forming the largest benchmark test set for assessing the generalization ability of deepfake detection models, to our knowledge. The comprehensive benchmark evaluation demonstrates the proposed PoundNet is significantly less “Penny-Wise and Pound-Foolish”, achieving a remarkable improvement of 19% in deepfake detection performance compared to state-of-the-art methods, while maintaining a strong performance of 63% on object classification tasks, where other deepfake detection models tend to be ineffective. Code and data are open-sourced at this https URL.

[CV-55] Level Up Your Tutorials: VLMs for Game Tutorials Quality Assessment ECCV2024

链接: https://arxiv.org/abs/2408.08396
作者: Daniele Rege Cambrin,Gabriele Scaffidi Militone,Luca Colomba,Giovanni Malnati,Daniele Apiletti,Paolo Garza
关键词-EN: complex core mechanics, Designing effective game, smooth learning curve, Designing effective, core mechanics
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted at ECCV 2024 CV2 Workshop

点击查看摘要

Abstract:Designing effective game tutorials is crucial for a smooth learning curve for new players, especially in games with many rules and complex core mechanics. Evaluating the effectiveness of these tutorials usually requires multiple iterations with testers who have no prior knowledge of the game. Recent Vision-Language Models (VLMs) have demonstrated significant capabilities in understanding and interpreting visual content. VLMs can analyze images, provide detailed insights, and answer questions about their content. They can recognize objects, actions, and contexts in visual data, making them valuable tools for various applications, including automated game testing. In this work, we propose an automated game-testing solution to evaluate the quality of game tutorials. Our approach leverages VLMs to analyze frames from video game tutorials, answer relevant questions to simulate human perception, and provide feedback. This feedback is compared with expected results to identify confusing or problematic scenes and highlight potential errors for developers. In addition, we publish complete tutorial videos and annotated frames from different game versions used in our tests. This solution reduces the need for extensive manual testing, especially by speeding up and simplifying the initial development stages of the tutorial to improve the final game experience.

[CV-56] Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension

链接: https://arxiv.org/abs/2408.08381
作者: Nicholas Konz,Maciej A. Mazurowski
关键词-EN: geometric properties, medical image models, recent years, generalization ability, important model behavior
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, there has been interest in how geometric properties such as intrinsic dimension (ID) of a neural network’s hidden representations evolve through its layers, and how such properties are predictive of important model behavior such as generalization ability. However, evidence has begun to emerge that such behavior can change significantly depending on the domain of the network’s training data, such as natural versus medical images. Here, we further this inquiry by exploring how the ID of a network’s learned representations evolves through its layers, in essence, characterizing how the network successively refines the information content of input data to be used for predictions. Analyzing eleven natural and medical image datasets across six network architectures, we find that the shape of this ID evolution curve differs noticeably between natural and medical image models: medical image models peak in representation ID earlier in the network, implying a difference in the image features and their abstractness that are typically used for downstream tasks in these domains. Additionally, we discover a strong correlation of this peak representation ID with the ID of the data in its input space, implying that the intrinsic information content of a model’s learned representations is guided by that of the data it was trained on. Overall, our findings emphasize notable discrepancies in network behavior between natural and non-natural imaging domains regarding hidden representation information content, and provide further insights into how a network’s learned features are shaped by its training data.

[CV-57] 5%100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

链接: https://arxiv.org/abs/2408.08345
作者: Dongshuo Yin,Leiyi Hu,Bin Li,Youqun Zhang,Xue Yang
关键词-EN: Pre-training fine-tuning, full fine-tuning, fine-tuning, transferring efficiency, visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2311.15010

点击查看摘要

Abstract:Pre-training fine-tuning can enhance the transferring efficiency and performance in visual tasks. Recent delta-tuning methods provide more options for visual classification tasks. Despite their success, existing visual delta-tuning art fails to exceed the upper limit of full fine-tuning on challenging tasks like object detection and segmentation. To find a competitive alternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter (Mona) tuning, a novel adapter-based tuning method. First, we introduce multiple vision-friendly filters into the adapter to enhance its ability to process visual signals, while previous methods mainly rely on language-friendly linear filters. Second, we add the scaled normalization layer in the adapter to regulate the distribution of input features for visual filters. To fully demonstrate the practicality and generality of Mona, we conduct experiments on multiple representative visual tasks, including instance segmentation on COCO, semantic segmentation on ADE20K, object detection on Pascal VOC, oriented object detection on DOTA/STAR, and image classification on three common datasets. Exciting results illustrate that Mona surpasses full fine-tuning on all these tasks, and is the only delta-tuning method outperforming full fine-tuning on the above various tasks. For example, Mona achieves 1% performance gain on the COCO dataset compared to full fine-tuning. Comprehensive results suggest that Mona-tuning is more suitable for retaining and utilizing the capabilities of pre-trained models than full fine-tuning. We will make the code publicly available.

[CV-58] CT4D: Consistent Text-to-4D Generation with Animatable Meshes

链接: https://arxiv.org/abs/2408.08342
作者: Ce Chen,Shaoli Huang,Xuelin Chen,Guangyi Chen,Xiaoguang Han,Kun Zhang,Mingming Gong
关键词-EN: image diffusion model, video diffusion model, diffusion model, image diffusion, video diffusion
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-4D generation has recently been demonstrated viable by integrating a 2D image diffusion model with a video diffusion model. However, existing models tend to produce results with inconsistent motions and geometric structures over time. To this end, we present a novel framework, coined CT4D, which directly operates on animatable meshes for generating consistent 4D content from arbitrary user-supplied prompts. The primary challenges of our mesh-based framework involve stably generating a mesh with details that align with the text prompt while directly driving it and maintaining surface continuity. Our CT4D framework incorporates a unique Generate-Refine-Animate (GRA) algorithm to enhance the creation of text-aligned meshes. To improve surface continuity, we divide a mesh into several smaller regions and implement a uniform driving function within each area. Additionally, we constrain the animating stage with a rigidity regulation to ensure cross-region continuity. Our experimental results, both qualitative and quantitative, demonstrate that our CT4D framework surpasses existing text-to-4D techniques in maintaining interframe consistency and preserving global geometry. Furthermore, we showcase that this enhanced representation inherently possesses the capability for combinational 4D generation and texture editing.

[CV-59] METR: Image Watermarking with Large Number of Unique Messages

链接: https://arxiv.org/abs/2408.08340
作者: Alexander Varlamov,Daria Diatlova,Egor Spirin
关键词-EN: led researchers, improving watermarking algorithms, focus on improving, METR, Diffusion Model
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 9 figures, code is available at this https URL

点击查看摘要

Abstract:Improvements in diffusion models have boosted the quality of image generation, which has led researchers, companies, and creators to focus on improving watermarking algorithms. This provision would make it possible to clearly identify the creators of generative art. The main challenges that modern watermarking algorithms face have to do with their ability to withstand attacks and encrypt many unique messages, such as user IDs. In this paper, we present METR: Message Enhanced Tree-Ring, which is an approach that aims to address these challenges. METR is built on the Tree-Ring watermarking algorithm, a technique that makes it possible to encode multiple distinct messages without compromising attack resilience or image quality. This ensures the suitability of this watermarking algorithm for any Diffusion Model. In order to surpass the limitations on the quantity of encoded messages, we propose METR++, an enhanced version of METR. This approach, while limited to the Latent Diffusion Model architecture, is designed to inject a virtually unlimited number of unique messages. We demonstrate its robustness to attacks and ability to encrypt many unique messages while preserving image quality, which makes METR and METR++ hold great potential for practical applications in real-world settings. Our code is available at this https URL

[CV-60] Graph representations of 3D data for machine learning

链接: https://arxiv.org/abs/2408.08336
作者: Tomasz Prytuła
关键词-EN: machine learning algorithms, graphs and meshes, learning algorithms, give an overview, overview of combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:We give an overview of combinatorial methods to represent 3D data, such as graphs and meshes, from the viewpoint of their amenability to analysis using machine learning algorithms. We highlight pros and cons of various representations and we discuss some methods of generating/switching between the representations. We finally present two concrete applications in life science and industry. Despite its theoretical nature, our discussion is in general motivated by, and biased towards real-world challenges.

[CV-61] urboEdit: Instant text-based image editing ECCV

链接: https://arxiv.org/abs/2408.08332
作者: Zongze Wu,Nicholas Kolkin,Jonathan Brandt,Richard Zhang,Eli Shechtman
关键词-EN: precise image inversion, address the challenges, challenges of precise, image, input image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to European Conference on Computer Vision (ECCV), 2024. Project page: this https URL

点击查看摘要

Abstract:We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

[CV-62] Can ChatGPT assist visually impaired people with micro-navigation?

链接: https://arxiv.org/abs/2408.08321
作者: Junxian He,Shrinivas Pundlik,Gang Luo
关键词-EN: Micro-navigation poses challenges, poses challenges, challenges for blind, SPE, Objective
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Objective: Micro-navigation poses challenges for blind and visually impaired individuals. They often need to ask for sighted assistance. We explored the feasibility of utilizing ChatGPT as a virtual assistant to provide navigation directions. Methods: We created a test set of outdoor and indoor micro-navigation scenarios consisting of 113 scene images and their human-generated text descriptions. A total of 412 way-finding queries and their expected responses were compiled based on the scenarios. Not all queries are answerable based on the information available in the scene image. "I do not know"response was expected for unanswerable queries, which served as negative cases. High level orientation responses were expected, and step-by-step guidance was not required. ChatGPT 4o was evaluated based on sensitivity (SEN) and specificity (SPE) under different conditions. Results: The default ChatGPT 4o, with scene images as inputs, resulted in SEN and SPE values of 64.8% and 75.9%, respectively. Instruction on how to respond to unanswerable questions did not improve SEN substantially but SPE increased by around 14 percentage points. SEN and SPE both improved substantially, by about 17 and 16 percentage points on average respectively, when human written descriptions of the scenes were provided as input instead of images. Providing further prompt instructions to the assistants when the input was text description did not substantially change the SEN and SPE values. Conclusion: Current native ChatGPT 4o is still unable to provide correct micro-navigation guidance in some cases, probably because its scene understanding is not optimized for navigation purposes. If multi-modal chatbots could interpret scenes with a level of clarity comparable to humans, and also guided by appropriate prompts, they may have the potential to provide assistance to visually impaired for micro-navigation.

[CV-63] Segment Anything for Videos: A Systematic Survey

链接: https://arxiv.org/abs/2408.08315
作者: Chunhui Zhang,Yawen Cui,Weilin Lin,Guanjie Huang,Yan Rong,Li Liu,Shiguang Shan
关键词-EN: witnessed tremendous success, exploring task-agnostic visual, SAM, computer vision, task-agnostic visual foundation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (\eg, text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.

[CV-64] HistoGym: A Reinforcement Learning Environment for Histopathological Image Analysis

链接: https://arxiv.org/abs/2408.08847
作者: Zhi-Bo Liu,Xiaobo Pang,Jizhao Wang,Shuai Liu,Chen Li
关键词-EN: decision-making process based, pathological research, critically important, pathological images, based on pathological
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In pathological research, education, and clinical practice, the decision-making process based on pathological images is critically important. This significance extends to digital pathology image analysis: its adequacy is demonstrated by the extensive information contained within tissue structures, which is essential for accurate cancer classification and grading. Additionally, its necessity is highlighted by the inherent requirement for interpretability in the conclusions generated by algorithms. For humans, determining tumor type and grade typically involves multi-scale analysis, which presents a significant challenge for AI algorithms. Traditional patch-based methods are inadequate for modeling such complex structures, as they fail to capture the intricate, multi-scale information inherent in whole slide images. Consequently, there is a pressing need for advanced AI techniques capable of efficiently and accurately replicating this complex analytical process. To address this issue, we introduce HistoGym, an open-source reinforcement learning environment for histopathological image analysis. Following OpenAI Gym APIs, HistoGym aims to foster whole slide image diagnosis by mimicking the real-life processes of doctors. Leveraging the pyramid feature of WSIs and the OpenSlide API, HistoGym provides a unified framework for various clinical tasks, including tumor detection and classification. We detail the observation, action, and reward specifications tailored for the histopathological image analysis domain and provide an open-source Python-based interface for both clinicians and researchers. To accommodate different clinical demands, we offer various scenarios for different organs and cancers, including both WSI-based and selected region-based scenarios, showcasing several noteworthy results.

[CV-65] Assessing Generalization Capabilities of Malaria Diagnostic Models from Thin Blood Smears MICCAI2024

链接: https://arxiv.org/abs/2408.08792
作者: Louise Guillon,Soheib Biga,Axel Puyo,Grégoire Pasquier,Valentin Foucher,Yendoubé E. Kantchire,Stéphane E. Sossou,Ameyo M. Dorkenoo,Laurent Bonnardot,Marc Thellier,Laurence Lachaud,Renaud Piarroux
关键词-EN: global health challenge, accurate diagnostic methods, significant global health, health challenge, necessitating rapid
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注: MICCAI 2024 AMAI Workshop, Accepted for presentation, Submitted Manuscript Version, 10 pages

点击查看摘要

Abstract:Malaria remains a significant global health challenge, necessitating rapid and accurate diagnostic methods. While computer-aided diagnosis (CAD) tools utilizing deep learning have shown promise, their generalization to diverse clinical settings remains poorly assessed. This study evaluates the generalization capabilities of a CAD model for malaria diagnosis from thin blood smear images across four sites. We explore strategies to enhance generalization, including fine-tuning and incremental learning. Our results demonstrate that incorporating site-specific data significantly improves model performance, paving the way for broader clinical application.

[CV-66] A Disease-Specific Foundation Model Using Over 100K Fundus Images: Release and Validation for Abnormality and Multi-Disease Classification on Downstream Tasks

链接: https://arxiv.org/abs/2408.08790
作者: Boa Jang,Youngbin Ahn,Eun Kyung Choe,Chang Ki Yoon,Hyuk Jin Choi,Young-Gon Kim
关键词-EN: Artificial intelligence applied, Artificial intelligence, offers significant potential, artificial intelligence models, generalized artificial intelligence
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Artificial intelligence applied to retinal images offers significant potential for recognizing signs and symptoms of retinal conditions and expediting the diagnosis of eye diseases and systemic disorders. However, developing generalized artificial intelligence models for medical data often requires a large number of labeled images representing various disease signs, and most models are typically task-specific, focusing on major retinal diseases. In this study, we developed a Fundus-Specific Pretrained Model (Image+Fundus), a supervised artificial intelligence model trained to detect abnormalities in fundus images. A total of 57,803 images were used to develop this pretrained model, which achieved superior performance across various downstream tasks, indicating that our proposed model outperforms other general methods. Our Image+Fundus model offers a generalized approach to improve model performance while reducing the number of labeled datasets required. Additionally, it provides more disease-specific insights into fundus images, with visualizations generated by our model. These disease-specific foundation models are invaluable in enhancing the performance and efficiency of deep learning models in the field of fundus imaging.

[CV-67] Multi-task Learning Approach for Intracranial Hemorrhage Prognosis

链接: https://arxiv.org/abs/2408.08784
作者: Miriam Cobo,Amaia Pérez del Barrio,Pablo Menéndez Fernández-Miranda,Pablo Sanz Bellón,Lara Lloret Iglesias,Wilson Silva
关键词-EN: intracranial hemorrhage, complex interplay, interplay between imaging, imaging and tabular, Glasgow Coma Scale
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages

点击查看摘要

Abstract:Prognosis after intracranial hemorrhage (ICH) is influenced by a complex interplay between imaging and tabular data. Rapid and reliable prognosis are crucial for effective patient stratification and informed treatment decision-making. In this study, we aim to enhance image-based prognosis by learning a robust feature representation shared between prognosis and the clinical and demographic variables most highly correlated with it. Our approach mimics clinical decision-making by reinforcing the model to learn valuable prognostic data embedded in the image. We propose a 3D multi-task image model to predict prognosis, Glasgow Coma Scale and age, improving accuracy and interpretability. Our method outperforms current state-of-the-art baseline image models, and demonstrates superior performance in ICH prognosis compared to four board-certified neuroradiologists using only CT scans as input. We further validate our model with interpretability saliency maps. Code is available at this https URL.

[CV-68] MicroSSIM: Improved Structural Similarity for Comparing Microscopy Data ECCV24

链接: https://arxiv.org/abs/2408.08747
作者: Ashesh Ashesh,Joran Deschamps,Florian Jug
关键词-EN: image biological structures, structures of interest, SSIM, biological structures, images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BIC workshop, ECCV 24

点击查看摘要

Abstract:Microscopy is routinely used to image biological structures of interest. Due to imaging constraints, acquired images are typically low-SNR and contain noise. Over the last few years, regression-based tasks like unsupervised denoising and splitting have found utility in working with such noisy micrographs. For evaluation, Structural Similarity (SSIM) is one of the most popular measures used in the field. For such tasks, the best evaluation would be when both low-SNR noisy images and corresponding high-SNR clean images are obtained directly from a microscope. However, due to the following three peculiar properties of the microscopy data, we observe that SSIM is not well suited to this data regime: (a) high-SNR micrographs have higher intensity pixels as compared to low SNR micrographs, (b) high-SNR micrographs have higher intensity pixels than found in natural images, images for which SSIM was developed, and © a digitally configurable offset is added by the detector present inside the microscope. We show that SSIM components behave unexpectedly when the prediction generated from low-SNR input is compared with the corresponding high-SNR data. We explain this behavior by introducing the phenomenon of saturation, where the value of SSIM components becomes less sensitive to (dis)similarity between the images. We introduce microSSIM, a variant of SSIM, which overcomes the above-discussed issues. We justify the soundness and utility of microSSIM using theoretical and empirical arguments and show the utility of microSSIM on two tasks: unsupervised denoising and joint image splitting with unsupervised denoising. Since our formulation can be applied to a broad family of SSIM-based measures, we also introduce MicroMS3IM, a microscopy-specific variation of MS-SSIM. The source code and python package is available at this https URL.

[CV-69] A lifted Bregman strategy for training unfolded proximal neural network Gaussian denoisers

链接: https://arxiv.org/abs/2408.08742
作者: Xiaoyu Wang,Martin Benning,Audrey Repetti
关键词-EN: proximal neural networks, proximal optimization approaches, Unfolded proximal neural, form a family, combines deep learning
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 IEEE International Workshop on Machine Learning for Signal Processing, Sept. 22–25, 2024, London, UK

点击查看摘要

Abstract:Unfolded proximal neural networks (PNNs) form a family of methods that combines deep learning and proximal optimization approaches. They consist in designing a neural network for a specific task by unrolling a proximal algorithm for a fixed number of iterations, where linearities can be learned from prior training procedure. PNNs have shown to be more robust than traditional deep learning approaches while reaching at least as good performances, in particular in computational imaging. However, training PNNs still depends on the efficiency of available training algorithms. In this work, we propose a lifted training formulation based on Bregman distances for unfolded PNNs. Leveraging the deterministic mini-batch block-coordinate forward-backward method, we design a bespoke computational strategy beyond traditional back-propagation methods for solving the resulting learning problem efficiently. We assess the behaviour of the proposed training approach for PNNs through numerical simulations on image denoising, considering a denoising PNN whose structure is based on dual proximal-gradient iterations.

[CV-70] Modeling the Neonatal Brain Development Using Implicit Neural Representations MICCAI2024

链接: https://arxiv.org/abs/2408.08647
作者: Florentin Bieder,Paul Friedrich,Hélène Corbaz,Alicia Durrer,Julia Wolleb,Philippe C. Cattin
关键词-EN: brain undergoes rapid, undergoes rapid development, human brain undergoes, trimester of pregnancy, undergoes rapid
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint, Accepted for PRIME MICCAI 2024

点击查看摘要

Abstract:The human brain undergoes rapid development during the third trimester of pregnancy. In this work, we model the neonatal development of the infant brain in this age range. As a basis, we use MR images of preterm- and term-birth neonates from the developing human connectome project (dHCP). We propose a neural network, specifically an implicit neural representation (INR), to predict 2D- and 3D images of varying time points. In order to model a subject-specific development process, it is necessary to disentangle the age from the subjects’ identity in the latent space of the INR. We propose two methods, Subject Specific Latent Vectors (SSL) and Stochastic Global Latent Augmentation (SGLA), enabling this disentanglement. We perform an analysis of the results and compare our proposed model to an age-conditioned denoising diffusion model as a baseline. We also show that our method can be applied in a memory-efficient way, which is especially important for 3D data.

[CV-71] Reference-free Axial Super-resolution of 3D Microscopy Images using Implicit Neural Representation with a 2D Diffusion Prior MICCAI2024

链接: https://arxiv.org/abs/2408.08616
作者: Kyungryun Lee,Won-Ki Jeong
关键词-EN: pose challenges due, Analysis and visualization, images pose challenges, demanding volumetric super-resolution, pose challenges
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI2024 accepted

点击查看摘要

Abstract:Analysis and visualization of 3D microscopy images pose challenges due to anisotropic axial resolution, demanding volumetric super-resolution along the axial direction. While training a learning-based 3D super-resolution model seems to be a straightforward solution, it requires ground truth isotropic volumes and suffers from the curse of dimensionality. Therefore, existing methods utilize 2D neural networks to reconstruct each axial slice, eventually piecing together the entire volume. However, reconstructing each slice in the pixel domain fails to give consistent reconstruction in all directions leading to misalignment artifacts. In this work, we present a reconstruction framework based on implicit neural representation (INR), which allows 3D coherency even when optimized by independent axial slices in a batch-wise manner. Our method optimizes a continuous volumetric representation from low-resolution axial slices, using a 2D diffusion prior trained on high-resolution lateral slices without requiring isotropic volumes. Through experiments on real and synthetic anisotropic microscopy images, we demonstrate that our method surpasses other state-of-the-art reconstruction methods. The source code is available on GitHub: this https URL.

[CV-72] DFT-Based Adversarial Attack Detection in MRI Brain Imaging: Enhancing Diagnostic Accuracy in Alzheimers Case Studies

链接: https://arxiv.org/abs/2408.08489
作者: Mohammad Hossein Najafi,Mohammad Morsali,Mohammadmahdi Vahediahmar,Saeed Bagheri Shouraki
关键词-EN: Recent advancements, adversarial attacks, healthcare systems, attacks, significantly propelled
类目: Image and Video Processing (eess.IV); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures, conference

点击查看摘要

Abstract:Recent advancements in deep learning, particularly in medical imaging, have significantly propelled the progress of healthcare systems. However, examining the robustness of medical images against adversarial attacks is crucial due to their real-world applications and profound impact on individuals’ health. These attacks can result in misclassifications in disease diagnosis, potentially leading to severe consequences. Numerous studies have explored both the implementation of adversarial attacks on medical images and the development of defense mechanisms against these threats, highlighting the vulnerabilities of deep neural networks to such adversarial activities. In this study, we investigate adversarial attacks on images associated with Alzheimer’s disease and propose a defensive method to counteract these attacks. Specifically, we examine adversarial attacks that employ frequency domain transformations on Alzheimer’s disease images, along with other well-known adversarial attacks. Our approach utilizes a convolutional neural network (CNN)-based autoencoder architecture in conjunction with the two-dimensional Fourier transform of images for detection purposes. The simulation results demonstrate that our detection and defense mechanism effectively mitigates several adversarial attacks, thereby enhancing the robustness of deep neural networks against such vulnerabilities.

[CV-73] Efficient Data-Sketches and Fine-Tuning for Early Detection of Distributional Drift in Medical Imaging

链接: https://arxiv.org/abs/2408.08456
作者: Yusen Wu,Hao Chen,Alex Pissinou Makki,Phuong Nguyen,Yelena Yesha
关键词-EN: underlying data distribution, treatment decisions, Distributional drift, Distributional drift detection, detect distributional drift
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributional drift detection is important in medical applications as it helps ensure the accuracy and reliability of models by identifying changes in the underlying data distribution that could affect diagnostic or treatment decisions. However, current methods have limitations in detecting drift; for example, the inclusion of abnormal datasets can lead to unfair comparisons. This paper presents an accurate and sensitive approach to detect distributional drift in CT-scan medical images by leveraging data-sketching and fine-tuning techniques. We developed a robust baseline library model for real-time anomaly detection, allowing for efficient comparison of incoming images and identification of anomalies. Additionally, we fine-tuned a vision transformer pre-trained model to extract relevant features using breast cancer images as an example, significantly enhancing model accuracy to 99.11%. Combining with data-sketches and fine-tuning, our feature extraction evaluation demonstrated that cosine similarity scores between similar datasets provide greater improvements, from around 50% increased to 100%. Finally, the sensitivity evaluation shows that our solutions are highly sensitive to even 1% salt-and-pepper and speckle noise, and it is not sensitive to lighting noise (e.g., lighting conditions have no impact on data drift). The proposed methods offer a scalable and reliable solution for maintaining the accuracy of diagnostic models in dynamic clinical environments.

[CV-74] Predictive uncertainty estimation in deep learning for lung carcinoma classification in digital pathology under real dataset shifts

链接: https://arxiv.org/abs/2408.08432
作者: Abdur R. Fayjie,Jutika Borah,Florencia Carbone,Jan Tack,Patrick Vandewalle
关键词-EN: shown tremendous progress, predictive uncertainty, shown tremendous, tremendous progress, wide range
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Deep learning has shown tremendous progress in a wide range of digital pathology and medical image classification tasks. Its integration into safe clinical decision-making support requires robust and reliable models. However, real-world data comes with diversities that often lie outside the intended source distribution. Moreover, when test samples are dramatically different, clinical decision-making is greatly affected. Quantifying predictive uncertainty in models is crucial for well-calibrated predictions and determining when (or not) to trust a model. Unfortunately, many works have overlooked the importance of predictive uncertainty estimation. This paper evaluates whether predictive uncertainty estimation adds robustness to deep learning-based diagnostic decision-making systems. We investigate the effect of various carcinoma distribution shift scenarios on predictive performance and calibration. We first systematically investigate three popular methods for improving predictive uncertainty: Monte Carlo dropout, deep ensemble, and few-shot learning on lung adenocarcinoma classification as a primary disease in whole slide images. Secondly, we compare the effectiveness of the methods in terms of performance and calibration under clinically relevant distribution shifts such as in-distribution shifts comprising primary disease sub-types and other characterization analysis data; out-of-distribution shifts comprising well-differentiated cases, different organ origin, and imaging modality shifts. While studies on uncertainty estimation exist, to our best knowledge, no rigorous large-scale benchmark compares predictive uncertainty estimation including these dataset shifts for lung carcinoma classification.

机器学习

[LG-0] PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars

链接: https://arxiv.org/abs/2408.08869
作者: Sumanth Prabhu
关键词-EN: Large Language Models, Language Models, Large Language, demonstrated remarkable gains, diverse reasoning paths
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-ensembling techniques with diverse reasoning paths such as Self-Consistency have demonstrated remarkable gains in accuracy for Large Language Models (LLMs). However, such techniques depend on the availability of an accurate answer extraction process to aggregate across multiple outputs. Moreover, they acquire higher inference cost, in comparison to Greedy Decoding, due to generation of relatively higher number of output tokens. Research has shown that the free form text outputs from Self-Consistency can be aggregated reliably using LLMs to produce the final output. Additionally, recent advancements in LLM inference have demonstrated that usage of diverse exemplars in prompts have the ability to induce diversity in the LLM outputs. Such proven techniques can be easily extended to self-ensembling based approaches to achieve enhanced results in text generation. In this paper, we introduce PEDAL (Prompts based on Exemplar Diversity Aggregated using LLMs), a hybrid self-ensembling approach, that combines the strengths of diverse exemplar based prompts and LLM based aggregation to achieve improvement in overall performance. On the publicly available SVAMP and ARC datasets, our experiments reveal that PEDAL can achieve better accuracy than Greedy Decoding based strategies with lower inference cost compared to Self Consistency based approaches.

[LG-1] A Hassle-free Algorithm for Private Learning in Practice: Dont Use Tree Aggregation Use BLTs

链接: https://arxiv.org/abs/2408.08868
作者: H. Brendan McMahan,Zheng Xu,Yanxiang Zhang
关键词-EN: combines federated learning, mobile keyboard applications, keyboard applications combines, applications combines federated, Buffered Linear Toeplitz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The state-of-the-art for training on-device language models for mobile keyboard applications combines federated learning (FL) with differential privacy (DP) via the DP-Follow-the-Regularized-Leader (DP-FTRL) algorithm. Two variants of DP-FTRL are used in practice, tree aggregation and matrix factorization. However, tree aggregation suffers from significantly suboptimal privacy/utility tradeoffs, while matrix mechanisms require expensive optimization parameterized by hard-to-estimate-in-advance constants, and high runtime memory costs.This paper extends the recently introduced Buffered Linear Toeplitz (BLT) mechanism to multi-participation scenarios. Our BLT-DP-FTRL maintains the ease-of-use advantages of tree aggregation, while essentially matching matrix factorization in terms of utility and privacy. We evaluate BLT-DP-FTRL on the StackOverflow dataset, serving as a re-producible simulation benchmark, and across four on-device language model tasks in a production FL system. Our empirical results highlight the advantages of the BLT mechanism and elevate the practicality and effectiveness of DP in real-world scenarios.

[LG-2] Visual Agents as Fast and Slow Thinkers

链接: https://arxiv.org/abs/2408.08862
作者: Guangyan Sun,Mingyu Jin,Zhenting Wang,Cheng-Long Wang,Siqi Ma,Qifan Wang,Ying Nian Wu,Yongfeng Zhang,Dongfang Liu
关键词-EN: human-level intelligence requires, intelligence requires refining, requires refining cognitive, refining cognitive distinctions, requires refining
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident responses. To address the challenge, we introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents. FaST employs a switch adapter to dynamically select between System 1/2 modes, tailoring the problem-solving approach to different task complexity. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data. With this novel design, we advocate a flexible system, hierarchical reasoning capabilities, and a transparent decision-making pipeline, all of which contribute to its ability to emulate human-like cognitive processes in visual intelligence. Empirical results demonstrate that FaST outperforms various well-known baselines, achieving 80.8% accuracy over VQA^v2 for visual question answering and 48.7% GIoU score over ReasonSeg for reasoning segmentation, demonstrate FaST’s superior performance. Extensive testing validates the efficacy and robustness of FaST’s core components, showcasing its potential to advance the development of cognitive visual agents in AI systems.

[LG-3] Stochastic Bandits Robust to Adversarial Attacks

链接: https://arxiv.org/abs/2408.08859
作者: Xuchuang Wang,Jinhang Zuo,Xutong Liu,John C.S. Lui,Mohammad Hajiesmaili
关键词-EN: paper investigates stochastic, investigates stochastic multi-armed, stochastic multi-armed bandit, multi-armed bandit algorithms, paper investigates
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates stochastic multi-armed bandit algorithms that are robust to adversarial attacks, where an attacker can first observe the learner’s action and then alter their reward observation. We study two cases of this model, with or without the knowledge of an attack budget C , defined as an upper bound of the summation of the difference between the actual and altered rewards. For both cases, we devise two types of algorithms with regret bounds having additive or multiplicative C dependence terms. For the known attack budget case, we prove our algorithms achieve the regret bound of O((K/\Delta)\log T + KC) and \tildeO(\sqrtKTC) for the additive and multiplicative C terms, respectively, where K is the number of arms, T is the time horizon, \Delta is the gap between the expected rewards of the optimal arm and the second-best arm, and \tildeO hides the logarithmic factors. For the unknown case, we prove our algorithms achieve the regret bound of \tildeO(\sqrtKT + KC^2) and \tildeO(KC\sqrtT) for the additive and multiplicative C terms, respectively. In addition to these upper bound results, we provide several lower bounds showing the tightness of our bounds and the optimality of our algorithms. These results delineate an intrinsic separation between the bandits with attacks and corruption models [Lykouris et al., 2018].

[LG-4] GeoTransformer: Enhancing Urban Forecasting with Geospatial Attention Mechanisms

链接: https://arxiv.org/abs/2408.08852
作者: Yuhao Jia,Zile Wu,Shengao Yi,Yifei Sun
关键词-EN: integrating sociodemographic data, Recent advancements, notable efforts dedicated, high-dimensional spaces, satellite imagery
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements have focused on encoding urban spatial information into high-dimensional spaces, with notable efforts dedicated to integrating sociodemographic data and satellite imagery. These efforts have established foundational models in this field. However, the effective utilization of these spatial representations for urban forecasting applications remains under-explored. To address this gap, we introduce GeoTransformer, a novel structure that synergizes the Transformer architecture with geospatial statistics prior. GeoTransformer employs an innovative geospatial attention mechanism to incorporate extensive urban information and spatial dependencies into a unified predictive model. Specifically, we compute geospatial weighted attention scores between the target region and surrounding regions and leverage the integrated urban information for predictions. Extensive experiments on GDP and ride-share demand prediction tasks demonstrate that GeoTransformer significantly outperforms existing baseline models, showcasing its potential to enhance urban forecasting tasks.

[LG-5] Entropy Coding of Unordered Data Structures ICLR2024

链接: https://arxiv.org/abs/2408.08837
作者: Julius Kunze,Daniel Severo,Giulio Zani,Jan-Willem van de Meent,James Townsend
关键词-EN: present shuffle coding, general method, method for optimal, sequences of unordered, unordered objects
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT)
*备注: Published at ICLR 2024

点击查看摘要

Abstract:We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-of-the-art compression rates on a range of graph datasets including molecular data.

[LG-6] LEVIS: Large Exact Verifiable Input Spaces for Neural Networks

链接: https://arxiv.org/abs/2408.08824
作者: Mohamad Fares El Hajj Chehade,Brian Wesley Bell,Russell Bent,Hao Zhu,Wenting Li
关键词-EN: LEVIS, texttt, neural networks, networks is paramount, paramount in safety-critical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The robustness of neural networks is paramount in safety-critical applications. While most current robustness verification methods assess the worst-case output under the assumption that the input space is known, identifying a verifiable input space \mathcalC , where no adversarial examples exist, is crucial for effective model selection, robustness evaluation, and the development of reliable control strategies. To address this challenge, we introduce a novel framework, \textttLEVIS , comprising \textttLEVIS - \alpha and \textttLEVIS - \beta . \textttLEVIS - \alpha locates the largest possible verifiable ball within the central region of \mathcalC that intersects at least two boundaries. In contrast, \textttLEVIS - \beta integrates multiple verifiable balls to encapsulate the entirety of the verifiable space comprehensively. Our contributions are threefold: (1) We propose \textttLEVIS equipped with three pioneering techniques that identify the maximum verifiable ball and the nearest adversarial point along collinear or orthogonal directions. (2) We offer a theoretical analysis elucidating the properties of the verifiable balls acquired through \textttLEVIS - \alpha and \textttLEVIS - \beta . (3) We validate our methodology across diverse applications, including electrical power flow regression and image classification, showcasing performance enhancements and visualizations of the searching characteristics.

[LG-7] Optimal Symmetries in Binary Classification

链接: https://arxiv.org/abs/2408.08823
作者: Vishal S. Ngairangbam,Michael Spannowsky
关键词-EN: binary classification tasks, Neyman-Pearson optimality, explore the role, framework that leverages, leverages the principles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 13 pages, 1 figure, 2 tables

点击查看摘要

Abstract:We explore the role of group symmetries in binary classification tasks, presenting a novel framework that leverages the principles of Neyman-Pearson optimality. Contrary to the common intuition that larger symmetry groups lead to improved classification performance, our findings show that selecting the appropriate group symmetries is crucial for optimising generalisation and sample efficiency. We develop a theoretical foundation for designing group equivariant neural networks that align the choice of symmetries with the underlying probability distributions of the data. Our approach provides a unified methodology for improving classification accuracy across a broad range of applications by carefully tailoring the symmetry group to the specific characteristics of the problem. Theoretical analysis and experimental results demonstrate that optimal classification performance is not always associated with the largest equivariant groups possible in the domain, even when the likelihood ratio is invariant under one of its proper subgroups, but rather with those subgroups themselves. This work offers insights and practical guidelines for constructing more effective group equivariant architectures in diverse machine-learning contexts.

[LG-8] An Empirical Examination of Balancing Strategy for Counterfactual Estimation on Time Series ICML2024

链接: https://arxiv.org/abs/2408.08815
作者: Qiang Huang,Chuizheng Meng,Defu Cao,Biwei Huang,Yi Chang,Yan Liu
关键词-EN: numerous application fields, balancing strategies, application fields, healthcare and finance, observations represents
类目: Machine Learning (cs.LG)
*备注: ICML 2024 Carema Ready Version. 20 Pages, 12 Figures, 10 Tables

点击查看摘要

Abstract:Counterfactual estimation from observations represents a critical endeavor in numerous application fields, such as healthcare and finance, with the primary challenge being the mitigation of treatment bias. The balancing strategy aimed at reducing covariate disparities between different treatment groups serves as a universal solution. However, when it comes to the time series data, the effectiveness of balancing strategies remains an open question, with a thorough analysis of the robustness and applicability of balancing strategies still lacking. This paper revisits counterfactual estimation in the temporal setting and provides a brief overview of recent advancements in balancing strategies. More importantly, we conduct a critical empirical examination for the effectiveness of the balancing strategies within the realm of temporal counterfactual estimation in various settings on multiple datasets. Our findings could be of significant interest to researchers and practitioners and call for a reexamination of the balancing strategy in time series settings.

[LG-9] CAT: Caution Aware Transfer in Reinforcement Learning via Distributional Risk

链接: https://arxiv.org/abs/2408.08812
作者: Mohamad Fares El Hajj Chehade,Amrit Singh Bedi,Amy Zhang,Hao Zhu
关键词-EN: improving data efficiency, previously learned tasks, pivotal strategy, strategy for improving, improving data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning in reinforcement learning (RL) has become a pivotal strategy for improving data efficiency in new, unseen tasks by utilizing knowledge from previously learned tasks. This approach is especially beneficial in real-world deployment scenarios where computational resources are constrained and agents must adapt rapidly to novel environments. However, current state-of-the-art methods often fall short in ensuring safety during the transfer process, particularly when unforeseen risks emerge in the deployment phase. In this work, we address these limitations by introducing a novel Caution-Aware Transfer Learning (CAT) framework. Unlike traditional approaches that limit risk considerations to mean-variance, we define “caution” as a more generalized and comprehensive notion of risk. Our core innovation lies in optimizing a weighted sum of reward return and caution-based on state-action occupancy measures-during the transfer process, allowing for a rich representation of diverse risk factors. To the best of our knowledge, this is the first work to explore the optimization of such a generalized risk notion within the context of transfer RL. Our contributions are threefold: (1) We propose a Caution-Aware Transfer (CAT) framework that evaluates source policies within the test environment and constructs a new policy that balances reward maximization and caution. (2) We derive theoretical sub-optimality bounds for our method, providing rigorous guarantees of its efficacy. (3) We empirically validate CAT, demonstrating that it consistently outperforms existing methods by delivering safer policies under varying risk conditions in the test tasks.

[LG-10] Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

链接: https://arxiv.org/abs/2408.08808
作者: Ravi Raju,Swayambhoo Jain,Bo Li,Jonathan Li,Urmish Thakkar
关键词-EN: Large Language Models, Large Language, real-world applications, revolutionized the landscape, landscape of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark’s usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC \citedubois2024lengthcontrolledalpacaevalsimpleway and Arena-Hard v0.1 \citeli2024crowdsourced are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84%) across ten top-ranked models, and agreement (84%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9% better than Arena Hard and 20% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.

[LG-11] Representation Learning of Geometric Trees

链接: https://arxiv.org/abs/2408.08799
作者: Zheng Zhang,Allen Zhang,Ruth Nelson,Giorgio Ascoli,Liang Zhao
关键词-EN: spatially constrained nodes, nodes and edges, topological attributes, tree-structured layout, layout and spatially
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geometric trees are characterized by their tree-structured layout and spatially constrained nodes and edges, which significantly impacts their topological attributes. This inherent hierarchical structure plays a crucial role in domains such as neuron morphology and river geomorphology, but traditional graph representation methods often overlook these specific characteristics of tree structures. To address this, we introduce a new representation learning framework tailored for geometric trees. It first features a unique message passing neural network, which is both provably geometrical structure-recoverable and rotation-translation invariant. To address the data label scarcity issue, our approach also includes two innovative training targets that reflect the hierarchical ordering and geometric structure of these geometric trees. This enables fully self-supervised learning without explicit labels. We validate our method’s effectiveness on eight real-world datasets, demonstrating its capability to represent geometric trees.

[LG-12] Neighbor Overlay-Induced Graph Attention Network

链接: https://arxiv.org/abs/2408.08788
作者: Tiqiao Wei,Ye Yuan
关键词-EN: garnered significant attention, significant attention due, represent graph data, graph attention network, Graph neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have garnered significant attention due to their ability to represent graph data. Among various GNN variants, graph attention network (GAT) stands out since it is able to dynamically learn the importance of different nodes. However, present GATs heavily rely on the smoothed node features to obtain the attention coefficients rather than graph structural information, which fails to provide crucial contextual cues for node representations. To address this issue, this study proposes a neighbor overlay-induced graph attention network (NO-GAT) with the following two-fold ideas: a) learning favorable structural information, i.e., overlaid neighbors, outside the node feature propagation process from an adjacency matrix; b) injecting the information of overlaid neighbors into the node feature propagation process to compute the attention coefficient jointly. Empirical studies on graph benchmark datasets indicate that the proposed NO-GAT consistently outperforms state-of-the-art models.

[LG-13] A Transparency Paradox? Investigating the Impact of Explanation Specificity and Autonomous Vehicle Perceptual Inaccuracies on Passengers

链接: https://arxiv.org/abs/2408.08785
作者: Daniel Omeiza,Raunak Bhattacharyya,Marina Jirotka,Nick Hawes,Lars Kunze
关键词-EN: provision of intelligible, Transparency, explanations, perception system, perception
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to Transportation Research Part F: Traffic Psychology and Behaviour. arXiv admin note: text overlap with arXiv:2307.00633

点击查看摘要

Abstract:Transparency in automated systems could be afforded through the provision of intelligible explanations. While transparency is desirable, might it lead to catastrophic outcomes (such as anxiety), that could outweigh its benefits? It’s quite unclear how the specificity of explanations (level of transparency) influences recipients, especially in autonomous driving (AD). In this work, we examined the effects of transparency mediated through varying levels of explanation specificity in AD. We first extended a data-driven explainer model by adding a rule-based option for explanation generation in AD, and then conducted a within-subject lab study with 39 participants in an immersive driving simulator to study the effect of the resulting explanations. Specifically, our investigation focused on: (1) how different types of explanations (specific vs. abstract) affect passengers’ perceived safety, anxiety, and willingness to take control of the vehicle when the vehicle perception system makes erroneous predictions; and (2) the relationship between passengers’ behavioural cues and their feelings during the autonomous drives. Our findings showed that passengers felt safer with specific explanations when the vehicle’s perception system had minimal errors, while abstract explanations that hid perception errors led to lower feelings of safety. Anxiety levels increased when specific explanations revealed perception system errors (high transparency). We found no significant link between passengers’ visual patterns and their anxiety levels. Our study suggests that passengers prefer clear and specific explanations (high transparency) when they originate from autonomous vehicles (AVs) with optimal perceptual accuracy.

[LG-14] NEAR: A Training-Free Pre-Estimator of Machine Learning Model Performance

链接: https://arxiv.org/abs/2408.08776
作者: Raphael T. Husistein,Markus Reiher,Marco Eckhoff
关键词-EN: including natural language, natural language processing, Artificial neural networks, machine learning models, Artificial neural
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Chemical Physics (physics.chem-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 12 pages, 4 figures, 10 tables

点击查看摘要

Abstract:Artificial neural networks have been shown to be state-of-the-art machine learning models in a wide variety of applications, including natural language processing and image recognition. However, building a performant neural network is a laborious task and requires substantial computing power. Neural Architecture Search (NAS) addresses this issue by an automatic selection of the optimal network from a set of potential candidates. While many NAS methods still require training of (some) neural networks, zero-cost proxies promise to identify the optimal network without training. In this work, we propose the zero-cost proxy Network Expressivity by Activation Rank (NEAR). It is based on the effective rank of the pre- and post-activation matrix, i.e., the values of a neural network layer before and after applying its activation function. We demonstrate the cutting-edge correlation between this network score and the model accuracy on NAS-Bench-101 and NATS-Bench-SSS/TSS. In addition, we present a simple approach to estimate the optimal layer sizes in multi-layer perceptrons. Furthermore, we show that this score can be utilized to select hyperparameters such as the activation function and the neural network weight initialization scheme.

[LG-15] Speckle Noise Analysis for Synthetic Aperture Radar (SAR) Space Data

链接: https://arxiv.org/abs/2408.08774
作者: Sanjjushri Varshini R,Rohith Mahadevan,Bagiya Lakshmi S,Mathivanan Periasamy,Raja CSP Raman,Lokesh M
关键词-EN: Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, Filtering, Frost Filtering
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research tackles the challenge of speckle noise in Synthetic Aperture Radar (SAR) space data, a prevalent issue that hampers the clarity and utility of SAR images. The study presents a comparative analysis of six distinct speckle noise reduction techniques: Lee Filtering, Frost Filtering, Kuan Filtering, Gaussian Filtering, Median Filtering, and Bilateral Filtering. These methods, selected for their unique approaches to noise reduction and image preservation, were applied to SAR datasets sourced from the Alaska Satellite Facility (ASF). The performance of each technique was evaluated using a comprehensive set of metrics, including Peak Signal-to-Noise Ratio (PSNR), Mean Squared Error (MSE), Structural Similarity Index (SSIM), Equivalent Number of Looks (ENL), and Speckle Suppression Index (SSI). The study concludes that both the Lee and Kuan Filters are effective, with the choice of filter depending on the specific application requirements for image quality and noise suppression. This work provides valuable insights into optimizing SAR image processing, with significant implications for remote sensing, environmental monitoring, and geological surveying.

[LG-16] Pessimistic Iterative Planning for Robust POMDPs

链接: https://arxiv.org/abs/2408.08770
作者: Maris F. L. Galesloot,Marnix Suilen,Thiago D. Simão,Steven Carr,Matthijs T. J. Spaan,Ufuk Topcu,Nils Jansen
关键词-EN: Markov decision processes, partially observable Markov, observable Markov decision, extend classical POMDPs, observable Markov
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust partially observable Markov decision processes (robust POMDPs) extend classical POMDPs to handle additional uncertainty on the transition and observation probabilities via so-called uncertainty sets. Policies for robust POMDPs must not only be memory-based to account for partial observability but also robust against model uncertainty to account for the worst-case instances from the uncertainty sets. We propose the pessimistic iterative planning (PIP) framework, which finds robust memory-based policies for robust POMDPs. PIP alternates between two main steps: (1) selecting an adversarial (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this adversarial POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next adversarial POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network trained using supervision policies optimized for the adversarial POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against a baseline method in an ablation study and competitive performance compared to a state-of-the-art robust POMDP solver.

[LG-17] SYMPOL: Symbolic Tree-Based On-Policy Reinforcement Learning

链接: https://arxiv.org/abs/2408.08761
作者: Sascha Marton,Tim Grams,Florian Vogt,Stefan Lüdtke,Christian Bartelt,Heiner Stuckenschmidt
关键词-EN: neural network policies, Reinforcement learning, making them difficult, difficult to interpret, significant success
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has seen significant success across various domains, but its adoption is often limited by the black-box nature of neural network policies, making them difficult to interpret. In contrast, symbolic policies allow representing decision-making strategies in a compact and interpretable way. However, learning symbolic policies directly within on-policy methods remains challenging. In this paper, we introduce SYMPOL, a novel method for SYMbolic tree-based on-POLicy RL. SYMPOL employs a tree-based model integrated with a policy gradient method, enabling the agent to learn and adapt its actions while maintaining a high level of interpretability. We evaluate SYMPOL on a set of benchmark RL tasks, demonstrating its superiority over alternative tree-based RL approaches in terms of performance and interpretability. To the best of our knowledge, this is the first method, that allows a gradient-based end-to-end learning of interpretable, axis-aligned decision trees on-policy. Therefore, SYMPOL can become the foundation for a new class of interpretable RL based on decision trees. Our implementation is available under: this https URL

[LG-18] SE-SGformer: A Self-Explainable Signed Graph Transformer for Link Sign Prediction

链接: https://arxiv.org/abs/2408.08754
作者: Lu Li,Jiale Liu,Xingyu Ji,Maojun Wang,Zeyu Zhang
关键词-EN: analyzing complex patterns, Graph Neural Networks, Signed Graph Neural, Signed Graph, SGNN models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Signed Graph Neural Networks (SGNNs) have been shown to be effective in analyzing complex patterns in real-world situations where positive and negative links coexist. However, SGNN models suffer from poor explainability, which limit their adoptions in critical scenarios that require understanding the rationale behind predictions. To the best of our knowledge, there is currently no research work on the explainability of the SGNN models. Our goal is to address the explainability of decision-making for the downstream task of link sign prediction specific to signed graph neural networks. Since post-hoc explanations are not derived directly from the models, they may be biased and misrepresent the true explanations. Therefore, in this paper we introduce a Self-Explainable Signed Graph transformer (SE-SGformer) framework, which can not only outputs explainable information while ensuring high prediction accuracy. Specifically, We propose a new Transformer architecture for signed graphs and theoretically demonstrate that using positional encoding based on signed random walks has greater expressive power than current SGNN methods and other positional encoding graph Transformer-based approaches. We constructs a novel explainable decision process by discovering the K -nearest (farthest) positive (negative) neighbors of a node to replace the neural network-based decoder for predicting edge signs. These K positive (negative) neighbors represent crucial information about the formation of positive (negative) edges between nodes and thus can serve as important explanatory information in the decision-making process. We conducted experiments on several real-world datasets to validate the effectiveness of SE-SGformer, which outperforms the state-of-the-art methods by improving 2.2% prediction accuracy and 73.1% explainablity accuracy in the best-case scenario.

[LG-19] ML Study of MaliciousTransactions in Ethereum

链接: https://arxiv.org/abs/2408.08749
作者: Natan Katz
关键词-EN: tool in Ethereum, Ethereum transactions, Smart contracts, major tool, detecting malicious transactions
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smart contracts are a major tool in Ethereum transactions. Therefore hackers can exploit them by adding code vulnerabilities to their sources and using these vulnerabilities for performing malicious transactions. This paper presents two successful approaches for detecting malicious contracts: one uses opcode and relies on GPT2 and the other uses the Solidity source and a LORA fine-tuned CodeLlama. Finally, we present an XGBOOST model that combines gas properties and Hexa-decimal signatures for detecting malicious transactions. This approach relies on early assumptions that maliciousness is manifested by the uncommon usage of the contracts’ functions and the effort to pursue the transaction.

[LG-20] Beyond KAN: Introducing KarSein for Adaptive High-Order Feature Interaction Modeling in CTR Prediction

链接: https://arxiv.org/abs/2408.08713
作者: Yunxiao Shi,Wujiang Wu,Mingyu Jin,Haimin Zhang,Qiang Wu,Yongfeng Zhang,Min Xu
关键词-EN: click-through rate, crucial for click-through, high-order explicit interactions, modeling high-order, interactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: KarSein for CTR

点击查看摘要

Abstract:Modeling feature interactions is crucial for click-through rate (CTR) prediction, particularly when it comes to high-order explicit interactions. Traditional methods struggle with this task because they often predefine a maximum interaction order, which relies heavily on prior knowledge and can limit the model’s effectiveness. Additionally, modeling high-order interactions typically leads to increased computational costs. Therefore, the challenge lies in adaptively modeling high-order feature interactions while maintaining efficiency. To address this issue, we introduce Kolmogorov-Arnold Represented Sparse Efficient Interaction Network (KarSein), designed to optimize both predictive accuracy and computational efficiency. We firstly identify limitations of directly applying Kolmogorov-Arnold Networks (KAN) to CTR and then introduce KarSein to overcome these issues. It features a novel architecture that reduces the computational costs of KAN and supports embedding vectors as feature inputs. Additionally, KarSein employs guided symbolic regression to address the challenge of KAN in spontaneously learning multiplicative relationships. Extensive experiments demonstrate KarSein’s superior performance, achieving significant predictive accuracy with minimal computational overhead. Furthermore, KarSein maintains strong global explainability while enabling the removal of redundant features, resulting in a sparse network structure. These advantages also position KarSein as a promising method for efficient inference.

[LG-21] Beam Prediction based on Large Language Models

链接: https://arxiv.org/abs/2408.08707
作者: Yucheng Sheng,Kai Huang,Le Liang,Peng Liu,Shi Jin,Geoffrey Ye Li
关键词-EN: significant path loss, requiring extensive antenna, extensive antenna arrays, frequent beam training, next-generation wireless networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) communication is promising for next-generation wireless networks but suffers from significant path loss, requiring extensive antenna arrays and frequent beam training. Traditional deep learning models, such as long short-term memory (LSTM), enhance beam tracking accuracy however are limited by poor robustness and generalization. In this letter, we use large language models (LLMs) to improve the robustness of beam prediction. By converting time series data into text-based representations and employing the Prompt-as-Prefix (PaP) technique for contextual enrichment, our approach unleashes the strength of LLMs for time series forecasting. Simulation results demonstrate that our LLM-based method offers superior robustness and generalization compared to LSTM-based models, showcasing the potential of LLMs in wireless communications.

[LG-22] Efficient Multi-Policy Evaluation for Reinforcement Learning

链接: https://arxiv.org/abs/2408.08706
作者: Shuze Liu,Yuxin Chen,Shangtong Zhang
关键词-EN: unbiasedly evaluate multiple, evaluate multiple target, dominant approach, target policy separately, multiple target policies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms on-policy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.

[LG-23] RBLA: Rank-Based-LoRA-Aggregation for Fine-tuning Heterogeneous Models in FLaaS

链接: https://arxiv.org/abs/2408.08699
作者: Shuaijun Chen,Omid Tavallaie,Niousha Nazemi,Albert Y. Zomaya
关键词-EN: privacy-aware distributed learning, distributed learning framework, Federated Learning, server-based Federated Learning, promising privacy-aware distributed
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a promising privacy-aware distributed learning framework that can be deployed on various devices, such as mobile phones, desktops, and devices equipped with CPUs or GPUs. In the context of server-based Federated Learning as a Service (FLaas), FL enables the central server to coordinate the training process across multiple devices without direct access to the local data, thereby enhancing privacy and data security. Low-Rank Adaptation (LoRA) is a method that fine-tunes models efficiently by focusing on a low-dimensional subspace of the model’s parameters. This approach significantly reduces computational and memory costs compared to fine-tuning all parameters from scratch. When integrated with FL, especially in a FLaas environment, LoRA allows for flexible and efficient deployment across diverse hardware with varying computational capabilities by adjusting the local model’s rank. However, in LoRA-enabled FL, different clients may train models with varying ranks, which poses a challenge for model aggregation on the server. Current methods of aggregating models of different ranks require padding weights to a uniform shape, which can degrade the global model’s performance. To address this issue, we propose Rank-Based LoRA Aggregation (RBLA), a novel model aggregation method designed for heterogeneous LoRA structures. RBLA preserves key features across models with different ranks. This paper analyzes the issues with current padding methods that reshape models for aggregation in a FLaas environment. Then, we introduce RBLA, a rank-based aggregation method that maintains both low-rank and high-rank features. Finally, we demonstrate the effectiveness of RBLA through comparative experiments with state-of-the-art methods.

[LG-24] urning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

链接: https://arxiv.org/abs/2408.08696
作者: Xianzhen Luo,Yixuan Wang,Qingfu Zhu,Zhiming Zhang,Xuanyu Zhang,Qing Yang,Dongliang Xu,Wanxiang Che
关键词-EN: limiting broader application, made inference latency, fundamental bottleneck, limiting broader, rapid growth
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:The rapid growth in the parameters of large language models (LLMs) has made inference latency a fundamental bottleneck, limiting broader application of LLMs. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm, leveraging the parallel capabilities of modern hardware. Some speculative decoding methods rely on additional structures to guess draft tokens, such as small models or parameter-efficient architectures, which need extra training before use. Alternatively, retrieval-based train-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. This approach stores candidate tokens in an adjacency matrix and employs a breadth-first search (BFS)-like algorithm on the matrix to construct a draft tree. The tree is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30% and even a training method by 25%. It can be directly applied to any existing LLMs and tasks without the need for adaptation.

[LG-25] Explore-then-Commit Algorithms for Decentralized Two-Sided Matching Markets

链接: https://arxiv.org/abs/2408.08690
作者: Tejas Pagare,Avishek Ghosh
关键词-EN: received substantial interest, Online learning, received substantial, substantial interest, circ
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); General Economics (econ.GN); Machine Learning (stat.ML)
*备注: Accepted at International Symposium of Information Theory (ISIT) 2024

点击查看摘要

Abstract:Online learning in a decentralized two-sided matching markets, where the demand-side (players) compete to match with the supply-side (arms), has received substantial interest because it abstracts out the complex interactions in matching platforms (e.g. UpWork, TaskRabbit). However, past works assume that each arm knows their preference ranking over the players (one-sided learning), and each player aim to learn the preference over arms through successive interactions. Moreover, several (impractical) assumptions on the problem are usually made for theoretical tractability such as broadcast player-arm match Liu et al. (2020; 2021); Kong Li (2023) or serial dictatorship Sankararaman et al. (2021); Basu et al. (2021); Ghosh et al. (2022). In this paper, we study a decentralized two-sided matching market, where we do not assume that the preference ranking over players are known to the arms apriori. Furthermore, we do not have any structural assumptions on the problem. We propose a multi-phase explore-then-commit type algorithm namely epoch-based CA-ETC (collision avoidance explore then commit) (\textttCA-ETC in short) for this problem that does not require any communication across agents (players and arms) and hence decentralized. We show that for the initial epoch length of T_\circ and subsequent epoch-lengths of 2^l/\gamma T_\circ (for the l- th epoch with \gamma \in (0,1) as an input parameter to the algorithm), \textttCA-ETC yields a player optimal expected regret of \mathcalO\left(T_\circ (\fracK \log TT_\circ \Delta^2)^1/\gamma + T_\circ (\fracTT_\circ)^\gamma\right) for the i -th player, where T is the learning horizon, K is the number of arms and \Delta is an appropriately defined problem gap. Furthermore, we propose a blackboard communication based baseline achieving logarithmic regret in T .

[LG-26] Can Large Language Models Improve the Adversarial Robustness of Graph Neural Networks?

链接: https://arxiv.org/abs/2408.08685
作者: Zhongjian Zhang,Xiao Wang,Huichi Zhou,Yue Yu,Mengmei Zhang,Cheng Yang,Chuan Shi
关键词-EN: received considerable attention, Graph neural networks, GNNs, neural networks, considerable attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are vulnerable to adversarial perturbations, especially for topology attacks, and many methods that improve the robustness of GNNs have received considerable attention. Recently, we have witnessed the significant success of large language models (LLMs), leading many to explore the great potential of LLMs on GNNs. However, they mainly focus on improving the performance of GNNs by utilizing LLMs to enhance the node features. Therefore, we ask: Will the robustness of GNNs also be enhanced with the powerful understanding and inference capabilities of LLMs? By presenting the empirical results, we find that despite that LLMs can improve the robustness of GNNs, there is still an average decrease of 23.1% in accuracy, implying that the GNNs remain extremely vulnerable against topology attack. Therefore, another question is how to extend the capabilities of LLMs on graph adversarial robustness. In this paper, we propose an LLM-based robust graph structure inference framework, LLM4RGNN, which distills the inference capabilities of GPT-4 into a local LLM for identifying malicious edges and an LM-based edge predictor for finding missing important edges, so as to recover a robust graph structure. Extensive experiments demonstrate that LLM4RGNN consistently improves the robustness across various GNNs. Even in some cases where the perturbation ratio increases to 40%, the accuracy of GNNs is still better than that on the clean graph.

[LG-27] Research on Personalized Compression Algorithm for Pre-trained Models Based on Homomorphic Entropy Increase

链接: https://arxiv.org/abs/2408.08684
作者: Yicong Li,Xing Guo,Haohua Du
关键词-EN: Vision Transformer model, Large Language Model, Vision Transformer captures, Vision Transformer, Large Language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this article, we explore the challenges and evolution of two key technologies in the current field of AI: Vision Transformer model and Large Language Model (LLM). Vision Transformer captures global information by splitting images into small pieces and leveraging Transformer’s multi-head attention mechanism, but its high reference count and compute overhead limit deployment on mobile devices. At the same time, the rapid development of LLM has revolutionized natural language processing, but it also faces huge deployment challenges. To address these issues, we investigate model pruning techniques, with a particular focus on how to reduce redundant parameters without losing accuracy to accommodate personalized data and resource-constrained environments. In this paper, a new layered pruning strategy is proposed to distinguish the personalized layer from the common layer by compressed sensing and random sampling, thus significantly reducing the model parameters. Our experimental results show that the introduced step buffering mechanism further improves the accuracy of the model after pruning, providing new directions and possibilities for the deployment of efficient and personalized AI models on mobile devices in the future.

[LG-28] A Mean Field Ansatz for Zero-Shot Weight Transfer

链接: https://arxiv.org/abs/2408.08681
作者: Xingyuan Chen,Wenwei Kuang,Lei Deng,Wei Han,Bo Bai,Goncalo dos Reis
关键词-EN: large language models, weight transfer, zero-shot weight transfer, language models, pre-training cost
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注: 40 pages, 6 Figures, 1 table

点击查看摘要

Abstract:The pre-training cost of large language models (LLMs) is prohibitive. One cutting-edge approach to reduce the cost is zero-shot weight transfer, also known as model growth for some cases, which magically transfers the weights trained in a small model to a large model. However, there are still some theoretical mysteries behind the weight transfer. In this paper, inspired by prior applications of mean field theory to neural network dynamics, we introduce a mean field ansatz to provide a theoretical explanation for weight transfer. Specifically, we propose the row-column (RC) ansatz under the mean field point of view, which describes the measure structure of the weights in the neural network (NN) and admits a close measure dynamic. Thus, the weights of different sizes NN admit a common distribution under proper assumptions, and weight transfer methods can be viewed as sampling methods. We empirically validate the RC ansatz by exploring simple MLP examples and LLMs such as GPT-3 and Llama-3.1. We show the mean-field point of view is adequate under suitable assumptions which can provide theoretical support for zero-shot weight transfer.

[LG-29] Neural Reward Machines

链接: https://arxiv.org/abs/2408.08677
作者: Elena Umili,Francesco Argenziano,Roberto Capobianco
关键词-EN: Non-markovian Reinforcement Learning, Linear Temporal Logic, Non-markovian Reinforcement, Reinforcement Learning, hard to solve
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-markovian Reinforcement Learning (RL) tasks are very hard to solve, because agents must consider the entire history of state-action pairs to act rationally in the environment. Most works use symbolic formalisms (as Linear Temporal Logic or automata) to specify the temporally-extended task. These approaches only work in finite and discrete state environments or continuous problems for which a mapping between the raw state and a symbolic interpretation is known as a symbol grounding (SG) function. Here, we define Neural Reward Machines (NRM), an automata-based neurosymbolic framework that can be used for both reasoning and learning in non-symbolic non-markovian RL domains, which is based on the probabilistic relaxation of Moore Machines. We combine RL with semisupervised symbol grounding (SSSG) and we show that NRMs can exploit high-level symbolic knowledge in non-symbolic environments without any knowledge of the SG function, outperforming Deep RL methods which cannot incorporate prior knowledge. Moreover, we advance the research in SSSG, proposing an algorithm for analysing the groundability of temporal specifications, which is more efficient than baseline techniques of a factor 10^3 .

[LG-30] A Multivocal Literature Review on Privacy and Fairness in Federated Learning

链接: https://arxiv.org/abs/2408.08666
作者: Beatrice Balbierer,Lukas Heinlein,Domenique Zipperling,Niklas Kühl
关键词-EN: Federated Learning presents, Federated Learning, data sharing, eliminating the necessity, necessity for data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at the Internationale Tagung Wirtschaftsinformatik 2024

点击查看摘要

Abstract:Federated Learning presents a way to revolutionize AI applications by eliminating the necessity for data sharing. Yet, research has shown that information can still be extracted during training, making additional privacy-preserving measures such as differential privacy imperative. To implement real-world federated learning applications, fairness, ranging from a fair distribution of performance to non-discriminative behaviour, must be considered. Particularly in high-risk applications (e.g. healthcare), avoiding the repetition of past discriminatory errors is paramount. As recent research has demonstrated an inherent tension between privacy and fairness, we conduct a multivocal literature review to examine the current methods to integrate privacy and fairness in federated learning. Our analyses illustrate that the relationship between privacy and fairness has been neglected, posing a critical risk for real-world applications. We highlight the need to explore the relationship between privacy, fairness, and performance, advocating for the creation of integrated federated learning frameworks.

[LG-31] MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector

链接: https://arxiv.org/abs/2408.08661
作者: Wenjie Fu,Huandong Wang,Chen Gao,Guanghua Liu,Yong Li,Tao Jiang
关键词-EN: large language models, language models, highlight the urgent, increasing parameters, parameters and expansive
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: code and dataset: this https URL

点击查看摘要

Abstract:The increasing parameters and expansive dataset of large language models (LLMs) highlight the urgent demand for a technical solution to audit the underlying privacy risks and copyright issues associated with LLMs. Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA). This problem involves determining whether a given piece of text has been used during the pre-training phase of the target LLM. Although existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance in pre-trained LLMs, how to achieve high-confidence detection and how to perform MIA on aligned LLMs remain challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector internally, rather than design an external MIA score function. Furthermore, we design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner. To comprehensively evaluate the most recent state-of-the-art LLMs, we collect a more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely adopted benchmark WIKIMIA. We conduct extensive experiments across various aligned and unaligned LLMs over the two benchmark datasets. The results demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a significantly high level of 0.9.

[LG-32] Mitigating Backdoor Attacks in Federated Learning via Flipping Weight Updates of Low-Activation Input Neurons

链接: https://arxiv.org/abs/2408.08655
作者: Binbin Ding,Penghui Yang,Zeqing Ge,Shengjun Huang
关键词-EN: collaboratively train machine, enables multiple clients, train machine learning, learning enables multiple, Federated learning enables
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning enables multiple clients to collaboratively train machine learning models under the overall planning of the server while adhering to privacy requirements. However, the server cannot directly oversee the local training process, creating an opportunity for malicious clients to introduce backdoors. Existing research shows that backdoor attacks activate specific neurons in the compromised model, which remain dormant when processing clean data. Leveraging this insight, we propose a method called Flipping Weight Updates of Low-Activation Input Neurons (FLAIN) to defend against backdoor attacks in federated learning. Specifically, after completing global training, we employ an auxiliary dataset to identify low-activation input neurons and flip the associated weight updates. We incrementally raise the threshold for low-activation inputs and flip the weight updates iteratively, until the performance degradation on the auxiliary data becomes unacceptable. Extensive experiments validate that our method can effectively reduce the success rate of backdoor attacks to a low level in various attack scenarios including those with non-IID data distribution or high MCRs, causing only minimal performance degradation on clean data.

[LG-33] xtCAVs: Debugging vision models using text MICCAI2024

链接: https://arxiv.org/abs/2408.08652
作者: Angus Nicolson,Yarin Gal,J. Alison Noble
关键词-EN: Concept-based interpretability methods, high-level human interpretable, Concept-based interpretability, human interpretable concepts, popular form
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 11 pages, 2 figures. Accepted at iMIMIC Workshop at MICCAI 2024

点击查看摘要

Abstract:Concept-based interpretability methods are a popular form of explanation for deep learning models which provide explanations in the form of high-level human interpretable concepts. These methods typically find concept activation vectors (CAVs) using a probe dataset of concept examples. This requires labelled data for these concepts – an expensive task in the medical domain. We introduce TextCAVs: a novel method which creates CAVs using vision-language models such as CLIP, allowing for explanations to be created solely using text descriptions of the concept, as opposed to image exemplars. This reduced cost in testing concepts allows for many concepts to be tested and for users to interact with the model, testing new ideas as they are thought of, rather than a delay caused by image collection and annotation. In early experimental results, we demonstrate that TextCAVs produces reasonable explanations for a chest x-ray dataset (MIMIC-CXR) and natural images (ImageNet), and that these explanations can be used to debug deep learning-based models.

[LG-34] he Power of Bias: Optimizing Client Selection in Federated Learning with Heterogeneous Differential Privacy

链接: https://arxiv.org/abs/2408.08642
作者: Jiating Ma,Yipeng Zhou,Qi Li,Quan Z. Sheng,Laizhong Cui,Jiangchuan Liu
关键词-EN: conducting model training, expose model gradients, private federated learning, federated learning, paradigm emerges
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To preserve the data privacy, the federated learning (FL) paradigm emerges in which clients only expose model gradients rather than original data for conducting model training. To enhance the protection of model gradients in FL, differentially private federated learning (DPFL) is proposed which incorporates differentially private (DP) noises to obfuscate gradients before they are exposed. Yet, an essential but largely overlooked problem in DPFL is the heterogeneity of clients’ privacy requirement, which can vary significantly between clients and extremely complicates the client selection problem in DPFL. In other words, both the data quality and the influence of DP noises should be taken into account when selecting clients. To address this problem, we conduct convergence analysis of DPFL under heterogeneous privacy, a generic client selection strategy, popular DP mechanisms and convex loss. Based on convergence analysis, we formulate the client selection problem to minimize the value of loss function in DPFL with heterogeneous privacy, which is a convex optimization problem and can be solved efficiently. Accordingly, we propose the DPFL-BCS (biased client selection) algorithm. The extensive experiment results with real datasets under both convex and non-convex loss functions indicate that DPFL-BCS can remarkably improve model utility compared with the SOTA baselines.

[LG-35] Navigating Uncertainties in Machine Learning for Structural Dynamics: A Comprehensive Review of Probabilistic and Non-Probabilistic Approaches in Forward and Inverse Problems

链接: https://arxiv.org/abs/2408.08629
作者: Wang-Ji Yan(1 and 2),Lin-Feng Mei(1),Jiang Mo(1),Costas Papadimitriou(3),Ka-Veng Yuen(1 and 2),Michael Beer(4,5, and 6) ((1) State Key Laboratory of Internet of Things for Smart City and Department of Civil and Environmental Engineering, University of Macau, (2) Guangdong-Hong Kong-Macau Joint Laboratory for Smart Cities, China, (3) Department of Mechanical Engineering, University of Thessaly, (4) Leibniz University Hannover, Institute for Risk and Reliability, (5) Department of Civil and Environmental Engineering, University of Liverpool, (6) International Joint Research Center for Resilient Infrastructure amp; International Joint Research Center for Engineering Reliability and Stochastic Mechanics, Tongji University)
关键词-EN: notably impacting structural, notably impacting, era of big, powerful tool, big data
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 114 pages, 27 figures, 6 tables, references added

点击查看摘要

Abstract:In the era of big data, machine learning (ML) has become a powerful tool in various fields, notably impacting structural dynamics. ML algorithms offer advantages by modeling physical phenomena based on data, even in the absence of underlying mechanisms. However, uncertainties such as measurement noise and modeling errors can compromise the reliability of ML predictions, highlighting the need for effective uncertainty awareness to enhance prediction robustness. This paper presents a comprehensive review on navigating uncertainties in ML, categorizing uncertainty-aware approaches into probabilistic methods (including Bayesian and frequentist perspectives) and non-probabilistic methods (such as interval learning and fuzzy learning). Bayesian neural networks, known for their uncertainty quantification and nonlinear mapping capabilities, are emphasized for their superior performance and potential. The review covers various techniques and methodologies for addressing uncertainties in ML, discussing fundamentals and implementation procedures of each method. While providing a concise overview of fundamental concepts, the paper refrains from in-depth critical explanations. Strengths and limitations of each approach are examined, along with their applications in structural dynamic forward problems like response prediction, sensitivity assessment, and reliability analysis, and inverse problems like system identification, model updating, and damage identification. Additionally, the review identifies research gaps and suggests future directions for investigations, aiming to provide comprehensive insights to the research community. By offering an extensive overview of both probabilistic and non-probabilistic approaches, this review aims to assist researchers and practitioners in making informed decisions when utilizing ML techniques to address uncertainties in structural dynamic problems.

[LG-36] A survey on secure decentralized optimization and learning

链接: https://arxiv.org/abs/2408.08628
作者: Changxin Liu,Nicola Bastianello,Wei Huo,Yang Shi,Karl H. Johansson
关键词-EN: solving large-scale decision-making, large-scale decision-making problems, training large machine, large machine learning, machine learning models
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 38 pages

点击查看摘要

Abstract:Decentralized optimization has become a standard paradigm for solving large-scale decision-making problems and training large machine learning models without centralizing data. However, this paradigm introduces new privacy and security risks, with malicious agents potentially able to infer private data or impair the model accuracy. Over the past decade, significant advancements have been made in developing secure decentralized optimization and learning frameworks and algorithms. This survey provides a comprehensive tutorial on these advancements. We begin with the fundamentals of decentralized optimization and learning, highlighting centralized aggregation and distributed consensus as key modules exposed to security risks in federated and distributed optimization, respectively. Next, we focus on privacy-preserving algorithms, detailing three cryptographic tools and their integration into decentralized optimization and learning systems. Additionally, we examine resilient algorithms, exploring the design and analysis of resilient aggregation and consensus protocols that support these systems. We conclude the survey by discussing current trends and potential future directions.

[LG-37] DeepDFA: Automata Learning through Neural Probabilistic Relaxations

链接: https://arxiv.org/abs/2408.08622
作者: Elena Umili,Roberto Capobianco
关键词-EN: Deterministic Finite Automata, identifying Deterministic Finite, Finite Automata, Deterministic Finite, Recurrent Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we introduce DeepDFA, a novel approach to identifying Deterministic Finite Automata (DFAs) from traces, harnessing a differentiable yet discrete model. Inspired by both the probabilistic relaxation of DFAs and Recurrent Neural Networks (RNNs), our model offers interpretability post-training, alongside reduced complexity and enhanced training efficiency compared to traditional RNNs. Moreover, by leveraging gradient-based optimization, our method surpasses combinatorial approaches in both scalability and noise resilience. Validation experiments conducted on target regular languages of varying size and complexity demonstrate that our approach is accurate, fast, and robust to noise in both the input symbols and the output labels of training data, integrating the strengths of both logical grammar induction and deep learning.

[LG-38] Generative Dataset Distillation Based on Diffusion Model ECCV2024

链接: https://arxiv.org/abs/2408.08610
作者: Duo Su,Junjie Hou,Guang Li,Ren Togo,Rui Song,Takahiro Ogawa,Miki Haseyama
关键词-EN: Dataset Distillation Challenge, generative dataset distillation, dataset distillation method, Dataset Distillation, generative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The Third Place Winner in Generative Track of the ECCV 2024 DD Challenge

点击查看摘要

Abstract:This paper presents our method for the generative track of The First Dataset Distillation Challenge at ECCV 2024. Since the diffusion model has become the mainstay of generative models because of its high-quality generative effects, we focus on distillation methods based on the diffusion model. Considering that the track can only generate a fixed number of images in 10 minutes using a generative model for CIFAR-100 and Tiny-ImageNet datasets, we need to use a generative model that can generate images at high speed. In this study, we proposed a novel generative dataset distillation method based on Stable Diffusion. Specifically, we use the SDXL-Turbo model which can generate images at high speed and quality. Compared to other diffusion models that can only generate images per class (IPC) = 1, our method can achieve an IPC = 10 for Tiny-ImageNet and an IPC = 20 for CIFAR-100, respectively. Additionally, to generate high-quality distilled datasets for CIFAR-100 and Tiny-ImageNet, we use the class information as text prompts and post data augmentation for the SDXL-Turbo model. Experimental results show the effectiveness of the proposed method, and we achieved third place in the generative track of the ECCV 2024 DD Challenge. Codes are available at this https URL.

[LG-39] RadioDiff: An Effective Generative Diffusion Model for Sampling-Free Dynamic Radio Map Construction

链接: https://arxiv.org/abs/2408.08593
作者: Xiucheng Wang,Keda Tao,Nan Cheng,Zhisheng Yin,Zan Li,Yuan Zhang,Xuemin Shen
关键词-EN: Radio map, obtain pathloss based, promising technology, applications to reduce, reduce the communication
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Radio map (RM) is a promising technology that can obtain pathloss based on only location, which is significant for 6G network applications to reduce the communication costs for pathloss estimation. However, the construction of RM in traditional is either computationally intensive or depends on costly sampling-based pathloss measurements. Although the neural network (NN)-based method can efficiently construct the RM without sampling, its performance is still suboptimal. This is primarily due to the misalignment between the generative characteristics of the RM construction problem and the discrimination modeling exploited by existing NN-based methods. Thus, to enhance RM construction performance, in this paper, the sampling-free RM construction is modeled as a conditional generative problem, where a denoised diffusion-based method, named RadioDiff, is proposed to achieve high-quality RM construction. In addition, to enhance the diffusion model’s capability of extracting features from dynamic environments, an attention U-Net with an adaptive fast Fourier transform module is employed as the backbone network to improve the dynamic environmental features extracting capability. Meanwhile, the decoupled diffusion model is utilized to further enhance the construction performance of RMs. Moreover, a comprehensive theoretical analysis of why the RM construction is a generative problem is provided for the first time, from both perspectives of data features and NN training methods. Experimental results show that the proposed RadioDiff achieves state-of-the-art performance in all three metrics of accuracy, structural similarity, and peak signal-to-noise ratio. The code is available at this https URL.

[LG-40] A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models

链接: https://arxiv.org/abs/2408.08590
作者: Geonhee Kim,Marco Valentino,André Freitas
关键词-EN: auto-regressive Language Models, exploit superficial patterns, Recent studies, auto-regressive Language, systematic reasoning principles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies on logical reasoning in auto-regressive Language Models (LMs) have sparked a debate on whether such models can learn systematic reasoning principles during pre-training or merely exploit superficial patterns in the training data. This paper presents a mechanistic interpretation of syllogistic reasoning in LMs to further enhance our understanding of internal dynamics. Specifically, we present a methodology for circuit discovery aimed at disentangling content-independent reasoning mechanisms from world knowledge acquired during pre-training. Through two distinct intervention methods, we uncover a sufficient and necessary circuit involving middle-term suppression that elucidates how LMs transfer information to derive valid conclusions from premises. Furthermore, we investigate how belief biases manifest in syllogistic reasoning, finding evidence of partial contamination from additional attention heads responsible for encoding commonsense and contextualized knowledge. Finally, we explore the generalization of the discovered mechanisms across various syllogistic schemes and model sizes, finding that the identified circuit is sufficient and necessary for all the schemes on which the model achieves high downstream accuracy ( \geq 60%). Overall, our findings suggest that LMs indeed learn transferable content-independent reasoning mechanisms, but that, at the same time, such mechanisms do not involve generalisable and abstract logical primitives, being susceptible to contamination by the same world knowledge acquired during pre-training.

[LG-41] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction CIKM2024

链接: https://arxiv.org/abs/2408.08585
作者: Yunpeng Weng,Xing Tang,Zhenhao Xu,Fuyuan Lyu,Dugang Liu,Zexu Sun,Xiuqiang He
关键词-EN: CLTV, Customer Lifetime, distribution, critical task, business applications
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: CIKM 2024

点击查看摘要

Abstract:Customer Lifetime Value (CLTV) prediction is a critical task in business applications. Accurately predicting CLTV is challenging in real-world business scenarios, as the distribution of CLTV is complex and mutable. Firstly, there is a large number of users without any consumption consisting of a long-tailed part that is too complex to fit. Secondly, the small set of high-value users spent orders of magnitude more than a typical user leading to a wide range of the CLTV distribution which is hard to capture in a single distribution. Existing approaches for CLTV estimation either assume a prior probability distribution and fit a single group of distribution-related parameters for all samples, or directly learn from the posterior distribution with manually predefined buckets in a heuristic manner. However, all these methods fail to handle complex and mutable distributions. In this paper, we propose a novel optimal distribution selection model OptDist for CLTV prediction, which utilizes an adaptive optimal sub-distribution selection mechanism to improve the accuracy of complex distribution modeling. Specifically, OptDist trains several candidate sub-distribution networks in the distribution learning module (DLM) for modeling the probability distribution of CLTV. Then, a distribution selection module (DSM) is proposed to select the sub-distribution for each sample, thus making the selection automatically and adaptively. Besides, we design an alignment mechanism that connects both modules, which effectively guides the optimization. We conduct extensive experiments on both two public and one private dataset to verify that OptDist outperforms state-of-the-art baselines. Furthermore, OptDist has been deployed on a large-scale financial platform for customer acquisition marketing campaigns and the online experiments also demonstrate the effectiveness of OptDist.

[LG-42] S-RAF: A Simulation-Based Robustness Assessment Framework for Responsible Autonomous Driving

链接: https://arxiv.org/abs/2408.08584
作者: Daniel Omeiza,Pratik Somaiya,Jo-Ann Pattinson,Carolyn Ten-Holter,Jack Stilgoe,Marina Jirotka,Lars Kunze
关键词-EN: technology advances, artificial intelligence, Robustness Assessment Framework, AI-driven systems, Simulation-Based Robustness Assessment
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) technology advances, ensuring the robustness and safety of AI-driven systems has become paramount. However, varying perceptions of robustness among AI developers create misaligned evaluation metrics, complicating the assessment and certification of safety-critical and complex AI systems such as autonomous driving (AD) agents. To address this challenge, we introduce Simulation-Based Robustness Assessment Framework (S-RAF) for autonomous driving. S-RAF leverages the CARLA Driving simulator to rigorously assess AD agents across diverse conditions, including faulty sensors, environmental changes, and complex traffic situations. By quantifying robustness and its relationship with other safety-critical factors, such as carbon emissions, S-RAF aids developers and stakeholders in building safe and responsible driving agents, and streamlining safety certification processes. Furthermore, S-RAF offers significant advantages, such as reduced testing costs, and the ability to explore edge cases that may be unsafe to test in the real world. The code for this framework is available here: this https URL

[LG-43] GrassNet: State Space Model Meets Graph Neural Network

链接: https://arxiv.org/abs/2408.08583
作者: Gongpei Zhao,Tao Wang,Yi Jin,Congyan Lang,Yidong Li,Haibin Ling
关键词-EN: spectral convolutional networks, Designing spectral convolutional, graph, State Space Network, graph neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing spectral convolutional networks is a formidable task in graph learning. In traditional spectral graph neural networks (GNNs), polynomial-based methods are commonly used to design filters via the Laplacian matrix. In practical applications, however, these polynomial methods encounter inherent limitations, which primarily arise from the the low-order truncation of polynomial filters and the lack of overall modeling of the graph spectrum. This leads to poor performance of existing spectral approaches on real-world graph data, especially when the spectrum is highly concentrated or contains many numerically identical values, as they tend to apply the exact same modulation to signals with the same frequencies. To overcome these issues, in this paper, we propose Graph State Space Network (GrassNet), a novel graph neural network with theoretical support that provides a simple yet effective scheme for designing and learning arbitrary graph spectral filters. In particular, our GrassNet introduces structured state space models (SSMs) to model the correlations of graph signals at different frequencies and derives a unique rectification for each frequency in the graph spectrum. To the best of our knowledge, our work is the first to employ SSMs for the design of GNN spectral filters, and it theoretically offers greater expressive power compared with polynomial filters. Extensive experiments on nine public benchmarks reveal that GrassNet achieves superior performance in real-world graph modeling tasks.

[LG-44] S3Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

链接: https://arxiv.org/abs/2408.08567
作者: Xue Wang,Tian Zhou,Jianqing Zhu,Jialin Liu,Kun Yuan,Tao Yao,Wotao Yin,Rong Jin,HanQin Cai
关键词-EN: Attention based models, Attention, based Attention structure, Attention based, vanilla Attention based
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S ^3 Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S ^3 Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S ^3 Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S ^3 Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.

[LG-45] A training regime to learn unified representations from complementary breast imaging modalities

链接: https://arxiv.org/abs/2408.08560
作者: Umang Sharma,Jungkyu Park,Laura Heacock,Sumit Chopra,Krzysztof Geras
关键词-EN: Full Field Digital, Field Digital Mammograms, Digital Breast Tomosynthesis, Digital Mammograms, Full Field
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Full Field Digital Mammograms (FFDMs) and Digital Breast Tomosynthesis (DBT) are the two most widely used imaging modalities for breast cancer screening. Although DBT has increased cancer detection compared to FFDM, its widespread adoption in clinical practice has been slowed by increased interpretation times and a perceived decrease in the conspicuity of specific lesion types. Specifically, the non-inferiority of DBT for microcalcifications remains under debate. Due to concerns about the decrease in visual acuity, combined DBT-FFDM acquisitions remain popular, leading to overall increased exam times and radiation dosage. Enabling DBT to provide diagnostic information present in both FFDM and DBT would reduce reliance on FFDM, resulting in a reduction in both quantities. We propose a machine learning methodology that learns high-level representations leveraging the complementary diagnostic signal from both DBT and FFDM. Experiments on a large-scale data set validate our claims and show that our representations enable more accurate breast lesion detection than any DBT- or FFDM-based model.

[LG-46] ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

链接: https://arxiv.org/abs/2408.08554
作者: Chao Zeng,Songwei Liu,Yusheng Xie,Hong Liu,Xiaojian Wang,Miao Wei,Shu Yang,Fangmin Chen,Xing Mei
关键词-EN: Large Language Models, language processing tasks, revolutionized natural language, natural language processing, Large Language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17 \downarrow vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6 \times acceleration improvement and 2.7 \times memory compression gain.

[LG-47] Where is the signal in tokenization space?

链接: https://arxiv.org/abs/2408.08541
作者: Renato Lui Geh,Honghua Zhang,Kareem Ahmed,Benjie Wang,Guy Van den Broeck
关键词-EN: Large Language Models, Large Language, so-called canonical token, canonical token sequences, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values. One common assumption is that the probability of a piece of text is the probability of its canonical token sequence. However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes Tokens as [Tok,ens], but [Tok,en,s] also represents the same text. In this paper, we study non-canonical tokenizations. We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations. We then show how the marginal is, in most cases, indistinguishable from the canonical probability. Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space. Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.

[LG-48] Blockchain-Enabled Accountability in Data Supply Chain: A Data Bill of Materials Approach

链接: https://arxiv.org/abs/2408.08536
作者: Yue Liu,Dawen Zhang,Boming Xia,Julia Anticev,Tunde Adebayo,Zhenchang Xing,Moses Machao
关键词-EN: advanced artificial intelligence, large-scale generative models, ensuring the traceability, artificial intelligence, highlighted by large-scale
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of advanced artificial intelligence, highlighted by large-scale generative models like GPT-4, ensuring the traceability, verifiability, and reproducibility of datasets throughout their lifecycle is paramount for research institutions and technology companies. These organisations increasingly rely on vast corpora to train and fine-tune advanced AI models, resulting in intricate data supply chains that demand effective data governance mechanisms. In addition, the challenge intensifies as diverse stakeholders may use assorted tools, often without adequate measures to ensure the accountability of data and the reliability of outcomes. In this study, we adapt the concept of Software Bill of Materials" into the field of data governance and management to address the above challenges, and introduce Data Bill of Materials" (DataBOM) to capture the dependency relationship between different datasets and stakeholders by storing specific metadata. We demonstrate a platform architecture for providing blockchain-based DataBOM services, present the interaction protocol for stakeholders, and discuss the minimal requirements for DataBOM metadata. The proposed solution is evaluated in terms of feasibility and performance via case study and quantitative analysis respectively.

[LG-49] Detecting Unsuccessful Students in Cybersecurity Exercises in Two Different Learning Environments

链接: https://arxiv.org/abs/2408.08531
作者: Valdemar Švábenský,Kristián Tkáčik,Aubrey Birdwell,Richard Weiss,Ryan S. Baker,Pavel Čeleda,Jan Vykopal,Jens Mache,Ankur Chattopadhyay
关键词-EN: research track evaluates, performing poorly, research track, track evaluates, evaluates the usage
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: To appear for publication in the FIE 2024 conference proceedings

点击查看摘要

Abstract:This full paper in the research track evaluates the usage of data logged from cybersecurity exercises in order to predict students who are potentially at risk of performing poorly. Hands-on exercises are essential for learning since they enable students to practice their skills. In cybersecurity, hands-on exercises are often complex and require knowledge of many topics. Therefore, students may miss solutions due to gaps in their knowledge and become frustrated, which impedes their learning. Targeted aid by the instructor helps, but since the instructor’s time is limited, efficient ways to detect struggling students are needed. This paper develops automated tools to predict when a student is having difficulty. We formed a dataset with the actions of 313 students from two countries and two learning environments: KYPO CRP and EDURange. These data are used in machine learning algorithms to predict the success of students in exercises deployed in these environments. After extracting features from the data, we trained and cross-validated eight classifiers for predicting the exercise outcome and evaluated their predictive power. The contribution of this paper is comparing two approaches to feature engineering, modeling, and classification performance on data from two learning environments. Using the features from either learning environment, we were able to detect and distinguish between successful and struggling students. A decision tree classifier achieved the highest balanced accuracy and sensitivity with data from both learning environments. The results show that activity data from cybersecurity exercises are suitable for predicting student success. In a potential application, such models can aid instructors in detecting struggling students and providing targeted help. We publish data and code for building these models so that others can adopt or adapt them.

[LG-50] Inverse design with conditional cascaded diffusion models

链接: https://arxiv.org/abs/2408.08526
作者: Milad Habibi,Mark Fuge
关键词-EN: Adjoint-based design optimizations, Adjoint-based design, computationally expensive, diffusion model, model
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at IDETC/CIE 2024 conference, Washington, DC. 11 pages, 9 figures

点击查看摘要

Abstract:Adjoint-based design optimizations are usually computationally expensive and those costs scale with resolution. To address this, researchers have proposed machine learning approaches for inverse design that can predict higher-resolution solutions from lower cost/resolution ones. Due to the recent success of diffusion models over traditional generative models, we extend the use of diffusion models for multi-resolution tasks by proposing the conditional cascaded diffusion model (cCDM). Compared to GANs, cCDM is more stable to train, and each diffusion model within the cCDM can be trained independently, thus each model’s parameters can be tuned separately to maximize the performance of the pipeline. Our study compares cCDM against a cGAN model with transfer learning. Our results demonstrate that the cCDM excels in capturing finer details, preserving volume fraction constraints, and minimizing compliance errors in multi-resolution tasks when a sufficient amount of high-resolution training data (more than 102 designs) is available. Furthermore, we explore the impact of training data size on the performance of both models. While both models show decreased performance with reduced high-resolution training data, the cCDM loses its superiority to the cGAN model with transfer learning when training data is limited (less than 102), and we show the break-even point for this transition. Also, we highlight that while the diffusion model may achieve better pixel-wise performance in both low-resolution and high-resolution scenarios, this does not necessarily guarantee that the model produces optimal compliance error or constraint satisfaction. Comments: Accepted for presentation at IDETC/CIE 2024 conference, Washington, DC. 11 pages, 9 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.08526 [cs.LG] (or arXiv:2408.08526v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.08526 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] Mitigating Degree Bias in Signed Graph Neural Networks AAAI

链接: https://arxiv.org/abs/2408.08508
作者: Fang He,Jinhai Deng,Ruizhan Xue,Maojun Wang,Zeyu Zhang
关键词-EN: Graph Neural Networks, Signed Graph Neural, Graph Neural, Neural Networks, Debiased Signed Graph
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Like Graph Neural Networks (GNNs), Signed Graph Neural Networks (SGNNs) are also up against fairness issues from source data and typical aggregation method. In this paper, we are pioneering to make the investigation of fairness in SGNNs expanded from GNNs. We identify the issue of degree bias within signed graphs, offering a new perspective on the fairness issues related to SGNNs. To handle the confronted bias issue, inspired by previous work on degree bias, a new Model-Agnostic method is consequently proposed to enhance representation of nodes with different degrees, which named as Degree Debiased Signed Graph Neural Network (DD-SGNN) . More specifically, in each layer, we make a transfer from nodes with high degree to nodes with low degree inside a head-to-tail triplet, which to supplement the underlying domain missing structure of the tail nodes and meanwhile maintain the positive and negative semantics specified by balance theory in signed graphs. We make extensive experiments on four real-world datasets. The result verifies the validity of the model, that is, our model mitigates the degree bias issue without compromising performance( \textiti.e. , AUC, F1). The code is provided in supplementary material.

[LG-52] he Limitations of Model Retraining in the Face of Performativity ICML

链接: https://arxiv.org/abs/2408.08499
作者: Anmol Kabra,Kumar Kshitij Patel
关键词-EN: study stochastic optimization, study stochastic, stochastic optimization, distribution shifts, simple distribution shifts
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Accepted to 2024 ICML Workshop on Humans, Algorithmic Decision-Making and Society

点击查看摘要

Abstract:We study stochastic optimization in the context of performative shifts, where the data distribution changes in response to the deployed model. We demonstrate that naive retraining can be provably suboptimal even for simple distribution shifts. The issue worsens when models are retrained given a finite number of samples at each retraining step. We show that adding regularization to retraining corrects both of these issues, attaining provably optimal models in the face of distribution shifts. Our work advocates rethinking how machine learning models are retrained in the presence of performative effects.

[LG-53] Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms ICLR2024

链接: https://arxiv.org/abs/2408.08494
作者: Yi Li,Honghao Lin,David P.Woodruff
关键词-EN: residual error estimation, error estimation, bound, epsilon, residual error
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2024

点击查看摘要

Abstract:We study the problem of residual error estimation for matrix and vector norms using a linear sketch. Such estimates can be used, for example, to quickly assess how useful a more expensive low-rank approximation computation will be. The matrix case concerns the Frobenius norm and the task is to approximate the k -residual |A - A_k|_F of the input matrix A within a (1+\epsilon) -factor, where A_k is the optimal rank- k approximation. We provide a tight bound of \Theta(k^2/\epsilon^4) on the size of bilinear sketches, which have the form of a matrix product SAT . This improves the previous O(k^2/\epsilon^6) upper bound in (Andoni et al. SODA 2013) and gives the first non-trivial lower bound, to the best of our knowledge. In our algorithm, our sketching matrices S and T can both be sparse matrices, allowing for a very fast update time. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. For the vector case, we consider the \ell_p -norm for p2 , where the task is to approximate the k -residual |x - x_k|_p up to a constant factor, where x_k is the optimal k -sparse approximation to x . Such vector norms are frequently studied in the data stream literature and are useful for finding frequent items or so-called heavy hitters. We establish an upper bound of O(k^2/pn^1-2/p\operatornamepoly(\log n)) for constant \epsilon on the dimension of a linear sketch for this problem. Our algorithm can be extended to the \ell_p sparse recovery problem with the same sketching dimension, which seems to be the first such bound for p 2 . We also show an \Omega(k^2/pn^1-2/p) lower bound for the sparse recovery problem, which is tight up to a \mathrmpoly(\log n) factor. Comments: Published as a conference paper at ICLR 2024 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2408.08494 [cs.DS] (or arXiv:2408.08494v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2408.08494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Fishers Harvest Parallel Unlearning in Inherited Model Networks

链接: https://arxiv.org/abs/2408.08493
作者: Xiao Liu,Mingyuan Li,Xu Wang,Guangsheng Yu,Wei Ni,Lixiang Li,Haipeng Peng,Renping Liu
关键词-EN: complex inheritance relationships, exhibiting complex inheritance, frameworks remains challenging, Directed Acyclic Graph, learning frameworks remains
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Unlearning in various learning frameworks remains challenging, with the continuous growth and updates of models exhibiting complex inheritance relationships. This paper presents a novel unlearning framework, which enables fully parallel unlearning among models exhibiting inheritance. A key enabler is the new Unified Model Inheritance Graph (UMIG), which captures the inheritance using a Directed Acyclic Graph (DAG).Central to our framework is the new Fisher Inheritance Unlearning (FIUn) algorithm, which utilizes the Fisher Information Matrix (FIM) from initial unlearning models to pinpoint impacted parameters in inherited models. By employing FIM, the FIUn method breaks the sequential dependencies among the models, facilitating simultaneous unlearning and reducing computational overhead. We further design to merge disparate FIMs into a single matrix, synchronizing updates across inherited models. Experiments confirm the effectiveness of our unlearning framework. For single-class tasks, it achieves complete unlearning with 0% accuracy for unlearned labels while maintaining 94.53% accuracy for retained labels on average. For multi-class tasks, the accuracy is 1.07% for unlearned labels and 84.77% for retained labels on average. Our framework accelerates unlearning by 99% compared to alternative methods.

[LG-55] Adversarial Contrastive Learning Based Physics-Informed Temporal Networks for Cuffless Blood Pressure Estimation

链接: https://arxiv.org/abs/2408.08488
作者: Rui Wang,Mengshi Qi,Yingxia Shao,Anfu Zhou,Huadong Ma
关键词-EN: extensive applications, mining is immensely, immensely important, important in extensive, Time series data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Time series data mining is immensely important in extensive applications, such as traffic, medical, and e-commerce. In this paper, we focus on medical temporal variation modeling, \emphi.e., cuffless blood pressure (BP) monitoring which has great value in cardiovascular healthcare. Although providing a comfortable user experience, such methods are suffering from the demand for a significant amount of realistic data to train an individual model for each subject, especially considering the invasive or obtrusive BP ground-truth measurements. To tackle this challenge, we introduce a novel physics-informed temporal network~(PITN) with adversarial contrastive learning to enable precise BP estimation with very limited data. Specifically, we first enhance the physics-informed neural network~(PINN) with the temporal block for investigating BP dynamics’ multi-periodicity for personal cardiovascular cycle modeling and temporal variation. We then employ adversarial training to generate extra physiological time series data, improving PITN’s robustness in the face of sparse subject-specific training data. Furthermore, we utilize contrastive learning to capture the discriminative variations of cardiovascular physiologic phenomena. This approach aggregates physiological signals with similar blood pressure values in latent space while separating clusters of samples with dissimilar blood pressure values. Experiments on three widely-adopted datasets with different modailties (\emphi.e., bioimpedance, PPG, millimeter-wave) demonstrate the superiority and effectiveness of the proposed methods over previous state-of-the-art approaches. The code is available at~\urlthis https URL.

[LG-56] An Unsupervised Learning Framework Combined with Heuristics for the Maximum Minimal Cut Problem

链接: https://arxiv.org/abs/2408.08484
作者: Huaiyuan Liu,Xianzhang Liu,Donghua Yang,Hongzhi Wang,Yingchi Long,Mengtong Ji,Dongjing Miao,Zhiyu Liang
关键词-EN: Maximum Minimal Cut, Minimal Cut Problem, Maximum Minimal, Minimal Cut, NP-hard combinatorial optimization
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Maximum Minimal Cut Problem (MMCP), a NP-hard combinatorial optimization (CO) problem, has not received much attention due to the demanding and challenging bi-connectivity constraint. Moreover, as a CO problem, it is also a daunting task for machine learning, especially without labeled instances. To deal with these problems, this work proposes an unsupervised learning framework combined with heuristics for MMCP that can provide valid and high-quality solutions. As far as we know, this is the first work that explores machine learning and heuristics to solve MMCP. The unsupervised solver is inspired by a relaxation-plus-rounding approach, the relaxed solution is parameterized by graph neural networks, and the cost and penalty of MMCP are explicitly written out, which can train the model end-to-end. A crucial observation is that each solution corresponds to at least one spanning tree. Based on this finding, a heuristic solver that implements tree transformations by adding vertices is utilized to repair and improve the solution quality of the unsupervised solver. Alternatively, the graph is simplified while guaranteeing solution consistency, which reduces the running time. We conduct extensive experiments to evaluate our framework and give a specific application. The results demonstrate the superiority of our method against two techniques designed.

[LG-57] Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models

链接: https://arxiv.org/abs/2408.08470
作者: Jerry Huang,Prasanna Parthasarathi,Mehdi Rezagholizadeh,Sarath Chandar
关键词-EN: large language models, widespread adoption, resource constraints, growing sizes, sizes only increasing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages (9 pages main content + references + appendix)

点击查看摘要

Abstract:Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One noted issue is the high latency associated with auto-regressive generation, rendering large LLMs use dependent on advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger target model’s generation, has helped alleviate this, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains provided the candidates are effective. Further results show this to hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.

[LG-58] JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

链接: https://arxiv.org/abs/2408.08459
作者: Xiaochuang Han,Marjan Ghazvininejad,Pang Wei Koh,Yulia Tsvetkov
关键词-EN: potentially easy integration, Recent work, generality and potentially, potentially easy, easy integration
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization – representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

[LG-59] Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

链接: https://arxiv.org/abs/2408.08454
作者: Zohaib Khan,Muhammad Khaquan,Omer Tafveez,Agha Ali Raza
关键词-EN: revolutionized deep learning, effectively captures contextual, captures contextual information, GQA, architecture has revolutionized
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities.

[LG-60] Exploring Cross-model Neuronal Correlations in the Context of Predicting Model Performance and Generalizability

链接: https://arxiv.org/abs/2408.08448
作者: Haniyeh Ehsani Oskouie,Lionel Levine,Majid Sarrafzadeh
关键词-EN: Artificial Intelligence, increasingly paramount, increasingly integrated, establish the trustworthiness, critical systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) models are increasingly integrated into critical systems, the need for a robust framework to establish the trustworthiness of AI is increasingly paramount. While collaborative efforts have established conceptual foundations for such a framework, there remains a significant gap in developing concrete, technically robust methods for assessing AI model quality and performance. A critical drawback in the traditional methods for assessing the validity and generalizability of models is their dependence on internal developer datasets, rendering it challenging to independently assess and verify their performance claims. This paper introduces a novel approach for assessing a newly trained model’s performance based on another known model by calculating correlation between neural networks. The proposed method evaluates correlations by determining if, for each neuron in one network, there exists a neuron in the other network that produces similar output. This approach has implications for memory efficiency, allowing for the use of smaller networks when high correlation exists between networks of different sizes. Additionally, the method provides insights into robustness, suggesting that if two highly correlated networks are compared and one demonstrates robustness when operating in production environments, the other is likely to exhibit similar robustness. This contribution advances the technical toolkit for responsible AI, supporting more comprehensive and nuanced evaluations of AI models to ensure their safe and effective deployment.

[LG-61] Lifelong Reinforcement Learning via Neuromodulation

链接: https://arxiv.org/abs/2408.08446
作者: Sebastian Lee,Samuel Liebana Garcia,Claudia Clopath,Will Dabney
关键词-EN: Navigating multiple tasks, Navigating multiple, multiple tasks, requires some notion, notion of adaptation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Navigating multiple tasks \unicodex2014 for instance in succession as in continual or lifelong learning, or in distributions as in meta or multi-task learning \unicodex2014 requires some notion of adaptation. Evolution over timescales of millennia has imbued humans and other animals with highly effective adaptive learning and decision-making strategies. Central to these functions are so-called neuromodulatory systems. In this work we introduce an abstract framework for integrating theories and evidence from neuroscience and the cognitive sciences into the design of adaptive artificial reinforcement learning algorithms. We give a concrete instance of this framework built on literature surrounding the neuromodulators Acetylcholine (ACh) and Noradrenaline (NA), and empirically validate the effectiveness of the resulting adaptive algorithm in a non-stationary multi-armed bandit problem. We conclude with a theory-based experiment proposal providing an avenue to link our framework back to efforts in experimental neuroscience.

[LG-62] W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering

链接: https://arxiv.org/abs/2408.08444
作者: Jinming Nian,Zhiyuan Peng,Qifan Wang,Yi Fang
关键词-EN: Large Language Models, Large Language, open-domain question answering, factual answers relying, answers relying solely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In knowledge-intensive tasks such as open-domain question answering (OpenQA), Large Language Models (LLMs) often struggle to generate factual answers relying solely on their internal (parametric) knowledge. To address this limitation, Retrieval-Augmented Generation (RAG) systems enhance LLMs by retrieving relevant information from external sources, thereby positioning the retriever as a pivotal component. Although dense retrieval demonstrates state-of-the-art performance, its training poses challenges due to the scarcity of ground-truth evidence, largely attributed to the high costs of human annotation. In this paper, we propose W-RAG by utilizing the ranking capabilities of LLMs to create weakly labeled data for training dense retrievers. Specifically, we rerank the top- K passages retrieved via BM25 by assessing the probability that LLMs will generate the correct answer based on the question and each passage. The highest-ranking passages are then used as positive training examples for dense retrieval. Our comprehensive experiments across four publicly available OpenQA datasets demonstrate that our approach enhances both retrieval and OpenQA performance compared to baseline models.

[LG-63] A semi-centralized multi-agent RL framework for efficient irrigation scheduling

链接: https://arxiv.org/abs/2408.08442
作者: Bernard T. Agyeman,Benjamin Decard-Nelson,Jinfeng Liu(University of Alberta),Sirish L. Shah
关键词-EN: Multi-Agent Reinforcement Learning, Reinforcement Learning, address spatial variability, Semi-Centralized Multi-Agent Reinforcement, spatially variable agricultural
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a Semi-Centralized Multi-Agent Reinforcement Learning (SCMARL) approach for irrigation scheduling in spatially variable agricultural fields, where management zones address spatial variability. The SCMARL framework is hierarchical in nature, with a centralized coordinator agent at the top level and decentralized local agents at the second level. The coordinator agent makes daily binary irrigation decisions based on field-wide conditions, which are communicated to the local agents. Local agents determine appropriate irrigation amounts for specific management zones using local conditions. The framework employs state augmentation approach to handle non-stationarity in the local agents’ environments. An extensive evaluation on a large-scale field in Lethbridge, Canada, compares the SCMARL approach with a learning-based multi-agent model predictive control scheduling approach, highlighting its enhanced performance, resulting in water conservation and improved Irrigation Water Use Efficiency (IWUE). Notably, the proposed approach achieved a 4.0% savings in irrigation water while enhancing the IWUE by 6.3%.

[LG-64] D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.08441
作者: Rafael Rafailov,Kyle Hatch,Anikait Singh,Laura Smith,Aviral Kumar,Ilya Kostrikov,Philippe Hansen-Estruch,Victor Kolev,Philip Ball,Jiajun Wu,Chelsea Finn,Sergey Levine
关键词-EN: Offline reinforcement learning, large pre-collected datasets, reinforcement learning algorithms, learning algorithms hold, dangerous real-world exploration
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: RLC 2024

点击查看摘要

Abstract:Offline reinforcement learning algorithms hold the promise of enabling data-driven RL methods that do not require costly or dangerous real-world exploration and benefit from large pre-collected datasets. This in turn can facilitate real-world applications, as well as a more standardized approach to RL research. Furthermore, offline RL methods can provide effective initializations for online finetuning to overcome challenges with exploration. However, evaluating progress on offline RL algorithms requires effective and challenging benchmarks that capture properties of real-world tasks, provide a range of task difficulties, and cover a range of challenges both in terms of the parameters of the domain (e.g., length of the horizon, sparsity of rewards) and the parameters of the data (e.g., narrow demonstration data or broad exploratory data). While considerable progress in offline RL in recent years has been enabled by simpler benchmark tasks, the most widely used datasets are increasingly saturating in performance and may fail to reflect properties of realistic tasks. We propose a new benchmark for offline RL that focuses on realistic simulations of robotic manipulation and locomotion environments, based on models of real-world robotic systems, and comprising a variety of data sources, including scripted data, play-style data collected by human teleoperators, and other data sources. Our proposed benchmark covers state-based and image-based domains, and supports both offline RL and online fine-tuning evaluation, with some of the tasks specifically designed to require both pre-training and fine-tuning. We hope that our proposed benchmark will facilitate further progress on both offline RL and fine-tuning algorithms. Website with code, examples, tasks, and data is available at \urlthis https URL

[LG-65] Random Gradient Masking as a Defensive Measure to Deep Leakage in Federated Learning

链接: https://arxiv.org/abs/2408.08430
作者: Joon Kim,Sejin Park
关键词-EN: machine learning models, quality machine learning, Federated Learning, individual clients’ data, producing quality machine
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 13 pages, 5 figures, to be submitted to Applied Intelligence

点击查看摘要

Abstract:Federated Learning(FL), in theory, preserves privacy of individual clients’ data while producing quality machine learning models. However, attacks such as Deep Leakage from Gradients(DLG) severely question the practicality of FL. In this paper, we empirically evaluate the efficacy of four defensive methods against DLG: Masking, Clipping, Pruning, and Noising. Masking, while only previously studied as a way to compress information during parameter transfer, shows surprisingly robust defensive utility when compared to the other three established methods. Our experimentation is two-fold. We first evaluate the minimum hyperparameter threshold for each method across MNIST, CIFAR-10, and lfw datasets. Then, we train FL clients with each method and their minimum threshold values to investigate the trade-off between DLG defense and training performance. Results reveal that Masking and Clipping show near to none degradation in performance while obfuscating enough information to effectively defend against DLG.

[LG-66] An Efficient and Explainable Transformer-Based Few-Shot Learning for Modeling Electricity Consumption Profiles Across Thousands of Domains

链接: https://arxiv.org/abs/2408.08399
作者: Weijie Xia,Gao Peng,Chenguang Wang,Peter Palensky,Eric Pauwels,Pedro P. Vergara
关键词-EN: Electricity Consumption Profiles, Electricity Consumption, Consumption Profiles, ECP modeling, ECP
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Electricity Consumption Profiles (ECPs) are crucial for operating and planning power distribution systems, especially with the increasing numbers of various low-carbon technologies such as solar panels and electric vehicles. Traditional ECP modeling methods typically assume the availability of sufficient ECP data. However, in practice, the accessibility of ECP data is limited due to privacy issues or the absence of metering devices. Few-shot learning (FSL) has emerged as a promising solution for ECP modeling in data-scarce scenarios. Nevertheless, standard FSL methods, such as those used for images, are unsuitable for ECP modeling because (1) these methods usually assume several source domains with sufficient data and several target domains. However, in the context of ECP modeling, there may be thousands of source domains with a moderate amount of data and thousands of target domains. (2) Standard FSL methods usually involve cumbersome knowledge transfer mechanisms, such as pre-training and fine-tuning, whereas ECP modeling requires more lightweight methods. (3) Deep learning models often lack explainability, hindering their application in industry. This paper proposes a novel FSL method that exploits Transformers and Gaussian Mixture Models (GMMs) for ECP modeling to address the above-described issues. Results show that our method can accurately restore the complex ECP distribution with a minimal amount of ECP data (e.g., only 1.6% of the complete domain dataset) while it outperforms state-of-the-art time series modeling methods, maintaining the advantages of being both lightweight and interpretable. The project is open-sourced at this https URL.

[LG-67] Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension

链接: https://arxiv.org/abs/2408.08381
作者: Nicholas Konz,Maciej A. Mazurowski
关键词-EN: geometric properties, medical image models, recent years, generalization ability, important model behavior
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, there has been interest in how geometric properties such as intrinsic dimension (ID) of a neural network’s hidden representations evolve through its layers, and how such properties are predictive of important model behavior such as generalization ability. However, evidence has begun to emerge that such behavior can change significantly depending on the domain of the network’s training data, such as natural versus medical images. Here, we further this inquiry by exploring how the ID of a network’s learned representations evolves through its layers, in essence, characterizing how the network successively refines the information content of input data to be used for predictions. Analyzing eleven natural and medical image datasets across six network architectures, we find that the shape of this ID evolution curve differs noticeably between natural and medical image models: medical image models peak in representation ID earlier in the network, implying a difference in the image features and their abstractness that are typically used for downstream tasks in these domains. Additionally, we discover a strong correlation of this peak representation ID with the ID of the data in its input space, implying that the intrinsic information content of a model’s learned representations is guided by that of the data it was trained on. Overall, our findings emphasize notable discrepancies in network behavior between natural and non-natural imaging domains regarding hidden representation information content, and provide further insights into how a network’s learned features are shaped by its training data.

[LG-68] owards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions

链接: https://arxiv.org/abs/2408.08379
作者: Krisztian Balog,John Palowitch,Barbara Ikica,Filip Radlinski,Hamidreza Alvari,Mehdi Manshadi
关键词-EN: modern machine learning, synthetic data represents, highly private, machine learning, offering a solution
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of synthetic data represents a pivotal shift in modern machine learning, offering a solution to satisfy the need for large volumes of data in domains where real data is scarce, highly private, or difficult to obtain. We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content, noting that such content is increasingly prevalent and a source of frequently sought information. Large language models (LLMs) offer a starting point for generating synthetic social media discussion threads, due to their ability to produce diverse responses that typify online interactions. However, as we demonstrate, straightforward application of LLMs yields limited success in capturing the complex structure of online discussions, and standard prompting mechanisms lack sufficient control. We therefore propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads, referred to as scaffolds. Our framework is generic yet adaptable to the unique characteristics of specific social media platforms. We demonstrate its feasibility using data from two distinct online discussion platforms. To address the fundamental challenge of ensuring the representativeness and realism of synthetic data, we propose a portfolio of evaluation measures to compare various instantiations of our framework.

[LG-69] Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples

链接: https://arxiv.org/abs/2408.08374
作者: Anahita Samadi,Allison Sullivan
关键词-EN: machine learning systems, safety critical applications, machine learning, text-based adversarial, learning systems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As machine learning systems become more widely used, especially for safety critical applications, there is a growing need to ensure that these systems behave as intended, even in the face of adversarial examples. Adversarial examples are inputs that are designed to trick the decision making process, and are intended to be imperceptible to humans. However, for text-based classification systems, changes to the input, a string of text, are always perceptible. Therefore, text-based adversarial examples instead focus on trying to preserve semantics. Unfortunately, recent work has shown this goal is often not met. To improve the quality of text-based adversarial examples, we need to know what elements of the input text are worth focusing on. To address this, in this paper, we explore what parts of speech have the highest impact of text-based classifiers. Our experiments highlight a distinct bias in CNN algorithms against certain parts of speech tokens within review datasets. This finding underscores a critical vulnerability in the linguistic processing capabilities of CNNs.

[LG-70] METR: Image Watermarking with Large Number of Unique Messages

链接: https://arxiv.org/abs/2408.08340
作者: Alexander Varlamov,Daria Diatlova,Egor Spirin
关键词-EN: led researchers, improving watermarking algorithms, focus on improving, METR, Diffusion Model
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 9 figures, code is available at this https URL

点击查看摘要

Abstract:Improvements in diffusion models have boosted the quality of image generation, which has led researchers, companies, and creators to focus on improving watermarking algorithms. This provision would make it possible to clearly identify the creators of generative art. The main challenges that modern watermarking algorithms face have to do with their ability to withstand attacks and encrypt many unique messages, such as user IDs. In this paper, we present METR: Message Enhanced Tree-Ring, which is an approach that aims to address these challenges. METR is built on the Tree-Ring watermarking algorithm, a technique that makes it possible to encode multiple distinct messages without compromising attack resilience or image quality. This ensures the suitability of this watermarking algorithm for any Diffusion Model. In order to surpass the limitations on the quantity of encoded messages, we propose METR++, an enhanced version of METR. This approach, while limited to the Latent Diffusion Model architecture, is designed to inject a virtually unlimited number of unique messages. We demonstrate its robustness to attacks and ability to encrypt many unique messages while preserving image quality, which makes METR and METR++ hold great potential for practical applications in real-world settings. Our code is available at this https URL

[LG-71] Activation Space Selectable Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2408.08338
作者: Zhuoqin Yang,Jiansong Zhang,Xiaoling Luo,Zheng Lu,Linlin Shen
关键词-EN: current artificial intelligence, natural language processing, multilayer perceptron, artificial intelligence, language processing
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures. The code for this work will be released soon

点击查看摘要

Abstract:The multilayer perceptron (MLP), a fundamental paradigm in current artificial intelligence, is widely applied in fields such as computer vision and natural language processing. However, the recently proposed Kolmogorov-Arnold Network (KAN), based on nonlinear additive connections, has been proven to achieve performance comparable to MLPs with significantly fewer parameters. Despite this potential, the use of a single activation function space results in reduced performance of KAN and related works across different tasks. To address this issue, we propose an activation space Selectable KAN (S-KAN). S-KAN employs an adaptive strategy to choose the possible activation mode for data at each feedforward KAN node. Our approach outperforms baseline methods in seven representative function fitting tasks and significantly surpasses MLP methods with the same level of parameters. Furthermore, we extend the structure of S-KAN and propose an activation space selectable Convolutional KAN (S-ConvKAN), which achieves leading results on four general image classification datasets. Our method mitigates the performance variability of the original KAN across different tasks and demonstrates through extensive experiments that feedforward KANs with selectable activations can achieve or even exceed the performance of MLP-based methods. This work contributes to the understanding of the data-centric design of new AI paradigms and provides a foundational reference for innovations in KAN-based network architectures.

[LG-72] raining Large-Scale Optical Neural Networks with Two-Pass Forward Propagation

链接: https://arxiv.org/abs/2408.08337
作者: Amirreza Ahmadnejad,Somayyeh Koohi
关键词-EN: input data processing, nonlinear function implementation, large input data, Two-Pass Forward Propagation, Neural Networks
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:This paper addresses the limitations in Optical Neural Networks (ONNs) related to training efficiency, nonlinear function implementation, and large input data processing. We introduce Two-Pass Forward Propagation, a novel training method that avoids specific nonlinear activation functions by modulating and re-entering error with random noise. Additionally, we propose a new way to implement convolutional neural networks using simple neural networks in integrated optical systems. Theoretical foundations and numerical results demonstrate significant improvements in training speed, energy efficiency, and scalability, advancing the potential of optical computing for complex data tasks.

[LG-73] Graph representations of 3D data for machine learning

链接: https://arxiv.org/abs/2408.08336
作者: Tomasz Prytuła
关键词-EN: machine learning algorithms, graphs and meshes, learning algorithms, give an overview, overview of combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:We give an overview of combinatorial methods to represent 3D data, such as graphs and meshes, from the viewpoint of their amenability to analysis using machine learning algorithms. We highlight pros and cons of various representations and we discuss some methods of generating/switching between the representations. We finally present two concrete applications in life science and industry. Despite its theoretical nature, our discussion is in general motivated by, and biased towards real-world challenges.

[LG-74] urboEdit: Instant text-based image editing ECCV

链接: https://arxiv.org/abs/2408.08332
作者: Zongze Wu,Nicholas Kolkin,Jonathan Brandt,Richard Zhang,Eli Shechtman
关键词-EN: precise image inversion, address the challenges, challenges of precise, image, input image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to European Conference on Computer Vision (ECCV), 2024. Project page: this https URL

点击查看摘要

Abstract:We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

[LG-75] Unleash The Power of Pre-Trained Language Models for Irregularly Sampled Time Series

链接: https://arxiv.org/abs/2408.08328
作者: Weijia Zhang,Chenlong Yin,Hao Liu,Hui Xiong
关键词-EN: Pre-trained Language Models, natural language processing, Pre-trained Language, Sampled Time Series, language processing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Pre-trained Language Models (PLMs), such as ChatGPT, have significantly advanced the field of natural language processing. This progress has inspired a series of innovative studies that explore the adaptation of PLMs to time series analysis, intending to create a unified foundation model that addresses various time series analytical tasks. However, these efforts predominantly focus on Regularly Sampled Time Series (RSTS), neglecting the unique challenges posed by Irregularly Sampled Time Series (ISTS), which are characterized by non-uniform sampling intervals and prevalent missing data. To bridge this gap, this work explores the potential of PLMs for ISTS analysis. We begin by investigating the effect of various methods for representing ISTS, aiming to maximize the efficacy of PLMs in this under-explored area. Furthermore, we present a unified PLM-based framework, ISTS-PLM, which integrates time-aware and variable-aware PLMs tailored for comprehensive intra and inter-time series modeling and includes a learnable input embedding layer and a task-specific output layer to tackle diverse ISTS analytical tasks. Extensive experiments on a comprehensive benchmark demonstrate that the ISTS-PLM, utilizing a simple yet effective series-based representation for ISTS, consistently achieves state-of-the-art performance across various analytical tasks, such as classification, interpolation, and extrapolation, as well as few-shot and zero-shot learning scenarios, spanning scientific domains like healthcare and biomechanics.

[LG-76] Accelerating Giant Impact Simulations with Machine Learning

链接: https://arxiv.org/abs/2408.08873
作者: Caleb Lammers,Miles Cranmer,Sam Hadden,Shirley Ho,Norman Murray,Daniel Tamayo
关键词-EN: Constraining planet formation, observed exoplanet population, exoplanet population requires, population requires generating, requires generating large
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, 1 table. Easy-to-use API available at this https URL

点击查看摘要

Abstract:Constraining planet formation models based on the observed exoplanet population requires generating large samples of synthetic planetary systems, which can be computationally prohibitive. A significant bottleneck is simulating the giant impact phase, during which planetary embryos evolve gravitationally and combine to form planets, which may themselves experience later collisions. To accelerate giant impact simulations, we present a machine learning (ML) approach to predicting collisional outcomes in multiplanet systems. Trained on more than 500,000 N -body simulations of three-planet systems, we develop an ML model that can accurately predict which two planets will experience a collision, along with the state of the post-collision planets, from a short integration of the system’s initial conditions. Our model greatly improves on non-ML baselines that rely on metrics from dynamics theory, which struggle to accurately predict which pair of planets will experience a collision. By combining with a model for predicting long-term stability, we create an efficient ML-based giant impact emulator, which can predict the outcomes of giant impact simulations with a speedup of up to four orders of magnitude. We expect our model to enable analyses that would not otherwise be computationally feasible. As such, we release our full training code, along with an easy-to-use API for our collision outcome model and giant impact emulator.

[LG-77] HistoGym: A Reinforcement Learning Environment for Histopathological Image Analysis

链接: https://arxiv.org/abs/2408.08847
作者: Zhi-Bo Liu,Xiaobo Pang,Jizhao Wang,Shuai Liu,Chen Li
关键词-EN: decision-making process based, pathological research, critically important, pathological images, based on pathological
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In pathological research, education, and clinical practice, the decision-making process based on pathological images is critically important. This significance extends to digital pathology image analysis: its adequacy is demonstrated by the extensive information contained within tissue structures, which is essential for accurate cancer classification and grading. Additionally, its necessity is highlighted by the inherent requirement for interpretability in the conclusions generated by algorithms. For humans, determining tumor type and grade typically involves multi-scale analysis, which presents a significant challenge for AI algorithms. Traditional patch-based methods are inadequate for modeling such complex structures, as they fail to capture the intricate, multi-scale information inherent in whole slide images. Consequently, there is a pressing need for advanced AI techniques capable of efficiently and accurately replicating this complex analytical process. To address this issue, we introduce HistoGym, an open-source reinforcement learning environment for histopathological image analysis. Following OpenAI Gym APIs, HistoGym aims to foster whole slide image diagnosis by mimicking the real-life processes of doctors. Leveraging the pyramid feature of WSIs and the OpenSlide API, HistoGym provides a unified framework for various clinical tasks, including tumor detection and classification. We detail the observation, action, and reward specifications tailored for the histopathological image analysis domain and provide an open-source Python-based interface for both clinicians and researchers. To accommodate different clinical demands, we offer various scenarios for different organs and cancers, including both WSI-based and selected region-based scenarios, showcasing several noteworthy results.

[LG-78] Shapley Marginal Surplus for Strong Models

链接: https://arxiv.org/abs/2408.08845
作者: Daniel de Marchi,Michael Kosorok,Scott de Marchi
关键词-EN: explain model predictions, Shapley Marginal Surplus, machine learning, DGP, machine learning models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shapley values have seen widespread use in machine learning as a way to explain model predictions and estimate the importance of covariates. Accurately explaining models is critical in real-world models to both aid in decision making and to infer the properties of the true data-generating process (DGP). In this paper, we demonstrate that while model-based Shapley values might be accurate explainers of model predictions, machine learning models themselves are often poor explainers of the DGP even if the model is highly accurate. Particularly in the presence of interrelated or noisy variables, the output of a highly predictive model may fail to account for these relationships. This implies explanations of a trained model’s behavior may fail to provide meaningful insight into the DGP. In this paper we introduce a novel variable importance algorithm, Shapley Marginal Surplus for Strong Models, that samples the space of possible models to come up with an inferential measure of feature importance. We compare this method to other popular feature importance methods, both Shapley-based and non-Shapley based, and demonstrate significant outperformance in inferential capabilities relative to other methods.

[LG-79] Misclassification excess risk bounds for PAC-Bayesian classification via convexified loss

链接: https://arxiv.org/abs/2408.08675
作者: TheTien Mai
关键词-EN: deriving generalization bounds, learning algorithms, machine learning, valuable tool, tool for deriving
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PAC-Bayesian bounds have proven to be a valuable tool for deriving generalization bounds and for designing new learning algorithms in machine learning. However, it typically focus on providing generalization bounds with respect to a chosen loss function. In classification tasks, due to the non-convex nature of the 0-1 loss, a convex surrogate loss is often used, and thus current PAC-Bayesian bounds are primarily specified for this convex surrogate. This work shifts its focus to providing misclassification excess risk bounds for PAC-Bayesian classification when using a convex surrogate loss. Our key ingredient here is to leverage PAC-Bayesian relative bounds in expectation rather than relying on PAC-Bayesian bounds in probability. We demonstrate our approach in several important applications.

[LG-80] A new perspective on Bayesian Operational Modal Analysis

链接: https://arxiv.org/abs/2408.08664
作者: Brandon J. O’Connell,Max D. Champneys,Timothy J. Rogers
关键词-EN: Bayesian OMA, state of aerospace, offshore and civil, Bayesian, assess the current
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In the field of operational modal analysis (OMA), obtained modal information is frequently used to assess the current state of aerospace, mechanical, offshore and civil structures. However, the stochasticity of operational systems and the lack of forcing information can lead to inconsistent results. Quantifying the uncertainty of the recovered modal parameters through OMA is therefore of significant value. In this article, a new perspective on Bayesian OMA is proposed: a Bayesian stochastic subspace identification (SSI) algorithm. Distinct from existing approaches to Bayesian OMA, a hierarchical probabilistic model is embedded at the core of covariance-driven SSI. Through substitution of canonical correlation analysis with a Bayesian equivalent, posterior distributions over the modal properties are obtained. Two inference schemes are presented for the proposed Bayesian formulation: Markov Chain Monte Carlo and variational Bayes. Two case studies are then explored. The first is benchmark study using data from a simulated, multi degree-of-freedom, linear system. Following application of Bayesian SSI, it is shown that the same posterior is targeted and recovered by both inference schemes, with good agreement between the posterior mean and the conventional SSI result. The second study applies the variational form to data obtained from an in-service structure: The Z24 bridge. The results of this study are presented at single model orders, and then using a stabilisation diagram. The recovered posterior uncertainty is presented and compared to the classic SSI result. It is observed that the posterior distributions with mean values coinciding with the natural frequencies exhibit lower variance than values situated away from the natural frequencies.

[LG-81] Modeling the Neonatal Brain Development Using Implicit Neural Representations MICCAI2024

链接: https://arxiv.org/abs/2408.08647
作者: Florentin Bieder,Paul Friedrich,Hélène Corbaz,Alicia Durrer,Julia Wolleb,Philippe C. Cattin
关键词-EN: brain undergoes rapid, undergoes rapid development, human brain undergoes, trimester of pregnancy, undergoes rapid
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint, Accepted for PRIME MICCAI 2024

点击查看摘要

Abstract:The human brain undergoes rapid development during the third trimester of pregnancy. In this work, we model the neonatal development of the infant brain in this age range. As a basis, we use MR images of preterm- and term-birth neonates from the developing human connectome project (dHCP). We propose a neural network, specifically an implicit neural representation (INR), to predict 2D- and 3D images of varying time points. In order to model a subject-specific development process, it is necessary to disentangle the age from the subjects’ identity in the latent space of the INR. We propose two methods, Subject Specific Latent Vectors (SSL) and Stochastic Global Latent Augmentation (SGLA), enabling this disentanglement. We perform an analysis of the results and compare our proposed model to an age-conditioned denoising diffusion model as a baseline. We also show that our method can be applied in a memory-efficient way, which is especially important for 3D data.

[LG-82] Solving The Quantum Many-Body Hamiltonian Learning Problem with Neural Differential Equations

链接: https://arxiv.org/abs/2408.08639
作者: Timothy Heightman,Edward Jiang,Antonio Acín
关键词-EN: exponential complexity required, Understanding and characterising, significant challenge due, accurately track states, characterising quantum many-body
类目: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding and characterising quantum many-body dynamics remains a significant challenge due to both the exponential complexity required to represent quantum many-body Hamiltonians, and the need to accurately track states in time under the action of such Hamiltonians. This inherent complexity limits our ability to characterise quantum many-body systems, highlighting the need for innovative approaches to unlock their full potential. To address this challenge, we propose a novel method to solve the Hamiltonian Learning (HL) problem-inferring quantum dynamics from many-body state trajectories-using Neural Differential Equations combined with an Ansatz Hamiltonian. Our method is reliably convergent, experimentally friendly, and interpretable, making it a stable solution for HL on a set of Hamiltonians previously unlearnable in the literature. In addition to this, we propose a new quantitative benchmark based on power laws, which can objectively compare the reliability and generalisation capabilities of any two HL algorithms. Finally, we benchmark our method against state-of-the-art HL algorithms with a 1D spin-1/2 chain proof of concept.

[LG-83] Linear combinations of latents in diffusion models: interpolation and beyond

链接: https://arxiv.org/abs/2408.08558
作者: Erik Bodin,Henry Moss,Carl Henrik Ek
关键词-EN: Continuous Normalizing Flows, synthesis and augmentation, Flow Matching, Normalizing Flows, Matching and Continuous
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models are crucial for applications like data synthesis and augmentation. Diffusion, Flow Matching and Continuous Normalizing Flows have shown effectiveness across various modalities, and rely on Gaussian latent variables for generation. As any generated object is directly associated with a particular latent variable, we can manipulate the variables to exert control over the generation process. However, standard approaches for combining latent variables, such as spherical interpolation, only apply or work well in special cases. Moreover, current methods for obtaining low-dimensional representations of the data, important for e.g. surrogate models for search and creative applications, are network and data modality specific. In this work we show that the standard methods to combine variables do not yield intermediates following the distribution the models are trained to expect. We propose Combination of Gaussian variables (COG), a novel interpolation method that addresses this, is easy to implement yet matches or improves upon current methods. COG addresses linear combinations in general and, as we demonstrate, also supports other operations including e.g. defining subspaces of the latent space, simplifying the creation of expressive low-dimensional spaces of high-dimensional objects using generative models based on Gaussian latents.

[LG-84] Unsupervised Transfer Learning via Adversarial Contrastive Training

链接: https://arxiv.org/abs/2408.08533
作者: Chenguang Duan,Yuling Jiao,Huazhen Lin,Wensen Ma,Jerry Zhijian Yang
关键词-EN: critical and challenging, downstream supervised learning, supervised learning tasks, Learning, supervised learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning a data representation for downstream supervised learning tasks under unlabeled scenario is both critical and challenging. In this paper, we propose a novel unsupervised transfer learning approach using adversarial contrastive training (ACT). Our experimental results demonstrate outstanding classification accuracy with both fine-tuned linear probe and K-NN protocol across various datasets, showing competitiveness with existing state-of-the-art self-supervised learning methods. Moreover, we provide an end-to-end theoretical guarantee for downstream classification tasks in a misspecified, over-parameterized setting, highlighting how a large amount of unlabeled data contributes to prediction accuracy. Our theoretical findings suggest that the testing error of downstream tasks depends solely on the efficiency of data augmentation used in ACT when the unlabeled sample size is sufficiently large. This offers a theoretical understanding of learning downstream tasks with a small sample size.

[LG-85] Enhancing Events in Neutrino Telescopes through Deep Learning-Driven Super-Resolution

链接: https://arxiv.org/abs/2408.08474
作者: Felix J. Yu,Nicholas Kamp,Carlos A. Argüelles
关键词-EN: IceCube Neutrino Observatory, infer physical quantities, Recent discoveries, Neutrino Observatory, photon hits detected
类目: High Energy Physics - Experiment (hep-ex); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 5+1 pages, 4+1 figures

点击查看摘要

Abstract:Recent discoveries by neutrino telescopes, such as the IceCube Neutrino Observatory, relied extensively on machine learning (ML) tools to infer physical quantities from the raw photon hits detected. Neutrino telescope reconstruction algorithms are limited by the sparse sampling of photons by the optical modules due to the relatively large spacing ( 10-100,\rm m) between them. In this letter, we propose a novel technique that learns photon transport through the detector medium through the use of deep learning-driven super-resolution of data events. These improved'' events can then be reconstructed using traditional or ML techniques, resulting in improved resolution. Our strategy arranges additional virtual’’ optical modules within an existing detector geometry and trains a convolutional neural network to predict the hits on these virtual optical modules. We show that this technique improves the angular reconstruction of muons in a generic ice-based neutrino telescope. Our results readily extend to water-based neutrino telescopes and other event morphologies.

[LG-86] Efficient Data-Sketches and Fine-Tuning for Early Detection of Distributional Drift in Medical Imaging

链接: https://arxiv.org/abs/2408.08456
作者: Yusen Wu,Hao Chen,Alex Pissinou Makki,Phuong Nguyen,Yelena Yesha
关键词-EN: underlying data distribution, treatment decisions, Distributional drift, Distributional drift detection, detect distributional drift
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributional drift detection is important in medical applications as it helps ensure the accuracy and reliability of models by identifying changes in the underlying data distribution that could affect diagnostic or treatment decisions. However, current methods have limitations in detecting drift; for example, the inclusion of abnormal datasets can lead to unfair comparisons. This paper presents an accurate and sensitive approach to detect distributional drift in CT-scan medical images by leveraging data-sketching and fine-tuning techniques. We developed a robust baseline library model for real-time anomaly detection, allowing for efficient comparison of incoming images and identification of anomalies. Additionally, we fine-tuned a vision transformer pre-trained model to extract relevant features using breast cancer images as an example, significantly enhancing model accuracy to 99.11%. Combining with data-sketches and fine-tuning, our feature extraction evaluation demonstrated that cosine similarity scores between similar datasets provide greater improvements, from around 50% increased to 100%. Finally, the sensitivity evaluation shows that our solutions are highly sensitive to even 1% salt-and-pepper and speckle noise, and it is not sensitive to lighting noise (e.g., lighting conditions have no impact on data drift). The proposed methods offer a scalable and reliable solution for maintaining the accuracy of diagnostic models in dynamic clinical environments.

[LG-87] Predictive uncertainty estimation in deep learning for lung carcinoma classification in digital pathology under real dataset shifts

链接: https://arxiv.org/abs/2408.08432
作者: Abdur R. Fayjie,Jutika Borah,Florencia Carbone,Jan Tack,Patrick Vandewalle
关键词-EN: shown tremendous progress, predictive uncertainty, shown tremendous, tremendous progress, wide range
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Deep learning has shown tremendous progress in a wide range of digital pathology and medical image classification tasks. Its integration into safe clinical decision-making support requires robust and reliable models. However, real-world data comes with diversities that often lie outside the intended source distribution. Moreover, when test samples are dramatically different, clinical decision-making is greatly affected. Quantifying predictive uncertainty in models is crucial for well-calibrated predictions and determining when (or not) to trust a model. Unfortunately, many works have overlooked the importance of predictive uncertainty estimation. This paper evaluates whether predictive uncertainty estimation adds robustness to deep learning-based diagnostic decision-making systems. We investigate the effect of various carcinoma distribution shift scenarios on predictive performance and calibration. We first systematically investigate three popular methods for improving predictive uncertainty: Monte Carlo dropout, deep ensemble, and few-shot learning on lung adenocarcinoma classification as a primary disease in whole slide images. Secondly, we compare the effectiveness of the methods in terms of performance and calibration under clinically relevant distribution shifts such as in-distribution shifts comprising primary disease sub-types and other characterization analysis data; out-of-distribution shifts comprising well-differentiated cases, different organ origin, and imaging modality shifts. While studies on uncertainty estimation exist, to our best knowledge, no rigorous large-scale benchmark compares predictive uncertainty estimation including these dataset shifts for lung carcinoma classification.

[LG-88] Phononic materials with effectively scale-separated hierarchical features using interpretable machine learning

链接: https://arxiv.org/abs/2408.08428
作者: Mary V. Bastawrous,Zhi Chen,Alexander C. Ogren,Chiara Daraio,Cynthia Rudin,L. Catherine Brinson
关键词-EN: Manipulating the dispersive, high-precision instruments, multiple frequency ranges, dispersive characteristics, characteristics of vibrational
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Manipulating the dispersive characteristics of vibrational waves is beneficial for many applications, e.g., high-precision instruments. architected hierarchical phononic materials have sparked promise tunability of elastodynamic waves and vibrations over multiple frequency ranges. In this article, hierarchical unit-cells are obtained, where features at each length scale result in a band gap within a targeted frequency range. Our novel approach, the ``hierarchical unit-cell template method,‘’ is an interpretable machine-learning approach that uncovers global unit-cell shape/topology patterns corresponding to predefined band-gap objectives. A scale-separation effect is observed where the coarse-scale band-gap objective is mostly unaffected by the fine-scale features despite the closeness of their length scales, thus enabling an efficient hierarchical algorithm. Moreover, the hierarchical patterns revealed are not predefined or self-similar hierarchies as common in current hierarchical phononic materials. Thus, our approach offers a flexible and efficient method for the exploration of new regions in the hierarchical design space, extracting minimal effective patterns for inverse design in applications targeting multiple frequency ranges.

[LG-89] Classification of High-dimensional Time Series in Spectral Domain using Explainable Features

链接: https://arxiv.org/abs/2408.08388
作者: Sarbojit Roy,Malik Shahid Sultan,Hernando Ombao
关键词-EN: presents significant challenges, series presents significant, time series presents, high dimensions, Interpretable classification
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Interpretable classification of time series presents significant challenges in high dimensions. Traditional feature selection methods in the frequency domain often assume sparsity in spectral density matrices (SDMs) or their inverses, which can be restrictive for real-world applications. In this article, we propose a model-based approach for classifying high-dimensional stationary time series by assuming sparsity in the difference between inverse SDMs. Our approach emphasizes the interpretability of model parameters, making it especially suitable for fields like neuroscience, where understanding differences in brain network connectivity across various states is crucial. The estimators for model parameters demonstrate consistency under appropriate conditions. We further propose using standard deep learning optimizers for parameter estimation, employing techniques such as mini-batching and learning rate scheduling. Additionally, we introduce a method to screen the most discriminatory frequencies for classification, which exhibits the sure screening property under general conditions. The flexibility of the proposed model allows the significance of covariates to vary across frequencies, enabling nuanced inferences and deeper insights into the underlying problem. The novelty of our method lies in the interpretability of the model parameters, addressing critical needs in neuroscience. The proposed approaches have been evaluated on simulated examples and the `Alert-vs-Drowsy’ EEG dataset.

[LG-90] Exploring Latent Space for Generating Peptide Analogs Using Protein Language Models

链接: https://arxiv.org/abs/2408.08341
作者: Po-Yu Liang,Xueting Huang,Tibo Duran,Andrew J. Wiemer,Jun Bai
关键词-EN: discovery and biotechnology, Generating peptides, crucial for drug, drug discovery, Generating
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating peptides with desired properties is crucial for drug discovery and biotechnology. Traditional sequence-based and structure-based methods often require extensive datasets, which limits their effectiveness. In this study, we proposed a novel method that utilized autoencoder shaped models to explore the protein embedding space, and generate novel peptide analogs by leveraging protein language models. The proposed method requires only a single sequence of interest, avoiding the need for large datasets. Our results show significant improvements over baseline models in similarity indicators of peptide structures, descriptors and bioactivities. The proposed method validated through Molecular Dynamics simulations on TIGIT inhibitors, demonstrates that our method produces peptide analogs with similar yet distinct properties, highlighting its potential to enhance peptide screening processes.

[LG-91] SepAl: Sepsis Alerts On Low Power Wearables With Digital Biomarkers and On-Device Tiny Machine Learning

链接: https://arxiv.org/abs/2408.08316
作者: Marco Giordano,Kanika Dheman,Michele Magno
关键词-EN: infection and claims, million lives, year globally, lethal syndrome, lives per year
类目: Tissues and Organs (q-bio.TO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sepsis is a lethal syndrome of organ dysfunction that is triggered by an infection and claims 11 million lives per year globally. Prognostic algorithms based on deep learning have shown promise in detecting the onset of sepsis hours before the actual event but use a large number of bio-markers, including vital signs and laboratory tests. The latter makes the deployment of such systems outside hospitals or in resource-limited environments extremely challenging. This paper introduces SepAl, an energy-efficient and lightweight neural network, using only data from low-power wearable sensors, such as photoplethysmography (PPG), inertial measurement units (IMU), and body temperature sensors, designed to deliver alerts in real-time. SepAl leverages only six digitally acquirable vital signs and tiny machine learning algorithms, enabling on-device real-time sepsis prediction. SepAl uses a lightweight temporal convolution neural network capable of providing sepsis alerts with a median predicted time to sepsis of 9.8 hours. The model has been fully quantized, being able to be deployed on any low-power processors, and evaluated on an ARM Cortex-M33 core. Experimental evaluations show an inference efficiency of 0.11MAC/Cycle and a latency of 143ms, with an energy per inference of 2.68mJ. This work aims at paving the way toward accurate disease prediction, deployable in a long-lasting multi-vital sign wearable device, suitable for providing sepsis onset alerts at the point of care. The code used in this work has been open-sourced and is available at this https URL Subjects: Tissues and Organs (q-bio.TO); Machine Learning (cs.LG) Cite as: arXiv:2408.08316 [q-bio.TO] (or arXiv:2408.08316v1 [q-bio.TO] for this version) https://doi.org/10.48550/arXiv.2408.08316 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1109/JSEN.2024.3424655 Focus to learn more DOI(s) linking to related resources

信息检索

[IR-0] EasyRec: Simple yet Effective Language Models for Recommendation

链接: https://arxiv.org/abs/2408.08821
作者: Xubin Ren,Chao Huang
关键词-EN: Deep neural networks, user-item interaction data, Deep neural, neural networks, powerful technique
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep neural networks have become a powerful technique for learning representations from user-item interaction data in collaborative filtering (CF) for recommender systems. However, many existing methods heavily rely on unique user and item IDs, which limits their ability to perform well in practical zero-shot learning scenarios where sufficient training data may be unavailable. Inspired by the success of language models (LMs) and their strong generalization capabilities, a crucial question arises: How can we harness the potential of language models to empower recommender systems and elevate its generalization capabilities to new heights? In this study, we propose EasyRec - an effective and easy-to-use approach that seamlessly integrates text-based semantic understanding with collaborative signals. EasyRec employs a text-behavior alignment framework, which combines contrastive learning with collaborative language model tuning, to ensure a strong alignment between the text-enhanced semantic space and the collaborative behavior information. Extensive empirical evaluations across diverse real-world datasets demonstrate the superior performance of EasyRec compared to state-of-the-art alternative models, particularly in the challenging text-based zero-shot recommendation scenarios. Furthermore, the study highlights the potential of seamlessly integrating EasyRec as a plug-and-play component into text-enhanced collaborative filtering frameworks, thereby empowering existing recommender systems to elevate their recommendation performance and adapt to the evolving user preferences in dynamic environments. For better result reproducibility of our EasyRec framework, the model implementation details, source code, and datasets are available at the link: this https URL.

[IR-1] Beyond KAN: Introducing KarSein for Adaptive High-Order Feature Interaction Modeling in CTR Prediction

链接: https://arxiv.org/abs/2408.08713
作者: Yunxiao Shi,Wujiang Wu,Mingyu Jin,Haimin Zhang,Qiang Wu,Yongfeng Zhang,Min Xu
关键词-EN: click-through rate, crucial for click-through, high-order explicit interactions, modeling high-order, interactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: KarSein for CTR

点击查看摘要

Abstract:Modeling feature interactions is crucial for click-through rate (CTR) prediction, particularly when it comes to high-order explicit interactions. Traditional methods struggle with this task because they often predefine a maximum interaction order, which relies heavily on prior knowledge and can limit the model’s effectiveness. Additionally, modeling high-order interactions typically leads to increased computational costs. Therefore, the challenge lies in adaptively modeling high-order feature interactions while maintaining efficiency. To address this issue, we introduce Kolmogorov-Arnold Represented Sparse Efficient Interaction Network (KarSein), designed to optimize both predictive accuracy and computational efficiency. We firstly identify limitations of directly applying Kolmogorov-Arnold Networks (KAN) to CTR and then introduce KarSein to overcome these issues. It features a novel architecture that reduces the computational costs of KAN and supports embedding vectors as feature inputs. Additionally, KarSein employs guided symbolic regression to address the challenge of KAN in spontaneously learning multiplicative relationships. Extensive experiments demonstrate KarSein’s superior performance, achieving significant predictive accuracy with minimal computational overhead. Furthermore, KarSein maintains strong global explainability while enabling the removal of redundant features, resulting in a sparse network structure. These advantages also position KarSein as a promising method for efficient inference.

[IR-2] Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

链接: https://arxiv.org/abs/2408.08709
作者: Lei Hei,Ning An,Tingjing Liao,Qi Ma,Jiaqi Wang,Feiliang Ren
关键词-EN: realistic knowledge graphs, Multimodal Relation Extraction, knowledge graphs, crucial for constructing, constructing flexible
类目: Information Retrieval (cs.IR)
*备注: 15 pages, 7 figures, preprint

点击查看摘要

Abstract:Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.

[IR-3] SC-Rec: Enhancing Generative Retrieval with Self-Consistent Reranking for~Sequential Recommendation

链接: https://arxiv.org/abs/2408.08686
作者: Tongyoung Kim,Soojin Yoon,Seongku Kang,Jinyoung Yeo,Dongha Lee
关键词-EN: advanced language understanding, advanced language, language understanding, generation capabilities, recommendation systems due
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language Models (LMs) are increasingly employed in recommendation systems due to their advanced language understanding and generation capabilities. Recent recommender systems based on generative retrieval have leveraged the inferential abilities of LMs to directly generate the index tokens of the next item, based on item sequences within the user’s interaction history. Previous studies have mostly focused on item indices based solely on textual semantic or collaborative information. However, although the standalone effectiveness of these aspects has been demonstrated, the integration of this information has remained unexplored. Our in-depth analysis finds that there is a significant difference in the knowledge captured by the model from heterogeneous item indices and diverse input prompts, which can have a high potential for complementarity. In this paper, we propose SC-Rec, a unified recommender system that learns diverse preference knowledge from two distinct item indices and multiple prompt templates. Furthermore, SC-Rec adopts a novel reranking strategy that aggregates a set of ranking results, inferred based on different indices and prompts, to achieve the self-consistency of the model. Our empirical evaluation on three real-world datasets demonstrates that SC-Rec considerably outperforms the state-of-the-art methods for sequential recommendation, effectively incorporating complementary knowledge from varied outputs of the model.

[IR-4] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction CIKM2024

链接: https://arxiv.org/abs/2408.08585
作者: Yunpeng Weng,Xing Tang,Zhenhao Xu,Fuyuan Lyu,Dugang Liu,Zexu Sun,Xiuqiang He
关键词-EN: CLTV, Customer Lifetime, distribution, critical task, business applications
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: CIKM 2024

点击查看摘要

Abstract:Customer Lifetime Value (CLTV) prediction is a critical task in business applications. Accurately predicting CLTV is challenging in real-world business scenarios, as the distribution of CLTV is complex and mutable. Firstly, there is a large number of users without any consumption consisting of a long-tailed part that is too complex to fit. Secondly, the small set of high-value users spent orders of magnitude more than a typical user leading to a wide range of the CLTV distribution which is hard to capture in a single distribution. Existing approaches for CLTV estimation either assume a prior probability distribution and fit a single group of distribution-related parameters for all samples, or directly learn from the posterior distribution with manually predefined buckets in a heuristic manner. However, all these methods fail to handle complex and mutable distributions. In this paper, we propose a novel optimal distribution selection model OptDist for CLTV prediction, which utilizes an adaptive optimal sub-distribution selection mechanism to improve the accuracy of complex distribution modeling. Specifically, OptDist trains several candidate sub-distribution networks in the distribution learning module (DLM) for modeling the probability distribution of CLTV. Then, a distribution selection module (DSM) is proposed to select the sub-distribution for each sample, thus making the selection automatically and adaptively. Besides, we design an alignment mechanism that connects both modules, which effectively guides the optimization. We conduct extensive experiments on both two public and one private dataset to verify that OptDist outperforms state-of-the-art baselines. Furthermore, OptDist has been deployed on a large-scale financial platform for customer acquisition marketing campaigns and the online experiments also demonstrate the effectiveness of OptDist.

[IR-5] Collaborative Cross-modal Fusion with Large Language Model for Recommendation CIKM2024

链接: https://arxiv.org/abs/2408.08564
作者: Zhongzhou Liu,Hao Zhang,Kuicai Dong,Yuan Fang
关键词-EN: conventional collaborative filtering, large language models, leveraging semantic knowledge, success of conventional, exhibit limitations
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: 10 pages, 4 figures, accepted by CIKM 2024

点击查看摘要

Abstract:Despite the success of conventional collaborative filtering (CF) approaches for recommendation systems, they exhibit limitations in leveraging semantic knowledge within the textual attributes of users and items. Recent focus on the application of large language models for recommendation (LLM4Rec) has highlighted their capability for effective semantic knowledge capture. However, these methods often overlook the collaborative signals in user behaviors. Some simply instruct-tune a language model, while others directly inject the embeddings of a CF-based model, lacking a synergistic fusion of different modalities. To address these issues, we propose a framework of Collaborative Cross-modal Fusion with Large Language Models, termed CCF-LLM, for recommendation. In this framework, we translate the user-item interactions into a hybrid prompt to encode both semantic knowledge and collaborative signals, and then employ an attentive cross-modal fusion strategy to effectively fuse latent embeddings of both modalities. Extensive experiments demonstrate that CCF-LLM outperforms existing methods by effectively utilizing semantic and collaborative signals in the LLM4Rec context.

[IR-6] Dont Click the Bait: Title Debiasing News Recommendation via Cross-Field Contrastive Learning

链接: https://arxiv.org/abs/2408.08538
作者: Yijie Shu,Xiaokun Zhang,Youlin Wu,Bo Xu,Liang Yang,Hongfei Lin
关键词-EN: access content, content of interest, vast amount, recommendation emerges, Cross-field Contrastive learning
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:News recommendation emerges as a primary means for users to access content of interest from the vast amount of news. The title clickbait extensively exists in news domain and increases the difficulty for news recommendation to offer satisfactory services for users. Fortunately, we find that news abstract, as a critical field of news, aligns cohesively with the news authenticity. To this end, we propose a Title Debiasing News Recommendation with Cross-field Contrastive learning (TDNR-C2) to overcome the title bias by incorporating news abstract. Specifically, a multi-field knowledge extraction module is devised to extract multi-view knowledge about news from various fields. Afterwards, we present a cross-field contrastive learning module to conduct bias removal via contrasting learned knowledge from title and abstract fileds. Experimental results on a real-world dataset demonstrate the superiority of the proposed TDNR-C2 over existing state-of-the-art methods. Further analysis also indicates the significance of news abstract for title debiasing.

[IR-7] MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

链接: https://arxiv.org/abs/2408.08521
作者: Zhengyuan Zhu,Daniel Lee,Hong Zhang,Sai Sree Harsha,Loic Feujio,Akash Maharaj,Yunyao Li
关键词-EN: demonstrated impressive performance, Recent advancements, retrieval-augmented generation, advancements in retrieval-augmented, demonstrated impressive
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: Preprint

点击查看摘要

Abstract:Recent advancements in retrieval-augmented generation (RAG) have demonstrated impressive performance in the question-answering (QA) task. However, most previous works predominantly focus on text-based answers. While some studies address multimodal data, they still fall short in generating comprehensive multimodal answers, particularly for explaining concepts or providing step-by-step tutorials on how to accomplish specific goals. This capability is especially valuable for applications such as enterprise chatbots and settings such as customer service and educational systems, where the answers are sourced from multimodal data. In this paper, we introduce a simple and effective framework named MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR enhances text-based answers by retrieving relevant multimodal data and refining the responses to create coherent multimodal answers. This framework can be easily extended to support multimodal answers in enterprise chatbots with minimal modifications. Human evaluation results indicate that multimodal answers generated by MuRAR are more useful and readable compared to plain text answers.

[IR-8] W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering

链接: https://arxiv.org/abs/2408.08444
作者: Jinming Nian,Zhiyuan Peng,Qifan Wang,Yi Fang
关键词-EN: Large Language Models, Large Language, open-domain question answering, factual answers relying, answers relying solely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In knowledge-intensive tasks such as open-domain question answering (OpenQA), Large Language Models (LLMs) often struggle to generate factual answers relying solely on their internal (parametric) knowledge. To address this limitation, Retrieval-Augmented Generation (RAG) systems enhance LLMs by retrieving relevant information from external sources, thereby positioning the retriever as a pivotal component. Although dense retrieval demonstrates state-of-the-art performance, its training poses challenges due to the scarcity of ground-truth evidence, largely attributed to the high costs of human annotation. In this paper, we propose W-RAG by utilizing the ranking capabilities of LLMs to create weakly labeled data for training dense retrievers. Specifically, we rerank the top- K passages retrieved via BM25 by assessing the probability that LLMs will generate the correct answer based on the question and each passage. The highest-ranking passages are then used as positive training examples for dense retrieval. Our comprehensive experiments across four publicly available OpenQA datasets demonstrate that our approach enhances both retrieval and OpenQA performance compared to baseline models.

[IR-9] owards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions

链接: https://arxiv.org/abs/2408.08379
作者: Krisztian Balog,John Palowitch,Barbara Ikica,Filip Radlinski,Hamidreza Alvari,Mehdi Manshadi
关键词-EN: modern machine learning, synthetic data represents, highly private, machine learning, offering a solution
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of synthetic data represents a pivotal shift in modern machine learning, offering a solution to satisfy the need for large volumes of data in domains where real data is scarce, highly private, or difficult to obtain. We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content, noting that such content is increasingly prevalent and a source of frequently sought information. Large language models (LLMs) offer a starting point for generating synthetic social media discussion threads, due to their ability to produce diverse responses that typify online interactions. However, as we demonstrate, straightforward application of LLMs yields limited success in capturing the complex structure of online discussions, and standard prompting mechanisms lack sufficient control. We therefore propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads, referred to as scaffolds. Our framework is generic yet adaptable to the unique characteristics of specific social media platforms. We demonstrate its feasibility using data from two distinct online discussion platforms. To address the fundamental challenge of ensuring the representativeness and realism of synthetic data, we propose a portfolio of evaluation measures to compare various instantiations of our framework.

附件下载

点击下载今日全部论文列表