本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-07-01)

今日共更新341篇论文,其中:

  • 自然语言处理68篇(Computation and Language (cs.CL))
  • 计算机视觉69篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能78篇(Artificial Intelligence (cs.AI))
  • 机器学习89篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
[NLP-0] Web 2Code:用于多模式LLM的大规模网页到代码数据集和评估框架

链接: https://arxiv.org/abs/2406.20098
作者: Sukmin Yun,Haokun Lin,Rusiru Thushara,Mohammad Qazim Bhat,Yongxin Wang,Zutao Jiang,Mingkai Deng,Jinhong Wang,Tianhua Tao,Junbo Li,Haonan Li,Preslav Nakov,Timothy Baldwin,Zhengzhong Liu,Eric P. Xing,Xiaodan Liang,Zhiqiang Shen
关键词: Multimodal large language, shown impressive success, Multimodal large, HTML code, shown impressive
中文关键词: 多模式大型语言,表现出令人印象深刻的成功,多模式大型,HTML代码,表现出令人印象深刻的
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Website at this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage’s HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs’ abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at this https URL.
摘要:在各种理解和生成任务中,多通道大语言模型(MLLMS)在图像、视频和音频等通道上取得了令人印象深刻的成功。然而,目前的MLLMS在理解网页截图和生成相应的HTML代码方面出人意料地差。为了解决这一问题,我们提出了Web2Code,这是一个由用于指令调优的新的大规模网页到代码的数据集和MLLMS的网页理解和HTML代码翻译能力的评估框架组成的基准。对于数据集构建,我们利用预先训练的LLM来增强现有的网页到代码的数据集,以及生成呈现为图像的不同的新网页池。具体地说,输入是网页图像和说明,而响应是网页的HTML码。我们还在回复中加入了关于网页内容的不同自然语言问答对,以使人们能够更全面地了解网页内容。为了评估模型在这些任务中的性能,我们开发了一个评估框架,用于测试MLLMS在网页理解和Web到代码生成方面的能力。大量的实验表明,我们提出的数据集不仅对我们提出的任务有益,而且在一般的视觉领域也是有益的,而以前的数据集的性能较差。我们希望我们的工作将有助于开发适合于基于Web的内容生成和任务自动化的通用MLLMS。我们的数据和代码将在此HTTPS URL上提供。

[NLP-1] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
[NLP-1] LLaRA:为视觉语言政策增强机器人学习数据

链接: https://arxiv.org/abs/2406.20095
作者: Xiang Li,Cristina Mata,Jongwoo Park,Kumara Kahatapitiya,Yoo Sung Jang,Jinghuan Shang,Kanchana Ranasinghe,Ryan Burgert,Mu Cai,Yong Jae Lee,Michael S. Ryoo
关键词: Large Language Models, Large Language, extensive world knowledge, strong reasoning skills, Vision Language Models
中文关键词: 大型语言模型,大型语言,广泛的世界知识,强大的推理能力,视觉语言模型
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at this https URL.
摘要:大型语言模型具有广博的世界知识和较强的推理能力,能够处理跨域的不同任务,通常通过将它们设定为对话式的教学-反应对来实现。在本文中,我们提出了LLaRA:Large Language and Robotics Assistant,这是一个框架,它将机器人的动作策略制定为会话,并在使用辅助数据进行训练时提供更好的响应,以补充策略学习。具有视觉输入的LLM,即视觉语言模型(VLM),具有将状态信息处理为视觉文本提示并以文本形式生成最佳政策决策的能力。为了训练这样的动作策略VLM,我们首先引入了一个自动化管道来从现有的行为克隆数据中生成各种高质量的机器人指令数据。VLM基于为机器人任务量身定做的对话式公式而得到的数据集集合进行了微调,可以生成有意义的机器人行动策略决策。我们在多个模拟和真实环境中的实验证明了所提出的LLaRA框架具有最先进的性能。代码、数据集和预先训练的模型可在此HTTPS URL中找到。

[NLP-2] Scaling Synthetic Data Creation with 1000000000 Personas
[NLP-2] 使用10000000000个角色扩展合成数据创建

链接: https://arxiv.org/abs/2406.20094
作者: Xin Chan,Xiaoyang Wang,Dian Yu,Haitao Mi,Dong Yu
关键词: large language model, create diverse synthetic, diverse synthetic data, language model, large language
中文关键词: 大语言模型,创建多样化的合成,多样化的合成数据,语言模型,大语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub – a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world’s total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub’s use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.
摘要:我们提出了一种新的人物角色驱动的数据合成方法,该方法利用大型语言模型(LLM)中的不同视角来创建不同的合成数据。为了在规模上充分利用这一方法,我们引入了Persona Hub–从网络数据中自动管理的10亿个不同的角色的集合。这10亿个人物角色(约占世界总人口的13%)作为世界知识的分布式载体,可以利用LLM中封装的几乎每个角度,从而促进为各种场景创建规模多样的合成数据。通过展示Persona Hub在大规模合成高质量数学和逻辑推理问题、指令(即用户提示)、富知识文本、游戏NPC和工具(功能)方面的用例,我们展示了人物角色驱动的数据合成是通用的、可扩展的、灵活的和易于使用的,潜在地推动了合成数据创建和实践应用的范式转变,这可能会对LLM研究和开发产生深远影响。

[NLP-3] ProgressGym: Alignment with a Millennium of Moral Progress
[NLP-3] 进步健身房:与道德进步的千年保持一致

链接: https://arxiv.org/abs/2406.20087
作者: Tianyi Qiu,Yang Zhang,Xuchuan Huang,Jasmine Xinze Li,Jiaming Ji,Yaodong Yang
关键词: large language models, including large language, Frontier AI systems, hold increasing influence, including large
中文关键词: 大型语言模型,包括大型语言、Frontier AI系统,具有越来越大的影响力,包括大型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at this https URL and this https URL respectively.
摘要:包括大语言模型在内的前沿人工智能系统对人类用户的认识论产生了越来越大的影响。这种影响可以加强普遍的社会价值观,潜在地导致被误导的道德信仰被锁定,从而导致有问题的道德做法在大范围内永久化。我们引入进度调整作为一种技术解决方案来缓解这一迫在眉睫的风险。进度比对算法学习模仿人类道德进步的机制,从而解决现有比对方法对当代道德盲点的敏感性。为了增强进步中的一致性研究,我们引入了ProgressGym,这是一个允许从历史中学习道德进步机制的实验框架,以促进现实世界道德决策的未来进步。ProgressGym利用9个世纪的历史文本和18个历史LLM,能够将现实世界的进步调整挑战编纂为具体的基准。具体地说,我们引入了三个核心挑战:跟踪不断演变的价值观(PG-Follow),先发制人地预测道德进步(PG-Forecast),以及调节人类和人工智能价值转移之间的反馈回路(PG-CoEvve)。没有时间维度的配准方法不适用于这些任务。作为回应,我们提出了终身和外推算法作为进度比对的基线方法,并建立了一个公开的排行榜,征求新的算法和挑战。框架和排行榜分别位于此HTTPS URL和此HTTPS URL。

[NLP-4] oken Erasure as a Footprint of Implicit Vocabulary Items in LLMs
[NLP-4] 作为LLM中隐性词汇项目足迹的oken Erasure

链接: https://arxiv.org/abs/2406.20086
作者: Sheridan Feucht,David Atkinson,Byron Wallace,David Bau
关键词: LLMs process text, process text, text as sequences, represented by multiple, tokens
中文关键词: LLM处理文本,处理文本,文本作为序列,由多个标记表示
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 14 figures. Code and data at this https URL

点击查看摘要

Abstract:LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b’s tokenizer splits the word “northeastern” into the tokens [‘_n’, ‘ort’, ‘he’, ‘astern’], none of which correspond to semantically meaningful units like “north” or “east.” Similarly, the overall meanings of named entities like “Neil Young” and multi-word expressions like “break a leg” cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced “erasure” effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to “read out” the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
摘要:LLMS将文本处理为与单词大致对应的标记序列,其中不常用的单词由多个标记表示。然而,个别标记通常在语义上与它们所包含的单词/概念的含义无关。例如,Llama-2-7b的标记器将单词“东北”拆分成记号[‘_n’、‘ort’、‘he’、‘recast’],这些记号都不对应于“North”或“East”等有语义意义的单位。同样,像“Neil Young”这样的命名实体和像“好运”这样的多个单词表达的整体含义也不能直接从它们的组成标记中推断出来。从机制上讲,LLM如何将这种任意的令牌组转换为有用的更高级别的表示?在这项工作中,我们发现命名实体和多标记词的最后一个标记词呈现出明显的“擦除”效应,其中关于先前和当前标记词的信息在早期层被迅速遗忘。利用这一观察结果,我们提出了一种通过检查各层符号表示的差异来“读出”自回归LLM的隐含词汇表的方法,并给出了该方法在Llama-2-7b和Llama-3-8b上的结果。据我们所知,这是第一次对语言习得者的内隐词汇进行研究。

[NLP-5] Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
[NLP-5] 分子事实:LLM事实验证中去语境化的愿望

链接: https://arxiv.org/abs/2406.20079
作者: Anisha Gunjal,Greg Durrett
关键词: large language model, Automatic factuality verification, Automatic factuality, language model, combat hallucinations
中文关键词: 大型语言模型、自动真实性验证、自动真实性、语言模型、战斗幻觉
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic factuality verification of large language model (LLM) generations is becoming more and more widely used to combat hallucinations. A major point of tension in the literature is the granularity of this fact-checking: larger chunks of text are hard to fact-check, but more atomic facts like propositions may lack context to interpret correctly. In this work, we assess the role of context in these atomic facts. We argue that fully atomic facts are not the right representation, and define two criteria for molecular facts: decontextuality, or how well they can stand alone, and minimality, or how little extra information is added to achieve decontexuality. We quantify the impact of decontextualization on minimality, then present a baseline methodology for generating molecular facts automatically, aiming to add the right amount of information. We compare against various methods of decontextualization and find that molecular facts balance minimality with fact verification accuracy in ambiguous settings.
摘要:大型语言模型(LLM)生成的自动真实性验证正越来越广泛地用于对抗幻觉。文献中的一个主要紧张点是这种事实核查的粒度:较大的文本块很难进行事实核查,但更多的原子事实,如命题,可能缺乏正确解释的背景。在这项工作中,我们评估了上下文在这些原子事实中的作用。我们认为完全原子的事实并不是正确的表征,并为分子事实定义了两个标准:去文本性,或者它们能多好地独立,以及最小性,或者几乎没有额外的信息来实现去文本。我们量化了去上下文对最小化的影响,然后提出了一种自动生成分子事实的基线方法,旨在添加合适的信息量。我们比较了不同的去文本化方法,发现分子事实在歧义环境中平衡了最小性和事实验证的准确性。

[NLP-6] Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
[NLP-6] 在轻量级LLC中应用RLAIF进行API使用的代码生成

链接: https://arxiv.org/abs/2406.20060
作者: Sujan Dutta,Sayantan Mahinder,Raviteja Anantha,Bortik Bandyopadhyay
关键词: Reinforcement Learning, enhancing text summarization, demonstrated significant potential, including mitigating harm, enhancing text
中文关键词: 强化学习、增强文本摘要,表现出巨大的潜力,包括减轻伤害、增强文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline’s performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.
摘要:来自人工智能反馈的强化学习(RLAIF)在各个领域都显示出了巨大的潜力,包括减轻LLM输出中的危害,增强文本摘要和数学推理。介绍了一种改进轻量级(1B参数)LLM代码生成能力的RLAIF框架。我们特别关注需要编写适当的API调用的代码生成任务,由于众所周知的LLM中的幻觉问题,这是具有挑战性的。我们的框架通过一种专门的提示策略从较大的LLM(例如,GPT-3.5)中提取人工智能反馈,并使用这些数据来训练奖励模型,使其与较小的LLM更好地对齐。我们在Gorilla数据集上运行实验,仔细评估模型生成的代码在各种指标上的质量,包括AST、Rouge和Code-BLEU,并开发了一个管道来准确计算其可执行率。我们的方法显著提高了微调的LLM基线的性能,实现了4.5%的可执行率改进。值得注意的是,使用RLAIF训练的较小的LLM模型(780M参数)超过了使用7B参数的更大的微调基线,实现了1.0%的代码可执行率。

[NLP-7] o Word Senses and Beyond: Inducing Concepts with Contextualized Language Models
[NLP-7] o词感及超越:用上下文化语言模型推导概念

链接: https://arxiv.org/abs/2406.20054
作者: Bastien Liétard,Pascal Denis,Mikaella Keller
关键词: crucial interrelated facets, Word Sense Induction, lexical ambiguity, Word Sense Disambiguiation, crucial interrelated
中文关键词: 重要的相互关联的方面,词意归纳,词汇歧义,词意消除,重要的相互关联
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena have been studied extensively in NLP, leading to dedicated systems, they are often been considered independently. While many tasks dealing with polysemy (e.g. Word Sense Disambiguiation or Induction) highlight the role of a word’s senses, the study of synonymy is rooted in the study of concepts, i.e. meaning shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes that of Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon perspective to induce concepts. We evaluate the obtained clustering on SemCor’s annotated data and obtain good performances (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performances with the State-of-the-Art.
摘要:一词多义和同义词是词汇歧义的两个相互关联的重要方面。虽然这两种现象在NLP中得到了广泛的研究,导致了专门的系统,但它们往往被独立考虑。虽然许多处理多义词的任务(如词义消歧或归纳)强调了词义的作用,但同义词的研究植根于对概念的研究,即在整个词典中共享的意义。在本文中,我们引入了概念归纳,这是一种无监督的任务,学习词之间的软聚类,直接从数据中定义一组概念。这项任务概括了词义归纳的任务。我们提出了一种双层的概念归纳方法,该方法利用局部词条中心的观点和全局跨词典的观点来归纳概念。我们在SemCor的标注数据上对所得到的聚类进行了评估,并获得了良好的性能(BCued F1在0.60以上)。我们发现,在我们的环境中,局部和全球层面对于诱导概念和感觉是互惠互利的。最后,我们创建代表我们诱导的概念的静态嵌入,并将它们用于上下文中的单词任务,获得与最先进技术相竞争的性能。

[NLP-8] Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
[NLP-8] 秘密恶意微调:保障LLM适应的挑战

链接: https://arxiv.org/abs/2406.20053
作者: Danny Halawi,Alexander Wei,Eric Wallace,Tony T. Wang,Nika Haghtalab,Jacob Steinhardt
关键词: language models, finetuning, emerging interface, model, Black-box finetuning
中文关键词: 语言模型、微调、新兴界面、模型、黑匣子微调
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages

点击查看摘要

Abstract:Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
摘要:黑匣子微调是一种新兴的界面,用于根据用户需求调整最先进的语言模型。然而,这种访问也可能会让恶意行为者破坏模型安全性。为了展示保护微调接口的挑战,我们引入了隐蔽恶意微调,这是一种通过微调损害模型安全性同时逃避检测的方法。我们的方法构建了一个恶意数据集,其中每个数据点都显得无害,但对数据集的微调教会模型通过编码的有害响应来响应编码的有害请求。应用于GPT-4,我们的方法产生了一个微调模型,该模型99%的时间作用于有害指令,并避免数据集检查、安全评估和输入/输出分类器等防御机制的检测。我们的研究结果质疑黑匣子微调访问是否可以防止复杂的对手。

[NLP-9] Understanding and Mitigating Language Confusion in LLMs
[NLP-9] 理解和缓解法学硕士中的语言混乱

链接: https://arxiv.org/abs/2406.20052
作者: Kelly Marchisio,Wei-Yin Ko,Alexandre Bérard,Théo Dehaze,Sebastian Ruder
关键词: consistently generate text, user desired language, Language Confusion Benchmark, Language Confusion, investigate a surprising
中文关键词: 一致生成文本、用户所需语言、语言混乱基准、语言混乱、调查令人惊讶的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user’s desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation at this https URL.
摘要:我们调查了LLMS的一个令人惊讶的局限性:它们无法以用户所需的语言一致地生成文本。我们创建了语言混淆基准(LCB)来评估此类故障,涵盖了15种不同类型的语言,以及现有的和新创建的英语和多语言提示。我们评估了一系列反映实际用例的单语和跨语言生成的LLM,发现Llama Indict和Mistral模型表现出高度的语言混乱,即使是最强的模型也无法以正确的语言一致响应。我们发现,基础教学模式和以英语为中心的教学模式更容易出现语言混乱,而复杂的提示和较高的采样温度加剧了这一现象。我们发现,语言混乱可以通过少镜头提示、多语言SFT和偏好调整来部分缓解。我们发布了我们的语言混淆基准测试,该基准测试是在该HTTPS URL上进行高效、可伸缩的多语言评估的第一层。

[NLP-10] BioMNER: A Dataset for Biomedical Method Entity Recognition
[NLP-10] BioMNER:用于生物医学方法实体识别的数据集

链接: https://arxiv.org/abs/2406.20038
作者: Chen Tang,Bohao Yang,Kun Zhao,Bo Lv,Chenghao Xiao,Frank Guerin,Chenghua Lin
关键词: Natural Language Processing, Named entity recognition, realm of Natural, Biomedical Method NER, Biomedical Method
中文关键词: 自然语言处理、命名实体识别、自然领域、生物医学方法NER、生物医学方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.
摘要:命名实体识别(NER)是自然语言处理领域的一项基础和关键任务。特别是在生物医学方法NER领域,这项任务提出了显著的挑战,源于学术文献中特定领域术语的持续涌入。目前在生物医学方法(BioMethod)NER方面的研究缺乏资源,这主要是由于方法学概念的错综复杂的性质,这需要深刻的理解才能准确地描述。在这项研究中,我们提出了一种新的用于生物医学方法实体识别的数据集,使用一个自动化的BioMethod实体识别和信息检索系统来辅助人类标注。此外,我们全面探索了一系列传统和当代的开放领域NER方法,包括使用为我们的数据集定制的尖端大型语言模型(LLM)。我们的经验结果表明,语言模型的大参数计数出人意料地抑制了与生物医学方法有关的实体提取模式的有效同化。值得注意的是,该方法利用中等大小的阿尔伯特模型(仅11Mb),结合条件随机场(CRF),实现了最先进的性能(SOTA)。

[NLP-11] LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models
[NLP-11] LEMoE:大型语言模型终身模型编辑的高级专家混合适配器

链接: https://arxiv.org/abs/2406.20030
作者: Renzhi Wang,Piji Li
关键词: Large language models, require continual knowledge, ever-changing world facts, continual knowledge updates, Large language
中文关键词: 大型语言模型,需要持续的知识、不断变化的世界事实、持续的知识更新、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) require continual knowledge updates to stay abreast of the ever-changing world facts, prompting the formulation of lifelong model editing task. While recent years have witnessed the development of various techniques for single and batch editing, these methods either fail to apply or perform sub-optimally when faced with lifelong editing. In this paper, we introduce LEMoE, an advanced Mixture of Experts (MoE) adaptor for lifelong model editing. We first analyze the factors influencing the effectiveness of conventional MoE adaptor in lifelong editing, including catastrophic forgetting, inconsistent routing and order sensitivity. Based on these insights, we propose a tailored module insertion method to achieve lifelong editing, incorporating a novel KV anchor routing to enhance routing consistency between training and inference stage, along with a concise yet effective clustering-based editing order planning. Experimental results demonstrate the effectiveness of our method in lifelong editing, surpassing previous model editing techniques while maintaining outstanding performance in batch editing task. Our code will be available.
摘要:大型语言模型需要不断地更新知识,以跟上不断变化的世界事实,这促使制定了终身模型编辑任务。虽然近年来见证了各种单次和批量编辑技术的发展,但这些方法要么无法应用,要么在面对终身编辑时表现不佳。本文介绍了一种用于终身模型编辑的高级混合专家(MOE)适配器LEMoE。我们首先分析了影响传统MOE适配器在终身编辑中有效性的因素,包括灾难性遗忘、不一致的路线和顺序敏感性。基于这些见解,我们提出了一种定制的模块插入方法来实现终身编辑,该方法结合了一种新颖的KV锚路由来增强训练和推理阶段的路由一致性,以及一种简洁而有效的基于聚类的编辑顺序规划。实验结果证明了该方法在终身编辑中的有效性,超越了以往的模型编辑技术,同时在批处理编辑任务中保持了优异的性能。我们的代码将可用。

[NLP-12] oolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
[NLP-12] oolBeHonest:工具增强大型语言模型的多级别幻觉诊断基准

链接: https://arxiv.org/abs/2406.20015
作者: Yuxiang Zhang,Jing Chen,Junjie Wang,Yaxin Liu,Cheng Yang,Chufan Shi,Xinyu Zhu,Zihao Lin,Hanwen Wan,Yujiu Yang,Tetsuya Sakai,Tian Feng,Hayato Yamana
关键词: Tool-augmented large language, Tool-augmented large, large language models, real-world applications, large language
中文关键词: 工具增强大型语言、工具增强大型语言模型、现实世界应用程序、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community still needs to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM’s hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve a total score of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play a crucial role in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning.
摘要:工具扩充的大型语言模型(LLM)正迅速被集成到实际应用中。由于缺乏基准,社区仍然需要充分了解这些模型中的幻觉问题。为了应对这一挑战,我们引入了一个全面的诊断基准–ToolBH。具体地说,我们从两个角度评估LLM的幻觉:深度和广度。在深度方面,我们提出了一个多层次的诊断过程,包括(1)可解性检测,(2)解规划,(3)缺失工具分析。对于广度,我们根据工具集的特征考虑三种情况:缺少必要的工具、潜在的工具和功能有限的工具。此外,我们还开发了七个任务,通过多轮人工标注收集了700个评价样本。结果显示了ToolBH基准带来的重大挑战。目前先进的Gemini-1.5-Pro和GPT-4o在满分为100分的情况下,总分分别为45.3分和37.0分。在此基准测试中,较大的模型参数并不能保证更好的性能;培训数据和响应策略在工具增强的LLM场景中也起着至关重要的作用。我们的诊断分析表明,模型误差的主要原因在于评估任务的可解性。此外,开放重量模型的性能会下降,回复冗长,而专有模型的推理时间更长。

[NLP-13] he SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
[NLP-13] SIFo基准:调查大型语言模型的顺序指令跟随能力

链接: https://arxiv.org/abs/2406.19999
作者: Xinyi Chen,Baohao Liao,Jirui Qi,Panagiotis Eustratiadis,Christof Monz,Arianna Bisazza,Maarten de Rijke
关键词: multiple instructions, instructions, crucial ability, large language models, multiple
中文关键词: 多指令,指令,关键能力,大型语言模型,多
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models’ abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark’s effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today’s language models.
摘要:遵循多条指令是大型语言模型(LLM)的一项重要能力。评估这一能力面临着巨大的挑战:(I)多条指令之间的有限一致性,(Ii)指令顺序影响模型性能的位置偏差,以及(Iii)缺乏可客观验证的任务。为了解决这些问题,我们引入了一个基准测试,旨在评估模型通过顺序指令遵循(Sifo)任务遵循多条指令的能力。在Sifo中,可以通过只检查最后一条指令来验证多条指令的成功完成。我们的基准使用四个任务(文本修改、问题回答、数学和安全规则遵循)来评估指令遵循,每个任务都评估顺序指令遵循的不同方面。我们对流行的LLM的评估显示,较新和较大的模型在Sifo任务中的表现明显优于较旧和较小的对应模型,验证了基准测试的有效性。所有的模型都在努力遵循指令序列,这暗示着今天的语言模型严重缺乏健壮性。

[NLP-14] Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model
[NLP-14] 单亲家庭:来自单一预培训基金会模型的家庭成员谱系

链接: https://arxiv.org/abs/2406.19995
作者: Habib Hajimolahoseini,Mohammad Hassanpour,Foozhan Ataiefard,Boxing Chen,Yang Liu
关键词: Low Rank Decomposition, Progressive Low Rank, Progressive Low, Rank Decomposition, Low Rank
中文关键词: 低等级分解,渐进低等级,渐进低,等级分解,低等级
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.
摘要:介绍了一种适合于大型语言模型压缩的递进低阶分解(PLRD)方法。我们的方法利用预先训练的模型,然后使用逐步较低的等级递增地解压缩到较小的大小。这种方法可以显著减少计算开销和能源消耗,因为后续模型是从原始模型推导而来的,而不需要从头开始重新训练。我们详细介绍了PLRD的实现,它战略性地降低了张量等级,从而优化了模型性能和资源使用之间的权衡。通过大量的实验证明了PLRD的有效性,实验表明,仅在1B令牌上使用PLRD方法训练的模型在使用0.1%的令牌的情况下仍保持与传统训练模型相当的性能。PLRD的多功能性突出表现在它能够从单个基础模型生成多个模型大小,并流畅地适应不同的计算和内存预算。我们的发现表明,PLRD可以为LLMS的有效扩展设定一个新的标准,使先进的人工智能在不同的平台上更加可行。

[NLP-15] Into the Unknown: Generating Geospatial Descriptions for New Environments
[NLP-15] 走进未知:为新环境生成地理空间描述

链接: https://arxiv.org/abs/2406.19967
作者: Tzuf Paz-Argaman,John Palowitch,Sayali Kulkarni,Reut Tsarfaty,Jason Baldridge
关键词: task requires reasoning, task requires, observer viewpoint, focus on bridging, bridging the gap
中文关键词: 任务需要推理,任务需要,观察者观点,专注于弥合,弥合差距
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Similar to vision-and-language navigation (VLN) tasks that focus on bridging the gap between vision and language for embodied navigation, the new Rendezvous (RVS) task requires reasoning over allocentric spatial relationships (independent of the observer’s viewpoint) using non-sequential navigation instructions and maps. However, performance substantially drops in new environments with no training data. Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text resulting in low geolocation resolution. We propose a large-scale augmentation method for generating high-quality synthetic data for new environments using readily available geospatial data. Our method constructs a grounded knowledge-graph, capturing entity relationships. Sampled entities and relations (`shop north of school’) generate navigation instructions via (i) generating numerous templates using context-free grammar (CFG) to embed specific entities and relations; (ii) feeding the entities and relation into a large language model (LLM) for instruction generation. A comprehensive evaluation on RVS, showed that our approach improves the 100-meter accuracy by 45.83% on unseen environments. Furthermore, we demonstrate that models trained with CFG-based augmentation achieve superior performance compared with those trained with LLM-based augmentation, both in unseen and seen environments. These findings suggest that the potential advantages of explicitly structuring spatial information for text-based geospatial reasoning in previously unknown, can unlock data-scarce scenarios.
摘要:类似于视觉和语言导航(VLN)任务的重点是弥合视觉和语言之间的差距的具身导航,新的会合(RVS)任务需要使用非顺序导航指令和地图对同心空间关系(独立于观察者的视点)进行推理。然而,在没有训练数据的新环境中,性能会大幅下降。使用与坐标配对的开源描述(例如,维基百科)提供了训练数据,但存在面向空间的文本有限,导致地理位置分辨率低的问题。我们提出了一种大规模增强方法,利用现成的地理空间数据为新环境生成高质量的合成数据。我们的方法构建了一个扎根的知识图,捕获了实体关系。被采样的实体和关系(学校以北的商店)通过以下方式生成导航指令:(I)使用上下文无关文法(CFG)生成大量模板以嵌入特定实体和关系;(Ii)将实体和关系馈入大型语言模型(LLM)以用于指令生成。对RVS的综合评估表明,在不可见环境下,该方法将100米的准确率提高了45.83%。此外,我们还证明了在不可见和可见环境中,使用基于CFG的增强训练的模型比使用基于LLM的增强训练的模型获得了更好的性能。这些发现表明,在以前未知的情况下,显式结构化空间信息用于基于文本的地理空间推理的潜在优势可以释放数据稀缺的场景。

[NLP-16] Simulating Financial Market via Large Language Model based Agents
[NLP-16] 通过基于大语言模型的代理模拟金融市场

链接: https://arxiv.org/abs/2406.19966
作者: Shen Gao,Yuntao Wen,Minghang Zhu,Jianing Wei,Yuhan Cheng,Qunzi Zhang,Shuo Shang
关键词: theories typically assume, fully rational individuals, simulate human behavior, economic theories typically, human behavior
中文关键词: 理论通常假设,完全理性的个人,模拟人类行为,经济理论通常,人类行为
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most economic theories typically assume that financial market participants are fully rational individuals and use mathematical models to simulate human behavior in financial markets. However, human behavior is often not entirely rational and is challenging to predict accurately with mathematical models. In this paper, we propose \textbfAgent-based \textbfSimulated \textbfFinancial \textbfMarket (ASFM), which first constructs a simulated stock market with a real order matching system. Then, we propose a large language model based agent as the stock trader, which contains the profile, observation, and tool-learning based action module. The trading agent can comprehensively understand current market dynamics and financial policy information, and make decisions that align with their trading strategy. In the experiments, we first verify that the reactions of our ASFM are consistent with the real stock market in two controllable scenarios. In addition, we also conduct experiments in two popular economics research directions, and we find that conclusions drawn in our \model align with the preliminary findings in economics research. Based on these observations, we believe our proposed ASFM provides a new paradigm for economic research.
摘要:大多数经济学理论通常假设金融市场参与者是完全理性的个体,并使用数学模型来模拟金融市场中的人类行为。然而,人类的行为往往不是完全理性的,用数学模型进行准确预测是具有挑战性的。在本文中,我们提出了基于文本代理的模拟文本金融市场(ASFM),它首先构建了一个具有真实订单匹配系统的模拟股票市场。然后,我们提出了一个基于代理的大语言模型作为股票交易者,它包含了基于轮廓、观察和工具学习的动作模块。交易代理人可以全面了解当前的市场动态和金融政策信息,并做出与其交易策略一致的决定。在实验中,我们首先验证了我们的ASFM在两个可控场景下的反应与真实股市一致。此外,我们还在两个热门的经济学研究方向上进行了实验,我们发现我们的模型得出的结论与经济学研究中的初步结果是一致的。基于这些观察,我们认为我们提出的ASFM为经济研究提供了一个新的范式。

[NLP-17] BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
[NLP-17] BESTow:高效且可流化的语音语言模型,在GPT和T5中具有两种最佳效果

链接: https://arxiv.org/abs/2406.19954
作者: Zhehuai Chen,He Huang,Oleksii Hrinchuk,Krishna C. Puvvada,Nithin Rao Koluguri,Piotr Żelasko,Jagadeesh Balam,Boris Ginsburg
关键词: Incorporating speech understanding, vital research direction, pretrained large-language models, Incorporating speech, speech understanding capabilities
中文关键词: 融合语音理解、重要的研究方向、预训练的大语言模型、融合语音、语音理解能力
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.
摘要:将语音理解能力融入到预先训练的大语言模型中已成为一个重要的研究方向(SpeechLLM)。以前的体系结构可以被分类为:i)GPT风格,将语音提示作为一系列LLM输入预先添加到文本提示中,类似于仅解码器模型;ii)T5风格,将语音交叉注意引入到预先训练的LLM的每一层。我们建议Gifow体系结构将两个World的最佳功能整合到一个高效且具有强大的多任务能力的单一模型中。此外,对于这两种风格都没有明确的流媒体解决方案,特别是考虑到该解决方案应该推广到语音多任务。我们将Streamable SpeechLLM重新描述为一个读写策略问题,并将离线和流媒体研究与GIFEW体系结构相结合。因此,我们展示了第一个同时支持大规模流媒体和多任务(超越ASR)的开源SpeechLLM解决方案。这一可流水化的解决方案在各种语音任务(ASR、AST、SQA、不可见的DynamicSuperb)上实现了非常强大的性能。它是端到端可优化的,具有较低的训练/推理代价,并展示了LLM知识到语音的可传递性。

[NLP-18] Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation
[NLP-18] 使用利基采购和人工智能数据增强从非结构化数据中挖掘支持和反对疫苗接种的原因

链接: https://arxiv.org/abs/2406.19951
作者: Damián Ariel Furman,Juan Junqueras,Z. Burçe Gümüslü,Edgar Altszyler,Joaquin Navajas,Ophelia Deroy,Justin Sulik
关键词: annotated through nichesourcing, scientific authorities, present Reasons, predicting reasons, Vaccination
中文关键词: 通过利基采购、科学权威、当前原因、预测原因、疫苗接种进行注释
类目: Computation and Language (cs.CL)
备注: 8 pages + references and appendix

点击查看摘要

Abstract:We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and explore the impact of artificially augmented data using in-context learning with GPT4 and GPT3.5-Turbo. We publish the dataset and the trained models along with the annotation manual used to train annotators and define the task.
摘要:我们介绍了支持和反对疫苗接种的原因(RFAV),这是一个用于预测支持和反对疫苗接种的原因的数据集,以及用于证明其合理性的科学权威机构,通过利基采购进行注释并使用GPT 4和GPT 3.5-涡轮进行增强。我们展示了如何在不同的任务定义下在非结构化文本中挖掘这些原因,尽管涉及的主观性很高,并使用GPT 4和GPT 3.5-Turbo的上下文学习探索人工增强数据的影响。我们发布数据集和训练模型以及用于训练注释者和定义任务的注释手册。

[NLP-19] Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
[NLP-19] 通过思维树上的偏好优化来校准LLM,以生成科学问题评分的基本原理

链接: https://arxiv.org/abs/2406.19949
作者: Jiazheng Li,Hainiu Xu,Zhaoyue Sun,Yuxiang Zhou,David West,Cesare Aloisi,Yulan He
关键词: automated scoring systems, facilitate explainability, explainability in automated, Large Language Models, scoring systems
中文关键词: 自动评分系统,促进可解释性,自动化中的可解释性,大型语言模型,评分系统
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference optimization. Extensive experimental results demonstrate that our framework achieves a 38% assessment performance improvement in the QWK score compared to prior work while producing higher-quality rationales, as recognised by human evaluators and LLMs. Our work sheds light on the effectiveness of performing preference optimization using synthetic preference data obtained from thought tree paths.
摘要:在自动评分系统中,生成理由来证明评分决定是一种很有前途的促进可解释性的方法。然而,现有的方法并不能与基于分类器的方法的精度相匹配。此外,产生的理由往往包含幻觉信息。为了解决这些问题,我们提出了一个新的框架,能够产生更可信的理由,更重要的是,匹配性能与基于分类器的黑盒评分系统。我们首先通过查询大型语言模型(LLM)来模拟人类评估过程,以生成思维树。然后,我们总结来自每条思维树路径的中间评估决策,以创建合成理由数据和理由偏好数据。最后,我们利用生成的合成数据通过两个步骤的训练过程来校准LLMS:有监督的微调和偏好优化。大量的实验结果表明,与以前的工作相比,我们的框架在QWK分数上获得了38%的评估性能改进,同时产生了更高质量的基本原理,这一点得到了人类评估者和LLMS的认可。我们的工作揭示了使用从思维树路径获得的合成偏好数据执行偏好优化的有效性。

[NLP-20] From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
[NLP-20] 从最少到最多:通过数据合成构建即插即用视觉推理器

链接: https://arxiv.org/abs/2406.19934
作者: Chuanqi Cheng,Jian Guan,Wei Wu,Rui Yan
关键词: explore multi-step reasoning, reasoning, explore multi-step, multi-step reasoning, visual
中文关键词: 探索多步骤推理,推理,探索多步骤,多步骤推理,视觉
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct 50 k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at this https URL.
摘要:探讨了视觉语言模型中的多步推理问题。这个问题很有挑战性,因为由视觉和语言处理的多个步骤组成的推理数据几乎不可用。为了克服这一挑战,我们首先引入了从最少到最多的可视化推理范式,该范式交错了将问题分解为子问题和调用外部工具解决子问题的步骤。基于该范式,我们进一步提出了一种新的数据合成方法,该方法可以自下而上地为图像自动生成问题和多步推理路径。我们的方法将复杂的合成任务划分为几个简单的子任务,并且(几乎全部)依赖于开源模型来完成子任务。因此,整个合成过程是可重复性和成本效益的,合成的数据是有质量保证的。利用该方法,我们构建了5万个可视化推理实例。然后,我们通过有监督的微调开发了一个视觉推理机,它能够以即插即用的方式普遍增强现有的各种VLM的推理能力。大量的实验表明,视觉推理机可以在四个VQA基准上一致且显着地提高四个VLM。我们的代码和数据集可以在这个HTTPS URL上找到。

[NLP-21] Interactive Topic Models with Optimal Transport
[NLP-21] 具有最佳运输的交互式主题模型

链接: https://arxiv.org/abs/2406.19928
作者: Garima Dhanania,Sheshera Mysore,Chau Minh Pham,Mohit Iyyer,Hamed Zamani,Andrew McCallum
关键词: analyze document collections, document collections, corpus, Topic, analyze document
中文关键词: 分析文档集、文档集、文集、主题、分析文档
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: Pre-print; Work in progress

点击查看摘要

Abstract:Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM’s ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.
摘要:主题模型被广泛用于分析文档集合。虽然当分析人员不熟悉语料库时,它们对于发现语料库中的潜在主题很有价值,但分析人员通常也会从理解语料库中存在的内容开始。这可能是通过从语料库的初始传递获得的类别,或者通过从高级理论框架(例如,政治意识形态)派生的预定义类别集来分析语料库的愿望。在这些场景中,分析人员需要一种主题建模方法,该方法包含了他们对语料库的理解,同时支持与模型的各种形式的交互。在这项工作中,我们提出了EDTM,作为一种标签名称监督主题建模的方法。EdTM将主题建模建模为分配问题,同时利用基于LM/LLM的文档主题亲和力,并使用最优传输进行全局连贯的主题分配。在实验中,我们与少镜头LLM分类器以及基于聚类和LDA的主题模型进行了比较,证明了该框架的有效性。此外,我们展示了EdTM的能力,包括各种形式的分析师反馈,同时保持稳健的噪音分析师输入。

[NLP-22] Paraphrase Types Elicit Prompt Engineering Capabilities
[NLP-22] 解释类型激发的快速工程能力

链接: https://arxiv.org/abs/2406.19898
作者: Jan Philip Wahle,Terry Ruas,Yang Xu,Bela Gipp
关键词: success of modern, models, modern language models, language models depends, language models
中文关键词: 现代的成功,模型,现代语言模型,语言模型取决于,语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions. We measure behavioral changes for five models across 120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon, lexico-syntax, discourse, and others). We also control for other prompt engineering factors (e.g., prompt length, lexical diversity, and proximity to training data). Our results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7% median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in morphology and lexicon, i.e., the vocabulary used, showed promise in improving prompts. These findings contribute to developing more robust language models capable of handling variability in linguistic expression.
摘要:现代语言模型的成功很大程度上取决于找到合适的提示来指导模型。到目前为止,提示语的语言表达的差异如何影响这些模型在很大程度上是未知的。本研究系统地、实证地评估了哪些语言特征通过释义类型,即特定位置的不同语言变化来影响模型。我们通过120个任务和六个释义家族(即词法、句法、词汇、词汇句法、语篇等)测量了五个模型的行为变化。我们还控制了其他即时工程因素(例如,即时长度、词汇多样性和与训练数据的接近程度)。我们的结果表明,当语言模型的提示被改编成特定的释义类型时,它们有改进任务的潜力(例如,Mixtral 8x7B的中位数收益为6.7%;大羊驼38B的中位数收益为5.5%)。特别是,词法和词汇的变化,即使用的词汇,在改善提示方面表现出了希望。这些发现有助于开发更健壮的语言模型,能够处理语言表达的可变性。

[NLP-23] Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers
[NLP-23] 解开不受限制的网络:多语言注册表的自动识别

链接: https://arxiv.org/abs/2406.19892
作者: Erik Henriksson,Amanda Myntti,Anni Eskelinen,Selcen Erten-Johansson,Saara Hellström,Veronika Laippala
关键词: article explores deep, text varieties, discussion forums, explores deep learning, article explores
中文关键词: 文章探讨深度、文本多样性、论坛、探讨深度学习、文章探讨
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article explores deep learning models for the automatic identification of registers - text varieties such as news reports and discussion forums - in web-based datasets across 16 languages. Web register (or genre) identification would provide a robust solution for understanding the content of web-scale datasets, which have become crucial in computational linguistics. Despite recent advances, the potential of register classifiers on the noisy web remains largely unexplored, particularly in multilingual settings and when targeting the entire unrestricted web. We experiment with a range of deep learning models using the new Multilingual CORE corpora, which includes 16 languages annotated using a detailed, hierarchical taxonomy of 25 registers designed to cover the entire unrestricted web. Our models achieve state-of-the-art results, showing that a detailed taxonomy in a hierarchical multi-label setting can yield competitive classification performance. However, all models hit a glass ceiling at approximately 80% F1 score, which we attribute to the non-discrete nature of web registers and the inherent uncertainty in labeling some documents. By pruning ambiguous examples, we improve model performance to over 90%. Finally, multilingual models outperform monolingual ones, particularly benefiting languages with fewer training examples and smaller registers. Although a zero-shot setting decreases performance by an average of 7%, these drops are not linked to specific registers or languages. Instead, registers show surprising similarity across languages.
摘要:本文探讨了深度学习模型在16种语言的网络数据集中自动识别语域–文本变体,如新闻报道和讨论论坛–中的应用。网络语域(或语类)识别将为理解网络规模的数据集的内容提供一个健壮的解决方案,这在计算语言学中已变得至关重要。尽管最近取得了进展,但在嘈杂的网络上注册分类器的潜力仍然很大程度上仍未开发,特别是在多语言环境中和在针对整个不受限制的网络时。我们使用新的多语言核心语料库试验了一系列深度学习模型,其中包括16种语言,使用25个注册表的详细分层分类进行标注,旨在覆盖整个不受限制的网络。我们的模型获得了最先进的结果,表明在分层多标签设置中的详细分类可以产生具有竞争力的分类性能。然而,所有模型都达到了大约80%的F1分数的玻璃天花板,我们将其归因于网络注册的非离散性质以及在标记一些文件时固有的不确定性。通过剪枝歧义实例,我们将模型性能提高到90%以上。最后,多语言模型的表现优于单语言模型,尤其是对训练样本较少和语域较小的语言有利。尽管零拍设置会使性能平均下降7%,但这些下降与特定的寄存器或语言无关。相反,语域在不同语言之间表现出惊人的相似性。

[NLP-24] Investigating the Timescales of Language Processing with EEG and Language Models
[NLP-24] 利用脑电和语言模型研究语言处理的时间尺度

链接: https://arxiv.org/abs/2406.19884
作者: Davide Turco,Conor Houghton
关键词: pre-trained transformer-based language, Temporal Response Function, transformer-based language model, examining the alignment, alignment between word
中文关键词: 预训练的基于转换器的语言、时间响应函数、基于转换器的语言模型、检查对齐、单词之间的对齐
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: Accepted at the 2024 Conference on Cognitive Computational Neuroscience (CCN 2024)

点击查看摘要

Abstract:This study explores the temporal dynamics of language processing by examining the alignment between word representations from a pre-trained transformer-based language model, and EEG data. Using a Temporal Response Function (TRF) model, we investigate how neural activity corresponds to model representations across different layers, revealing insights into the interaction between artificial language models and brain responses during language comprehension. Our analysis reveals patterns in TRFs from distinct layers, highlighting varying contributions to lexical and compositional processing. Additionally, we used linear discriminant analysis (LDA) to isolate part-of-speech (POS) representations, offering insights into their influence on neural responses and the underlying mechanisms of syntactic processing. These findings underscore EEG’s utility for probing language processing dynamics with high temporal resolution. By bridging artificial language models and neural activity, this study advances our understanding of their interaction at fine timescales.
摘要:这项研究通过检验来自预先训练的基于变压器的语言模型的单词表征与脑电数据之间的一致性,来探索语言处理的时间动力学。使用时间反应函数(TRF)模型,我们研究了神经活动如何对应于不同层的模型表征,揭示了人工语言模型与语言理解过程中大脑反应之间的相互作用。我们的分析揭示了TRF中不同层次的模式,突出了对词汇和成分加工的不同贡献。此外,我们使用线性判别分析(LDA)来分离词性(POS)表征,以深入了解它们对神经反应的影响以及句法处理的潜在机制。这些发现强调了脑电在探索具有高时间分辨率的语言加工动力学方面的作用。通过将人工语言模型和神经活动联系起来,这项研究促进了我们对它们在精细时间尺度上相互作用的理解。

[NLP-25] Detecting Subtle Differences between Human and Model Languages Using Spectrum of Relative Likelihood
[NLP-25] 使用相对似然谱检测人类语言和模型语言之间的微妙差异

链接: https://arxiv.org/abs/2406.19874
作者: Yang Xu,Yu Wang,Hao An,Zhichen Liu,Yongyuan Li
关键词: distinguished by examining, examining the magnitude, Human and model-generated, model-generated texts, Abstract
中文关键词: 通过检查、检查幅度、人类和模型生成的、模型生成的文本、抽象来区分
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:Human and model-generated texts can be distinguished by examining the magnitude of likelihood in language. However, it is becoming increasingly difficult as language model’s capabilities of generating human-like texts keep evolving. This study provides a new perspective by using the relative likelihood values instead of absolute ones, and extracting useful features from the spectrum-view of likelihood for the human-model text detection task. We propose a detection procedure with two classification methods, supervised and heuristic-based, respectively, which results in competitive performances with previous zero-shot detection methods and a new state-of-the-art on short-text detection. Our method can also reveal subtle differences between human and model languages, which find theoretical roots in psycholinguistics studies. Our code is available at this https URL
摘要:人类文本和模型生成文本可以通过检查语言中可能性的大小来区分。然而,随着语言模型生成类人文本的能力不断发展,这变得越来越困难。这项研究通过使用相对似然值而不是绝对似然值,并从似然谱视图中提取有用的特征,为人体模型文本检测任务提供了一个新的视角。我们提出了一种采用两种分类方法(分别是监督和基于启发的分类方法)的检测过程,其性能与之前的零镜头检测方法和最新的短文本检测技术具有竞争力。我们的方法还可以揭示人类语言和模型语言之间的微妙差异,这些差异在心理语言学研究中找到了理论根源。我们的代码可在httpsURL上获取

[NLP-26] YuLan: An Open-source Large Language Model
[NLP-26] YuLan:开源大型语言模型

链接: https://arxiv.org/abs/2406.19853
作者: Yutao Zhu,Kun Zhou,Kelong Mao,Wentong Chen,Yiding Sun,Zhipeng Chen,Qian Cao,Yihan Wu,Yushuo Chen,Feng Wang,Lei Zhang,Junyi Li,Xiaolei Wang,Lei Wang,Beichen Zhang,Zican Dong,Xiaoxue Cheng,Yuhan Chen,Xinyu Tang,Yupeng Hou,Qiangqiang Ren,Xincheng Pang,Shufang Xie,Wayne Xin Zhao,Zhicheng Dou,Jiaxin Mao,Yankai Lin,Ruihua Song,Jun Xu,Xu Chen,Rui Yan,Zhewei Wei,Di Hu,Wenbing Huang,Ze-Feng Gao,Yueguo Chen,Weizheng Lu,Ji-Rong Wen
关键词: Large language models, understanding natural language, Large language, natural language, leveraging their extensive
中文关键词: 大型语言模型,理解自然语言,大型语言,自然语言,利用其广泛的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with 12 billion parameters. The base model of YuLan is pre-trained on approximately 1.7 T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan’s overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan’s training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at this https URL.
摘要:大型语言模型利用其在处理和理解自然语言方面的广泛能力,已经成为许多应用程序的基础。虽然许多开放源码的LLM已经发布了技术报告,但缺乏培训细节阻碍了进一步的研究和开发。本文介绍了玉兰的开发,这是一个开源的120亿个参数的低成本管理系统。玉兰的基本模型是在来自不同语料库的大约1.7T个标记上进行预训练的,这些语料库包括大量的英语、汉语和多语言文本。我们设计了三阶段预训方法,以提升玉兰的整体能力。随后的培训阶段包括教学调整和人员调整,使用了大量高质量的综合数据。为了促进复杂和长尾知识的学习,我们设计了一个贯穿这些阶段的课程学习框架,帮助LLMS以一种容易到难的方式学习知识。玉兰的培训于2024年1月结束,在各种英文和中文基准上取得了与最先进的LLMS不相上下的表现。本文概述了从零开始开发低成本管理系统的全面技术路线图。我们的模型和代码可以在这个HTTPS URL上找到。

[NLP-27] AnomaLLMy – Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions
[NLP-27] AnomaLLMy --通过低置信度单令牌预测检测黑匣子LLM中的异常令牌

链接: https://arxiv.org/abs/2406.19840
作者: Waligóra Witold
关键词: black-box Large Language, Large Language Models, Large Language, paper introduces AnomaLLMy, black-box Large
中文关键词: 黑匣子大型语言,大型语言模型,大型语言,论文介绍AnomaLLMy,大型黑匣子
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:This paper introduces AnomaLLMy, a novel technique for the automatic detection of anomalous tokens in black-box Large Language Models (LLMs) with API-only access. Utilizing low-confidence single-token predictions as a cost-effective indicator, AnomaLLMy identifies irregularities in model behavior, addressing the issue of anomalous tokens degrading the quality and reliability of models. Validated on the cl100k_base dataset, the token set of GPT-4, AnomaLLMy detected 413 major and 65 minor anomalies, demonstrating the method’s efficiency with just \ 24.39 spent in API credits. The insights from this research are expected to be beneficial for enhancing the robustness of and accuracy of LLMs, particularly in the development and assessment of tokenizers.
摘要:本文介绍了AnomaLLMy,这是一种新型技术,用于自动检测仅限API访问的黑匣子大型语言模型(LLM)中的异常标记。AnomaLLMy利用低置信度单令牌预测作为具有成本效益的指标,识别模型行为中的违规行为,解决异常令牌降低模型质量和可靠性的问题。AnomaLLMy在cl100k_base数据集(GPT-4的令牌集)上进行验证,检测到413个主要异常和65个次要异常,证明了该方法的效率,仅花费了24.39美元API积分。这项研究的见解预计将有助于增强LLM的稳健性和准确性,特别是在标记器的开发和评估方面。

[NLP-28] BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering
[NLP-28] BeamAggR:基于多源知识的束聚合推理,用于多跳问题回答

链接: https://arxiv.org/abs/2406.19820
作者: Zheng Chu,Jingchang Chen,Qianglong Chen,Haotian Wang,Kun Zhu,Xiyuan Du,Weijiang Yu,Ming Liu,Bing Qin
关键词: Large language models, Large language, strong reasoning capabilities, demonstrated strong reasoning, language models
中文关键词: 大型语言模型,大型语言,推理能力强,表现出强大的推理,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge. To address this, we propose Beam Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop of question. Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning. For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates. For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory. Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.
摘要:大型语言模型具有很强的推理能力。然而,在处理知识密集型任务时,他们仍然受到事实错误的影响。检索-增强推理是一种很有前途的方法。然而,仍然存在重大挑战,包括对复杂问题的检索不准确和不足,以及整合多源知识的困难。为了解决这个问题,我们提出了一种面向知识密集型多跳问答的推理框架–波束聚合推理BeamAggR。BeamAggR在每一跳问题中探索并确定有希望的答案的优先顺序。具体地,我们将复杂问题解析成树,其中包括原子问题和复合问题,然后进行自底向上的推理。对于原子问题,LLM对多源知识进行推理,得到候选答案。对于复合问题,LLM结合BEAM候选,通过概率聚集探索多条推理路径,并对最有希望的轨迹进行优先排序。在四个开放领域多跳推理数据集上的大量实验表明,该方法的性能明显优于SOTA方法8.5%。此外,我们的分析表明,BeamAggR可以带来更好的知识协作和答案聚合。

[NLP-29] Scalable and Domain-General Abstractive Proposition Segmentation
[NLP-29] 可扩展和领域通用抽象命题分割

链接: https://arxiv.org/abs/2406.19803
作者: Mohammad Javad Hosseini,Yang Gao,Tim Baumgärtner,Alex Fabrikant,Reinald Kim Amplayo
关键词: Segmenting text, units of meaning, wide range, proposition segmentation, fine-grained units
中文关键词: 文本分段、含义单位、广泛范围、命题分段、细粒度单位
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Segmenting text into fine-grained units of meaning is important to a wide range of NLP applications. The default approach of segmenting text into sentences is often insufficient, especially since sentences are usually complex enough to include multiple units of meaning that merit separate treatment in the downstream task. We focus on the task of abstractive proposition segmentation: transforming text into simple, self-contained, well-formed sentences. Several recent works have demonstrated the utility of proposition segmentation with few-shot prompted LLMs for downstream tasks such as retrieval-augmented grounding and fact verification. However, this approach does not scale to large amounts of text and may not always extract all the facts from the input text. In this paper, we first introduce evaluation metrics for the task to measure several dimensions of quality. We then propose a scalable, yet accurate, proposition segmentation model. We model proposition segmentation as a supervised task by training LLMs on existing annotated datasets and show that training yields significantly improved results. We further show that by using the fine-tuned LLMs as teachers for annotating large amounts of multi-domain synthetic distillation data, we can train smaller student models with results similar to the teacher LLMs. We then demonstrate that our technique leads to effective domain generalization, by annotating data in two domains outside the original training data and evaluating on them. Finally, as a key contribution of the paper, we share an easy-to-use API for NLP practitioners to use.
摘要:将文本分割成细粒度的语义单元对于广泛的自然语言处理应用非常重要。将文本分割成句子的默认方法通常是不够的,特别是因为句子通常足够复杂,包括需要在下游任务中单独处理的多个意义单元。我们专注于抽象命题切分的任务:将文本转换为简单、自包含、格式良好的句子。最近的几项工作已经证明了命题分割在少数镜头提示的LLMS下用于下游任务,如检索-增强基础和事实验证。然而,这种方法不适用于大量文本,并且可能并不总是从输入文本中提取所有事实。在本文中,我们首先引入了任务的评估度量来衡量质量的几个维度。然后,我们提出了一个可扩展的,但又准确的命题分割模型。我们通过在已有的标注数据集上训练最小二乘法,将命题分割建模为一项有监督的任务,并表明训练得到了显着改善的结果。我们进一步表明,通过使用微调的LLMS作为教师来标注大量的多域合成蒸馏数据,我们可以训练更小的学生模型,得到与教师LLMS相似的结果。然后,我们证明了我们的技术导致了有效的领域概括,通过标注原始训练数据之外的两个领域的数据并对它们进行评估。最后,作为本文的一个主要贡献,我们分享了一个易于使用的API,供NLP从业者使用。

[NLP-30] NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations
[NLP-30] NLPerturbator:研究代码LLM对自然语言变体的鲁棒性

链接: https://arxiv.org/abs/2406.19783
作者: Junkai Chen,Zhenhao Li,Xing Hu,Xin Xia
关键词: Large language models, achieve promising results, natural language description, Large language, natural language
中文关键词: 大型语言模型,取得有希望的结果,自然语言描述,大型语言,自然语言
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve promising results in code generation based on a given natural language description. They have been integrated into open-source projects and commercial products to facilitate daily coding activities. The natural language description in the prompt is crucial for LLMs to comprehend users’ requirements. Prior studies uncover that LLMs are sensitive to the changes in the prompts, including slight changes that look inconspicuous. However, the natural language descriptions often vary in real-world scenarios (e.g., different formats, grammar, and wording). Prior studies on the robustness of LLMs are often based on random perturbations and such perturbations may not actually happen. In this paper, we conduct a comprehensive study to investigate how are code LLMs robust to variations of natural language description in real-world scenarios. We summarize 18 categories of perturbations of natural language and 3 combinations of co-occurred categories based on our literature review and an online survey with practitioners. We propose an automated framework, NLPerturbator, which can perform perturbations of each category given a set of prompts. Through a series of experiments on code generation using six code LLMs, we find that the perturbed prompts can decrease the performance of code generation by a considerable margin (e.g., up to 21.2%, and 4.8% to 6.1% on average). Our study highlights the importance of enhancing the robustness of LLMs to real-world variations in the prompts, as well as the essentiality of attentively constructing the prompts.
摘要:大型语言模型在基于给定自然语言描述的代码生成方面取得了良好的效果。它们已经集成到开源项目和商业产品中,以方便日常编码活动。提示中的自然语言描述是LLMS理解用户需求的关键。先前的研究发现,LLM对提示中的变化很敏感,包括看起来不明显的微小变化。然而,自然语言描述在真实世界场景中往往不同(例如,不同的格式、语法和措辞)。以往关于LLMS稳健性的研究往往是基于随机扰动的,并且这种扰动可能并不实际发生。在本文中,我们进行了一项全面的研究,以调查代码LLM如何在现实世界场景中对自然语言描述的变化具有健壮性。在文献回顾和在线实践者调查的基础上,我们总结了自然语言扰动的18种类型和3种共现类型的组合。我们提出了一个自动化框架NLPerturator,它可以在给定一组提示的情况下执行每个类别的扰动。通过使用六个代码LLM进行的一系列代码生成实验,我们发现干扰提示会显著降低代码生成的性能(例如,高达21.2%,平均为4.8%到6.1%)。我们的研究强调了增强LLMS对现实世界提示变化的稳健性的重要性,以及用心构建提示的重要性。

[NLP-31] Direct Preference Knowledge Distillation for Large Language Models
[NLP-31] 大型语言模型的直接偏好知识提炼

链接: https://arxiv.org/abs/2406.19774
作者: Yixing Li,Yuxian Gu,Li Dong,Dequan Wang,Yu Cheng,Furu Wei
关键词: large language models, Preference Knowledge Distillation, Knowledge Distillation, language models, Direct Preference Knowledge
中文关键词: 大型语言模型、偏好知识蒸馏、知识蒸馏、语言模型、直接偏好知识
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at this https URL.
摘要:在大型语言模型领域,知识蒸馏是将能力从教师模型转换为学生模型的关键技术。然而,现有的KD方法在蒸馏LLMS方面面临着局限性和挑战,包括传统KL发散的效率和测量能力不足。结果表明,LLMS可以作为一种隐式奖励函数,我们将其定义为对KL发散性的补充。在这项工作中,我们提出了用于LLMS的直接偏好知识提取(DPKD)。DPKD利用分布发散性来表示偏好损失和隐含奖励函数。我们将LLMS的Kd重新定义为两个阶段:首先是由隐性奖励和反向KL发散组成的优化和目标阶段,然后是提高教师输出相对于学生输出的偏好概率。我们在LLM参数从120M到13B的各种数据集上进行了实验和分析,证明了我们的DPKD方法的广泛适用性和有效性。同时,通过实验和理论分析,证明了引入的隐性报酬和产出偏好在KD中的价值和有效性。DPKD方法在输出响应精度和精确匹配率方面均优于基线方法。代码和数据可在此HTTPS URL上找到。

[NLP-32] Belief Revision: The Adaptability of Large Language Models Reasoning
[NLP-32] 信念修正:大型语言模型推理的适应性

链接: https://arxiv.org/abs/2406.19764
作者: Bryan Wilie,Samuel Cahyawijaya,Etsuko Ishii,Junxian He,Pascale Fung
关键词: real-world NLP applications, NLP applications, real-world NLP, capability to reason, reason from text
中文关键词: 现实世界的NLP应用程序,NLP应用程序,现实世界的NLP,推理能力,根据文本推理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The capability to reason from text is crucial for real-world NLP applications. Real-world scenarios often involve incomplete or evolving data. In response, individuals update their beliefs and understandings accordingly. However, most existing evaluations assume that language models (LMs) operate with consistent information. We introduce Belief-R, a new dataset designed to test LMs’ belief revision ability when presented with new evidence. Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning ( \Delta R ) framework. Belief-R features sequences of premises designed to simulate scenarios where additional information could necessitate prior conclusions drawn by LMs. We evaluate \sim 30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information. Further, models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off. These insights underscore the importance of improving LMs’ adaptiveness to changing information, a step toward more reliable AI systems.
摘要:从文本中进行推理的能力对于实际的自然语言处理应用至关重要。现实世界中的场景通常涉及不完整或不断变化的数据。作为回应,个人会相应地更新他们的信念和理解。然而,现有的大多数评估都假设语言模型(LMS)在信息一致的情况下运行。我们介绍了Believe-R,这是一个新的数据集,用于测试LMS在出现新证据时的信念修正能力。受人类如何抑制先前推理的启发,本任务在新提出的增量推理(Delta R)框架内评估LMS。Believe-R以前提序列为特色,旨在模拟额外信息可能需要LMS先前得出的结论的场景。我们通过不同的提示策略对30位LMS进行评估,发现LMS通常很难对新信息做出适当的修正。此外,善于更新的模型在没有必要更新的情况下往往表现不佳,这突显了一个关键的权衡。这些见解突显了提高LMS对不断变化的信息的适应能力的重要性,这是迈向更可靠的人工智能系统的一步。

[NLP-33] Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation
[NLP-33] 通过知识引导的案例重组学习可解释的法律案例检索

链接: https://arxiv.org/abs/2406.19760
作者: Chenlong Deng,Kelong Mao,Zhicheng Dou
关键词: upholding judicial fairness, Legal case retrieval, sourcing similar cases, Legal case, case retrieval
中文关键词: 维护司法公正,法律案件检索,寻找类似案件,法律案件,案件检索
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to unsatisfactory retrieval performance. This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and robustness on complex legal case queries of KELLER over existing methods.
摘要:相似案件的法律检索是维护司法公正的关键。与一般的网络搜索不同,法律案例检索涉及处理冗长、复杂和高度专业化的法律文档。现有的方法往往忽略了法律专家知识的融合,而法律专家知识是准确理解和建模法律案件的关键,导致检索性能不佳。本文介绍了Keller,一种基于大型语言模型(LLMS)的以法律知识为导向的案例重构方法,用于有效和可解释的法律案例检索。通过整合关于犯罪的专业法律知识和法律条款,我们使大型语言模型能够准确地将原始法律案件重新表述为包含案件基本信息的简洁的犯罪子事实。在两个法律案例检索基准上的大量实验表明,与现有方法相比,Keller的检索性能和对复杂法律案例查询的健壮性更好。

[NLP-34] Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
[NLP-34] 通过基于拼音的训练后对齐打破多语言预训练语言模型中的脚本障碍

链接: https://arxiv.org/abs/2406.19759
作者: Orgest Xhelili,Yihong Liu,Hinrich Schütze
关键词: Multilingual pre-trained models, Multilingual pre-trained, shown impressive performance, shown impressive, Multilingual
中文关键词: 多语言预训练模型,多语言预训练,表现出令人印象深刻的性能,表现出令人印象深刻的,多语言
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, \textbfMediterranean-Amharic-Farsi and \textbfSouth+East Asian Languages , wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements. We will make our code and models publicly available at \urlthis https URL.
摘要:多语种预训练模型在跨语言迁移任务中表现出了令人印象深刻的表现。然而,当低资源目标语言用与高资源源语言不同的脚本编写时,即使这两种语言可能相关或共享它们的部分词汇表,迁移性能也经常受到阻碍。受最近利用音译来解决这一问题的工作的启发,本文提出了一种基于音译的训练前后对齐(PPA)方法,旨在提高使用不同脚本的语言之间的跨语言对齐。我们选择了两个区域语言组,\textbf地中海-阿姆哈拉语-波斯语和\textbf南亚+东亚语言,这两种语言相互影响,但使用不同的脚本。我们将我们的方法应用于这些语言组,并在一系列下游任务上进行了广泛的实验。结果表明,在PPA之后,在以英语为中心的迁移中,模型的成绩一直优于原模型(某些任务高达50%)。此外,当我们使用英语以外的语言作为转移的来源时,我们的方法获得了更大的改进。我们将在此HTTPS URL上公开提供我们的代码和模型。

[NLP-35] MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
[NLP-35] MM-Direct:为大型多模式模型对齐生成的视觉指令

链接: https://arxiv.org/abs/2406.19736
作者: Jihao Liu,Xin Huang,Jinliang Zheng,Boxiao Liu,Jia Wang,Osamu Yoshie,Yu Liu,Hongsheng Li
关键词: high-quality visual instruction, visual instruction data, visual instruction, instruction-following capabilities, instruction data designed
中文关键词: 高质量的视觉教学、视觉教学数据、视觉教学、遵循描述的能力、设计的教学数据
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Dataset and models are available at this https URL

点击查看摘要

Abstract:This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at this https URL.
摘要:本文介绍了MM-Indict,这是一个包含各种高质量可视化教学数据的大规模数据集,旨在增强大型多通道模型(LMM)的指令跟踪能力。虽然现有的可视化教学数据集通常侧重于问题回答,但它们难以推广到更广泛的应用场景,如创造性写作、摘要或图像分析。为了克服这些局限性,我们提出了一种新的构建MM-指令集的方法,该方法利用现有LLM强大的指令跟随能力,从大规模但传统的图像字幕数据集中生成新的视觉指令数据。MM-Indict首先利用ChatGPT通过扩充和汇总从一小组种子指令中自动生成不同的指令。然后,它将这些指令与图像进行匹配,并使用开源的大型语言模型(LLM)来生成指令-图像对的连贯答案。LLM以整个答案生成过程中图像的详细文本描述为基础,以保证指令数据的对齐。此外,我们还引入了一个基于生成的指令数据的基准测试来评估现有LMM的指令跟随能力。我们通过在生成的数据上训练LLaVA-1.5模型来证明MM-Indict的有效性,该模型与LLaVA-1.5模型相比,在指令跟随能力方面显示出显著的改进。MM-指令数据集、基准和预先训练的模型可在此HTTPS URL中找到。

[NLP-36] Message du troisi`eme type : irruption dun tiers dans un dialogue en ligne
[NLP-36] 三人类型的信息:在联合国对话中存在的腐败

链接: https://arxiv.org/abs/2406.19731
作者: Ludovic Tanguy(CLLE),Céline Poudat(BCL),Lydia-Mai Ho-Dac(CLLE)
关键词: Wikipedia talk pages, global perspective analyzing, perspective analyzing contributors’, analyzing contributors’ behaviors, Wikipedia talk
中文关键词: 维基百科谈话页面、全球视角分析、视角分析贡献者、分析贡献者行为、维基百科谈话
类目: Computation and Language (cs.CL)
备注: in French language. JADT 2024 - 17es Journ{é}es internationales d’Analyse statistique des Donn{é}es Textuelles, SeSLa (S{é}minaire des Sciences du Langage de l’UCLouvain – Site Saint-Louis); LASLA (Laboratoire d’Analyse statistique des Langues anciennes de l’Universit{é} de Li{è}ge), 2024, Bruxelles, Belgique

点击查看摘要

Abstract:Our study focuses on Wikipedia talk pages, from a global perspective analyzing contributors’ behaviors in online interactions. Using a corpus comprising all Wikipedia talk pages in French, totaling more than 300,000 discussion threads, we examine how discussions with more than two participants (multiparty conversation) unfold and we specifically investigate the role of a third participant’s intervention when two Wikipedians have already initiated an exchange. In this regard, we concentrate on the sequential structure of these interactions in terms of articulation among different participants and aim to specify this third message by exploring its lexical particularities, while also proposing an initial typology of the third participant’s message role and how it aligns with preceding messages.
摘要:我们的研究重点关注维基百科的讨论页面,从全球角度分析贡献者在在线互动中的行为。我们使用包含所有维基百科法语讨论页面(总计超过300,000个讨论线程)的文集,研究与两个以上参与者的讨论(多方对话)如何展开,并专门研究当两个维基人已经发起交流时第三个参与者干预的作用。在这方面,我们专注于不同参与者之间的清晰度方面的这些互动的顺序结构,并旨在通过探索第三个信息的词汇特殊性来指定第三个信息,同时还提出了第三个参与者的信息角色的初始类型学以及它如何与之前的信息保持一致。

[NLP-37] Le sens de la famille : analyse du vocabulaire de la parente par les plongements de mots
[NLP-37] 家庭的意义:分析父母的词汇和格言的困境

链接: https://arxiv.org/abs/2406.19729
作者: Ludovic Tanguy(CLLE),Cécile Fabre(CLLE),Nabil Hathout(UT, CNRS, CLLE),Lydia-Mai Ho-Dac(CLLE)
关键词: highly structured, French lexicon, propose a corpus, corpus analysis, dense and highly
中文关键词: 高度结构化,法语词典,提出一个文集,文集分析,密集且高度
类目: Computation and Language (cs.CL)
备注: in French language. JADT 2024 - 17es Journ{é}es internationales d’Analyse statistique des Donn{é}es Textuelles, SeSLa (S{é}minaire des Sciences du Langage de l’UCLouvain – Site Saint-Louis), 2024, Bruxelles, Belgique

点击查看摘要

Abstract:In this study, we propose a corpus analysis of an area of the French lexicon that is both dense and highly structured: the vocabulary of family relationships. Starting with a lexicon of 25 nouns designating the main relationships (son, cousin, mother, grandfather, sister-in-law etc.), we examine how these terms are positioned in relation to each other through distributional analyses based on the use of these terms in corpora. We show that distributional information can capture certain features that organize this vocabulary (descent, alliance, siblings, genre), in ways that vary according to the different corpora compared.
摘要:在这项研究中,我们提出了对法语词汇中一个密集且结构化程度高的领域进行一次文集分析:家庭关系词汇。从指定主要关系(儿子、表弟、母亲、祖父、嫂子等)的25个名词的词典开始,我们通过基于这些术语在数据库中的使用的分布分析来研究这些术语如何相互关联地定位。我们表明,分布信息可以捕捉组织该词汇的某些特征(血统、联盟、兄弟姐妹、流派),其方式根据所比较的不同的语料库而异。

[NLP-38] Uncertainty Quantification in Large Language Models Through Convex Hull Analysis
[NLP-38] 通过凹凸分析进行大型语言模型中的不确定性量化

链接: https://arxiv.org/abs/2406.19712
作者: Ferhat Ozgur Catak,Murat Kuzlu
关键词: requiring reliable outputs, Uncertainty quantification approaches, large language models, high-risk applications requiring, applications requiring reliable
中文关键词: 需要可靠的输出、不确定性量化方法、大型语言模型、需要高风险应用、需要可靠的应用
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Uncertainty quantification approaches have been more critical in large language models (LLMs), particularly high-risk applications requiring reliable outputs. However, traditional methods for uncertainty quantification, such as probabilistic models and ensemble techniques, face challenges when applied to the complex and high-dimensional nature of LLM-generated outputs. This study proposes a novel geometric approach to uncertainty quantification using convex hull analysis. The proposed method leverages the spatial properties of response embeddings to measure the dispersion and variability of model outputs. The prompts are categorized into three types, i.e., easy', moderate’, and `confusing’, to generate multiple responses using different LLMs at varying temperature settings. The responses are transformed into high-dimensional embeddings via a BERT model and subsequently projected into a two-dimensional space using Principal Component Analysis (PCA). The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is utilized to cluster the embeddings and compute the convex hull for each selected cluster. The experimental results indicate that the uncertainty of the model for LLMs depends on the prompt complexity, the model, and the temperature setting.
摘要:不确定性量化方法在大型语言模型(LLM)中变得更加关键,尤其是需要可靠输出的高风险应用。然而,传统的不确定性量化方法,如概率模型和集成技术,在应用于LLM生成的输出的复杂和高维性质时面临挑战。本研究提出了一种新的基于凸壳分析的不确定性几何量化方法。该方法利用响应嵌入的空间特性来度量模型输出的离散性和变异性。这些提示被分为三种类型,即“轻松”、“中等”和“令人困惑的”,以便在不同的温度设置下使用不同的LLM生成多个响应。通过BERT模型将响应转换为高维嵌入,然后使用主成分分析(PCA)将其投影到二维空间。基于密度的带噪声应用程序空间聚类(DBSCAN)算法被用来对嵌入进行聚类,并计算每个所选聚类的凸壳。实验结果表明,LLMS模型的不确定性取决于即时复杂性、模型和温度设置。

[NLP-39] Less is More: Accurate Speech Recognition Translation without Web-Scale Data
[NLP-39] 少即是多:无需网络规模数据即可准确语音识别翻译

链接: https://arxiv.org/abs/2406.19674
作者: Krishna C. Puvvada,Piotr Żelasko,He Huang,Oleksii Hrinchuk,Nithin Rao Koluguri,Kunal Dhawan,Somshubra Majumdar,Elena Rastorgueva,Zhehuai Chen,Vitaly Lavrukhin,Jagadeesh Balam,Boris Ginsburg
关键词: Recent advances, hours of Internet, Internet speech data, Internet speech, rely on hundreds
中文关键词: 最近的进步、互联网时间、互联网语音数据、互联网语音,依赖于数百个
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech-2024

点击查看摘要

Abstract:Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
摘要:语音识别和翻译的最新进展依赖于数十万小时的互联网语音数据。我们认为,无需依赖网络规模的数据即可达到最先进的准确性。金丝雀–多语言ASB和语音翻译模型,在英语、法语、西班牙语和德语上优于当前最先进的模型-- Whisper、OWSM和无条件M4 T,同时训练的数据数量级比这些模型少。实现这种数据高效模型的三个关键因素:(1)基于FastConformer的注意力编码器-解码器架构(2)对通过机器翻译生成的合成数据进行训练;(3)高级训练技术:数据平衡、动态数据混合、动态分桶和噪音稳健的微调。模型、权重和训练代码将开源。

[NLP-40] DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection Reasoning and Rewriting
[NLP-40] DECOR:利用不连贯检测推理和重写的新型基准提高L2英语写作的连贯性

链接: https://arxiv.org/abs/2406.19650
作者: Xuanming Zhang,Anthony Diaz,Zixun Chen,Qingyang Wu,Kun Qian,Erik Voss,Zhou Yu
关键词: English writing, aspect that second-language, crucial in assessing, English, writing
中文关键词: 英语写作,第二语言方面,对评估、英语、写作至关重要
类目: Computation and Language (cs.CL)
备注: 21 pages, 5 figures, 20 tables

点击查看摘要

Abstract:Coherence in writing, an aspect that second-language (L2) English learners often struggle with, is crucial in assessing L2 English writing. Existing automated writing evaluation systems primarily use basic surface linguistic features to detect coherence in writing. However, little effort has been made to correct the detected incoherence, which could significantly benefit L2 language learners seeking to improve their writing. To bridge this gap, we introduce DECOR, a novel benchmark that includes expert annotations for detecting incoherence in L2 English writing, identifying the underlying reasons, and rewriting the incoherent sentences. To our knowledge, DECOR is the first coherence assessment dataset specifically designed for improving L2 English writing, featuring pairs of original incoherent sentences alongside their expert-rewritten counterparts. Additionally, we fine-tuned models to automatically detect and rewrite incoherence in student essays. We find that incorporating specific reasons for incoherence during fine-tuning consistently improves the quality of the rewrites, achieving a result that is favored in both automatic and human evaluations.
摘要:写作中的连贯是二语英语学习者经常遇到的一个问题,也是评估二语英语写作的关键。现有的自动化写作评价系统主要使用基本的表层语言特征来检测写作中的连贯性。然而,几乎没有人努力纠正被发现的不连贯,这对寻求提高写作水平的二语学习者有很大的好处。为了弥补这一差距,我们引入了DECOR,这是一个新的基准,包括专家注释,用于检测二语英语写作中的不连贯,识别潜在原因,并重写不连贯的句子。据我们所知,DECOR是第一个专门为提高第二语言写作水平而设计的连贯评估数据集,它将原来的不连贯句子与专家重写的句子放在一起。此外,我们对模型进行了微调,以自动检测和重写学生论文中的不连贯之处。我们发现,在微调期间纳入不连贯的特定原因始终可以提高重写的质量,从而获得在自动和人工评估中都有利的结果。

[NLP-41] Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task
[NLP-41] 设计和评估用于人机通信的多聊天机器人界面:说服任务的初步发现

链接: https://arxiv.org/abs/2406.19648
作者: Sion Yoon,Tae Eun Kim,Yoo Jung Oh
关键词: dynamics of human-AI, human-AI communication, communication, multiple language model, language model chatbots
中文关键词: 人类-人工智能动态、人类-人工智能通信、通信、多语言模型、语言模型聊天机器人
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The dynamics of human-AI communication have been reshaped by language models such as ChatGPT. However, extant research has primarily focused on dyadic communication, leaving much to be explored regarding the dynamics of human-AI communication in group settings. The availability of multiple language model chatbots presents a unique opportunity for scholars to better understand the interaction between humans and multiple chatbots. This study examines the impact of multi-chatbot communication in a specific persuasion setting: promoting charitable donations. We developed an online environment that enables multi-chatbot communication and conducted a pilot experiment utilizing two GPT-based chatbots, Save the Children and UNICEF chatbots, to promote charitable donations. In this study, we present our development process of the multi-chatbot interface and present preliminary findings from a pilot experiment. Analysis of qualitative and quantitative feedback are presented, and limitations are addressed.
摘要:ChatGPT等语言模型重塑了人类与人工智能交流的动态。然而,现有的研究主要集中在二元交流上,对于群体环境中人类-人工智能交流的动态,还有很多有待探索的地方。多语言模型聊天机器人的出现为学者们更好地了解人类与多个聊天机器人之间的互动提供了一个独特的机会。这项研究考察了多个聊天机器人在一个特定的说服环境中的影响:促进慈善捐赠。我们开发了一个支持多聊天机器人交流的在线环境,并利用两个基于GPT的聊天机器人–救助儿童会和联合国儿童基金会聊天机器人–进行了试点实验,以促进慈善捐款。在这项研究中,我们介绍了我们的多聊天机器人界面的开发过程,并给出了初步的初步实验结果。对定性反馈和定量反馈进行了分析,并指出了局限性。

[NLP-42] Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation
[NLP-42] 解锁各种观点:基于人物的多代理框架,具有辩论驱动的文本规划,用于论点生成

链接: https://arxiv.org/abs/2406.19643
作者: Zhe Hu,Hou Pong Chan,Jing Li,Yu Yin
关键词: challenging task, argument writing, Writing, high-level beliefs, persuasive arguments
中文关键词: 具有挑战性的任务、论点写作、写作、高层信念、有说服力的论点
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Writing persuasive arguments is a challenging task for both humans and machines. It entails incorporating high-level beliefs from various perspectives on the topic, along with deliberate reasoning and planning to construct a coherent narrative. Current language models often generate surface tokens autoregressively, lacking explicit integration of these underlying controls, resulting in limited output diversity and coherence. In this work, we propose a persona-based multi-agent framework for argument writing. Inspired by the human debate, we first assign each agent a persona representing its high-level beliefs from a unique perspective, and then design an agent interaction process so that the agents can collaboratively debate and discuss the idea to form an overall plan for argument writing. Such debate process enables fluid and nonlinear development of ideas. We evaluate our framework on argumentative essay writing. The results show that our framework can generate more diverse and persuasive arguments through both automatic and human evaluations.
摘要:撰写有说服力的论据对人类和机器来说都是一项具有挑战性的任务。它需要结合不同角度对该主题的高级信念,以及深思熟虑的推理和计划,以构建一个连贯的叙事。目前的语言模型往往自回归地生成表层标记,缺乏对这些基本控制的明确整合,导致输出多样性和连贯性有限。在这项工作中,我们提出了一个基于人物角色的多主体论据写作框架。受人类辩论的启发,我们首先从一个独特的角度为每个智能体分配一个代表其高层信念的角色,然后设计一个智能体交互过程,使智能体可以协作地辩论和讨论这个想法,形成论点写作的总体计划。这样的辩论过程使思想的流动和非线性发展成为可能。我们评估我们的议论文写作框架。结果表明,我们的框架可以通过自动和人工评估生成更多样化和更有说服力的论点。

[NLP-43] IDT: Dual-Task Adversarial Attacks for Privacy Protection
[NLP-43] IDT:隐私保护的双重任务对抗攻击

链接: https://arxiv.org/abs/2406.19642
作者: Pedro Faustini,Shakila Mahjabin Tonni,Annabelle McIver,Qiongkai Xu,Mark Dras
关键词: Natural language processing, including membership inference, leak private information, Natural language, membership inference
中文关键词: 自然语言处理,包括成员资格推断、泄露私人信息、自然语言、成员资格推断
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 28 pages, 1 figure

点击查看摘要

Abstract:Natural language processing (NLP) models may leak private information in different ways, including membership inference, reconstruction or attribute inference attacks. Sensitive information may not be explicit in the text, but hidden in underlying writing characteristics. Methods to protect privacy can involve using representations inside models that are demonstrated not to detect sensitive attributes or – for instance, in cases where users might not trust a model, the sort of scenario of interest here – changing the raw text before models can have access to it. The goal is to rewrite text to prevent someone from inferring a sensitive attribute (e.g. the gender of the author, or their location by the writing style) whilst keeping the text useful for its original intention (e.g. the sentiment of a product review). The few works tackling this have focused on generative techniques. However, these often create extensively different texts from the original ones or face problems such as mode collapse. This paper explores a novel adaptation of adversarial attack techniques to manipulate a text to deceive a classifier w.r.t one task (privacy) whilst keeping the predictions of another classifier trained for another task (utility) unchanged. We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change for the privacy task, and which ones should be kept for the utility task. We evaluate different datasets for NLP suitable for different tasks. Automatic and human evaluations show that IDT retains the utility of text, while also outperforming existing methods when deceiving a classifier w.r.t privacy task.
摘要:自然语言处理(NLP)模型可能会以不同的方式泄露隐私信息,包括成员关系推理、重构或属性推理攻击。敏感信息可能在文本中不是显性的,而是隐藏在潜在的写作特征中。保护隐私的方法可以涉及在模型中使用表示法,这些表示法被演示为不检测敏感属性,或者–例如,在用户可能不信任模型的情况下,这里涉及的场景–在模型可以访问它之前更改原始文本。目标是重写文本,以防止有人推断敏感属性(例如,作者的性别或他们的写作风格),同时保持文本对其原始意图(例如,产品评论的情绪)的有用。解决这一问题的少数作品都集中在生成技术上。然而,这些通常会产生与原始文本截然不同的文本,或者面临模式崩溃等问题。本文探索了一种新颖的对抗性攻击技术,以操纵文本来欺骗一个任务(隐私)的分类器,同时保持为另一个任务(效用)训练的另一个分类器的预测不变。我们提出了IDT,这是一种分析辅助模型和可解释模型所做预测的方法,以确定哪些令牌对于隐私任务是重要的,哪些应该为实用任务保留。我们评估了适合不同任务的自然语言处理的不同数据集。自动和人工评估表明,IDT保留了文本的效用,同时在欺骗分类器w.r.t隐私任务时也优于现有方法。

[NLP-44] Mixture of In-Context Experts Enhance LLMs Long Context Awareness
[NLP-44] 背景专家的混合增强了LLM的长期背景意识

链接: https://arxiv.org/abs/2406.19598
作者: Hongzhan Lin,Ang Lv,Yuhan Chen,Chen Zhu,Yang Song,Hengshu Zhu,Rui Yan
关键词: large language models, exhibit uneven awareness, positions.Their limited context, limited context awareness, subsequent task failures
中文关键词: 大型语言模型,表现出不平衡的意识和立场。它们的上下文有限,上下文意识有限,随后的任务失败
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Many studies have revealed that large language models (LLMs) exhibit uneven awareness of different contextual positions.Their limited context awareness can lead to overlooking critical information and subsequent task failures. While several approaches have been proposed to enhance LLMs’ context awareness, achieving both effectiveness and efficiency remains this http URL this paper, for LLMs utilizing RoPE as position embeddings, we introduce a novel method called ``Mixture of In-Context Experts’’ (MoICE) to address this challenge. MoICE comprises two key components: a router integrated into each attention head within LLMs and a lightweight router-only training optimization strategy: (1) MoICE views each RoPE angle as an `in-context’ expert, demonstrated to be capable of directing the attention of a head to specific contextual positions. Consequently, each attention head flexibly processes tokens using multiple RoPE angles dynamically selected by the router to attend to the needed positions. This approach mitigates the risk of overlooking essential contextual information. (2) The router-only training strategy entails freezing LLM parameters and exclusively updating routers for only a few steps. When applied to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods across multiple tasks on long context understanding and generation, all while maintaining commendable inference efficiency.
摘要:许多研究表明,大型语言模型对不同的语境位置的感知参差不齐,它们有限的语境感知可能导致忽视关键信息和随后的任务失败。虽然已经提出了几种方法来增强LLMS的上下文感知能力,但同时实现有效性和效率仍然是本文的http URL。本文针对使用ROPE作为位置嵌入的LLMS,引入了一种新的方法–‘上下文专家混合’(MoICE)来解决这一挑战。MoICE由两个关键组件组成:集成到LLMS内每个注意力头部的路由器和仅针对路由器的轻量级训练优化策略:(1)MoICE将每个绳索角度视为“上下文中”的专家,被证明能够将头部的注意力引导到特定的上下文位置。因此,每个关注头使用路由器动态选择的多个绳角来灵活地处理令牌,以关注所需的位置。这种方法降低了忽略重要上下文信息的风险。(2)仅路由器训练策略需要冻结LLM参数并仅在几个步骤内独占地更新路由器。当应用于包括Llama和Mistral在内的开源LLMS时,MoICE在长上下文理解和生成方面超过了现有的跨多个任务的方法,同时保持了值得称赞的推理效率。

[NLP-45] SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
[NLP-45] SK-VQA:以规模生成合成知识,用于训练上下文增强多模式LLM

链接: https://arxiv.org/abs/2406.19593
作者: Xin Su,Man Luo,Kris W Pan,Tien Pei Chou,Vasudev Lal,Phillip Howard
关键词: gained significant attention, significant attention recently, gained significant, significant attention, attention recently
中文关键词: 最近引起了极大的关注,最近引起了极大的关注,最近引起了极大的关注
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic data generation has gained significant attention recently for its utility in training large vision and language models. However, the application of synthetic data to the training of multimodal context-augmented generation systems has been relatively unexplored. This gap in existing work is important because existing vision and language models (VLMs) are not trained specifically for context-augmented generation. Resources for adapting such models are therefore crucial for enabling their use in retrieval-augmented generation (RAG) settings, where a retriever is used to gather relevant information that is then subsequently provided to a generative model via context augmentation. To address this challenging problem, we generate SK-VQA: a large synthetic multimodal dataset containing over 2 million question-answer pairs which require external knowledge to determine the final answer. Our dataset is both larger and significantly more diverse than existing resources of its kind, possessing over 11x more unique questions and containing images from a greater variety of sources than previously-proposed datasets. Through extensive experiments, we demonstrate that our synthetic dataset can not only serve as a challenging benchmark, but is also highly effective for adapting existing generative multimodal models for context-augmented generation.
摘要:合成数据生成因其在训练大视觉和语言模型方面的应用而受到广泛关注。然而,将合成数据应用于多通道上下文增强生成系统的训练还相对未被探索。现有工作中的这一差距很重要,因为现有的视觉和语言模型(VLM)没有专门针对上下文增强的生成进行培训。因此,用于适配这些模型的资源对于能够在检索增强生成(RAG)设置中使用它们至关重要,在RAG设置中,检索器被用来收集相关信息,然后经由上下文增强将这些信息提供给生成模型。为了解决这个具有挑战性的问题,我们生成了SK-VQA:一个包含200多万个问答对的大型合成多模式数据集,这些问答对需要外部知识来确定最终答案。我们的数据集比现有的同类资源更大,也更多样化,拥有的独特问题比以前提议的数据集多11倍,包含的图像来源也更多。通过大量的实验,我们证明了我们的合成数据集不仅可以作为一个具有挑战性的基准,而且对于适应现有的生成性多通道模型以进行上下文增强生成也是非常有效的。

[NLP-46] PathAlign: A vision-language model for whole slide images in histopathology
[NLP-46] PathAlign:组织病理学中整个幻灯片图像的视觉语言模型

链接: https://arxiv.org/abs/2406.19578
作者: Faruk Ahmed,Andrew Sellergren,Lin Yang,Shawn Xu,Boris Babenko,Abbi Ward,Niels Olson,Arash Mohtashamian,Yossi Matias,Greg S. Corrado,Quang Duong,Dale R. Webster,Shravya Shetty,Daniel Golden,Yun Liu,David F. Steiner,Ellery Wulczyn
关键词: histopathology images underlies, Microscopic interpretation, treatment decisions, underlies many important, Microscopic
中文关键词: 组织病理学图像的基础,显微镜解释,治疗决策,许多重要的基础,显微镜
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures

点击查看摘要

Abstract:Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
摘要:组织病理学图像的显微解释是许多重要的诊断和治疗决策的基础。虽然视觉语言建模的进步为分析这类图像提供了新的机会,但整个幻灯片图像(WSIS)的千兆像素大小带来了独特的挑战。此外,病理报告同时突出小区域的关键发现,同时还聚合了多张幻灯片的解释,这往往使创建稳健的图文对变得困难。因此,病理报告在很大程度上仍然是计算病理学中尚未开发的监督来源,大多数努力依赖于感兴趣区域注释或补丁级别的自我监督。在这项工作中,我们开发了一个基于BLIP-2框架的视觉语言模型,使用WSIS与来自病理报告的精选文本配对。这使得应用程序能够利用共享的图像-文本嵌入空间,例如用于查找感兴趣案例的文本或图像检索,以及将WSI编码器与冻结的大型语言模型(LLM)集成,以实现基于WSI的生成文本功能,例如报告生成或AI in-the-loop交互。我们利用了一个超过350,000个WSIS和诊断文本对的去识别数据集,跨越了广泛的诊断、手术类型和组织类型。我们介绍了病理学家对使用WSI嵌入的文本生成和文本检索的评估,以及WSI分类和工作流优先顺序(幻灯片级别分类)的结果。WSIS的模型生成文本被病理学家评为准确,没有临床上显著的错误或遗漏,平均78%的WSIS。这项工作展示了与语言一致的WSI嵌入的令人兴奋的潜在功能。

[NLP-47] Voices Unheard: NLP Resources and Models for Yor`uba Regional Dialects
[NLP-47] 闻所未闻的声音:Yor ’ uba地区方言的NLP资源和模型

链接: https://arxiv.org/abs/2406.19564
作者: Orevaoghene Ahia,Anuoluwapo Aremu,Diana Abagyan,Hila Gonen,David Ifeoluwa Adelani,Daud Abolade,Noah A. Smith,Yulia Tsvetkov
关键词: million speakers encompasses, African languages, encompasses a continuum, African, dialects
中文关键词: 数百万使用者涵盖非洲语言、非洲方言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YORÙLECT across three domains and four regional Yorùbá dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yorùbá and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yorùbá and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YORÙLECT dataset and models publicly under an open license.
摘要:尤鲁巴语是一种非洲语言,大约有4700万人说,它包含了几种方言。最近为非洲语言开发自然语言处理技术的努力侧重于其标准方言,导致方言和变种之间存在差异,几乎没有资源或工具。我们采取步骤弥合这一差距,引入了一个新的高质量平行文本和语音语料库YOR HighLect,涵盖三个领域和四个地区Yorúbá方言。为了开发这个语料库,我们聘请了以英语为母语的人,前往使用这些方言的社区,收集文本和语音数据。使用我们新创建的语料库,我们在(文本)机器翻译、自动语音识别和语音到文本翻译方面进行了广泛的实验。我们的结果显示,在所有任务中,标准Yorúbá和其他方言之间的性能差异很大。然而,我们也表明,通过方言自适应微调,我们能够缩小这一差距。我们相信,我们的数据集和实验分析将通过提高我们对现有挑战的理解并为进一步发展提供高质量的数据集,为开发适用于约尔巴省及其方言的自然语言处理工具做出很大贡献,并可能为其他非洲语言开发自然语言处理工具。我们在开放许可下公开发布YOR精选数据集和模型。

[NLP-48] Rethinking harmless refusals when fine-tuning foundation models
[NLP-48] 微调基金会模型时重新思考无害的拒绝

链接: https://arxiv.org/abs/2406.19552
作者: Florin Pop,Judd Rosenblatt,Diogo Schwerz de Lucena,Michael Vaiana
关键词: Large Language Models, Large Language, effectively mitigates versus, conceals undesirable behavior, Language Models
中文关键词: 大型语言模型,大型语言,有效地缓解、隐藏不良行为,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2024 AGI Workshop Poster

点击查看摘要

Abstract:In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emphreason-based deception, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.
摘要:在这篇文章中,我们调查了在大语言模型(LLM)中的微调在多大程度上有效地缓解了而不是仅仅隐藏了不良行为。通过旨在诱导此类行为的半现实角色扮演练习的镜头,我们探索了微调干预后LLMS的反应动力学。我们的方法论包括促进思想链(COT)推理的模型,并分析推理痕迹和结果输出之间的一致性。值得注意的是,我们发现了一种普遍存在的现象,我们称之为基于理性的欺骗,其中模型要么停止产生推理痕迹,要么产生看似道德的推理痕迹,掩盖了其最终输出的不道德性质。我们进一步考察了应对策略(礼貌拒绝与明确反驳)在多轮互动的后续输出中抑制不良行为发生的有效性。我们的研究结果表明,显性反驳在防止不想要的输出的继续方面明显优于礼貌拒绝,并几乎消除了基于原因的欺骗,挑战了当前模型微调的实践。因此,本文的两个关键贡献是:(1)定义和研究了基于理性的欺骗,这是一种新型的隐藏行为;(2)证明了反驳提供了一个比拒绝更健壮的响应模型,从而强调了在微调方法中重新考虑响应策略的必要性。

[NLP-49] Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations
[NLP-49] 利用机器生成的收件箱来促进对话中的社交意义检测

链接: https://arxiv.org/abs/2406.19545
作者: Ritam Dutt,Zhen Wu,Kelly Shi,Divyanshu Sheth,Prakhar Gupta,Carolyn Penstein Rose
关键词: Large Language Models, leverages Large Language, Language Models, Large Language, implicitly encoded social
中文关键词: 大型语言模型,利用大型语言,语言模型,大型语言,隐式编码社交
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear at The Proceedings of the Association for Computational Linguistics, 2024

点击查看摘要

Abstract:We present a generalizable classification approach that leverages Large Language Models (LLMs) to facilitate the detection of implicitly encoded social meaning in conversations. We design a multi-faceted prompt to extract a textual explanation of the reasoning that connects visible cues to underlying social meanings. These extracted explanations or rationales serve as augmentations to the conversational text to facilitate dialogue understanding and transfer. Our empirical results over 2,340 experimental settings demonstrate the significant positive impact of adding these rationales. Our findings hold true for in-domain classification, zero-shot, and few-shot domain transfer for two different social meaning detection tasks, each spanning two different corpora.
摘要:我们提出了一种可推广的分类方法,该方法利用大型语言模型(LLM)来促进检测对话中隐式编码的社会意义。我们设计了一个多方面的提示来提取推理的文本解释,将可见线索与潜在的社会意义联系起来。这些提取的解释或理由作为对话文本的补充,以促进对话理解和转移。我们在2,340个实验环境中的经验结果证明了添加这些理由的显着积极影响。我们的研究结果适用于两个不同的社会意义检测任务(每个任务跨越两个不同的语料库)的领域内分类、零镜头和少镜头领域转移。

[NLP-50] Demarked: A Strategy for Enhanced Abusive Speech Moderation through Counterspeech Detoxification and Message Management
[NLP-50] 去标记:通过反言语去规范化和消息管理增强辱骂言语调节的策略

链接: https://arxiv.org/abs/2406.19543
作者: Seid Muhie Yimam,Daryna Dementieva,Tim Fischer,Daniil Moskovskiy,Naquee Rizwan,Punyajoy Saha,Sarthak Roy,Martin Semmann,Alexander Panchenko,Chris Biemann,Animesh Mukherjee
关键词: targeting digital violence, social media platforms, abusive content persists, regulations targeting digital, digital violence
中文关键词: 针对数字暴力、社交媒体平台、辱骂内容持续存在、针对数字、数字暴力的法规
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Despite regulations imposed by nations and social media platforms, such as recent EU regulations targeting digital violence, abusive content persists as a significant challenge. Existing approaches primarily rely on binary solutions, such as outright blocking or banning, yet fail to address the complex nature of abusive speech. In this work, we propose a more comprehensive approach called Demarcation scoring abusive speech based on four aspect – (i) severity scale; (ii) presence of a target; (iii) context scale; (iv) legal scale – and suggesting more options of actions like detoxification, counter speech generation, blocking, or, as a final measure, human intervention. Through a thorough analysis of abusive speech regulations across diverse jurisdictions, platforms, and research papers we highlight the gap in preventing measures and advocate for tailored proactive steps to combat its multifaceted manifestations. Our work aims to inform future strategies for effectively addressing abusive speech online.
摘要:尽管各国和社交媒体平台实施了法规,例如最近针对数字暴力的欧盟法规,但辱骂内容仍然是一个重大挑战。现有的方法主要依赖于二元解决方案,如直接阻止或禁止,但无法解决辱骂言论的复杂性质。在这项工作中,我们提出了一种更全面的方法,称为区分辱骂言语,基于四个方面–(I)严重程度衡量;(Ii)目标的存在;(Iii)语境衡量;(Iv)法律衡量–并建议更多的行动选择,如戒毒、反言语生成、阻止,或者作为最后的措施,人为干预。通过对跨不同司法管辖区、平台和研究论文的辱骂言论监管的透彻分析,我们强调了预防措施方面的差距,并倡导采取有针对性的积极步骤来打击其多方面的表现。我们的工作旨在为未来有效解决在线辱骂言论的策略提供信息。

[NLP-51] Context Matters: An Empirical Study of the Impact of Contextual Information in Temporal Question Answering Systems
[NLP-51] 上下文很重要:上下文信息对时间问题回答系统影响的实证研究

链接: https://arxiv.org/abs/2406.19538
作者: Dan Schumacher,Fatemeh Haji,Tara Grey,Niharika Bandlamudi,Nupoor Karnik,Gagana Uday Kumar,Jason Cho-Yu Chiang,Paul Rad,Nishant Vishwamitra,Anthony Rios
关键词: Large language models, historical event analysis, time-sensitive information retrieval, Large language, crucial for tasks
中文关键词: 大型语言模型、历史事件分析、时间敏感信息检索、大型语言,对任务至关重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle with temporal reasoning, crucial for tasks like historical event analysis and time-sensitive information retrieval. Despite advancements, state-of-the-art models falter in handling temporal information, especially when faced with irrelevant or noisy contexts. This paper addresses this gap by empirically examining the robustness of temporal question-answering (TQA) systems trained on various context types, including relevant, irrelevant, slightly altered, and no context. Our findings indicate that training with a mix of these contexts enhances model robustness and accuracy. Additionally, we show that the position of context relative to the question significantly impacts performance, with question-first positioning yielding better results. We introduce two new context-rich TQA datasets, ContextAQA and ContextTQE, and provide comprehensive evaluations and guidelines for training robust TQA models. Our work lays the foundation for developing reliable and context-aware temporal QA systems, with broader implications for enhancing LLM robustness against diverse and potentially adversarial information.
摘要:大型语言模型(LLM)常常难以进行时序推理,而时序推理对于历史事件分析和时间敏感信息检索等任务至关重要。尽管取得了进步,但最先进的模型在处理时间信息方面步履蹒跚,特别是在面对无关或嘈杂的环境时。本文通过经验检验时间问答系统在各种语境类型上的稳健性来解决这一差距,这些语境类型包括相关的、不相关的、略有改变的和没有语境的。我们的发现表明,在这些背景下混合进行训练可以提高模型的稳健性和准确性。此外,我们还表明,上下文相对于问题的位置显著影响性能,问题优先定位会产生更好的结果。我们引入了两个新的上下文丰富的TQA数据集,ConextAQA和ConextTQE,并为训练健壮的TQA模型提供了全面的评估和指导。我们的工作为开发可靠和上下文感知的时态QA系统奠定了基础,对于增强LLM对多样化和潜在敌意信息的稳健性具有更广泛的意义。

[NLP-52] Handling Ontology Gaps in Semantic Parsing
[NLP-52] 语义解析中的实体差距处理

链接: https://arxiv.org/abs/2406.19537
作者: Andrea Bacciu,Marco Damonte,Marco Basaldella,Emilio Monti
关键词: Neural Semantic Parsing, Semantic Parsing, Neural Semantic, majority of Neural, target symbols
中文关键词: 神经语义解析,语义解析,神经语义,大多数神经,目标符号
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The majority of Neural Semantic Parsing (NSP) models are developed with the assumption that there are no concepts outside the ones such models can represent with their target symbols (closed-world assumption). This assumption leads to generate hallucinated outputs rather than admitting their lack of knowledge. Hallucinations can lead to wrong or potentially offensive responses to users. Hence, a mechanism to prevent this behavior is crucial to build trusted NSP-based Question Answering agents. To that end, we propose the Hallucination Simulation Framework (HSF), a general setting for stimulating and analyzing NSP model hallucinations. The framework can be applied to any NSP task with a closed-ontology. Using the proposed framework and KQA Pro as the benchmark dataset, we assess state-of-the-art techniques for hallucination detection. We then present a novel hallucination detection strategy that exploits the computational graph of the NSP model to detect the NSP hallucinations in the presence of ontology gaps, out-of-domain utterances, and to recognize NSP errors, improving the F1-Score respectively by ~21, ~24% and ~1%. This is the first work in closed-ontology NSP that addresses the problem of recognizing ontology gaps. We release our code and checkpoints at this https URL.
摘要:大多数神经语义分析(NSP)模型都是基于这样一个假设:除了这些模型可以用目标符号表示的概念之外,没有其他概念(封闭世界假设)。这种假设导致产生幻觉的产出,而不是承认他们缺乏知识。幻觉可能会导致对用户的错误或潜在的冒犯反应。因此,一种防止这种行为的机制对于构建基于可信NSP的问答代理至关重要。为此,我们提出了幻觉模拟框架(HSF),这是一个用于刺激和分析NSP模型幻觉的通用环境。该框架可以应用于任何具有封闭本体的NSP任务。使用提出的框架和KQA Pro作为基准数据集,我们评估了最新的幻觉检测技术。然后,我们提出了一种新的幻觉检测策略,该策略利用NSP模型的计算图来检测存在本体空白、域外话语的NSP幻觉,并识别NSP错误,使F1-Score分别提高~21%、~24%和~1%。这是封闭本体NSP中第一项解决本体差距识别问题的工作。我们在这个HTTPS URL上发布代码和检查点。

[NLP-53] ocBERT: Medical Document Structure Extraction Using Bidirectional Transformers
[NLP-53] ocBERT:使用双向转换器提取医疗文档结构

链接: https://arxiv.org/abs/2406.19526
作者: Majd Saleh,Sarra Baghdadi,Stéphane Paquelet
关键词: Natural Language Processing, Language Processing, Natural Language, holds paramount importance, field of Natural
中文关键词: 自然语言处理,语言处理,自然语言,具有至关重要的意义,自然领域
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.
摘要:文本分割在自然语言处理领域占有重要地位。它在信息检索和文档摘要等几个NLP下游任务中发挥着重要作用。在这项工作中,我们提出了一种新的解决方案,即TocBERT,用于使用双向转换器进行文本分割。TocBERT代表了一种受监督的解决方案,该解决方案训练用于从标题和字幕的语义表示中检测它们。该任务被描述为命名实体识别(NER)问题。该解决方案已应用于医学文本分割用例,其中Bio-ClinicalBERT模型被微调为MIMIC-III数据集的放电摘要。TocBERT的性能已经在250个音符的人类标记的地面事实语料库上进行了评估。在线性文本分割问题上的F1得分为84.6%,在层次文本分割问题上的F1得分为72.8%。它的表现优于精心设计的基于规则的解决方案,特别是在区分标题和字幕方面。

[NLP-54] Captioning Visualizations with Large Language Models (CVLLM): A Tutorial
[NLP-54] 使用大型语言模型(CVLLM)为可视化添加字幕:收件箱

链接: https://arxiv.org/abs/2406.19512
作者: Giuseppe Carenini,Jordon Johnson,Ali Salamatian
关键词: Automatically captioning visualizations, large language models, Automatically captioning, open exciting, exciting new possibilities
中文关键词: 自动字幕可视化、大型语言模型、自动字幕、打开令人兴奋、令人兴奋的新可能性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and past work in captioning, we introduce neural models and the transformer architecture used in generic LLMs. We then discuss their recent applications in InfoVis, with a focus on captioning. Additionally, we explore promising future directions in this field.
摘要:自动字幕可视化并不新鲜,但大型语言模型(LLM)的最新进展开辟了令人兴奋的新可能性。在本教程中,在简要回顾了信息可视化(InfoVis)原则和过去在字幕方面的工作后,我们介绍了神经模型和通用LLM中使用的Transformer架构。然后我们讨论它们最近在InfoVis中的应用,重点关注字幕。此外,我们还探索该领域有前途的未来方向。

[NLP-55] Are Generative Language Models Multicultural? A Study on Hausa Culture and Emotions using ChatGPT
[NLP-55] 生成语言模型是多元文化的吗?使用ChatGPT研究豪萨文化和情感

链接: https://arxiv.org/abs/2406.19504
作者: Ibrahim Said Ahmad,Shiran Dudy,Resmi Ramachandranpillai,Kenneth Church
关键词: Large Language Models, Large Language, purposes and audiences, generate content, Language Models
中文关键词: 大型语言模型、大型语言、目的和受众、生成内容、语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa’s culture and emotions. We compare responses generated by ChatGPT with those provided by native Hausa speakers on 37 culturally relevant questions. We conducted experiments using emotion analysis and applied two similarity metrics to measure the alignment between human and ChatGPT responses. We also collected human participants ratings and feedback on ChatGPT responses. Our results show that ChatGPT has some level of similarity to human responses, but also exhibits some gaps and biases in its knowledge and awareness of the Hausa culture and emotions. We discuss the implications and limitations of our methodology and analysis and suggest ways to improve the performance and evaluation of LLMs for low-resource languages.
摘要:大型语言模型(LLM),如ChatGPT,被广泛用于为各种目的和受众生成内容。然而,这些模式可能没有反映其用户的文化和情感多样性,特别是对于资源较少的语言。在这篇文章中,我们调查了ChatGPT如何代表豪萨的文化和情感。我们比较了ChatGPT和以豪萨语为母语的人对37个与文化相关的问题的回答。我们使用情感分析进行了实验,并应用了两个相似性度量来衡量人和ChatGPT响应之间的比对。我们还收集了人类参与者对ChatGPT回复的评分和反馈。我们的结果表明,ChatGPT在一定程度上与人类的反应相似,但在对豪萨族文化和情感的知识和意识方面也表现出一些差距和偏见。我们讨论了我们的方法和分析的含义和局限性,并提出了改进低资源语言的LLMS性能和评估的方法。

[NLP-56] Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning
[NLP-56] 调查大型语言模型如何利用内部知识来执行复杂推理

链接: https://arxiv.org/abs/2406.19502
作者: Miyoung Ko,Sue Hyun Park,Joonsuk Park,Minjoon Seo
关键词: significant advancements, large language models, large language, knowledge, questions
中文关键词: 重大进步、大型语言模型、大型语言、知识、问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress; code is available at this https URL

点击查看摘要

Abstract:Despite significant advancements, there is a limited understanding of how large language models (LLMs) utilize knowledge for reasoning. To address this, we propose a method that deconstructs complex real-world questions into a graph, representing each question as a node with parent nodes of background knowledge needed to solve the question. We develop the DepthQA dataset, deconstructing questions into three depths: (i) recalling conceptual knowledge, (ii) applying procedural knowledge, and (iii) analyzing strategic knowledge. Based on a hierarchical graph, we quantify forward discrepancy, discrepancies in LLMs’ performance on simpler sub-problems versus complex questions. We also measure backward discrepancy, where LLMs answer complex questions but struggle with simpler ones. Our analysis shows that smaller models have more discrepancies than larger models. Additionally, guiding models from simpler to complex questions through multi-turn interactions improves performance across model sizes, highlighting the importance of structured intermediate steps in knowledge reasoning. This work enhances our understanding of LLM reasoning and suggests ways to improve their problem-solving abilities.
摘要:尽管有了很大的进步,但对于大型语言模型(LLM)如何利用知识进行推理的理解有限。为了解决这个问题,我们提出了一种方法,将复杂的现实世界问题解构为一个图,将每个问题表示为一个节点,其中包含解决问题所需的背景知识的父节点。我们开发了DepthQA数据集,将问题分解为三个深度:(I)回忆概念性知识,(Ii)应用程序性知识,和(Iii)分析策略性知识。基于层次图,我们量化了前向差异,即LLMS在较简单的子问题和复杂问题上的性能差异。我们还衡量了后向差异,即LLM回答复杂的问题,但努力解决更简单的问题。我们的分析表明,较小的模型比较大的模型有更多的差异。此外,通过多轮交互指导模型从简单问题到复杂问题提高了不同模型大小的性能,突出了结构化中间步骤在知识推理中的重要性。这项工作加深了我们对LLM推理的理解,并提出了提高他们解决问题能力的方法。

[NLP-57] Monitoring Latent World States in Language Models with Propositional Probes
[NLP-57] 用命题探针监控语言模型中的潜在世界状态

链接: https://arxiv.org/abs/2406.19501
作者: Jiahai Feng,Stuart Russell,Jacob Steinhardt
关键词: Language models, Language, tendencies that lead, input context, Greg
中文关键词: 语言模型,语言,导致的倾向,输入上下文,格雷格
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with ‘propositional probes’, which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ‘‘Greg is a nurse. Laura is a physicist.’’, we decode the propositions ‘‘WorksAs(Greg, nurse)’’ and ‘‘WorksAs(Laura, physicist)’’ from the model’s activations. Key to this is identifying a ‘binding subspace’ in which bound tokens have high similarity (‘‘Greg’’ and ‘‘nurse’’) but unbound ones do not (‘‘Greg’’ and ‘‘physicist’’). We validate propositional probes in a closed-world setting with finitely many predicates and properties. Despite being trained on simple templated contexts, propositional probes generalize to contexts rewritten as short stories and translated to Spanish. Moreover, we find that in three settings where language models respond unfaithfully to the input context – prompt injections, backdoor attacks, and gender bias – the decoded propositions remain faithful. This suggests that language models often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.
摘要:语言模型容易受到偏见、奉承、走后门和其他倾向的影响,从而导致对输入语境的不忠实反应。解读语言模型的内部状态有助于监控和纠正不忠行为。我们假设语言模型在潜在世界模型中表示它们的输入语境,并试图从激活中提取这种潜在世界状态。我们用‘命题探测’来做到这一点,它从成分上探测词汇信息的标记,并将它们绑定到代表世界状态的逻辑命题中。例如,给定输入上下文“Greg是一名护士。Laura是一名物理学家。”“,我们从模型的激活中解码命题”WorksAs(Greg,护士)“和”WorksAs(Laura,物理学家)“。这一点的关键是确定一个“绑定子空间”,其中绑定的令牌具有很高的相似性(‘’Greg‘’和‘’Nurse‘’),而未绑定的令牌不具有相似性(‘’Greg‘’和‘’物理学家‘’)。我们在具有有限多个谓词和属性的封闭世界环境中验证命题探测器。尽管接受了简单的模板化语境的培训,命题探测还是概括为重写为短篇小说并翻译成西班牙语的语境。此外,我们发现,在语言模型对输入语境做出不忠实反应的三个环境中–提示注入、后门攻击和性别偏见–解码的命题仍然是忠实的。这表明,语言模型通常编码一个忠实的世界模型,但对它进行不忠实的解码,这促使人们寻找更好的可解释性工具来监控LMS。

[NLP-58] Knowledge acquisition for dialogue agents using reinforcement learning on graph representations
[NLP-58] 使用图表示上的强化学习进行对话代理的知识获取

链接: https://arxiv.org/abs/2406.19500
作者: Selene Baez Santamaria,Shihan Wang,Piek Vossen
关键词: artificial agent motivated, initial training, develop an artificial, motivated to augment, Abstract
中文关键词: 人工代理激励,初始训练,开发人工,激励增强,摘要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We develop an artificial agent motivated to augment its knowledge base beyond its initial training. The agent actively participates in dialogues with other agents, strategically acquiring new information. The agent models its knowledge as an RDF knowledge graph, integrating new beliefs acquired through conversation. Responses in dialogue are generated by identifying graph patterns around these new integrated beliefs. We show that policies can be learned using reinforcement learning to select effective graph patterns during an interaction, without relying on explicit user feedback. Within this context, our study is a proof of concept for leveraging users as effective sources of information.
摘要:我们开发了一个人工代理,其动机是在初始训练之外扩大其知识库。代理积极参与与其他代理的对话,战略性地获取新信息。代理将其知识建模为RDF知识图,集成通过对话获得的新信念。对话中的回应是通过识别围绕这些新的综合信念的图形模式来生成的。我们表明,可以使用强化学习来学习策略,以在交互期间选择有效的图模式,而无需依赖明确的用户反馈。在此背景下,我们的研究证明了利用用户作为有效信息来源的概念。

[NLP-59] Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts
[NLP-59] 大型语言模型中的包容性:科学摘要中的人格特征和性别偏见

链接: https://arxiv.org/abs/2406.19497
作者: Naseela Pervez,Alexander J. Titus
关键词: helping authors enhance, Large language models, helping authors, increasingly utilized, utilized to assist
中文关键词: 帮助作者增强,大型语言模型,帮助作者,越来越多地被利用,被用来协助
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized to assist in scientific and academic writing, helping authors enhance the coherence of their articles. Previous studies have highlighted stereotypes and biases present in LLM outputs, emphasizing the need to evaluate these models for their alignment with human narrative styles and potential gender biases. In this study, we assess the alignment of three prominent LLMs - Claude 3 Opus, Mistral AI Large, and Gemini 1.5 Flash - by analyzing their performance on benchmark text-generation tasks for scientific abstracts. We employ the Linguistic Inquiry and Word Count (LIWC) framework to extract lexical, psychological, and social features from the generated texts. Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases. This research highlights the importance of developing LLMs that maintain a diversity of writing styles to promote inclusivity in academic discourse.
摘要:大型语言模型越来越多地被用来辅助科学和学术写作,帮助作者提高文章的连贯性。以前的研究强调了土地管理产出中存在的陈规定型观念和偏见,强调有必要评估这些模式是否符合人类的叙述风格和潜在的性别偏见。在这项研究中,我们通过分析三个著名的LLM-Claude 3 Opus、Mistral AI Large和Gemini 1.5 Flash在科学摘要基准文本生成任务中的表现来评估它们的一致性。我们使用语言查询和字数统计(LIWC)框架从生成的文本中提取词汇、心理和社会特征。我们的发现表明,虽然这些模型产生的文本与人类创作的内容非常相似,但文体特征的变化表明存在显著的性别偏见。这项研究强调了发展保持写作风格多样性的LLMS的重要性,以促进学术话语的包容性。

[NLP-60] Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems
[NLP-60] 用于创建人工系统的HIPPhIRE模型的检索增强生成工具的开发和评估

链接: https://arxiv.org/abs/2406.19493
作者: Anubhab Majumder,Kausik Bhattacharya,Amaresh Chakrabarti
关键词: Large Language Models, Representing systems, SAPPhIRE causality model, SAPPhIRE model, leverage Large Language
中文关键词: 大型语言模型,表示系统,SAP PhIRE因果关系模型,SAP PhIRE模型,利用大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representing systems using the SAPPhIRE causality model is found useful in supporting design-by-analogy. However, creating a SAPPhIRE model of artificial or biological systems is an effort-intensive process that requires human experts to source technical knowledge from multiple technical documents regarding how the system works. This research investigates how to leverage Large Language Models (LLMs) in creating structured descriptions of systems using the SAPPhIRE model of causality. This paper, the second part of the two-part research, presents a new Retrieval-Augmented Generation (RAG) tool for generating information related to SAPPhIRE constructs of artificial systems and reports the results from a preliminary evaluation of the tool’s success - focusing on the factual accuracy and reliability of outcomes.
摘要:使用SAP PhIRE因果关系模型的表示系统对于支持类比设计很有用。然而,创建人工或生物系统的HIPPhIRE模型是一个耗时的过程,需要人类专家从多个技术文档中获取有关系统如何工作的技术知识。这项研究研究探讨了如何利用大型语言模型(LLM)使用SAP PhIRE因果关系模型创建系统的结构化描述。本文是由两部分组成的研究的第二部分,提出了一种新的检索增强生成(RAG)工具,用于生成与人工系统的SAP PhIRE结构相关的信息,并报告了对该工具成功的初步评估的结果-重点是结果的事实准确性和可靠性。

[NLP-61] LoPT: Low-Rank Prompt Tuning for Parameter Efficient Language Models
[NLP-61] LoPT:参数高效语言模型的低级别提示调优

链接: https://arxiv.org/abs/2406.19486
作者: Shouchang Guo,Sonam Damani,Keng-hao Chang
关键词: prompt tuning, suffix text, token indices, text is added, optimized to gain
中文关键词: 提示调整、后缀文本、标记索引、添加文本、优化以获得
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In prompt tuning, a prefix or suffix text is added to the prompt, and the embeddings (soft prompts) or token indices (hard prompts) of the prefix/suffix are optimized to gain more control over language models for specific tasks. This approach eliminates the need for hand-crafted prompt engineering or explicit model fine-tuning. Prompt tuning is significantly more parameter-efficient than model fine-tuning, as it involves optimizing partial inputs of language models to produce desired outputs. In this work, we aim to further reduce the amount of trainable parameters required for a language model to perform well on specific tasks. We propose Low-rank Prompt Tuning (LoPT), a low-rank model for prompts that achieves efficient prompt optimization. The proposed method demonstrates similar outcomes to full parameter prompt tuning while reducing the number of trainable parameters by a factor of 5. It also provides promising results compared to the state-of-the-art methods that would require 10 to 20 times more parameters. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2406.19486 [cs.CL] (or arXiv:2406.19486v1 [cs.CL] for this version)
摘要:在提示调整中,在提示中添加前缀或后缀文本,并对前缀/后缀的嵌入(软提示)或令牌索引(硬提示)进行优化,以获得对特定任务的语言模型的更多控制。这种方法消除了手工创建即时工程或显式模型微调的需要。即时调整明显比模型微调更具参数效率,因为它涉及优化语言模型的部分输入以产生所需的输出。在这项工作中,我们的目标是进一步减少语言模型在特定任务中良好执行所需的可训练参数的数量。我们提出了低级提示调优(LOPT),这是一种实现高效提示优化的低级提示模型。该方法具有与全参数快速整定相似的效果,同时将可训练参数的数量减少了5倍。与需要10到20倍的参数的最新方法相比,该方法也提供了令人满意的结果。学科:计算与语言(cs.CL);人工智能(cs.AI);新兴技术(cs.ET);机器学习(cs.LG);信号处理(eess.SP)引用AS:arxiv:2406.19486cs.CL

[NLP-62] xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
[NLP-62] xTower:用于解释和纠正翻译错误的多语言LLM

链接: https://arxiv.org/abs/2406.19482
作者: Marcos Treviso,Nuno M. Guerreiro,Sweta Agrawal,Ricardo Rei,José Pombal,Tania Vaz,Helena Wu,Beatriz Silva,Daan van Stigt,André F. T. Martins
关键词: achieving increasingly strong, increasingly strong performance, systems are achieving, performance on benchmarks, achieving increasingly
中文关键词: 实现越来越强大,越来越强大的性能,系统正在实现,性能达到基准,实现越来越多
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While machine translation (MT) systems are achieving increasingly strong performance on benchmarks, they often produce translations with errors and anomalies. Understanding these errors can potentially help improve the translation quality and user experience. This paper introduces xTower, an open large language model (LLM) built on top of TowerBase designed to provide free-text explanations for translation errors in order to guide the generation of a corrected translation. The quality of the generated explanations by xTower are assessed via both intrinsic and extrinsic evaluation. We ask expert translators to evaluate the quality of the explanations across two dimensions: relatedness towards the error span being explained and helpfulness in error understanding and improving translation quality. Extrinsically, we test xTower across various experimental setups in generating translation corrections, demonstrating significant improvements in translation quality. Our findings highlight xTower’s potential towards not only producing plausible and helpful explanations of automatic translations, but also leveraging them to suggest corrected translations.
摘要:虽然机器翻译系统在基准测试上取得了越来越好的性能,但它们产生的翻译经常出现错误和异常。了解这些错误可能有助于提高翻译质量和用户体验。本文介绍了XTower,这是一个建立在TowerBase之上的开放的大型语言模型(LLM),旨在为翻译错误提供自由文本解释,以指导正确翻译的生成。XTower生成的解释的质量通过内部评估和外部评估进行评估。我们要求专家翻译员从两个方面对解释的质量进行评估:与被解释的错误跨度的关联性,以及对错误理解和提高翻译质量的帮助。在外部方面,我们在生成翻译更正的各种实验设置中测试了xTower,显示出翻译质量的显著改善。我们的发现突出了xTower的潜力,它不仅可以为自动翻译提供可信和有用的解释,而且还可以利用它们来建议更正后的翻译。

[NLP-63] Sparse Regression for Machine Translation
[NLP-63] 机器翻译的稀疏回归

链接: https://arxiv.org/abs/2406.19478
作者: Ergun Biçici
关键词: machine translation outputs, transductive regression techniques, generate machine translation, regularized regression, learn correct feature
中文关键词: 机器翻译输出、转化回归技术、生成机器翻译、正规化回归、学习正确特征
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. We show the effectiveness of L_1 regularized regression (\textitlasso) to learn the mappings between sparsely observed feature sets versus L_2 regularized regression. Proper selection of training instances plays an important role to learn correct feature mappings within limited computational resources and at expected accuracy levels. We introduce \textitdice instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings for improving the source and target coverage of the training set. We show that L_1 regularized regression performs better than L_2 regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English. We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model.
摘要:我们使用转导回归技术来学习给定平行语料库的源特征和目标特征之间的映射,并使用这些映射来生成机器翻译输出。我们证明了L_1正则化回归学习稀疏观测特征集与L_2正则化回归之间映射的有效性。训练实例的正确选择对于在有限的计算资源和期望的精度水平上学习正确的特征映射起着重要作用。文中介绍了一种训练实例的选择方法,该方法对于学习正确的特征映射,提高训练集的源覆盖率和目标覆盖率具有重要作用.结果表明,无论是在回归度量上,还是在图解译实验中,L_1正则化回归都优于L_2正则化回归。我们在从德语到英语和从西班牙语到英语的翻译中取得了令人鼓舞的结果。我们还展示了使用回归模型找到的映射来替换基于短语的解码器的短语表时的结果。

[NLP-64] Changing Answer Order Can Decrease MMLU Accuracy
[NLP-64] 改变答案顺序可能会降低MMLU准确性

链接: https://arxiv.org/abs/2406.19470
作者: Vipul Gupta,David Pantoja,Candace Ross,Adina Williams,Megan Ung
关键词: understanding model capabilities, large language models, grown in prevalence, large language, model capabilities
中文关键词: 理解模型能力、大型语言模型、流行率增长、大型语言、模型能力
类目: Computation and Language (cs.CL)
备注: Short paper, 9 pages

点击查看摘要

Abstract:As large language models (LLMs) have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.
摘要:随着大型语言模型(LLM)的普及,特定的基准对于评估这些模型和理解模型能力至关重要。最常见的是,我们使用多个子任务的平均测试准确性来对排行榜上的模型进行排名,以确定哪个模型最适合我们的目的。在本文中,我们研究了广泛使用的多项选择题回答数据集MMLU的准确性测量的稳健性。当洗牌答案标签内容时,我们发现所有探索的模型在MMLU上的准确性都会下降,但并非每个模型都同样敏感。这些发现表明可能会对排行榜测试的标准实践进行调整,其中我们还考虑每个模型随机正确回答的示例百分比。

[NLP-65] Can Large Language Models Generate High-quality Patent Claims?
[NLP-65] 大型语言模型能否生成高质量的专利声明?

链接: https://arxiv.org/abs/2406.19465
作者: Lekang Jiang,Caiqi Zhang,Pascal A Scherz,Stephan Goetz
关键词: Large language models, offers highly structured, Large language, text generation tasks, shown exceptional performance
中文关键词: 大型语言模型,提供高度结构化的大型语言文本生成任务,表现出出色的性能
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Large language models (LLMs) have shown exceptional performance across various text generation tasks but remain under-explored in the patent domain, which offers highly structured and precise language. This paper constructs a dataset to investigate the performance of current LLMs in patent claim generation. Our results demonstrate that generating claims based on patent descriptions outperforms previous research relying on abstracts. Interestingly, current patent-specific LLMs perform much worse than state-of-the-art general LLMs, highlighting the necessity for future research on in-domain LLMs. We also find that LLMs can produce high-quality first independent claims, but their performances markedly decrease for subsequent dependent claims. Moreover, fine-tuning can enhance the completeness of inventions’ features, conceptual clarity, and feature linkage. Among the tested LLMs, GPT-4 demonstrates the best performance in comprehensive human evaluations by patent experts, with better feature coverage, conceptual clarity, and technical coherence. Despite these capabilities, comprehensive revision and modification are still necessary to pass rigorous patent scrutiny and ensure legal robustness.
摘要:大型语言模型在各种文本生成任务中表现出了优异的性能,但在提供高度结构化和精确语言的专利领域仍然没有得到充分的探索。本文构建了一个数据集来研究现有的LLMS在专利权利要求生成中的性能。我们的结果表明,基于专利描述生成权利要求的效果优于基于摘要的先前研究。有趣的是,当前专利特定的LLM的表现比最先进的一般LLM差得多,这突显了未来对域内LLM进行研究的必要性。我们还发现,LLMS能够产生高质量的第一独立索赔,但对于后续的独立索赔,它们的性能显著下降。此外,微调可以提高发明特征的完整性、概念清晰度和特征联动性。在测试的LLM中,GPT-4在专利专家的综合人类评估中表现最佳,具有更好的功能覆盖率、概念清晰度和技术一致性。尽管有这些能力,全面的修改和修改仍然是必要的,以通过严格的专利审查并确保法律的稳健性。

[NLP-66] An Analysis of Multilingual FActScore
[NLP-66] 多语言FactScore分析

链接: https://arxiv.org/abs/2406.19415
作者: Kim Trong Vu,Michael Krumdick,Varshini Reddy,Franck Dernoncourt,Viet Dac Lai
关键词: Large Language Models, generated by Large, Language Models, Large Language, gained popularity
中文关键词: 由Large、Language Models、Large语言生成的Large语言模型受到欢迎
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:FActScore has gained popularity as a metric to estimate the factuality of long-form texts generated by Large Language Models (LLMs) in English. However, there has not been any work in studying the behavior of FActScore in other languages. This paper studies the limitations of each component in the four-component pipeline of FActScore in the multilingual setting. We introduce a new dataset for FActScore on texts generated by strong multilingual LLMs. Our evaluation shows that LLMs exhibit distinct behaviors in both fact extraction and fact scoring tasks. No LLM produces consistent and reliable FActScore across languages with varying levels of resources. We also find that the knowledge source plays an important role in the quality of the estimated FActScore. Using Wikipedia as the knowledge source may hinder the true FActScore of long-form text due to its limited coverage in medium- and low-resource languages. We also incorporate three mitigations to our knowledge source that ultimately improve FActScore estimation across all languages.
摘要:FActScore作为一种度量英语大型语言模型(LLMS)生成的长文本真实性的指标越来越受欢迎。然而,还没有任何关于FActScore在其他语言中的行为的研究工作。本文研究了多语言环境下FActScore四组件流水线中各组件的局限性。我们介绍了一个新的FActScore数据集,该数据集基于强多语言LMS生成的文本。我们的评估表明,LLM在事实提取和事实评分任务中都表现出了不同的行为。任何LLM都不会在具有不同资源级别的语言之间产生一致和可靠的FActScore。我们还发现,知识源对估计的FActScore的质量起着重要作用。使用维基百科作为知识来源可能会阻碍长格式文本的真正FActScore,因为它在中低资源语言中的覆盖面有限。我们还在我们的知识源中加入了三项缓解措施,最终改进了所有语言的FActScore估计。

[NLP-67] Saliency Attention and Semantic Similarity-Driven Adversarial Perturbation
[NLP-67] 显着注意和语义相似性驱动的对抗性扰动

链接: https://arxiv.org/abs/2406.19413
作者: Hetvi Waghela,Jaydip Sen,Sneha Rakshit
关键词: enhanced textual adversarial, Semantic Similarity driven, introduce an enhanced, enhanced textual, Similarity driven adversarial
中文关键词: 增强的文本对抗,语义相似性驱动,引入增强的文本,相似性驱动的对抗
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The paper is 12 pages long. and it contains 5 tables. It is the pre-reviewed version of the paper that has been accepted for oral presentation and publication in the 5th International Conference on Data Science and Applications which will be organized in Jaipur, India from July 17 to 19, 2024. This is not the final version

点击查看摘要

Abstract:In this paper, we introduce an enhanced textual adversarial attack method, known as Saliency Attention and Semantic Similarity driven adversarial Perturbation (SASSP). The proposed scheme is designed to improve the effectiveness of contextual perturbations by integrating saliency, attention, and semantic similarity. Traditional adversarial attack methods often struggle to maintain semantic consistency and coherence while effectively deceiving target models. Our proposed approach addresses these challenges by incorporating a three-pronged strategy for word selection and perturbation. First, we utilize a saliency-based word selection to prioritize words for modification based on their importance to the model’s prediction. Second, attention mechanisms are employed to focus perturbations on contextually significant words, enhancing the attack’s efficacy. Finally, an advanced semantic similarity-checking method is employed that includes embedding-based similarity and paraphrase detection. By leveraging models like Sentence-BERT for embedding similarity and fine-tuned paraphrase detection models from the Sentence Transformers library, the scheme ensures that the perturbed text remains contextually appropriate and semantically consistent with the original. Empirical evaluations demonstrate that SASSP generates adversarial examples that not only maintain high semantic fidelity but also effectively deceive state-of-the-art natural language processing models. Moreover, in comparison to the original scheme of contextual perturbation CLARE, SASSP has yielded a higher attack success rate and lower word perturbation rate.
摘要:本文介绍了一种增强的文本对抗攻击方法,称为显著注意和语义相似驱动的对抗扰动(SASSP)。该方案旨在通过整合显著度、关注度和语义相似性来提高上下文扰动的有效性。传统的对抗性攻击方法在有效欺骗目标模型的同时,往往难以保持语义的一致性和连贯性。我们提出的方法通过结合三管齐下的选词和扰动策略来应对这些挑战。首先,我们利用基于显著性的单词选择来根据单词对模型预测的重要性来确定修改的单词的优先级。其次,注意机制被用来将干扰集中在有上下文意义的单词上,从而增强了攻击的有效性。最后,采用了一种改进的语义相似度检测方法,包括基于嵌入的相似度检测和释义检测。通过利用语句-ERT等模型嵌入相似度,以及来自语句转换器库的微调释义检测模型,该方案确保了受干扰的文本保持与原始文本的上下文合适和语义一致。实验评估表明,SASSP生成的对抗性实例不仅保持了较高的语义保真度,而且有效地欺骗了最新的自然语言处理模型。此外,与原有的上下文扰动Clare方案相比,SASSP具有更高的攻击成功率和更低的单词扰动率。

计算机视觉

[CV-0] Odd-One-Out: Anomaly Detection by Comparing with Neighbors

链接: https://arxiv.org/abs/2406.20099
作者: Ankan Bhunia,Changjian Li,Hakan Bilen
关键词: anomaly detection, problem that focuses, focuses on identifying, objects relative, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Codes Dataset at this https URL

点击查看摘要

Abstract:This paper introduces a novel anomaly detection (AD) problem that focuses on identifying `odd-looking’ objects relative to the other instances within a scene. Unlike the traditional AD benchmarks, in our setting, anomalies in this context are scene-specific, defined by the regular instances that make up the majority. Since object instances are often partly visible from a single viewpoint, our setting provides multiple views of each scene as input. To provide a testbed for future research in this task, we introduce two benchmarks, ToysAD-8K and PartsAD-15K. We propose a novel method that generates 3D object-centric representations for each instance and detects the anomalous ones through a cross-examination between the instances. We rigorously analyze our method quantitatively and qualitatively in the presented benchmarks.

[CV-1] Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

链接: https://arxiv.org/abs/2406.20098
作者: Sukmin Yun,Haokun Lin,Rusiru Thushara,Mohammad Qazim Bhat,Yongxin Wang,Zutao Jiang,Mingkai Deng,Jinhong Wang,Tianhua Tao,Junbo Li,Haonan Li,Preslav Nakov,Timothy Baldwin,Zhengzhong Liu,Eric P. Xing,Xiaodan Liang,Zhiqiang Shen
关键词: Multimodal large language, shown impressive success, Multimodal large, HTML code, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Website at this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage’s HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs’ abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at this https URL.

[CV-2] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

链接: https://arxiv.org/abs/2406.20095
作者: Xiang Li,Cristina Mata,Jongwoo Park,Kumara Kahatapitiya,Yoo Sung Jang,Jinghuan Shang,Kanchana Ranasinghe,Ryan Burgert,Mu Cai,Yong Jae Lee,Michael S. Ryoo
关键词: Large Language Models, Large Language, extensive world knowledge, strong reasoning skills, Vision Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at this https URL.

[CV-3] LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

链接: https://arxiv.org/abs/2406.20092
作者: Jieneng Chen,Luoxin Ye,Ju He,Zhao-Yang Wang,Daniel Khashabi,Alan Yuille
关键词: large language models, large multi-modal models, largely overlooked area, visual tokens, visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs. Code is available at this https URL

[CV-4] Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

链接: https://arxiv.org/abs/2406.20085
作者: Yicheng Chen,Xiangtai Li,Yining Li,Yanhong Zeng,Jianzong Wu,Xiangyu Zhao,Kai Chen
关键词: Diffusion-based models, shown great potential, shown great, Diffusion-based, generating high-quality images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Diffusion-based models have shown great potential in generating high-quality images with various layouts, which can benefit downstream perception tasks. However, a fully automatic layout generation driven only by language and a suitable metric for measuring multiple generated instances has not been well explored. In this work, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality multi-modal training examples to augment perception and multi-modal training. Starting with a simple list of natural language concepts, we prompt large language models (LLMs) to generate a detailed description and design reasonable layouts. Next, we use an off-the-shelf text-to-image model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric to ensure quality. In particular, we present a new metric, Composite Layout and Image Score (CLIS), to evaluate the generated images fairly. Our synthetic high-quality examples boost performance in various scenarios by customizing the initial concept list, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that Auto Cherry-Picker can significantly improve the performance of existing models. In addition, we have thoroughly investigated the correlation between CLIS and performance gains in downstream tasks, and we find that a better CLIS score results in better performance. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks. Code will be available.

[CV-5] PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

链接: https://arxiv.org/abs/2406.20083
作者: Kuo-Hao Zeng,Zichen Zhang,Kiana Ehsani,Rose Hendrix,Jordi Salvador,Alvaro Herrasti,Ross Girshick,Aniruddha Kembhavi,Luca Weihs
关键词: Policy Transformer, purely in simulation, RGB-only indoor navigation, indoor navigation agent, RGB-only indoor
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.

[CV-6] Segment Anything without Supervision

链接: https://arxiv.org/abs/2406.20081
作者: XuDong Wang,Jingfeng Yang,Trevor Darrell
关键词: labor-intensive data labeling, requires labor-intensive data, data labeling, labor-intensive data, SAM
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to “discover” the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B’s ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM’s AR by over 6.7% and AP by 3.9% on SA-1B.

[CV-7] GM-DF: Generalized Multi-Scenario Deepfake Detection

链接: https://arxiv.org/abs/2406.20078
作者: Yingxin Lai,Zitong Yu,Jing Yang,Bin Li,Xiangui Kang,Linlin Shen
关键词: Existing face forgery, unknown attacks occur, face forgery detection, Existing face, limited generalization capacity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of detection accuracy when models are directly trained on combined datasets due to the discrepancy across collection scenarios and generation methods. To address the above issue, a Generalized Multi-Scenario Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world scenarios by a unified model. First, we propose a hybrid expert modeling approach for domain-specific real/forgery feature extraction. Besides, as for the commonality representation, we use CLIP to extract the common features for better aligning visual and textual features across domains. Meanwhile, we introduce a masked image reconstruction mechanism to force models to capture rich forged details. Finally, we supervise the models via a domain-aware meta-learning strategy to further enhance their generalization capacities. Specifically, we design a novel domain alignment loss to strongly align the distributions of the meta-test domains and meta-train domains. Thus, the updated models are able to represent both specific and common real/forgery features across multiple datasets. In consideration of the lack of study of multi-dataset training, we establish a new benchmark leveraging multi-source data to fairly evaluate the models’ generalization capacity on unseen scenarios. Both qualitative and quantitative experiments on five datasets conducted on traditional protocols as well as the proposed benchmark demonstrate the effectiveness of our approach.

[CV-8] HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

链接: https://arxiv.org/abs/2406.20077
作者: Hieu T. Nguyen,Yiwen Chen,Vikram Voleti,Varun Jampani,Huaizu Jiang
关键词: introduce HouseCrafter, complete large, diffusion model, images, indoor scene
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: this https URL

[CV-9] EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

链接: https://arxiv.org/abs/2406.20076
作者: Yuxuan Zhang,Tianheng Cheng,Rui Hu,ei Liu,Heng Liu,Longjin Ran,Xiaoxin Chen,Wenyu Liu,Xinggang Wang
关键词: attracted widespread attention, Vision-language Fusion-based SAM, superior interactive segmentation, interactive segmentation capabilities, SAM
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.

[CV-10] ASSR-NeRF: Arbitrary-Scale Super-Resolution on Voxel Grid for High-Quality Radiance Fields Reconstruction

链接: https://arxiv.org/abs/2406.20066
作者: Ding-Jiun Huang,Zi-Ting Chou,Yu-Chiang Frank Wang,Cheng Sun
关键词: NeRF-based methods reconstruct, view synthesis, explicit representations, building a radiance, radiance field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:NeRF-based methods reconstruct 3D scenes by building a radiance field with implicit or explicit representations. While NeRF-based methods can perform novel view synthesis (NVS) at arbitrary scale, the performance in high-resolution novel view synthesis (HRNVS) with low-resolution (LR) optimization often results in oversmoothing. On the other hand, single-image super-resolution (SR) aims to enhance LR images to HR counterparts but lacks multi-view consistency. To address these challenges, we propose Arbitrary-Scale Super-Resolution NeRF (ASSR-NeRF), a novel framework for super-resolution novel view synthesis (SRNVS). We propose an attention-based VoxelGridSR model to directly perform 3D super-resolution (SR) on the optimized volume. Our model is trained on diverse scenes to ensure generalizability. For unseen scenes trained with LR views, we then can directly apply our VoxelGridSR to further refine the volume and achieve multi-view consistent SR. We demonstrate quantitative and qualitatively that the proposed method achieves significant performance in SRNVS.

[CV-11] SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting

链接: https://arxiv.org/abs/2406.20055
作者: Sara Sabour,Lily Goli,George Kopanas,Mark Matthews,Dmitry Lagun,Leonidas Guibas,Alec Jacobson,David J. Fleet,Andrea Tagliasacchi
关键词: Gaussian Splatting, offering efficient training, highly controlled environments, require highly controlled, inter-view consistency assumption
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is a promising technique for 3D reconstruction, offering efficient training and rendering speeds, making it suitable for real-time applications.However, current methods require highly controlled environments (no moving people or wind-blown elements, and consistent lighting) to meet the inter-view consistency assumption of 3DGS. This makes reconstruction of real-world captures problematic. We present SpotlessSplats, an approach that leverages pre-trained and general-purpose features coupled with robust optimization to effectively ignore transient distractors. Our method achieves state-of-the-art reconstruction quality both visually and quantitatively, on casual captures.

[CV-12] MoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking

链接: https://arxiv.org/abs/2406.20024
作者: Yucheng Chen,Lin Wang
关键词: multi-modal fusion approaches, high frame rate, develop multi-modal fusion, frame rate object, fusion approaches
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: RGB-event single object tracking

点击查看摘要

Abstract:The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, no interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features for better interaction and discriminability between the target information and background. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Gating to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that fine-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to improve interaction and discriminability between the target information and background. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.

[CV-13] Wavelets Are All You Need for Autoregressive Image Generation

链接: https://arxiv.org/abs/2406.19997
作者: Wael Mattar,Idan Levy,Nir Sharon,Shai Dekel
关键词: autoregressive image generation, main ingredients, approach to autoregressive, autoregressive image, wavelet image coding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this ‘wavelet language’. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.

[CV-14] STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

链接: https://arxiv.org/abs/2406.19973
作者: Guohao Sun,Can Qin,Huazhu Fu,Linwei Wang,Zhiqiang Tao
关键词: shown significant potential, assisting medical diagnosis, Large Vision-Language Models, extensive biomedical datasets, leveraging extensive biomedical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.

[CV-15] GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection

链接: https://arxiv.org/abs/2406.19941
作者: Chih-Chung Hsu,Shao-Ning Chen,Mei-Hsuan Wu,Yi-Fang Wang,Chia-Ming Lee,Yi-Shiuan Chou
关键词: DeepFake video detection, posing profound threats, manipulation techniques escalate, efficient detection strategies, develop efficient detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to TPAMI 2024

点击查看摘要

Abstract:As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at this https URL.

[CV-16] Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping

链接: https://arxiv.org/abs/2406.19922
作者: Tianli Liao,Ce Wang,Lei Li,Guangen Liu,Nan Li
关键词: image stitching method, intractable issue, image stitching, image, Large parallax
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Large parallax between images is an intractable issue in image stitching. Various warping-based methods are proposed to address it, yet the results are unsatisfactory. In this paper, we propose a novel image stitching method using multi-homography warping guided by image segmentation. Specifically, we leverage the Segment Anything Model to segment the target image into numerous contents and partition the feature points into multiple subsets via the energy-based multi-homography fitting algorithm. The multiple subsets of feature points are used to calculate the corresponding multiple homographies. For each segmented content in the overlapping region, we select its best-fitting homography with the lowest photometric error. For each segmented content in the non-overlapping region, we calculate a weighted combination of the linearized homographies. Finally, the target image is warped via the best-fitting homographies to align with the reference image, and the final panorama is generated via linear blending. Comprehensive experimental results on the public datasets demonstrate that our method provides the best alignment accuracy by a large margin, compared with the state-of-the-art methods. The source code is available at this https URL.

[CV-17] Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

链接: https://arxiv.org/abs/2406.19905
作者: Longrong Yang,Dong Sheng,Chaoxiang Cai,Fan Yang,Size Li,Di Zhang,Xi Li
关键词: gained increasing attention, Large Vision-Language Models, gained increasing, increasing attention, Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they employ a router to predict the routing for each token. However, the predictions are based solely on sample features and do not truly reveal the optimization direction of tokens. This can lead to severe optimization conflicts between different tokens within an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis. Specifically, we first use token-level gradients to identify conflicting tokens in experts. Then, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate the effectiveness of our method. The code will be publicly available at this https URL.

[CV-18] On the Value of PHH3 for Mitotic Figure Detection on HE-stained Images

链接: https://arxiv.org/abs/2406.19899
作者: Jonathan Ganz,Christian Marzahl,Jonas Ammeling,Barbara Richter,Chloé Puget,Daniela Denk,Elena A. Demeter,Flaviu A. Tabaran,Gabriel Wasinger,Karoline Lipnik,Marco Tecilla,Matthew J. Valentine,Michael J. Dark,Niklas Abele,Pompei Bolfa,Ramona Erber,Robert Klopfleisch,Sophie Merz,Taryn A. Donovan,Samir Jabari,Christof A. Bertram,Katharina Breininger,Marc Aubreville
关键词: important prognostic marker, tumor cell proliferation, mitotic figures, observed in hematoxylin, hematoxylin and eosin
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注: 10 pages, 5 figures, 1 Table

点击查看摘要

Abstract:The count of mitotic figures (MFs) observed in hematoxylin and eosin (HE)-stained slides is an important prognostic marker as it is a measure for tumor cell proliferation. However, the identification of MFs has a known low inter-rater agreement. Deep learning algorithms can standardize this task, but they require large amounts of annotated data for training and validation. Furthermore, label noise introduced during the annotation process may impede the algorithm’s performance. Unlike HE, the mitosis-specific antibody phospho-histone H3 (PHH3) specifically highlights MFs. Counting MFs on slides stained against PHH3 leads to higher agreement among raters and has therefore recently been used as a ground truth for the annotation of MFs in HE. However, as PHH3 facilitates the recognition of cells indistinguishable from HE stain alone, the use of this ground truth could potentially introduce noise into the HE-related dataset, impacting model performance. This study analyzes the impact of PHH3-assisted MF annotation on inter-rater reliability and object level agreement through an extensive multi-rater experiment. We found that the annotators’ object-level agreement increased when using PHH3-assisted labeling. Subsequently, MF detectors were evaluated on the resulting datasets to investigate the influence of PHH3-assisted labeling on the models’ performance. Additionally, a novel dual-stain MF detector was developed to investigate the interpretation-shift of PHH3-assisted labels used in HE, which clearly outperformed single-stain detectors. However, the PHH3-assisted labels did not have a positive effect on solely HE-based models. The high performance of our dual-input detector reveals an information mismatch between the HE and PHH3-stained images as the cause of this effect.

[CV-19] InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

链接: https://arxiv.org/abs/2406.19875
作者: Kirolos Ataallah,Chenhui Gou,Eslam Abdelrahman,Khushbu Pahwa,Jian Ding,Mohamed Elhoseiny
关键词: ranging from tens, presents unique challenges, Movie Spoiler Questions, video, video comprehension
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 page ,17 figures

点击查看摘要

Abstract:Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at this https URL

[CV-20] FootBots: A Transformer-based Architecture for Motion Prediction in Soccer

链接: https://arxiv.org/abs/2406.19852
作者: Guillem Capellera,Luis Ferraz,Antonio Rubio,Antonio Agudo,Francesc Moreno-Noguer
关键词: involves capturing complex, Motion prediction, capturing complex dynamics, soccer involves capturing, conditioned motion prediction
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注: Published as a conference paper at IEEE ICIP 2024

点击查看摘要

Abstract:Motion prediction in soccer involves capturing complex dynamics from player and ball interactions. We present FootBots, an encoder-decoder transformer-based architecture addressing motion prediction and conditioned motion prediction through equivariance properties. FootBots captures temporal and social dynamics using set attention blocks and multi-attention block decoder. Our evaluation utilizes two datasets: a real soccer dataset and a tailored synthetic one. Insights from the synthetic dataset highlight the effectiveness of FootBots’ social attention mechanism and the significance of conditioned motion prediction. Empirical results on real soccer data demonstrate that FootBots outperforms baselines in motion prediction and excels in conditioned tasks, such as predicting the players based on the ball position, predicting the offensive (defensive) team based on the ball and the defensive (offensive) team, and predicting the ball position based on all players. Our evaluation connects quantitative and qualitative findings. this https URL

[CV-21] StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction

链接: https://arxiv.org/abs/2406.19844
作者: Jiaheng Zhuang,Guoan Wang,Siyu Zhang,Xiyang Wang,Hangning Zhou,Ziyao Xu,Chi Zhang,Zhiheng Li
关键词: crucial modules, trajectory prediction, multi-object tracking, autonomous driving systems, tracking and trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations between tracking and prediction tasks. In this paper, we propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP) to address the above challenges. Firstly, we construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively. Secondly, a relative spatio-temporal positional encoding strategy is introduced to bridge the gap of coordinate representations between the two tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we further improve the quality and consistency of predicted trajectories with a dual-stream predictor. We conduct extensive experiments on popular nuSences dataset and the experimental results demonstrate the effectiveness and superiority of StreamMOTP, which outperforms previous methods significantly on both tasks. Furthermore, we also prove that the proposed framework has great potential and advantages in actual applications of autonomous driving.

[CV-22] LightStereo: Channel Boost Is All Your Need for Efficient 2D Cost Aggregation

链接: https://arxiv.org/abs/2406.19833
作者: Xianda Guo,Chenming Zhang,Dujun Nie,Wenzhao Zheng,Youmin Zhang,Long Chen
关键词: cutting-edge stereo-matching network, stereo-matching network crafted, cutting-edge stereo-matching, stereo-matching network, network crafted
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code will be available at \url{ this https URL }

点击查看摘要

Abstract:We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs, with an inference time of just 17 ms. Our comprehensive analysis reveals the effect of 2D cost aggregation for stereo matching, paving the way for real-world applications of efficient stereo systems. Code will be available at \urlthis https URL.

[CV-23] Emotion Loss Attacking: Adversarial Attack Perception for Skeleton based on Multi-dimensional Features

链接: https://arxiv.org/abs/2406.19815
作者: Feng Liu,Qing Xu,Qijian Zheng
关键词: hot topic, skeletal motions, Adversarial attack, adversarial attack method, skeletal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adversarial attack on skeletal motion is a hot topic. However, existing researches only consider part of dynamic features when measuring distance between skeleton graph sequences, which results in poor imperceptibility. To this end, we propose a novel adversarial attack method to attack action recognizers for skeletal motions. Firstly, our method systematically proposes a dynamic distance function to measure the difference between skeletal motions. Meanwhile, we innovatively introduce emotional features for complementary information. In addition, we use Alternating Direction Method of Multipliers(ADMM) to solve the constrained optimization problem, which generates adversarial samples with better imperceptibility to deceive the classifiers. Experiments show that our method is effective on multiple action classifiers and datasets. When the perturbation magnitude measured by l norms is the same, the dynamic perturbations generated by our method are much lower than that of other methods. What’s more, we are the first to prove the effectiveness of emotional features, and provide a new idea for measuring the distance between skeletal motions.

[CV-24] Extract More from Less: Efficient Fine-Grained Visual Recognition in Low-Data Regimes

链接: https://arxiv.org/abs/2406.19814
作者: Dmitry Demidov,Abduragim Shtanchaev,Mihail Mihaylov,Mohammad Almansoori
关键词: low inter-class variance, large intra-class variation, highly limited amount, low-data regimes assumes, fine-grained image classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Main paper and Appendices

点击查看摘要

Abstract:The emerging task of fine-grained image classification in low-data regimes assumes the presence of low inter-class variance and large intra-class variation along with a highly limited amount of training samples per class. However, traditional ways of separately dealing with fine-grained categorisation and extremely scarce data may be inefficient under both these harsh conditions presented together. In this paper, we present a novel framework, called AD-Net, aiming to enhance deep neural network performance on this challenge by leveraging the power of Augmentation and Distillation techniques. Specifically, our approach is designed to refine learned features through self-distillation on augmented samples, mitigating harmful overfitting. We conduct comprehensive experiments on popular fine-grained image classification benchmarks where our AD-Net demonstrates consistent improvement over traditional fine-tuning and state-of-the-art low-data techniques. Remarkably, with the smallest data available, our framework shows an outstanding relative accuracy increase of up to 45 % compared to standard ResNet-50 and up to 27 % compared to the closest SOTA runner-up. We emphasise that our approach is practically architecture-independent and adds zero extra cost at inference time. Additionally, we provide an extensive study on the impact of every framework’s component, highlighting the importance of each in achieving optimal performance. Source code and trained models are publicly available at this http URL.

[CV-25] EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

链接: https://arxiv.org/abs/2406.19811
作者: Daiwei Zhang,Gengyan Li,Jiajie Li,Mickaël Bressieux,Otmar Hilliges,Marc Pollefeys,Luc Van Gool,Xi Wang
关键词: simple household tasks, household tasks involve, tasks involve numerous, involve numerous object, inherently complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on mapping 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models.

[CV-26] Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train

链接: https://arxiv.org/abs/2406.19756
作者: Haojun Jiang,Meng Li,Zhenguo Sun,Ning Jia,Yu Sun,Shaqi Luo,Shiji Song,Gao Huang
关键词: cardiac ultrasound images, acquisition cardiac ultrasound, ultrasound images, heart leads, leads to significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:The complex structure of the heart leads to significant challenges in echocardiography, especially in acquisition cardiac ultrasound images. Successful echocardiography requires a thorough understanding of the structures on the two-dimensional plane and the spatial relationships between planes in three-dimensional space. In this paper, we innovatively propose a large-scale self-supervised pre-training method to acquire a cardiac structure-aware world model. The core innovation lies in constructing a self-supervised task that requires structural inference by predicting masked structures on a 2D plane and imagining another plane based on pose transformation in 3D space. To support large-scale pre-training, we collected over 1.36 million echocardiograms from ten standard views, along with their 3D spatial poses. In the downstream probe guidance task, we demonstrate that our pre-trained model consistently reduces guidance errors across the ten most common standard views on the test set with 0.29 million samples from 74 routine clinical scans, indicating that structure-aware pre-training benefits the scanning.

[CV-27] MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

链接: https://arxiv.org/abs/2406.19736
作者: Jihao Liu,Xin Huang,Jinliang Zheng,Boxiao Liu,Jia Wang,Osamu Yoshie,Yu Liu,Hongsheng Li
关键词: high-quality visual instruction, visual instruction data, visual instruction, instruction-following capabilities, instruction data designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Dataset and models are available at this https URL

点击查看摘要

Abstract:This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at this https URL.

[CV-28] EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

链接: https://arxiv.org/abs/2406.19726
作者: Nicola Garau,Giulia Martinelli,Niccolò Bisagno,Denis Tomè,Carsten Stoll
关键词: Monocular Human Pose, Monocular Human, Human Pose Estimation, human joints, Human Pose
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].

[CV-29] Vision Transformer with Key-select Routing Attention for Single Image Dehazing

链接: https://arxiv.org/abs/2406.19703
作者: Lihan Tong,Weijia Li,Qingxia Yang,Liyuan Chen,Peng Chen
关键词: Key-select Routing Attention, Frequency Processing Module, Lightweight Frequency Processing, Multi-scale Key-select Routing, utilizing Multi-scale Key-select
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages,4 figures,IEICE Trans. Information and Systems

点击查看摘要

Abstract:We present Ksformer, utilizing Multi-scale Key-select Routing Attention (MKRA) for intelligent selection of key areas through multi-channel, multi-scale windows with a top-k operator, and Lightweight Frequency Processing Module (LFPM) to enhance high-frequency features, outperforming other dehazing methods in tests.

[CV-30] MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?

链接: https://arxiv.org/abs/2406.19693
作者: Jinming Li,Yichen Zhu,Zhiyuan Xu,Jindong Gu,Minjie Zhu,Xin Liu,Ning Liu,Yaxin Peng,Feifei Feng,Jian Tang
关键词: Multimodal Large Language, Large Language Models, language understanding, fundamentally challenging, assistants in human
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:It is fundamentally challenging for robots to serve as useful assistants in human environments because this requires addressing a spectrum of sub-problems across robotics, including perception, language understanding, reasoning, and planning. The recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated their exceptional abilities in solving complex mathematical problems, mastering commonsense and abstract reasoning. This has led to the recent utilization of MLLMs as the brain in robotic systems, enabling these models to conduct high-level planning prior to triggering low-level control actions for task execution. However, it remains uncertain whether existing MLLMs are reliable in serving the brain role of robots. In this study, we introduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark, which tests the capability of MLLMs for robot applications. Specifically, we identify four essential capabilities perception, task planning, visual reasoning, and safety measurement that MLLMs must possess to qualify as the robot’s central processing unit. We have developed several scenarios for each capability, resulting in a total of 14 metrics for evaluation. We present experimental results for various MLLMs, including both commercial and open-source models, to assess the performance of existing systems. Our findings indicate that no single model excels in all areas, suggesting that current MLLMs are not yet trustworthy enough to serve as the cognitive core for robots. Our data can be found in this https URL.

[CV-31] Deep Fusion Model for Brain Tumor Classification Using Fine-Grained Gradient Preservation

链接: https://arxiv.org/abs/2406.19690
作者: Niful Islam,Mohaiminul Islam Bhuiyan,Jarin Tasnim Raya,Nur Shazwani Kamarudin,Khan Md Hasib,M. F. Mridha,Dewan Md. Farid
关键词: brain tumor classification, accurate brain tumor, brain tumor, early stage, early death
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain tumors are one of the most common diseases that lead to early death if not diagnosed at an early stage. Traditional diagnostic approaches are extremely time-consuming and prone to errors. In this context, computer vision-based approaches have emerged as an effective tool for accurate brain tumor classification. While some of the existing solutions demonstrate noteworthy accuracy, the models become infeasible to deploy in areas where computational resources are limited. This research addresses the need for accurate and fast classification of brain tumors with a priority of deploying the model in technologically underdeveloped regions. The research presents a novel architecture for precise brain tumor classification fusing pretrained ResNet152V2 and modified VGG16 models. The proposed architecture undergoes a diligent fine-tuning process that ensures fine gradients are preserved in deep neural networks, which are essential for effective brain tumor classification. The proposed solution incorporates various image processing techniques to improve image quality and achieves an astounding accuracy of 98.36% and 98.04% in Figshare and Kaggle datasets respectively. This architecture stands out for having a streamlined profile, with only 2.8 million trainable parameters. We have leveraged 8-bit quantization to produce a model of size 73.881 MB, significantly reducing it from the previous size of 289.45 MB, ensuring smooth deployment in edge devices even in resource-constrained areas. Additionally, the use of Grad-CAM improves the interpretability of the model, offering insightful information regarding its decision-making process. Owing to its high discriminative ability, this model can be a reliable option for accurate brain tumor classification.

[CV-32] MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

链接: https://arxiv.org/abs/2406.19680
作者: Yuang Zhang,Jiaxi Gu,Li-Wen Wang,Han Wang,Junqi Cheng,Yuefeng Zhu,Fangyuan Zou
关键词: generative artificial intelligence, achieved significant advancements, recent years, generative artificial, spawning a variety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: this https URL .

[CV-33] Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey

链接: https://arxiv.org/abs/2406.19675
作者: Uchitha Rajapaksha,Ferdous Sohel,Hamid Laga,Dean Diepeveen,Mohammed Bennamoun
关键词: single RGB images, including autonomous driving, widespread interest due, single RGB, RGB images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 46 pages, 10 figures, The paper has been accepted for publication in ACM Computing Surveys 2024

点击查看摘要

Abstract:Estimating depth from single RGB images and videos is of widespread interest due to its applications in many areas, including autonomous driving, 3D reconstruction, digital entertainment, and robotics. More than 500 deep learning-based papers have been published in the past 10 years, which indicates the growing interest in the task. This paper presents a comprehensive survey of the existing deep learning-based methods, the challenges they address, and how they have evolved in their architecture and supervision methods. It provides a taxonomy for classifying the current work based on their input and output modalities, network architectures, and learning methods. It also discusses the major milestones in the history of monocular depth estimation, and different pipelines, datasets, and evaluation metrics used in existing methods.

[CV-34] Beyond First-Order: A Multi-Scale Approach to Finger Knuckle Print Biometrics

链接: https://arxiv.org/abs/2406.19672
作者: Chengrui Gao,Ziyuan Yang,Andrew Beng Jin Teoh,Min Zhu
关键词: finger knuckle prints, rich textural patterns, FKP recognition, Prior FKP recognition, gained attention due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, finger knuckle prints (FKPs) have gained attention due to their rich textural patterns, positioning them as a promising biometric for identity recognition. Prior FKP recognition methods predominantly leverage first-order feature descriptors, which capture intricate texture details but fail to account for structural information. Emerging research, however, indicates that second-order textures, which describe the curves and arcs of the textures, encompass this overlooked structural information. This paper introduces a novel FKP recognition approach, the Dual-Order Texture Competition Network (DOTCNet), designed to capture texture information in FKP images comprehensively. DOTCNet incorporates three dual-order texture competitive modules (DTCMs), each targeting textures at different scales. Each DTCM employs a learnable texture descriptor, specifically a learnable Gabor filter (LGF), to extract texture features. By leveraging LGFs, the network extracts first and second order textures to describe fine textures and structural features thoroughly. Furthermore, an attention mechanism enhances relevant features in the first-order features, thereby highlighting significant texture details. For second-order features, a competitive mechanism emphasizes structural information while reducing noise from higher-order features. Extensive experimental results reveal that DOTCNet significantly outperforms several standard algorithms on the publicly available PolyU-FKP dataset.

[CV-35] PopAlign: Population-Level Alignment for Fair Text-to-Image Generation

链接: https://arxiv.org/abs/2406.19668
作者: Shufan Li,Harkanwar Singh,Aditya Grover
关键词: models achieve high-fidelity, achieve high-fidelity generation, large datasets, achieve high-fidelity, Direct Preference Optimization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:Text-to-image (T2I) models achieve high-fidelity generation through extensive training on large datasets. However, these models may unintentionally pick up undesirable biases of their training data, such as over-representation of particular identities in gender or ethnicity neutral prompts. Existing alignment methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) fail to address this problem effectively because they operate on pairwise preferences consisting of individual samples, while the aforementioned biases can only be measured at a population level. For example, a single sample for the prompt “doctor” could be male or female, but a model generating predominantly male doctors even with repeated sampling reflects a gender bias. To address this limitation, we introduce PopAlign, a novel approach for population-level preference optimization, while standard optimization would prefer entire sets of samples over others. We further derive a stochastic lower bound that directly optimizes for individual samples from preferred populations over others for scalable training. Using human evaluation and standard image quality and bias metrics, we show that PopAlign significantly mitigates the bias of pretrained T2I models while largely preserving the generation quality. Code is available at this https URL.

[CV-36] CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion

链接: https://arxiv.org/abs/2406.19666
作者: Chih-Chung Hsu,Chih-Chien Ni,Chia-Ming Lee,Li-Wei Kang
关键词: detailed spectral information, industrial applications, capturing detailed spectral, pivotal in diverse, diverse scientific
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Submitted to TIP 2024

点击查看摘要

Abstract:Hyperspectral imaging, capturing detailed spectral information for each pixel, is pivotal in diverse scientific and industrial applications. Yet, the acquisition of high-resolution (HR) hyperspectral images (HSIs) often needs to be addressed due to the hardware limitations of existing imaging systems. A prevalent workaround involves capturing both a high-resolution multispectral image (HR-MSI) and a low-resolution (LR) HSI, subsequently fusing them to yield the desired HR-HSI. Although deep learning-based methods have shown promising in HR-MSI/LR-HSI fusion and LR-HSI super-resolution (SR), their substantial model complexities hinder deployment on resource-constrained imaging devices. This paper introduces a novel knowledge distillation (KD) framework for HR-MSI/LR-HSI fusion to achieve SR of LR-HSI. Our KD framework integrates the proposed Cross-Layer Residual Aggregation (CLRA) block to enhance efficiency for constructing Dual Two-Streamed (DTS) network structure, designed to extract joint and distinct features from LR-HSI and HR-MSI simultaneously. To fully exploit the spatial and spectral feature representations of LR-HSI and HR-MSI, we propose a novel Cross Self-Attention (CSA) fusion module to adaptively fuse those features to improve the spatial and spectral quality of the reconstructed HR-HSI. Finally, the proposed KD-based joint loss function is employed to co-train the teacher and student networks. Our experimental results demonstrate that the student model not only achieves comparable or superior LR-HSI SR performance but also significantly reduces the model-size and computational requirements. This marks a substantial advancement over existing state-of-the-art methods. The source code is available at this https URL.

[CV-37] PM-VIS: High-Performance Video Instance Segmentation without Video Annotation

链接: https://arxiv.org/abs/2406.19665
作者: Zhangjing Yang,Dun Liu,Xin Wang,Zhe Li,Barathwaj Anandan,Yi Wu
关键词: segmentation requires detecting, Video instance segmentation, requires detecting, typically relying, instance segmentation requires
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MIPR 2024

点击查看摘要

Abstract:Video instance segmentation requires detecting, segmenting, and tracking objects in videos, typically relying on costly video annotations. This paper introduces a method that eliminates video annotations by utilizing image datasets. The PM-VIS algorithm is adapted to handle both bounding box and instance-level pixel annotations dynamically. We introduce ImageNet-bbox to supplement missing categories in video datasets and propose the PM-VIS+ algorithm to adjust supervision based on annotation types. To enhance accuracy, we use pseudo masks and semi-supervised optimization techniques on unannotated video data. This method achieves high video instance segmentation performance without manual video annotations, offering a cost-effective solution and new perspectives for video instance segmentation applications. The code will be available in this https URL

[CV-38] Basketball-SORT: An Association Method for Complex Multi-object Occlusion Problems in Basketball Multi-object Tracking

链接: https://arxiv.org/abs/2406.19655
作者: Qingrui Hu,Atom Scott,Calvin Yeung,Keisuke Fujii
关键词: learning-based object detection, deep learning-based object, Recent deep learning-based, object detection approaches, CMOO problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent deep learning-based object detection approaches have led to significant progress in multi-object tracking (MOT) algorithms. The current MOT methods mainly focus on pedestrian or vehicle scenes, but basketball sports scenes are usually accompanied by three or more object occlusion problems with similar appearances and high-intensity complex motions, which we call complex multi-object occlusion (CMOO). Here, we propose an online and robust MOT approach, named Basketball-SORT, which focuses on the CMOO problems in basketball videos. To overcome the CMOO problem, instead of using the intersection-over-union-based (IoU-based) approach, we use the trajectories of neighboring frames based on the projected positions of the players. Our method designs the basketball game restriction (BGR) and reacquiring Long-Lost IDs (RLLI) based on the characteristics of basketball scenes, and we also solve the occlusion problem based on the player trajectories and appearance features. Experimental results show that our method achieves a Higher Order Tracking Accuracy (HOTA) score of 63.48 % on the basketball fixed video dataset and outperforms other recent popular approaches. Overall, our approach solved the CMOO problem more effectively than recent MOT algorithms.

[CV-39] Efficient Event Stream Super-Resolution with Recursive Multi-Branch Fusion

链接: https://arxiv.org/abs/2406.19640
作者: Quanmin Liang,Zhilin Huang,Xiawu Zheng,Feidiao Yang,Jun Peng,Kai Huang,Yonghong Tian
关键词: Current Event Stream, Event Stream Super-Resolution, Current Event, complementary information present, Feature Fusion Modules
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current Event Stream Super-Resolution (ESR) methods overlook the redundant and complementary information present in positive and negative events within the event stream, employing a direct mixing approach for super-resolution, which may lead to detail loss and inefficiency. To address these issues, we propose an efficient Recursive Multi-Branch Information Fusion Network (RMFNet) that separates positive and negative events for complementary information extraction, followed by mutual supplementation and refinement. Particularly, we introduce Feature Fusion Modules (FFM) and Feature Exchange Modules (FEM). FFM is designed for the fusion of contextual information within neighboring event streams, leveraging the coupling relationship between positive and negative events to alleviate the misleading of noises in the respective branches. FEM efficiently promotes the fusion and exchange of information between positive and negative branches, enabling superior local information enhancement and global information complementation. Experimental results demonstrate that our approach achieves over 17% and 31% improvement on synthetic and real datasets, accompanied by a 2.3X acceleration. Furthermore, we evaluate our method on two downstream event-driven applications, \emphi.e., object recognition and video reconstruction, achieving remarkable results that outperform existing methods. Our code and Supplementary Material are available at this https URL.

[CV-40] Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation

链接: https://arxiv.org/abs/2406.19638
作者: Junsung Park,Hyunjung Shim
关键词: Weakly Supervised Semantic, Supervised Semantic Segmentation, Weakly Supervised, Supervised Semantic, employs weak supervision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures, accepted in AAAI 2024 Edge Intelligence Workshop

点击查看摘要

Abstract:Weakly Supervised Semantic Segmentation (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model. Despite the impressive achievement in recent WSSS methods, we identify that introducing weak labels with high mean Intersection of Union (mIoU) does not guarantee high segmentation performance. Existing studies have emphasized the importance of prioritizing precision and reducing noise to improve overall performance. In the same vein, we propose ORANDNet, an advanced ensemble approach tailored for WSSS. ORANDNet combines Class Activation Maps (CAMs) from two different classifiers to increase the precision of pseudo-masks (PMs). To further mitigate small noise in the PMs, we incorporate curriculum learning. This involves training the segmentation model initially with pairs of smaller-sized images and corresponding PMs, gradually transitioning to the original-sized pairs. By combining the original CAMs of ResNet-50 and ViT, we significantly improve the segmentation performance over the single-best model and the naive ensemble model, respectively. We further extend our ensemble method to CAMs from AMN (ResNet-like) and MCTformer (ViT-like) models, achieving performance benefits in advanced WSSS models. It highlights the potential of our ORANDNet as a final add-on module for WSSS models.

[CV-41] Model Predictive Simulation Using Structured Graphical Models and Transformers

链接: https://arxiv.org/abs/2406.19635
作者: Xinghua Lou,Meet Dave,Shrinu Kushagra,Miguel Lazaro-Gredilla,Kevin Murphy
关键词: Waymo SimAgents challenge, probabilistic graphical models, Waymo SimAgents, multiple interacting agents, SimAgents challenge
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Special Mention at the Waymo Sim Agents Challenge 2024

点击查看摘要

Abstract:We propose an approach to simulating trajectories of multiple interacting agents (road users) based on transformers and probabilistic graphical models (PGMs), and apply it to the Waymo SimAgents challenge. The transformer baseline is based on the MTR model, which predicts multiple future trajectories conditioned on the past trajectories and static road layout features. We then improve upon these generated trajectories using a PGM, which contains factors which encode prior knowledge, such as a preference for smooth trajectories, and avoidance of collisions with static obstacles and other moving agents. We perform (approximate) MAP inference in this PGM using the Gauss-Newton method. Finally we sample K=32 trajectories for each of the N \sim 100 agents for the next T=8 \Delta time steps, where \Delta=10 is the sampling rate per second. Following the Model Predictive Control (MPC) paradigm, we only return the first element of our forecasted trajectories at each step, and then we replan, so that the simulation can constantly adapt to its changing environment. We therefore call our approach “Model Predictive Simulation” or MPS. We show that MPS improves upon the MTR baseline, especially in safety critical metrics such as collision rate. Furthermore, our approach is compatible with any underlying forecasting model, and does not require extra training, so we believe it is a valuable contribution to the community.

[CV-42] PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

链接: https://arxiv.org/abs/2406.19632
作者: Deyi Ji,Wenwei Jin,Hongtao Lu,Feng Zhao
关键词: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, fields necessitates effective, faces challenges due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IJCAI 2024

点击查看摘要

Abstract:The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel \textbfPseudo Multi-\textbfPerspective \textbfTrans\textbfformer network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.

[CV-43] Optimal Video Compression using Pixel Shift Tracking

链接: https://arxiv.org/abs/2406.19630
作者: Hitesh Saai Mananchery Panneerselvam,Smit Anand
关键词: hard coded rules, Video comprises approximately, comprises approximately, internet traffic, coded rules
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Video comprises approximately ~85% of all internet traffic, but video encoding/compression is being historically done with hard coded rules, which has worked well but only to a certain limit. We have seen a surge in video compression algorithms using ML-based models in the last few years and many of them have outperformed several legacy codecs. The models range from encoding video end to end using an ML approach or replacing some intermediate steps in legacy codecs using ML models to increase the efficiency of those steps. Optimizing video storage is an essential aspect of video processing, so we are proposing one of the possible approaches to achieve it is by avoiding redundant data at each frame. In this paper, we want to introduce the approach of redundancies removal in subsequent frames for a given video as a main approach for video compression. We call this method Redundancy Removal using Shift (R\textsuperscript2S). This method can be utilized across various Machine Learning model algorithms, and make the compression more accessible and adaptable. In this study, we have utilized a computer vision-based pixel point tracking method to identify redundant pixels to encode video for optimal storage. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.19630 [cs.CV] (or arXiv:2406.19630v1 [cs.CV] for this version)

[CV-44] A Survey on Deep Clustering: From the Prior Perspective

链接: https://arxiv.org/abs/2406.19602
作者: Yiding Lu,Haobin Li,Yunfan Li,Yijie Lin,Xi Peng
关键词: powerful feature extraction, feature extraction ability, achieved great success, deep clustering, deep clustering methods
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.

[CV-45] SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

链接: https://arxiv.org/abs/2406.19593
作者: Xin Su,Man Luo,Kris W Pan,Tien Pei Chou,Vasudev Lal,Phillip Howard
关键词: gained significant attention, significant attention recently, gained significant, significant attention, attention recently
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthetic data generation has gained significant attention recently for its utility in training large vision and language models. However, the application of synthetic data to the training of multimodal context-augmented generation systems has been relatively unexplored. This gap in existing work is important because existing vision and language models (VLMs) are not trained specifically for context-augmented generation. Resources for adapting such models are therefore crucial for enabling their use in retrieval-augmented generation (RAG) settings, where a retriever is used to gather relevant information that is then subsequently provided to a generative model via context augmentation. To address this challenging problem, we generate SK-VQA: a large synthetic multimodal dataset containing over 2 million question-answer pairs which require external knowledge to determine the final answer. Our dataset is both larger and significantly more diverse than existing resources of its kind, possessing over 11x more unique questions and containing images from a greater variety of sources than previously-proposed datasets. Through extensive experiments, we demonstrate that our synthetic dataset can not only serve as a challenging benchmark, but is also highly effective for adapting existing generative multimodal models for context-augmented generation.

[CV-46] PathAlign: A vision-language model for whole slide images in histopathology

链接: https://arxiv.org/abs/2406.19578
作者: Faruk Ahmed,Andrew Sellergren,Lin Yang,Shawn Xu,Boris Babenko,Abbi Ward,Niels Olson,Arash Mohtashamian,Yossi Matias,Greg S. Corrado,Quang Duong,Dale R. Webster,Shravya Shetty,Daniel Golden,Yun Liu,David F. Steiner,Ellery Wulczyn
关键词: histopathology images underlies, Microscopic interpretation, treatment decisions, underlies many important, Microscopic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures

点击查看摘要

Abstract:Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

[CV-47] What Matters in Detecting AI-Generated Videos like Sora?

链接: https://arxiv.org/abs/2406.19568
作者: Chirui Chang,Zhengzhe Liu,Xiaoyang Lyu,Xiaojuan Qi
关键词: showcased remarkable results, Recent advancements, Stable Video Diffusion, videos remains under-explored, diffusion-based video generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: this https URL

[CV-48] Cost-efficient Active Illumination Camera For Hyper-spectral Reconstruction

链接: https://arxiv.org/abs/2406.19560
作者: Yuxuan Zhang,T.M. Sazzad,Yangyang Song,Spencer J. Chang,Ritesh Chowdhry,Tomas Mejia,Anna Hampton,Shelby Kucharski,Stefan Gerber,Barry Tillman,Marcio F. R. Resende,William M. Hammond,Chris H. Wilson,Alina Zare,Sanjeev J. Koppal
关键词: recently gained increasing, gained increasing attention, including agricultural investigation, Hyper-spectral imaging, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient, compact and easy to use active illumination camera that may benefit many applications. We developed a fully functional prototype of such camera. With the hope of helping with agricultural research, we tested our camera for plant root imaging. In addition, a U-Net model for spectral reconstruction was trained by using a reference hyperspectral camera’s data as ground truth and our camera’s data as input. We demonstrated our camera’s ability to obtain additional information over a typical RGB camera. In addition, the ability to reconstruct hyperspectral data from multi-spectral input makes our device compatible to models and algorithms developed for hyperspectral applications with no modifications required.

[CV-49] Weighted Circle Fusion: Ensembling Circle Representation from Different Object Detection Results

链接: https://arxiv.org/abs/2406.19540
作者: Jialin Yue,Tianyuan Yao,Ruining Deng,Quan Liu,Juming Xiong,Haichun Yang,Yuankai Huo
关键词: Weighted Circle Fusion, identification of spherical, Recently, circle, Weighted Circle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, the use of circle representation has emerged as a method to improve the identification of spherical objects (such as glomeruli, cells, and nuclei) in medical imaging studies. In traditional bounding box-based object detection, combining results from multiple models improves accuracy, especially when real-time processing isn’t crucial. Unfortunately, this widely adopted strategy is not readily available for combining circle representations. In this paper, we propose Weighted Circle Fusion (WCF), a simple approach for merging predictions from various circle detection models. Our method leverages confidence scores associated with each proposed bounding circle to generate averaged circles. Our method undergoes thorough evaluation on a proprietary dataset for glomerular detection in object detection within whole slide imaging (WSI). The findings reveal a performance gain of 5 %, respectively, compared to existing ensemble methods. Furthermore, the Weighted Circle Fusion technique not only improves the precision of object detection in medical images but also notably decreases false detections, presenting a promising direction for future research and application in pathological image analysis.

[CV-50] Comparative Analysis Of Color Models For Human Perception And Visual Color Difference

链接: https://arxiv.org/abs/2406.19520
作者: Aruzhan Burambekova,Pakizar Shamoi
关键词: influencing emotions, human visual perception, Color, human experience, visual perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper has been submitted to EJMCA journal for consideration. Current version is a preprint

点击查看摘要

Abstract:Color is integral to human experience, influencing emotions, decisions, and perceptions. This paper presents a comparative analysis of various color models’ alignment with human visual perception. The study evaluates color models such as RGB, HSV, HSL, XYZ, CIELAB, and CIELUV to assess their effectiveness in accurately representing how humans perceive color. We evaluate each model based on its ability to accurately reflect visual color differences and dominant palette extraction compatible with the human eye. In image processing, accurate assessment of color difference is essential for applications ranging from digital design to quality control. Current color difference metrics do not always match how people see colors, causing issues in accurately judging subtle differences. Understanding how different color models align with human visual perception is crucial for various applications in image processing, digital media, and design.

[CV-51] Stereo Vision Based Robot for Remote Monitoring with VR Support

链接: https://arxiv.org/abs/2406.19498
作者: Mohamed Fazil M. S.,Arockia Selvakumar A.,Daniel Schilberg
关键词: machine vision systems, human-like visual system, playing a significant, significant role, visual monitoring systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 Pages, 10 Figures

点击查看摘要

Abstract:The machine vision systems have been playing a significant role in visual monitoring systems. With the help of stereovision and machine learning, it will be able to mimic human-like visual system and behaviour towards the environment. In this paper, we present a stereo vision based 3-DOF robot which will be used to monitor places from remote using cloud server and internet devices. The 3-DOF robot will transmit human-like head movements, i.e., yaw, pitch, roll and produce 3D stereoscopic video and stream it in Real-time. This video stream is sent to the user through any generic internet devices with VR box support, i.e., smartphones giving the user a First-person real-time 3D experience and transfers the head motion of the user to the robot also in Real-time. The robot will also be able to track moving objects and faces as a target using deep neural networks which enables it to be a standalone monitoring robot. The user will be able to choose specific subjects to monitor in a space. The stereovision enables us to track the depth information of different objects detected and will be used to track human interest objects with its distances and sent to the cloud. A full working prototype is developed which showcases the capabilities of a monitoring system based on stereo vision, robotics, and machine learning.

[CV-52] ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

链接: https://arxiv.org/abs/2406.19464
作者: Zeyi Liu,Cheng Chi,Eric Cousineau,Naveen Kuppuswamy,Benjamin Burchfiel,Shuran Song
关键词: signals provide rich, provide rich information, Audio signals provide, signals provide, provide rich
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio signals provide rich information for the robot interaction and object properties through contact. These information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an ‘ear-in-hand’ data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments, by learning from diverse in-the-wild human demonstrations. Project website: this https URL

[CV-53] Efficient and Distributed Large-Scale 3D Map Registration using Tomographic Features

链接: https://arxiv.org/abs/2406.19461
作者: Halil Utku Unlu,Anthony Tzes,Prashanth Krishnamurthy,Farshad Khorrami
关键词: minimally parameterized, map matching, gravity-aligned local maps, suggested algorithm utilizes, algorithm utilizes tomographic
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to Elsevier Journal: Robotics and Autonomous Systems (RAS)

点击查看摘要

Abstract:A robust, resource-efficient, distributed, and minimally parameterized 3D map matching and merging algorithm is proposed. The suggested algorithm utilizes tomographic features from 2D projections of horizontal cross-sections of gravity-aligned local maps, and matches these projection slices at all possible height differences, enabling the estimation of four degrees of freedom in an efficient and parallelizable manner. The advocated algorithm improves state-of-the-art feature extraction and registration pipelines by an order of magnitude in memory use and execution time. Experimental studies are offered to investigate the efficiency of this 3D map merging scheme.

[CV-54] A Sanity Check for AI-generated Image Detection

链接: https://arxiv.org/abs/2406.19435
作者: Shilin Yan,Ouxiang Li,Jiayin Cai,Yanbin Hao,Xiaolong Jiang,Yao Hu,Weidi Xie
关键词: evoked increasing attention, discerning AI-generated content, AI-generated images, industry and academia, rapid development
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on “whether the task of AI-generated image detection has been solved”. To start with, we present Chameleon dataset, consisting AIgenerated images that are genuinely challenging for human perception. To quantify the generalization of existing methods, we evaluate 9 off-the-shelf AI-generated image detectors on Chameleon dataset. Upon analysis, almost all models classify AI-generated images as real ones. Later, we propose AIDE (AI-generated Image DEtector with Hybrid Features), which leverages multiple experts to simultaneously extract visual artifacts and noise patterns. Specifically, to capture the high-level semantics, we utilize CLIP to compute the visual embedding. This effectively enables the model to discern AI-generated images based on semantics or contextual information; Secondly, we select the highest frequency patches and the lowest frequency patches in the image, and compute the low-level patchwise features, aiming to detect AI-generated images by low-level artifacts, for example, noise pattern, anti-aliasing, etc. While evaluating on existing benchmarks, for example, AIGCDetectBenchmark and GenImage, AIDE achieves +3.5% and +4.6% improvements to state-of-the-art methods, and on our proposed challenging Chameleon benchmarks, it also achieves the promising results, despite this problem for detecting AI-generated images is far from being solved. The dataset, codes, and pre-train models will be published at this https URL.

[CV-55] YOLOv10 to Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once Series

链接: https://arxiv.org/abs/2406.19407
作者: Ranjan Sapkota,Rizwan Qureshi,Marco Flores Calero,Muhammad Hussain,Chetan Badjugar,Upesh Nepal,Alwin Poulose,Peter Zeno,Uday Bhanu Prakash Vaddevolu,Hong Yan,Manoj Karkee
关键词: object detection algorithms, review systematically examines, real-time object detection, recently unveiled, object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This review systematically examines the progression of the You Only Look Once (YOLO) object detection algorithms from YOLOv1 to the recently unveiled YOLOv10. Employing a reverse chronological analysis, this study examines the advancements introduced by YOLO algorithms, beginning with YOLOv10 and progressing through YOLOv9, YOLOv8, and subsequent versions to explore each version’s contributions to enhancing speed, accuracy, and computational efficiency in real-time object detection. The study highlights the transformative impact of YOLO across five critical application areas: automotive safety, healthcare, industrial manufacturing, surveillance, and agriculture. By detailing the incremental technological advancements that each iteration brought, this review not only chronicles the evolution of YOLO but also discusses the challenges and limitations observed in each earlier versions. The evolution signifies a path towards integrating YOLO with multimodal, context-aware, and General Artificial Intelligence (AGI) systems for the next YOLO decade, promising significant implications for future developments in AI-driven applications.

[CV-56] Deep Convolutional Neural Networks Meet Variational Shape Compactness Priors for Image Segmentation

链接: https://arxiv.org/abs/2406.19400
作者: Kehui Zhang,Lingfeng Li,Hao Liu,Jing Yuan,Xue-Cheng Tai
关键词: key geometrical property, describe interesting regions, image segmentation tasks, Shape compactness, key geometrical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages

点击查看摘要

Abstract:Shape compactness is a key geometrical property to describe interesting regions in many image segmentation tasks. In this paper, we propose two novel algorithms to solve the introduced image segmentation problem that incorporates a shape-compactness prior. Existing algorithms for such a problem often suffer from computational inefficiency, difficulty in reaching a local minimum, and the need to fine-tune the hyperparameters. To address these issues, we propose a novel optimization model along with its equivalent primal-dual model and introduce a new optimization algorithm based on primal-dual threshold dynamics (PD-TD). Additionally, we relax the solution constraint and propose another novel primal-dual soft threshold-dynamics algorithm (PD-STD) to achieve superior performance. Based on the variational explanation of the sigmoid layer, the proposed PD-STD algorithm can be integrated into Deep Neural Networks (DNNs) to enforce compact regions as image segmentation results. Compared to existing deep learning methods, extensive experiments demonstrated that the proposed algorithms outperformed state-of-the-art algorithms in numerical efficiency and effectiveness, especially while applying to the popular networks of DeepLabV3 and IrisParseNet with higher IoU, dice, and compactness metrics on noisy Iris datasets. In particular, the proposed algorithms significantly improve IoU by 20% training on a highly noisy image dataset.

[CV-57] Woven Fabric Capture with a Reflection-Transmission Photo Pair

链接: https://arxiv.org/abs/2406.19398
作者: Yingjie Tang,Zixuan Li,Miloš Hašan,Jian Yang,Beibei Wang
关键词: Digitizing woven fabrics, Digitizing woven, fabric parameters, reflection, fabric
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 10 pages, 16 figures (in the main paper). Accepted by SIGGRAPH 2024 conference

点击查看摘要

Abstract:Digitizing woven fabrics would be valuable for many applications, from digital humans to interior design. Previous work introduces a lightweight woven fabric acquisition approach by capturing a single reflection image and estimating the fabric parameters with a differentiable geometric and shading model. The renderings of the estimated fabric parameters can closely match the photo; however, the captured reflection image is insufficient to fully characterize the fabric sample reflectance. For instance, fabrics with different thicknesses might have similar reflection images but lead to significantly different transmission. We propose to recover the woven fabric parameters from two captured images: reflection and transmission. At the core of our method is a differentiable bidirectional scattering distribution function (BSDF) model, handling reflection and transmission, including single and multiple scattering. We propose a two-layer model, where the single scattering uses an SGGX phase function as in previous work, and multiple scattering uses a new azimuthally-invariant microflake definition, which we term ASGGX. This new fabric BSDF model closely matches real woven fabrics in both reflection and transmission. We use a simple setup for capturing reflection and transmission photos with a cell phone camera and two point lights, and estimate the fabric parameters via a lightweight network, together with a differentiable optimization. We also model the out-of-focus effects explicitly with a simple solution to match the thin-lens camera better. As a result, the renderings of the estimated parameters can agree with the input images on both reflection and transmission for the first time. The code for this paper is at this https URL.

[CV-58] HAITCH: A Framework for Distortion and Motion Correction in Fetal Multi-Shell Diffusion-Weighted MRI

链接: https://arxiv.org/abs/2406.20042
作者: Haykel Snoussi,Davood Karimi,Onur Afacan,Mustafa Utkur,Ali Gholipour
关键词: fetal dMRI data, fetal dMRI, Diffusion magnetic resonance, resolution fetal dMRI, dMRI data
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion magnetic resonance imaging (dMRI) is pivotal for probing the microstructure of the rapidly-developing fetal brain. However, fetal motion during scans and its interaction with magnetic field inhomogeneities result in artifacts and data scattering across spatial and angular domains. The effects of those artifacts are more pronounced in high-angular resolution fetal dMRI, where signal-to-noise ratio is very low. Those effects lead to biased estimates and compromise the consistency and reliability of dMRI analysis. This work presents HAITCH, the first and the only publicly available tool to correct and reconstruct multi-shell high-angular resolution fetal dMRI data. HAITCH offers several technical advances that include a blip-reversed dual-echo acquisition for dynamic distortion correction, advanced motion correction for model-free and robust reconstruction, optimized multi-shell design for enhanced information capture and increased tolerance to motion, and outlier detection for improved reconstruction fidelity. The framework is open-source, flexible, and can be used to process any type of fetal dMRI data including single-echo or single-shell acquisitions, but is most effective when used with multi-shell multi-echo fetal dMRI data that cannot be processed with any of the existing tools. Validation experiments on real fetal dMRI scans demonstrate significant improvements and accurate correction across diverse fetal ages and motion levels. HAITCH successfully removes artifacts and reconstructs high-fidelity fetal dMRI data suitable for advanced diffusion modeling, including fiber orientation distribution function estimation. These advancements pave the way for more reliable analysis of the fetal brain microstructure and tractography under challenging imaging conditions.

[CV-59] Malaria Cell Detection Using Deep Neural Networks

链接: https://arxiv.org/abs/2406.20005
作者: Saurabh Sawant,Anurag Singh
关键词: health concerns globally, pressing public health, public health concerns, causing significant morbidity, sub-Saharan Africa
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Malaria remains one of the most pressing public health concerns globally, causing significant morbidity and mortality, especially in sub-Saharan Africa. Rapid and accurate diagnosis is crucial for effective treatment and disease management. Traditional diagnostic methods, such as microscopic examination of blood smears, are labor-intensive and require significant expertise, which may not be readily available in resource-limited settings. This project aims to automate the detection of malaria-infected cells using a deep learning approach. We employed a convolutional neural network (CNN) based on the ResNet50 architecture, leveraging transfer learning to enhance performance. The Malaria Cell Images Dataset from Kaggle, containing 27,558 images categorized into infected and uninfected cells, was used for training and evaluation. Our model demonstrated high accuracy, precision, and recall, indicating its potential as a reliable tool for assisting in malaria diagnosis. Additionally, a web application was developed using Streamlit to allow users to upload cell images and receive predictions about malaria infection, making the technology accessible and user-friendly. This paper provides a comprehensive overview of the methodology, experiments, and results, highlighting the effectiveness of deep learning in medical image analysis.

[CV-60] Impact of Initialization on Intra-subject Pediatric Brain MR Image Registration: A Comparative Analysis between SyN ANTs and Deep Learning-Based Approaches

链接: https://arxiv.org/abs/2406.19943
作者: Andjela Dimitrijevic,Vincent Noblet,Benjamin De Leener
关键词: intrasubject deformable registration, SyN ANTs, specifically focusing, conventional SyN ANTs, study evaluates
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:This study evaluates the performance of conventional SyN ANTs and learning-based registration methods in the context of pediatric neuroimaging, specifically focusing on intrasubject deformable registration. The comparison involves three approaches: without (NR), with rigid (RR), and with rigid and affine (RAR) initializations. In addition to initialization, performances are evaluated in terms of accuracy, speed, and the impact of age intervals and sex per pair. Data consists of the publicly available MRI scans from the Calgary Preschool dataset, which includes 63 children aged 2-7 years, allowing for 431 registration pairs. We implemented the unsupervised DL framework with a U-Net architecture using DeepReg and it was 5-fold cross-validated. Evaluation includes Dice scores for tissue segmentation from 18 smaller regions obtained by SynthSeg, analysis of log Jacobian determinants, and registration pro-rated training and inference times. Learning-based approaches, with or without linear initializations, exhibit slight superiority over SyN ANTs in terms of Dice scores. Indeed, DL-based implementations with RR and RAR initializations significantly outperform SyN ANTs. Both SyN ANTs and DL-based registration involve parameter optimization, but the choice between these methods depends on the scale of registration: network-based for broader coverage or SyN ANTs for specific structures. Both methods face challenges with larger age intervals due to greater growth changes. The main takeaway is that while DL-based methods show promise with faster and more accurate registrations, SyN ANTs remains robust and generalizable without the need for extensive training, highlighting the importance of method selection based on specific registration needs in the pediatric context. Our code is available at this https URL

[CV-61] Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting

链接: https://arxiv.org/abs/2406.19796
作者: Wei Li,Jingyang Zhang,Pheng-Ann Heng,Lixu Gu
关键词: Generalist segmentation models, Generalist segmentation, increasingly favored, involving various objects, Comprehensive Generative Replay
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI24

点击查看摘要

Abstract:Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. Task-Incremental Learning (TIL) offers a privacy-preserving training paradigm using tasks arriving sequentially, instead of gathering them due to strict data sharing policies. However, the task evolution can span a wide scope that involves shifts in both image appearance and segmentation semantics with intricate correlation, causing concurrent appearance and semantic forgetting. To solve this issue, we propose a Comprehensive Generative Replay (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs to mimic past task data, which focuses on two aspects: modeling image-mask correspondence and promoting scalability for diverse tasks. Specifically, we introduce a novel Bayesian Joint Diffusion (BJD) model for high-quality synthesis of image-mask pairs with their correspondence explicitly preserved by conditional denoising. Furthermore, we develop a Task-Oriented Adapter (TOA) that recalibrates prompt embeddings to modulate the diffusion model, making the data synthesis compatible with different tasks. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting. Code is available at this https URL.

[CV-62] SPIRONet: Spatial-Frequency Learning and Topological Channel Interaction Network for Vessel Segmentation

链接: https://arxiv.org/abs/2406.19749
作者: De-Xing Huang,Xiao-Hu Zhou,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Zhen-Qiu Feng,Mei-Jiang Gui,Hao Li,Tian-Yu Xiang,Bo-Xian Yao,Zeng-Guang Hou
关键词: Automatic vessel segmentation, developing next-generation interventional, Automatic vessel, paramount for developing, developing next-generation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic vessel segmentation is paramount for developing next-generation interventional navigation systems. However, current approaches suffer from suboptimal segmentation performances due to significant challenges in intraoperative images (i.e., low signal-to-noise ratio, small or slender vessels, and strong interference). In this paper, a novel spatial-frequency learning and topological channel interaction network (SPIRONet) is proposed to address the above issues. Specifically, dual encoders are utilized to comprehensively capture local spatial and global frequency vessel features. Then, a cross-attention fusion module is introduced to effectively fuse spatial and frequency features, thereby enhancing feature discriminability. Furthermore, a topological channel interaction module is designed to filter out task-irrelevant responses based on graph neural networks. Extensive experimental results on several challenging datasets (CADSA, CAXF, DCA1, and XCAD) demonstrate state-of-the-art performances of our method. Moreover, the inference speed of SPIRONet is 21 FPS with a 512x512 input size, surpassing clinical real-time requirements (6~12FPS). These promising outcomes indicate SPIRONet’s potential for integration into vascular interventional navigation systems. Code is available at this https URL.

[CV-63] Enhancing Radiological Diagnosis: A Collaborative Approach Integrating AI and Human Expertise for Visual Miss Correction

链接: https://arxiv.org/abs/2406.19686
作者: Akash Awasthi,Ngan Le,Zhigang Deng,Carol C. Wu,Hien Van Nguyen
关键词: correct perceptual errors, Human-AI collaboration, previously explored, eye gaze data, collaboration to identify
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Under Review in Journal

点击查看摘要

Abstract:Human-AI collaboration to identify and correct perceptual errors in chest radiographs has not been previously explored. This study aimed to develop a collaborative AI system, CoRaX, which integrates eye gaze data and radiology reports to enhance diagnostic accuracy in chest radiology by pinpointing perceptual errors and refining the decision-making process. Using public datasets REFLACX and EGD-CXR, the study retrospectively developed CoRaX, employing a large multimodal model to analyze image embeddings, eye gaze data, and radiology reports. The system’s effectiveness was evaluated based on its referral-making process, the quality of referrals, and performance in collaborative diagnostic settings. CoRaX was tested on a simulated error dataset of 271 samples with 28% (93 of 332) missed abnormalities. The system corrected 21% (71 of 332) of these errors, leaving 7% (22 of 312) unresolved. The Referral-Usefulness score, indicating the accuracy of predicted regions for all true referrals, was 0.63 (95% CI 0.59, 0.68). The Total-Usefulness score, reflecting the diagnostic accuracy of CoRaX’s interactions with radiologists, showed that 84% (237 of 280) of these interactions had a score above 0.40. In conclusion, CoRaX efficiently collaborates with radiologists to address perceptual errors across various abnormalities, with potential applications in the education and training of novice radiologists.

[CV-64] AstMatch: Adversarial Self-training Consistency Framework for Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2406.19649
作者: Guanghao Zhu,Jing Zhang,Juanxiu Liu,Xiaohui Du,Ruqian Hao,Yong Liu,Lin Liu
关键词: medical image segmentation, shown considerable potential, Semi-supervised learning, primarily leveraging consistency, image segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). Firstly, we design an adversarial consistency regularization (ACR) approach to enhance knowledge transfer and strengthen prediction consistency under varying perturbation intensities. Second, we apply a feature matching loss for adversarial training to incorporate high-level consistency regularization. Additionally, we present the pyramid channel attention (PCA) and efficient channel and spatial attention (ECSA) modules to improve the discriminator’s performance. Finally, we propose an adaptive self-training (AST) approach to ensure the pseudo-labels’ quality. The proposed AstMatch has been extensively evaluated with cutting-edge SSL methods on three public-available datasets. The experimental results under different labeled ratios indicate that AstMatch outperforms other existing methods, achieving new state-of-the-art performance. Our code will be available at this https URL.

[CV-65] Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation

链接: https://arxiv.org/abs/2406.19557
作者: Jack Highton,Quok Zong Chong,Samuel Finestone,Arian Beqiri,Julia A. Schnabel,Kanwal K. Bhatia
关键词: Deep learning models, medical image segmentation, Deep learning, clinical products, medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Deep learning models for medical image segmentation and object detection are becoming increasingly available as clinical products. However, as details are rarely provided about the training data, models may unexpectedly fail when cases differ from those in the training distribution. An approach allowing potential users to independently test the robustness of a model, treating it as a black box and using only a few cases from their own site, is key for adoption. To address this, a method to test the robustness of these models against CT image quality variation is presented. In this work we present this framework by demonstrating that given the same training data, the model architecture and data pre processing greatly affect the robustness of several frequently used segmentation and object detection methods to simulated CT imaging artifacts and degradation. Our framework also addresses the concern about the sustainability of deep learning models in clinical use, by considering future shifts in image quality due to scanner deterioration or imaging protocol changes which are not reflected in a limited local test dataset.

[CV-66] BOrg: A Brain Organoid-Based Mitosis Dataset for Automatic Analysis of Brain Diseases

链接: https://arxiv.org/abs/2406.19556
作者: Muhammad Awais,Mehaboobathunnisa Sahul Hameed,Bidisha Bhattacharya,Orly Reiner,Rao Muhammad Anwer
关键词: Recent advances, brain organoids derived, human brain development, advances have enabled, derived from stem
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances have enabled the study of human brain development using brain organoids derived from stem cells. Quantifying cellular processes like mitosis in these organoids offers insights into neurodevelopmental disorders, but the manual analysis is time-consuming, and existing datasets lack specific details for brain organoid studies. We introduce BOrg, a dataset designed to study mitotic events in the embryonic development of the brain using confocal microscopy images of brain organoids. BOrg utilizes an efficient annotation pipeline with sparse point annotations and techniques that minimize expert effort, overcoming limitations of standard deep learning approaches on sparse data. We adapt and benchmark state-of-the-art object detection and cell counting models on BOrg for detecting and analyzing mitotic cells across prophase, metaphase, anaphase, and telophase stages. Our results demonstrate these adapted models significantly improve mitosis analysis efficiency and accuracy for brain organoid research compared to existing methods. BOrg facilitates the development of automated tools to quantify statistics like mitosis rates, aiding mechanistic studies of neurodevelopmental processes and disorders. Data and code are available at this https URL.

[CV-67] High-resolution segmentations of the hypothalamus and its subregions for training of segmentation models

链接: https://arxiv.org/abs/2406.19492
作者: Livia Rodrigues,Martina Bocchetta,Oula Puonti,Douglas Greve,Ana Carolina Londe,Marcondes França,Simone Appenzeller,Leticia Rittner,Juan Eugenio Iglesias
关键词: magnetic resonance imaging, relevant neuroimaging topic, highly relevant neuroimaging, resonance imaging, neuroimaging topic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmentation of brain structures on magnetic resonance imaging (MRI) is a highly relevant neuroimaging topic, as it is a prerequisite for different analyses such as volumetry or shape analysis. Automated segmentation facilitates the study of brain structures in larger cohorts when compared with manual segmentation, which is time-consuming. However, the development of most automated methods relies on large and manually annotated datasets, which limits the generalizability of these methods. Recently, new techniques using synthetic images have emerged, reducing the need for manual annotation. Here we provide HELM, Hypothalamic ex vivo Label Maps, a dataset composed of label maps built from publicly available ultra-high resolution ex vivo MRI from 10 whole hemispheres, which can be used to develop segmentation methods using synthetic data. The label maps are obtained with a combination of manual labels for the hypothalamic regions and automated segmentations for the rest of the brain, and mirrored to simulate entire brains. We also provide the pre-processed ex vivo scans, as this dataset can support future projects to include other structures after these are manually segmented.

[CV-68] GAPNet: Granularity Attention Network with Anatomy-Prior-Constraint for Carotid Artery Segmentation

链接: https://arxiv.org/abs/2406.19485
作者: Lin Zhang,Chenggang Lu,Xin-yang Shi,Caifeng Shan,Jiong Zhang,Da Chen,Laurent D. Cohen
关键词: primarily affects, affects the arterial, progressive disease, arterial walls, Abstract
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Atherosclerosis is a chronic, progressive disease that primarily affects the arterial walls. It is one of the major causes of cardiovascular disease. Magnetic Resonance (MR) black-blood vessel wall imaging (BB-VWI) offers crucial insights into vascular disease diagnosis by clearly visualizing vascular structures. However, the complex anatomy of the neck poses challenges in distinguishing the carotid artery (CA) from surrounding structures, especially with changes like atherosclerosis. In order to address these issues, we propose GAPNet, which is a consisting of a novel geometric prior deduced from.

机器学习

[LG-0] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

链接: https://arxiv.org/abs/2406.20095
作者: Xiang Li,Cristina Mata,Jongwoo Park,Kumara Kahatapitiya,Yoo Sung Jang,Jinghuan Shang,Kanchana Ranasinghe,Ryan Burgert,Mu Cai,Yong Jae Lee,Michael S. Ryoo
关键词: Large Language Models, Large Language, extensive world knowledge, strong reasoning skills, Vision Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at this https URL.

[LG-1] Scaling Synthetic Data Creation with 1000000000 Personas

链接: https://arxiv.org/abs/2406.20094
作者: Xin Chan,Xiaoyang Wang,Dian Yu,Haitao Mi,Dong Yu
关键词: large language model, create diverse synthetic, diverse synthetic data, language model, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub – a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world’s total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub’s use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

[LG-2] ProgressGym: Alignment with a Millennium of Moral Progress

链接: https://arxiv.org/abs/2406.20087
作者: Tianyi Qiu,Yang Zhang,Xuchuan Huang,Jasmine Xinze Li,Jiaming Ji,Yaodong Yang
关键词: large language models, including large language, Frontier AI systems, hold increasing influence, including large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at this https URL and this https URL respectively.

[LG-3] oken Erasure as a Footprint of Implicit Vocabulary Items in LLMs

链接: https://arxiv.org/abs/2406.20086
作者: Sheridan Feucht,David Atkinson,Byron Wallace,David Bau
关键词: LLMs process text, process text, text as sequences, represented by multiple, tokens
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures. Code and data at this https URL

点击查看摘要

Abstract:LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b’s tokenizer splits the word “northeastern” into the tokens [‘_n’, ‘ort’, ‘he’, ‘astern’], none of which correspond to semantically meaningful units like “north” or “east.” Similarly, the overall meanings of named entities like “Neil Young” and multi-word expressions like “break a leg” cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced “erasure” effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to “read out” the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.

[LG-4] Segment Anything without Supervision

链接: https://arxiv.org/abs/2406.20081
作者: XuDong Wang,Jingfeng Yang,Trevor Darrell
关键词: labor-intensive data labeling, requires labor-intensive data, data labeling, labor-intensive data, SAM
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to “discover” the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B’s ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM’s AR by over 6.7% and AP by 3.9% on SA-1B.

[LG-5] Cost-aware Bayesian optimization via the Pandoras Box Gittins index

链接: https://arxiv.org/abs/2406.20062
作者: Qian Xie,Raul Astudillo,Peter Frazier,Ziv Scully,Alexander Terenin
关键词: efficiently optimizing unknown, Bayesian optimization, Pandora Box problem, Bayesian, optimizing unknown functions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization is a technique for efficiently optimizing unknown functions in a black-box manner. To handle practical settings where gathering data requires use of finite resources, it is desirable to explicitly incorporate function evaluation costs into Bayesian optimization policies. To understand how to do so, we develop a previously-unexplored connection between cost-aware Bayesian optimization and the Pandora’s Box problem, a decision problem from economics. The Pandora’s Box problem admits a Bayesian-optimal solution based on an expression called the Gittins index, which can be reinterpreted as an acquisition function. We study the use of this acquisition function for cost-aware Bayesian optimization, and demonstrate empirically that it performs well, particularly in medium-high dimensions. We further show that this performance carries over to classical Bayesian optimization without explicit evaluation costs. Our work constitutes a first step towards integrating techniques from Gittins index theory into Bayesian optimization.

[LG-6] SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting

链接: https://arxiv.org/abs/2406.20055
作者: Sara Sabour,Lily Goli,George Kopanas,Mark Matthews,Dmitry Lagun,Leonidas Guibas,Alec Jacobson,David J. Fleet,Andrea Tagliasacchi
关键词: Gaussian Splatting, offering efficient training, highly controlled environments, require highly controlled, inter-view consistency assumption
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is a promising technique for 3D reconstruction, offering efficient training and rendering speeds, making it suitable for real-time applications.However, current methods require highly controlled environments (no moving people or wind-blown elements, and consistent lighting) to meet the inter-view consistency assumption of 3DGS. This makes reconstruction of real-world captures problematic. We present SpotlessSplats, an approach that leverages pre-trained and general-purpose features coupled with robust optimization to effectively ignore transient distractors. Our method achieves state-of-the-art reconstruction quality both visually and quantitatively, on casual captures.

[LG-7] Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

链接: https://arxiv.org/abs/2406.20053
作者: Danny Halawi,Alexander Wei,Eric Wallace,Tony T. Wang,Nika Haghtalab,Jacob Steinhardt
关键词: language models, finetuning, emerging interface, model, Black-box finetuning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

[LG-8] Evaluation of autonomous systems under data distribution shifts

链接: https://arxiv.org/abs/2406.20046
作者: Daniel Sikar,Artur Garcez
关键词: data distribution shift, data distribution, human operator, network predictive accuracy, autonomous system
类目: Machine Learning (cs.LG)
*备注: 13 pages, 10 figures, 4 tables

点击查看摘要

Abstract:We posit that data can only be safe to use up to a certain threshold of the data distribution shift, after which control must be relinquished by the autonomous system and operation halted or handed to a human operator. With the use of a computer vision toy example we demonstrate that network predictive accuracy is impacted by data distribution shifts and propose distance metrics between training and testing data to define safe operation limits within said shifts. We conclude that beyond an empirically obtained threshold of the data distribution shift, it is unreasonable to expect network predictive accuracy not to degrade

[LG-9] Explore as a Storm Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

链接: https://arxiv.org/abs/2406.20037
作者: Michael Canesche,Gaurav Verma,Fernando Magno Quintao Pereira
关键词: Machine-learning models consist, Ansor, Machine-learning models, data indexed, linear combination
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 22 pages, 19 figures, original work

点击查看摘要

Abstract:Machine-learning models consist of kernels, which are algorithms applying operations on tensors – data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel’s optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function – typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor’s search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor’s exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM’s MetaSchedule in June 2024.

[LG-10] Pairwise Difference Learning for Classification

链接: https://arxiv.org/abs/2406.20031
作者: Mohamed Karim Belaid,Maximilian Rabus,Eyke Hüllermeier
关键词: Pairwise difference learning, Pairwise difference, recently been introduced, technique for regression, PDL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pairwise difference learning (PDL) has recently been introduced as a new meta-learning technique for regression. Instead of learning a mapping from instances to outcomes in the standard way, the key idea is to learn a function that takes two instances as input and predicts the difference between the respective outcomes. Given a function of this kind, predictions for a query instance are derived from every training example and then averaged. This paper extends PDL toward the task of classification and proposes a meta-learning technique for inducing a PDL classifier by solving a suitably defined (binary) classification problem on a paired version of the original training data. We analyze the performance of the PDL classifier in a large-scale empirical study and find that it outperforms state-of-the-art methods in terms of prediction performance. Last but not least, we provide an easy-to-use and publicly available implementation of PDL in a Python package.

[LG-11] On the Trade-off between Flatness and Optimization in Distributed Learning

链接: https://arxiv.org/abs/2406.20006
作者: Ying Cao,Zhaoxian Wu,Kun Yuan,Ali H. Sayed
关键词: nonconvex environments, proposes a theoretical, theoretical framework, framework to evaluate, evaluate and compare
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a theoretical framework to evaluate and compare the performance of gradient-descent algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution in the large-batch training regime. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because it strikes a more favorable balance between flatness and optimization performance.

[LG-12] Wavelets Are All You Need for Autoregressive Image Generation

链接: https://arxiv.org/abs/2406.19997
作者: Wael Mattar,Idan Levy,Nir Sharon,Shai Dekel
关键词: autoregressive image generation, main ingredients, approach to autoregressive, autoregressive image, wavelet image coding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this ‘wavelet language’. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.

[LG-13] Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

链接: https://arxiv.org/abs/2406.19995
作者: Habib Hajimolahoseini,Mohammad Hassanpour,Foozhan Ataiefard,Boxing Chen,Yang Liu
关键词: Low Rank Decomposition, Progressive Low Rank, Progressive Low, Rank Decomposition, Low Rank
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.

[LG-14] Machine Learning Predictors for Min-Entropy Estimation

链接: https://arxiv.org/abs/2406.19983
作者: Javier Blanco-Romero,Vicente Lorenzo,Florina Almenares Mendoza,Daniel Díaz-Sánchez
关键词: Random Number Generators, Number Generators, Random Number, essential for cybersecurity, study investigates
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:This study investigates the application of machine learning predictors for min-entropy estimation in Random Number Generators (RNGs), a key component in cryptographic applications where accurate entropy assessment is essential for cybersecurity. Our research indicates that these predictors, and indeed any predictor that leverages sequence correlations, primarily estimate average min-entropy, a metric not extensively studied in this context. We explore the relationship between average min-entropy and the traditional min-entropy, focusing on their dependence on the number of target bits being predicted. Utilizing data from Generalized Binary Autoregressive Models, a subset of Markov processes, we demonstrate that machine learning models (including a hybrid of convolutional and recurrent Long Short-Term Memory layers and the transformer-based GPT-2 model) outperform traditional NIST SP 800-90B predictors in certain scenarios. Our findings underscore the importance of considering the number of target bits in min-entropy assessment for RNGs and highlight the potential of machine learning approaches in enhancing entropy estimation techniques for improved cryptographic security.

[LG-15] Comparative Analysis of LSTM Neural Networks and Traditional Machine Learning Models for Predicting Diabetes Patient Readmission

链接: https://arxiv.org/abs/2406.19980
作者: Abolfazl Zarghani
关键词: chronic metabolic disorder, problems worldwide due, health problems worldwide, pricey to manage, chronic metabolic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diabetes mellitus is a chronic metabolic disorder that has emerged as one of the major health problems worldwide due to its high prevalence and serious complications, which are pricey to manage. Effective management requires good glycemic control and regular follow-up in the clinic; however, non-adherence to scheduled follow-ups is very common. This study uses the Diabetes 130-US Hospitals dataset for analysis and prediction of readmission patients by various traditional machine learning models, such as XGBoost, LightGBM, CatBoost, Decision Tree, and Random Forest, and also uses an in-house LSTM neural network for comparison. The quality of the data was assured by preprocessing it, and the performance evaluation for all these models was based on accuracy, precision, recall, and F1-score. LightGBM turned out to be the best traditional model, while XGBoost was the runner-up. The LSTM model suffered from overfitting despite high training accuracy. A major strength of LSTM is capturing temporal dependencies among the patient data. Further, SHAP values were used, which improved model interpretability, whereby key factors among them number of lab procedures and discharge disposition were identified as critical in the prediction of readmissions. This study demonstrates that model selection, validation, and interpretability are key steps in predictive healthcare modeling. This will help health providers design interventions for improved follow-up adherence and better management of diabetes.

[LG-16] ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting

链接: https://arxiv.org/abs/2406.19976
作者: Rui Pan,Jipeng Zhang,Xingyuan Pan,Renjie Pi,Xiaoyu Wang,Tong Zhang
关键词: machine learning settings, require second-order information, practice require second-order, Bilevel optimization, learning settings
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particularly in the context of large language models (LLMs). This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting. By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to 34-billion-parameter LLMs on eight A40 GPUs, marking the first successful application of bilevel optimization under practical scenarios for large-sized LLMs. Empirically, extensive experiments on data reweighting verify the effectiveness of ScaleBiO for different-scaled models, including GPT-2, LLaMA-3-8B, GPT-NeoX-20B, and Yi-34B, where bilevel optimization succeeds in filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO ensures the optimality of the learned data weights, along with a convergence guarantee matching the conventional first-order bilevel optimization paradigm on smooth and strongly convex objectives.

[LG-17] STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

链接: https://arxiv.org/abs/2406.19973
作者: Guohao Sun,Can Qin,Huazhu Fu,Linwei Wang,Zhiqiang Tao
关键词: shown significant potential, assisting medical diagnosis, Large Vision-Language Models, extensive biomedical datasets, leveraging extensive biomedical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.

[LG-18] xt2Robot: Evolutionary Robot Design from Text Descriptions

链接: https://arxiv.org/abs/2406.19963
作者: Ryan P. Ringel,Zachary S. Charlick,Jiaxun Liu,Boxi Xia,Boyuan Chen
关键词: costly and labor-intensive, traditionally been costly, Robot design, Abstract, design
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our project website is at: this https URL

点击查看摘要

Abstract:Robot design has traditionally been costly and labor-intensive. Despite advancements in automated processes, it remains challenging to navigate a vast design space while producing physically manufacturable robots. We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots. Within minutes, Text2Robot can use text-to-3D models to provide strong initializations of diverse morphologies. Within a day, our geometric processing algorithms and body-control co-optimization produce a walking robot by explicitly considering real-world electronics and manufacturability. Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models.

[LG-19] Decoupling General and Personalized Knowledge in Federated Learning via Additive and Low-Rank Decomposition

链接: https://arxiv.org/abs/2406.19931
作者: Xinghao Wu,Xuefeng Liu,Jianwei Niu,Haolin Wang,Shaojie Tang,Guogang Zhu,Hao Su
关键词: Personalized Federated Learning, Federated Learning, Personalized Federated, client-specific knowledge, decouple general knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:To address data heterogeneity, the key strategy of Personalized Federated Learning (PFL) is to decouple general knowledge (shared among clients) and client-specific knowledge, as the latter can have a negative impact on collaboration if not removed. Existing PFL methods primarily adopt a parameter partitioning approach, where the parameters of a model are designated as one of two types: parameters shared with other clients to extract general knowledge and parameters retained locally to learn client-specific knowledge. However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. Instead of assigning each parameter of a model as either a shared or personalized one, FedDecomp decomposes each parameter into the sum of two parameters: a shared one and a personalized one, thus achieving a more thorough decoupling of shared and personalized knowledge compared to the parameter partitioning method. In addition, as we find that retaining local knowledge of specific clients requires much lower model capacity compared with general knowledge across all clients, we let the matrix containing personalized parameters be low rank during the training process. Moreover, a new alternating training strategy is proposed to further improve the performance. Experimental results across multiple datasets and varying degrees of data heterogeneity demonstrate that FedDecomp outperforms state-of-the-art methods up to 4.9%.

[LG-20] `Just One More Sensor is Enough – Iterative Water Leak Localization with Physical Simulation and a Small Number of Pressure Sensors

链接: https://arxiv.org/abs/2406.19900
作者: Michał Cholewa,Michał Romaszewski,Przemysław Głomb,Katarzyna Kołodziej,Michał Gorawski,Jakub Koral,Wojciech Koral,Andrzej Madej,Kryspin Musioł
关键词: complex water delivery, water delivery grid, EPANET software, delivery grid, water pressure sensors
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this article, we propose an approach to leak localisation in a complex water delivery grid with the use of data from physical simulation (e.g. EPANET software). This task is usually achieved by a network of multiple water pressure sensors and analysis of the so-called sensitivity matrix of pressure differences between the network’s simulated data and actual data of the network affected by the leak. However, most algorithms using this approach require a significant number of pressure sensors – a condition that is not easy to fulfil in the case of many less equipped networks. Therefore, we answer the question of whether leak localisation is possible by utilising very few sensors but having the ability to relocate one of them. Our algorithm is based on physical simulations (EPANET software) and an iterative scheme for mobile sensor relocation. The experiments show that the proposed system can equalise the low number of sensors with adjustments made for their positioning, giving a very good approximation of leak’s position both in simulated cases and real-life example taken from BattLeDIM competition L-Town data.

[LG-21] FI-CBL: A Probabilistic Method for Concept-Based Learning with Expert Rules

链接: https://arxiv.org/abs/2406.19897
作者: Lev V. Utkin,Andrei V. Konstantinov,Stanislav R. Kirpichenko
关键词: solving concept-based learning, Frequentist Inference CBL, concept-based learning, solving concept-based, frequentist inference
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A method for solving concept-based learning (CBL) problem is proposed. The main idea behind the method is to divide each concept-annotated image into patches, to transform the patches into embeddings by using an autoencoder, and to cluster the embeddings assuming that each cluster will mainly contain embeddings of patches with certain concepts. To find concepts of a new image, the method implements the frequentist inference by computing prior and posterior probabilities of concepts based on rates of patches from images with certain values of the concepts. Therefore, the proposed method is called the Frequentist Inference CBL (FI-CBL). FI-CBL allows us to incorporate the expert rules in the form of logic functions into the inference procedure. An idea behind the incorporation is to update prior and conditional probabilities of concepts to satisfy the rules. The method is transparent because it has an explicit sequence of probabilistic calculations and a clear frequency interpretation. Numerical experiments show that FI-CBL outperforms the concept bottleneck model in cases when the number of training data is small. The code of proposed algorithms is publicly available.

[LG-22] Attention Meets UAVs: A Comprehensive Evaluation of DDoS Detection in Low-Cost UAVs

链接: https://arxiv.org/abs/2406.19881
作者: Ashish Sharma,SVSLN Surya Suhas Vaddhiparthy,Sai Usha Goparaju,Deepak Gangadharan,Harikumar Kandath
关键词: Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, Denial of Service, Distributed Denial
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the critical issue of enhancing cybersecurity measures for low-cost, Wi-Fi-based Unmanned Aerial Vehicles (UAVs) against Distributed Denial of Service (DDoS) attacks. In the current work, we have explored three variants of DDoS attacks, namely Transmission Control Protocol (TCP), Internet Control Message Protocol (ICMP), and TCP + ICMP flooding attacks, and developed a detection mechanism that runs on the companion computer of the UAV system. As a part of the detection mechanism, we have evaluated various machine learning, and deep learning algorithms, such as XGBoost, Isolation Forest, Long Short-Term Memory (LSTM), Bidirectional-LSTM (Bi-LSTM), LSTM with attention, Bi-LSTM with attention, and Time Series Transformer (TST) in terms of various classification metrics. Our evaluation reveals that algorithms with attention mechanisms outperform their counterparts in general, and TST stands out as the most efficient model with a run time of 0.1 seconds. TST has demonstrated an F1 score of 0.999, 0.997, and 0.943 for TCP, ICMP, and TCP + ICMP flooding attacks respectively. In this work, we present the necessary steps required to build an on-board DDoS detection mechanism. Further, we also present the ablation study to identify the best TST hyperparameters for DDoS detection, and we have also underscored the advantage of adapting learnable positional embeddings in TST for DDoS detection with an improvement in F1 score from 0.94 to 0.99.

[LG-23] Koopman based trajectory model and computation offloading for high mobility paradigm in ISAC enabled IoT system

链接: https://arxiv.org/abs/2406.19871
作者: Minh-Tuan Tran
关键词: User experience, limited battery capacity, mobile technical evolution, technology advancements, technical evolution
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:User experience on mobile devices is constrained by limited battery capacity and processing power, but 6G technology advancements are diving rapidly into mobile technical evolution. Mobile edge computing (MEC) offers a solution, offloading computationally intensive tasks to edge cloud servers, reducing battery drain compared to local processing. The upcoming integrated sensing and communication in mobile communication may improve the trajectory prediction and processing delays. This study proposes a greedy resource allocation optimization strategy for multi-user networks to minimize aggregate energy usage. Numerical results show potential improvement at 33% for every 1000 iteration. Addressing prediction model division and velocity accuracy issues is crucial for better results. A plan for further improvement and achieving objectives is outlined for the upcoming work phase.

[LG-24] Operator World Models for Reinforcement Learning

链接: https://arxiv.org/abs/2406.19861
作者: Pietro Novelli,Marco Pratticò,Massimiliano Pontil,Carlo Ciliberto
关键词: Policy Mirror Descent, theoretically sound methodology, Policy Mirror, Mirror Descent, sequential decision-making
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Policy Mirror Descent (PMD) is a powerful and theoretically sound methodology for sequential decision-making. However, it is not directly applicable to Reinforcement Learning (RL) due to the inaccessibility of explicit action-value functions. We address this challenge by introducing a novel approach based on learning a world model of the environment using conditional mean embeddings. We then leverage the operatorial formulation of RL to express the action-value function in terms of this quantity in closed form via matrix operations. Combining these estimators with PMD leads to POWR, a new RL algorithm for which we prove convergence rates to the global optimum. Preliminary experiments in finite and infinite state settings support the effectiveness of our method.

[LG-25] MuGSI: Distilling GNNs with Multi-Granularity Structural Information for Graph Classification

链接: https://arxiv.org/abs/2406.19832
作者: Tianjun Yao,Jiaqi Sun,Defu Cao,Kun Zhang,Guangyi Chen
关键词: GNN superior performance, fast inference speed, Recent works, combine both GNN, GNN superior
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures. Accepted by TheWebConf2024

点击查看摘要

Abstract:Recent works have introduced GNN-to-MLP knowledge distillation (KD) frameworks to combine both GNN’s superior performance and MLP’s fast inference speed. However, existing KD frameworks are primarily designed for node classification within single graphs, leaving their applicability to graph classification largely unexplored. Two main challenges arise when extending KD for node classification to graph classification: (1) The inherent sparsity of learning signals due to soft labels being generated at the graph level; (2) The limited expressiveness of student MLPs, especially in datasets with limited input feature spaces. To overcome these challenges, we introduce MuGSI, a novel KD framework that employs Multi-granularity Structural Information for graph classification. Specifically, we propose multi-granularity distillation loss in MuGSI to tackle the first challenge. This loss function is composed of three distinct components: graph-level distillation, subgraph-level distillation, and node-level distillation. Each component targets a specific granularity of the graph structure, ensuring a comprehensive transfer of structural knowledge from the teacher model to the student model. To tackle the second challenge, MuGSI proposes to incorporate a node feature augmentation component, thereby enhancing the expressiveness of the student MLPs and making them more capable learners. We perform extensive experiments across a variety of datasets and different teacher/student model architectures. The experiment results demonstrate the effectiveness, efficiency, and robustness of MuGSI. Codes are publicly available at: \textbf\urlthis https URL.

[LG-26] owards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory

链接: https://arxiv.org/abs/2406.19827
作者: Wenliang Zhong,Haoyu Tang,Qinghai Zheng,Mingzhu Xu,Yupeng Hu,Liqiang Nie
关键词: large language models, managing large datasets, Matching Training Trajectories, large language, managing large
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT leverages insights from the linearized dynamics of Neural Tangent Kernel methods to create a convex combination of expert trajectories, guiding the student network to converge rapidly and stably. This trajectory is not only easier to store, but also enables a continuous sampling strategy during distillation, ensuring thorough learning and fitting of the entire expert trajectory. Comprehensive experiments across three public datasets validate the superiority of MCT over traditional MTT methods.

[LG-27] Reinforcement Learning for Efficient Design and Control Co-optimisation of Energy Systems

链接: https://arxiv.org/abs/2406.19825
作者: Marine Cauz,Adrien Bolland,Nicolas Wyrsch,Christophe Ballif
关键词: ongoing energy transition, energy transition drives, decentralised renewable energy, renewable energy sources, heterogeneous and weather-dependent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ongoing energy transition drives the development of decentralised renewable energy sources, which are heterogeneous and weather-dependent, complicating their integration into energy systems. This study tackles this issue by introducing a novel reinforcement learning (RL) framework tailored for the co-optimisation of design and control in energy systems. Traditionally, the integration of renewable sources in the energy sector has relied on complex mathematical modelling and sequential processes. By leveraging RL’s model-free capabilities, the framework eliminates the need for explicit system modelling. By optimising both control and design policies jointly, the framework enhances the integration of renewable sources and improves system efficiency. This contribution paves the way for advanced RL applications in energy management, leading to more efficient and effective use of renewable energy sources.

[LG-28] Deceptive Diffusion: Generating Synthetic Adversarial Examples

链接: https://arxiv.org/abs/2406.19807
作者: Lucas Beerens,Catherine F. Higham,Desmond J. Higham
关键词: produce adversarial images, deceptive diffusion, deceptive diffusion model, introduce the concept, diffusion model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce the concept of deceptive diffusion – training a generative AI model to produce adversarial images. Whereas a traditional adversarial attack algorithm aims to perturb an existing image to induce a misclassificaton, the deceptive diffusion model can create an arbitrary number of new, misclassified images that are not directly associated with training or test images. Deceptive diffusion offers the possibility of strengthening defence algorithms by providing adversarial training data at scale, including types of misclassification that are otherwise difficult to find. In our experiments, we also investigate the effect of training on a partially attacked data set. This highlights a new type of vulnerability for generative diffusion models: if an attacker is able to stealthily poison a portion of the training data, then the resulting diffusion model will generate a similar proportion of misleading outputs.

[LG-29] MulTi-Wise Sampling: Trading Uniform T-Wise Feature Interaction Coverage for Smaller Samples

链接: https://arxiv.org/abs/2406.19801
作者: Tobias Pett,Sebastian Krieter,Thomas Thüm,Ina Schaefer
关键词: t-wise feature interactions, t-wise feature, Ensuring the functional, requires testing representative, Feature Interaction Coverage
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring the functional safety of highly configurable systems often requires testing representative subsets of all possible configurations to reduce testing effort and save resources. The ratio of covered t-wise feature interactions (i.e., T-Wise Feature Interaction Coverage) is a common criterion for determining whether a subset of configurations is representative and capable of finding faults. Existing t-wise sampling algorithms uniformly cover t-wise feature interactions for all features, resulting in lengthy execution times and large sample sizes, particularly when large t-wise feature interactions are considered (i.e., high values of t). In this paper, we introduce a novel approach to t-wise feature interaction sampling, questioning the necessity of uniform coverage across all t-wise feature interactions, called \emph\mulTiWise. Our approach prioritizes between subsets of critical and non-critical features, considering higher t-values for subsets of critical features when generating a t-wise feature interaction sample. We evaluate our approach using subject systems from real-world applications, including \busybox, \soletta, \fiasco, and \uclibc. Our results show that sacrificing uniform t-wise feature interaction coverage between all features reduces the time needed to generate a sample and the resulting sample size. Hence, \mulTiWise Sampling offers an alternative to existing approaches if knowledge about feature criticality is available.

[LG-30] Modeling the Real World with High-Density Visual Particle Dynamics

链接: https://arxiv.org/abs/2406.19800
作者: William F. Whitney,Jacob Varley,Deepali Jain,Krzysztof Choromanski,Sumeet Singh,Vikas Sindhwani
关键词: present High-Density Visual, Point Cloud Transformers, High-Density Visual Particle, latent point clouds, massive latent point
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neighbour attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous graph neural network approach, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using 4x as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can grasping tasks. See videos and particle dynamics rendered by HD-VPD at this https URL.

[LG-31] Improving Performance Prediction of Electrolyte Formulations with Transformer-based Molecular Representation Model

链接: https://arxiv.org/abs/2406.19792
作者: Indra Priyadarsini,Vidushi Sharma,Seiji Takeda,Akihiro Kishimoto,Lisa Hamada,Hajime Shinohara
关键词: energy storage technologies, advancing energy storage, Development of efficient, storage technologies, efficient and high-performing
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: Accepted in ML4LMS Workshop at ICML 2024

点击查看摘要

Abstract:Development of efficient and high-performing electrolytes is crucial for advancing energy storage technologies, particularly in batteries. Predicting the performance of battery electrolytes rely on complex interactions between the individual constituents. Consequently, a strategy that adeptly captures these relationships and forms a robust representation of the formulation is essential for integrating with machine learning models to predict properties accurately. In this paper, we introduce a novel approach leveraging a transformer-based molecular representation model to effectively and efficiently capture the representation of electrolyte formulations. The performance of the proposed approach is evaluated on two battery property prediction tasks and the results show superior performance compared to the state-of-the-art methods.

[LG-32] Self-Supervised Spatial-Temporal Normality Learning for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2406.19770
作者: Yutong Chen,Hongzuo Xu,Guansong Pang,Hezhe Qiao,Yuan Zhou,Mingsheng Shang
关键词: Series Anomaly Detection, Time Series Anomaly, Anomaly Detection, finds widespread applications, time series data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 4 figures, accepted in ECML PKDD2024

点击查看摘要

Abstract:Time Series Anomaly Detection (TSAD) finds widespread applications across various domains such as financial markets, industrial production, and healthcare. Its primary objective is to learn the normal patterns of time series data, thereby identifying deviations in test samples. Most existing TSAD methods focus on modeling data from the temporal dimension, while ignoring the semantic information in the spatial dimension. To address this issue, we introduce a novel approach, called Spatial-Temporal Normality learning (STEN). STEN is composed of a sequence Order prediction-based Temporal Normality learning (OTN) module that captures the temporal correlations within sequences, and a Distance prediction-based Spatial Normality learning (DSN) module that learns the relative spatial relations between sequences in a feature space. By synthesizing these two modules, STEN learns expressive spatial-temporal representations for the normal patterns hidden in the time series data. Extensive experiments on five popular TSAD benchmarks show that STEN substantially outperforms state-of-the-art competing methods. Our code is available at this https URL.

[LG-33] Contextualized Hybrid Ensemble Q-learning: Learning Fast with Control Priors

链接: https://arxiv.org/abs/2406.19768
作者: Emma Cramer,Bernd Frauenknecht,Ramil Sabirov,Sebastian Trimpe
关键词: Combining Reinforcement Learning, Combining Reinforcement, Reinforcement Learning, solve complex nonlinear, control prior ensures
类目: Machine Learning (cs.LG)
*备注: 20 pages, 12 figures

点击查看摘要

Abstract:Combining Reinforcement Learning (RL) with a prior controller can yield the best out of two worlds: RL can solve complex nonlinear problems, while the control prior ensures safer exploration and speeds up training. Prior work largely blends both components with a fixed weight, neglecting that the RL agent’s performance varies with the training progress and across regions in the state space. Therefore, we advocate for an adaptive strategy that dynamically adjusts the weighting based on the RL agent’s current capabilities. We propose a new adaptive hybrid RL algorithm, Contextualized Hybrid Ensemble Q-learning (CHEQ). CHEQ combines three key ingredients: (i) a time-invariant formulation of the adaptive hybrid RL problem treating the adaptive weight as a context variable, (ii) a weight adaption mechanism based on the parametric uncertainty of a critic ensemble, and (iii) ensemble-based acceleration for data-efficient RL. Evaluating CHEQ on a car racing task reveals substantially stronger data efficiency, exploration safety, and transferability to unknown scenarios than state-of-the-art adaptive hybrid RL methods.

[LG-34] Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration

链接: https://arxiv.org/abs/2406.19765
作者: Ali Kazemi Arani,Triet Huynh Minh Le,Mansooreh Zahedi,M. Ali Babar
关键词: Machine learning, deep learning, automating Continuous Integration, learning-based methods, analyze raw data
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: This paper has been accepted to be published in IEEE Access

点击查看摘要

Abstract:Context: Machine learning (ML) and deep learning (DL) analyze raw data to extract valuable insights in specific phases. The rise of continuous practices in software projects emphasizes automating Continuous Integration (CI) with these learning-based methods, while the growing adoption of such approaches underscores the need for systematizing knowledge. Objective: Our objective is to comprehensively review and analyze existing literature concerning learning-based methods within the CI domain. We endeavour to identify and analyse various techniques documented in the literature, emphasizing the fundamental attributes of training phases within learning-based solutions in the context of CI. Method: We conducted a Systematic Literature Review (SLR) involving 52 primary studies. Through statistical and thematic analyses, we explored the correlations between CI tasks and the training phases of learning-based methodologies across the selected studies, encompassing a spectrum from data engineering techniques to evaluation metrics. Results: This paper presents an analysis of the automation of CI tasks utilizing learning-based methods. We identify and analyze nine types of data sources, four steps in data preparation, four feature types, nine subsets of data features, five approaches for hyperparameter selection and tuning, and fifteen evaluation metrics. Furthermore, we discuss the latest techniques employed, existing gaps in CI task automation, and the characteristics of the utilized learning-based techniques. Conclusion: This study provides a comprehensive overview of learning-based methods in CI, offering valuable insights for researchers and practitioners developing CI task automation. It also highlights the need for further research to advance these methods in CI.

[LG-35] Backdoor Attack in Prompt-Based Continual Learning

链接: https://arxiv.org/abs/2406.19753
作者: Trang Nguyen,Anh Tran,Nhat Ho
关键词: Prompt-based approaches offer, scenarios involving multiple, private user data, data privacy issues, involving multiple data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prompt-based approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain poisoned knowledge injected during learning from private user data. Following this insight, in this paper, we expose continual learning to a potential threat: backdoor attack, which drives the model to follow a desired adversarial target whenever a specific trigger is present while still performing normally on clean samples. We highlight three critical challenges in executing backdoor attacks on incremental learners and propose corresponding solutions: (1) \emphTransferability: We employ a surrogate dataset and manipulate prompt selection to transfer backdoor knowledge to data from other suppliers; (2) \emphResiliency: We simulate static and dynamic states of the victim to ensure the backdoor trigger remains robust during intense incremental learning processes; and (3) \emphAuthenticity: We apply binary cross-entropy loss as an anti-cheating factor to prevent the backdoor trigger from devolving into adversarial noise. Extensive experiments across various benchmark datasets and continual learners validate our continual backdoor framework, achieving up to 100% attack success rate, with further ablation studies confirming our contributions’ effectiveness.

[LG-36] MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

链接: https://arxiv.org/abs/2406.19736
作者: Jihao Liu,Xin Huang,Jinliang Zheng,Boxiao Liu,Jia Wang,Osamu Yoshie,Yu Liu,Hongsheng Li
关键词: high-quality visual instruction, visual instruction data, visual instruction, instruction-following capabilities, instruction data designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Dataset and models are available at this https URL

点击查看摘要

Abstract:This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at this https URL.

[LG-37] EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

链接: https://arxiv.org/abs/2406.19726
作者: Nicola Garau,Giulia Martinelli,Niccolò Bisagno,Denis Tomè,Carsten Stoll
关键词: Monocular Human Pose, Monocular Human, Human Pose Estimation, human joints, Human Pose
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].

[LG-38] State Matching and Multiple References in Adaptive Active Automata Learning

链接: https://arxiv.org/abs/2406.19714
作者: Loes Kruger,Sebastian Junges,Jurriaan Rot
关键词: Active automata learning, Active automata, infer state machines, method to infer, machines by interacting
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: Extended paper for FM 2024

点击查看摘要

Abstract:Active automata learning (AAL) is a method to infer state machines by interacting with black-box systems. Adaptive AAL aims to reduce the sample complexity of AAL by incorporating domain specific knowledge in the form of (similar) reference models. Such reference models appear naturally when learning multiple versions or variants of a software system. In this paper, we present state matching, which allows flexible use of the structure of these reference models by the learner. State matching is the main ingredient of adaptive L#, a novel framework for adaptive learning, built on top of L#. Our empirical evaluation shows that adaptive L# improves the state of the art by up to two orders of magnitude.

[LG-39] CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems

链接: https://arxiv.org/abs/2406.19711
作者: Ziming Zhao,Tiehua Zhang,Zhishu Shen,Hai Dong,Xingjun Ma,Xianhui Liu,Yun Yang
关键词: enhanced system availability, distributed microservice architectures, recent years, availability and robustness, widespread adoption
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.

[LG-40] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

链接: https://arxiv.org/abs/2406.19707
作者: Wonbeom Lee,Jungi Lee,Junghwan Seo,Jaewoong Sim
关键词: Transformer-based large language, language processing tasks, natural language processing, large language models, Transformer-based large
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: OSDI 2024

点击查看摘要

Abstract:Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00x compared to prior KV cache management methods while offering substantially better model accuracy.

[LG-41] Less is More: Accurate Speech Recognition Translation without Web-Scale Data

链接: https://arxiv.org/abs/2406.19674
作者: Krishna C. Puvvada,Piotr Żelasko,He Huang,Oleksii Hrinchuk,Nithin Rao Koluguri,Kunal Dhawan,Somshubra Majumdar,Elena Rastorgueva,Zhehuai Chen,Vitaly Lavrukhin,Jagadeesh Balam,Boris Ginsburg
关键词: Recent advances, hours of Internet, Internet speech data, Internet speech, rely on hundreds
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at Interspeech-2024

点击查看摘要

Abstract:Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.

[LG-42] FunctionData Flow: A Framework to Specify Machine Learning Pipelines for Digital Twinning

链接: https://arxiv.org/abs/2406.19670
作者: Eduardo de Conto,Blaise Genest,Arvind Easwaran
关键词: leverages artificial intelligence, creating computationally efficient, physical systems increasingly, systems increasingly leverages, increasingly leverages artificial
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, to be published in AIware’24

点击查看摘要

Abstract:The development of digital twins (DTs) for physical systems increasingly leverages artificial intelligence (AI), particularly for combining data from different sources or for creating computationally efficient, reduced-dimension models. Indeed, even in very different application domains, twinning employs common techniques such as model order reduction and modelization with hybrid data (that is, data sourced from both physics-based models and sensors). Despite this apparent generality, current development practices are ad-hoc, making the design of AI pipelines for digital twinning complex and time-consuming. Here we propose Function+Data Flow (FDF), a domain-specific language (DSL) to describe AI pipelines within DTs. FDF aims to facilitate the design and validation of digital twins. Specifically, FDF treats functions as first-class citizens, enabling effective manipulation of models learned with AI. We illustrate the benefits of FDF on two concrete use cases from different domains: predicting the plastic strain of a structure and modeling the electromagnetic behavior of a bearing.

[LG-43] Finite basis Kolmogorov-Arnold networks: domain decomposition for data-driven and physics-informed problems

链接: https://arxiv.org/abs/2406.19662
作者: Amanda A. Howard,Bruno Jacob,Sarah H. Murphy,Alexander Heinlein,Panos Stinis
关键词: scientific machine learning, attracted attention recently, multilayer perceptrons, machine learning, attracted attention
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold networks (KANs) have attracted attention recently as an alternative to multilayer perceptrons (MLPs) for scientific machine learning. However, KANs can be expensive to train, even for relatively small networks. Inspired by finite basis physics-informed neural networks (FBPINNs), in this work, we develop a domain decomposition method for KANs that allows for several small KANs to be trained in parallel to give accurate solutions for multiscale problems. We show that finite basis KANs (FBKANs) can provide accurate results with noisy data and for physics-informed training.

[LG-44] LLMEasyQuant – An Easy to Use Toolkit for LLM Quantization

链接: https://arxiv.org/abs/2406.19657
作者: Dong Liu,Meng Jiang,Kaiser Pister
关键词: appeared for LLM, quantization methods appeared, deployed locally, methods appeared, LLM quantization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Currently, there are many quantization methods appeared for LLM quantization, yet few are user-friendly and easy to be deployed locally. Packages like TensorRT and Quantohave many underlying structures and self-invoking internal functions, which are not conducive to developers’ personalized development and learning for deployment. Therefore, we develop LLMEasyQuant, it is a package aiming to for easy quantization deployment which is user-friendly and suitable for beginners’ learning.

[LG-45] ACES: Automatic Cohort Extraction System for Event-Stream Datasets

链接: https://arxiv.org/abs/2406.19653
作者: Justin Xu,Jack Gallifant,Alistair E. W. Johnson,Matthew B. A. McDermott
关键词: machine learning, Automatic Cohort Extraction, Cohort Extraction System, challenge in machine, ACES
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: For ACES Online Documentation, see this https URL

点击查看摘要

Abstract:Reproducibility remains a significant challenge in machine learning (ML) for healthcare. In this field, datasets, model pipelines, and even task/cohort definitions are often private, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. In this paper, we address a significant part of this problem by introducing the Automatic Cohort Extraction System for Event-Stream Datasets (ACES). This tool is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or to any dataset for which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality. ACES is available at this https URL.

[LG-46] IDT: Dual-Task Adversarial Attacks for Privacy Protection

链接: https://arxiv.org/abs/2406.19642
作者: Pedro Faustini,Shakila Mahjabin Tonni,Annabelle McIver,Qiongkai Xu,Mark Dras
关键词: Natural language processing, including membership inference, leak private information, Natural language, membership inference
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 28 pages, 1 figure

点击查看摘要

Abstract:Natural language processing (NLP) models may leak private information in different ways, including membership inference, reconstruction or attribute inference attacks. Sensitive information may not be explicit in the text, but hidden in underlying writing characteristics. Methods to protect privacy can involve using representations inside models that are demonstrated not to detect sensitive attributes or – for instance, in cases where users might not trust a model, the sort of scenario of interest here – changing the raw text before models can have access to it. The goal is to rewrite text to prevent someone from inferring a sensitive attribute (e.g. the gender of the author, or their location by the writing style) whilst keeping the text useful for its original intention (e.g. the sentiment of a product review). The few works tackling this have focused on generative techniques. However, these often create extensively different texts from the original ones or face problems such as mode collapse. This paper explores a novel adaptation of adversarial attack techniques to manipulate a text to deceive a classifier w.r.t one task (privacy) whilst keeping the predictions of another classifier trained for another task (utility) unchanged. We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change for the privacy task, and which ones should be kept for the utility task. We evaluate different datasets for NLP suitable for different tasks. Automatic and human evaluations show that IDT retains the utility of text, while also outperforming existing methods when deceiving a classifier w.r.t privacy task.

[LG-47] Model Predictive Simulation Using Structured Graphical Models and Transformers

链接: https://arxiv.org/abs/2406.19635
作者: Xinghua Lou,Meet Dave,Shrinu Kushagra,Miguel Lazaro-Gredilla,Kevin Murphy
关键词: Waymo SimAgents challenge, probabilistic graphical models, Waymo SimAgents, multiple interacting agents, SimAgents challenge
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Special Mention at the Waymo Sim Agents Challenge 2024

点击查看摘要

Abstract:We propose an approach to simulating trajectories of multiple interacting agents (road users) based on transformers and probabilistic graphical models (PGMs), and apply it to the Waymo SimAgents challenge. The transformer baseline is based on the MTR model, which predicts multiple future trajectories conditioned on the past trajectories and static road layout features. We then improve upon these generated trajectories using a PGM, which contains factors which encode prior knowledge, such as a preference for smooth trajectories, and avoidance of collisions with static obstacles and other moving agents. We perform (approximate) MAP inference in this PGM using the Gauss-Newton method. Finally we sample K=32 trajectories for each of the N \sim 100 agents for the next T=8 \Delta time steps, where \Delta=10 is the sampling rate per second. Following the Model Predictive Control (MPC) paradigm, we only return the first element of our forecasted trajectories at each step, and then we replan, so that the simulation can constantly adapt to its changing environment. We therefore call our approach “Model Predictive Simulation” or MPS. We show that MPS improves upon the MTR baseline, especially in safety critical metrics such as collision rate. Furthermore, our approach is compatible with any underlying forecasting model, and does not require extra training, so we believe it is a valuable contribution to the community.

[LG-48] Personalized Interpretation on Federated Learning: A Virtual Concepts approach

链接: https://arxiv.org/abs/2406.19631
作者: Peng Yan,Guodong Long,Jing Jiang,Michael Blumenstein
关键词: Tackling non-IID data, Tackling non-IID, federated learning research, open challenge, non-IID data
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Tackling non-IID data is an open challenge in federated learning research. Existing FL methods, including robust FL and personalized FL, are designed to improve model performance without consideration of interpreting non-IID across clients. This paper aims to design a novel FL method to robust and interpret the non-IID data across clients. Specifically, we interpret each client’s dataset as a mixture of conceptual vectors that each one represents an interpretable concept to end-users. These conceptual vectors could be pre-defined or refined in a human-in-the-loop process or be learnt via the optimization procedure of the federated learning system. In addition to the interpretability, the clarity of client-specific personalization could also be applied to enhance the robustness of the training process on FL system. The effectiveness of the proposed method have been validated on benchmark datasets.

[LG-49] Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness

链接: https://arxiv.org/abs/2406.19622
作者: Erh-Chung Chen,Pin-Yu Chen,I-Hsin Chung,Che-Rung Lee
关键词: deep neural networks, deep neural, neural networks, robustness, DNNs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The security and robustness of deep neural networks (DNNs) have become increasingly concerning. This paper aims to provide both a theoretical foundation and a practical solution to ensure the reliability of DNNs. We explore the concept of Lipschitz continuity to certify the robustness of DNNs against adversarial attacks, which aim to mislead the network with adding imperceptible perturbations into inputs. We propose a novel algorithm that remaps the input domain into a constrained range, reducing the Lipschitz constant and potentially enhancing robustness. Unlike existing adversarially trained models, where robustness is enhanced by introducing additional examples from other datasets or generative models, our method is almost cost-free as it can be integrated with existing models without requiring re-training. Experimental results demonstrate the generalizability of our method, as it can be combined with various models and achieve enhancements in robustness. Furthermore, our method achieves the best robust accuracy for CIFAR10, CIFAR100, and ImageNet datasets on the RobustBench leaderboard.

[LG-50] Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

链接: https://arxiv.org/abs/2406.19621
作者: Yufan Xia,Giuseppe Maria Junior Barca
关键词: Aware Linear Algebra, BLAS Level, Data-Structure Aware Linear, scientific computing, essential for scientific
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Multi-Thread, Matrix Multiplication, Optimization, BLAS, Machine Learning

点击查看摘要

Abstract:BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.

[LG-51] Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity

链接: https://arxiv.org/abs/2406.19617
作者: Qian Yu,Yining Wang,Baihe Huang,Qi Lei,Jason D. Lee
关键词: stochastic zeroth-order feedback, strongly convex functions, convex functions, online learning, stochastic zeroth-order
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Optimization of convex functions under stochastic zeroth-order feedback has been a major and challenging question in online learning. In this work, we consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries. We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds. We propose an algorithm that features a combination of a bootstrapping stage and a mirror-descent stage. Our main technical innovation consists of a sharp characterization for the spherical-sampling gradient estimator under higher-order smoothness conditions, which allows the algorithm to optimally balance the bias-variance tradeoff, and a new iterative method for the bootstrapping stage, which maintains the performance for unbounded Hessian.

[LG-52] VarteX: Enhancing Weather Forecast through Distributed Variable Representation

链接: https://arxiv.org/abs/2406.19615
作者: Ayumu Ueyama,Kazuhiko Kawamoto,Hiroshi Kera
关键词: human activities, Weather forecasting, Weather, Abstract, forecasting
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: ICML 2024, Workshop on Machine Learning for Earth System Modeling

点击查看摘要

Abstract:Weather forecasting is essential for various human activities. Recent data-driven models have outperformed numerical weather prediction by utilizing deep learning in forecasting performance. However, challenges remain in efficiently handling multiple meteorological variables. This study proposes a new variable aggregation scheme and an efficient learning framework for that challenge. Experiments show that VarteX outperforms the conventional model in forecast performance, requiring significantly fewer parameters and resources. The effectiveness of learning through multiple aggregations and regional split training is demonstrated, enabling more efficient and accurate deep learning-based weather forecasting.

[LG-53] A Survey on Data Quality Dimensions and Tools for Machine Learning

链接: https://arxiv.org/abs/2406.19614
作者: Yuhan Zhou,Fengjiao Tu,Kewei Sha,Junhua Ding,Haihua Chen
关键词: Machine learning, substantial in practically, practically all aspects, data quality, exploratory data analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by The 6th IEEE International Conference on Artificial Intelligence Testing (IEEE AITest 2024) as an invited paper

点击查看摘要

Abstract:Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: this https URL.

[LG-54] A Survey on Deep Clustering: From the Prior Perspective

链接: https://arxiv.org/abs/2406.19602
作者: Yiding Lu,Haobin Li,Yunfan Li,Yijie Lin,Xi Peng
关键词: powerful feature extraction, feature extraction ability, achieved great success, deep clustering, deep clustering methods
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.

[LG-55] Optimizing Cyber Defense in Dynamic Active Directories through Reinforcement Learning

链接: https://arxiv.org/abs/2406.19596
作者: Diksha Goel,Kristen Moore,Mingyu Guo,Derui Wang,Minjune Kim,Seyit Camtepe
关键词: Autonomous Cyber Operations, Cyber Operations, Autonomous Cyber, effective edge-blocking ACO, gap in Autonomous
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The manuscript has been accepted as full paper at European Symposium on Research in Computer Security (ESORICS) 2024

点击查看摘要

Abstract:This paper addresses a significant gap in Autonomous Cyber Operations (ACO) literature: the absence of effective edge-blocking ACO strategies in dynamic, real-world networks. It specifically targets the cybersecurity vulnerabilities of organizational Active Directory (AD) systems. Unlike the existing literature on edge-blocking defenses which considers AD systems as static entities, our study counters this by recognizing their dynamic nature and developing advanced edge-blocking defenses through a Stackelberg game model between attacker and defender. We devise a Reinforcement Learning (RL)-based attack strategy and an RL-assisted Evolutionary Diversity Optimization-based defense strategy, where the attacker and defender improve each other strategy via parallel gameplay. To address the computational challenges of training attacker-defender strategies on numerous dynamic AD graphs, we propose an RL Training Facilitator that prunes environments and neural networks to eliminate irrelevant elements, enabling efficient and scalable training for large graphs. We extensively train the attacker strategy, as a sophisticated attacker model is essential for a robust defense. Our empirical results successfully demonstrate that our proposed approach enhances defender’s proficiency in hardening dynamic AD graphs while ensuring scalability for large-scale AD.

[LG-56] Network Bending of Diffusion Models for Audio-Visual Generation

链接: https://arxiv.org/abs/2406.19589
作者: Luke Dzwonczyk,Carmine Emanuele Cella,David Ban
关键词: create music visualizations, machine learning models, visualizations using pre-trained, paper we present, enables artists
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 5 figures, to be published in the proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), for additional image and video examples see this https URL

点击查看摘要

Abstract:In this paper we present the first steps towards the creation of a tool which enables artists to create music visualizations using pre-trained, generative, machine learning models. First, we investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensor-wise, and morphological operators. We identify a number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools. We find that this process allows for continuous, fine-grain control of image generation which can be helpful for creative applications. Next, we generate music-reactive videos using Stable Diffusion by passing audio features as parameters to network bending operators. Finally, we comment on certain transforms which radically shift the image and the possibilities of learning more about the latent space of Stable Diffusion based on these transforms.

[LG-57] HarmonICA: Neural non-stationarity correction and source separation for motor neuron interfaces

链接: https://arxiv.org/abs/2406.19581
作者: Alexander Kenneth Clarke,Agnese Grison,Irene Mendez Guerra,Pranav Mamidanna,Shihan Ma,Silvia Muceli,Dario Farina
关键词: spinal motor neurons, major outstanding problem, estimated in advance, major outstanding, interfacing with spinal
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A major outstanding problem when interfacing with spinal motor neurons is how to accurately compensate for non-stationary effects in the signal during source separation routines, particularly when they cannot be estimated in advance. This forces current systems to instead use undifferentiated bulk signal, which limits the potential degrees of freedom for control. In this study we propose a potential solution, using an unsupervised learning algorithm to blindly correct for the effects of latent processes which drive the signal non-stationarities. We implement this methodology within the theoretical framework of a quasilinear version of independent component analysis (ICA). The proposed design, HarmonICA, sidesteps the identifiability problems of nonlinear ICA, allowing for equivalent predictability to linear ICA whilst retaining the ability to learn complex nonlinear relationships between non-stationary latents and their effects on the signal. We test HarmonICA on both invasive and non-invasive recordings both simulated and real, demonstrating an ability to blindly compensate for the non-stationary effects specific to each, and thus to significantly enhance the quality of a source separation routine.

[LG-58] FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

链接: https://arxiv.org/abs/2406.19580
作者: Saeed Rashidi,William Won,Sudarshan Srinivasan,Puneet Gupta,Tushar Krishna
关键词: Distributed Deep Neural, Deep Neural Network, Deep Neural, Distributed Deep, Neural Network
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.

[LG-59] PathAlign: A vision-language model for whole slide images in histopathology

链接: https://arxiv.org/abs/2406.19578
作者: Faruk Ahmed,Andrew Sellergren,Lin Yang,Shawn Xu,Boris Babenko,Abbi Ward,Niels Olson,Arash Mohtashamian,Yossi Matias,Greg S. Corrado,Quang Duong,Dale R. Webster,Shravya Shetty,Daniel Golden,Yun Liu,David F. Steiner,Ellery Wulczyn
关键词: histopathology images underlies, Microscopic interpretation, treatment decisions, underlies many important, Microscopic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures

点击查看摘要

Abstract:Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

[LG-60] On Counterfactual Interventions in Vector Autoregressive Models

链接: https://arxiv.org/abs/2406.19573
作者: Kurt Butler,Marija Iloska,Petar M. Djuric
关键词: explore hypothetical scenarios, explore hypothetical, hypothetical scenarios, scenarios in order, order to explain
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Counterfactual reasoning allows us to explore hypothetical scenarios in order to explain the impacts of our decisions. However, addressing such inquires is impossible without establishing the appropriate mathematical framework. In this work, we introduce the problem of counterfactual reasoning in the context of vector autoregressive (VAR) processes. We also formulate the inference of a causal model as a joint regression task where for inference we use both data with and without interventions. After learning the model, we exploit linearity of the VAR model to make exact predictions about the effects of counterfactual interventions. Furthermore, we quantify the total causal effects of past counterfactual interventions. The source code for this project is freely available at this https URL.

[LG-61] Instance-Optimal Private Density Estimation in the Wasserstein Distance

链接: https://arxiv.org/abs/2406.19566
作者: Vitaly Feldman,Audra McMillan,Satchit Sivakumar,Kunal Talwar
关键词: Wasserstein distance, Wasserstein, small Wasserstein distance, density, Machine Learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating the density of a distribution from samples is a fundamental problem in statistics. In many practical settings, the Wasserstein distance is an appropriate error metric for density estimation. For example, when estimating population densities in a geographic region, a small Wasserstein distance means that the estimate is able to capture roughly where the population mass is. In this work we study differentially private density estimation in the Wasserstein distance. We design and analyze instance-optimal algorithms for this problem that can adapt to easy instances. For distributions P over \mathbbR , we consider a strong notion of instance-optimality: an algorithm that uniformly achieves the instance-optimal estimation rate is competitive with an algorithm that is told that the distribution is either P or Q_P for some distribution Q_P whose probability density function (pdf) is within a factor of 2 of the pdf of P . For distributions over \mathbbR^2 , we use a different notion of instance optimality. We say that an algorithm is instance-optimal if it is competitive with an algorithm that is given a constant-factor multiplicative approximation of the density of the distribution. We characterize the instance-optimal estimation rates in both these settings and show that they are uniformly achievable (up to polylogarithmic factors). Our approach for \mathbbR^2 extends to arbitrary metric spaces as it goes via hierarchically separated trees. As a special case our results lead to instance-optimal private learning in TV distance for discrete distributions. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2406.19566 [cs.LG] (or arXiv:2406.19566v1 [cs.LG] for this version)

[LG-62] Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

链接: https://arxiv.org/abs/2406.19561
作者: Bradley Burega,John D. Martin,Luke Kapeluck,Michael Bowling
关键词: Reinforcement Learning, remain sample-efficient, imperfect model, environment dynamics change, Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.

[LG-63] Cost-efficient Active Illumination Camera For Hyper-spectral Reconstruction

链接: https://arxiv.org/abs/2406.19560
作者: Yuxuan Zhang,T.M. Sazzad,Yangyang Song,Spencer J. Chang,Ritesh Chowdhry,Tomas Mejia,Anna Hampton,Shelby Kucharski,Stefan Gerber,Barry Tillman,Marcio F. R. Resende,William M. Hammond,Chris H. Wilson,Alina Zare,Sanjeev J. Koppal
关键词: recently gained increasing, gained increasing attention, including agricultural investigation, Hyper-spectral imaging, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient, compact and easy to use active illumination camera that may benefit many applications. We developed a fully functional prototype of such camera. With the hope of helping with agricultural research, we tested our camera for plant root imaging. In addition, a U-Net model for spectral reconstruction was trained by using a reference hyperspectral camera’s data as ground truth and our camera’s data as input. We demonstrated our camera’s ability to obtain additional information over a typical RGB camera. In addition, the ability to reconstruct hyperspectral data from multi-spectral input makes our device compatible to models and algorithms developed for hyperspectral applications with no modifications required.

[LG-64] Rethinking harmless refusals when fine-tuning foundation models

链接: https://arxiv.org/abs/2406.19552
作者: Florin Pop,Judd Rosenblatt,Diogo Schwerz de Lucena,Michael Vaiana
关键词: Large Language Models, Large Language, effectively mitigates versus, conceals undesirable behavior, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICLR 2024 AGI Workshop Poster

点击查看摘要

Abstract:In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emphreason-based deception, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.

[LG-65] ASCENT: Amplifying Power Side-Channel Resilience via Learning Monte-Carlo Tree Search

链接: https://arxiv.org/abs/2406.19549
作者: Jitendra Bhandari,Animesh Basak Chowdhury,Ozgur Sinanoglu,Siddharth Garg,Ramesh Karri,Johann Knechtel
关键词: securing cryptographic hardware, cryptographic hardware, PSC, Power side-channel, securing cryptographic
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at 2024 ACM/IEEE International Conference on Computer-Aided Design

点击查看摘要

Abstract:Power side-channel (PSC) analysis is pivotal for securing cryptographic hardware. Prior art focused on securing gate-level netlists obtained as-is from chip design automation, neglecting all the complexities and potential side-effects for security arising from the design automation process. That is, automation traditionally prioritizes power, performance, and area (PPA), sidelining security. We propose a “security-first” approach, refining the logic synthesis stage to enhance the overall resilience of PSC countermeasures. We introduce ASCENT, a learning-and-search-based framework that (i) drastically reduces the time for post-design PSC evaluation and (ii) explores the security-vs-PPA design space. Thus, ASCENT enables an efficient exploration of a large number of candidate netlists, leading to an improvement in PSC resilience compared to regular PPA-optimized netlists. ASCENT is up to 120x faster than traditional PSC analysis and yields a 3.11x improvement for PSC resilience of state-of-the-art PSC countermeasures

[LG-66] Dataless Quadratic Neural Networks for the Maximum Independent Set Problem

链接: https://arxiv.org/abs/2406.19532
作者: Ismail Alkhouri,Cedric Le Denmat,Yingjie Li,Cunxi Yu,Jia Liu,Rongrong Wang,Alvaro Velasquez
关键词: Maximum Independent Set, challenging Maximum Independent, Independent Set, Maximum Independent, challenging Maximum
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combinatorial Optimization (CO) plays a crucial role in addressing various significant problems, among them the challenging Maximum Independent Set (MIS) problem. In light of recent advancements in deep learning methods, efforts have been directed towards leveraging data-driven learning approaches, typically rooted in supervised learning and reinforcement learning, to tackle the NP-hard MIS problem. However, these approaches rely on labeled datasets, exhibit weak generalization, and often depend on problem-specific heuristics. Recently, ReLU-based dataless neural networks were introduced to address combinatorial optimization problems. This paper introduces a novel dataless quadratic neural network formulation, featuring a continuous quadratic relaxation for the MIS problem. Notably, our method eliminates the need for training data by treating the given MIS instance as a trainable entity. More specifically, the graph structure and constraints of the MIS instance are used to define the structure and parameters of the neural network such that training it on a fixed input provides a solution to the problem, thereby setting it apart from traditional supervised or reinforcement learning approaches. By employing a gradient-based optimization algorithm like ADAM and leveraging an efficient off-the-shelf GPU parallel implementation, our straightforward yet effective approach demonstrates competitive or superior performance compared to state-of-the-art learning-based methods. Another significant advantage of our approach is that, unlike exact and heuristic solvers, the running time of our method scales only with the number of nodes in the graph, not the number of edges.

[LG-67] ocBERT: Medical Document Structure Extraction Using Bidirectional Transformers

链接: https://arxiv.org/abs/2406.19526
作者: Majd Saleh,Sarra Baghdadi,Stéphane Paquelet
关键词: Natural Language Processing, Language Processing, Natural Language, holds paramount importance, field of Natural
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.

[LG-68] Reliable edge machine learning hardware for scientific applications

链接: https://arxiv.org/abs/2406.19522
作者: Tommaso Baldi(1 and 2),Javier Campos(1),Ben Hawks(1),Jennifer Ngadiuba(1),Nhan Tran(1),Daniel Diaz(3),Javier Duarte(3),Ryan Kastner(3),Andres Meza(3),Melissa Quinnan(3),Olivia Weng(3),Caleb Geniesse(4),Amir Gholami(4),Michael W. Mahoney(4),Vladimir Loncar(5),Philip Harris(5),Joshua Agar(6),Shuyu Qin(6) ((1) Fermilab, (2) University of Pisa, (3) UC San Diego, (4) UC Berkeley/LBNL/ICSI, (5) MIT, (6) Drexel University)
关键词: experiments create massive, create massive amounts, Extreme data rate, data rate scientific, rate scientific experiments
类目: Machine Learning (cs.LG)
*备注: IEEE VLSI Test Symposium 2024 (VTS)

点击查看摘要

Abstract:Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling ultra-fine-grained model inspection for efficient fault tolerance. We discuss approaches to developing and validating reliable algorithms at the scientific edge under such strict latency, resource, power, and area requirements in extreme experimental environments. We study metrics for developing robust algorithms, present preliminary results and mitigation strategies, and conclude with an outlook of these and future directions of research towards the longer-term goal of developing autonomous scientific experimentation methods for accelerated scientific discovery.

[LG-69] oo Good to be True? Turn Any Model Differentially Private With DP-Weights

链接: https://arxiv.org/abs/2406.19507
作者: David Zagardo
关键词: Stochastic Gradient Descent, Private Stochastic Gradient, Gradient Descent, Stochastic Gradient, Differentially Private Stochastic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: For code visit the following repository, this https URL

点击查看摘要

Abstract:Imagine training a machine learning model with Differentially Private Stochastic Gradient Descent (DP-SGD), only to discover post-training that the noise level was either too high, crippling your model’s utility, or too low, compromising privacy. The dreaded realization hits: you must start the lengthy training process from scratch. But what if you could avoid this retraining nightmare? In this study, we introduce a groundbreaking approach (to our knowledge) that applies differential privacy noise to the model’s weights after training. We offer a comprehensive mathematical proof for this novel approach’s privacy bounds, use formal methods to validate its privacy guarantees, and empirically evaluate its effectiveness using membership inference attacks and performance evaluations. This method allows for a single training run, followed by post-hoc noise adjustments to achieve optimal privacy-utility trade-offs. We compare this novel fine-tuned model (DP-Weights model) to a traditional DP-SGD model, demonstrating that our approach yields statistically similar performance and privacy guarantees. Our results validate the efficacy of post-training noise application, promising significant time savings and flexibility in fine-tuning differential privacy parameters, making it a practical alternative for deploying differentially private models in real-world scenarios.

[LG-70] Monitoring Latent World States in Language Models with Propositional Probes

链接: https://arxiv.org/abs/2406.19501
作者: Jiahai Feng,Stuart Russell,Jacob Steinhardt
关键词: Language models, Language, tendencies that lead, input context, Greg
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with ‘propositional probes’, which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ‘‘Greg is a nurse. Laura is a physicist.’’, we decode the propositions ‘‘WorksAs(Greg, nurse)’’ and ‘‘WorksAs(Laura, physicist)’’ from the model’s activations. Key to this is identifying a ‘binding subspace’ in which bound tokens have high similarity (‘‘Greg’’ and ‘‘nurse’’) but unbound ones do not (‘‘Greg’’ and ‘‘physicist’’). We validate propositional probes in a closed-world setting with finitely many predicates and properties. Despite being trained on simple templated contexts, propositional probes generalize to contexts rewritten as short stories and translated to Spanish. Moreover, we find that in three settings where language models respond unfaithfully to the input context – prompt injections, backdoor attacks, and gender bias – the decoded propositions remain faithful. This suggests that language models often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.

[LG-71] LoPT: Low-Rank Prompt Tuning for Parameter Efficient Language Models

链接: https://arxiv.org/abs/2406.19486
作者: Shouchang Guo,Sonam Damani,Keng-hao Chang
关键词: prompt tuning, suffix text, token indices, text is added, optimized to gain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In prompt tuning, a prefix or suffix text is added to the prompt, and the embeddings (soft prompts) or token indices (hard prompts) of the prefix/suffix are optimized to gain more control over language models for specific tasks. This approach eliminates the need for hand-crafted prompt engineering or explicit model fine-tuning. Prompt tuning is significantly more parameter-efficient than model fine-tuning, as it involves optimizing partial inputs of language models to produce desired outputs. In this work, we aim to further reduce the amount of trainable parameters required for a language model to perform well on specific tasks. We propose Low-rank Prompt Tuning (LoPT), a low-rank model for prompts that achieves efficient prompt optimization. The proposed method demonstrates similar outcomes to full parameter prompt tuning while reducing the number of trainable parameters by a factor of 5. It also provides promising results compared to the state-of-the-art methods that would require 10 to 20 times more parameters. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2406.19486 [cs.CL] (or arXiv:2406.19486v1 [cs.CL] for this version)

[LG-72] Multi-agent Cooperative Games Using Belief Map Assisted Training

链接: https://arxiv.org/abs/2406.19477
作者: Qinwei Huang,Chen Luo,Alex B. Wu,Simon Khan,Hai Li,Qinru Qiu
关键词: message passing system, gain global situational, global situational awareness, message passing, multi-agent message passing
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a multi-agent system, agents share their local observations to gain global situational awareness for decision making and collaboration using a message passing system. When to send a message, how to encode a message, and how to leverage the received messages directly affect the effectiveness of the collaboration among agents. When training a multi-agent cooperative game using reinforcement learning (RL), the message passing system needs to be optimized together with the agent policies. This consequently increases the model’s complexity and poses significant challenges to the convergence and performance of learning. To address this issue, we propose the Belief-map Assisted Multi-agent System (BAMS), which leverages a neuro-symbolic belief map to enhance training. The belief map decodes the agent’s hidden state to provide a symbolic representation of the agent’s understanding of the environment and other agent’s status. The simplicity of symbolic representation allows the gathering and comparison of the ground truth information with the belief, which provides an additional channel of feedback for the learning. Compared to the sporadic and delayed feedback coming from the reward in RL, the feedback from the belief map is more consistent and reliable. Agents using BAMS can learn a more effective message passing network to better understand each other, resulting in better performance in a cooperative predator and prey game with varying levels of map complexity and compare it to previous multi-agent message passing models. The simulation results showed that BAMS reduced training epochs by 66%, and agents who apply the BAMS model completed the game with 34.62% fewer steps on average.

[LG-73] Saliency Attention and Semantic Similarity-Driven Adversarial Perturbation

链接: https://arxiv.org/abs/2406.19413
作者: Hetvi Waghela,Jaydip Sen,Sneha Rakshit
关键词: enhanced textual adversarial, Semantic Similarity driven, introduce an enhanced, enhanced textual, Similarity driven adversarial
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: The paper is 12 pages long. and it contains 5 tables. It is the pre-reviewed version of the paper that has been accepted for oral presentation and publication in the 5th International Conference on Data Science and Applications which will be organized in Jaipur, India from July 17 to 19, 2024. This is not the final version

点击查看摘要

Abstract:In this paper, we introduce an enhanced textual adversarial attack method, known as Saliency Attention and Semantic Similarity driven adversarial Perturbation (SASSP). The proposed scheme is designed to improve the effectiveness of contextual perturbations by integrating saliency, attention, and semantic similarity. Traditional adversarial attack methods often struggle to maintain semantic consistency and coherence while effectively deceiving target models. Our proposed approach addresses these challenges by incorporating a three-pronged strategy for word selection and perturbation. First, we utilize a saliency-based word selection to prioritize words for modification based on their importance to the model’s prediction. Second, attention mechanisms are employed to focus perturbations on contextually significant words, enhancing the attack’s efficacy. Finally, an advanced semantic similarity-checking method is employed that includes embedding-based similarity and paraphrase detection. By leveraging models like Sentence-BERT for embedding similarity and fine-tuned paraphrase detection models from the Sentence Transformers library, the scheme ensures that the perturbed text remains contextually appropriate and semantically consistent with the original. Empirical evaluations demonstrate that SASSP generates adversarial examples that not only maintain high semantic fidelity but also effectively deceive state-of-the-art natural language processing models. Moreover, in comparison to the original scheme of contextual perturbation CLARE, SASSP has yielded a higher attack success rate and lower word perturbation rate.

[LG-74] he Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

链接: https://arxiv.org/abs/2406.19958
作者: Yan Shuo Tan,Omer Ronen,Theo Saarinen,Bin Yu
关键词: Bayesian Additive Regression, popular Bayesian non-parametric, Bayesian non-parametric regression, Additive Regression Trees, Bayesian Additive
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In this paper, we show that the BART sampler often converges slowly, confirming empirical observations by other researchers. Assuming discrete covariates, we show that, while the BART posterior concentrates on a set comprising all optimal tree structures (smallest bias and complexity), the Markov chain’s hitting time for this set increases with n (training sample size), under several common data generative settings. As n increases, the approximate BART posterior thus becomes increasingly different from the exact posterior (for the same number of MCMC samples), contrasting with earlier concentration results on the exact posterior. This contrast is highlighted by our simulations showing worsening frequentist undercoverage for approximate posterior intervals and a growing ratio between the MSE of the approximate posterior and that obtainable by artificially improving convergence via averaging multiple sampler chains. Finally, based on our theoretical insights, possibilities are discussed to improve the BART sampler convergence performance.

[LG-75] Kolmogorov-Smirnov GAN

链接: https://arxiv.org/abs/2406.19948
作者: Maciej Falkiewicz,Naoya Takeishi,Alexandros Kalousis
关键词: Generative Adversarial Network, Adversarial Network, deep generative model, Kolmogorov-Smirnov Generative Adversarial, adversarial generative models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Code available at this https URL

点击查看摘要

Abstract:We propose a novel deep generative model, the Kolmogorov-Smirnov Generative Adversarial Network (KSGAN). Unlike existing approaches, KSGAN formulates the learning process as a minimization of the Kolmogorov-Smirnov (KS) distance, generalized to handle multivariate distributions. This distance is calculated using the quantile function, which acts as the critic in the adversarial training process. We formally demonstrate that minimizing the KS distance leads to the trained approximate distribution aligning with the target distribution. We propose an efficient implementation and evaluate its effectiveness through experiments. The results show that KSGAN performs on par with existing adversarial methods, exhibiting stability during training, resistance to mode dropping and collapse, and tolerance to variations in hyperparameter settings. Additionally, we review the literature on the Generalized KS test and discuss the connections between KSGAN and existing adversarial generative models.

[LG-76] Classical Bandit Algorithms for Entanglement Detection in Parameterized Qubit States

链接: https://arxiv.org/abs/2406.19738
作者: Bharati. K,Vikesh Siddhu,Krishna Jagannathan
关键词: Phys. Rev, information and computing, wide range, range of tasks, key resource
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Entanglement is a key resource for a wide range of tasks in quantum information and computing. Thus, verifying availability of this quantum resource is essential. Extensive research on entanglement detection has led to no-go theorems (Lu et al. [Phys. Rev. Lett., 116, 230501 (2016)]) that highlight the need for full state tomography (FST) in the absence of adaptive or joint measurements. Recent advancements, as proposed by Zhu, Teo, and Englert [Phys. Rev. A, 81, 052339, 2010], introduce a single-parameter family of entanglement witness measurements which are capable of conclusively detecting certain entangled states and only resort to FST when all witness measurements are inconclusive. We find a variety of realistic noisy two-qubit quantum states \mathcalF that yield conclusive results under this witness family. We solve the problem of detecting entanglement among K quantum states in \mathcalF , of which m states are entangled, with m potentially unknown. We recognize a structural connection of this problem to the Bad Arm Identification problem in stochastic Multi-Armed Bandits (MAB). In contrast to existing quantum bandit frameworks, we establish a new correspondence tailored for entanglement detection and term it the (m,K) -quantum Multi-Armed Bandit. We implement two well-known MAB policies for arbitrary states derived from \mathcalF , present theoretical guarantees on the measurement/sample complexity and demonstrate the practicality of the policies through numerical simulations. More broadly, this paper highlights the potential for employing classical machine learning techniques for quantum entanglement detection.

[LG-77] Enforcing Equity in Neural Climate Emulators

链接: https://arxiv.org/abs/2406.19636
作者: William Yik,Sam J. Silva
关键词: invaluable tool, wide variety, loss function, loss, weather prediction tasks
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Neural network emulators have become an invaluable tool for a wide variety of climate and weather prediction tasks. While showing incredibly promising results, these networks do not have an inherent ability to produce equitable predictions. That is, they are not guaranteed to provide a uniform quality of prediction along any particular class or group of people. This potential for inequitable predictions motivates the need for explicit representations of fairness in these neural networks. To that end, we draw on methods for enforcing analytical physical constraints in neural networks to bias networks towards more equitable predictions. We demonstrate the promise of this methodology using the task of climate model emulation. Specifically, we propose a custom loss function which punishes emulators with unequal quality of predictions across any prespecified regions or category, here defined using human development index (HDI). This loss function weighs a standard loss metric such as mean squared error against another metric which captures inequity along the equity category (HDI), allowing us to adjust the priority of each term before training. Importantly, the loss function does not specify a particular definition of equity to bias the neural network towards, opening the door for custom fairness metrics. Our results show that neural climate emulators trained with our loss function provide more equitable predictions and that the equity metric improves with greater weighting in the loss function. We empirically demonstrate that while there is a tradeoff between accuracy and equity when prioritizing the latter during training, an appropriate selection of the equity priority hyperparameter can minimize loss of performance.

[LG-78] ScoreFusion: fusing score-based generative models via Kullback-Leibler barycenters

链接: https://arxiv.org/abs/2406.19619
作者: Hao Liu,Junze(Tony)Ye,Jose Blanchet,Nian Si
关键词: auxiliary generative models, generative models, target generative model, fusing pre-trained, study the problem
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 40 pages, 6 figures

点击查看摘要

Abstract:We study the problem of fusing pre-trained (auxiliary) generative models to enhance the training of a target generative model. We propose using KL-divergence weighted barycenters as an optimal fusion mechanism, in which the barycenter weights are optimally trained to minimize a suitable loss for the target population. While computing the optimal KL-barycenter weights can be challenging, we demonstrate that this process can be efficiently executed using diffusion score training when the auxiliary generative models are also trained based on diffusion score methods. Moreover, we show that our fusion method has a dimension-free sample complexity in total variation distance provided that the auxiliary models are well fitted for their own task and the auxiliary tasks combined capture the target well. The main takeaway of our method is that if the auxiliary models are well-trained and can borrow features from each other that are present in the target, our fusion method significantly improves the training of generative models. We provide a concise computational implementation of the fusion algorithm, and validate its efficiency in the low-data regime with numerical experiments involving mixtures models and image datasets.

[LG-79] Private Zeroth-Order Nonsmooth Nonconvex Optimization

链接: https://arxiv.org/abs/2406.19579
作者: Qinzi Zhang,Hoang Tran,Ashok Cutkosky
关键词: private stochastic optimization, epsilon, private stochastic, stochastic optimization, optimization on nonconvex
类目: Optimization and Control (math.OC); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives. Given a dataset of size M , our algorithm ensures (\alpha,\alpha\rho^2/2) -Rényi differential privacy and finds a (\delta,\epsilon) -stationary point so long as M=\tilde\Omega\left(\fracd\delta\epsilon^3 + \fracd^3/2\rho\delta\epsilon^2\right) . This matches the optimal complexity of its non-private zeroth-order analog. Notably, although the objective is not smooth, we have privacy ``for free’’ whenever \rho \ge \sqrtd\epsilon .

[LG-80] Deep Temporal Sequence Classification and Mathematical Modeling for Cell Tracking in Dense 3D Microscopy Videos of Bacterial Biofilms

链接: https://arxiv.org/abs/2406.19574
作者: Tanjin Taher Toma,Yibo Wang,Andreas Gahlmann,Scott T. Acton
关键词: Automatic cell tracking, Automatic cell, parent-offspring relationships, dense environments, environments is plagued
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic cell tracking in dense environments is plagued by inaccurate correspondences and misidentification of parent-offspring relationships. In this paper, we introduce a novel cell tracking algorithm named DenseTrack, which integrates deep learning with mathematical model-based strategies to effectively establish correspondences between consecutive frames and detect cell division events in crowded scenarios. We formulate the cell tracking problem as a deep learning-based temporal sequence classification task followed by solving a constrained one-to-one matching optimization problem exploiting the classifier’s confidence scores. Additionally, we present an eigendecomposition-based cell division detection strategy that leverages knowledge of cellular geometry. The performance of the proposed approach has been evaluated by tracking densely packed cells in 3D time-lapse image sequences of bacterial biofilm development. The experimental results on simulated as well as experimental fluorescence image sequences suggest that the proposed tracking method achieves superior performance in terms of both qualitative and quantitative evaluation measures compared to recent state-of-the-art cell tracking approaches.

[LG-81] BOrg: A Brain Organoid-Based Mitosis Dataset for Automatic Analysis of Brain Diseases

链接: https://arxiv.org/abs/2406.19556
作者: Muhammad Awais,Mehaboobathunnisa Sahul Hameed,Bidisha Bhattacharya,Orly Reiner,Rao Muhammad Anwer
关键词: Recent advances, brain organoids derived, human brain development, advances have enabled, derived from stem
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances have enabled the study of human brain development using brain organoids derived from stem cells. Quantifying cellular processes like mitosis in these organoids offers insights into neurodevelopmental disorders, but the manual analysis is time-consuming, and existing datasets lack specific details for brain organoid studies. We introduce BOrg, a dataset designed to study mitotic events in the embryonic development of the brain using confocal microscopy images of brain organoids. BOrg utilizes an efficient annotation pipeline with sparse point annotations and techniques that minimize expert effort, overcoming limitations of standard deep learning approaches on sparse data. We adapt and benchmark state-of-the-art object detection and cell counting models on BOrg for detecting and analyzing mitotic cells across prophase, metaphase, anaphase, and telophase stages. Our results demonstrate these adapted models significantly improve mitosis analysis efficiency and accuracy for brain organoid research compared to existing methods. BOrg facilitates the development of automated tools to quantify statistics like mitosis rates, aiding mechanistic studies of neurodevelopmental processes and disorders. Data and code are available at this https URL.

[LG-82] Forward and Backward State Abstractions for Off-policy Evaluation

链接: https://arxiv.org/abs/2406.19531
作者: Meiling Hao,Pingfan Su,Liyuan Hu,Zoltan Szabo,Qingyuan Zhao,Chengchun Shi
关键词: Off-policy evaluation, target policy impact, policy impact offline, crucial for evaluating, evaluating a target
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, 5 figures

点击查看摘要

Abstract:Off-policy evaluation (OPE) is crucial for evaluating a target policy’s impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE. (ii) We derive sufficient conditions for achieving irrelevance in Q-functions and marginalized importance sampling ratios, the latter obtained by constructing a time-reversed Markov decision process (MDP) based on the observed MDP. (iii) We propose a novel two-step procedure that sequentially projects the original state space into a smaller space, which substantially simplify the sample complexity of OPE arising from high cardinality.

[LG-83] Bayesian calibration of stochastic agent based model via random forest

链接: https://arxiv.org/abs/2406.19524
作者: Connor Robertson,Cosmin Safta,Nicholson Collier,Jonathan Ozik,Jaideep Ray
关键词: diverse individual interactions, Agent-based models, provide an excellent, interactions and environments, excellent framework
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Agent-based models (ABM) provide an excellent framework for modeling outbreaks and interventions in epidemiology by explicitly accounting for diverse individual interactions and environments. However, these models are usually stochastic and highly parametrized, requiring precise calibration for predictive performance. When considering realistic numbers of agents and properly accounting for stochasticity, this high dimensional calibration can be computationally prohibitive. This paper presents a random forest based surrogate modeling technique to accelerate the evaluation of ABMs and demonstrates its use to calibrate an epidemiological ABM named CityCOVID via Markov chain Monte Carlo (MCMC). The technique is first outlined in the context of CityCOVID’s quantities of interest, namely hospitalizations and deaths, by exploring dimensionality reduction via temporal decomposition with principal component analysis (PCA) and via sensitivity analysis. The calibration problem is then presented and samples are generated to best match COVID-19 hospitalization and death numbers in Chicago from March to June in 2020. These results are compared with previous approximate Bayesian calibration (IMABC) results and their predictive performance is analyzed showing improved performance with a reduction in computation.

[LG-84] Stochastic First-Order Methods with Non-smooth and Non-Euclidean Proximal Terms for Nonconvex High-Dimensional Stochastic Optimization

链接: https://arxiv.org/abs/2406.19475
作者: Yue Xie,Jiawen Bi,Hongcheng Liu
关键词: stochastic first-order methods, large-scale problems, dimension-insensitive stochastic first-order, stochastic first-order, first-order methods
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When the nonconvex problem is complicated by stochasticity, the sample complexity of stochastic first-order methods may depend linearly on the problem dimension, which is undesirable for large-scale problems. In this work, we propose dimension-insensitive stochastic first-order methods (DISFOMs) to address nonconvex optimization with expected-valued objective function. Our algorithms allow for non-Euclidean and non-smooth distance functions as the proximal terms. Under mild assumptions, we show that DISFOM using minibatches to estimate the gradient enjoys sample complexity of \mathcalO ( (\log d) / \epsilon^4 ) to obtain an \epsilon -stationary point. Furthermore, we prove that DISFOM employing variance reduction can sharpen this bound to \mathcalO ( (\log d)^2/3/\epsilon^10/3 ) , which perhaps leads to the best-known sample complexity result in terms of d . We provide two choices of the non-smooth distance functions, both of which allow for closed-form solutions to the proximal step. Numerical experiments are conducted to illustrate the dimension insensitive property of the proposed frameworks.

[LG-85] Stock Volume Forecasting with Advanced Information by Conditional Variational Auto-Encoder

链接: https://arxiv.org/abs/2406.19414
作者: Parley R Yang,Alexander Y Shestopaloff
关键词: Conditional Variational Encoder, Variational Encoder, term forecasting tasks, Conditional Variational, daily stock volume
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Pricing of Securities (q-fin.PR); Applications (stat.AP); Machine Learning (stat.ML); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:We demonstrate the use of Conditional Variational Encoder (CVAE) to improve the forecasts of daily stock volume time series in both short and long term forecasting tasks, with the use of advanced information of input variables such as rebalancing dates. CVAE generates non-linear time series as out-of-sample forecasts, which have better accuracy and closer fit of correlation to the actual data, compared to traditional linear models. These generative forecasts can also be used for scenario generation, which aids interpretation. We further discuss correlations in non-stationary time series and other potential extensions from the CVAE forecasts.

[LG-86] mporal distribution of clusters of investors and their application in prediction with expert advice

链接: https://arxiv.org/abs/2406.19403
作者: Wojciech Wisniewski,Yuri Kalnishkan,David Lindsay,Siân Lindsay
关键词: brokers face, face a significant, Financial organisations, traders worldwide, Ewens’ Sampling Distribution
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 20 pages, technical report

点击查看摘要

Abstract:Financial organisations such as brokers face a significant challenge in servicing the investment needs of thousands of their traders worldwide. This task is further compounded since individual traders will have their own risk appetite and investment goals. Traders may look to capture short-term trends in the market which last only seconds to minutes, or they may have longer-term views which last several days to months. To reduce the complexity of this task, client trades can be clustered. By examining such clusters, we would likely observe many traders following common patterns of investment, but how do these patterns vary through time? Knowledge regarding the temporal distributions of such clusters may help financial institutions manage the overall portfolio of risk that accumulates from underlying trader positions. This study contributes to the field by demonstrating that the distribution of clusters derived from the real-world trades of 20k Foreign Exchange (FX) traders (from 2015 to 2017) is described in accordance with Ewens’ Sampling Distribution. Further, we show that the Aggregating Algorithm (AA), an on-line prediction with expert advice algorithm, can be applied to the aforementioned real-world data in order to improve the returns of portfolios of trader risk. However we found that the AA ‘struggles’ when presented with too many trader ``experts’', especially when there are many trades with similar overall patterns. To help overcome this challenge, we have applied and compared the use of Statistically Validated Networks (SVN) with a hierarchical clustering approach on a subset of the data, demonstrating that both approaches can be used to significantly improve results of the AA in terms of profitability and smoothness of returns.

[LG-87] Modelling financial volume curves with hierarchical Poisson processes

链接: https://arxiv.org/abs/2406.19402
作者: Creighton Heaukulani,Abhinav Pandey,Lancelot F. James
关键词: financial trading applications, trading volume curves, volume curves, expected volume curve, financial instruments
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modeling the trading volume curves of financial instruments throughout the day is of key interest in financial trading applications. Predictions of these so-called volume profiles guide trade execution strategies, for example, a common strategy is to trade a desired quantity across many orders in line with the expected volume curve throughout the day so as not to impact the price of the instrument. The volume curves (for each day) are naturally grouped by stock and can be further gathered into higher-level groupings, such as by industry. In order to model such admixtures of volume curves, we introduce a hierarchical Poisson process model for the intensity functions of admixtures of inhomogenous Poisson processes, which represent the trading times of the stock throughout the day. The model is based on the hierarchical Dirichlet process, and an efficient Markov Chain Monte Carlo (MCMC) algorithm is derived following the slice sampling framework for Bayesian nonparametric mixture models. We demonstrate the method on datasets of different stocks from the Trade and Quote repository maintained by Wharton Research Data Services, including the most liquid stock on the NASDAQ stock exchange, Apple, demonstrating the scalability of the approach.

[LG-88] Discretized Gradient Flow for Manifold Learning in the Space of Embeddings

链接: https://arxiv.org/abs/1901.09057
作者: Dara Gold,Steven Rosenberg
关键词: negative gradient flow, Gradient descent, Gradient, standard technique, technique in optimization
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient descent, or negative gradient flow, is a standard technique in optimization to find minima of functions. Many implementations of gradient descent rely on discretized versions, i.e., moving in the gradient direction for a set step size, recomputing the gradient, and continuing. In this paper, we present an approach to manifold learning where gradient descent takes place in the infinite dimensional space \mathcalE = \rm Emb(M,\mathbbR^N) of smooth embeddings \phi of a manifold M into \mathbbR^N . Implementing a discretized version of gradient descent for P:\mathcalE\to \mathbb R , a penalty function that scores an embedding \phi \in \mathcalE , requires estimating how far we can move in a fixed direction – the direction of one gradient step – before leaving the space of smooth embeddings. Our main result is to give an explicit lower bound for this step length in terms of the Riemannian geometry of \phi(M) . In particular, we consider the case when the gradient of P is pointwise normal to the embedded manifold \phi(M) . We prove this case arises when P is invariant under diffeomorphisms of M , a natural condition in manifold learning.

信息检索

[IR-0] Interactive Topic Models with Optimal Transport

链接: https://arxiv.org/abs/2406.19928
作者: Garima Dhanania,Sheshera Mysore,Chau Minh Pham,Mohit Iyyer,Hamed Zamani,Andrew McCallum
关键词: analyze document collections, document collections, corpus, Topic, analyze document
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Pre-print; Work in progress

点击查看摘要

Abstract:Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM’s ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.

[IR-1] Rateless Stochastic Coding for Delay-constrained Semantic Communication

链接: https://arxiv.org/abs/2406.19804
作者: Cheng Peng,Rulong Wang,Yong Xiao
关键词: balance between reliability, uncertain channels, joint source-channel, settle the balance, joint source-channel code
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We consider the problem of joint source-channel coding with distortion and perception constraints from a rateless perspective, the purpose of which is to settle the balance between reliability (distortion/perception) and effectiveness (rate) of transmission over uncertain channels. We find a new finite-blocklength bound for the achievable joint source-channel code rate with the above two constraints. To achieve a superior rateless characteristic of JSCC coding, we perform multi-level optimization on various finite-blocklength codes. Based on these two, we then propose a new JSCC coding scheme called rateless stochastic coding (RSC). We experimentally demonstrate that the proposed RSC can achieve variable rates of transmission maintaining an excellent trade-off between distortion and perception.

[IR-2] Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation

链接: https://arxiv.org/abs/2406.19760
作者: Chenlong Deng,Kelong Mao,Zhicheng Dou
关键词: upholding judicial fairness, Legal case retrieval, sourcing similar cases, Legal case, case retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to unsatisfactory retrieval performance. This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and robustness on complex legal case queries of KELLER over existing methods.

[IR-3] Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search

链接: https://arxiv.org/abs/2406.19647
作者: Kaihao Li,Juexin Lin,Tony Lee
关键词: miss important keywords, e-commerce search engines, vocabulary mismatch, central challenge, miss important
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 1 figure, SIGIR 2024 Workshop on eCommerce

点击查看摘要

Abstract:Addressing the “vocabulary mismatch” issue in information retrieval is a central challenge for e-commerce search engines, because product pages often miss important keywords that customers search for. Doc2Query[1] is a popular document-expansion technique that predicts search queries for a document and includes the predicted queries with the document for retrieval. However, this approach can be inefficient for e-commerce search, because the predicted query tokens are often already present in the document. In this paper, we propose Doc2Token, a technique that predicts relevant tokens (instead of queries) that are missing from the document and includes these tokens in the document for retrieval. For the task of predicting missing tokens, we introduce a new metric, “novel ROUGE score”. Doc2Token is demonstrated to be superior to Doc2Query in terms of novel ROUGE score and diversity of predictions. Doc2Token also exhibits efficiency gains by reducing both training and inference times. We deployed the feature to production and observed significant revenue gain in an online A/B test, and launched the feature to full traffic on this http URL. [1] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, arXiv preprint arXiv:1904.08375 (2019) Comments: 9 pages, 1 figure, SIGIR 2024 Workshop on eCommerce Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2406.19647 [cs.IR] (or arXiv:2406.19647v1 [cs.IR] for this version)

[IR-4] ocBERT: Medical Document Structure Extraction Using Bidirectional Transformers

链接: https://arxiv.org/abs/2406.19526
作者: Majd Saleh,Sarra Baghdadi,Stéphane Paquelet
关键词: Natural Language Processing, Language Processing, Natural Language, holds paramount importance, field of Natural
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.

人工智能

[AI-0] Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

链接: https://arxiv.org/abs/2406.20098
作者: Sukmin Yun,Haokun Lin,Rusiru Thushara,Mohammad Qazim Bhat,Yongxin Wang,Zutao Jiang,Mingkai Deng,Jinhong Wang,Tianhua Tao,Junbo Li,Haonan Li,Preslav Nakov,Timothy Baldwin,Zhengzhong Liu,Eric P. Xing,Xiaodan Liang,Zhiqiang Shen
关键词: Multimodal large language, shown impressive success, Multimodal large, HTML code, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Website at this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage’s HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs’ abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at this https URL.

[AI-1] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

链接: https://arxiv.org/abs/2406.20095
作者: Xiang Li,Cristina Mata,Jongwoo Park,Kumara Kahatapitiya,Yoo Sung Jang,Jinghuan Shang,Kanchana Ranasinghe,Ryan Burgert,Mu Cai,Yong Jae Lee,Michael S. Ryoo
关键词: Large Language Models, Large Language, extensive world knowledge, strong reasoning skills, Vision Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at this https URL.

[AI-2] ProgressGym: Alignment with a Millennium of Moral Progress

链接: https://arxiv.org/abs/2406.20087
作者: Tianyi Qiu,Yang Zhang,Xuchuan Huang,Jasmine Xinze Li,Jiaming Ji,Yaodong Yang
关键词: large language models, including large language, Frontier AI systems, hold increasing influence, including large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at this https URL and this https URL respectively.

[AI-3] AI for Extreme Event Modeling and Understanding: Methodologies and Challenges

链接: https://arxiv.org/abs/2406.20080
作者: Gustau Camps-Valls,Miguel-Ángel Fernández-Torres,Kai-Hendrik Cohrs,Adrian Höhl,Andrea Castelletti,Aytac Pacal,Claire Robin,Francesco Martinuzzi,Ioannis Papoutsis,Ioannis Prapas,Jorge Pérez-Aracil,Katja Weigel,Maria Gonzalez-Calabuig,Markus Reichstein,Martin Rabel,Matteo Giuliani,Miguel Mahecha,Oana-Iuliana Popescu,Oscar J. Pellicer-Valero,Said Ouala,Sancho Salcedo-Sanz,Sebastian Sippel,Spyros Kondylatos,Tamara Happé,Tristan Williams
关键词: Earth system sciences, including Earth system, including Earth, Earth system, artificial intelligence
类目: Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:In recent years, artificial intelligence (AI) has deeply impacted various fields, including Earth system sciences. Here, AI improved weather forecasting, model emulation, parameter estimation, and the prediction of extreme events. However, the latter comes with specific challenges, such as developing accurate predictors from noisy, heterogeneous and limited annotated data. This paper reviews how AI is being used to analyze extreme events (like floods, droughts, wildfires and heatwaves), highlighting the importance of creating accurate, transparent, and reliable AI models. We discuss the hurdles of dealing with limited data, integrating information in real-time, deploying models, and making them understandable, all crucial for gaining the trust of stakeholders and meeting regulatory needs. We provide an overview of how AI can help identify and explain extreme events more effectively, improving disaster response and communication. We emphasize the need for collaboration across different fields to create AI solutions that are practical, understandable, and trustworthy for analyzing and predicting extreme events. Such collaborative efforts aim to enhance disaster readiness and disaster risk reduction.

[AI-4] Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification

链接: https://arxiv.org/abs/2406.20079
作者: Anisha Gunjal,Greg Durrett
关键词: large language model, Automatic factuality verification, Automatic factuality, language model, combat hallucinations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic factuality verification of large language model (LLM) generations is becoming more and more widely used to combat hallucinations. A major point of tension in the literature is the granularity of this fact-checking: larger chunks of text are hard to fact-check, but more atomic facts like propositions may lack context to interpret correctly. In this work, we assess the role of context in these atomic facts. We argue that fully atomic facts are not the right representation, and define two criteria for molecular facts: decontextuality, or how well they can stand alone, and minimality, or how little extra information is added to achieve decontexuality. We quantify the impact of decontextualization on minimality, then present a baseline methodology for generating molecular facts automatically, aiming to add the right amount of information. We compare against various methods of decontextualization and find that molecular facts balance minimality with fact verification accuracy in ambiguous settings.

[AI-5] Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

链接: https://arxiv.org/abs/2406.20053
作者: Danny Halawi,Alexander Wei,Eric Wallace,Tony T. Wang,Nika Haghtalab,Jacob Steinhardt
关键词: language models, finetuning, emerging interface, model, Black-box finetuning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

[AI-6] Electrostatics-based particle sampling and approximate inference

链接: https://arxiv.org/abs/2406.20044
作者: Yongchao Huang
关键词: Newton mechanics principles, electrostatics and Newton, Newton mechanics, based on electrostatics, mechanics principles
类目: Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A new particle-based sampling and approximate inference method, based on electrostatics and Newton mechanics principles, is introduced with theoretical ground, algorithm design and experimental validation. This method simulates an interacting particle system (IPS) where particles, i.e. the freely-moving negative charges and spatially-fixed positive charges with magnitudes proportional to the target distribution, interact with each other via attraction and repulsion induced by the resulting electric fields described by Poisson’s equation. The IPS evolves towards a steady-state where the distribution of negative charges conforms to the target distribution. This physics-inspired method offers deterministic, gradient-free sampling and inference, achieving comparable performance as other particle-based and MCMC methods in benchmark tasks of inferring complex densities, Bayesian logistic regression and dynamical system identification. A discrete-time, discrete-space algorithmic design, readily extendable to continuous time and space, is provided for usage in more general inference problems occurring in probabilistic machine learning scenarios such as Bayesian inference, generative modelling, and beyond.

[AI-7] BMW Agents – A Framework For Task Automation Through Multi-agent Collaboration

链接: https://arxiv.org/abs/2406.20041
作者: Noel Crawford,Edward B. Duffy,Iman Evazzade,Torsten Foehr,Gregory Robbins,Debbrata Kumar Saha,Jiya Varma,Marcin Ziolkowski
关键词: Large Language Models, Language Models, Large Language, driven by Large, offer enormous potential
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 24 pages. 21 PDF images

点击查看摘要

Abstract:Autonomous agents driven by Large Language Models (LLMs) offer enormous potential for automation. Early proof of this technology can be found in various demonstrations of agents solving complex tasks, interacting with external systems to augment their knowledge, and triggering actions. In particular, workflows involving multiple agents solving complex tasks in a collaborative fashion exemplify their capacity to operate in less strict and less well-defined environments. Thus, a multi-agent approach has great potential for serving as a backbone in many industrial applications, ranging from complex knowledge retrieval systems to next generation robotic process automation. Given the reasoning abilities within the current generation of LLMs, complex processes require a multi-step approach that includes a plan of well-defined and modular tasks. Depending on the level of complexity, these tasks can be executed either by a single agent or a group of agents. In this work, we focus on designing a flexible agent engineering framework with careful attention to planning and execution, capable of handling complex use case applications across various domains. The proposed framework provides reliability in industrial applications and presents techniques to ensure a scalable, flexible, and collaborative workflow for multiple autonomous agents working together towards solving tasks.

[AI-8] Pairwise Difference Learning for Classification

链接: https://arxiv.org/abs/2406.20031
作者: Mohamed Karim Belaid,Maximilian Rabus,Eyke Hüllermeier
关键词: Pairwise difference learning, Pairwise difference, recently been introduced, technique for regression, PDL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pairwise difference learning (PDL) has recently been introduced as a new meta-learning technique for regression. Instead of learning a mapping from instances to outcomes in the standard way, the key idea is to learn a function that takes two instances as input and predicts the difference between the respective outcomes. Given a function of this kind, predictions for a query instance are derived from every training example and then averaged. This paper extends PDL toward the task of classification and proposes a meta-learning technique for inducing a PDL classifier by solving a suitably defined (binary) classification problem on a paired version of the original training data. We analyze the performance of the PDL classifier in a large-scale empirical study and find that it outperforms state-of-the-art methods in terms of prediction performance. Last but not least, we provide an easy-to-use and publicly available implementation of PDL in a Python package.

[AI-9] oolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

链接: https://arxiv.org/abs/2406.20015
作者: Yuxiang Zhang,Jing Chen,Junjie Wang,Yaxin Liu,Cheng Yang,Chufan Shi,Xinyu Zhu,Zihao Lin,Hanwen Wan,Yujiu Yang,Tetsuya Sakai,Tian Feng,Hayato Yamana
关键词: Tool-augmented large language, Tool-augmented large, large language models, real-world applications, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community still needs to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM’s hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve a total score of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play a crucial role in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning.

[AI-10] Wavelets Are All You Need for Autoregressive Image Generation

链接: https://arxiv.org/abs/2406.19997
作者: Wael Mattar,Idan Levy,Nir Sharon,Shai Dekel
关键词: autoregressive image generation, main ingredients, approach to autoregressive, autoregressive image, wavelet image coding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this ‘wavelet language’. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.

[AI-11] Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

链接: https://arxiv.org/abs/2406.19995
作者: Habib Hajimolahoseini,Mohammad Hassanpour,Foozhan Ataiefard,Boxing Chen,Yang Liu
关键词: Low Rank Decomposition, Progressive Low Rank, Progressive Low, Rank Decomposition, Low Rank
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.

[AI-12] Into the Unknown: Generating Geospatial Descriptions for New Environments

链接: https://arxiv.org/abs/2406.19967
作者: Tzuf Paz-Argaman,John Palowitch,Sayali Kulkarni,Reut Tsarfaty,Jason Baldridge
关键词: task requires reasoning, task requires, observer viewpoint, focus on bridging, bridging the gap
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Similar to vision-and-language navigation (VLN) tasks that focus on bridging the gap between vision and language for embodied navigation, the new Rendezvous (RVS) task requires reasoning over allocentric spatial relationships (independent of the observer’s viewpoint) using non-sequential navigation instructions and maps. However, performance substantially drops in new environments with no training data. Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text resulting in low geolocation resolution. We propose a large-scale augmentation method for generating high-quality synthetic data for new environments using readily available geospatial data. Our method constructs a grounded knowledge-graph, capturing entity relationships. Sampled entities and relations (`shop north of school’) generate navigation instructions via (i) generating numerous templates using context-free grammar (CFG) to embed specific entities and relations; (ii) feeding the entities and relation into a large language model (LLM) for instruction generation. A comprehensive evaluation on RVS, showed that our approach improves the 100-meter accuracy by 45.83% on unseen environments. Furthermore, we demonstrate that models trained with CFG-based augmentation achieve superior performance compared with those trained with LLM-based augmentation, both in unseen and seen environments. These findings suggest that the potential advantages of explicitly structuring spatial information for text-based geospatial reasoning in previously unknown, can unlock data-scarce scenarios.

[AI-13] xt2Robot: Evolutionary Robot Design from Text Descriptions

链接: https://arxiv.org/abs/2406.19963
作者: Ryan P. Ringel,Zachary S. Charlick,Jiaxun Liu,Boxi Xia,Boyuan Chen
关键词: costly and labor-intensive, traditionally been costly, Robot design, Abstract, design
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Our project website is at: this https URL

点击查看摘要

Abstract:Robot design has traditionally been costly and labor-intensive. Despite advancements in automated processes, it remains challenging to navigate a vast design space while producing physically manufacturable robots. We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots. Within minutes, Text2Robot can use text-to-3D models to provide strong initializations of diverse morphologies. Within a day, our geometric processing algorithms and body-control co-optimization produce a walking robot by explicitly considering real-world electronics and manufacturability. Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models.

[AI-14] From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

链接: https://arxiv.org/abs/2406.19934
作者: Chuanqi Cheng,Jian Guan,Wei Wu,Rui Yan
关键词: explore multi-step reasoning, reasoning, explore multi-step, multi-step reasoning, visual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct 50 k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at this https URL.

[AI-15] Decoupling General and Personalized Knowledge in Federated Learning via Additive and Low-Rank Decomposition

链接: https://arxiv.org/abs/2406.19931
作者: Xinghao Wu,Xuefeng Liu,Jianwei Niu,Haolin Wang,Shaojie Tang,Guogang Zhu,Hao Su
关键词: Personalized Federated Learning, Federated Learning, Personalized Federated, client-specific knowledge, decouple general knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:To address data heterogeneity, the key strategy of Personalized Federated Learning (PFL) is to decouple general knowledge (shared among clients) and client-specific knowledge, as the latter can have a negative impact on collaboration if not removed. Existing PFL methods primarily adopt a parameter partitioning approach, where the parameters of a model are designated as one of two types: parameters shared with other clients to extract general knowledge and parameters retained locally to learn client-specific knowledge. However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. Instead of assigning each parameter of a model as either a shared or personalized one, FedDecomp decomposes each parameter into the sum of two parameters: a shared one and a personalized one, thus achieving a more thorough decoupling of shared and personalized knowledge compared to the parameter partitioning method. In addition, as we find that retaining local knowledge of specific clients requires much lower model capacity compared with general knowledge across all clients, we let the matrix containing personalized parameters be low rank during the training process. Moreover, a new alternating training strategy is proposed to further improve the performance. Experimental results across multiple datasets and varying degrees of data heterogeneity demonstrate that FedDecomp outperforms state-of-the-art methods up to 4.9%.

[AI-16] AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models Generating Benchmark Dataset

链接: https://arxiv.org/abs/2406.19896
作者: Bhaskar Joshi,Sepideh HajiHossein Khani,Arash HabibiLashkari
关键词: origin and behavior, Source Code, Code, Code Authorship Attribution, Source Code Authorship
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Source Code Authorship Attribution (SCAA) is crucial for software classification because it provides insights into the origin and behavior of software. By accurately identifying the author or group behind a piece of code, experts can better understand the motivations and techniques of developers. In the cybersecurity era, this attribution helps trace the source of malicious software, identify patterns in the code that may indicate specific threat actors or groups, and ultimately enhance threat intelligence and mitigation strategies. This paper presents AuthAttLyzer-V2, a new source code feature extractor for SCAA, focusing on lexical, semantic, syntactic, and N-gram features. Our research explores author identification in C++ by examining 24,000 source code samples from 3,000 authors. Our methodology integrates Random Forest, Gradient Boosting, and XGBoost models, enhanced with SHAP for interpretability. The study demonstrates how ensemble models can effectively discern individual coding styles, offering insights into the unique attributes of code authorship. This approach is pivotal in understanding and interpreting complex patterns in authorship attribution, especially for malware classification.

[AI-17] Fine-tuning of Geospatial Foundation Models for Aboveground Biomass Estimation

链接: https://arxiv.org/abs/2406.19888
作者: Michal Muszynski,Levente Klein,Ademir Ferreira da Silva,Anjani Prasad Atluri,Carlos Gomes,Daniela Szwarcman,Gurkanwar Singh,Kewen Gu,Maciel Zortea,Naomi Simumba,Paolo Fraccaro,Shraddha Singh,Steve Meliksetian,Campbell Watson,Daiki Kimura,Harini Srinivasan
关键词: carbon sequestration initiatives, nature-based carbon sequestration, vegetation structure mapping, global carbon cycle, Global vegetation structure
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Global vegetation structure mapping is critical for understanding the global carbon cycle and maximizing the efficacy of nature-based carbon sequestration initiatives. Moreover, vegetation structure mapping can help reduce the impacts of climate change by, for example, guiding actions to improve water security, increase biodiversity and reduce flood risk. Global satellite measurements provide an important set of observations for monitoring and managing deforestation and degradation of existing forests, natural forest regeneration, reforestation, biodiversity restoration, and the implementation of sustainable agricultural practices. In this paper, we explore the effectiveness of fine-tuning of a geospatial foundation model to estimate above-ground biomass (AGB) using space-borne data collected across different eco-regions in Brazil. The fine-tuned model architecture consisted of a Swin-B transformer as the encoder (i.e., backbone) and a single convolutional layer for the decoder head. All results were compared to a U-Net which was trained as the baseline model Experimental results of this sparse-label prediction task demonstrate that the fine-tuned geospatial foundation model with a frozen encoder has comparable performance to a U-Net trained from scratch. This is despite the fine-tuned model having 13 times less parameters requiring optimization, which saves both time and compute resources. Further, we explore the transfer-learning capabilities of the geospatial foundation models by fine-tuning on satellite imagery with sparse labels from different eco-regions in Brazil.

[AI-18] Detecting Subtle Differences between Human and Model Languages Using Spectrum of Relative Likelihood

链接: https://arxiv.org/abs/2406.19874
作者: Yang Xu,Yu Wang,Hao An,Zhichen Liu,Yongyuan Li
关键词: distinguished by examining, examining the magnitude, Human and model-generated, model-generated texts, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Human and model-generated texts can be distinguished by examining the magnitude of likelihood in language. However, it is becoming increasingly difficult as language model’s capabilities of generating human-like texts keep evolving. This study provides a new perspective by using the relative likelihood values instead of absolute ones, and extracting useful features from the spectrum-view of likelihood for the human-model text detection task. We propose a detection procedure with two classification methods, supervised and heuristic-based, respectively, which results in competitive performances with previous zero-shot detection methods and a new state-of-the-art on short-text detection. Our method can also reveal subtle differences between human and model languages, which find theoretical roots in psycholinguistics studies. Our code is available at this https URL

[AI-19] MetaDesigner: Advancing Artistic Typography through AI-Driven User-Centric and Multilingual WordArt Synthesis

链接: https://arxiv.org/abs/2406.19859
作者: Jun-Yan He,Zhi-Qi Cheng,Chenyang Li,Jingdong Sun,Qi He,Wangmeng Xiang,Hanyuan Chen,Jin-Peng Lan,Xianhui Lin,Kang Zhu,Bin Luo,Yifeng Geng,Xuansong Xie,Alexander G. Hauptmann
关键词: Large Language Models, revolutionizes artistic typography, artistic typography synthesis, Large Language, design paradigm centered
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: 18 pages, 16 figures, Project: this https URL

点击查看摘要

Abstract:MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner’s capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results.

[AI-20] YuLan: An Open-source Large Language Model

链接: https://arxiv.org/abs/2406.19853
作者: Yutao Zhu,Kun Zhou,Kelong Mao,Wentong Chen,Yiding Sun,Zhipeng Chen,Qian Cao,Yihan Wu,Yushuo Chen,Feng Wang,Lei Zhang,Junyi Li,Xiaolei Wang,Lei Wang,Beichen Zhang,Zican Dong,Xiaoxue Cheng,Yuhan Chen,Xinyu Tang,Yupeng Hou,Qiangqiang Ren,Xincheng Pang,Shufang Xie,Wayne Xin Zhao,Zhicheng Dou,Jiaxin Mao,Yankai Lin,Ruihua Song,Jun Xu,Xu Chen,Rui Yan,Zhewei Wei,Di Hu,Wenbing Huang,Ze-Feng Gao,Yueguo Chen,Weizheng Lu,Ji-Rong Wen
关键词: Large language models, understanding natural language, Large language, natural language, leveraging their extensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with 12 billion parameters. The base model of YuLan is pre-trained on approximately 1.7 T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan’s overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan’s training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at this https URL.

[AI-21] AnomaLLMy – Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions

链接: https://arxiv.org/abs/2406.19840
作者: Waligóra Witold
关键词: black-box Large Language, Large Language Models, Large Language, paper introduces AnomaLLMy, black-box Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:This paper introduces AnomaLLMy, a novel technique for the automatic detection of anomalous tokens in black-box Large Language Models (LLMs) with API-only access. Utilizing low-confidence single-token predictions as a cost-effective indicator, AnomaLLMy identifies irregularities in model behavior, addressing the issue of anomalous tokens degrading the quality and reliability of models. Validated on the cl100k_base dataset, the token set of GPT-4, AnomaLLMy detected 413 major and 65 minor anomalies, demonstrating the method’s efficiency with just \ 24.39 spent in API credits. The insights from this research are expected to be beneficial for enhancing the robustness of and accuracy of LLMs, particularly in the development and assessment of tokenizers.

[AI-22] BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering

链接: https://arxiv.org/abs/2406.19820
作者: Zheng Chu,Jingchang Chen,Qianglong Chen,Haotian Wang,Kun Zhu,Xiyuan Du,Weijiang Yu,Ming Liu,Bing Qin
关键词: Large language models, Large language, strong reasoning capabilities, demonstrated strong reasoning, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge. To address this, we propose Beam Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop of question. Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning. For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates. For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory. Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.

[AI-23] Emotion Loss Attacking: Adversarial Attack Perception for Skeleton based on Multi-dimensional Features

链接: https://arxiv.org/abs/2406.19815
作者: Feng Liu,Qing Xu,Qijian Zheng
关键词: hot topic, skeletal motions, Adversarial attack, adversarial attack method, skeletal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adversarial attack on skeletal motion is a hot topic. However, existing researches only consider part of dynamic features when measuring distance between skeleton graph sequences, which results in poor imperceptibility. To this end, we propose a novel adversarial attack method to attack action recognizers for skeletal motions. Firstly, our method systematically proposes a dynamic distance function to measure the difference between skeletal motions. Meanwhile, we innovatively introduce emotional features for complementary information. In addition, we use Alternating Direction Method of Multipliers(ADMM) to solve the constrained optimization problem, which generates adversarial samples with better imperceptibility to deceive the classifiers. Experiments show that our method is effective on multiple action classifiers and datasets. When the perturbation magnitude measured by l norms is the same, the dynamic perturbations generated by our method are much lower than that of other methods. What’s more, we are the first to prove the effectiveness of emotional features, and provide a new idea for measuring the distance between skeletal motions.

[AI-24] Fuzzy Logic Guided Reward Function Variation: An Oracle for Testing Reinforcement Learning Programs

链接: https://arxiv.org/abs/2406.19812
作者: Shiyu Zhang,Haoyang Song,Qixin Wang,Yu Pei
关键词: Reinforcement Learning, gained significant attention, gained significant, significant attention, oracle
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) has gained significant attention across various domains. However, the increasing complexity of RL programs presents testing challenges, particularly the oracle problem: defining the correctness of the RL program. Conventional human oracles struggle to cope with the complexity, leading to inefficiencies and potential unreliability in RL testing. To alleviate this problem, we propose an automated oracle approach that leverages RL properties using fuzzy logic. Our oracle quantifies an agent’s behavioral compliance with reward policies and analyzes its trend over training episodes. It labels an RL program as “Buggy” if the compliance trend violates expectations derived from RL characteristics. We evaluate our oracle on RL programs with varying complexities and compare it with human oracles. Results show that while human oracles perform well in simpler testing scenarios, our fuzzy oracle demonstrates superior performance in complex environments. The proposed approach shows promise in addressing the oracle problem for RL testing, particularly in complex cases where manual testing falls short. It offers a potential solution to improve the efficiency, reliability, and scalability of RL program testing. This research takes a step towards automated testing of RL programs and highlights the potential of fuzzy logic-based oracles in tackling the oracle problem.

[AI-25] Deceptive Diffusion: Generating Synthetic Adversarial Examples

链接: https://arxiv.org/abs/2406.19807
作者: Lucas Beerens,Catherine F. Higham,Desmond J. Higham
关键词: produce adversarial images, deceptive diffusion, deceptive diffusion model, introduce the concept, diffusion model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce the concept of deceptive diffusion – training a generative AI model to produce adversarial images. Whereas a traditional adversarial attack algorithm aims to perturb an existing image to induce a misclassificaton, the deceptive diffusion model can create an arbitrary number of new, misclassified images that are not directly associated with training or test images. Deceptive diffusion offers the possibility of strengthening defence algorithms by providing adversarial training data at scale, including types of misclassification that are otherwise difficult to find. In our experiments, we also investigate the effect of training on a partially attacked data set. This highlights a new type of vulnerability for generative diffusion models: if an attacker is able to stealthily poison a portion of the training data, then the resulting diffusion model will generate a similar proportion of misleading outputs.

[AI-26] Self-Supervised Spatial-Temporal Normality Learning for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2406.19770
作者: Yutong Chen,Hongzuo Xu,Guansong Pang,Hezhe Qiao,Yuan Zhou,Mingsheng Shang
关键词: Series Anomaly Detection, Time Series Anomaly, Anomaly Detection, finds widespread applications, time series data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 4 figures, accepted in ECML PKDD2024

点击查看摘要

Abstract:Time Series Anomaly Detection (TSAD) finds widespread applications across various domains such as financial markets, industrial production, and healthcare. Its primary objective is to learn the normal patterns of time series data, thereby identifying deviations in test samples. Most existing TSAD methods focus on modeling data from the temporal dimension, while ignoring the semantic information in the spatial dimension. To address this issue, we introduce a novel approach, called Spatial-Temporal Normality learning (STEN). STEN is composed of a sequence Order prediction-based Temporal Normality learning (OTN) module that captures the temporal correlations within sequences, and a Distance prediction-based Spatial Normality learning (DSN) module that learns the relative spatial relations between sequences in a feature space. By synthesizing these two modules, STEN learns expressive spatial-temporal representations for the normal patterns hidden in the time series data. Extensive experiments on five popular TSAD benchmarks show that STEN substantially outperforms state-of-the-art competing methods. Our code is available at this https URL.

[AI-27] xSemAD: Explainable Semantic Anomaly Detection in Event Logs Using Sequence-to-Sequence Models

链接: https://arxiv.org/abs/2406.19763
作者: Kiran Busch,Timotheus Kampik,Henrik Leopold
关键词: anomaly detection methods, semantic anomaly detection, anomaly detection, detection methods, anomaly
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at BPM 2024

点击查看摘要

Abstract:The identification of undesirable behavior in event logs is an important aspect of process mining that is often addressed by anomaly detection methods. Traditional anomaly detection methods tend to focus on statistically rare behavior and neglect the subtle difference between rarity and undesirability. The introduction of semantic anomaly detection has opened a promising avenue by identifying semantically deviant behavior. This work addresses a gap in semantic anomaly detection, which typically indicates the occurrence of an anomaly without explaining the nature of the anomaly. We propose xSemAD, an approach that uses a sequence-to-sequence model to go beyond pure identification and provides extended explanations. In essence, our approach learns constraints from a given process model repository and then checks whether these constraints hold in the considered event log. This approach not only helps understand the specifics of the undesired behavior, but also facilitates targeted corrective actions. Our experiments demonstrate that our approach outperforms existing state-of-the-art semantic anomaly detection methods.

[AI-28] Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train

链接: https://arxiv.org/abs/2406.19756
作者: Haojun Jiang,Meng Li,Zhenguo Sun,Ning Jia,Yu Sun,Shaqi Luo,Shiji Song,Gao Huang
关键词: cardiac ultrasound images, acquisition cardiac ultrasound, ultrasound images, heart leads, leads to significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:The complex structure of the heart leads to significant challenges in echocardiography, especially in acquisition cardiac ultrasound images. Successful echocardiography requires a thorough understanding of the structures on the two-dimensional plane and the spatial relationships between planes in three-dimensional space. In this paper, we innovatively propose a large-scale self-supervised pre-training method to acquire a cardiac structure-aware world model. The core innovation lies in constructing a self-supervised task that requires structural inference by predicting masked structures on a 2D plane and imagining another plane based on pose transformation in 3D space. To support large-scale pre-training, we collected over 1.36 million echocardiograms from ten standard views, along with their 3D spatial poses. In the downstream probe guidance task, we demonstrate that our pre-trained model consistently reduces guidance errors across the ten most common standard views on the test set with 0.29 million samples from 74 routine clinical scans, indicating that structure-aware pre-training benefits the scanning.

[AI-29] ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

链接: https://arxiv.org/abs/2406.19741
作者: Christopher E. Mower,Yuhui Wan,Hongzhan Yu,Antoine Grosnit,Jonas Gonzalez-Billandon,Matthieu Zimmer,Jinlong Wang,Xinyu Zhang,Yao Zhao,Anbang Zhai,Puze Liu,Davide Tateo,Cesar Cadena,Marco Hutter,Jan Peters,Guangjian Tian,Yuzheng Zhuang,Kun Shao,Xingyue Quan,Jianye Hao,Jun Wang,Haitham Bou-Ammar
关键词: Robot Operating System, leveraging natural language, intuitive robot programming, natural language prompts, Operating System
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This document contains 26 pages and 13 figures

点击查看摘要

Abstract:We present a framework for intuitive robot programming by non-experts, leveraging natural language prompts and contextual information from the Robot Operating System (ROS). Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface. Key features of the framework include: integration of ROS with an AI agent connected to a plethora of open-source and commercial LLMs, automatic extraction of a behavior from the LLM output and execution of ROS actions/services, support for three behavior modes (sequence, behavior tree, state machine), imitation learning for adding new robot actions to the library of possible actions, and LLM reflection via human and environment feedback. Extensive experiments validate the framework, showcasing robustness, scalability, and versatility in diverse scenarios, including long-horizon tasks, tabletop rearrangements, and remote supervisory control. To facilitate the adoption of our framework and support the reproduction of our results, we have made our code open-source. You can access it at: this https URL.

[AI-30] MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

链接: https://arxiv.org/abs/2406.19736
作者: Jihao Liu,Xin Huang,Jinliang Zheng,Boxiao Liu,Jia Wang,Osamu Yoshie,Yu Liu,Hongsheng Li
关键词: high-quality visual instruction, visual instruction data, visual instruction, instruction-following capabilities, instruction data designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Dataset and models are available at this https URL

点击查看摘要

Abstract:This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at this https URL.

[AI-31] CUPID: Improving Battle Fairness and Position Satisfaction in Online MOBA Games with a Re-matchmaking System

链接: https://arxiv.org/abs/2406.19720
作者: Ge Fan,Chaoyun Zhang,Kai Wang,Yingjie Li,Junyang Chen,Zenglin Xu
关键词: attracting considerable research, considerable research interest, online battle arena, multiplayer online battle, gained significant popularity
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 38 pages, accepted by CSCW 24

点击查看摘要

Abstract:The multiplayer online battle arena (MOBA) genre has gained significant popularity and economic success, attracting considerable research interest within the Human-Computer Interaction community. Enhancing the gaming experience requires a deep understanding of player behavior, and a crucial aspect of MOBA games is matchmaking, which aims to assemble teams of comparable skill levels. However, existing matchmaking systems often neglect important factors such as players’ position preferences and team assignment, resulting in imbalanced matches and reduced player satisfaction. To address these limitations, this paper proposes a novel framework called CUPID, which introduces a novel process called ``re-matchmaking’’ to optimize team and position assignments to improve both fairness and player satisfaction. CUPID incorporates a pre-filtering step to ensure a minimum level of matchmaking quality, followed by a pre-match win-rate prediction model that evaluates the fairness of potential assignments. By simultaneously considering players’ position satisfaction and game fairness, CUPID aims to provide an enhanced matchmaking experience. Extensive experiments were conducted on two large-scale, real-world MOBA datasets to validate the effectiveness of CUPID. The results surpass all existing state-of-the-art baselines, with an average relative improvement of 7.18% in terms of win prediction accuracy. Furthermore, CUPID has been successfully deployed in a popular online mobile MOBA game. The deployment resulted in significant improvements in match fairness and player satisfaction, as evidenced by critical Human-Computer Interaction (HCI) metrics covering usability, accessibility, and engagement, observed through A/B testing. To the best of our knowledge, CUPID is the first re-matchmaking system designed specifically for large-scale MOBA games.

[AI-32] Uncertainty Quantification in Large Language Models Through Convex Hull Analysis

链接: https://arxiv.org/abs/2406.19712
作者: Ferhat Ozgur Catak,Murat Kuzlu
关键词: requiring reliable outputs, Uncertainty quantification approaches, large language models, high-risk applications requiring, applications requiring reliable
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 17 pages

点击查看摘要

Abstract:Uncertainty quantification approaches have been more critical in large language models (LLMs), particularly high-risk applications requiring reliable outputs. However, traditional methods for uncertainty quantification, such as probabilistic models and ensemble techniques, face challenges when applied to the complex and high-dimensional nature of LLM-generated outputs. This study proposes a novel geometric approach to uncertainty quantification using convex hull analysis. The proposed method leverages the spatial properties of response embeddings to measure the dispersion and variability of model outputs. The prompts are categorized into three types, i.e., easy', moderate’, and `confusing’, to generate multiple responses using different LLMs at varying temperature settings. The responses are transformed into high-dimensional embeddings via a BERT model and subsequently projected into a two-dimensional space using Principal Component Analysis (PCA). The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is utilized to cluster the embeddings and compute the convex hull for each selected cluster. The experimental results indicate that the uncertainty of the model for LLMs depends on the prompt complexity, the model, and the temperature setting.

[AI-33] A Differentiable Approach to Multi-scale Brain Modeling

链接: https://arxiv.org/abs/2406.19708
作者: Chaoming Wang,Muyang Lyu,Tianqiu Zhang,Sichao He,Si Wu
关键词: modeling workflow utilizing, brain modeling workflow, combines accurate brain, workflow utilizing BrainPy, powerful gradient-based optimization
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Neurons and Cognition (q-bio.NC)
*备注: 2nd Differentiable Almost Everything Workshop at ICML 2024

点击查看摘要

Abstract:We present a multi-scale differentiable brain modeling workflow utilizing BrainPy, a unique differentiable brain simulator that combines accurate brain simulation with powerful gradient-based optimization. We leverage this capability of BrainPy across different brain scales. At the single-neuron level, we implement differentiable neuron models and employ gradient methods to optimize their fit to electrophysiological data. On the network level, we incorporate connectomic data to construct biologically constrained network models. Finally, to replicate animal behavior, we train these models on cognitive tasks using gradient-based learning rules. Experiments demonstrate that our approach achieves superior performance and speed in fitting generalized leaky integrate-and-fire and Hodgkin-Huxley single neuron models. Additionally, training a biologically-informed network of excitatory and inhibitory spiking neurons on working memory tasks successfully replicates observed neural activity and synaptic weight distributions. Overall, our differentiable multi-scale simulation approach offers a promising tool to bridge neuroscience data across electrophysiological, anatomical, and behavioral scales.

[AI-34] DISCO: Efficient Diffusion Solver for Large-Scale Combinatorial Optimization Problems

链接: https://arxiv.org/abs/2406.19705
作者: Kexiong Yu,Hang Zhao,Yuhang Huang,Renjiao Yi,Kai Xu,Chenyang Zhu
关键词: demanding time-sensitive response, numerous practical applications, Combinatorial Optimization, entailing enormous solution, Combinatorial Optimization problems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Combinatorial Optimization (CO) problems are fundamentally crucial in numerous practical applications across diverse industries, characterized by entailing enormous solution space and demanding time-sensitive response. Despite significant advancements made by recent neural solvers, their limited expressiveness does not conform well to the multi-modal nature of CO landscapes. While some research has pivoted towards diffusion models, they require simulating a Markov chain with many steps to produce a sample, which is time-consuming and does not meet the efficiency requirement of real applications, especially at scale. We propose DISCO, an efficient DIffusion Solver for Combinatorial Optimization problems that excels in both solution quality and inference speed. DISCO’s efficacy is two-pronged: Firstly, it achieves rapid denoising of solutions through an analytically solvable form, allowing for direct sampling from the solution space with very few reverse-time steps, thereby drastically reducing inference time. Secondly, DISCO enhances solution quality by restricting the sampling space to a more constrained, meaningful domain guided by solution residues, while still preserving the inherent multi-modality of the output probabilistic distributions. DISCO achieves state-of-the-art results on very large Traveling Salesman Problems with 10000 nodes and challenging Maximal Independent Set benchmarks, with its per-instance denoising time up to 44.8 times faster. Through further combining a divide-and-conquer strategy, DISCO can be generalized to solve arbitrary-scale problem instances off the shelf, even outperforming models trained specifically on corresponding scales.

[AI-35] Deep Fusion Model for Brain Tumor Classification Using Fine-Grained Gradient Preservation

链接: https://arxiv.org/abs/2406.19690
作者: Niful Islam,Mohaiminul Islam Bhuiyan,Jarin Tasnim Raya,Nur Shazwani Kamarudin,Khan Md Hasib,M. F. Mridha,Dewan Md. Farid
关键词: brain tumor classification, accurate brain tumor, brain tumor, early stage, early death
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain tumors are one of the most common diseases that lead to early death if not diagnosed at an early stage. Traditional diagnostic approaches are extremely time-consuming and prone to errors. In this context, computer vision-based approaches have emerged as an effective tool for accurate brain tumor classification. While some of the existing solutions demonstrate noteworthy accuracy, the models become infeasible to deploy in areas where computational resources are limited. This research addresses the need for accurate and fast classification of brain tumors with a priority of deploying the model in technologically underdeveloped regions. The research presents a novel architecture for precise brain tumor classification fusing pretrained ResNet152V2 and modified VGG16 models. The proposed architecture undergoes a diligent fine-tuning process that ensures fine gradients are preserved in deep neural networks, which are essential for effective brain tumor classification. The proposed solution incorporates various image processing techniques to improve image quality and achieves an astounding accuracy of 98.36% and 98.04% in Figshare and Kaggle datasets respectively. This architecture stands out for having a streamlined profile, with only 2.8 million trainable parameters. We have leveraged 8-bit quantization to produce a model of size 73.881 MB, significantly reducing it from the previous size of 289.45 MB, ensuring smooth deployment in edge devices even in resource-constrained areas. Additionally, the use of Grad-CAM improves the interpretability of the model, offering insightful information regarding its decision-making process. Owing to its high discriminative ability, this model can be a reliable option for accurate brain tumor classification.

[AI-36] MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

链接: https://arxiv.org/abs/2406.19680
作者: Yuang Zhang,Jiaxi Gu,Li-Wen Wang,Han Wang,Junqi Cheng,Yuefeng Zhu,Fangyuan Zou
关键词: generative artificial intelligence, achieved significant advancements, recent years, generative artificial, spawning a variety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: this https URL .

[AI-37] FunctionData Flow: A Framework to Specify Machine Learning Pipelines for Digital Twinning

链接: https://arxiv.org/abs/2406.19670
作者: Eduardo de Conto,Blaise Genest,Arvind Easwaran
关键词: leverages artificial intelligence, creating computationally efficient, physical systems increasingly, systems increasingly leverages, increasingly leverages artificial
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, to be published in AIware’24

点击查看摘要

Abstract:The development of digital twins (DTs) for physical systems increasingly leverages artificial intelligence (AI), particularly for combining data from different sources or for creating computationally efficient, reduced-dimension models. Indeed, even in very different application domains, twinning employs common techniques such as model order reduction and modelization with hybrid data (that is, data sourced from both physics-based models and sensors). Despite this apparent generality, current development practices are ad-hoc, making the design of AI pipelines for digital twinning complex and time-consuming. Here we propose Function+Data Flow (FDF), a domain-specific language (DSL) to describe AI pipelines within DTs. FDF aims to facilitate the design and validation of digital twins. Specifically, FDF treats functions as first-class citizens, enabling effective manipulation of models learned with AI. We illustrate the benefits of FDF on two concrete use cases from different domains: predicting the plastic strain of a structure and modeling the electromagnetic behavior of a bearing.

[AI-38] ACES: Automatic Cohort Extraction System for Event-Stream Datasets

链接: https://arxiv.org/abs/2406.19653
作者: Justin Xu,Jack Gallifant,Alistair E. W. Johnson,Matthew B. A. McDermott
关键词: machine learning, Automatic Cohort Extraction, Cohort Extraction System, challenge in machine, ACES
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: For ACES Online Documentation, see this https URL

点击查看摘要

Abstract:Reproducibility remains a significant challenge in machine learning (ML) for healthcare. In this field, datasets, model pipelines, and even task/cohort definitions are often private, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. In this paper, we address a significant part of this problem by introducing the Automatic Cohort Extraction System for Event-Stream Datasets (ACES). This tool is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or to any dataset for which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality. ACES is available at this https URL.

[AI-39] CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion

链接: https://arxiv.org/abs/2406.19651
作者: Xianzhi Zeng,Zhuoyan Wu,Xinjing Hu,Xuanhua Shi,Shixuan Sun,Shuhao Zhang
关键词: natural language processing, including information retrieval, Approximate Nearest Neighbor, Nearest Neighbor Search, computer vision
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Approximate K Nearest Neighbor (AKNN) algorithms play a pivotal role in various AI applications, including information retrieval, computer vision, and natural language processing. Although numerous AKNN algorithms and benchmarks have been developed recently to evaluate their effectiveness, the dynamic nature of real-world data presents significant challenges that existing benchmarks fail to address. Traditional benchmarks primarily assess retrieval effectiveness in static contexts and often overlook update efficiency, which is crucial for handling continuous data ingestion. This limitation results in an incomplete assessment of an AKNN algorithms ability to adapt to changing data patterns, thereby restricting insights into their performance in dynamic environments. To address these gaps, we introduce CANDY, a benchmark tailored for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. CANDY comprehensively assesses a wide range of AKNN algorithms, integrating advanced optimizations such as machine learning-driven inference to supplant traditional heuristic scans, and improved distance computation methods to reduce computational overhead. Our extensive evaluations across diverse datasets demonstrate that simpler AKNN baselines often surpass more complex alternatives in terms of recall and latency. These findings challenge established beliefs about the necessity of algorithmic complexity for high performance. Furthermore, our results underscore existing challenges and illuminate future research opportunities. We have made the datasets and implementation methods available at: this https URL.

[AI-40] Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task

链接: https://arxiv.org/abs/2406.19648
作者: Sion Yoon,Tae Eun Kim,Yoo Jung Oh
关键词: dynamics of human-AI, human-AI communication, communication, multiple language model, language model chatbots
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The dynamics of human-AI communication have been reshaped by language models such as ChatGPT. However, extant research has primarily focused on dyadic communication, leaving much to be explored regarding the dynamics of human-AI communication in group settings. The availability of multiple language model chatbots presents a unique opportunity for scholars to better understand the interaction between humans and multiple chatbots. This study examines the impact of multi-chatbot communication in a specific persuasion setting: promoting charitable donations. We developed an online environment that enables multi-chatbot communication and conducted a pilot experiment utilizing two GPT-based chatbots, Save the Children and UNICEF chatbots, to promote charitable donations. In this study, we present our development process of the multi-chatbot interface and present preliminary findings from a pilot experiment. Analysis of qualitative and quantitative feedback are presented, and limitations are addressed.

[AI-41] Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs

链接: https://arxiv.org/abs/2406.19644
作者: Zichao Shen,Tianchen Zhu,Qingyun Sun,Shiqi Gao,Jianxin Li
关键词: evaluating policy trajectories, Reinforcement learning, intricate game tasks, game tasks due, Preference-based reinforcement learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks due to the difficulty in designing comprehensive and precise reward functions. This inherent difficulty curtails the broader application of RL within game environments characterized by diverse constraints. Preference-based reinforcement learning (PbRL) presents a pioneering framework that capitalizes on human preferences as pivotal reward signals, thereby circumventing the need for meticulous reward engineering. However, obtaining preference data from human experts is costly and inefficient, especially under conditions marked by complex constraints. To tackle this challenge, we propose a LLM-enabled automatic preference generation framework named LLM4PG , which harnesses the capabilities of large language models (LLMs) to abstract trajectories, rank preferences, and reconstruct reward functions to optimize conditioned policies. Experiments on tasks with complex language constraints demonstrated the effectiveness of our LLM-enabled reward functions, accelerating RL convergence and overcoming stagnation caused by slow or absent progress under original reward structures. This approach mitigates the reliance on specialized human knowledge and demonstrates the potential of LLMs to enhance RL’s effectiveness in complex environments in the wild.

[AI-42] Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation

链接: https://arxiv.org/abs/2406.19643
作者: Zhe Hu,Hou Pong Chan,Jing Li,Yu Yin
关键词: challenging task, argument writing, Writing, high-level beliefs, persuasive arguments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Writing persuasive arguments is a challenging task for both humans and machines. It entails incorporating high-level beliefs from various perspectives on the topic, along with deliberate reasoning and planning to construct a coherent narrative. Current language models often generate surface tokens autoregressively, lacking explicit integration of these underlying controls, resulting in limited output diversity and coherence. In this work, we propose a persona-based multi-agent framework for argument writing. Inspired by the human debate, we first assign each agent a persona representing its high-level beliefs from a unique perspective, and then design an agent interaction process so that the agents can collaboratively debate and discuss the idea to form an overall plan for argument writing. Such debate process enables fluid and nonlinear development of ideas. We evaluate our framework on argumentative essay writing. The results show that our framework can generate more diverse and persuasive arguments through both automatic and human evaluations.

[AI-43] Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation

链接: https://arxiv.org/abs/2406.19638
作者: Junsung Park,Hyunjung Shim
关键词: Weakly Supervised Semantic, Supervised Semantic Segmentation, Weakly Supervised, Supervised Semantic, employs weak supervision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures, accepted in AAAI 2024 Edge Intelligence Workshop

点击查看摘要

Abstract:Weakly Supervised Semantic Segmentation (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model. Despite the impressive achievement in recent WSSS methods, we identify that introducing weak labels with high mean Intersection of Union (mIoU) does not guarantee high segmentation performance. Existing studies have emphasized the importance of prioritizing precision and reducing noise to improve overall performance. In the same vein, we propose ORANDNet, an advanced ensemble approach tailored for WSSS. ORANDNet combines Class Activation Maps (CAMs) from two different classifiers to increase the precision of pseudo-masks (PMs). To further mitigate small noise in the PMs, we incorporate curriculum learning. This involves training the segmentation model initially with pairs of smaller-sized images and corresponding PMs, gradually transitioning to the original-sized pairs. By combining the original CAMs of ResNet-50 and ViT, we significantly improve the segmentation performance over the single-best model and the naive ensemble model, respectively. We further extend our ensemble method to CAMs from AMN (ResNet-like) and MCTformer (ViT-like) models, achieving performance benefits in advanced WSSS models. It highlights the potential of our ORANDNet as a final add-on module for WSSS models.

[AI-44] Optimal Video Compression using Pixel Shift Tracking

链接: https://arxiv.org/abs/2406.19630
作者: Hitesh Saai Mananchery Panneerselvam,Smit Anand
关键词: hard coded rules, Video comprises approximately, comprises approximately, internet traffic, coded rules
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Video comprises approximately ~85% of all internet traffic, but video encoding/compression is being historically done with hard coded rules, which has worked well but only to a certain limit. We have seen a surge in video compression algorithms using ML-based models in the last few years and many of them have outperformed several legacy codecs. The models range from encoding video end to end using an ML approach or replacing some intermediate steps in legacy codecs using ML models to increase the efficiency of those steps. Optimizing video storage is an essential aspect of video processing, so we are proposing one of the possible approaches to achieve it is by avoiding redundant data at each frame. In this paper, we want to introduce the approach of redundancies removal in subsequent frames for a given video as a main approach for video compression. We call this method Redundancy Removal using Shift (R\textsuperscript2S). This method can be utilized across various Machine Learning model algorithms, and make the compression more accessible and adaptable. In this study, we have utilized a computer vision-based pixel point tracking method to identify redundant pixels to encode video for optimal storage. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.19630 [cs.CV] (or arXiv:2406.19630v1 [cs.CV] for this version)

[AI-45] Safety through feedback in Constrained RL

链接: https://arxiv.org/abs/2406.19626
作者: Shashank Reddy Chirra,Pradeep Varakantham,Praveen Paruchuri
关键词: additional cost function, cost function, agent safe behaviour, modifying the reward, function
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent’s safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g. aggressive lane changes) is inherently complex. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends to beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback on every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce \textitnovelty-based sampling that selectively involves the evaluator only when the the agent encounters a \textitnovel trajectory. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios.

[AI-46] Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness

链接: https://arxiv.org/abs/2406.19622
作者: Erh-Chung Chen,Pin-Yu Chen,I-Hsin Chung,Che-Rung Lee
关键词: deep neural networks, deep neural, neural networks, robustness, DNNs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The security and robustness of deep neural networks (DNNs) have become increasingly concerning. This paper aims to provide both a theoretical foundation and a practical solution to ensure the reliability of DNNs. We explore the concept of Lipschitz continuity to certify the robustness of DNNs against adversarial attacks, which aim to mislead the network with adding imperceptible perturbations into inputs. We propose a novel algorithm that remaps the input domain into a constrained range, reducing the Lipschitz constant and potentially enhancing robustness. Unlike existing adversarially trained models, where robustness is enhanced by introducing additional examples from other datasets or generative models, our method is almost cost-free as it can be integrated with existing models without requiring re-training. Experimental results demonstrate the generalizability of our method, as it can be combined with various models and achieve enhancements in robustness. Furthermore, our method achieves the best robust accuracy for CIFAR10, CIFAR100, and ImageNet datasets on the RobustBench leaderboard.

[AI-47] A Survey on Data Quality Dimensions and Tools for Machine Learning

链接: https://arxiv.org/abs/2406.19614
作者: Yuhan Zhou,Fengjiao Tu,Kewei Sha,Junhua Ding,Haihua Chen
关键词: Machine learning, substantial in practically, practically all aspects, data quality, exploratory data analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by The 6th IEEE International Conference on Artificial Intelligence Testing (IEEE AITest 2024) as an invited paper

点击查看摘要

Abstract:Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: this https URL.

[AI-48] Optimizing Cyber Defense in Dynamic Active Directories through Reinforcement Learning

链接: https://arxiv.org/abs/2406.19596
作者: Diksha Goel,Kristen Moore,Mingyu Guo,Derui Wang,Minjune Kim,Seyit Camtepe
关键词: Autonomous Cyber Operations, Cyber Operations, Autonomous Cyber, effective edge-blocking ACO, gap in Autonomous
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The manuscript has been accepted as full paper at European Symposium on Research in Computer Security (ESORICS) 2024

点击查看摘要

Abstract:This paper addresses a significant gap in Autonomous Cyber Operations (ACO) literature: the absence of effective edge-blocking ACO strategies in dynamic, real-world networks. It specifically targets the cybersecurity vulnerabilities of organizational Active Directory (AD) systems. Unlike the existing literature on edge-blocking defenses which considers AD systems as static entities, our study counters this by recognizing their dynamic nature and developing advanced edge-blocking defenses through a Stackelberg game model between attacker and defender. We devise a Reinforcement Learning (RL)-based attack strategy and an RL-assisted Evolutionary Diversity Optimization-based defense strategy, where the attacker and defender improve each other strategy via parallel gameplay. To address the computational challenges of training attacker-defender strategies on numerous dynamic AD graphs, we propose an RL Training Facilitator that prunes environments and neural networks to eliminate irrelevant elements, enabling efficient and scalable training for large graphs. We extensively train the attacker strategy, as a sophisticated attacker model is essential for a robust defense. Our empirical results successfully demonstrate that our proposed approach enhances defender’s proficiency in hardening dynamic AD graphs while ensuring scalability for large-scale AD.

[AI-49] PathAlign: A vision-language model for whole slide images in histopathology

链接: https://arxiv.org/abs/2406.19578
作者: Faruk Ahmed,Andrew Sellergren,Lin Yang,Shawn Xu,Boris Babenko,Abbi Ward,Niels Olson,Arash Mohtashamian,Yossi Matias,Greg S. Corrado,Quang Duong,Dale R. Webster,Shravya Shetty,Daniel Golden,Yun Liu,David F. Steiner,Ellery Wulczyn
关键词: histopathology images underlies, Microscopic interpretation, treatment decisions, underlies many important, Microscopic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures

点击查看摘要

Abstract:Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

[AI-50] Synthetic Cancer – Augmenting Worms with LLMs

链接: https://arxiv.org/abs/2406.19570
作者: Benjamin Zimmerman,David Zollikofer
关键词: large language models, abuse rises drastically, increasingly sophisticated large, sophisticated large language, language models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Won first place at the Swiss AI Safety Prize. Some technical details omitted, contact authors for more information

点击查看摘要

Abstract:With increasingly sophisticated large language models (LLMs), the potential for abuse rises drastically. As a submission to the Swiss AI Safety Prize, we present a novel type of metamorphic malware leveraging LLMs for two key processes. First, LLMs are used for automatic code rewriting to evade signature-based detection by antimalware programs. The malware then spreads its copies via email by utilizing an LLM to socially engineer email replies to encourage recipients to execute the attached malware. Our submission includes a functional minimal prototype, highlighting the risks that LLMs pose for cybersecurity and underscoring the need for further research into intelligent malware.

[AI-51] What Matters in Detecting AI-Generated Videos like Sora?

链接: https://arxiv.org/abs/2406.19568
作者: Chirui Chang,Zhengzhe Liu,Xiaoyang Lyu,Xiaojuan Qi
关键词: showcased remarkable results, Recent advancements, Stable Video Diffusion, videos remains under-explored, diffusion-based video generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: this https URL

[AI-52] Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning

链接: https://arxiv.org/abs/2406.19561
作者: Bradley Burega,John D. Martin,Luke Kapeluck,Michael Bowling
关键词: Reinforcement Learning, remain sample-efficient, imperfect model, environment dynamics change, Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.

[AI-53] Rethinking harmless refusals when fine-tuning foundation models

链接: https://arxiv.org/abs/2406.19552
作者: Florin Pop,Judd Rosenblatt,Diogo Schwerz de Lucena,Michael Vaiana
关键词: Large Language Models, Large Language, effectively mitigates versus, conceals undesirable behavior, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICLR 2024 AGI Workshop Poster

点击查看摘要

Abstract:In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emphreason-based deception, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.

[AI-54] Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

链接: https://arxiv.org/abs/2406.19545
作者: Ritam Dutt,Zhen Wu,Kelly Shi,Divyanshu Sheth,Prakhar Gupta,Carolyn Penstein Rose
关键词: Large Language Models, leverages Large Language, Language Models, Large Language, implicitly encoded social
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: To appear at The Proceedings of the Association for Computational Linguistics, 2024

点击查看摘要

Abstract:We present a generalizable classification approach that leverages Large Language Models (LLMs) to facilitate the detection of implicitly encoded social meaning in conversations. We design a multi-faceted prompt to extract a textual explanation of the reasoning that connects visible cues to underlying social meanings. These extracted explanations or rationales serve as augmentations to the conversational text to facilitate dialogue understanding and transfer. Our empirical results over 2,340 experimental settings demonstrate the significant positive impact of adding these rationales. Our findings hold true for in-domain classification, zero-shot, and few-shot domain transfer for two different social meaning detection tasks, each spanning two different corpora.

[AI-55] Handling Ontology Gaps in Semantic Parsing

链接: https://arxiv.org/abs/2406.19537
作者: Andrea Bacciu,Marco Damonte,Marco Basaldella,Emilio Monti
关键词: Neural Semantic Parsing, Semantic Parsing, Neural Semantic, majority of Neural, target symbols
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The majority of Neural Semantic Parsing (NSP) models are developed with the assumption that there are no concepts outside the ones such models can represent with their target symbols (closed-world assumption). This assumption leads to generate hallucinated outputs rather than admitting their lack of knowledge. Hallucinations can lead to wrong or potentially offensive responses to users. Hence, a mechanism to prevent this behavior is crucial to build trusted NSP-based Question Answering agents. To that end, we propose the Hallucination Simulation Framework (HSF), a general setting for stimulating and analyzing NSP model hallucinations. The framework can be applied to any NSP task with a closed-ontology. Using the proposed framework and KQA Pro as the benchmark dataset, we assess state-of-the-art techniques for hallucination detection. We then present a novel hallucination detection strategy that exploits the computational graph of the NSP model to detect the NSP hallucinations in the presence of ontology gaps, out-of-domain utterances, and to recognize NSP errors, improving the F1-Score respectively by ~21, ~24% and ~1%. This is the first work in closed-ontology NSP that addresses the problem of recognizing ontology gaps. We release our code and checkpoints at this https URL.

[AI-56] Using Large Language Models to Assist Video Content Analysis: An Exploratory Study of Short Videos on Depression

链接: https://arxiv.org/abs/2406.19528
作者: Jiaying Liu,Yunlong Wang,Yao Lyu,Yiheng Su,Shuo Niu,Xuhai “Orson” Xu,Yan Zhang
关键词: Large Language Models, leveraging Large Language, Language Models, Large Language, leveraging Large
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 6 pages, 2 figures, under review in CSCW 24

点击查看摘要

Abstract:Despite the growing interest in leveraging Large Language Models (LLMs) for content analysis, current studies have primarily focused on text-based content. In the present work, we explored the potential of LLMs in assisting video content analysis by conducting a case study that followed a new workflow of LLM-assisted multimodal content analysis. The workflow encompasses codebook design, prompt engineering, LLM processing, and human evaluation. We strategically crafted annotation prompts to get LLM Annotations in structured form and explanation prompts to generate LLM Explanations for a better understanding of LLM reasoning and transparency. To test LLM’s video annotation capabilities, we analyzed 203 keyframes extracted from 25 YouTube short videos about depression. We compared the LLM Annotations with those of two human coders and found that LLM has higher accuracy in object and activity Annotations than emotion and genre Annotations. Moreover, we identified the potential and limitations of LLM’s capabilities in annotating videos. Based on the findings, we explore opportunities and challenges for future research and improvements to the workflow. We also discuss ethical concerns surrounding future studies based on LLM-assisted video analysis.

[AI-57] Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

链接: https://arxiv.org/abs/2406.19512
作者: Giuseppe Carenini,Jordon Johnson,Ali Salamatian
关键词: Automatically captioning visualizations, large language models, Automatically captioning, open exciting, exciting new possibilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and past work in captioning, we introduce neural models and the transformer architecture used in generic LLMs. We then discuss their recent applications in InfoVis, with a focus on captioning. Additionally, we explore promising future directions in this field.

[AI-58] oo Good to be True? Turn Any Model Differentially Private With DP-Weights

链接: https://arxiv.org/abs/2406.19507
作者: David Zagardo
关键词: Stochastic Gradient Descent, Private Stochastic Gradient, Gradient Descent, Stochastic Gradient, Differentially Private Stochastic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: For code visit the following repository, this https URL

点击查看摘要

Abstract:Imagine training a machine learning model with Differentially Private Stochastic Gradient Descent (DP-SGD), only to discover post-training that the noise level was either too high, crippling your model’s utility, or too low, compromising privacy. The dreaded realization hits: you must start the lengthy training process from scratch. But what if you could avoid this retraining nightmare? In this study, we introduce a groundbreaking approach (to our knowledge) that applies differential privacy noise to the model’s weights after training. We offer a comprehensive mathematical proof for this novel approach’s privacy bounds, use formal methods to validate its privacy guarantees, and empirically evaluate its effectiveness using membership inference attacks and performance evaluations. This method allows for a single training run, followed by post-hoc noise adjustments to achieve optimal privacy-utility trade-offs. We compare this novel fine-tuned model (DP-Weights model) to a traditional DP-SGD model, demonstrating that our approach yields statistically similar performance and privacy guarantees. Our results validate the efficacy of post-training noise application, promising significant time savings and flexibility in fine-tuning differential privacy parameters, making it a practical alternative for deploying differentially private models in real-world scenarios.

[AI-59] Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning

链接: https://arxiv.org/abs/2406.19502
作者: Miyoung Ko,Sue Hyun Park,Joonsuk Park,Minjoon Seo
关键词: significant advancements, large language models, large language, knowledge, questions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Work in progress; code is available at this https URL

点击查看摘要

Abstract:Despite significant advancements, there is a limited understanding of how large language models (LLMs) utilize knowledge for reasoning. To address this, we propose a method that deconstructs complex real-world questions into a graph, representing each question as a node with parent nodes of background knowledge needed to solve the question. We develop the DepthQA dataset, deconstructing questions into three depths: (i) recalling conceptual knowledge, (ii) applying procedural knowledge, and (iii) analyzing strategic knowledge. Based on a hierarchical graph, we quantify forward discrepancy, discrepancies in LLMs’ performance on simpler sub-problems versus complex questions. We also measure backward discrepancy, where LLMs answer complex questions but struggle with simpler ones. Our analysis shows that smaller models have more discrepancies than larger models. Additionally, guiding models from simpler to complex questions through multi-turn interactions improves performance across model sizes, highlighting the importance of structured intermediate steps in knowledge reasoning. This work enhances our understanding of LLM reasoning and suggests ways to improve their problem-solving abilities.

[AI-60] Knowledge acquisition for dialogue agents using reinforcement learning on graph representations

链接: https://arxiv.org/abs/2406.19500
作者: Selene Baez Santamaria,Shihan Wang,Piek Vossen
关键词: artificial agent motivated, initial training, develop an artificial, motivated to augment, Abstract
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We develop an artificial agent motivated to augment its knowledge base beyond its initial training. The agent actively participates in dialogues with other agents, strategically acquiring new information. The agent models its knowledge as an RDF knowledge graph, integrating new beliefs acquired through conversation. Responses in dialogue are generated by identifying graph patterns around these new integrated beliefs. We show that policies can be learned using reinforcement learning to select effective graph patterns during an interaction, without relying on explicit user feedback. Within this context, our study is a proof of concept for leveraging users as effective sources of information.

[AI-61] Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts

链接: https://arxiv.org/abs/2406.19497
作者: Naseela Pervez,Alexander J. Titus
关键词: helping authors enhance, Large language models, helping authors, increasingly utilized, utilized to assist
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized to assist in scientific and academic writing, helping authors enhance the coherence of their articles. Previous studies have highlighted stereotypes and biases present in LLM outputs, emphasizing the need to evaluate these models for their alignment with human narrative styles and potential gender biases. In this study, we assess the alignment of three prominent LLMs - Claude 3 Opus, Mistral AI Large, and Gemini 1.5 Flash - by analyzing their performance on benchmark text-generation tasks for scientific abstracts. We employ the Linguistic Inquiry and Word Count (LIWC) framework to extract lexical, psychological, and social features from the generated texts. Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases. This research highlights the importance of developing LLMs that maintain a diversity of writing styles to promote inclusivity in academic discourse.

[AI-62] Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems

链接: https://arxiv.org/abs/2406.19493
作者: Anubhab Majumder,Kausik Bhattacharya,Amaresh Chakrabarti
关键词: Large Language Models, Representing systems, SAPPhIRE causality model, SAPPhIRE model, leverage Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representing systems using the SAPPhIRE causality model is found useful in supporting design-by-analogy. However, creating a SAPPhIRE model of artificial or biological systems is an effort-intensive process that requires human experts to source technical knowledge from multiple technical documents regarding how the system works. This research investigates how to leverage Large Language Models (LLMs) in creating structured descriptions of systems using the SAPPhIRE model of causality. This paper, the second part of the two-part research, presents a new Retrieval-Augmented Generation (RAG) tool for generating information related to SAPPhIRE constructs of artificial systems and reports the results from a preliminary evaluation of the tool’s success - focusing on the factual accuracy and reliability of outcomes.

[AI-63] LoPT: Low-Rank Prompt Tuning for Parameter Efficient Language Models

链接: https://arxiv.org/abs/2406.19486
作者: Shouchang Guo,Sonam Damani,Keng-hao Chang
关键词: prompt tuning, suffix text, token indices, text is added, optimized to gain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In prompt tuning, a prefix or suffix text is added to the prompt, and the embeddings (soft prompts) or token indices (hard prompts) of the prefix/suffix are optimized to gain more control over language models for specific tasks. This approach eliminates the need for hand-crafted prompt engineering or explicit model fine-tuning. Prompt tuning is significantly more parameter-efficient than model fine-tuning, as it involves optimizing partial inputs of language models to produce desired outputs. In this work, we aim to further reduce the amount of trainable parameters required for a language model to perform well on specific tasks. We propose Low-rank Prompt Tuning (LoPT), a low-rank model for prompts that achieves efficient prompt optimization. The proposed method demonstrates similar outcomes to full parameter prompt tuning while reducing the number of trainable parameters by a factor of 5. It also provides promising results compared to the state-of-the-art methods that would require 10 to 20 times more parameters. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2406.19486 [cs.CL] (or arXiv:2406.19486v1 [cs.CL] for this version)

[AI-64] Sparse Regression for Machine Translation

链接: https://arxiv.org/abs/2406.19478
作者: Ergun Biçici
关键词: machine translation outputs, transductive regression techniques, generate machine translation, regularized regression, learn correct feature
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. We show the effectiveness of L_1 regularized regression (\textitlasso) to learn the mappings between sparsely observed feature sets versus L_2 regularized regression. Proper selection of training instances plays an important role to learn correct feature mappings within limited computational resources and at expected accuracy levels. We introduce \textitdice instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings for improving the source and target coverage of the training set. We show that L_1 regularized regression performs better than L_2 regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English. We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model.

[AI-65] ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

链接: https://arxiv.org/abs/2406.19464
作者: Zeyi Liu,Cheng Chi,Eric Cousineau,Naveen Kuppuswamy,Benjamin Burchfiel,Shuran Song
关键词: signals provide rich, provide rich information, Audio signals provide, signals provide, provide rich
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio signals provide rich information for the robot interaction and object properties through contact. These information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an ‘ear-in-hand’ data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments, by learning from diverse in-the-wild human demonstrations. Project website: this https URL

[AI-66] Lightweight Predictive 3D Gaussian Splats

链接: https://arxiv.org/abs/2406.19434
作者: Junli Cao,Vidit Goel,Chaoyang Wang,Anil Kag,Ju Hu,Sergei Korolev,Chenfanfu Jiang,Sergey Tulyakov,Jian Ren
关键词: Recent approaches representing, Recent approaches, splats show increased, show increased rendering, increased rendering speed
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Recent approaches representing 3D objects and scenes using Gaussian splats show increased rendering speed across a variety of platforms and devices. While rendering such representations is indeed extremely efficient, storing and transmitting them is often prohibitively expensive. To represent large-scale scenes, one often needs to store millions of 3D Gaussians, occupying gigabytes of disk space. This poses a very practical limitation, prohibiting widespread adoption.Several solutions have been proposed to strike a balance between disk size and rendering quality, noticeably reducing the visual quality. In this work, we propose a new representation that dramatically reduces the hard drive footprint while featuring similar or improved quality when compared to the standard 3D Gaussian splats. When compared to other compact solutions, ours offers higher quality renderings with significantly reduced storage, being able to efficiently run on a mobile device in real-time. Our key observation is that nearby points in the scene can share similar representations. Hence, only a small ratio of 3D points needs to be stored. We introduce an approach to identify such points which are called parent points. The discarded points called children points along with attributes can be efficiently predicted by tiny MLPs.

[AI-67] A Quantization-based Technique for Privacy Preserving Distributed Learning

链接: https://arxiv.org/abs/2406.19418
作者: Maurizio Colombo,Rasool Asal,Ernesto Damiani,Lamees Mahmoud AlQassem,Al Anoud Almemari,Yousof Alhammadi
关键词: Machine Learning, deployment of Machine, massive deployment, raises serious concerns, distributed learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The massive deployment of Machine Learning (ML) models raises serious concerns about data protection. Privacy-enhancing technologies (PETs) offer a promising first step, but hard challenges persist in achieving confidentiality and differential privacy in distributed learning. In this paper, we describe a novel, regulation-compliant data protection technique for the distributed training of ML models, applicable throughout the ML life cycle regardless of the underlying ML architecture. Designed from the data owner’s perspective, our method protects both training data and ML model parameters by employing a protocol based on a quantized multi-hash data representation Hash-Comb combined with randomization. The hyper-parameters of our scheme can be shared using standard Secure Multi-Party computation protocols. Our experimental results demonstrate the robustness and accuracy-preserving properties of our approach.

[AI-68] “Glue pizza and eat rocks” – Exploiting Vulnerabilities in Retrieval-Augmented Generative Models

链接: https://arxiv.org/abs/2406.19417
作者: Zhen Tan,Chengshuai Zhao,Raha Moraffah,Yifan Li,Song Wang,Jundong Li,Tianlong Chen,Huan Liu
关键词: enhance Large Language, Large Language Models, models enhance Large, Large Language, Retrieval-Augmented Generative
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionally changing the model’s behavior. This threat is critical as it mirrors real-world usage scenarios where RAG systems interact with publicly accessible knowledge bases, such as web scrapings and user-contributed data pools. To be more realistic, we target a realistic setting where the adversary has no knowledge of users’ queries, knowledge base data, and the LLM parameters. We demonstrate that it is possible to exploit the model successfully through crafted content uploads with access to the retriever. Our findings emphasize an urgent need for security measures in the design and deployment of RAG systems to prevent potential manipulation and ensure the integrity of machine-generated content.

[AI-69] Uncovering the hidden core-periphery structure in hyperbolic networks

链接: https://arxiv.org/abs/2406.19953
作者: Imran Ansari,Pawanesh Yadav,Niteesh Sahni
关键词: high-clustering coefficient, hyperbolic network models, exhibit very fundamental, fundamental and essential, network models exhibit
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The hyperbolic network models exhibit very fundamental and essential features, like small-worldness, scale-freeness, high-clustering coefficient, and community structure. In this paper, we comprehensively explore the presence of an important feature, the core-periphery structure, in the hyperbolic network models, which is often exhibited by real-world networks. We focused on well-known hyperbolic models such as popularity-similarity optimization model (PSO) and S1/H2 models and studied core-periphery structures using a well-established method that is based on standard random walk Markov chain model. The observed core-periphery centralization values indicate that the core-periphery structure can be very pronounced under certain conditions. We also validate our findings by statistically testing for the significance of the observed core-periphery structure in the network geometry. This study extends network science and reveals core-periphery insights applicable to various domains, enhancing network performance and resiliency in transportation and information systems.

[AI-70] Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?

链接: https://arxiv.org/abs/2406.19755
作者: Yang Tan,Lirong Zheng,Bozitao Zhong,Liang Hong,Bingxin Zhou
关键词: crucial tool, tool in studying, amino acid types, Deep learning, amino acid
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Deep learning has become a crucial tool in studying proteins. While the significance of modeling protein structure has been discussed extensively in the literature, amino acid types are typically included in the input as a default operation for many inference tasks. This study demonstrates with structure alignment task that embedding amino acid types in some cases may not help a deep learning model learn better representation. To this end, we propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation. The effectiveness of ProtLOCA is examined by a global structure-matching task on protein pairs with an independent test dataset based on CATH labels. Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains. Furthermore, in local structure pairing tasks, ProtLOCA for the first time provides a valid solution to highlight common local structures among proteins with different overall structures but the same function. This suggests a new possibility for using deep learning methods to analyze protein structure to infer function.

[AI-71] Classical Bandit Algorithms for Entanglement Detection in Parameterized Qubit States

链接: https://arxiv.org/abs/2406.19738
作者: Bharati. K,Vikesh Siddhu,Krishna Jagannathan
关键词: Phys. Rev, information and computing, wide range, range of tasks, key resource
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Entanglement is a key resource for a wide range of tasks in quantum information and computing. Thus, verifying availability of this quantum resource is essential. Extensive research on entanglement detection has led to no-go theorems (Lu et al. [Phys. Rev. Lett., 116, 230501 (2016)]) that highlight the need for full state tomography (FST) in the absence of adaptive or joint measurements. Recent advancements, as proposed by Zhu, Teo, and Englert [Phys. Rev. A, 81, 052339, 2010], introduce a single-parameter family of entanglement witness measurements which are capable of conclusively detecting certain entangled states and only resort to FST when all witness measurements are inconclusive. We find a variety of realistic noisy two-qubit quantum states \mathcalF that yield conclusive results under this witness family. We solve the problem of detecting entanglement among K quantum states in \mathcalF , of which m states are entangled, with m potentially unknown. We recognize a structural connection of this problem to the Bad Arm Identification problem in stochastic Multi-Armed Bandits (MAB). In contrast to existing quantum bandit frameworks, we establish a new correspondence tailored for entanglement detection and term it the (m,K) -quantum Multi-Armed Bandit. We implement two well-known MAB policies for arbitrary states derived from \mathcalF , present theoretical guarantees on the measurement/sample complexity and demonstrate the practicality of the policies through numerical simulations. More broadly, this paper highlights the potential for employing classical machine learning techniques for quantum entanglement detection.

[AI-72] Enhancing Radiological Diagnosis: A Collaborative Approach Integrating AI and Human Expertise for Visual Miss Correction

链接: https://arxiv.org/abs/2406.19686
作者: Akash Awasthi,Ngan Le,Zhigang Deng,Carol C. Wu,Hien Van Nguyen
关键词: correct perceptual errors, Human-AI collaboration, previously explored, eye gaze data, collaboration to identify
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Under Review in Journal

点击查看摘要

Abstract:Human-AI collaboration to identify and correct perceptual errors in chest radiographs has not been previously explored. This study aimed to develop a collaborative AI system, CoRaX, which integrates eye gaze data and radiology reports to enhance diagnostic accuracy in chest radiology by pinpointing perceptual errors and refining the decision-making process. Using public datasets REFLACX and EGD-CXR, the study retrospectively developed CoRaX, employing a large multimodal model to analyze image embeddings, eye gaze data, and radiology reports. The system’s effectiveness was evaluated based on its referral-making process, the quality of referrals, and performance in collaborative diagnostic settings. CoRaX was tested on a simulated error dataset of 271 samples with 28% (93 of 332) missed abnormalities. The system corrected 21% (71 of 332) of these errors, leaving 7% (22 of 312) unresolved. The Referral-Usefulness score, indicating the accuracy of predicted regions for all true referrals, was 0.63 (95% CI 0.59, 0.68). The Total-Usefulness score, reflecting the diagnostic accuracy of CoRaX’s interactions with radiologists, showed that 84% (237 of 280) of these interactions had a score above 0.40. In conclusion, CoRaX efficiently collaborates with radiologists to address perceptual errors across various abnormalities, with potential applications in the education and training of novice radiologists.

[AI-73] Multimodal Data Integration for Precision Oncology: Challenges and Future Directions

链接: https://arxiv.org/abs/2406.19611
作者: Huajun Zhou,Fengtao Zhou,Chenyu Zhao,Yingxue Xu,Luyang Luo,Hao Chen
关键词: tailor targeted treatments, precision oncology lies, multimodal data integration, precision oncology, commitment to tailor
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade, multimodal data integration technology for precision oncology has made significant strides, showcasing remarkable progress in understanding the intricate details within heterogeneous data modalities. These strides have exhibited tremendous potential for improving clinical decision-making and model interpretation, contributing to the advancement of cancer care and treatment. Given the rapid progress that has been achieved, we provide a comprehensive overview of about 300 papers detailing cutting-edge multimodal data integration techniques in precision oncology. In addition, we conclude the primary clinical applications that have reaped significant benefits, including early assessment, diagnosis, prognosis, and biomarker discovery. Finally, derived from the findings of this survey, we present an in-depth analysis that explores the pivotal challenges and reveals essential pathways for future research in the field of multimodal data integration for precision oncology.

[AI-74] mporal distribution of clusters of investors and their application in prediction with expert advice

链接: https://arxiv.org/abs/2406.19403
作者: Wojciech Wisniewski,Yuri Kalnishkan,David Lindsay,Siân Lindsay
关键词: brokers face, face a significant, Financial organisations, traders worldwide, Ewens’ Sampling Distribution
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 20 pages, technical report

点击查看摘要

Abstract:Financial organisations such as brokers face a significant challenge in servicing the investment needs of thousands of their traders worldwide. This task is further compounded since individual traders will have their own risk appetite and investment goals. Traders may look to capture short-term trends in the market which last only seconds to minutes, or they may have longer-term views which last several days to months. To reduce the complexity of this task, client trades can be clustered. By examining such clusters, we would likely observe many traders following common patterns of investment, but how do these patterns vary through time? Knowledge regarding the temporal distributions of such clusters may help financial institutions manage the overall portfolio of risk that accumulates from underlying trader positions. This study contributes to the field by demonstrating that the distribution of clusters derived from the real-world trades of 20k Foreign Exchange (FX) traders (from 2015 to 2017) is described in accordance with Ewens’ Sampling Distribution. Further, we show that the Aggregating Algorithm (AA), an on-line prediction with expert advice algorithm, can be applied to the aforementioned real-world data in order to improve the returns of portfolios of trader risk. However we found that the AA ‘struggles’ when presented with too many trader ``experts’', especially when there are many trades with similar overall patterns. To help overcome this challenge, we have applied and compared the use of Statistically Validated Networks (SVN) with a hierarchical clustering approach on a subset of the data, demonstrating that both approaches can be used to significantly improve results of the AA in terms of profitability and smoothness of returns.

[AI-75] Predicting Customer Goals in Financial Institution Services: A Data-Driven LSTM Approach

链接: https://arxiv.org/abs/2406.19399
作者: Andrew Estornell,Stylianos Loukas Vasileiou,William Yeoh,Daniel Borrajo,Rui Silva
关键词: competitive financial landscape, optimized user experience, today competitive financial, predicting customer goals, anticipating customer goals
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: Accepted at the FinPlan 2023 workshop at ICAPS 2023

点击查看摘要

Abstract:In today’s competitive financial landscape, understanding and anticipating customer goals is crucial for institutions to deliver a personalized and optimized user experience. This has given rise to the problem of accurately predicting customer goals and actions. Focusing on that problem, we use historical customer traces generated by a realistic simulator and present two simple models for predicting customer goals and future actions – an LSTM model and an LSTM model enhanced with state-space graph embeddings. Our results demonstrate the effectiveness of these models when it comes to predicting customer goals and actions.

[AI-76] How scanning probe microscopy can be supported by Artificial Intelligence and quantum computing

链接: https://arxiv.org/abs/2406.19397
作者: Agnieszka Pregowska,Agata Roszkiewicz,Magdalena Osial,Michael Giersig
关键词: Scanning Probe Microscopy, Machine Learning, Probe Microscopy measurements, Artificial Intelligence, supporting Scanning Probe
类目: Neurons and Cognition (q-bio.NC); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:We focus on the potential possibilities for supporting Scanning Probe Microscopy measurements, emphasizing the application of Artificial Intelligence, especially Machine Learning as well as quantum computing. It turned out that Artificial Intelligence can be helpful in the experimental processes automation in routine operations, the algorithmic search for good sample regions, and shed light on the structure property relationships. Thus, it contributes to increasing the efficiency and accuracy of optical nanoscopy scanning probes. Moreover, the combination of Artificial Intelligence based algorithms and quantum computing may have a huge potential to increase the practical application of Scanning Probe Microscopy. The limitations were also discussed. Finally, we outline a research path for the improvement of the proposed approach.

[AI-77] Shaping New Norms for AI

链接: https://arxiv.org/abs/2307.08564
作者: Andrea Baronchelli
关键词: Artificial Intelligence, increasingly integrated, norm formation, norms, Intelligence
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) becomes increasingly integrated into our lives, the need for new norms is urgent. However, AI evolves at a much faster pace than the characteristic time of norm formation, posing an unprecedented challenge to our societies. This paper examines possible criticalities of the processes of norm formation surrounding AI. Thus, it focuses on how new norms can be established, rather than on what these norms should be. It distinguishes different scenarios based on the centralisation or decentralisation of the norm formation process, analysing the cases where new norms are shaped by formal authorities, informal institutions, or emerge spontaneously in a bottom-up fashion. On the latter point, the paper reports a conversation with ChatGPT in which the LLM discusses some of the emerging norms it has observed. Far from seeking exhaustiveness, this article aims to offer readers interpretive tools to understand society’s response to the growing pervasiveness of AI. An outlook on how AI could influence the formation of future social norms emphasises the importance for open societies to anchor their formal deliberation process in an open, inclusive, and transparent public discourse.

附件下载

点击下载今日全部论文列表